Intro to machine learning part 4: Support Vector Machines

In the fourth installment of our intro to machine learning series, I’ll introduce Support Vector Machines, a powerful tool for classification and regression. In my last post on Perceptron, we examined a procedure for finding a hyperplane that is an effective linear classifier and not subject to the curse of dimensionality. To review, our task during binary classification is to find a classifier that can predict the label of a data point using a vector of the point’s characteristic features. The Perceptron algorithm finds a hyperplane that is able to separate points of different classifications within the feature space. An example of a linear classifier found using perceptron for a data set with 3 features is shown in Figure 1 below.

Hyperplane

Figure 1: A hyperplane for binary classification of a linearly separable data set

While Perceptron is a useful conceptual tool for understanding classification problems, it suffers from several flaws that limit its usefulness in most classification problems. First, Perceptron will never converge if the data is not linearly separable, limiting its utility to well-behaved data sets. Second, while Perceptron is guaranteed to converge to a separating hyperplane if the data is linearly separable, such a hyperplane may take infinitely many possible forms and there is no guarantee that Perceptron will find the “best” one (more on what “best” means later). Support Vector Machines (SVM) overcome these problems to find an “optimal” hyperplane that may be fit to data sets that are not linearly separable. SVMs may also be kernalized to fit non-linear decision boundaries and extended to regression contexts. In this post we’ll look into how and why SVMs work and discuss an example of SVMs in water resources systems literature. I would like to note that the main body of content in this post was derived from course notes of CS 4780 by Dr. Kilian Weinberger at Cornell. For deeper coverage of machine learning content, check out the course page here.

The Maximum Margin Hyperplane

Let’s define the features of point i as xi, and its label as yi. If the data is linearly separable, Perceptron will find a hyperplane, H, such that:

H(x) = dot\{x: (x,w) +b=0\}

Where w  is an orthogonal vector that defines the orientation of hyperplane H, and b represents and offset from the origin.

While a hyperplane found by Perceptron will accurately separate the two classes, it likely will not find the hyperplane that does the “best” job bisecting the classes. So how do we define what makes a good hyperplane for classification? In machine learning we train classifiers on a set of training data hoping that the classifiers will generalize to a broader set of data; we seek classifiers that not only have low training error, but will also have low generalization error. It turns out the best way to reduce generalization error is to find the hyperplane that is most in the “middle” of the two classes in the training set. We can define the middle of the two data sets as the hyperplane that has the largest margin, or distance between the closest points in the two classes. This is known as the maximum margin hyperplane. The advantages of using the maximum margin hyperplane are easy to visualize. Figure 2 shows three hyperplanes attempting to classify a set of binary test data with two features, x1 and x2. Plane H1 does not separate the two data sets. Planes H2 and H3 are both able to correctly classify the data set, but plane H3, the maximum margin classifier, is the more natural partition. The two closest training points to the hyperplane H3, which define the maximum margin, are known as “support vectors”.

svm_separating_hyperplanes_(svg).svg

Figure 2: Three hyperplanes, plane H1 does not separate the two data sets, plane H2 separates the two but with a small margin, H3 is the maximum margin hyperplane (Image source: Wikimedia commons under the creative commons license)

SVM changes the classification problem from “find a hyperplane that correctly classifies all training points” to “find a the maximum margin hyperplane”, which allows us to formulate the problem as an optimization. Here, will denote the margin of a hyperplane as γ, which is a function of the hyperplane’s weight vector w and intercept b. The optimization is thus:

\max_{w,b}\gamma(w,b)

s.t \quad \forall i \  y_{i}(w^{T}x_i+b) \geq 0

With a little math (if you’re interested, see these notes), we can simplify this problem to a quadratic optimization problem that can be efficiently solved with a Quadratically Constrained Quadratic Program (QCQP) solver.

\min_{w} \ w^{T}w

s.t \quad \forall i \  y_{i}(w^{T}x_{i}+b) \geq 1

Soft Margin SVM

By finding the maximum margin classifier, SVM is guaranteed to find the optimal hyperplane for the given data and can do so using an out of the box QCQP solver. However,  thus far, our treatment of SVM (known as “hard margin SVM”) will only work for linearly separable data. SVM can be extended to training sets that are not linearly separable by incorporating the optimization’s constraint as a penalty into the objective function:

\min_{w,b} \quad w^Tw+C\sum^{n}_{i=1}max[1-y_{i}(w^Tx+b),0]

Where C represents a penalty coefficient that can be set by the user. This formulation is known as “soft margin SVM”. Instead of only finding hyperplanes that perfectly divide the data, soft margin SVM allows misclassification at a penalty. When C is very large, soft margin SVM will find the same hyperplane as hard margin SVM. For more info on the derivation of soft margin SVM, see these notes.

An example of SVM in Water Resources Literature

So SVM is a linear classifier that finds the maximum margin hyperplane and can be applied to data that is not linearly separable, but how might one apply it in water resources systems? One example of SVM in water resources literature is by Herman et al., 2016, who use SVM for scenario discovery. Scenario discovery (Groves and Lempert, 2007) assists decision makers in finding key uncertainties that represent vulnerability for a planning alternative. Herman et al., 2016, used SVM to examine the performance of a water supply portfolio under future demand growth and drought scenarios. Figure 7 of Herman et al., 2016, plots factor maps of drought frequency vs. demand growth scaling factor, and shows SVM classification of future states of the world based on whether they met stakeholder performance criteria. Each point on the plot represents a unique drought and demand scenario and the point’s color represents it classification. The classification predicted by soft margin SVM is shown as the shading of the plots.  The hyperplanes discovered by SVM allow decision makers to discover what ranges of uncertainties are likely to cause the chosen alternative to fail and can assist in the development of narrative scenarios for future planning purposes.

screen shot 2019-01-29 at 8.53.54 am

Figure 3: Figure 7 from Herman et al., 2016. Each point represents a sampled uncertainty, demand growth is shown on the horizontal axis and drought frequency (n times more likely) is shown on the vertical axis. The color of each point represents its classification (success or failure to meet stakeholder criteria) and the shading of the plots represents the predicted classification by soft margin SVM.

SVMs are also commonly used for regression in hydrology and downscaling of climate models. SVMs in a regression context are beyond the scope of this post, but see Asefa et al., 2006 for an example of SVMs used to forecast streamflow and Tripathi et al., 2006 for an example of SVMs in downscaling.

Implementing SVMs

In Python, Scikit-learn has an easy to use implementation of SVMs that can be used for classification or regression.

Matlab has several functions for fitting SVMs including fitcsvm.

R has an svm function, and a library devoted to SVMs as part of the e1071 statistical package.

References

Herman, J.D., Zeff, H.B., Lamontagne, J.R., Reed, P.M., Characklis, G.W., 2016. Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. J. Water Resour. Plann. Manage. 142 (11), 04016050.

Groves, D.G., Lempert, R.J., 2007. A new analytic method for finding policy-relevant scenarios. Global Environ. Change 17 (1), 73–85. http://dx.doi.org/10.1016/j. gloenvcha.20 06.11.0 06 . Uncertainty and Climate Change Adaptation and Mitiga- tion. 

Discriminant analysis for classification

Discriminant analysis for classification

Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

Classification – Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

  • Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
  • Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

Prediction – Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

  • Predict sales volume of a product in a future time period
  • Predict annual heating/cooling expenses for a set of buildings

Segmentation – Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

  • Identify groups of customers that have similar ordering patterns across seasons of the year
  • Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving classification problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a continuous dependent variable using one or more independent categorical variables, whereas DA attempts to predict a categorical dependent variable using one or more continuous independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are normally distributed within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).

DA_1

Fig. 1 – Employee classification on the basis of mechanical and verbal aptitude scores

The centroids of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation i from a multivariate normal distribution should be assigned to group G_j that minimizes the Mahalanobis distance between i and group centroid C_j.

DA_2

Fig. 2 – Independent variables have different variances

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have unequal variances or if they are correlated, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,

DA_3

Fig. 3 – Independent variables are correlated

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by:  D_{ij}=\sqrt{(X_i-\mu_j)^TS^{-1}(X_i-\mu_j)} where X_i is the vector of attributes for observation i, \mu_j is the vector of means for group j, and S is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:

DA_4

Fig. 4 – Using the Mahalanobis Distance, points equidistant from a group mean are on tilted eclipses

 

 

For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:

DA_5

Table 1 – Distance of each observation to each group, using Mahalanobis distance. Classification is estimated using the minimum between the three distances. Values in red indicate misclassifications.

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics. Eighth edition. Boston, MA: Cengage Learning, 2018.