Discriminant analysis for classification

Discriminant analysis for classification

Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

Classification – Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

  • Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
  • Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

Prediction – Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

  • Predict sales volume of a product in a future time period
  • Predict annual heating/cooling expenses for a set of buildings

Segmentation – Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

  • Identify groups of customers that have similar ordering patterns across seasons of the year
  • Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving classification problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a continuous dependent variable using one or more independent categorical variables, whereas DA attempts to predict a categorical dependent variable using one or more continuous independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are normally distributed within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).

DA_1

Fig. 1 – Employee classification on the basis of mechanical and verbal aptitude scores

The centroids of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation i from a multivariate normal distribution should be assigned to group G_j that minimizes the Mahalanobis distance between i and group centroid C_j.

DA_2

Fig. 2 – Independent variables have different variances

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have unequal variances or if they are correlated, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,

DA_3

Fig. 3 – Independent variables are correlated

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by:  D_{ij}=\sqrt{(X_i-\mu_j)^TS^{-1}(X_i-\mu_j)} where X_i is the vector of attributes for observation i, \mu_j is the vector of means for group j, and S is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:

DA_4

Fig. 4 – Using the Mahalanobis Distance, points equidistant from a group mean are on tilted eclipses

 

 

For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:

DA_5

Table 1 – Distance of each observation to each group, using Mahalanobis distance. Classification is estimated using the minimum between the three distances. Values in red indicate misclassifications.

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics. Eighth edition. Boston, MA: Cengage Learning, 2018.

 

Multivariate Distances: Mahalanobis vs. Euclidean

Some supervised and unsupervised learning algorithms, such as k-nearest neighbors and k-means clustering, depend on distance calculations. In this post, I will discuss why the Mahalanobis distance is almost always better to use than the Euclidean distance for the multivariate case. There is overlap between the ideas here and David’s post on multicollinearity, so give that one a read too!

Why you should care about multivariate distance

Synthetic time series generation methods are of interest to many of us in systems optimization and the topic been covered extensively on this blog. For an overview on that subject, check out Jon’s blog post on synthetic streamflow. It’s a great read and will give you many more resources to explore.

Among synthetic time series generation methods the k-nearest neighbors (k-NN) bootstrap resampling algorithm, developed by Lall and Sharma (1996), is a popular method for generating synthetic time series data of streamflow and weather variables. The k-NN algorithm resamples the observed data in a way that attempts to preserve the statistics of that data (e.g., mean and standard deviation at each timestep, lag-1 autocorrelation, etc.) but creates new and interesting synthetic records for the user to explore. As the name implies, this algorithm relies on finding k (generally set to be ⌊√(N)⌉, where N is the number of years of observed data) “nearest neighbors” to do its magic.

Determining the nearest neighbors

Finding those neighbors is straightforward in the univariate case (when there is only a single variable you want to simulate)—you just calculate the Euclidean distance. The shorter the distance, the “nearer” the neighbor. Well, it gets a bit more complicated in the multivariate case. There, you’ve got different units involved and correlation among variables which throws a wrench in the whole Euclidean distance thing. So, in most cases the Mahalanobis distance is preferred. Let me explain…

Example: how multivariate distance can help buy a car

Say we want to buy a four-wheel drive (4wd) car that will get us up into the mountains. We’ve got our eye set on a dream car, a 4wd Jeep, but we know we should shop around. So, let’s look at other 4wd cars on the market and compare their highway gas mileage and displacement (the total volume of all the cylinders in your engine) to find other cars we might be interested in. In other words, we are looking for the dream car’s nearest neighbors, with respect to those two measures.

fig1

Figure 1. Comparing highway gas mileage with displacement for our dream car and the others available.

Euclidean Distance

By glancing at the plot above, the distance calculation might appear trivial. In fact, you can probably roughly rank which points lie closest to the dream car just by eyeballing it. But when you try to do the calculation for Euclidean distance (equation 1), it will be skewed based on the units for gas mileage and displacement.

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^T(\overrightarrow{x}-\overrightarrow{y})}                                          (1)

Where: \overrightarrow{x} represents the attributes of our car and \overrightarrow{y} represents the attributes of another car.

For example, what if instead of miles per gallon, gas mileage was reported in feet per gallon? By changing those units, gas mileage would have multiple orders of magnitude more weight in the distance calculation than displacement. In that case, gas mileage would basically be the only thing that matters, which isn’t fair to poor old displacement. Therefore, when using the Euclidean distance to compare multiple variables we need to standardize the data which eliminates units and weights both measures equally. To do so, we can calculate the z-score (equation 2) for each observation:

z = \frac{x - \mu}{\sigma}                                      (2)

Where: z is the z-score (standardized variable), x is an observation, \mu and \sigma are the mean and standard deviation of the observation variable, respectively.

Visually, this is just like looking at our plot from before with no units at all.

fig2

Figure 2. Scale removed from Figure 1 to show that we need to remove the influence of units on the Euclidean distance calculation.

Now we can calculate the Euclidean distance and find the nearest neighbors!

euclidean-distance_cars

Figure 3. The Euclidean distance and rank assigned to each car, where rank 0 is our “dream car”. If we were interested in a k-nearest neighbor algorithm with k=4, the points in the orange box would be selected as the neighbors.

Take note of the k-nearest neighbors in the orange box. Let’s see whether or not we get the same neighbors with the Mahalanobis distance.

Mahalanobis Distance

The Mahalanobis distance calculation (equation 3) differs only slightly from Euclidean distance (equation 1).

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^TS^{-1}(\overrightarrow{x}-\overrightarrow{y})}                                          (3)

Where: \overrightarrow{x} represents the attributes of our car, \overrightarrow{y} represents the attributes of another car, and S^{-1} is the covariance matrix of \overrightarrow{x} and \overrightarrow{y}

Unlike the Euclidean distance though, the Mahalanobis distance accounts for how correlated the variables are to one another. For example, you might have noticed that gas mileage and displacement are highly correlated. Because of this, there is a lot of redundant information in that Euclidean distance calculation. By considering the covariance between the points in the distance calculation, we remove that redundancy.

mahalanobis-distance_cars

Figure 4. The Mahalanobis distance and rank assigned to each car, where rank 0 is our “dream car”.

And look! By comparing the ranks in the orange boxes in Figures 3 and 4, we can see that  although the ranks are similar between the two distance metrics, they do in fact yield different nearest neighbors. So which points get more weight when using the Mahalnobis distance vs. using the Euclidean distance?

To answer that question, I’ve standardized the distance calculations so we can compare them to one another and plotted each on a 1-to-1 line. If the distance metrics were exactly the same, all the points would end up on that line and they would each have a Mahalanobis to Euclidean ratio of 0. However, we see that certain points get more weight (i.e., a larger distance calculated) depending on the distance metric used.

mahalanobis-euclidean-distance-ratio

Figure 5. Mahalanobis to Euclidean distances plotted for each car in the dataset. The points are colored based on the Mahalnobis to Euclidean ratio, where zero means that the distance metrics have equal weight. Purple means the Mahalanobis distance has greater weight than Euclidean and orange means the opposite.

Let’s map the Mahalanonbis to Euclidean ratio onto our gas mileage v. displacement plot.

fig6

Figure 6. The gas mileage vs. displacement of the cars as color-coded by the Mahalanobis to Euclidean ratio from Figure 5.

Notice that many of the points at the top left and bottom right part of the screen are orange, meaning that the Euclidean distance calculation would give them more weight. And then there’s that point at the bottom center of plot. That one gets far more weight when using Mahalanobis distance. To understand this let’s look at the axes of greatest variability in the data, these are also known as principle components. For a primer on that subject, check out David’s post and Ronhini’s post on principle component analysis!

mahalanobis-euclidean_pca

When using Mahalanobis, the ellipse shown on the plot is squeezed towards circle. Along the first principle component axis, there is a lot of work to get it there! The points in the top right and bottom right corners move quite a bit to get towards a nice neat circle. Along the second principle component axis, there is not much squishing to do. The difference between these distance calculations are due to this “squishification” (a term used by the great 3blue1brown so it must be real). The Mahalnobis distance can be thought of calculating the Euclidean distance after performing this “squishification”. In fact, when the variables are completely uncorrelated, no squishing can happen, thus these two calculations are identical (i.e., S^{-1}=1).

Why you should use Mahalanobis distance (in general)

Which one should I use and when? When in doubt, Mahalanobis it out. When using the Mahalanobis distance, we don’t have to standardize the data like we did for the Euclidean distance. The covariance matrix calculation takes care of this. Also, it removes redundant information from correlated variables. Even if your variables aren’t very correlated it can’t hurt to use Mahalanobis distance, it will just be quite similar to the results you’ll get from Euclidean. You’ll notice that most recent k-NN resampling literature uses the Mahalanobis distance: Yates et al. (2003) and Sharif and Burn (2007).

One issue with the Mahalanobis distance is that it depends on taking the inverse of the covariance matrix. If this matrix is not invertible, no need to fear, you can calculate the pseudo-inverse instead to calculate the Mahalanobis distance (thanks to Philip Trettner for pointing that out!).

Code Availability

For anyone interested in the code used to create the figures in this post, I’ve created a GitHub gist.

References

Lall, Upmanu, and Ashish Sharma. “A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series.” Water Resources Research 32, no. 3 (March 1, 1996): 679–93. https://doi.org/10.1029/95WR02966.
Sharif, Mohammed, and Donald Burn. “Improved K -Nearest Neighbor Weather Generating Model.” Journal of Hydrologic Engineering 12, no. 1 (January 1, 2007): 42–51. https://doi.org/10.1061/(ASCE)1084-0699(2007)12:1(42).
Yates, David, Subhrendu Gangopadhyay, Balaji Rajagopalan, and Kenneth Strzepek. “A Technique for Generating Regional Climate Scenarios Using a Nearest-Neighbor Algorithm.” Water Resources Research 39, no. 7 (2003). https://doi.org/10.1029/2002WR001769.