Multivariate Distances: Mahalanobis vs. Euclidean

Some supervised and unsupervised learning algorithms, such as k-nearest neighbors and k-means clustering, depend on distance calculations. In this post, I will discuss why the Mahalanobis distance is almost always better to use than the Euclidean distance for the multivariate case. There is overlap between the ideas here and David’s post on multicollinearity, so give that one a read too!

Why you should care about multivariate distance

Synthetic time series generation methods are of interest to many of us in systems optimization and the topic been covered extensively on this blog. For an overview on that subject, check out Jon’s blog post on synthetic streamflow. It’s a great read and will give you many more resources to explore.

Among synthetic time series generation methods the k-nearest neighbors (k-NN) bootstrap resampling algorithm, developed by Lall and Sharma (1996), is a popular method for generating synthetic time series data of streamflow and weather variables. The k-NN algorithm resamples the observed data in a way that attempts to preserve the statistics of that data (e.g., mean and standard deviation at each timestep, lag-1 autocorrelation, etc.) but creates new and interesting synthetic records for the user to explore. As the name implies, this algorithm relies on finding k (generally set to be ⌊√(N)⌉, where N is the number of years of observed data) “nearest neighbors” to do its magic.

Determining the nearest neighbors

Finding those neighbors is straightforward in the univariate case (when there is only a single variable you want to simulate)—you just calculate the Euclidean distance. The shorter the distance, the “nearer” the neighbor. Well, it gets a bit more complicated in the multivariate case. There, you’ve got different units involved and correlation among variables which throws a wrench in the whole Euclidean distance thing. So, in most cases the Mahalanobis distance is preferred. Let me explain…

Example: how multivariate distance can help buy a car

Say we want to buy a four-wheel drive (4wd) car that will get us up into the mountains. We’ve got our eye set on a dream car, a 4wd Jeep, but we know we should shop around. So, let’s look at other 4wd cars on the market and compare their highway gas mileage and displacement (the total volume of all the cylinders in your engine) to find other cars we might be interested in. In other words, we are looking for the dream car’s nearest neighbors, with respect to those two measures.

fig1

Figure 1. Comparing highway gas mileage with displacement for our dream car and the others available.

Euclidean Distance

By glancing at the plot above, the distance calculation might appear trivial. In fact, you can probably roughly rank which points lie closest to the dream car just by eyeballing it. But when you try to do the calculation for Euclidean distance (equation 1), it will be skewed based on the units for gas mileage and displacement.

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^T(\overrightarrow{x}-\overrightarrow{y})}                                          (1)

Where: \overrightarrow{x} represents the attributes of our car and \overrightarrow{y} represents the attributes of another car.

For example, what if instead of miles per gallon, gas mileage was reported in feet per gallon? By changing those units, gas mileage would have multiple orders of magnitude more weight in the distance calculation than displacement. In that case, gas mileage would basically be the only thing that matters, which isn’t fair to poor old displacement. Therefore, when using the Euclidean distance to compare multiple variables we need to standardize the data which eliminates units and weights both measures equally. To do so, we can calculate the z-score (equation 2) for each observation:

z = \frac{x - \mu}{\sigma}                                      (2)

Where: z is the z-score (standardized variable), x is an observation, \mu and \sigma are the mean and standard deviation of the observation variable, respectively.

Visually, this is just like looking at our plot from before with no units at all.

fig2

Figure 2. Scale removed from Figure 1 to show that we need to remove the influence of units on the Euclidean distance calculation.

Now we can calculate the Euclidean distance and find the nearest neighbors!

euclidean-distance_cars

Figure 3. The Euclidean distance and rank assigned to each car, where rank 0 is our “dream car”. If we were interested in a k-nearest neighbor algorithm with k=4, the points in the orange box would be selected as the neighbors.

Take note of the k-nearest neighbors in the orange box. Let’s see whether or not we get the same neighbors with the Mahalanobis distance.

Mahalanobis Distance

The Mahalanobis distance calculation (equation 3) differs only slightly from Euclidean distance (equation 1).

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^TS^{-1}(\overrightarrow{x}-\overrightarrow{y})}                                          (3)

Where: \overrightarrow{x} represents the attributes of our car, \overrightarrow{y} represents the attributes of another car, and S^{-1} is the covariance matrix of \overrightarrow{x} and \overrightarrow{y}

Unlike the Euclidean distance though, the Mahalanobis distance accounts for how correlated the variables are to one another. For example, you might have noticed that gas mileage and displacement are highly correlated. Because of this, there is a lot of redundant information in that Euclidean distance calculation. By considering the covariance between the points in the distance calculation, we remove that redundancy.

mahalanobis-distance_cars

Figure 4. The Mahalanobis distance and rank assigned to each car, where rank 0 is our “dream car”.

And look! By comparing the ranks in the orange boxes in Figures 3 and 4, we can see that  although the ranks are similar between the two distance metrics, they do in fact yield different nearest neighbors. So which points get more weight when using the Mahalnobis distance vs. using the Euclidean distance?

To answer that question, I’ve standardized the distance calculations so we can compare them to one another and plotted each on a 1-to-1 line. If the distance metrics were exactly the same, all the points would end up on that line and they would each have a Mahalanobis to Euclidean ratio of 0. However, we see that certain points get more weight (i.e., a larger distance calculated) depending on the distance metric used.

mahalanobis-euclidean-distance-ratio

Figure 5. Mahalanobis to Euclidean distances plotted for each car in the dataset. The points are colored based on the Mahalnobis to Euclidean ratio, where zero means that the distance metrics have equal weight. Purple means the Mahalanobis distance has greater weight than Euclidean and orange means the opposite.

Let’s map the Mahalanonbis to Euclidean ratio onto our gas mileage v. displacement plot.

fig6

Figure 6. The gas mileage vs. displacement of the cars as color-coded by the Mahalanobis to Euclidean ratio from Figure 5.

Notice that many of the points at the top left and bottom right part of the screen are orange, meaning that the Euclidean distance calculation would give them more weight. And then there’s that point at the bottom center of plot. That one gets far more weight when using Mahalanobis distance. To understand this let’s look at the axes of greatest variability in the data, these are also known as principle components. For a primer on that subject, check out David’s post and Ronhini’s post on principle component analysis!

mahalanobis-euclidean_pca

When using Mahalanobis, the ellipse shown on the plot is squeezed towards circle. Along the first principle component axis, there is a lot of work to get it there! The points in the top right and bottom right corners move quite a bit to get towards a nice neat circle. Along the second principle component axis, there is not much squishing to do. The difference between these distance calculations are due to this “squishification” (a term used by the great 3blue1brown so it must be real). The Mahalnobis distance can be thought of calculating the Euclidean distance after performing this “squishification”. In fact, when the variables are completely uncorrelated, no squishing can happen, thus these two calculations are identical (i.e., S^{-1}=1).

Why you should use Mahalanobis distance (in general)

Which one should I use and when? When in doubt, Mahalanobis it out. When using the Mahalanobis distance, we don’t have to standardize the data like we did for the Euclidean distance. The covariance matrix calculation takes care of this. Also, it removes redundant information from correlated variables. Even if your variables aren’t very correlated it can’t hurt to use Mahalanobis distance, it will just be quite similar to the results you’ll get from Euclidean. You’ll notice that most recent k-NN resampling literature uses the Mahalanobis distance: Yates et al. (2003) and Sharif and Burn (2007).

One issue with the Mahalanobis distance is that it depends on taking the inverse of the covariance matrix. If this matrix is not invertible, no need to fear, you can calculate the pseudo-inverse instead to calculate the Mahalanobis distance (thanks to Philip Trettner for pointing that out!).

Code Availability

For anyone interested in the code used to create the figures in this post, I’ve created a GitHub gist.

References

Lall, Upmanu, and Ashish Sharma. “A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series.” Water Resources Research 32, no. 3 (March 1, 1996): 679–93. https://doi.org/10.1029/95WR02966.
Sharif, Mohammed, and Donald Burn. “Improved K -Nearest Neighbor Weather Generating Model.” Journal of Hydrologic Engineering 12, no. 1 (January 1, 2007): 42–51. https://doi.org/10.1061/(ASCE)1084-0699(2007)12:1(42).
Yates, David, Subhrendu Gangopadhyay, Balaji Rajagopalan, and Kenneth Strzepek. “A Technique for Generating Regional Climate Scenarios Using a Nearest-Neighbor Algorithm.” Water Resources Research 39, no. 7 (2003). https://doi.org/10.1029/2002WR001769.
Advertisements

3 thoughts on “Multivariate Distances: Mahalanobis vs. Euclidean

  1. “if the covariance matrix is not invertible, you have to use the Euclidean distance calculation”

    You should be able to use the pseudoinverse instead, right? The covariance matrix loses a rank when two variables are exactly scaled versions of each other in which case the pseudoinverse should just ignore one of them (which is the behavior that you want because it’s redundant data).

  2. Pingback: Intro to Machine Learning: Classification using K-NN – Water Programming: A Collaborative Research Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s