Intro to Machine Learning: Classification using K-NN

Rohini and I are taking Kilian Weinberger’s fantastic machine learning class here at Cornell this semester and I thought I’d use a couple blog posts to share what we’ve been learning. I hope these posts can serve as a look into the topics covered in this class and what is meant by the term “machine learning”, which is a hot topic these days. In these posts I’ll provide overviews of the topics we’re learning and some example code. There is way more material in this class than could every be covered on this blog and if you’re interested in learning more on the subject I’d suggest checking out the course web page where all lecture video links and lecture notes (where the content in this post is derived from) are free an open to the public.

In this post we’ll start with a very simple algorithm that you’ve probably heard of, the K-Nearest Neighbors (K-NN) algorithm. We’ll look at how K-NN is performed, why it works and the theoretical limits of its performance. We’ll then examine cases where K-NN performs poorly and learn why. The concepts in this simple example will lay the foundation for the more sophisticated Machine Learning techniques that will be discussed in future posts.

The Problem: Binary Classification

Say you have a set of data where each point has a number of features (attributes that define the point, such as an x-y coordinate) and a label such as red/blue or 1/-1. If we were given a new point and told it’s features but not it’s label, can we predict its label based on information from the data we have? We’ll call our original data set the training set, and the new point a test input.

Points_demo

Figure 1: Can we predict the classification for the new point (star) based on it’s features (x,y coordinates) and those of the training data?

 

The K-NN algorithm

The K-NN algorithm is based on a simple idea, that points close to each other in the feature space are likely to have similar labels. This allows us to create a classification rule:

For a test input x, assign the most common label among its k most similar training inputs

Here we are defining similarity in terms of distance within the feature space. To put the above definition formally for a new point x, the classification rule for label y of the new point, h(x) is:

h(x) = mode({y'' : (x'', y'') \in S_x})

where:

Sx is the set of k nearest neighbors of x:

S_x \subseteq D \quad s.t \enspace |S_x| =k

and \forall (x',y') \in D,S_x,

dist(x,x') \geq \underset{dist(x,x'') \in S_x}{max}(x'',y'')

KNN_k5

Figure 2: If k=5, the new point is classified as +1 because it has three +1 neighbors and two -1 neighbors.

The choice of k here is important one, for noisy data with a smooth boundary between classes, a larger k will usually be preferable. Choosing k is difficult to do a priori, and should be done by finding k that minimizes the error in a validation set and testing it over a separate testing data set to check for over fitting.

Defining a distance metric

The concept of proximity is critical to K-NN, we define a set of nearest neighbors as the k closest points to our test point in the feature space. But how do we define closeness? When we think of distances in everyday life we usually think of Euclidean distance:

dist(x,z) = \sqrt{\sum_{r=1}^{d} (x_r-z_r)^2 }

However, other distance metrics may be useful. In fact, the Euclidean distance is just one form of the more general Minkowski distance:

dist(x,z) =(\sum_{r=1}^{d} (x_r-z_r)^p)^{1/p}

Aside from the euclidean distance (p=2), other common forms of the Minkowski distance include the taxicab norm (p=1) or the infinity norm (p=infinity).

Another possible distance metric is the Mahalanobis distance, which accounts for correlation between points. For a detailed look into the use of Mahalanobis and Euclidean distance see Billy’s excellent post from earlier this year.

But how good of a classifier is K-NN?

How should we evaluate the performance of a machine learning algorithm? We can calculate the error rate of our predictions over a testing set, but this alone does not tell the full story as it neglects the properties of the testing set itself. It is therefore useful to set theoretical lower and upper bounds for error of a potential model.

The upper bound on error constitutes a classifier that our constructed model should always beat. A useful upper bound is the best constant predictor, which will always predict the most common label in the training set for any new point. If you construct a model that is able to classify points correctly 55% of the time, but 60% of the points in your data set have a certain label, you’re model is pretty useless.

Establishing the upper bound on error is pretty straight forward, but how do we specify a lower bound? What is the theoretical limit to the performance of a classification algorithm? To answer this question, lets assume we knew the probability of label y of data set x, P(y|x). We can then define the Bayes Optimal Classifier, hopt(x) as:

h_{opt}(x) = y*=\underset{y}{argmax}{P(y|x)}

We the error rate of  hopt(x)  is then defined as:

\epsilon_{BayesOpt} = 1-P(h_{opt}|y) = 1-P(y^*|x)

This error is the theoretical limit of our performance. If we new precisely the probability that a point had features x, this is the best we could do to predict its label. So how does the K-NN algorithm stack up? Cover and Hart (1967) famously proved that the error rate of a 1-NN classifier is no more than twice there error of the Bayes optimal classifier (See course notes for the full proof). Not bad!

The Curse of Dimensionality

The K-NN algorithm rests on the assumption that similar points have similar labels. As we increase the dimensionality of our feature space however, finding “similar” points becomes increasingly difficult. As we increase the number of features we are comparing, all points in our data set become more distant from each other.

As an example, lets take a look at the “similarity” between fellow Reed group member Rohini and I based on a set of features I’ve chosen at random:

Rohini v Dave

If we look at only three dimensions of features, Rohini and I are quite similar (in fact identical in terms of occupation, favorite class and current location). As we increase the number of features however, our similarity decreases precipitously. The more features we examine, the more unique each of us gets.  If we were to randomly sample these features of 100 people on campus, the similarity between Rohini and I in this feature set would likely not stand out from the sample. This is the curse of dimensionality, as the the number of features become large, the distance between all points increases to a point where it is impossible to determine “similarity” between points. If  we are unable to find neighbors that are “similar” to a test point, how can we assume any other point can be used to predict its label?

For high dimensional data sets, techniques such as PCA may make K-NN feasible if the data has a lower dimensional structure. If this isn’t effective, we must seek other strategies for classification. While the curse of dimensionality dictates that all points in a large dimensional space are distant from each other, this concept does not hold for the distance between points and hyperplanes. The Perceptron Algorithm, which I’ll cover in the next post, exploits this idea to create linear classifiers that are effective on high dimensional data.

 

Multivariate Distances: Mahalanobis vs. Euclidean

Some supervised and unsupervised learning algorithms, such as k-nearest neighbors and k-means clustering, depend on distance calculations. In this post, I will discuss why the Mahalanobis distance is almost always better to use than the Euclidean distance for the multivariate case. There is overlap between the ideas here and David’s post on multicollinearity, so give that one a read too!

Why you should care about multivariate distance

Synthetic time series generation methods are of interest to many of us in systems optimization and the topic been covered extensively on this blog. For an overview on that subject, check out Jon’s blog post on synthetic streamflow. It’s a great read and will give you many more resources to explore.

Among synthetic time series generation methods the k-nearest neighbors (k-NN) bootstrap resampling algorithm, developed by Lall and Sharma (1996), is a popular method for generating synthetic time series data of streamflow and weather variables. The k-NN algorithm resamples the observed data in a way that attempts to preserve the statistics of that data (e.g., mean and standard deviation at each timestep, lag-1 autocorrelation, etc.) but creates new and interesting synthetic records for the user to explore. As the name implies, this algorithm relies on finding k (generally set to be ⌊√(N)⌉, where N is the number of years of observed data) “nearest neighbors” to do its magic.

Determining the nearest neighbors

Finding those neighbors is straightforward in the univariate case (when there is only a single variable you want to simulate)—you just calculate the Euclidean distance. The shorter the distance, the “nearer” the neighbor. Well, it gets a bit more complicated in the multivariate case. There, you’ve got different units involved and correlation among variables which throws a wrench in the whole Euclidean distance thing. So, in most cases the Mahalanobis distance is preferred. Let me explain…

Example: how multivariate distance can help buy a car

Say we want to buy a four-wheel drive (4wd) car that will get us up into the mountains. We’ve got our eye set on a dream car, a 4wd Jeep, but we know we should shop around. So, let’s look at other 4wd cars on the market and compare their highway gas mileage and displacement (the total volume of all the cylinders in your engine) to find other cars we might be interested in. In other words, we are looking for the dream car’s nearest neighbors, with respect to those two measures.

fig1

Figure 1. Comparing highway gas mileage with displacement for our dream car and the others available.

Euclidean Distance

By glancing at the plot above, the distance calculation might appear trivial. In fact, you can probably roughly rank which points lie closest to the dream car just by eyeballing it. But when you try to do the calculation for Euclidean distance (equation 1), it will be skewed based on the units for gas mileage and displacement.

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^T(\overrightarrow{x}-\overrightarrow{y})}                                          (1)

Where: \overrightarrow{x} represents the attributes of our car and \overrightarrow{y} represents the attributes of another car.

For example, what if instead of miles per gallon, gas mileage was reported in feet per gallon? By changing those units, gas mileage would have multiple orders of magnitude more weight in the distance calculation than displacement. In that case, gas mileage would basically be the only thing that matters, which isn’t fair to poor old displacement. Therefore, when using the Euclidean distance to compare multiple variables we need to standardize the data which eliminates units and weights both measures equally. To do so, we can calculate the z-score (equation 2) for each observation:

z = \frac{x - \mu}{\sigma}                                      (2)

Where: z is the z-score (standardized variable), x is an observation, \mu and \sigma are the mean and standard deviation of the observation variable, respectively.

Visually, this is just like looking at our plot from before with no units at all.

fig2

Figure 2. Scale removed from Figure 1 to show that we need to remove the influence of units on the Euclidean distance calculation.

Now we can calculate the Euclidean distance and find the nearest neighbors!

euclidean-distance_cars

Figure 3. The Euclidean distance and rank assigned to each car, where rank 0 is our “dream car”. If we were interested in a k-nearest neighbor algorithm with k=4, the points in the orange box would be selected as the neighbors.

Take note of the k-nearest neighbors in the orange box. Let’s see whether or not we get the same neighbors with the Mahalanobis distance.

Mahalanobis Distance

The Mahalanobis distance calculation (equation 3) differs only slightly from Euclidean distance (equation 1).

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^TS^{-1}(\overrightarrow{x}-\overrightarrow{y})}                                          (3)

Where: \overrightarrow{x} represents the attributes of our car, \overrightarrow{y} represents the attributes of another car, and S^{-1} is the covariance matrix of \overrightarrow{x} and \overrightarrow{y}

Unlike the Euclidean distance though, the Mahalanobis distance accounts for how correlated the variables are to one another. For example, you might have noticed that gas mileage and displacement are highly correlated. Because of this, there is a lot of redundant information in that Euclidean distance calculation. By considering the covariance between the points in the distance calculation, we remove that redundancy.

mahalanobis-distance_cars

Figure 4. The Mahalanobis distance and rank assigned to each car, where rank 0 is our “dream car”.

And look! By comparing the ranks in the orange boxes in Figures 3 and 4, we can see that  although the ranks are similar between the two distance metrics, they do in fact yield different nearest neighbors. So which points get more weight when using the Mahalnobis distance vs. using the Euclidean distance?

To answer that question, I’ve standardized the distance calculations so we can compare them to one another and plotted each on a 1-to-1 line. If the distance metrics were exactly the same, all the points would end up on that line and they would each have a Mahalanobis to Euclidean ratio of 0. However, we see that certain points get more weight (i.e., a larger distance calculated) depending on the distance metric used.

mahalanobis-euclidean-distance-ratio

Figure 5. Mahalanobis to Euclidean distances plotted for each car in the dataset. The points are colored based on the Mahalnobis to Euclidean ratio, where zero means that the distance metrics have equal weight. Purple means the Mahalanobis distance has greater weight than Euclidean and orange means the opposite.

Let’s map the Mahalanonbis to Euclidean ratio onto our gas mileage v. displacement plot.

fig6

Figure 6. The gas mileage vs. displacement of the cars as color-coded by the Mahalanobis to Euclidean ratio from Figure 5.

Notice that many of the points at the top left and bottom right part of the screen are orange, meaning that the Euclidean distance calculation would give them more weight. And then there’s that point at the bottom center of plot. That one gets far more weight when using Mahalanobis distance. To understand this let’s look at the axes of greatest variability in the data, these are also known as principle components. For a primer on that subject, check out David’s post and Ronhini’s post on principle component analysis!

mahalanobis-euclidean_pca

When using Mahalanobis, the ellipse shown on the plot is squeezed towards circle. Along the first principle component axis, there is a lot of work to get it there! The points in the top right and bottom right corners move quite a bit to get towards a nice neat circle. Along the second principle component axis, there is not much squishing to do. The difference between these distance calculations are due to this “squishification” (a term used by the great 3blue1brown so it must be real). The Mahalnobis distance can be thought of calculating the Euclidean distance after performing this “squishification”. In fact, when the variables are completely uncorrelated, no squishing can happen, thus these two calculations are identical (i.e., S^{-1}=1).

Why you should use Mahalanobis distance (in general)

Which one should I use and when? When in doubt, Mahalanobis it out. When using the Mahalanobis distance, we don’t have to standardize the data like we did for the Euclidean distance. The covariance matrix calculation takes care of this. Also, it removes redundant information from correlated variables. Even if your variables aren’t very correlated it can’t hurt to use Mahalanobis distance, it will just be quite similar to the results you’ll get from Euclidean. You’ll notice that most recent k-NN resampling literature uses the Mahalanobis distance: Yates et al. (2003) and Sharif and Burn (2007).

One issue with the Mahalanobis distance is that it depends on taking the inverse of the covariance matrix. If this matrix is not invertible, no need to fear, you can calculate the pseudo-inverse instead to calculate the Mahalanobis distance (thanks to Philip Trettner for pointing that out!).

Code Availability

For anyone interested in the code used to create the figures in this post, I’ve created a GitHub gist.

References

Lall, Upmanu, and Ashish Sharma. “A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series.” Water Resources Research 32, no. 3 (March 1, 1996): 679–93. https://doi.org/10.1029/95WR02966.
Sharif, Mohammed, and Donald Burn. “Improved K -Nearest Neighbor Weather Generating Model.” Journal of Hydrologic Engineering 12, no. 1 (January 1, 2007): 42–51. https://doi.org/10.1061/(ASCE)1084-0699(2007)12:1(42).
Yates, David, Subhrendu Gangopadhyay, Balaji Rajagopalan, and Kenneth Strzepek. “A Technique for Generating Regional Climate Scenarios Using a Nearest-Neighbor Algorithm.” Water Resources Research 39, no. 7 (2003). https://doi.org/10.1029/2002WR001769.