# Intro to Machine Learning: Classification using K-NN

Rohini and I are taking Kilian Weinberger’s fantastic machine learning class here at Cornell this semester and I thought I’d use a couple blog posts to share what we’ve been learning. I hope these posts can serve as a look into the topics covered in this class and what is meant by the term “machine learning”, which is a hot topic these days. In these posts I’ll provide overviews of the topics we’re learning and some example code. There is way more material in this class than could every be covered on this blog and if you’re interested in learning more on the subject I’d suggest checking out the course web page where all lecture video links and lecture notes (where the content in this post is derived from) are free an open to the public.

In this post we’ll start with a very simple algorithm that you’ve probably heard of, the K-Nearest Neighbors (K-NN) algorithm. We’ll look at how K-NN is performed, why it works and the theoretical limits of its performance. We’ll then examine cases where K-NN performs poorly and learn why. The concepts in this simple example will lay the foundation for the more sophisticated Machine Learning techniques that will be discussed in future posts.

## The Problem: Binary Classification

Say you have a set of data where each point has a number of features (attributes that define the point, such as an x-y coordinate) and a label such as red/blue or 1/-1. If we were given a new point and told it’s features but not it’s label, can we predict its label based on information from the data we have? We’ll call our original data set the training set, and the new point a test input.

Figure 1: Can we predict the classification for the new point (star) based on it’s features (x,y coordinates) and those of the training data?

## The K-NN algorithm

The K-NN algorithm is based on a simple idea, that points close to each other in the feature space are likely to have similar labels. This allows us to create a classification rule:

For a test input x, assign the most common label among its k most similar training inputs

Here we are defining similarity in terms of distance within the feature space. To put the above definition formally for a new point x, the classification rule for label y of the new point, h(x) is:

$h(x) = mode({y'' : (x'', y'') \in S_x})$

where:

Sx is the set of k nearest neighbors of x:

$S_x \subseteq D \quad s.t \enspace |S_x| =k$

and $\forall (x',y') \in D,S_x,$

$dist(x,x') \geq \underset{dist(x,x'') \in S_x}{max}(x'',y'')$

Figure 2: If k=5, the new point is classified as +1 because it has three +1 neighbors and two -1 neighbors.

The choice of k here is important one, for noisy data with a smooth boundary between classes, a larger k will usually be preferable. Choosing k is difficult to do a priori, and should be done by finding k that minimizes the error in a validation set and testing it over a separate testing data set to check for over fitting.

## Defining a distance metric

The concept of proximity is critical to K-NN, we define a set of nearest neighbors as the k closest points to our test point in the feature space. But how do we define closeness? When we think of distances in everyday life we usually think of Euclidean distance:

$dist(x,z) = \sqrt{\sum_{r=1}^{d} (x_r-z_r)^2 }$

However, other distance metrics may be useful. In fact, the Euclidean distance is just one form of the more general Minkowski distance:

$dist(x,z) =(\sum_{r=1}^{d} (x_r-z_r)^p)^{1/p}$

Aside from the euclidean distance (p=2), other common forms of the Minkowski distance include the taxicab norm (p=1) or the infinity norm (p=infinity).

Another possible distance metric is the Mahalanobis distance, which accounts for correlation between points. For a detailed look into the use of Mahalanobis and Euclidean distance see Billy’s excellent post from earlier this year.

## But how good of a classifier is K-NN?

How should we evaluate the performance of a machine learning algorithm? We can calculate the error rate of our predictions over a testing set, but this alone does not tell the full story as it neglects the properties of the testing set itself. It is therefore useful to set theoretical lower and upper bounds for error of a potential model.

The upper bound on error constitutes a classifier that our constructed model should always beat. A useful upper bound is the best constant predictor, which will always predict the most common label in the training set for any new point. If you construct a model that is able to classify points correctly 55% of the time, but 60% of the points in your data set have a certain label, you’re model is pretty useless.

Establishing the upper bound on error is pretty straight forward, but how do we specify a lower bound? What is the theoretical limit to the performance of a classification algorithm? To answer this question, lets assume we knew the probability of label y of data set x, P(y|x). We can then define the Bayes Optimal Classifier, hopt(x) as:

$h_{opt}(x) = y*=\underset{y}{argmax}{P(y|x)}$

We the error rate of  hopt(x)  is then defined as:

$\epsilon_{BayesOpt} = 1-P(h_{opt}|y) = 1-P(y^*|x)$

This error is the theoretical limit of our performance. If we new precisely the probability that a point had features x, this is the best we could do to predict its label. So how does the K-NN algorithm stack up? Cover and Hart (1967) famously proved that the error rate of a 1-NN classifier is no more than twice there error of the Bayes optimal classifier (See course notes for the full proof). Not bad!

## The Curse of Dimensionality

The K-NN algorithm rests on the assumption that similar points have similar labels. As we increase the dimensionality of our feature space however, finding “similar” points becomes increasingly difficult. As we increase the number of features we are comparing, all points in our data set become more distant from each other.

As an example, lets take a look at the “similarity” between fellow Reed group member Rohini and I based on a set of features I’ve chosen at random:

If we look at only three dimensions of features, Rohini and I are quite similar (in fact identical in terms of occupation, favorite class and current location). As we increase the number of features however, our similarity decreases precipitously. The more features we examine, the more unique each of us gets.  If we were to randomly sample these features of 100 people on campus, the similarity between Rohini and I in this feature set would likely not stand out from the sample. This is the curse of dimensionality, as the the number of features become large, the distance between all points increases to a point where it is impossible to determine “similarity” between points. If  we are unable to find neighbors that are “similar” to a test point, how can we assume any other point can be used to predict its label?

For high dimensional data sets, techniques such as PCA may make K-NN feasible if the data has a lower dimensional structure. If this isn’t effective, we must seek other strategies for classification. While the curse of dimensionality dictates that all points in a large dimensional space are distant from each other, this concept does not hold for the distance between points and hyperplanes. The Perceptron Algorithm, which I’ll cover in the next post, exploits this idea to create linear classifiers that are effective on high dimensional data.