Intro to Machine Learning Part 3: Bias-Variance Tradeoff

The goal of this post is to introduce one of the most important topics in Machine Learning, termed the Bias-Variance Tradeoff. In the last two posts of this series, David Gold introduced two popular classification algorithms. This post will discuss how to assess your classifier’s error and improve it.

Decomposition of Error

Let’s assume that we have a dataset, D={(x1,y1)…(xn,yn)}, where xi is a data point and yi is the label associated with that point. We train a classifier, hd , on this dataset and use it to predict the label y associated with a test point x. The classifier outputs a predicted label hd(x). The classifier’s expected test error is defined as the squared difference between the prediction and the actual label y. This error can be decomposed into three different types of error shown in the following equation:

Eq1

Variance: hd is the classifier that you have trained on dataset D. If you were to build a classifier on a different dataset, you would get a different classifier that could potentially give you different labels. This is unfortunate because multiple large datasets are usually hard to come by. If we had infinite datasets, we could hypothetically find the ideal expected classifier denoted as hbar(x). Therefore, the expected error due to variance is defined as the difference between our classifier and the ideal expected classifier.

Bias: Suppose that you are able to train your model on infinite datasets and are able to achieve the expected classifier hbar(x), but you still have a high test error. Then, your classifier may have error due to bias, which is a measure of how much the expected classifier’s prediction differs from the average label ybar(x). Error due to bias is a limitation of your model. For example, if you are trying to use a linear classifier to separate data which are not linearly separable, then you will never get good results, even with infinite datasets.

Noise: The goal of a classifier is to determine the average label ybar(x) associated with a data point. However, sometimes the actual label associated with the data point differs from the average label. This is termed error due to noise and is intrinsic to your data.

Improving Classification Error

The first step to improving error is to first identify what type of error that your classifier is suffering from. Figure 1 can help you to diagnose this.

Picture1

Figure 1: Training and test error due to number of training instances

High Variance: If you classify your test set and find that there is a large gap between the test error and training error, then your classifier might have high variance. Another characteristic of this region is that the training error is below the acceptable test error ε, which is ideal, but the test error is still much higher than ε, which is not ideal.

Solutions

  • Add more data: As seen from the graph, as more data (training instances) are added, the test error will go down initially. Conversely, the training error will increase, because adding more training data will always make a classification problem more difficult. The ultimate goal is that the two errors lines will converge and level off below the dashed line.
  • Reduce Model Complexity: If your model has high variance, then it can be a sign that your classifier is too complex. It is overfit to the training set and therefore won’t generalize well across test sets. One way to fix this is to reduce model complexity by increasing regularization to penalize more complex solutions.
  • Bagging: Short for Bootstrapping Aggregation, bagging is a procedure where multiple datasets can be produced by sampling dataset D with replacement. A classifier is trained on each dataset and then averaged to try to obtain the expected classifier. The variance error can be provably reduced using this method. The details to this proof will be explained in the next blog post.

High Bias: If you classify your test set and find that there is a small gap between the test error and training error, but that both error lines have converged above the acceptable error then your classifier might have high bias.

Solutions

A common misconception is that adding more data is the solution to most error woes. However, it’s very important to realize that if your classifier suffers from bias, there exists a fundamental problem with your model. More data will not fix this issue, and in fact will just increase training error.

  • Increase Model Complexity: High bias means that your classifier isn’t robust enough to accurately predict labels, so you need to increase the power of the algorithm. If your data require a more complex classification boundary, consider using algorithms that can find a non-linear boundary such as KNN or Logistic Regression. Non-linearities can be incorporated into linear classifiers through kernelization which involves mapping feature vectors into higher dimensional spaces where you can more likely  find a linearly separating hyperplane.
  • Add Features: Instead of increasing the complexity of the algorithm, an alternative is to add more features to represent each data point. This might increase the separation between classes enough such that simpler algorithms can still be utilized.
  • Boosting: In boosting, weak classifiers with high bias (termed as such because they perform only slightly better than random guessing at predicting labels) are ensembled together to generate a strong learner that has a lower bias.

High Noise: If training error and test error are both too high, and perhaps become even higher after more training instances are introduced, then your data may just be very noisy. This can be due to identical training points having different labels. There are very few ways to actually overcome this type of error, but the best way is to try to add more features to characterize your training data that will hopefully offer more separation between seemingly identical points.

This blog post should give you a taste of the different types of error that are prevalent among classifiers and the next few blog posts in the series will go through bagging, boosting, and kernels as some options to help reduce these errors.

References:

All of the material for the post comes from Kilian Weinberger’s CS4780 class notes and lectures found at: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html

Discriminant analysis for classification

Discriminant analysis for classification

Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

Classification – Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

  • Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
  • Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

Prediction – Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

  • Predict sales volume of a product in a future time period
  • Predict annual heating/cooling expenses for a set of buildings

Segmentation – Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

  • Identify groups of customers that have similar ordering patterns across seasons of the year
  • Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving classification problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a continuous dependent variable using one or more independent categorical variables, whereas DA attempts to predict a categorical dependent variable using one or more continuous independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are normally distributed within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).

DA_1

Fig. 1 – Employee classification on the basis of mechanical and verbal aptitude scores

The centroids of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation i from a multivariate normal distribution should be assigned to group G_j that minimizes the Mahalanobis distance between i and group centroid C_j.

DA_2

Fig. 2 – Independent variables have different variances

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have unequal variances or if they are correlated, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,

DA_3

Fig. 3 – Independent variables are correlated

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by:  D_{ij}=\sqrt{(X_i-\mu_j)^TS^{-1}(X_i-\mu_j)} where X_i is the vector of attributes for observation i, \mu_j is the vector of means for group j, and S is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:

DA_4

Fig. 4 – Using the Mahalanobis Distance, points equidistant from a group mean are on tilted eclipses

 

 

For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:

DA_5

Table 1 – Distance of each observation to each group, using Mahalanobis distance. Classification is estimated using the minimum between the three distances. Values in red indicate misclassifications.

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics. Eighth edition. Boston, MA: Cengage Learning, 2018.