Discriminant analysis for classification

Discriminant analysis for classification

Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

Classification – Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

  • Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
  • Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

Prediction – Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

  • Predict sales volume of a product in a future time period
  • Predict annual heating/cooling expenses for a set of buildings

Segmentation – Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

  • Identify groups of customers that have similar ordering patterns across seasons of the year
  • Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving classification problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a continuous dependent variable using one or more independent categorical variables, whereas DA attempts to predict a categorical dependent variable using one or more continuous independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are normally distributed within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).


Fig. 1 – Employee classification on the basis of mechanical and verbal aptitude scores

The centroids of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation i from a multivariate normal distribution should be assigned to group G_j that minimizes the Mahalanobis distance between i and group centroid C_j.


Fig. 2 – Independent variables have different variances

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have unequal variances or if they are correlated, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,


Fig. 3 – Independent variables are correlated

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by:  D_{ij}=\sqrt{(X_i-\mu_j)^TS^{-1}(X_i-\mu_j)} where X_i is the vector of attributes for observation i, \mu_j is the vector of means for group j, and S is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:


Fig. 4 – Using the Mahalanobis Distance, points equidistant from a group mean are on tilted eclipses



For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:


Table 1 – Distance of each observation to each group, using Mahalanobis distance. Classification is estimated using the minimum between the three distances. Values in red indicate misclassifications.

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics. Eighth edition. Boston, MA: Cengage Learning, 2018.


A Deeper Dive into Principal Component Analysis

This post is meant to be a continuation of Dave Gold’s introductory post on Principal Component Analysis, which is an excellent explanation on how to conduct a PCA and visualize the principal components. The goal of this post is to elaborate on how to proceed after you have conducted a PCA and to address some common questions and concerns associated with the method.

Performing a PCA in R

Often times, you will perform a PCA on large datasets that contain many variables and/or many observations. One such dataset that will be used as an example is the Living Blended Drought Atlas (LBDA) which is a reconstruction of the Palmer Drought Severity Index (PDSI) over the contiguous United States from 1473-2005. This dataset contains 4968 columns, or variables, each of which is a grid cell over the U.S., and 533 rows, each of which is a yearly observation. We will call this dataset, X. In matrix notation, we will denote the PCA analysis formula as:

U=XW    (1)

where X is the dataset, W is the weighting matrix, whose columns are the key patterns in the data, and U is the matrix whose columns are the resulting principal components (PCs). You can perform a PCA on this dataset with a single function in R, prcomp.

 PCA=prcomp(X, scale=TRUE/FALSE) 

The first input into the function is your data matrix and the second input is used to declare if your dataset should be scaled to have a unit variance before the PCA is conducted. There are various other inputs into the function, listed here, that can be included if necessary.

prcomp returns three sets of results in a list that we have called “PCA”:

  1. sdev: the standard deviations of the principal components (if you square them, you get the eigenvalues of the covariance/correlation matrix)
  2. rotation: the loading matrix whose columns are the eigenvectors (W in the equation above)
  3. x: the rotated data or your PCs (The columns of U in the equation above)

And that’s it! You have the results of the PCA. Now comes the more difficult part: interpreting them.

How do I choose how many PCs to keep?

The dimensions associated with equation (1) are as follows:



If the number of observations is much larger than the number of variables in the dataset, i.e. n>>k, then the PCA will return k distinct eigenvectors. In our case, since n is smaller than our number of variables, the most non-zero eigenvectors that the PCA will return is n. Either way, we have many supposedly distinct patterns. How do we decide how many of those patterns to keep?

The answer is not always clear and most often subjective and case-dependent. One common tool used is a scree plot or an eigenvalue spectrum.

Scree Plot

Each column of our W matrix is a distinct, independent pattern, also called an empirical orthogonal function (EOF). Each EOF is responsible for explaining some amount of variance in the dataset. A scree plot, shown in Figure 1, allows you visualize this variance breakdown.


Figure 1: Scree Plot


On the x-axis of the scree plot is the EOF (we’ve chosen to keep 10) and on the y-axis is the total variance explained by that EOF. Each variance is equivalent to the eigenvalue associated with its respective eigenvector. You can find the eigenvalues/variances by squaring of the results of sdev that is returned by prcomp.

Generally speaking, one can look for the “elbow” of the scree plot to determine at what point to truncate the EOFs and retain up to the EOF before the elbow. At about the 5th EOF, the graph starts to level off and all subsequent EOFs start to contribute about the same amount of variance. Therefore, the elbow of the graph is located at about the 5th EOF and you will retain the first 4 EOFs.

North’s Rule of Thumb

North’s Rule of Thumb is more of a precise way of truncating and involves creating confidence intervals around your estimates of the variance. The rule states that you should truncated EOFs only when the confidence intervals of the variances start to intersect. At this point, the eigenvectors are considered too close to be interpretable and spacing might be due to sampling error rather than a clear distinction between the variances [1].

Rotated EOFs

At some point, your EOFs might start to exhibit patterns that can be hard to interpret or attribute to a physical phenomenon. It is not uncommon for these types of patterns to result from pure noise in your data, especially if you are analyzing latter EOFs that explain a very small amount of the variance [2].

Rotating EOFs is a practice that is done to simplify the patterns obtained in EOFs and make them more interpretable. A varimax orthogonal rotation can be used to determine an optimum rotation matrix that maximizes the variance in the columns of W. The variance of the columns is maximized by driving some of the loadings to zero and trying to maximize the values of other loadings. In R, this is done using the varimax function.


In the above command, my input is the first 10 EOFs from the original weighting matrix. The result, my.varimax, is a list with the following components:

  1. loadings: the resulting rotated loading matrix
  2. rotmat: the rotation matrix

The new loading matrix, Wrot , is still orthogonal after the rotation, and the eigenvectors are, therefore, still orthonormal. However, multiplication of the rotated loading matrix by the original dataset, X, to obtain a new U, results in principal components are not guaranteed to be independent. This can be seen through further inspection of the correlation matrix associated with U. Unfortunately, this is a tradeoff associated with obtaining EOFs that are simpler and easier to interpret.


[1] http://yyy.rsmas.miami.edu/users/bmapes/teaching/MPO581_2011/EOF_chapter_DelSole.pdf

[2] Hannachi, A., Jolliffe, I.T., and Stephenson, D.B. (2007), Empirical orthogonal functions and related techniques in atmospheric science: A review, International Journal of Climatology, 27, 1119-1152.

*All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class, BEE 6300: Environmental Statistics 

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

This semester I’m taking a Multivariate statistics course taught by Professor Scott Steinschneider in the BEE department at Cornell. I’ve been really enjoying the course thus far and thought I would share some of what we’ve covered in the class with a blog post. The material below on multicollinearity is from Dr. Steinschneider’s class, presented in my own words.

What is Multicollinearity?

Multicollinearity is the condition where two or more predictor variables in a statistical model are linearly related (Dormann et. al. 2013). The existence of multicollinearity in your data set can result in an increase of the variance of regression coefficients leading to unstable estimation of parameter values. This in turn can lead to erroneous identification of relevant predictors within a regression and detracts from a model’s ability to extrapolate beyond the range of the sample it was constructed with. In this post, I’ll explain how multicollinearity causes problems for linear regression by Ordinary Least Squares (OLS), introduce three metrics for detecting multicollinearity and detail two “Tolerant Methods” for dealing with multicollinearity within a data set.

How does multicollinearity cause problems in OLS regression?

To illustrate the problems caused by multicollinearity, let’s start with a linear regression:

y=x\beta +\epsilon


y=x\beta +\epsilon

x = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} predictor \hspace{.1 cm} variables

\beta = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} coefficients

\epsilon =  a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} residuals

The Gauss-Markov theorem states that the Best Linear Unbiased Estimator (BLUE) for each  coefficient can be found using OLS:

\hat{\beta}_{OLS} = (x^Tx)^{-1}x^Ty

This  estimate will have a variance defined as:

var(\hat{\beta}_{OLS}) =\sigma^2 (x^Tx)^{-1}


\sigma^2 = the \hspace{.1 cm} variance\hspace{.1 cm} of \hspace{.1 cm} the\hspace{.1 cm} residuals

If you dive into the matrix algebra, you will find that the term (xTx) is equal to a matrix with ones on the diagonals and the pairwise Pearson’s correlation coefficients (ρ) on the off-diagonals:

(x^Tx) =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}

As the correlation values increase, the values within (xTx)-1 also increase. Even with a low residual variance, multicollinearity can cause large increases in estimator variance. Here are a few examples of the effect of multicollinearity using a hypothetical regression with two predictors:

 \rho = .3 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.09 & -0.33 \\ -0.33 & 1.09 \end{bmatrix}

 \rho = .9 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 5.26 & -4.73 \\ -5.26 & -4.73 \end{bmatrix}

 \rho = .999 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 500.25 & -499.75 \\ -499.75 & 500.25\end{bmatrix}

So why should you care about the variance of your coefficient estimators? The answer depends on what the purpose of your model is. If your only goal is to obtain an accurate measure of the predictand, the presence of multicollinearity in your predictors might not be such a problem. If, however, you are trying to identify the key predictors that effect the predictand, multicollinearity is a big problem.

OLS estimators with large variances are highly unstable, meaning that if you construct estimators from different data samples you will potentially get wildly different estimates of your coefficient values (Dormann et al. 2013). Large estimator variance also undermines the trustworthiness of hypothesis testing of the significance of coefficients. Recall that the t value used in hypothesis testing for an OLS regression coefficient is a function of the sample standard deviation (the square root of the variance) of the  OLS estimator.

t_{n-2} =\frac{\hat{\beta_j}-0}{s_{\beta_j}}

An estimator with an inflated standard deviation, s_{\beta_j}, will thus yield a lower t value, which could lead to the false rejection of a significant predictor (ie. a type II error). See Ohlemüller et al. (2008) for some examples where hypothesis testing results are undermined by multicollinearity.

Detecting Multicollinearity within a data set

Now we know how multicollinearity causes problems in our regression, but how can we tell if there is multicollinearity within a data set? There are several commonly used metrics for which basic guidelines have been developed to determine whether multicollinearity is present.

The most basic metric is the pairwise Pearson Correlation Coefficient between predictors, r. Recall from your intro statistics course that the Pearson Correlation Coefficient is a measure of the linear relationship between two variables, defined as:


A common rule of thumb is that multicollinearity may be a problem in a data set if any pairwise |r| > 0.7 (Dormann et al. 2013).

Another common metric is known as the Variance Inflation Factor (VIF). This measure is calculated by regressing each predictor on all others being used in the regression.

VIF(\beta_j) = \frac{1}{1-R^2_j}

Where Rj2 is the R2 value generated by regressing predictor xj on all other predictors. Multicollinearity is thought to be a problem if VIF > 10 for any given predictor (Dormann et al. 2012).

A third metric for detecting multicollinearity in a data set is the Condition Number (CN) of the predictor matrix defined as the square root of the ratio of the largest and smallest eigenvalues in the predictor matrix:


CN> 15 indicates the possible presence of multicollinearity, while a CN > 30 indicates serious multicollinearity problems (Dormann et al. 2013).

Dealing with Multicollinearity using Tolerant Methods

In a statistical sense, there is no way to “fix” multicollinearity. However, methods have been developed to mitigate its effects. Perhaps the most effective way to remedy multicollinearity is to make a priori judgements about the relationship between predictors and remove or consolidate predictors that have known correlations. This is not always possible however, especially when the true functional forms of relationships are not known (which is often why regression is done in the first place). In this section I will explain two “Tolerant Methods” for dealing with multicollinearity.

The purpose of Tolerant Methods is to reduce the sensitivity of regression parameters to multicollinearity. This is accomplished through penalized regression. Since multicollinearity can result in large and opposite signed  estimator values for correlated predictors, a penalty function is imposed to keep the value of predictors below a pre-specified value.

\sum_{j=1}^{p}|\beta|^l \leq c

Where c is the predetermined value representing model complexity, p is the number of predictors and l is either 1 or 2 depending on the type of tolerant method employed (more on this below).

Ridge Regression

Ridge regression uses the L2 norm, or Euclidean distance, to constrain model coefficients (ie. l = 2 in the equation above). The coefficients created using ridge regression are defined as:

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Ridge regression adds a constant, λ, to the term xTx to construct the estimator. It should be noted that both x and y should be standardized before this estimator is constructed. The Ridge regression coefficient is the result of a constrained version of the ordinary least squares optimization problem. The objective is to minimize the sum of square errors for the regression while meeting the complexity constraint.

\hat{\beta_r} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2  \\  \sum_{j=1}^{p}|\beta_j|^2 \leq c \end{cases}

To solve the constrained optimization, Lagrange multipliers can be employed. Let z equal the Residual Sum of Squares (RSS) to be minimized:

argmin(\beta) \hspace{.3cm}  z= (y-x\beta)^T(y-x\beta)+\lambda(\sum_{i=1}^{N}|\beta_j|^2-c)

This can be rewritten in terms of the L2 norm of β:

z = (y-x\beta)^T(y-x\beta)+\lambda||\beta||^2_2

Taking the derivative with respect to β and solving:

0 = \frac{\partial z}{\partial \beta} = -2x^T(y-x\beta)+2\lambda\beta

x^Ty = x^Tx\beta+\lambda\beta=(x^Tx+\lambda I)\beta

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Remember that the Gauss-Markov theorem states that the OLS estimate for regression coefficients is the BLUE, so by using ridge regression, we are sacrificing some benefits of OLS estimators in order to constrain estimator variance. Estimators constructed using ridge regression are in fact biased, this can be proven by calculating the expected value of ridge regression coefficients.

E[\hat{\beta_r}]=(I+\lambda(x^Tx)^{-1})\beta \neq \beta

For a scenario with two predictors, the tradeoff between reduced model complexity and increase bias in the estimators can be visualized graphically by plotting the estimators of the two beta values against each other. The vector of beta values estimated by regression are represented as points on this plot  (\hat{\beta}=[\beta_1, \beta_2]).  In Figure 1,\beta_{OLS} is plotted in the upper right quadrant and represents estimator that produces the smallest RSS possible for the model. The ellipses centered around  are representations of the increasing RSS resulting from the combination of β1 and β2  values, each RSS is a function of a different lambda value added to the regression.  The circle centered around the origin represents the chosen level of model complexity that is constraining the ridge regression. The ridge estimator is the point where this circle intersects a RSS ellipse. Notice that as the value of c increases, the error introduced into the estimators decreases and vice versa.


Figure 1: Geometric Interpretation of a ridge regression estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimates represent RSS contours for each Beta 1, Beta 2 combination (denoted on here as z from the optimization equation above). The model complexity is constrained by distance c from the origin. The ridge regression estimator of Beta is shown as the red dot, where the RSS contour meets the circle defined by c.

The c value displayed in Figure 1 is only presented to explain the theoretical underpinnings of ridge regression. In practice, c is never specified, rather, a value for λ is chosen a priori to model construction. Lambda is usually chosen through a process known as k-fold cross validation, which is accomplished through the following steps:

  1. Partition data set into K separate sets of equal size
  2. For each k = 1 …k, fit model with excluding the kth set.
  3. Predict for the kth set
  4. Calculate the cross validation error (CVerror)for kth set: CV^{\lambda_0}_k = E[\sum(y-\hat{y})^2]
  5. Repeat for different values of , choose a that minimizes   CV^{\lambda_0} = \frac{1}{k}CV^{\lambda_0}_k

Lasso Regression

Another Tolerant Method for dealing with multicollinearity known as Least Absolute Shrinkage and Selection Operator (LASSO) regression, solves the same constrained optimization problem as ridge regression, but uses the L1 norm rather than the L2 norm as a measure of complexity.

\hat{\beta}_{Lasso} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2 \\ \sum_{j=1}^{p}|\beta_j|^1 \leq c \end{cases}

LASSO regression can be visualized similarly to ridge regression, but since c is defined by the sum of absolute values of beta, rather than sum of squares, the area it constrains is diamond shaped rather than circular.  Figure 2 shows the selection of the beta estimator from LASSO regression. As you can see, the use of the L1 norm means LASSO regression selects one of the predictors and drops the other (weights it as zero). This has been argued to provide a more interpretable estimators (Tibshirani 1996).


Figure 2: Geometric interpretation of Lasso Regression Estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimate represents RSS contours for each Beta 1, Beta 2 combination (denoted as z from the optimization equation). The mode complexity is constrained by the L1 norm representing model complexity. The Lasso estimator of Beta is shown as the red dot, the location where the RSS contour intersects the diamond defined by c.

Final thoughts

If you’re creating a model with multiple predictors, it’s important to be cognizant of potential for multicollinearity within your data set. Tolerant methods are only one of many possible remedies for multicollinearity (other notable techniques include data clustering and Principle Component Analysis) but it’s important to remember that no known technique can truly “solve” the problem of multicollinearity. The method chosen to deal with multicollinearity should be chosen on a case to case basis and multiple methods should be employed if possible to help identify the underlying structure within the predictor data set (Dormann et. al. 2013)


Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D. and Lautenbach, S. 2013, “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.” Ecography, 36: 27–46. doi:10.1111/j.1600-0587.2012.07348.x

Ohlemüller, R. et al. 2008. “The coincidence of climatic and species rarity: high risk to small-range species from climate change.” Biology Letters. 4: 568 – 572.

Tibshirani, Robert 1996. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological): 267-288.