Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

**Classification – **Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

- Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
- Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

**Prediction – **Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

- Predict sales volume of a product in a future time period
- Predict annual heating/cooling expenses for a set of buildings

**Segmentation – **Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

- Identify groups of customers that have similar ordering patterns across seasons of the year
- Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving **classification** problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a *continuous* dependent variable using one or more independent *categorical* variables, whereas DA attempts to predict a *categorical* dependent variable using one or more *continuous* independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are *normally distributed *within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).

The *centroids* of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation *i* from a multivariate normal distribution should be assigned to group *G_j *that minimizes the Mahalanobis distance between *i* and group centroid *C_j.*

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have **unequal variances** or if they are **correlated**, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by: where is the vector of attributes for observation , is the vector of means for group , and is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:

For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” *IEEE Transactions on Pattern Analysis and Machine Intelligence* 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. *Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics.* Eighth edition. Boston, MA: Cengage Learning, 2018.