Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and labels Y∈ (+1,-1), positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.
Gaussian Naïve Bayes
Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):
A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.
To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:
Where is a diagonal covariance matrix with for each feature .
For each feature, $\alpha$, and each class, c we can then model as:
We can then estimate model parameters:
Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).
Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes, which takes the following form:
Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.
Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:
Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.
Comparing Gaussian Naive Bayes and Logistic Regression
Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.
import numpy as np import matplotlib.pyplot as plt from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression import seaborn as sns sns.set(style='whitegrid') ## create a test data set ## pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \ [5,5], [4,8], [5,9], [2,6], [3,9], [4,4]]) neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \ [4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]]) all_points = np.concatenate((pos,neg), 0) labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]) ## compare Naive Bayes and Logistic Regression ## # Fit Naive Bayes gnb = GaussianNB() gnb.fit(all_points, labels) # make NB predictions and plot x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1)) Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1] Y_NB = Y_NB.reshape(x1_mesh.shape) fig1, axes = plt.subplots(1,2, figsize=(10,4)) axes.contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \ cmap='RdBu') axes.scatter(pos[:,0], pos[:,1], s=50, \ edgecolors='none') axes.scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\ edgecolors='none') axes.set_xlim([0,10]); axes.set_ylim([0,10]); axes.set_xlabel('X1') axes.set_ylabel('X2'); axes.set_title('Naive Bayes') #plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1) #.savefig('NB_classification.png', bbox_inches='tight') # Fit Logistic Regression lr = LogisticRegression() lr.fit(all_points, labels) # Make predictions and plot Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1] Y_LR = Y_LR.reshape(x1_mesh.shape) axes.contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \ cmap='RdBu') axes.scatter(pos[:,0], pos[:,1], s=50, \ edgecolors='none') axes.scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\ edgecolors='none') axes.set_xlim([0,10]); axes.set_ylim([0,10]); axes.set_xlabel('X1'); axes.set_ylabel('X2'); axes.set_title("Logistic Regression") plt.savefig('compare_classification.png', bbox_inches='tight')
Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.
So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.
One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.
# plot Naive Bayes predicted probabilities fig2, axes = plt.subplots(1,2, figsize=(12,4)) axes.contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \ cmap='RdBu') axes.scatter(pos[:,0], pos[:,1], s=50, \ edgecolors='none') axes.scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\ edgecolors='none') axes.set_xlim([0,10]); axes.set_ylim([0,10]); axes.set_xlabel('X1'); axes.set_ylabel('X2'); axes.set_title('Naive Bayes') # plot Logistic Regression redicted probabilities LRcont = axes.contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \ cmap='RdBu') axes.scatter(pos[:,0], pos[:,1], s=50, \ edgecolors='none') axes.scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\ edgecolors='none') axes.set_xlim([0,10]); axes.set_ylim([0,10]); axes.set_xlabel('X1') axes.set_ylabel('X2'); axes.set_title('Logistic Regression') cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist()) cb.set_label('Probability of Positive Classification') cb.set_ticks([0, .25, .5, .75, 1]) cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"]) plt.savefig('compare_probs.png', bbox_inches='tight')
This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.
As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic
Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html
Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supply. Water Resources Research, 54, 4638–4662. https://doi.org/10.1029/2018WR022743