Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

raw_points

Figure 1: Example training set

 

Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

P(y|x)=\frac{P(x|y)P(y)}{P(x)}                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

P(x|y) ~ N(\mu_y, \Sigma_y)                                                                                   (2)

Where \Sigma_y is a diagonal covariance matrix with [\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y} for each feature \alpha.

For each feature, $\alpha$, and each class, c we can then model P(x_\alpha|y) as:

P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}                              (3)

We can then estimate model parameters:

\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}                                                                   (4)

\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2                                                  (5)

Where:

n_c = \sum^{n}_{i=1}I(y_i=c)                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1'); 
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')

 

 

compare_classification

Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1'); 
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')

compare_probs

Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

Further reading

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s