# Introduction

Once upon a time, in a machine learning class project, student Michael Kearns asked if it is possible to convert a weak classifier (high bias classifier whose outputs are correct only slightly over 50% of the time) into a classifier of arbitrary accuracy by using an ensemble of such classifiers. This question was posed in 1988 and, two years later, in 1990, Robert Schapire answered that it is possible (Schapire, 1990). And so boosting was born.

The idea of boosting is to train an ensemble of weak classifiers, each of which an expert in a specific region of the space. The ensemble of classifiers has the form below:
$H(\vec{x}) = \sum_{t=1}^T \alpha_th_t(\vec{x})$
where H is the ensemble of classifiers, $\alpha_t$ is the weight assigned to weak classifier $h_t$ around samples $\vec{x}$ at iteration t, and T is the number of classifiers in the ensemble.

Boosting creates such an ensemble in a similar fashion to gradient descent. However, instead of:
$\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \alpha \nabla \ell(\vec{x}_t)$
as in gradient descent in real space, where $\ell$ is a loss function, Boosting is trained via gradient descent in functional space, so that:
$H_{t+1}(\boldsymbol{X}) = H_t(\boldsymbol{X}) + \alpha_t \nabla \ell(h_t(\boldsymbol{X}))$

The question then becomes how to find the $\alpha_t$‘s and the classifiers $h_t$. Before answering these questions, we should get a geometric intuition for Boosting first. The presentation in the following sections were based on the presentation on these course notes.

# Geometric Intuition

In the example below we have a set of blue crosses and red circles we would like our ensemble or weak classifiers to correctly classify (panel “a”). Our weak classifier for this example will be a line, orthogonal to either axes, dividing blue crosses from red circles regions. Such classifier can also be called a CART tree (Breiman et al., 1984) with depth of 1 — hereafter called a tree stump. For now, let’s assume all tree stumps will have the same weight $\alpha_t$ in the final ensemble.

The first tree stump, a horizontal divide in panel “b,” classified ten out of thirteen points correctly but failed to classify the remaining three. Since it incorrectly classified a few points in the last attempt, we would like the next classifier correctly classify these points. To make sure that will be the case, boosting will increase the weight of the points that were misclassified earlier before training the new tree stump. The second tree stump, a vertical line on panel “c,” correctly classifies the two blue crosses that were originally incorrectly classified, although it incorrectly three crosses that were originally correctly classified. For the third classifier, Boosting will now increase the weight of the three bottom misclassified crosses as well of the other misclassified two crosses and circle because they are still not correctly classified — technically speaking they are tied, each of the two classifiers classifies them in a different way, but we are here considering this a wrong classification. The third iteration will then prioritize correcting the high-weight points again, and will end up as the vertical line on the right of panel “d.” Now, all points are correctly classified.

There are different ways to mathematically approach Boosting. But before getting to Boosting, it is a good idea to go over gradient descent, which is a basis of Boosting. Following that, this post will cover, as an example, the AdaBoost algorithm, which assumes an exponential loss function.

# Minimizing a Loss Function

Before getting into Boosting per se, it is worth going over the standard gradient descent algorithm. Gradient descent is a minimization algorithm (hence, descent). The goal is to move from an initial $x_0$ to the value of $x$ with the minimum value of $f(x)$, which in machine learning is a loss function, henceforth called $\ell(x)$. Gradient descent does that by moving one step of length s at a time starting from $x^0$ in the direction of the steeped downhill slope at location $x_t$, the value of $x$ at the $t^{th}$ iteration. This idea is formalized by the Taylor series expansion below:
$\ell(x_{t + 1}) = \ell(x_t + s) \approx \ell(x_t) - sg(x_t)$
where $g(x_t)=\nabla\ell(x_t)$. Furthermore,
$s=\alpha g(x_t)$
$\ell\left[x_{t+1} -\alpha g(x_t)\right] \approx \ell(x^{t})-\alpha g(x_t)^Tg(x_t)$
where $\alpha$, called the learning rate, must be positive and can be set as a fixed parameter. The dot product on the last term $g(x_t)^Tg(x_t)$ will also always be positive, which means that the loss should always decrease — the reason for the italics is that too high values for $\alpha$ may make the algorithm diverge, so small values around 0.1 are recommended.

## Gradient Descent in Functional Space

What if $x$ is actually a function instead of a real number? This would mean that the loss function $\ell(\cdot)$ would be a function of a function, say $\ell(H(\boldsymbol{x}))$ instead of a real number x. As mentioned, Gradient Descent in real space works by adding small quantities to $x_0$ to find the final $x_{min}$, which is an ensemble of small $\Delta x$‘s added together. By analogy, gradient descent in functional space works by adding functions to an ever growing ensemble of functions. Using the definition of a functional gradient, which is beyond the scope of the this post, this leads us to:
$\ell(H+\alpha h) \approx \ell(H) + \alpha \textless\nabla \ell(H),h\textgreater$
where H is an ensemble of functions, h is a single function, and the $\textless f, g\textgreater$ notation denotes a dot product between f and g. Gradient descent in function space is an important component of Boosting. Based on that, the next section will talk about AdaBoost, a specific Boosting algorithm.

## Basic Definitions

The goal of AdaBoost is to find an ensemble function H of functions h, a weak classifier, that minimize an exponential loss function below for a binary classification problem:
$\ell(H)=\sum_{i=1}^ne^{-y(x_i)H(x_i)}$
where $x_i, y(x_i)$ is the $i^{th}$ data point in the training set. The step size $\alpha$ can be interpreted as the weight of each classifier in the ensemble, which optimized for each function h added to the ensemble. AdaBoost is an algorithm for binary classification, meaning the independent variables $\boldsymbol{x}$ have corresponding vector of dependent variables $\boldsymbol{y}(x_i)$, in which each $y(x_i) \in \{-1, 1\}$ is a vector with the classification of each point, with -1 and 1 representing the two classes to which a point may belong (say, -1 for red circles and 1 for blue crosses). The weak classifiers h in AdaBoost also return $h(x) \in \{-1, 1\}$.

## Setting the Weights of Each Classifier

The weight $\alpha_t$ of each weak classifier h can be found by performing the following minimization:
$\alpha=argmin_{\alpha}\ell(H+\alpha h)$
Since the loss function is defined as the summation of the exponential loss of each point in the training set, the minimization problem above can be expanded into:
$\alpha=argmin_{\alpha}\sum_{i=1}^ne^{y(\boldsymbol{x}_i)\left[H(\boldsymbol{x}_i)+\alpha h(\boldsymbol{x}_i)\right]}$
Differentiating the error w.r.t. $\alpha$, equating it with zero and performing a few steps of algebra leads to:
$\alpha = \frac{1}{2}ln\frac{1-\epsilon}{\epsilon}$
where $\epsilon$ is the classification error of weak classifier $h_t$. The error $\epsilon$ can be calculated for AdaBoost as shown next.

## The Classification Error of one Weak Classifier

Following from gradient descent in functional space, the next best classifier $h_{t+1}$ will be the one that minimizes the term $\textless\nabla \ell(H),h\textgreater$, which when zero would mean that the zero-slope ensemble has been reached, denoting the minimum value of the loss function has been reached for a convex problem such as this. Replacing the dot product by a summation, this minimization problem can be written as:
$h(\boldsymbol{x}_i)=argmin_h\epsilon=argmin_h\textless\nabla \ell(H),h\textgreater$
$h(\boldsymbol{x}_i)=argmin_h\sum_{i=1}^n\frac{\partial e^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}}{\partial H(\boldsymbol{x}_i)}h(\boldsymbol{x}_i)$
which after some algebra becomes:
$h(\boldsymbol{x}_i)=argmin_h\sum_{i:h(\boldsymbol{x}_i)\neq y(\boldsymbol{x}_i)}w_i$
(the summation of the weights of misclassified points)

Comparing the last with the first expression, we have that the error $\epsilon$ for iteration t is simply the summation of the weights of the points misclassified by $h(\boldsymbol{x}_i)_t$ — e.g., in panel “b” the error would be summation the of the weights of the two crosses on the upper left corner and of the circle at the bottom right corner. Now let’s get to these weights.

## The Weights of the Points

There are multiple ways we can think of for setting the weights of each data point at each iteration. Schapire (1990) found a great way of doing so:
$\boldsymbol{w}_{t+1}=\frac{\boldsymbol{w}^{t}}{Z}e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}$
where $\boldsymbol{w}$ is a vector containing the weights of each point, Z is a normalization factor to ensure the weights will sum up to 1. Be sure not to confuse $\alpha_t$, the weight of classifier t, with the weight of the points at iteration t, represented by $\boldsymbol{w}_t$. For the weights to sum up to 1, Z needs to be the sum of their pre-normalization values, which is actually identical to the loss function, so
$Z=\sum_{i=1}^ne^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}$
Using the definition of the error $\epsilon$, the update for Z can be shown to be:
$Z_{t+1}=Z_t\cdot2\sqrt{\epsilon(1-\epsilon)}$
so that the complete update is:
$w_{t+1}=w_t \frac{e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}}{2\sqrt{\epsilon(1-\epsilon)}}$

Below is a pseudo-code of AdaBoost. Note that it can be used with any weak learner (high bias) classifier. Again, shallow decision trees are a common choice for their simplicity and good performance.

## Boosting Will Converge, and Fast

One fascinating fact about boosting is that any boosted algorithms is guaranteed to converge independently of which weak classified is chosen. Recall that Z, the normalization factor for the point weights update, equals the loss function. That being the case, we get the following relation:
$\ell(H)=Z=n\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_ t}$
where n is the normalizing factor for all weights at step 0 (all of the weights are initially set to 1/n). To derive an expression for the upper bound of the error, let’s assume that the errors at all steps t equal their highest value, $\epsilon_{max}$. We have that:
$\ell(H)\leq n\left[2\sqrt{\epsilon_{max}(1-\epsilon_{max})}\right]^T$
Given that necessarily $\epsilon_{max} \leq \frac{1}{2}$, we have that
$\epsilon_{max}(1-\epsilon_{max})<\frac{1}{4}$
or
$\epsilon_{max}(1-\epsilon_{max})=\frac{1}{4}-\gamma^2$
for any $\gamma$ in the interval $\left[-\frac{1}{2},\frac{1}{2}\right]$. Therefore, replacing the equation above in the first loss inequality written as a function of $\epsilon_{max}$, we have that:
$\ell(H)\leq n(1-4\gamma^2)^{T/2}$
which means that the training error is bound by an exponential decay as you add classifiers to the ensemble. This is a fantastic result and applies to any boosted algorithm!

## (Next to) No Overfitting

Lastly, Boosted algorithms are remarkably resistant to overfitting. According to Murphy (2012), a possible reason is that Boosting can be seen as a form of $\ell_1$ regularization, which is prone to eliminate irrelevant features and thus reduce overfitting. Another explanation is related to the concept of margins, so that at least certain boosting algorithms force a classification on a point only if possible by a certain margin, thus also preventing overfitting.

# Final Remarks

In this post, I presented the general idea of boosting a weak classifier, emphasizing its use with shallow CART trees, and used the AdaBoost algorithm as an example. However, other loss functions can be used and boosting can also be used for non-binary classifiers and for regression. The Python package scikit-learn in fact allows the user to use boosting with different loss functions and with different weak classifiers.

Despite the theoretical proof that Boosting does not overfit, researchers running it for extremely long times on rather big supercomputers found at at some point it starts to overfit, although still very slowly. Still, that is not likely to happen in your application. Lastly, Boosting with shallow decision trees is also a great way to have a fine control over how much bias there will be on your model, as all you need to do for that is to choose the number of T iterations.

# Bibliography

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression T rees (Monterey, California: Wadsworth).
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.