From MATLAB to Julia: Insights from Translating an Opensource Kirsch-Nowak Streamflow Generator to Julia

A quick look into translating code: speed comparisons, practicality, and comments

As I am becoming more and more familiar with Julia—an open-source programming language—I’ve been attracted to translate code to not only run it on an opensource and free language but also to test its performance. Since Julia was made to be an open source language made to handle matrix operations efficiently (when compared to other high-level opensource languages), finding a problem to utilize these performance advantages only makes sense.

As with any new language, understanding how well it performs relative to the other potential tools in your toolbox is vital. As such, I decided to use a problem that is easily scalable and can be directly compare the performances of MATLAB and Julia—the Kirsch-Nowak synthetic stationary streamflow generator.

So, in an effort to sharpen my understanding of the Kirsch-Nowak synthetic stationary streamflow generator created by Matteo GiulianiJon Herman and Julianne Quinn, I decided to take on this project of converting from this generator from MATLAB. This specific generator takes in historical streamflow data from multiple sites (while assuming stationarity) and returns a synthetically generated daily timeseries of streamflow. For a great background on synthetic streamflow generation, please refer to this post by Jon Lamontagne.

Model Description

The example is borrowed from Julie’s code utilizes data from the Susquehanna River flows (cfs) at both Marietta (USGS station 01576000) and Muddy Run along with lateral inflows (cfs) between Marietta and Conowingo Damn (1932-2001). Additionally, evaporation rates (in/day) over the Conowingo and Muddy Run Dams (from an OASIS model simulation) utilized. The generator developed by Kirsch et al. (2013) utilizes a Cholesky decomposition to create a monthly synthetic record which preserves the autocorrelation structure of the historical data. The method proposed by Nowak et al. (2010) is then used to disaggregate to daily flows (using a historical month +/- 7 days). A full description of the methods can be found at this link.

Comparing Julia and MATLAB

Comparison of Performance between Julia and MATLAB

To compare the speeds of each language, I adapted the MATLAB code into Julia (shown here) on as nearly of equal basis as possible. I attempted to keep the loops, data structures, and function formulation as similar as possible, even calling similar libraries for any given function. julia_matlab_streamflow

When examining the performance between Julia (solid lines) and MATLAB (dashed lines), there is only one instance where MATLAB(x) outperformed Julia(+)—in the 10-realization, 1000-year simulation shown in the yellow dots in the upper left. Needless to say, Julia easily outperformed MATLAB in all other situations and required only 53% of the time on average (all simulations considered equal). However, Julia was much proportionally faster at lower dimensions of years (17-35% of the time required) than MATLAB. This is likely because I did not handle arrays optimally—the code could likely be sped up even more.

Considerations for Speeding Up Code

Row- Versus Column-Major Array Architecture

It is worth knowing how a specific language processes its arrays/matrices. MATLAB and Julia are both column-major languages, meaning the sequential indexes and memory paths are grouped by descending down row by row through a column then going through the next column. On the other hand, Numpy in Python specifically uses row-major architecture. The Wikipedia article on this is brief but well worthwhile for understanding these quirks.

This is especially notable because ensuring that proper indexing and looping methods are followed can substantially speed up code. In fact, it is likely that the reason Julia slowed down significantly on a 10-realization 1000-year simulation when compared to both its previous performances and MATLAB because of how the arrays were looped through. As a direct example shown below, when exponentiating through a [20000, 20000] array row-by-row took approximately 47.7 seconds while doing the same operation column-by-column only took 12.7 seconds.

5555

Dealing with Arrays

Simply put, arrays and matrices in Julia are a pain compared to MATLAB. As an example of the bad and the ugly, unlike in MATLAB where you can directly declare any size array you wish to work with, you must first create an array and then fill the array with individual array in Julia. This is shown below where an array of arrays is initialized below.  However, once an array is established, Julia is extremely fast in loops, so dealing with filling a previously established array makes for a much faster experience.

# initialize output
qq = Array{Array}(undef, num_sites) #(4, 100, 1200)

for i = 1:num_sites
     qq[i] = Array{Float64}(undef, nR, nY * 12)
end

Once the plus side  when creating arrays, Julia is extremely powerful in its ability to assign variable types to the components of a given array. This can drastically speed up your code during the day. Shown below, it is easy to the range of declarations and assignments being made to populate the array. There’s an easy example of declaring an array with zeros, and another where we’re populating an array using slices of another. Note the indexing structure for Qd_cg in the second loop–it is not technically a 3-D array but rather a 2-D array nested within a 1-D array–showing the issues mentioned prior.

delta = zeros(n_totals)
for i = 1:n_totals
     for j = 1:n_sites
          delta[i] += (Qtotals[month][j][i] - Z[j]) ^ 2
     end
end

q_ = Array{Float64, 2}(undef, num_realizations[k], 365 * num_years[k])
for i = 1: Nsites
     # put into array of [realizations, 365*num_yrs]
     for j = 1: num_realizations[k]
          q_[j, :] = Qd_cg[j][:, i]'
     end
end

Code Profiling: Order of Experiments

An interesting observation I’ve noticed is that Julia’s first run on a given block of code is substantially slower than every other attempt. Thus, it is likely worthwhile to run a smaller-scale array through to initialize the code if there are plans to move on to substantially more expensive operations (i.e. scaling up).

In the example below, we can see that the second iteration of the same exact code was over 10% faster when calling it a second time. However, when running the code without the function wrapper (in the original timed runs), the code was 10% faster (177 seconds) than the second sequential run shown below. This points to the importance of profiling and experimenting with sections of your code.

4444

Basic profiling tools are directly built into Julia, as shown in the Julia profiling documentation. This can be visualized easily using the ProfileView library. The Juno IDE (standard with Julia Pro) allegedly has a good built-in profile as well. However, it should be expected that most any IDE should do the trick (links to IDEs can be found here).

Syntax and Library Depreciation

While Julia is very similar in its structure and language to MATLAB, much of the similar language has depreciated as Julia has been rapidly upgraded. Notably, Julia released V1.0 in late 2018 and recently released V1.1, moving further away from similarities in function names. Thus, this stands as a lesson for individuals wishing to translate all of their code between these languages. I found a useful website that assists in translating general syntax, but many of the functions have depreciated. However, as someone who didn’t have any experience with MATLAB but was vaguely familiar with Julia, this was a godsend for learning differences in coding styles.

For example, creating an identity matrix in MATLAB utilizes the function eye(size(R)) to create an nxn matrix the size of R. While this was initially the language used in Julia, this specific language was depreciated in V0.7. To get around this, either ‘I’ can be used to create a scalable identity matrix or Matrix{Float64}(I, size(R), size(R)) declare an identity matrix of size(R) by size(R) for a more foolproof and faster operation.

When declaring functions, I have found Julia to be relatively straightforward and Pythonic in its declarations. While I still look to insert colons at the ends of declarations while forgetting to add ‘end’ at the end of functions, loops, and more, the ease of creating, calling, and interacting with functions makes Julia very accessible.  Furthermore, its ability to interact with matrices in without special libraries (e.g. Numpy in Python) allows for more efficient coding without having to know specific library notation.

Debugging Drawbacks

One of the most significant drawbacks I run into when using Julia is the lack of clarity in generated error codes for common mistakes, such as adding extra brackets. For example, the following error code is generated in Python when adding an extra parenthesis at the end of an expression.3333

However, Julia produces the follow error for an identical mistake:

2222

One simple solution to this is to simply upgrade my development environment from Jupyter Notebooks to a general IDE to more easily root out issues by running code line-by-line. However, I see the lack of clarity in showing where specific errors arise a significant drawback to development within Julia. However, as shown in the example below where an array has gone awry, an IDE (such as Atom shown below) can make troubleshooting and debugging a relative breeze.

1111

Furthermore, when editing auxiliary functions in another file or module that was loaded as a library, Julia is not kind enough to simply reload and recompile the module; to get it to properly work in Atom, I had to shut down the Julia kernel then rerun the entirety of the code. Since Julia takes a bit to initially load and compile libraries and code, this slows down the debugging process substantially. There is a specific package (Revise) that exists to take care of this issue, but it is not standard and requires loading this specific library into your code.

GitHub Repositories: Streamflow Generators

PyMFGM: A parallelized Python version of the code, written by Bernardo Trindade

Kirsch-Nowak Stationary Generator in MATLAB: Developed by Matteo GiulianiJon Herman and Julianne Quinn

Kirsch-Nowak Stationary Generator in Julia: Please note that the results are not validated. However, you can easily access the Jupyter Notebook version to play around with the code in addition to running the code from your terminal using the main.jl script.

Full Kirsch-Nowak Streamflow Generator: Also developed by Matteo GiulianiJon Herman and Julianne Quinn and can handle rescaling flows for changes due to monsoons. I would highly suggest diving into this code alongside the relevant blog posts: Part 1 (explanation), Part 2 (validation).

Interactive visualizations of high-dimensional data using J3

Project Platypus is a repository that supports multiple Python libraries for multi-objective optimization, scenario discovery, and data analysis. Past blogposts have already demonstrated the Rhodium [1, 2] and Platypus [3] libraries. The aim of this post is to demonstrate the capabilities of J3 and its implementation within Project Platypus, through the Python module J3Py. J3 is an open source, cross-platform Java application for producing and sharing high-dimensional, interactive scientific visualizations. It can be used within Project Platypus, through J3Py, which is a Python module that allows us to call J3 within Python scripts. This blogpost will look into a simple system I’ve been working on and use the Rhodium library to generate management alternatives. I’ll then show how J3 can be used to explore the tradeoffs in the alternatives generated and aid in the negotiated selection of alternatives.

First thing to do is load the necessary libraries:

import numpy as np # This is a library required by the model
import itertools # This is a library required by the model
from rhodium import * # This is the library needed to use Rhodium
from j3 import J3 # This is the library we'll be using to visualize solutions

We then need to define the model function, it’s a bit long and not immediately pertinent to the blogpost so I’ll put it at the bottom so readers don’t have to scroll through it. The optimization will be performed using Rhodium and it’s set up like so:

model = Model(fish_game)

model.parameters = [Parameter("vars"),
                    Parameter("a"),
                    Parameter("b"),
                    Parameter("c"),
                    Parameter("d"),
                    Parameter("h"),
                    Parameter("K"),
                    Parameter("m"),
                    Parameter("sigmaX"),
                    Parameter("sigmaY")]

model.responses = [Response("NPV", Response.MAXIMIZE),
                   Response("PreyDeficit", Response.MINIMIZE),
                   Response("ConsLowHarvest", Response.MINIMIZE),
                   Response("WorstHarvest", Response.MAXIMIZE),
                   Response("PredatorExtinction", Response.INFO)]

model.constraints = [Constraint("PredatorExtinction < 1")]

model.levers = [RealLever("vars", 0.0, 1.0, length = 6)]

output = optimize(model, "NSGAII", 1000)

As Julie has covered Rhodium already, I won’t go into the details here, but it’s pretty intuitive that we first declare what the model is, input parameters, responses (i.e. objectives and constraints), and decision variables. Instead, I’ll focus on analyzing the output (the candidate solutions found) using J3. “output” here is a “DataSet” object of the Rhodium module and contains the decision variables, and objective performance of the solutions identified. There is also a constraint (“PredatorExtinction”) which is always zero in all the solutions, and I will not be visualizing here. I will not edit or change anything on my screen before screen-grabbing, to demonstrate how truly simple and easy it is to use. To call the J3 environment run:

J3(output.as_dataframe(['NPV', 'PreyDeficit', 'ConsLowHarvest', 'WorstHarvest']))

This produces a window with a 3D scatterplot of three of our objectives in the x, y, and z axes and the fourth used as the color.

Full resolution here

I’d like to start examining what my results look like so I’ll make this larger and rotate it a bit.

Full resolution here

I’d like to also change how my objectives are displayed. So I’ll change the orientation of the axes and the objective used for the color.

Full resolution here

The rainbow color-scheme is not really my aesthetic, so let’s change that also.

Full resolution here

A couple things we can see from this plot: there is a strong tradeoff between the NPV objective and the prey deficit, as well as between the prey deficit and the Worst Harvest. We can examine these pairs of tradeoffs more explicitly, by pulling the axes’ planes out and projecting the values on the 2D surfaces:

Full resolution here

We can also examine the tradeoffs using a parallel axis plot:

Full resolution here

We can also move the axes in the parallel axis plot:

Full resolution here

Having these multiple views, we can highlight and examine particular solutions and see how they compare with others, as well as get more detailed information:

Full resolution here and here

The final feature I’d like to showcase is solution brushing, which can facilitate in the negotiation of solutions process. Brushing allows decision makers to set limits on what the believe is acceptable or unacceptable performance (e.g. “I cannot accept costs above X amount”). It also allows decision makers to more closely examine where potential tensions might arise. If, for example, one negotiating party sets their bar too high, all remaining solutions might be unacceptable by the other decision making parties. Tools like brushing make this process more transparent and straightforward.

Full resolution here

The model function used in the example is posted below. I would also like to mention ScreenToGif, which is the tool I used to produce these GIFs and it’s been super easy to download and start using. Great product.

nRBF = 2 # no. of RBFs to use
nIn = 1 # no. of inputs (depending on selected strategy)
nOut = 1 # no. of outputs (depending on selected strategy)

N = 100 # Number of realizations of environmental stochasticity

tSteps = 100 # no. of timesteps to run the fish game on

# Define problem to be solved
def fish_game(vars, # contains all C, R, W for RBF policy
              a = 0.005, # rate at which the prey is available to the predator
              b = 0.5, # prey growth rate
              c = 0.5, # rate with which consumed prey is converted to predator abundance
              d = 0.1, # predator death rate
              h = 0.1, # handling time (time each predator needs to consume the caught prey)
              K = 2000, # prey carrying capacity given its environmental conditions
              m = 0.7, # predator interference parameter
              sigmaX = 0.004, # variance of stochastic noise in prey population
              sigmaY = 0.004): # variance of stochastic noise of predator population

    x = np.zeros(tSteps+1) # Create prey population array
    y = np.zeros(tSteps+1) # Create predator population array
    z = np.zeros(tSteps+1) # Create harvest array

    # Create array to store harvest for all realizations
    harvest = np.zeros([N,tSteps+1])
    # Create array to store effort for all realizations
    effort = np.zeros([N,tSteps+1])
    # Create array to store prey for all realizations
    prey = np.zeros([N,tSteps+1])
    # Create array to store predator for all realizations
    predator = np.zeros([N,tSteps+1])
    
    # Create array to store metrics per realization
    NPV = np.zeros(N)
    cons_low_harv = np.zeros(N)
    harv_1st_pc = np.zeros(N)    
    
    # Create array with environmental stochasticity for prey
    epsilon_prey = np.random.normal(0.0, sigmaX, N)
    
    # Create array with environmental stochasticity for predator
    epsilon_predator = np.random.normal(0.0, sigmaY, N)
    
    #Set policy input and output ranges
    input_ranges = [[0, K]] # Prey pop. range to use for normalization
    output_ranges = [[0, 1]] # Range to de-normalize harvest to

    # Go through N possible realizations
    for i in range(N):
        # Initialize populations and values
        x[0] = prey[i,0] = K
        y[0] = predator[i,0] = 250
        z[0] = effort[i,0] = hrvSTR([x[0]], vars, input_ranges, output_ranges)
        NPVharvest = harvest[i,0] = effort[i,0]*x[0]        
        # Go through all timesteps for prey, predator, and harvest
        for t in range(tSteps):
            if x[t] > 0 and y[t] > 0:
                x[t+1] = (x[t] + b*x[t]*(1-x[t]/K) - (a*x[t]*y[t])/(np.power(y[t],m)+a*h*x[t]) - z[t]*x[t])* np.exp(epsilon_prey[i]) # Prey growth equation
                y[t+1] = (y[t] + c*a*x[t]*y[t]/(np.power(y[t],m)+a*h*x[t]) - d*y[t]) *np.exp(epsilon_predator[i]) # Predator growth equation
                if t <= tSteps-1:
                    z[t+1] = hrvSTR([x[t]], vars, input_ranges, output_ranges)
            prey[i,t+1] = x[t+1]
            predator[i,t+1] = y[t+1]
            effort[i,t+1] = z[t+1]
            harvest[i,t+1] = z[t+1]*x[t+1]
            NPVharvest = NPVharvest + harvest[i,t+1]*(1+0.05)**(-(t+1))
        NPV[i] = NPVharvest
        low_hrv = [harvest[i,j]<prey[i,j]/20 for j in range(len(harvest[i,:]))] # Returns a list of True values when there's harvest below 5%
        count = [ sum( 1 for _ in group ) for key, group in itertools.groupby( low_hrv ) if key ] # Counts groups of True values in a row
        if count: # Checks if theres at least one count (if not, np.max won't work on empty list)
            cons_low_harv[i] = np.max(count)  # Finds the largest number of consecutive low harvests
        else:
            cons_low_harv[i] = 0
        harv_1st_pc[i] = np.percentile(harvest[i,:],1)
    
    return (np.mean(NPV), # Mean NPV for all realizations
            np.mean((K-prey)/K), # Mean prey deficit
            np.mean(cons_low_harv), # Mean worst case of consecutive low harvest across realizations
            np.mean(harv_1st_pc), # 5th percentile of all harvests
            np.mean((predator < 1).sum(axis=1))) # Mean number of predator extinction days per realization

# Calculate outputs (u) corresponding to each sample of inputs
# u is a 2-D matrix with nOut columns (1 for each output)
# and as many rows as there are samples of inputs
def hrvSTR(Inputs, vars, input_ranges, output_ranges):
    # Rearrange decision variables into C, R, and W arrays
    # C and R are nIn x nRBF and W is nOut x nRBF
    # Decision variables are arranged in 'vars' as nRBF consecutive
    # sets of {nIn pairs of {C, R} followed by nOut Ws}
    # E.g. for nRBF = 2, nIn = 3 and nOut = 4:
    # C, R, C, R, C, R, W, W, W, W, C, R, C, R, C, R, W, W, W, W
    C = np.zeros([nIn,nRBF])
    R = np.zeros([nIn,nRBF])
    W = np.zeros([nOut,nRBF])
    for n in range(nRBF):
        for m in range(nIn):
            C[m,n] = vars[(2*nIn+nOut)*n + 2*m]
            R[m,n] = vars[(2*nIn+nOut)*n + 2*m + 1]
        for k in range(nOut):
            W[k,n] = vars[(2*nIn+nOut)*n + 2*nIn + k]

    # Normalize weights to sum to 1 across the RBFs (each row of W should sum to 1)
    totals = np.sum(W,1)
    for k in range(nOut):
        if totals[k] > 0:
            W[k,:] = W[k,:]/totals[k]
    # Normalize inputs
    norm_in = np.zeros(nIn)
    for m in range (nIn):
        norm_in[m] = (Inputs[m]-input_ranges[m][0])/(input_ranges[m][1]-input_ranges[m][0])
    # Create array to store outputs
    u = np.zeros(nOut)
    # Calculate RBFs
    for k in range(nOut):
        for n in range(nRBF):
            BF = 0
            for m in range(nIn):
                if R[m,n] > 10**-6: # set so as to avoid division by 0
                    BF = BF + ((norm_in[m]-C[m,n])/R[m,n])**2
                else:
                    BF = BF + ((norm_in[m]-C[m,n])/(10**-6))**2
            u[k] = u[k] + W[k,n]*np.exp(-BF)
    # De-normalize outputs
    norm_u = np.zeros(nOut)
    for k in range(nOut):
        norm_u[k] = output_ranges[k][0] + u[k]*(output_ranges[k][1]-output_ranges[k][0])
    return norm_u

Intro to Boosting

Introduction

Once upon a time, in a machine learning class project, student Michael Kearns asked if it is possible to convert a weak classifier (high bias classifier whose outputs are correct only slightly over 50% of the time) into a classifier of arbitrary accuracy by using an ensemble of such classifiers. This question was posed in 1988 and, two years later, in 1990, Robert Schapire answered that it is possible (Schapire, 1990). And so boosting was born.

The idea of boosting is to train an ensemble of weak classifiers, each of which an expert in a specific region of the space. The ensemble of classifiers has the form below:
H(\vec{x}) = \sum_{t=1}^T \alpha_th_t(\vec{x})
where H is the ensemble of classifiers, \alpha_t is the weight assigned to weak classifier h_t around samples \vec{x} at iteration t, and T is the number of classifiers in the ensemble.

Boosting creates such an ensemble in a similar fashion to gradient descent. However, instead of:
\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \alpha \nabla \ell(\vec{x}_t)
as in gradient descent in real space, where \ell is a loss function, Boosting is trained via gradient descent in functional space, so that:
H_{t+1}(\boldsymbol{X}) = H_t(\boldsymbol{X}) + \alpha_t \nabla \ell(h_t(\boldsymbol{X}))

The question then becomes how to find the \alpha_t‘s and the classifiers h_t. Before answering these questions, we should get a geometric intuition for Boosting first. The presentation in the following sections were based on the presentation on these course notes.

Geometric Intuition

In the example below we have a set of blue crosses and red circles we would like our ensemble or weak classifiers to correctly classify (panel “a”). Our weak classifier for this example will be a line, orthogonal to either axes, dividing blue crosses from red circles regions. Such classifier can also be called a CART tree (Breiman et al., 1984) with depth of 1 — hereafter called a tree stump. For now, let’s assume all tree stumps will have the same weight \alpha_t in the final ensemble.

Boosting.png

The first tree stump, a horizontal divide in panel “b,” classified ten out of thirteen points correctly but failed to classify the remaining three. Since it incorrectly classified a few points in the last attempt, we would like the next classifier correctly classify these points. To make sure that will be the case, boosting will increase the weight of the points that were misclassified earlier before training the new tree stump. The second tree stump, a vertical line on panel “c,” correctly classifies the two blue crosses that were originally incorrectly classified, although it incorrectly three crosses that were originally correctly classified. For the third classifier, Boosting will now increase the weight of the three bottom misclassified crosses as well of the other misclassified two crosses and circle because they are still not correctly classified — technically speaking they are tied, each of the two classifiers classifies them in a different way, but we are here considering this a wrong classification. The third iteration will then prioritize correcting the high-weight points again, and will end up as the vertical line on the right of panel “d.” Now, all points are correctly classified.

There are different ways to mathematically approach Boosting. But before getting to Boosting, it is a good idea to go over gradient descent, which is a basis of Boosting. Following that, this post will cover, as an example, the AdaBoost algorithm, which assumes an exponential loss function.

Minimizing a Loss Function

Standard gradient descent

Before getting into Boosting per se, it is worth going over the standard gradient descent algorithm. Gradient descent is a minimization algorithm (hence, descent). The goal is to move from an initial x_0 to the value of x with the minimum value of f(x), which in machine learning is a loss function, henceforth called \ell(x). Gradient descent does that by moving one step of length s at a time starting from x^0 in the direction of the steeped downhill slope at location x_t, the value of x at the t^{th} iteration. This idea is formalized by the Taylor series expansion below:
\ell(x_{t + 1}) = \ell(x_t + s) \approx \ell(x_t) - sg(x_t)
where g(x_t)=\nabla\ell(x_t). Furthermore,
s=\alpha g(x_t)
which leads to:
\ell\left[x_{t+1} -\alpha g(x_t)\right] \approx \ell(x^{t})-\alpha g(x_t)^Tg(x_t)
where \alpha, called the learning rate, must be positive and can be set as a fixed parameter. The dot product on the last term g(x_t)^Tg(x_t) will also always be positive, which means that the loss should always decrease — the reason for the italics is that too high values for \alpha may make the algorithm diverge, so small values around 0.1 are recommended.

Gradient Descent in Functional Space

What if x is actually a function instead of a real number? This would mean that the loss function \ell(\cdot) would be a function of a function, say \ell(H(\boldsymbol{x})) instead of a real number x. As mentioned, Gradient Descent in real space works by adding small quantities to x_0 to find the final x_{min}, which is an ensemble of small \Delta x‘s added together. By analogy, gradient descent in functional space works by adding functions to an ever growing ensemble of functions. Using the definition of a functional gradient, which is beyond the scope of the this post, this leads us to:
\ell(H+\alpha h) \approx \ell(H) + \alpha \textless\nabla \ell(H),h\textgreater
where H is an ensemble of functions, h is a single function, and the \textless f, g\textgreater notation denotes a dot product between f and g. Gradient descent in function space is an important component of Boosting. Based on that, the next section will talk about AdaBoost, a specific Boosting algorithm.

AdaBoost

Basic Definitions

The goal of AdaBoost is to find an ensemble function H of functions h, a weak classifier, that minimize an exponential loss function below for a binary classification problem:
\ell(H)=\sum_{i=1}^ne^{-y(x_i)H(x_i)}
where x_i, y(x_i) is the i^{th} data point in the training set. The step size \alpha can be interpreted as the weight of each classifier in the ensemble, which optimized for each function h added to the ensemble. AdaBoost is an algorithm for binary classification, meaning the independent variables \boldsymbol{x} have corresponding vector of dependent variables \boldsymbol{y}(x_i), in which each y(x_i) \in \{-1, 1\} is a vector with the classification of each point, with -1 and 1 representing the two classes to which a point may belong (say, -1 for red circles and 1 for blue crosses). The weak classifiers h in AdaBoost also return h(x) \in \{-1, 1\}.

Setting the Weights of Each Classifier

The weight \alpha_t of each weak classifier h can be found by performing the following minimization:
\alpha=argmin_{\alpha}\ell(H+\alpha h)
Since the loss function is defined as the summation of the exponential loss of each point in the training set, the minimization problem above can be expanded into:
\alpha=argmin_{\alpha}\sum_{i=1}^ne^{y(\boldsymbol{x}_i)\left[H(\boldsymbol{x}_i)+\alpha h(\boldsymbol{x}_i)\right]}
Differentiating the error w.r.t. \alpha, equating it with zero and performing a few steps of algebra leads to:
\alpha = \frac{1}{2}ln\frac{1-\epsilon}{\epsilon}
where \epsilon is the classification error of weak classifier h_t. The error \epsilon can be calculated for AdaBoost as shown next.

The Classification Error of one Weak Classifier

Following from gradient descent in functional space, the next best classifier h_{t+1} will be the one that minimizes the term \textless\nabla \ell(H),h\textgreater, which when zero would mean that the zero-slope ensemble has been reached, denoting the minimum value of the loss function has been reached for a convex problem such as this. Replacing the dot product by a summation, this minimization problem can be written as:
h(\boldsymbol{x}_i)=argmin_h\epsilon=argmin_h\textless\nabla \ell(H),h\textgreater
h(\boldsymbol{x}_i)=argmin_h\sum_{i=1}^n\frac{\partial e^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}}{\partial H(\boldsymbol{x}_i)}h(\boldsymbol{x}_i)
which after some algebra becomes:
h(\boldsymbol{x}_i)=argmin_h\sum_{i:h(\boldsymbol{x}_i)\neq y(\boldsymbol{x}_i)}w_i
(the summation of the weights of misclassified points)

Comparing the last with the first expression, we have that the error \epsilon for iteration t is simply the summation of the weights of the points misclassified by h(\boldsymbol{x}_i)_t — e.g., in panel “b” the error would be summation the of the weights of the two crosses on the upper left corner and of the circle at the bottom right corner. Now let’s get to these weights.

The Weights of the Points

There are multiple ways we can think of for setting the weights of each data point at each iteration. Schapire (1990) found a great way of doing so:
\boldsymbol{w}_{t+1}=\frac{\boldsymbol{w}^{t}}{Z}e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}
where \boldsymbol{w} is a vector containing the weights of each point, Z is a normalization factor to ensure the weights will sum up to 1. Be sure not to confuse \alpha_t, the weight of classifier t, with the weight of the points at iteration t, represented by \boldsymbol{w}_t. For the weights to sum up to 1, Z needs to be the sum of their pre-normalization values, which is actually identical to the loss function, so
Z=\sum_{i=1}^ne^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}
Using the definition of the error \epsilon, the update for Z can be shown to be:
Z_{t+1}=Z_t\cdot2\sqrt{\epsilon(1-\epsilon)}
so that the complete update is:
w_{t+1}=w_t \frac{e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}}{2\sqrt{\epsilon(1-\epsilon)}}
Now AdaBoost is ready for implementation following the pseudo-code below.

AdaBoost Pseudo-code

Below is a pseudo-code of AdaBoost. Note that it can be used with any weak learner (high bias) classifier. Again, shallow decision trees are a common choice for their simplicity and good performance.Boosting_algorithm.png

Boosting Will Converge, and Fast

One fascinating fact about boosting is that any boosted algorithms is guaranteed to converge independently of which weak classified is chosen. Recall that Z, the normalization factor for the point weights update, equals the loss function. That being the case, we get the following relation:
\ell(H)=Z=n\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_ t}
where n is the normalizing factor for all weights at step 0 (all of the weights are initially set to 1/n). To derive an expression for the upper bound of the error, let’s assume that the errors at all steps t equal their highest value, \epsilon_{max}. We have that:
\ell(H)\leq n\left[2\sqrt{\epsilon_{max}(1-\epsilon_{max})}\right]^T
Given that necessarily \epsilon_{max} \leq \frac{1}{2}, we have that
\epsilon_{max}(1-\epsilon_{max})<\frac{1}{4}
or
\epsilon_{max}(1-\epsilon_{max})=\frac{1}{4}-\gamma^2
for any \gamma in the interval \left[-\frac{1}{2},\frac{1}{2}\right]. Therefore, replacing the equation above in the first loss inequality written as a function of \epsilon_{max}, we have that:
\ell(H)\leq n(1-4\gamma^2)^{T/2}
which means that the training error is bound by an exponential decay as you add classifiers to the ensemble. This is a fantastic result and applies to any boosted algorithm!

(Next to) No Overfitting

Lastly, Boosted algorithms are remarkably resistant to overfitting. According to Murphy (2012), a possible reason is that Boosting can be seen as a form of \ell_1 regularization, which is prone to eliminate irrelevant features and thus reduce overfitting. Another explanation is related to the concept of margins, so that at least certain boosting algorithms force a classification on a point only if possible by a certain margin, thus also preventing overfitting.

Final Remarks

In this post, I presented the general idea of boosting a weak classifier, emphasizing its use with shallow CART trees, and used the AdaBoost algorithm as an example. However, other loss functions can be used and boosting can also be used for non-binary classifiers and for regression. The Python package scikit-learn in fact allows the user to use boosting with different loss functions and with different weak classifiers.

Despite the theoretical proof that Boosting does not overfit, researchers running it for extremely long times on rather big supercomputers found at at some point it starts to overfit, although still very slowly. Still, that is not likely to happen in your application. Lastly, Boosting with shallow decision trees is also a great way to have a fine control over how much bias there will be on your model, as all you need to do for that is to choose the number of T iterations.

Bibliography

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression T rees (Monterey, California: Wadsworth).
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.

Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

raw_points

Figure 1: Example training set

 

Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

P(y|x)=\frac{P(x|y)P(y)}{P(x)}                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

P(x|y) ~ N(\mu_y, \Sigma_y)                                                                                   (2)

Where \Sigma_y is a diagonal covariance matrix with [\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y} for each feature \alpha.

For each feature, $\alpha$, and each class, c we can then model P(x_\alpha|y) as:

P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}                              (3)

We can then estimate model parameters:

\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}                                                                   (4)

\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2                                                  (5)

Where:

n_c = \sum^{n}_{i=1}I(y_i=c)                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1'); 
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')

 

 

compare_classification

Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1'); 
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')

compare_probs

Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

Further reading

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743