# Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

# Performing random seed analysis and runtime diagnostics with the serial Borg Matlab wrapper

Search with Multiobjective Evolutionary Algorithms (MOEAs) is inherently stochastic. MOEAs are initialized with a random population of solutions that serve as the starting point for the multiobjective search, if the algorithm gets “lucky”, the initial population may contain points in an advantageous region of the decision space  that give the algorithm a head start on the search. On the other hand, the initial population may only contain solutions in difficult regions of the decision space, which may slow the discovery of quality solutions. To overcome the effects of initial parameterization, we perform a random seed analysis which involves running an ensemble of searches, each starting with a randomly sampled set of initial conditions which we’ll here on refer to as a “random seed”. We combine search results across all random seeds to generate a “reference set” which contains only the best (Pareto non-dominated) solutions across the ensemble.

Tracking the algorithm’s performance during search is an important part of a random seed analysis. When we use MOEAs to solve real world problems (ie. problems that don’t have analytical solutions), we don’t know the true Pareto set a priori. To determine if an algorithm has discovered an acceptable approximation of the true Pareto set, we must measure it’s performance across the search, and only end our analysis if we can demonstrate the search has ceased improving (of course this is not criteria for true convergence as it is possible the algorithm has simply failed to find better solutions to the problem, this is why performing rigorous diagnostic studies such as Zatarain et al., 2016 is important for understanding how various MOEAs perform in real world problems). To measure MOEA search performance, we’ll use hypervolume , a metric that captures both convergence and diversity of a given approximation set (Knowles and Corne, 2002; Zitzler et al., 2003). Hypervolume represents the fraction of the objective space that is dominated by an approximation set, as shown in Figure 1 (from Zatarain et al., 2017). For more information on MOEA performance metrics, see Joe’s post from 2013.

Figure 1: A 2 objective example of hypervolume from Zatarain et al,. 2017. To calculate hypervolume, an offset, delta, is taken from the bounds of the approximation set to construct a “reference point”. The hypervolume is a measure of the volume of the objective space between the approximation set and the reference point. A larger hypervolume indicates a better approximation set.

This post will demonstrate how to perform a random seed analysis and runtime diagnostics using the Matlab wrapper for the serial Borg MOEA (for background on the Borg MOEA, see Hadka and Reed, 2013). I’ll use the DTLZ2 3 objective test problem as an example, which tasks the algorithm with approximating a spherical Pareto-optimal front (Deb et al,. 2002). I’ve created a Github repository with relevant code, you can find it here.

In this demonstration, I’ll use the Matlab IDE and Bash shell scripts called from a Linux terminal (Window’s machines can use Cygwin, a free Linux emulator). If you are unfamiliar with using a Linux terminal, you can find a tutorial here. To perform runtime diagnostics, I’ll use the MOEAFramework, a Java library that you can download here (the demo version will work just fine for our purposes).

## A modified Matlab wrapper that produces runtime files

In order to track search performance across time, we need snapshots of Borg’s archive during the search. In the parallel “master-worker” and “multi-master” versions of Borg, these snapshots are generated by the Borg C library in the form of “runtime” files. The snapshots provided by the runtime files contain information on the number of function evaluations completed (NFE), elapsed time, operator probabilities, number of improvements, number of restarts, population size, archive size and the decision variables and objectives within the archive itself.

To my knowledge, the current release of the serial Borg Matlab wrapper does not print runtime files. To perform runtime diagnostics, I had to modify the wrapper file, nativeborg.cpp. I’ve posted my edited version to the aforementioned Github repository.

## Performing random seed analysis and runtime diagnostics

To perform a random seed analysis and runtime diagnostics with the Matlab wrapper, follow these steps:

To request access to the Borg MOEA, complete step 2 of Jazmin’s introduction to Borg, found here . To run Borg with Matlab you must compile a MEX file, instructions for compiling for Windows can be found here, and here for Linux/Mac.

Once you’ve downloaded and compiled Borg for Matlab, clone the Github repository I’ve created and replace the nativeborg.cpp file from the Borg download with the edited version from the repository. Next, create three new folders in your working directory, one called “Runtime” and another called “Objectives” and the third called “metrics”. Make sure your working directory contains the following files:

• borg.c
• borg.h
• mt19937ar.c
• mt19937ar.h
• nativeborg.cpp (version from the Git repository)
• borg.m
• DTLZ2.m (test problem code, supplied from Github repository)
• calc_runtime_metrics.sh
• append_hash.sh
• MOEAFramework-2.12-Demo.jar

### 2) Use Matlab to run the Borg MOEA across an ensemble of random seeds

For this example we’ll use 10 seeds with 30,000 NFE each. We’ll print runtime snapshots every 500 NFE.

To run DTLZ2 across 10 seeds,  run the following script in Matlab:

for i = [1:10]
[vars, objs, runtime] = borg(12,3,0, @DTLZ2, 30000, zeros(1,12),ones(1,12), [0.01, 0.01, 0.01], {'frequency',500, 'seed', i});
objFile = sprintf('Objectives/DTLZ2_3_S%i.obj',i);
dlmwrite(objFile, objs, 'Delimiter', ' ');
end


The for loop above iterates across 10 random initialization of the algorithm. The first line within the for loop calls the Borg MOEA and returns decision variables (vars), objective values (objs) and a struct with basic runtime information. This function will also produce a runtime file, which will be printed in the Runtime folder created earlier (more on this later).

The second line within the for loop creates a string containing the name of a file to store the seed’s objectives and the third line prints the final objectives to this file.

### 3) Calculate the reference set across random seeds using the MOEAFramework

The 10 .obj files created in step two containing the final archives from each random seed. For our analysis, we want to generate a “reference set” of the best solutions across all seeds. To generate this set, we’ll use built in tools from the MOEAFramework. The MOEAFramework requires that all .obj files have “#” at the end of the file, which is annoying to add in Matlab. To get around this, I’ve written a simple Bash script called “append_hash.sh”.

In your Linux terminal navigate to the working directory with your files (the folder just above Objectives) and run the Bash script like this:

 ./append_hash.sh

Now that the hash tags have been appended to each .obj files, create an overall reference set by running the following command in your Linux Terminal.

java -cp MOEAFramework-2.12-Demo.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger -d 3 -e 0.01,0.01,0.01 -o Borg_DTLZ2_3.reference Objectives/*.obj


This command calls the MOEAFramework’s ResultFileMerger tool, which will merge results across random seeds. The -d flag specifies the number of objectives in our problem (3), the -e flag specifies the epsilons for each objective (.01 for all 3 objectives), the -o flag specifies the name of our newly created reference set file and the Objectives/*.obj tells the MOEAFramework to merge all files in the Objectives folder that have the extension “.obj”. This command will generate a new file named “Borg_DTLZ2_3.reference”, which will contain 3 columns, each corresponding to one objective. If we load this file into matlab and plot, we get the following plot of our Pareto approximate set.

Figure 2: The reference set generated by the Borg Matlab wrapper using 30,000 NFE.

### 4) Calculate and visualize runtime hypervolumes

We now have a reference set representing the best solutions across our random seeds. A final step in our analysis is to examine runtime data to understand how the search progressed across function evaluations. We’ll again use the MOEAFramework to examine each seed’s hypervolume at the distinct runtime snapshots provided in the .runtime files. I’ve written a Bash script to call the MOEAFramework, which is provided in the Git repository as “calc_runtime_metrics.sh” and reproduced below:

#/bin/bash

NSEEDS=10
SEEDS=$(seq 1${NSEEDS})
JAVA_ARGS="-cp MOEAFramework-2.12-Demo.jar"

for SEED in ${SEEDS} do java${JAVA_ARGS} org.moeaframework.analysis.sensitivity.ResultFileEvaluator -d 3 -i ./Runtime/runtime_S${SEED}.runtime -r Borg_DTLZ2_3.reference -o ./metrics/Borg_DTLZ2_3_S${SEED}.metrics
done


To execute the script in your terminal enter:


./calc_runtime_metrics.sh


The above command will generate 10 .metrics files inside the metrics folder, each .metric file contains MOEA performance metrics for one randome seed, hypervolume is in the first column, each row represents a different runtime snapshot. It’s important to note that the hypervolume calculated by the MOEAFramework here is the relative hypervolume to the reference set (ie the hypervolume achieved at each runtime snapshot divided by the hypervolume of the reference set).

To examine runtime peformance across random seeds, we can load each .metric file into Matlab and plot hypervolume against NFE. The runtime hypervolume for the DTLZ2  3 objective test case I ran is shown in Figure 3 below.

Figure 3: Runtime results for the DTLZ2 3 objective test case

Figure 3 shows that while there is some variance across the seeds, they all approach the hypervolume of the reference set after about 10,000 NFE. This leveling off of our search across many initial parameterizations indicates that our algorithm has likely converged to a final approximation of our Pareto set. If this plot had yielded hypervolumes that were still increasing after the 30,000 NFE, it would indicate that we need to extend our search to a higher number of NFE.

## References

Deb, K., Thiele, L., Laumanns, M. Zitzler, E., 2002. Scalable multi-objective optimization test problems, Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02, (1),  825-830

framework. Evol. Comput. 21 (2), 231–259.

Knowles, J., Corne, D., 2002. On metrics for comparing nondominated sets. Evolutionary
Computation, 2002. CEC’02. Proceedings of the 2002 Congress on. 1. IEEE, pp. 711–716.

Zatarain Salazar, J., Reed, P.M., Herman, J.D., Giuliani, M., Castelletti, A., 2016. A diagnostic assessment of evolutionary algorithms for multi-objective surface water
reservoir control. Adv. Water Resour. 92, 172–185.

Zatarain Salazar, J. J., Reed, P.M., Quinn, J.D., Giuliani, M., Castelletti, A., 2017. Balancing exploration, uncertainty and computational demands in many objective reservoir optimization. Adv. Water Resour. 109, 196-210

Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., Da Fonseca, V.G., 2003. Performance
assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol.
Comput. 7 (2), 117–132.

# MOEAFramework Training Part 1: Connecting an External Problem

The goal of this training is to step a user through becoming familiar with the capabilities of MOEAFramework, a free and open source Java library created by Dave Hadka, that allows the user to design, execute, and test out a variety of popular multi-objective evolutionary algorithms (MOEAs). In this series, we will demonstrate the capabilities of MOEAFramework in 4 parts. Part 1 demonstrates how to hook up an external optimization problem to MOEAFramework. Part 2 will cover the optimization of the problem using a variety of algorithms. Part 3 will illustrate how to calculate metrics to assess the performance of the algorithms. Finally, Part 4 will step through generation of relevant figures from the metrics that convey the effectiveness, efficiency, reliability, and controllability of the algorithms.

The example test case that will be used throughout this tutorial is the DPS version of the Lake Problem. The code is written and adapted by Dr. Julianne Quinn and can be found here. However, the tutorial will be set up so that the user can easily swap in their own problem formulation.

Finally, this guide is specifically built to connect an external problem that is written in C++ but Dave Hadka’s Beginner Guide to the MOEA Framework, has examples of how to create the Java executable for problems written in a language other than C++.

Before starting this training, it is recommended to set up a directory in a Linux environment with Java installed. If you are a student in Reed Research Group, create a folder in your Cube directory called “MOEA_Diagnostics_Tutorial”. Throughout the training, we will be adding various subdirectories and files to this folder. For this part of the tutorial, you will need the following in your directory:

1. MOEAFramework Demo .jar file which can be found by clicking on “Demo Application” on the far right.
2. The C++ version of the Lake Problem
3. SOWs_Type6.txt (natural inflows read in by the .cpp file)
4. moeaframework.c and moeaframework.h found here

Once you have the .cpp file, the next step is to get it set up to be recognized by MOEAFramework. I have made these changes in this  file in my GitHub repo. If you compare with the file in 2. you will see some changes.

First, I have commented out all parts of the code related to Borg, which the problem was originally set up to be optimized with. For this tutorial, we will be optimizing with multiple algorithms and comparing their performance. Secondly, in lines 55-57, I have use the #define directive to declare the number of variables, objectives, and constraints to be constants. Lines 300-309 is where the direct connection to MOEAFramework comes in. MOEA_Init initializes the communication between C++ and MOEAFramework and takes the number of objectives and constraints as arguments. Then we can start reading and evaluating solutions. MOEA_Read_doubles extracts the decision variables and stores them in an array, vars. Line 304 calls the lake_problem function that we would like to evaluate, along with the variables, and constraints. This results in the objs array being filled. Finally, MOEA_write writes the objectives back into the framework. When all solutions are done being read and written, MOEA_Terminate closes the connection. The important thing here is just to make sure that you are passing the names of the arguments of your lake_problem function correctly.

Next, we must make an executable of the C++ file that Java can read. The makefile is located here and takes lake.cpp, moeaframework.c, moeaframework.h, and some relevant libraries and compiles them into an executable called “lake”. In order to run this file, simply type “make”.

Finally, we must turn our executable into a java class. The relevant java file can be found here. This file can be found in the MOEAFramework documentation, but must be tailored to an external problem. Lines 21-25 import in the relevant tools from MOEAFramework that help to configure and solve the problem. Lines 40, 42, 43, and 48 show the first change from the original file. Here we insert the name of our executable that was generated in the last step, “lake”. In lines 53, 58, and 63, we state the number of decision variables, objectives, and constraints. Finally, in lines 67-75, we create a newSolution ( ) method specify that our solution should have 6 real-valued decision variables between 0 and 1.

In order to create a class file, simply type:



java -classpath MOEAFramework-2.12-Demo.jar lake.java



A file called java.class will be created.

The first part of training is done! Next time, we will set up some scripts, call these executables, and optimize the lake problem with a variety of different algorithms.

# Introduction

Let me start this blog post by stating the obvious: Geospatial maps are interesting to look at and certainly make papers and presentations prettier and more impressive; however, those are not the only reasons that such maps exist. They are used to communicate various types of information including geographical locations of regions in the world.

## Why R?

Several available platforms have been used for drawing spatial maps and conducting geospatial data management. An eminent example is ArcGIS, which is a popular, flexible, and user-friendly geospatial mapping tool. Although ArcGIS is powerful and has many features, I am personally interested in open source, and Linux-friendly software.

Although there are several GIS tools such as Python, GRASS, QGIS, and UbuntuGIS, in this blog post, I will explain how R as an alternative tool can be used for geospatial analysis and for drawing spatial maps. R offers several advantages. First, R is an open-source platform, whereas GIS is relatively expensive. Second, R is script based. In some situations, you might have to generate several hundred maps from post-processed results; a tool such as R could offer faster and more-flexible data processing. You can run R on Linux machines and computer clusters and link it to other models that work under the Linux operating system. Different packages in R have been developed for geospatial analysis. In this exercise, I am going to focus on “RGDAL,” a widely used R package. RGDAL is the R distribution of Geospatial Data Abstraction Library (GDAL).

I recently moved to Cornell University, and I am eager to learn more about this region, so I decided to focus on the Susquehanna River Basin (SRB) located in US mid-Atlantic. The SRB drains parts of New York, Pennsylvania, and Maryland to the Chesapeake Bay. Before I get entirely sidetracked by my interest in SRB, let’s go back to the original intention of this blog post, which is making geospatial maps in R.

## Prerequisites

1- Susquehanna River Basin Boundary from here

2- Major Watersheds in the Susquehanna River Basin from here

3- Susquehanna River from here

### Open a new R Script in your R-Studio, then install the following R packages, you can use the following commands to install and load the packages:

# install.packages("rgdal")
# install.packages("ggplot2")
# install.packages("RColorBrewer")

library(rgdal)
library(ggplot2)
library(RColorBrewer)


## Step 1- Map of Susquehanna River Basin

The first map of this exercise is a simple map of Susquehanna River Basin.

# I) The first step is to draw the map of SRB using the following code

SRB_Boundary <- readOGR(dsn = "spatial maps/Code/Shapefiles/srb/srb.shp")
plot(SRB_Boundary, col="gray90",
main="Figure 1", sub="Susquehanna River Basin", cex.main=2.5, cex.sub=2.5)

# II) Then we can add Susquehanna River to the map

# III) Adding information from the attribute table
#Shapefiles usually contain helpful information, such as name of objects,
#sub-basins, area/length of objects, etc.
#We are often interested in adding some of that information to our maps.
#Here is how we can do it in R:

text(SRB_Boundary$NAME, x=coordinates(SRB_Boundary)[1], y=coordinates(SRB_Boundary)[2]*1.2, cex=1.2, col="darkblue", font=2) text(paste("Area=27,500 square miles"), x=coordinates(SRB_Boundary)[1], y=(coordinates(SRB_Boundary)[2]*1.15), font=3, cex=1, col="darkblue") # Let's add coordinates to the map llgridlines(SRB_Boundary, plotLabels = T, cex=1.5) ## Step 2- Selection of objects from an attribute table If you have already worked with Arc-GIS you probably used its selection tools. What we are doing here is equivalent to selection from the attribute table. If you are not familiar with attribute tables this short explanation from esri should be helpful. #I) Let's add SRB map again plot(SRB_Boundary, col="gray90", main="Figure 2", sub="Sub-basins greater than 800 square kilometer", cex.main=2.5, cex.sub=2.5) #II) Then we can add all the subbasins in SRB to the map Subbasin <- readOGR(dsn = "spatial maps/Code/Shapefiles/wshedmjr/wshedmjr.shp") plot(Subbasin, add =T, col=alpha("darkolivegreen1", 0.9)) # III) For this exercise, we are going to select large sub-basins of SRB # with area of greater than 800 square kilometer LargestBasins=which(Subbasin$SQM>800) # square kilometer

# IV) Now we are going to change the color of these selected features on the map

plot(Subbasin[LargestBasins,], add =T, col=alpha("seagreen", 0.9))

## Step 3- Adding a legend to the map

In this part of the exercise, we are going to add a legend to the map

# I) SRB map

plot(SRB_Boundary, col="gray70",lwd=4,
main="Figure 3", sub="Precipitation contour lines", cex.main=2.5, cex.sub=2.5)

# Now let's add precipitation contours to the SRB map

# We can use the following script to add a legend to the map
llgridlines(SRB_Boundary, plotLabels = T, cex=1.5)

legend("right",box.col = "white", legend = unique(isohyet$INCHES), fill=bpy.colors(11), cex=1.75, title = "Precipitation (inches)") In this short tutorial, we went over some basic features of RGDAL. However, R can be used for more sophisticated geospatial analysis tasks, which I might cover in my future blog posts. # From MATLAB to Julia: Insights from Translating an Opensource Kirsch-Nowak Streamflow Generator to Julia ## A quick look into translating code: speed comparisons, practicality, and comments As I am becoming more and more familiar with Julia—an open-source programming language—I’ve been attracted to translate code to not only run it on an opensource and free language but also to test its performance. Since Julia was made to be an open source language made to handle matrix operations efficiently (when compared to other high-level opensource languages), finding a problem to utilize these performance advantages only makes sense. As with any new language, understanding how well it performs relative to the other potential tools in your toolbox is vital. As such, I decided to use a problem that is easily scalable and can be directly compare the performances of MATLAB and Julia—the Kirsch-Nowak synthetic stationary streamflow generator. So, in an effort to sharpen my understanding of the Kirsch-Nowak synthetic stationary streamflow generator created by Matteo GiulianiJon Herman and Julianne Quinn, I decided to take on this project of converting from this generator from MATLAB. This specific generator takes in historical streamflow data from multiple sites (while assuming stationarity) and returns a synthetically generated daily timeseries of streamflow. For a great background on synthetic streamflow generation, please refer to this post by Jon Lamontagne. ### Model Description The example is borrowed from Julie’s code utilizes data from the Susquehanna River flows (cfs) at both Marietta (USGS station 01576000) and Muddy Run along with lateral inflows (cfs) between Marietta and Conowingo Damn (1932-2001). Additionally, evaporation rates (in/day) over the Conowingo and Muddy Run Dams (from an OASIS model simulation) utilized. The generator developed by Kirsch et al. (2013) utilizes a Cholesky decomposition to create a monthly synthetic record which preserves the autocorrelation structure of the historical data. The method proposed by Nowak et al. (2010) is then used to disaggregate to daily flows (using a historical month +/- 7 days). A full description of the methods can be found at this link. ## Comparing Julia and MATLAB ### Comparison of Performance between Julia and MATLAB To compare the speeds of each language, I adapted the MATLAB code into Julia (shown here) on as nearly of equal basis as possible. I attempted to keep the loops, data structures, and function formulation as similar as possible, even calling similar libraries for any given function. When examining the performance between Julia (solid lines) and MATLAB (dashed lines), there is only one instance where MATLAB(x) outperformed Julia(+)—in the 10-realization, 1000-year simulation shown in the yellow dots in the upper left. Needless to say, Julia easily outperformed MATLAB in all other situations and required only 53% of the time on average (all simulations considered equal). However, Julia was much proportionally faster at lower dimensions of years (17-35% of the time required) than MATLAB. This is likely because I did not handle arrays optimally—the code could likely be sped up even more. ### Considerations for Speeding Up Code #### Row- Versus Column-Major Array Architecture It is worth knowing how a specific language processes its arrays/matrices. MATLAB and Julia are both column-major languages, meaning the sequential indexes and memory paths are grouped by descending down row by row through a column then going through the next column. On the other hand, Numpy in Python specifically uses row-major architecture. The Wikipedia article on this is brief but well worthwhile for understanding these quirks. This is especially notable because ensuring that proper indexing and looping methods are followed can substantially speed up code. In fact, it is likely that the reason Julia slowed down significantly on a 10-realization 1000-year simulation when compared to both its previous performances and MATLAB because of how the arrays were looped through. As a direct example shown below, when exponentiating through a [20000, 20000] array row-by-row took approximately 47.7 seconds while doing the same operation column-by-column only took 12.7 seconds. #### Dealing with Arrays Simply put, arrays and matrices in Julia are a pain compared to MATLAB. As an example of the bad and the ugly, unlike in MATLAB where you can directly declare any size array you wish to work with, you must first create an array and then fill the array with individual array in Julia. This is shown below where an array of arrays is initialized below. However, once an array is established, Julia is extremely fast in loops, so dealing with filling a previously established array makes for a much faster experience. # initialize output qq = Array{Array}(undef, num_sites) #(4, 100, 1200) for i = 1:num_sites qq[i] = Array{Float64}(undef, nR, nY * 12) end  Once the plus side when creating arrays, Julia is extremely powerful in its ability to assign variable types to the components of a given array. This can drastically speed up your code during the day. Shown below, it is easy to the range of declarations and assignments being made to populate the array. There’s an easy example of declaring an array with zeros, and another where we’re populating an array using slices of another. Note the indexing structure for Qd_cg in the second loop–it is not technically a 3-D array but rather a 2-D array nested within a 1-D array–showing the issues mentioned prior. delta = zeros(n_totals) for i = 1:n_totals for j = 1:n_sites delta[i] += (Qtotals[month][j][i] - Z[j]) ^ 2 end end q_ = Array{Float64, 2}(undef, num_realizations[k], 365 * num_years[k]) for i = 1: Nsites # put into array of [realizations, 365*num_yrs] for j = 1: num_realizations[k] q_[j, :] = Qd_cg[j][:, i]' end end  #### Code Profiling: Order of Experiments An interesting observation I’ve noticed is that Julia’s first run on a given block of code is substantially slower than every other attempt. Thus, it is likely worthwhile to run a smaller-scale array through to initialize the code if there are plans to move on to substantially more expensive operations (i.e. scaling up). In the example below, we can see that the second iteration of the same exact code was over 10% faster when calling it a second time. However, when running the code without the function wrapper (in the original timed runs), the code was 10% faster (177 seconds) than the second sequential run shown below. This points to the importance of profiling and experimenting with sections of your code. Basic profiling tools are directly built into Julia, as shown in the Julia profiling documentation. This can be visualized easily using the ProfileView library. The Juno IDE (standard with Julia Pro) allegedly has a good built-in profile as well. However, it should be expected that most any IDE should do the trick (links to IDEs can be found here). #### Syntax and Library Depreciation While Julia is very similar in its structure and language to MATLAB, much of the similar language has depreciated as Julia has been rapidly upgraded. Notably, Julia released V1.0 in late 2018 and recently released V1.1, moving further away from similarities in function names. Thus, this stands as a lesson for individuals wishing to translate all of their code between these languages. I found a useful website that assists in translating general syntax, but many of the functions have depreciated. However, as someone who didn’t have any experience with MATLAB but was vaguely familiar with Julia, this was a godsend for learning differences in coding styles. For example, creating an identity matrix in MATLAB utilizes the function eye(size(R)) to create an nxn matrix the size of R. While this was initially the language used in Julia, this specific language was depreciated in V0.7. To get around this, either ‘I’ can be used to create a scalable identity matrix or Matrix{Float64}(I, size(R), size(R)) declare an identity matrix of size(R) by size(R) for a more foolproof and faster operation. When declaring functions, I have found Julia to be relatively straightforward and Pythonic in its declarations. While I still look to insert colons at the ends of declarations while forgetting to add ‘end’ at the end of functions, loops, and more, the ease of creating, calling, and interacting with functions makes Julia very accessible. Furthermore, its ability to interact with matrices in without special libraries (e.g. Numpy in Python) allows for more efficient coding without having to know specific library notation. #### Debugging Drawbacks One of the most significant drawbacks I run into when using Julia is the lack of clarity in generated error codes for common mistakes, such as adding extra brackets. For example, the following error code is generated in Python when adding an extra parenthesis at the end of an expression. However, Julia produces the follow error for an identical mistake: One simple solution to this is to simply upgrade my development environment from Jupyter Notebooks to a general IDE to more easily root out issues by running code line-by-line. However, I see the lack of clarity in showing where specific errors arise a significant drawback to development within Julia. However, as shown in the example below where an array has gone awry, an IDE (such as Atom shown below) can make troubleshooting and debugging a relative breeze. Furthermore, when editing auxiliary functions in another file or module that was loaded as a library, Julia is not kind enough to simply reload and recompile the module; to get it to properly work in Atom, I had to shut down the Julia kernel then rerun the entirety of the code. Since Julia takes a bit to initially load and compile libraries and code, this slows down the debugging process substantially. There is a specific package (Revise) that exists to take care of this issue, but it is not standard and requires loading this specific library into your code. ## GitHub Repositories: Streamflow Generators PyMFGM: A parallelized Python version of the code, written by Bernardo Trindade Kirsch-Nowak Stationary Generator in Julia: Please note that the results are not validated. However, you can easily access the Jupyter Notebook version to play around with the code in addition to running the code from your terminal using the main.jl script. Full Kirsch-Nowak Streamflow Generator: Also developed by Matteo GiulianiJon Herman and Julianne Quinn and can handle rescaling flows for changes due to monsoons. I would highly suggest diving into this code alongside the relevant blog posts: Part 1 (explanation), Part 2 (validation). # Interactive visualizations of high-dimensional data using J3 Project Platypus is a repository that supports multiple Python libraries for multi-objective optimization, scenario discovery, and data analysis. Past blogposts have already demonstrated the Rhodium [1, 2] and Platypus [3] libraries. The aim of this post is to demonstrate the capabilities of J3 and its implementation within Project Platypus, through the Python module J3Py. J3 is an open source, cross-platform Java application for producing and sharing high-dimensional, interactive scientific visualizations. It can be used within Project Platypus, through J3Py, which is a Python module that allows us to call J3 within Python scripts. This blogpost will look into a simple system I’ve been working on and use the Rhodium library to generate management alternatives. I’ll then show how J3 can be used to explore the tradeoffs in the alternatives generated and aid in the negotiated selection of alternatives. First thing to do is load the necessary libraries: import numpy as np # This is a library required by the model import itertools # This is a library required by the model from rhodium import * # This is the library needed to use Rhodium from j3 import J3 # This is the library we'll be using to visualize solutions We then need to define the model function, it’s a bit long and not immediately pertinent to the blogpost so I’ll put it at the bottom so readers don’t have to scroll through it. The optimization will be performed using Rhodium and it’s set up like so: model = Model(fish_game) model.parameters = [Parameter("vars"), Parameter("a"), Parameter("b"), Parameter("c"), Parameter("d"), Parameter("h"), Parameter("K"), Parameter("m"), Parameter("sigmaX"), Parameter("sigmaY")] model.responses = [Response("NPV", Response.MAXIMIZE), Response("PreyDeficit", Response.MINIMIZE), Response("ConsLowHarvest", Response.MINIMIZE), Response("WorstHarvest", Response.MAXIMIZE), Response("PredatorExtinction", Response.INFO)] model.constraints = [Constraint("PredatorExtinction < 1")] model.levers = [RealLever("vars", 0.0, 1.0, length = 6)] output = optimize(model, "NSGAII", 1000) As Julie has covered Rhodium already, I won’t go into the details here, but it’s pretty intuitive that we first declare what the model is, input parameters, responses (i.e. objectives and constraints), and decision variables. Instead, I’ll focus on analyzing the output (the candidate solutions found) using J3. “output” here is a “DataSet” object of the Rhodium module and contains the decision variables, and objective performance of the solutions identified. There is also a constraint (“PredatorExtinction”) which is always zero in all the solutions, and I will not be visualizing here. I will not edit or change anything on my screen before screen-grabbing, to demonstrate how truly simple and easy it is to use. To call the J3 environment run: J3(output.as_dataframe(['NPV', 'PreyDeficit', 'ConsLowHarvest', 'WorstHarvest'])) This produces a window with a 3D scatterplot of three of our objectives in the x, y, and z axes and the fourth used as the color. I’d like to start examining what my results look like so I’ll make this larger and rotate it a bit. I’d like to also change how my objectives are displayed. So I’ll change the orientation of the axes and the objective used for the color. The rainbow color-scheme is not really my aesthetic, so let’s change that also. A couple things we can see from this plot: there is a strong tradeoff between the NPV objective and the prey deficit, as well as between the prey deficit and the Worst Harvest. We can examine these pairs of tradeoffs more explicitly, by pulling the axes’ planes out and projecting the values on the 2D surfaces: We can also examine the tradeoffs using a parallel axis plot: We can also move the axes in the parallel axis plot: Having these multiple views, we can highlight and examine particular solutions and see how they compare with others, as well as get more detailed information: The final feature I’d like to showcase is solution brushing, which can facilitate in the negotiation of solutions process. Brushing allows decision makers to set limits on what the believe is acceptable or unacceptable performance (e.g. “I cannot accept costs above X amount”). It also allows decision makers to more closely examine where potential tensions might arise. If, for example, one negotiating party sets their bar too high, all remaining solutions might be unacceptable by the other decision making parties. Tools like brushing make this process more transparent and straightforward. The model function used in the example is posted below. I would also like to mention ScreenToGif, which is the tool I used to produce these GIFs and it’s been super easy to download and start using. Great product. nRBF = 2 # no. of RBFs to use nIn = 1 # no. of inputs (depending on selected strategy) nOut = 1 # no. of outputs (depending on selected strategy) N = 100 # Number of realizations of environmental stochasticity tSteps = 100 # no. of timesteps to run the fish game on # Define problem to be solved def fish_game(vars, # contains all C, R, W for RBF policy a = 0.005, # rate at which the prey is available to the predator b = 0.5, # prey growth rate c = 0.5, # rate with which consumed prey is converted to predator abundance d = 0.1, # predator death rate h = 0.1, # handling time (time each predator needs to consume the caught prey) K = 2000, # prey carrying capacity given its environmental conditions m = 0.7, # predator interference parameter sigmaX = 0.004, # variance of stochastic noise in prey population sigmaY = 0.004): # variance of stochastic noise of predator population x = np.zeros(tSteps+1) # Create prey population array y = np.zeros(tSteps+1) # Create predator population array z = np.zeros(tSteps+1) # Create harvest array # Create array to store harvest for all realizations harvest = np.zeros([N,tSteps+1]) # Create array to store effort for all realizations effort = np.zeros([N,tSteps+1]) # Create array to store prey for all realizations prey = np.zeros([N,tSteps+1]) # Create array to store predator for all realizations predator = np.zeros([N,tSteps+1]) # Create array to store metrics per realization NPV = np.zeros(N) cons_low_harv = np.zeros(N) harv_1st_pc = np.zeros(N) # Create array with environmental stochasticity for prey epsilon_prey = np.random.normal(0.0, sigmaX, N) # Create array with environmental stochasticity for predator epsilon_predator = np.random.normal(0.0, sigmaY, N) #Set policy input and output ranges input_ranges = [[0, K]] # Prey pop. range to use for normalization output_ranges = [[0, 1]] # Range to de-normalize harvest to # Go through N possible realizations for i in range(N): # Initialize populations and values x[0] = prey[i,0] = K y[0] = predator[i,0] = 250 z[0] = effort[i,0] = hrvSTR([x[0]], vars, input_ranges, output_ranges) NPVharvest = harvest[i,0] = effort[i,0]*x[0] # Go through all timesteps for prey, predator, and harvest for t in range(tSteps): if x[t] > 0 and y[t] > 0: x[t+1] = (x[t] + b*x[t]*(1-x[t]/K) - (a*x[t]*y[t])/(np.power(y[t],m)+a*h*x[t]) - z[t]*x[t])* np.exp(epsilon_prey[i]) # Prey growth equation y[t+1] = (y[t] + c*a*x[t]*y[t]/(np.power(y[t],m)+a*h*x[t]) - d*y[t]) *np.exp(epsilon_predator[i]) # Predator growth equation if t <= tSteps-1: z[t+1] = hrvSTR([x[t]], vars, input_ranges, output_ranges) prey[i,t+1] = x[t+1] predator[i,t+1] = y[t+1] effort[i,t+1] = z[t+1] harvest[i,t+1] = z[t+1]*x[t+1] NPVharvest = NPVharvest + harvest[i,t+1]*(1+0.05)**(-(t+1)) NPV[i] = NPVharvest low_hrv = [harvest[i,j]<prey[i,j]/20 for j in range(len(harvest[i,:]))] # Returns a list of True values when there's harvest below 5% count = [ sum( 1 for _ in group ) for key, group in itertools.groupby( low_hrv ) if key ] # Counts groups of True values in a row if count: # Checks if theres at least one count (if not, np.max won't work on empty list) cons_low_harv[i] = np.max(count) # Finds the largest number of consecutive low harvests else: cons_low_harv[i] = 0 harv_1st_pc[i] = np.percentile(harvest[i,:],1) return (np.mean(NPV), # Mean NPV for all realizations np.mean((K-prey)/K), # Mean prey deficit np.mean(cons_low_harv), # Mean worst case of consecutive low harvest across realizations np.mean(harv_1st_pc), # 5th percentile of all harvests np.mean((predator < 1).sum(axis=1))) # Mean number of predator extinction days per realization # Calculate outputs (u) corresponding to each sample of inputs # u is a 2-D matrix with nOut columns (1 for each output) # and as many rows as there are samples of inputs def hrvSTR(Inputs, vars, input_ranges, output_ranges): # Rearrange decision variables into C, R, and W arrays # C and R are nIn x nRBF and W is nOut x nRBF # Decision variables are arranged in 'vars' as nRBF consecutive # sets of {nIn pairs of {C, R} followed by nOut Ws} # E.g. for nRBF = 2, nIn = 3 and nOut = 4: # C, R, C, R, C, R, W, W, W, W, C, R, C, R, C, R, W, W, W, W C = np.zeros([nIn,nRBF]) R = np.zeros([nIn,nRBF]) W = np.zeros([nOut,nRBF]) for n in range(nRBF): for m in range(nIn): C[m,n] = vars[(2*nIn+nOut)*n + 2*m] R[m,n] = vars[(2*nIn+nOut)*n + 2*m + 1] for k in range(nOut): W[k,n] = vars[(2*nIn+nOut)*n + 2*nIn + k] # Normalize weights to sum to 1 across the RBFs (each row of W should sum to 1) totals = np.sum(W,1) for k in range(nOut): if totals[k] > 0: W[k,:] = W[k,:]/totals[k] # Normalize inputs norm_in = np.zeros(nIn) for m in range (nIn): norm_in[m] = (Inputs[m]-input_ranges[m][0])/(input_ranges[m][1]-input_ranges[m][0]) # Create array to store outputs u = np.zeros(nOut) # Calculate RBFs for k in range(nOut): for n in range(nRBF): BF = 0 for m in range(nIn): if R[m,n] > 10**-6: # set so as to avoid division by 0 BF = BF + ((norm_in[m]-C[m,n])/R[m,n])**2 else: BF = BF + ((norm_in[m]-C[m,n])/(10**-6))**2 u[k] = u[k] + W[k,n]*np.exp(-BF) # De-normalize outputs norm_u = np.zeros(nOut) for k in range(nOut): norm_u[k] = output_ranges[k][0] + u[k]*(output_ranges[k][1]-output_ranges[k][0]) return norm_u # Intro to Boosting # Introduction Once upon a time, in a machine learning class project, student Michael Kearns asked if it is possible to convert a weak classifier (high bias classifier whose outputs are correct only slightly over 50% of the time) into a classifier of arbitrary accuracy by using an ensemble of such classifiers. This question was posed in 1988 and, two years later, in 1990, Robert Schapire answered that it is possible (Schapire, 1990). And so boosting was born. The idea of boosting is to train an ensemble of weak classifiers, each of which an expert in a specific region of the space. The ensemble of classifiers has the form below: $H(\vec{x}) = \sum_{t=1}^T \alpha_th_t(\vec{x})$ where H is the ensemble of classifiers, $\alpha_t$ is the weight assigned to weak classifier $h_t$ around samples $\vec{x}$ at iteration t, and T is the number of classifiers in the ensemble. Boosting creates such an ensemble in a similar fashion to gradient descent. However, instead of: $\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \alpha \nabla \ell(\vec{x}_t)$ as in gradient descent in real space, where $\ell$ is a loss function, Boosting is trained via gradient descent in functional space, so that: $H_{t+1}(\boldsymbol{X}) = H_t(\boldsymbol{X}) + \alpha_t \nabla \ell(h_t(\boldsymbol{X}))$ The question then becomes how to find the $\alpha_t$‘s and the classifiers $h_t$. Before answering these questions, we should get a geometric intuition for Boosting first. The presentation in the following sections were based on the presentation on these course notes. # Geometric Intuition In the example below we have a set of blue crosses and red circles we would like our ensemble or weak classifiers to correctly classify (panel “a”). Our weak classifier for this example will be a line, orthogonal to either axes, dividing blue crosses from red circles regions. Such classifier can also be called a CART tree (Breiman et al., 1984) with depth of 1 — hereafter called a tree stump. For now, let’s assume all tree stumps will have the same weight $\alpha_t$ in the final ensemble. The first tree stump, a horizontal divide in panel “b,” classified ten out of thirteen points correctly but failed to classify the remaining three. Since it incorrectly classified a few points in the last attempt, we would like the next classifier correctly classify these points. To make sure that will be the case, boosting will increase the weight of the points that were misclassified earlier before training the new tree stump. The second tree stump, a vertical line on panel “c,” correctly classifies the two blue crosses that were originally incorrectly classified, although it incorrectly three crosses that were originally correctly classified. For the third classifier, Boosting will now increase the weight of the three bottom misclassified crosses as well of the other misclassified two crosses and circle because they are still not correctly classified — technically speaking they are tied, each of the two classifiers classifies them in a different way, but we are here considering this a wrong classification. The third iteration will then prioritize correcting the high-weight points again, and will end up as the vertical line on the right of panel “d.” Now, all points are correctly classified. There are different ways to mathematically approach Boosting. But before getting to Boosting, it is a good idea to go over gradient descent, which is a basis of Boosting. Following that, this post will cover, as an example, the AdaBoost algorithm, which assumes an exponential loss function. # Minimizing a Loss Function ## Standard gradient descent Before getting into Boosting per se, it is worth going over the standard gradient descent algorithm. Gradient descent is a minimization algorithm (hence, descent). The goal is to move from an initial $x_0$ to the value of $x$ with the minimum value of $f(x)$, which in machine learning is a loss function, henceforth called $\ell(x)$. Gradient descent does that by moving one step of length s at a time starting from $x^0$ in the direction of the steeped downhill slope at location $x_t$, the value of $x$ at the $t^{th}$ iteration. This idea is formalized by the Taylor series expansion below: $\ell(x_{t + 1}) = \ell(x_t + s) \approx \ell(x_t) - sg(x_t)$ where $g(x_t)=\nabla\ell(x_t)$. Furthermore, $s=\alpha g(x_t)$ which leads to: $\ell\left[x_{t+1} -\alpha g(x_t)\right] \approx \ell(x^{t})-\alpha g(x_t)^Tg(x_t)$ where $\alpha$, called the learning rate, must be positive and can be set as a fixed parameter. The dot product on the last term $g(x_t)^Tg(x_t)$ will also always be positive, which means that the loss should always decrease — the reason for the italics is that too high values for $\alpha$ may make the algorithm diverge, so small values around 0.1 are recommended. ## Gradient Descent in Functional Space What if $x$ is actually a function instead of a real number? This would mean that the loss function $\ell(\cdot)$ would be a function of a function, say $\ell(H(\boldsymbol{x}))$ instead of a real number x. As mentioned, Gradient Descent in real space works by adding small quantities to $x_0$ to find the final $x_{min}$, which is an ensemble of small $\Delta x$‘s added together. By analogy, gradient descent in functional space works by adding functions to an ever growing ensemble of functions. Using the definition of a functional gradient, which is beyond the scope of the this post, this leads us to: $\ell(H+\alpha h) \approx \ell(H) + \alpha \textless\nabla \ell(H),h\textgreater$ where H is an ensemble of functions, h is a single function, and the $\textless f, g\textgreater$ notation denotes a dot product between f and g. Gradient descent in function space is an important component of Boosting. Based on that, the next section will talk about AdaBoost, a specific Boosting algorithm. # AdaBoost ## Basic Definitions The goal of AdaBoost is to find an ensemble function H of functions h, a weak classifier, that minimize an exponential loss function below for a binary classification problem: $\ell(H)=\sum_{i=1}^ne^{-y(x_i)H(x_i)}$ where $x_i, y(x_i)$ is the $i^{th}$ data point in the training set. The step size $\alpha$ can be interpreted as the weight of each classifier in the ensemble, which optimized for each function h added to the ensemble. AdaBoost is an algorithm for binary classification, meaning the independent variables $\boldsymbol{x}$ have corresponding vector of dependent variables $\boldsymbol{y}(x_i)$, in which each $y(x_i) \in \{-1, 1\}$ is a vector with the classification of each point, with -1 and 1 representing the two classes to which a point may belong (say, -1 for red circles and 1 for blue crosses). The weak classifiers h in AdaBoost also return $h(x) \in \{-1, 1\}$. ## Setting the Weights of Each Classifier The weight $\alpha_t$ of each weak classifier h can be found by performing the following minimization: $\alpha=argmin_{\alpha}\ell(H+\alpha h)$ Since the loss function is defined as the summation of the exponential loss of each point in the training set, the minimization problem above can be expanded into: $\alpha=argmin_{\alpha}\sum_{i=1}^ne^{y(\boldsymbol{x}_i)\left[H(\boldsymbol{x}_i)+\alpha h(\boldsymbol{x}_i)\right]}$ Differentiating the error w.r.t. $\alpha$, equating it with zero and performing a few steps of algebra leads to: $\alpha = \frac{1}{2}ln\frac{1-\epsilon}{\epsilon}$ where $\epsilon$ is the classification error of weak classifier $h_t$. The error $\epsilon$ can be calculated for AdaBoost as shown next. ## The Classification Error of one Weak Classifier Following from gradient descent in functional space, the next best classifier $h_{t+1}$ will be the one that minimizes the term $\textless\nabla \ell(H),h\textgreater$, which when zero would mean that the zero-slope ensemble has been reached, denoting the minimum value of the loss function has been reached for a convex problem such as this. Replacing the dot product by a summation, this minimization problem can be written as: $h(\boldsymbol{x}_i)=argmin_h\epsilon=argmin_h\textless\nabla \ell(H),h\textgreater$ $h(\boldsymbol{x}_i)=argmin_h\sum_{i=1}^n\frac{\partial e^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}}{\partial H(\boldsymbol{x}_i)}h(\boldsymbol{x}_i)$ which after some algebra becomes: $h(\boldsymbol{x}_i)=argmin_h\sum_{i:h(\boldsymbol{x}_i)\neq y(\boldsymbol{x}_i)}w_i$ (the summation of the weights of misclassified points) Comparing the last with the first expression, we have that the error $\epsilon$ for iteration t is simply the summation of the weights of the points misclassified by $h(\boldsymbol{x}_i)_t$ — e.g., in panel “b” the error would be summation the of the weights of the two crosses on the upper left corner and of the circle at the bottom right corner. Now let’s get to these weights. ## The Weights of the Points There are multiple ways we can think of for setting the weights of each data point at each iteration. Schapire (1990) found a great way of doing so: $\boldsymbol{w}_{t+1}=\frac{\boldsymbol{w}^{t}}{Z}e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}$ where $\boldsymbol{w}$ is a vector containing the weights of each point, Z is a normalization factor to ensure the weights will sum up to 1. Be sure not to confuse $\alpha_t$, the weight of classifier t, with the weight of the points at iteration t, represented by $\boldsymbol{w}_t$. For the weights to sum up to 1, Z needs to be the sum of their pre-normalization values, which is actually identical to the loss function, so $Z=\sum_{i=1}^ne^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}$ Using the definition of the error $\epsilon$, the update for Z can be shown to be: $Z_{t+1}=Z_t\cdot2\sqrt{\epsilon(1-\epsilon)}$ so that the complete update is: $w_{t+1}=w_t \frac{e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}}{2\sqrt{\epsilon(1-\epsilon)}}$ Now AdaBoost is ready for implementation following the pseudo-code below. ## AdaBoost Pseudo-code Below is a pseudo-code of AdaBoost. Note that it can be used with any weak learner (high bias) classifier. Again, shallow decision trees are a common choice for their simplicity and good performance. ## (Next to) No Overfitting Now, here is the most fascinating thing about boosting: theoretically speaking, Boosted algorithms do not overfit. Recall that Z, the normalization factor for the point weights update, equals the loss function. That being the case, we get the following relation: $\ell(H)=Z=n\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_ t}$ where n is the normalizing factor for all weights at step 0 (all of the weights are initially set to 1/n). To derive an expression for the upper bound of the error, let’s assume that the errors at all steps t equal their highest value, $\epsilon_{max}$. We have that: $\ell(H)\leq n\left[2\sqrt{\epsilon_{max}(1-\epsilon_{max})}\right]^T$ Given that necessarily $\epsilon_{max} \leq \frac{1}{2}$, we have that $\epsilon_{max}(1-\epsilon_{max})<\frac{1}{4}$ or $\epsilon_{max}(1-\epsilon_{max})=\frac{1}{4}-\gamma^2$ for any $\gamma$ in the interval $\left[-\frac{1}{2},\frac{1}{2}\right]$. Therefore, replacing the equation above in the first loss inequality written as a function of $\epsilon_{max}$, we have that: $\ell(H)\leq n(1-4\gamma^2)^{T/2}$ which means that the training error is bound by an exponential decay as you add classifiers to the ensemble. This is a fantastic result and applies to any boosted algorithm! # Final Remarks In this post, I presented the general idea of boosting a weak classifier, emphasizing its use with shallow CART trees, and used the AdaBoost algorithm as an example. However, other loss functions can be used and boosting can also be used for non-binary classifiers and for regression. The Python package scikit-learn in fact allows the user to use boosting with different loss functions and with different weak classifiers. Despite the theoretical proof that Boosting does not overfit, researchers running it for extremely long times on rather big supercomputers found at at some point it starts to overfit, although still very slowly. Still, that is not likely to happen in your application. Lastly, Boosting with shallow decision trees is also a great way to have a fine control over how much bias there will be on your model, as all you need to do for that is to choose the number of T iterations. # Bibliography Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression T rees (Monterey, California: Wadsworth). Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227. # Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and labels Y∈ (+1,-1), positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package. Figure 1: Example training set ## Gaussian Naïve Bayes Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x): $P(y|x)=\frac{P(x|y)P(y)}{P(x)}$ (1) A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated. To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that: $P(x|y) ~ N(\mu_y, \Sigma_y)$ (2) Where $\Sigma_y$ is a diagonal covariance matrix with $[\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y}$ for each feature $\alpha$. For each feature,$\alpha\$, and each class, c we can then model $P(x_\alpha|y)$ as:

$P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}$                              (3)

We can then estimate model parameters:

$\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}$                                                                   (4)

$\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2$                                                  (5)

Where:

$n_c = \sum^{n}_{i=1}I(y_i=c)$                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

$P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}$                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

## Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

$P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}$                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

## Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1');
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')



Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

## Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1');
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')



Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

## References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743