# Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

# Introducing Julia: A Fast and Modern Language

In this blog post, I am introducing Julia, a high-level open-source dynamic programming language. While it has taken root in finance, machine learning, life sciences, energy, optimization, and economic realms, it is certainly not common within the water programming realm. Through this post, I will give an overview of the features, pros, and cons of Julia before comparing its performance to Python. Following this, I give an overview of common Julia development environments as well as linking to additional resources.

## Julia: What’s Going On?

Julia is a high-level open-source dynamic programming language built for scientific computing and data processing that is Pythonic in syntax to ensure accessibility while boosting computational efficiency.  It is parallelizable on high performance computing resources utilizing CPUs and GPUs, allowing for its use in large-scale experiments.

With Version 1.1.0 recently released, Julia now stands amongst the established and stable programming languages. As such, Julia can handle matrices with ease while performing at speeds nearly comparable to C and Fortran. It is comparable to Cython/Numba and faster than Numpy, Python, Matlab, and R. Note that further benchmarking is required on a project-specific basis due to the speed of individual packages/libraries, but a simple case is shown in the following section.

The biggest case for implementing Julia over C/C++/Fortran is its simplicity in writing and implementation. Its language is similar Python, allowing for not only list comprehension but also ‘sloppy’ dynamic variable type declarations and use. Julia has the ability to move between dynamic and static variable types. Furthermore, Julia allows users to create unique variable types, allowing for maximum flexibility while maintaining efficiency.

Installing and implementing packages from the start is nearly seamless, simply requiring the name of the package and an internet connection. Most packages and libraries are hosted on GitHub and are relatively straightforward to install with just a couple lines of code. On the same hand, distributing and contributing to the opensource community is straightforward. Furthermore, Julia can access Python modules with wrappers and C/Fortran functions without, making it a very versatile language. Likewise, it can easily be called from Python (PyJulia) to speed up otherwise cumbersome operations.

Julia has relatively straightforward profiling modules built in. Beyond being able to utilize Jupyter Notebook commands, Julia has a range of code diagnostic tools, such as a @Time command to inform the user of wall time and memory allocation for a function.

One of the most obvious downsides to Julia when compared to Python is the relatively small community. There is not nearly the base of documentation that I am accustomed to; nearly every issue imaginable in the Python world has been explored on Stack Overflow, Julia is much more limited. However, because much of its functionality is based on MATLAB, similar issues and their corresponding solutions bleed over.

Additional transition issues that may be encountered is that unlike Python, Julia’s indexing begins with 1. Furthermore, Julia uses column-major ordering (like Matlab, Fortran, and R) unlike Python and C (row-major ordering), making iterating column-by-column substantially faster. Thus, special care must be taken when switching between languages in addition to ensuring consistent strategies for speeding up code. Another substantial transition that may be required is that Julia is not directly object-oriented since objects do not directly have embedded methods; instead, a given object must be passed to a function to be modified.

Overall, I see Julia as having the potential to improve computational efficiency for computationally-intensive models without the need to learn a more complex language (e.g. Fortran or C/C++). Between its ease of integration between other common languages/libraries to its ability to properly utilize HPC CPU/GPU resources, Julia may be a useful tool to utilize alongside Python to create more efficient large-scale models.

## Julia Allows for Faster Matrix Operations

As a simple test to compare scalability of Julia and Python, I computed the dot product of increasingly large matrices using the base Julia language and Numpy in Python (both running in Jupyter Notebooks on a Windows desktop). You can find the code at the following link on Github.

As you can see in the generated figure, Julia outperforms Python for large matrix applications, indicating that Julia is more efficient and may be worth using for larger and larger operations. If you have any suggestions on improving this code for comparison, please let me know; I am absolutely game for improvement.

## Development Environments for Julia

The first and most obvious variables on what development environment is best is you, the user. If you’re wanting all the bells and whistles versus a simple text-editing environment, the range of products to fulfill you desires exists.

Juno—an extension to Atom—is the default IDE that is incorporated with JuliaPro, a pre-packaged version of Julia that includes a range of standard packages—Anaconda is the parallel in the Python world.

Jupyter Notebooks  are my go-to development environment for experimenting with code in both Python and Julia, especially for testing small sections of code.  Between being able to easily create visuals and examples inline with your code and combining code with Markdown, it allows for clean and interactive approach to sharing code or teaching. Note that installation generally requires IJulia but can allow for easy integration into already existing Jupyter workflows.

JetBrains IDEs (e.g. CLion, PyCharm) support a Julia plugin that supports a wide range of the same features JetBrains is known for. However, it is still in beta and is working to improve its formatter and debugging capabilities.

JuliaBox is an online web interface using Jupyter Notebooks for code development on a remote server, requiring no local installation of Julia. Whether you are in a classroom setting, wanting to write code on your phone, or are just wanting to experiment with parallel computing (or don’t have access to HPC resources), JuliaBox allows for a nearly seamless setup experience. However, note that this development environment limits each session to 90 minutes (up to 8 hours on a paid subscription) and requires a subscription to access any resources beyond 3 CPU cores. Note that you can access GPU instances with a paid subscription.

Julia Studio is a default IDE used by a range of users. It is a no-frills IDE based on Qt Creator, resulting in a clean look.

For anyone looking to use Visual Studio, you can install a VS Code extension.

Vim is not surprisingly available for Julia.  And if you’re feeling up the challenge, Emacs also has an interface.

Of course, you can just use a texteditor of your choice (e.g. Sublime Text with an extension) and simply run Julia from the terminal.

## Acknowledgements

Thanks to Dave Gold for his assistance in giving guidance for benchmarking and reminding me of the KISS principle.

# Introduction to Unit Testing

## Motivation

Here is something that has happened to most or all of us engineers who code. We write a function or a class and make sure it works by running it and printing some internal state output on the console. The following week, after adding or making improvements on seemingly unrelated parts of the code, the results start to look weird. After a few hours of investigation (a.k.a. wasted time) we find out the reason for the weird results is that that first function or class is either not giving us the right numbers anymore or not behaving as expected with different inputs. I wonder how many papers would have been written, how many deadlines would have been met, and how many “this $@#%& does not @#$% work!!!” thoughts and comments to friends and on our codes themselves would have been avoided if we managed to get rid of such frustrating and lengthy situations.

In this post I will introduce a solution to this rather pervasive issue: unit testing, which amounts to ways of creating standard tests throughout the development of a code to avoid the mentioned setbacks and simplify debugging. I will also get into some of the specifics of unit testing for Python and demonstrate it with a reservoir routing example. Lastly, I will briefly mention a unit testing framework for C++.

## The General Idea

Unit test frameworks are libraries with functions (or macros) that allow for the creation of standard tests for individual parts of code, such as function and classes, that can be run separately from the rest of code. The advantages of using Unit Testing are mainly that (1) it eliminates the need of running the entire code with a full set of inputs simply to see if a specific function or class is working, which in some cases can get convoluted take a really long time, (2) it allows for easy and fast debugging, given that if a certain test fails you know where the problem is, (3) it prevents you or someone else from breaking the code down the road by running the tests frequently as the code is developed, and (4) it prevents an error in a function or class to be made noticible in another, which complicates the debugging process. Of course, the more modular a code is the easier it is to separately test its parts — a code comprised of just one 1,000-lines long function cannot be properly unit-tested, or rather can only have one unit test for the entire function.

## Example Code to be Tested

The first thing we need to perform unit testing is a code to be tested. Let’s then consider the reservoir routing code below, comprised of a reservoir class, a synthetic stream flow generation function, a function to run the simulation over time, and the main.:

import numpy as np
import matplotlib.pyplot as plt

class Reservoir:
__capacity = -1
__storage_area_curve = np.array([[], []])
__evaporation_series = np.array([])
__inflow_series = np.array([])
__demand_series = np.array([])
__stored_volume = np.array([])

def __init__(self, storage_area_curve, evaporations, inflows, demands):
""" Constructor for reservoir class

Arguments:
storage_area_curve {2D numpy matrix} -- Matrix with a
row of storages and a row of areas
evaporations {array/list} -- array of evaporations
inflows {array/list} -- array of inflows
demands {array/list} -- array of demands
"""

self.__storage_area_curve = storage_area_curve
assert(storage_area_curve.shape[0] == 2)
assert(len(storage_area_curve[0]) == len(storage_area_curve[1]))

self.__capacity = storage_area_curve[0, -1]

n_weeks = len(demands)
self.__stored_volume = np.ones(n_weeks, dtype=float) * self.__capacity
self.__evaporation_series = evaporations
self.__inflow_series = inflows
self.__demand_series = demands

def calculate_area(self, stored_volume):
""" Calculates reservoir area based on its storave vs. area curve

Arguments:
stored_volume {float} -- current stored volume

Returns:
float -- reservoir area
"""

storage_area_curve_T = self.__storage_area_curve.T .astype(float)

if stored_volume  self.__capacity:
print "Storage volume {} graeter than capacity {}.".format(
stored_volume, self.__capacity)
raise ValueError

for i in range(1, len(storage_area_curve_T)):
s, a = storage_area_curve_T[i]
# the &st; below needs to be replace with "smaller than" symbol. WordPress code highlighter has a bug that was distorting the code because of this "smaller than."
if stored_volume &st; s:
sm, am = storage_area_curve_T[i - 1]
return am + (stored_volume - sm)/ (s - sm) * (a - am)

return a

def mass_balance(self, upstream_flow, week):
""" Performs mass balance on reservoir

Arguments:
upstream_flow {float} -- release from upstream reservoir
week {int} -- week

Returns:
double, double -- reservoir release and unfulfilled
demand (in case reservoir gets empty)
"""

evaporation = self.__evaporation_series[week] *\
self.calculate_area(self.__stored_volume[week - 1])
new_stored_volume = self.__stored_volume[week - 1] + upstream_flow +\
self.__inflow_series[week] - evaporation - self.__demand_series[week]

release = 0
unfulfiled_demand = 0

if (new_stored_volume > self.__capacity):
release = new_stored_volume - self.__capacity
new_stored_volume = self.__capacity
elif (new_stored_volume < 0.):
unfulfiled_demand = -new_stored_volume
new_stored_volume = 0.

self.__stored_volume[week] = new_stored_volume

return release, unfulfiled_demand

def get_stored_volume_series(self):
return self.__stored_volume

def run_mass_balance(reservoirs, n_weeks):
upstream_flow = 0
release = 0
total_unfulfilled_demand = 0
for week in range(1, n_weeks):
for reservoir in reservoirs:
release, unfulfilled_demand = reservoir.mass_balance(release, week)

def generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma):
"""Log-normally distributed stream flow generator.

Arguments:
n_weeks {int} -- number of weeks of stream flows
sin_amplitude {double} -- amplitude of log-mean sinusoid fluctuation
log_mu {double} -- mean log-mean
log_sigma {double} -- log-sigma

Returns:
{Numpy array} -- stream flow series
"""

streamflows = np.zeros(n_weeks)

for i in range(n_weeks):
streamflow = np.random.randn() * log_sigma + log_mu *\
(1. + sin_amplitude * np.sin(2. * np.pi / 52 * (i % 52)))
streamflows[i] = np.exp(streamflow)

return streamflows

if __name__ == "__main__":
np.random.seed(41)
n_weeks = 522
sin_amplitude, log_mean, log_std = 1., 2.0, 1.8
streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mean, log_std)
storage_area_curve = np.array([[0, 1000, 3000, 4000], [0, 400, 600, 900]])

reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks) / 10,\
streamflows, np.random.rand(n_weeks) * 60)

run_mass_balance([reseroir1], n_weeks)

plt.plot(reseroir1.get_stored_volume_series())
plt.xlabel("Weeks [-]")
plt.ylabel("Stored Volume [MG]")
plt.title("Storage Over Time")
plt.ylim(0, 4100)
plt.show()


We have quite a few functions to be tested in the code above, but lets concentrate on Reservoir.calculate_area and generate_streamflow. The first is a function within a class (Reservoir) while the second is a standalone function. Another difference between both functions is that we know exactly what output to expect from Reservoir.calculate_area (area given storage), while given generate_stream flow is setup to return random numbers we can only test their estimated statistical properties, which will change slightly every time we run the function. We will use the PyTest framework to perform unit testing in the reservoir routing code above.

## PyTest

PyTest is one among many available unit test frameworks for Python. The reason for its choice is that PyTest is commonly used by Python developers. You can find the API (list of commands) here and examples more here.

### Basic Work Flow

Using PyTest is pretty simple. All you have to do it:

• Install PyTest with pip or conda (pip install pytest or conda install pytest)
• Create a file (preferably in the same directory as the file with the code you want to test) whose name either begins or end with “test_” or “_test.py,” respectively — this is a “must.” In this example, the main code is called “reservoir_routing.py,” so I created the PyTest file in the same directory and named it “test_reservoir_routing.py.”
• In the unit test file (“test_reservoir_routing.py”), import from the main code file the functions and classes you want to test, the PyTest module pytest and any other packages you may use to write the tests, such as Numpy.
• Write the tests. Tests in PyTest are simply regular Python functions whose names begin with “test_” and that have assertions (“assert,” will get to that below) in them.
• On the Command Prompt, PowerShell, or Linux terminal, run the command “pytest.” PyTest will then look for all your files that begin with “test_” or end in “_test.py,” scan them for functions beginning with “test_,” which indicate a test function, run them and give you the results.

### Assertions

Assertions are the actual checks, the cores of the tests. Assertions are performed with the assert command and will return “pass” if the condition being asserting is true and “fail” otherwise. Below is an example of an assertion, which will return “pass” if the function being tested returns the value on the right-hand-side for the given arguments or “false” otherwise:

assert function_calculate_something(1, 'a', 5) == 100


Sometimes we are not sure of the value a function will return but have an approximate idea. This is the case of functions that perform stochastic calculations or calculate numerically approximate solutions — e.g. sampling from distributions, Newton-Rhapson minimum finder, E-M algorithm, finite-differences PDE solvers, etc. In this case, we should perform our assertion against an approximate value with:

# assert if function returns 100 +- 2
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., abs=2.)


or with:

# assert if function returns 100 +- 0.1 (or 100 +- 100 * 1e-3)
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., rel=1e-3)


Sometimes (although, unfortunately, rarely) we add code to a function to check the validity of the arguments, intermediate calculations, or to handle exceptions during the function’s execution. We can also test all that with:

# Test if an exception (such as Exception) are properly raised
with pytest.raises(Exception):
function_calculate_something(1345336, 'agfhdhdh', 54784574)


### Test File Example

Putting all the above together, below is a complete PyTest file with two tests (two test functions, each with whatever number of assertions):

from reservoir_routing import Reservoir, generate_streamflow
import numpy as np
import pytest

def test_calculate_area():
n_weeks = 522
storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

# Test specific values of storages and areas
assert reseroir1.calculate_area(500) == 400
assert reseroir1.calculate_area(650) == 500
assert reseroir1.calculate_area(1000) == 900

# Test if exceptions are properly raised
with pytest.raises(ValueError):
reseroir1.calculate_area(-10)
reseroir1.calculate_area(1e6)

def test_generate_streamflow():
n_weeks = 100000
log_mu = 7.8
log_sigma = 0.5
sin_amplitude = 1.
streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma)

whitened_log_mu = (np.log(streamflows[range(0, n_weeks, 52)]) - log_mu) / log_sigma

# Test if whitened mean in log space is 0 +- 0.5
assert np.mean(whitened_log_mu) == pytest.approx(0., abs=0.05)


If I open my Command Prompt, navigate to the folder where both the main code and the test Python files are located and run the command pytest I will get the output below:

F:\Dropbox\Bernardo\Blog posts>pytest
============================= test session starts =============================
platform win32 -- Python 2.7.13, pytest-4.2.0, py-1.7.0, pluggy-0.8.1
rootdir: F:\Dropbox\Bernardo\Blog posts, inifile:
collected 2 items

test_reservoir_routing.py F.                                             [100%]

================================== FAILURES ===================================
_____________________________ test_calculate_area _____________________________

def test_calculate_area():
n_weeks = 522
storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

# Test specific values of storages and areas
assert reseroir1.calculate_area(500) == 400
>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500
E        +  where 400 = (650)
E        +    where  = .calculate_area

test_reservoir_routing.py:14: AssertionError
========================== deprecated python version ==========================
You are using Python 2.7.13, which will no longer be supported in pytest 5.0
https://docs.pytest.org/en/latest/py27-py34-deprecation.html
===================== 1 failed, 1 passed in 1.07 seconds ======================

F:\Dropbox\Bernardo\Blog posts>


This seems like a lot just to say that one test passed and the other failed, but there is actually some quite useful information in this output. The lines below (20 and 21 of the output above) indicate which assertion failed, so you can go ahead and debug your the corresponding specific part of code with a debugger:

>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500


I actually found this error with PyTest as I was writing the reservoir routing code for this post and used VS Code, which has a debugger integrated with PyTest, to debug and fix the error. The error is happening because of an int to float conversion that is not being done correctly in line 45 of the original code, which can be replaced by the version below to fix the problem:

storage_area_curve_T = self.__storage_area_curve.T<strong>.astype(float)</strong>


The last piece of important information contained in the last few lines of the output is that support for Python 2.7 will be discontinued, as it is happening with other Python packages as well. It may be time to finally switch to Python 3.

### Practice Exercise

As a practice exercise, try writing a test function for the Reservoir.mass_balance function including cases in which the reservoir spills, gets empty, and is partly full.

## Unit Testing in C++

A framework for unit testing in C++ that is quite popular is the Catch2 framework. The idea is exactly the same: you have a code you want to test and use the framework to write test functions. An example of Catch2 being used to check a factorial function is shown below, with both the main and the test codes in the same file:

#define CATCH_CONFIG_MAIN  // This tells Catch to provide a main() - only do this in one cpp file
#include "catch.hpp"

unsigned int Factorial( unsigned int number ) {
return number <= 1 ? number : Factorial(number-1)*number;
}

TEST_CASE( "Factorials are computed", "[factorial]" ) {
REQUIRE( Factorial(1) == 1 );
REQUIRE( Factorial(2) == 2 );
REQUIRE( Factorial(3) == 6 );
REQUIRE( Factorial(10) == 3628800 );
}


## Conclusion

Setting up and running unit testing is not rocket science. Also, with my test functions I do not have to use my actual reservoir system test case in order to check if the surface area calculations are correct. If I find they are not, I can just re-run the test function with a debugger (VS Code is great for that) and fix my function without having to deal with the rest of the code. You would be surprised by how many hours of debugging can be saved by taking the unit test approach. Lastly, be sure to run your tests after any change to your code you consider minimally significant (I know, this sounds vague). You never know when you mess something up by trying to fix or improve something else.

# Intro to Machine Learning Part 5: Bagging

In Part 3 of this series, it was stated that classification error can be due to bias, variance, or noise. There are different methods to address each type of classification error. This post will focus on bootstrap aggregation, or bagging, which aims to improve classification error due to variance.

As a reminder, expected test error can be decomposed as follows:

where hd is the training data classifier and hbar is the ideal expected classifier. Now let’s just focus on the part of the testing error that is due to variance. From the equation, it is clear that if we want this error component to approach 0, then hd(x) must tend to hbar(x). We can use the the Weak Law of Large Numbers (WLLN) to understand how this can happen. WLLN states that if you have a sample of independent and identically distributed (i.i.d) random variables, as the sample size grows larger, the sample mean will tend toward the population mean.

If this is applied in a classification setting:

In this equation, a classifier is trained on each of the m datatsets and then the results are averaged. By WLLN, hd tends to hbar …but only as our number of datasets, m, approaches infinity. But we only have one datatset, D, so how can we get more datasets? Through bagging!

### Bagging

Fig 1: Bootstrapping Process

Let’s say our original dataset D comes from some distribution P, shown in Figure 1.We can simulate drawing from P by bootstrapping, or sampling our original dataset D with replacement. Because we are sampling with replacement, we are essentially creating new datasets out of our original one! Note that when we sample with replacement, we cannot say that the new samples are i.i.d., so therefore the Weak Law of Large Numbers does not quite apply. But error due to variance can still be considerably reduced with this method.

### Application: Random Forests

Bagging is most popularly utilized for random forests which are bagged decision trees. A decision tree is trained on each of the m datasets and then either the mode label is assigned (in a classification setting), or an average value is calculated (in a regression setting). Random Forests are termed “out of the box” algorithms; they are very easy to use because 1) only 2 hyper-parameters have to be defined, m (the number of datasets) and k (a feature splitting parameter) and  2) features don’t need to be normalized to have the same units or scale, which is extremely helpful for any kind of heterogeneous data.

#### Coding up the Classifier

Random Forests can be implemented in every programming language. In Python, scikit-learn has a function called RandomForestClassifier. Only the number of trees in the forest needs to be specified, but there are many extra parameters that can be fine-tuned, including the criterion to measure the quality of a split (Gini vs. Entropy) and the maximum depth of each tree. The code below outlines a basic implementation of the classifier. A training dataset with “value” and “label” fields is read in and then fit with the RandomForestClassifier function that is trained on 200 trees. Then the classifier is used to predict the labels associated with each test set “value” and the prediction accuracy is calculated.


#import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#import training data
#fit classifier in training data
rfc1=RandomForestClassifier(n_estimators= 200, max_depth=8, criterion='entropy')
rfc1.fit(training.value, training.label)
#import testing data
#predict labels
predicted = rfc1.predict(test.value)
#calculate accuracy
acc=np.mean(predicted == test.label)



### Random Forests in Water Resources

The Water Resources field has seen an increasing machine learning presence for the past decade. However, the majority of research has been focused on implementation of artificial neural networks. Though just emerging in the field, random forests are becoming increasingly popular for both prediction and regression. Shortridge et al. (2016), tests the ability of random forests (among other models such as GLMs and ANNs) to predict monthly streamflow for five rivers in the Lake Tana Basin in Ethiopia. Empirical models that estimate monthly streamflow as a function of climate conditions and agricultural land cover in each basin were developed for the period from 1961 to 2004. The predictive capabilities of each model was tested along with investigation of their error structures. Random forests were found to better capture non-linear relationships than their ANN counterparts. Furthermore, they were much more parsimonious and physically interpretable. The paper also addresses ability of these models to make meaningful predictions under uncertainty and a non-stationary climate.

#### References:

All of the material for the post comes from Kilian Weinberger’s CS4780 class notes and lectures found at: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html

Shortridge, Julie E., et al. “Machine Learning Methods for Empirical Streamflow Simulation: a Comparison of Model Accuracy, Interpretability, and Uncertainty in Seasonal Watersheds.” Hydrology and Earth System Sciences, vol. 20, no. 7, 2016, pp. 2611–2628., doi:10.5194/hess-20-2611-2016.

# Intro to machine learning part 4: Support Vector Machines

In the fourth installment of our intro to machine learning series, I’ll introduce Support Vector Machines, a powerful tool for classification and regression. In my last post on Perceptron, we examined a procedure for finding a hyperplane that is an effective linear classifier and not subject to the curse of dimensionality. To review, our task during binary classification is to find a classifier that can predict the label of a data point using a vector of the point’s characteristic features. The Perceptron algorithm finds a hyperplane that is able to separate points of different classifications within the feature space. An example of a linear classifier found using perceptron for a data set with 3 features is shown in Figure 1 below.

Figure 1: A hyperplane for binary classification of a linearly separable data set

While Perceptron is a useful conceptual tool for understanding classification problems, it suffers from several flaws that limit its usefulness in most classification problems. First, Perceptron will never converge if the data is not linearly separable, limiting its utility to well-behaved data sets. Second, while Perceptron is guaranteed to converge to a separating hyperplane if the data is linearly separable, such a hyperplane may take infinitely many possible forms and there is no guarantee that Perceptron will find the “best” one (more on what “best” means later). Support Vector Machines (SVM) overcome these problems to find an “optimal” hyperplane that may be fit to data sets that are not linearly separable. SVMs may also be kernalized to fit non-linear decision boundaries and extended to regression contexts. In this post we’ll look into how and why SVMs work and discuss an example of SVMs in water resources systems literature. I would like to note that the main body of content in this post was derived from course notes of CS 4780 by Dr. Kilian Weinberger at Cornell. For deeper coverage of machine learning content, check out the course page here.

## The Maximum Margin Hyperplane

Let’s define the features of point i as xi, and its label as yi. If the data is linearly separable, Perceptron will find a hyperplane, H, such that:

$H(x) = dot\{x: (x,w) +b=0\}$

Where w  is an orthogonal vector that defines the orientation of hyperplane H, and b represents and offset from the origin.

While a hyperplane found by Perceptron will accurately separate the two classes, it likely will not find the hyperplane that does the “best” job bisecting the classes. So how do we define what makes a good hyperplane for classification? In machine learning we train classifiers on a set of training data hoping that the classifiers will generalize to a broader set of data; we seek classifiers that not only have low training error, but will also have low generalization error. It turns out the best way to reduce generalization error is to find the hyperplane that is most in the “middle” of the two classes in the training set. We can define the middle of the two data sets as the hyperplane that has the largest margin, or distance between the closest points in the two classes. This is known as the maximum margin hyperplane. The advantages of using the maximum margin hyperplane are easy to visualize. Figure 2 shows three hyperplanes attempting to classify a set of binary test data with two features, x1 and x2. Plane H1 does not separate the two data sets. Planes H2 and H3 are both able to correctly classify the data set, but plane H3, the maximum margin classifier, is the more natural partition. The two closest training points to the hyperplane H3, which define the maximum margin, are known as “support vectors”.

Figure 2: Three hyperplanes, plane H1 does not separate the two data sets, plane H2 separates the two but with a small margin, H3 is the maximum margin hyperplane (Image source: Wikimedia commons under the creative commons license)

SVM changes the classification problem from “find a hyperplane that correctly classifies all training points” to “find a the maximum margin hyperplane”, which allows us to formulate the problem as an optimization. Here, will denote the margin of a hyperplane as γ, which is a function of the hyperplane’s weight vector w and intercept b. The optimization is thus:

$\max_{w,b}\gamma(w,b)$

$s.t \quad \forall i \ y_{i}(w^{T}x_i+b) \geq 0$

With a little math (if you’re interested, see these notes), we can simplify this problem to a quadratic optimization problem that can be efficiently solved with a Quadratically Constrained Quadratic Program (QCQP) solver.

$\min_{w} \ w^{T}w$

$s.t \quad \forall i \ y_{i}(w^{T}x_{i}+b) \geq 1$

## Soft Margin SVM

By finding the maximum margin classifier, SVM is guaranteed to find the optimal hyperplane for the given data and can do so using an out of the box QCQP solver. However,  thus far, our treatment of SVM (known as “hard margin SVM”) will only work for linearly separable data. SVM can be extended to training sets that are not linearly separable by incorporating the optimization’s constraint as a penalty into the objective function:

$\min_{w,b} \quad w^Tw+C\sum^{n}_{i=1}max[1-y_{i}(w^Tx+b),0]$

Where C represents a penalty coefficient that can be set by the user. This formulation is known as “soft margin SVM”. Instead of only finding hyperplanes that perfectly divide the data, soft margin SVM allows misclassification at a penalty. When C is very large, soft margin SVM will find the same hyperplane as hard margin SVM. For more info on the derivation of soft margin SVM, see these notes.

## An example of SVM in Water Resources Literature

So SVM is a linear classifier that finds the maximum margin hyperplane and can be applied to data that is not linearly separable, but how might one apply it in water resources systems? One example of SVM in water resources literature is by Herman et al., 2016, who use SVM for scenario discovery. Scenario discovery (Groves and Lempert, 2007) assists decision makers in finding key uncertainties that represent vulnerability for a planning alternative. Herman et al., 2016, used SVM to examine the performance of a water supply portfolio under future demand growth and drought scenarios. Figure 7 of Herman et al., 2016, plots factor maps of drought frequency vs. demand growth scaling factor, and shows SVM classification of future states of the world based on whether they met stakeholder performance criteria. Each point on the plot represents a unique drought and demand scenario and the point’s color represents it classification. The classification predicted by soft margin SVM is shown as the shading of the plots.  The hyperplanes discovered by SVM allow decision makers to discover what ranges of uncertainties are likely to cause the chosen alternative to fail and can assist in the development of narrative scenarios for future planning purposes.

Figure 3: Figure 7 from Herman et al., 2016. Each point represents a sampled uncertainty, demand growth is shown on the horizontal axis and drought frequency (n times more likely) is shown on the vertical axis. The color of each point represents its classification (success or failure to meet stakeholder criteria) and the shading of the plots represents the predicted classification by soft margin SVM.

SVMs are also commonly used for regression in hydrology and downscaling of climate models. SVMs in a regression context are beyond the scope of this post, but see Asefa et al., 2006 for an example of SVMs used to forecast streamflow and Tripathi et al., 2006 for an example of SVMs in downscaling.

## Implementing SVMs

In Python, Scikit-learn has an easy to use implementation of SVMs that can be used for classification or regression.

Matlab has several functions for fitting SVMs including fitcsvm.

R has an svm function, and a library devoted to SVMs as part of the e1071 statistical package.

## References

Herman, J.D., Zeff, H.B., Lamontagne, J.R., Reed, P.M., Characklis, G.W., 2016. Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. J. Water Resour. Plann. Manage. 142 (11), 04016050.

Groves, D.G., Lempert, R.J., 2007. A new analytic method for finding policy-relevant scenarios. Global Environ. Change 17 (1), 73–85. http://dx.doi.org/10.1016/j. gloenvcha.20 06.11.0 06 . Uncertainty and Climate Change Adaptation and Mitiga- tion.

# Intro to Machine Learning Part 3: Bias-Variance Tradeoff

The goal of this post is to introduce one of the most important topics in Machine Learning, termed the Bias-Variance Tradeoff. In the last two posts of this series, David Gold introduced two popular classification algorithms. This post will discuss how to assess your classifier’s error and improve it.

### Decomposition of Error

Let’s assume that we have a dataset, D={(x1,y1)…(xn,yn)}, where xi is a data point and yi is the label associated with that point. We train a classifier, hd , on this dataset and use it to predict the label y associated with a test point x. The classifier outputs a predicted label hd(x). The classifier’s expected test error is defined as the squared difference between the prediction and the actual label y. This error can be decomposed into three different types of error shown in the following equation:

Variance: hd is the classifier that you have trained on dataset D. If you were to build a classifier on a different dataset, you would get a different classifier that could potentially give you different labels. This is unfortunate because multiple large datasets are usually hard to come by. If we had infinite datasets, we could hypothetically find the ideal expected classifier denoted as hbar(x). Therefore, the expected error due to variance is defined as the difference between our classifier and the ideal expected classifier.

Bias: Suppose that you are able to train your model on infinite datasets and are able to achieve the expected classifier hbar(x), but you still have a high test error. Then, your classifier may have error due to bias, which is a measure of how much the expected classifier’s prediction differs from the average label ybar(x). Error due to bias is a limitation of your model. For example, if you are trying to use a linear classifier to separate data which are not linearly separable, then you will never get good results, even with infinite datasets.

Noise: The goal of a classifier is to determine the average label ybar(x) associated with a data point. However, sometimes the actual label associated with the data point differs from the average label. This is termed error due to noise and is intrinsic to your data.

### Improving Classification Error

The first step to improving error is to first identify what type of error that your classifier is suffering from. Figure 1 can help you to diagnose this.

Figure 1: Training and test error due to number of training instances

High Variance: If you classify your test set and find that there is a large gap between the test error and training error, then your classifier might have high variance. Another characteristic of this region is that the training error is below the acceptable test error ε, which is ideal, but the test error is still much higher than ε, which is not ideal.

Solutions

• Add more data: As seen from the graph, as more data (training instances) are added, the test error will go down initially. Conversely, the training error will increase, because adding more training data will always make a classification problem more difficult. The ultimate goal is that the two errors lines will converge and level off below the dashed line.
• Reduce Model Complexity: If your model has high variance, then it can be a sign that your classifier is too complex. It is overfit to the training set and therefore won’t generalize well across test sets. One way to fix this is to reduce model complexity by increasing regularization to penalize more complex solutions.
• Bagging: Short for Bootstrapping Aggregation, bagging is a procedure where multiple datasets can be produced by sampling dataset D with replacement. A classifier is trained on each dataset and then averaged to try to obtain the expected classifier. The variance error can be provably reduced using this method. The details to this proof will be explained in the next blog post.

High Bias: If you classify your test set and find that there is a small gap between the test error and training error, but that both error lines have converged above the acceptable error then your classifier might have high bias.

Solutions

A common misconception is that adding more data is the solution to most error woes. However, it’s very important to realize that if your classifier suffers from bias, there exists a fundamental problem with your model. More data will not fix this issue, and in fact will just increase training error.

• Increase Model Complexity: High bias means that your classifier isn’t robust enough to accurately predict labels, so you need to increase the power of the algorithm. If your data require a more complex classification boundary, consider using algorithms that can find a non-linear boundary such as KNN or Logistic Regression. Non-linearities can be incorporated into linear classifiers through kernelization which involves mapping feature vectors into higher dimensional spaces where you can more likely  find a linearly separating hyperplane.
• Add Features: Instead of increasing the complexity of the algorithm, an alternative is to add more features to represent each data point. This might increase the separation between classes enough such that simpler algorithms can still be utilized.
• Boosting: In boosting, weak classifiers with high bias (termed as such because they perform only slightly better than random guessing at predicting labels) are ensembled together to generate a strong learner that has a lower bias.

High Noise: If training error and test error are both too high, and perhaps become even higher after more training instances are introduced, then your data may just be very noisy. This can be due to identical training points having different labels. There are very few ways to actually overcome this type of error, but the best way is to try to add more features to characterize your training data that will hopefully offer more separation between seemingly identical points.

This blog post should give you a taste of the different types of error that are prevalent among classifiers and the next few blog posts in the series will go through bagging, boosting, and kernels as some options to help reduce these errors.

### References:

All of the material for the post comes from Kilian Weinberger’s CS4780 class notes and lectures found at: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html

# Discriminant analysis for classification

Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

Classification – Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

• Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
• Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

Prediction – Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

• Predict sales volume of a product in a future time period
• Predict annual heating/cooling expenses for a set of buildings

Segmentation – Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

• Identify groups of customers that have similar ordering patterns across seasons of the year
• Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving classification problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a continuous dependent variable using one or more independent categorical variables, whereas DA attempts to predict a categorical dependent variable using one or more continuous independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are normally distributed within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).

Fig. 1 – Employee classification on the basis of mechanical and verbal aptitude scores

The centroids of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation i from a multivariate normal distribution should be assigned to group G_j that minimizes the Mahalanobis distance between i and group centroid C_j.

Fig. 2 – Independent variables have different variances

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have unequal variances or if they are correlated, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,

Fig. 3 – Independent variables are correlated

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by:  $D_{ij}=\sqrt{(X_i-\mu_j)^TS^{-1}(X_i-\mu_j)}$ where $X_i$ is the vector of attributes for observation $i$, $\mu_j$ is the vector of means for group $j$, and $S$ is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:

Fig. 4 – Using the Mahalanobis Distance, points equidistant from a group mean are on tilted eclipses

For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:

Table 1 – Distance of each observation to each group, using Mahalanobis distance. Classification is estimated using the minimum between the three distances. Values in red indicate misclassifications.

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics. Eighth edition. Boston, MA: Cengage Learning, 2018.

# Introduction to Borg Operators Part 1: Simplex Crossover (SPX)

I have always found the connection between Genetic Algorithms and Biological Evolution very interesting. In many ways, Genetic Algorithms like Borg, emulate Biological Evolution. They start with a random population of potential solutions, the best fit of these solutions are selected to produce the future generation. During the production of offspring, crossover and mutation occur, which leads to more variation in the future generation. Strong (Non-dominated) offspring join the larger population and replace some of the previous weaker solutions (which are now dominated), and the process is repeated until some final condition is met.

In this post, I will briefly discuss Borg’s operators, and then dive deeper into Simplex Crossover (SPX) by Tsutsui et Al. (1999). This is the first of a series of posts in which I will discuss the Borg operators.

## General Overview of Borg Operators

Borg uses the six operators shown in the figure 1 below to achieve variation in its solutions. In the figure, the larger dots represent the parent solutions, while the smaller dots represent the offspring solutions (Hadka & Reed, 2013).

The question is how does Borg know which operators are best suited for a new problem? Borg auto-adaptively selects operators to based on their performance to aid the search. Based on section 3.4 in Hadka and Reed, 2013: In the first iteration in Borg assigns an equal probability for using each operator. For K operators, this probability is 1/K. With each iteration, these probabilities are updated, so that the operators that produced more solutions in the epsilon box archive, are assigned higher probabilities, while those with less solutions in the epsilon box dominance archive are assigned a lower probability. Hadka and Reed(2013) use the following weighted average equation to obtain this behavior:

Where:

• K is the number of operators used
• Qi ∈ {Q1,…, Qk} is the probability that an operator is used to produce offspring in the next generation. and is initialized at 1/K
• Ci ∈ {C1,…, Ck} is the number of solutions in the epsilon box dominance archive produced by each operator i.
• Sigma is a positive number that prevents any of the operators from reaching a probability of zero.

Figure 1 (Hadka & Reed, 2013)

Notice in the bottom three operators in Figure 1 show some similarity: the parents form a simplex (which is a triangle in 2 dimensions). However, the way the offspring are distributed around the parents is different. In Parent Centric Crossover, the offspring form around the parent solutions, in Unimodal Normal Distribution Crossover, the offspring are normally distributed around the center of mass of the three parents, and in Simplex Crossover, the solutions are uniformly distributed inside a simplex that is larger than the simplex created by the three parents (Deb et Al., 2002).

## Simplex Crossover (SPX)

Discussion based on Tsutsui et Al., 1999

Simplex Crossover (SPX) is a recombination operator that can use more than two parent solutions to generate offspring. The method involves expanding the simplex formed by the parent solutions by a factor of epsilon, and then sampling offspring from the resulting expanded simplex. The expansion of the simplex is centered around the center of mass of the parent solution vectors, and therefore independent of the coordinate system used. The number of parents must be at least 2 and less than or equal to the number of parameters plus one. That is:

Where:

• m: is the number of parents
• n: is the number of parameters

The SPX is denoted as SPX-n-m-ε.

Tsutsui et Al., 1999 states that for low dimensional functions, SPX works best with 3 parent solution vectors, and for high dimensional functions SPX works best with 4 parent solution vectors.

### The 3 Parent, 2 Parameter Case

It is easiest to see this in the 2-dimensional three parent, 2 parameter case, i.e. SPX-2-3-ε. This can be done in four steps, referring to figure 2:

Figure 2 (Tsutsui et Al., 1999)

1. Let X1, X2, and X3 be the three parent solution vectors. Each of these vectors has two parameters (it’s x and y coordinates). Calculate the center of mass, O, of the parents by computing the average of their two parameters:
2. Expand the simplex by epsilon at every vertex:
3. Produce offspring by using a uniform distribution to sample inside new expanded simplex defined in step 3.

You can see this in Python using the code below:


import numpy as np
import random

def SPX_2D(X1, X2, X3, epsilon, n_offspring=1, m=2):
# m is the number of parameters in each parent vector
# X1, X2, X3 are the parent vectors as np.arrays, each with m parameters (elements) i.e. len(Xi)=m
# epsilon is a value greated that zero by which the simplex is expanded
# n_offspring is the number of offspring you would like to produce

# Step 0: apply checks and inform the user if the input is wrong
if m<2:
print("error: at least two parameters needed for 2D SPX")
elif len(X1)!= len(X2) | len(X1)!=len(X3) | len(X1)!=len(X2):
print("error: the number of parameters in the parent vectors don't match")
elif len(X1)!=m:
print("error: the number of parameters in the parent vectors is not equal to m")

# if the checks pass, the function can proceed
else:
# Step 1: find the center of mass (CoM) of the three parents
CoM = 1/3 * (X1 + X2 + X3)

# Step 2: Define the vertices (Vi) of the simplex by expanding the simplex in the direction of Xi-CoM by (1+epsilon)
# note that the equation here is slightly different from the one in the paper, since the one in the paper produces the vector
# that translates the center of mass to the new vertices, and so the vectors need to be added to the center of mass coordinates
# to produce the new vertices.
V1=CoM+(X1-CoM)*(1+epsilon)
V2=CoM+(X2-CoM)*(1+epsilon)
V3=CoM+(X3-CoM)*(1+epsilon)

# Step 3: Produce offspring by sampling n_offspring points from the new simplex defined in step 3 using a uniform distribution
offspring = [point_on_triangle(V1, V2, V3) for _ in range(n_offspring)]

return offspring, V1, V2, V3, CoM

#########################################################################################################################

# Point_on_triangle function
# source: https://stackoverflow.com/questions/47410054/generate-random-locations-within-a-triangular-domain
# Solution by Mark Dickinson (Nov 21, 2017)

def point_on_triangle(pt1, pt2, pt3):

# pt1, pt2, pt3 are the vertices of a triangle

# find two random numbers s and t, note that random produces a float between 0 and 1 using a uniform distribution
s, t = sorted([random.random(), random.random()])

# use s & t to calculate a weighted average of each coordinate. This will produce the offspring vector
return (s * pt1[0] + (t-s)*pt2[0] + (1-t)*pt3[0],
s * pt1[1] + (t-s)*pt2[1] + (1-t)*pt3[1])

#########################################################################################################################

# Let's try an example
X1=np.array([-2,2])
X2=np.array([4,2])
X3=np.array([1,6])
epsilon=0.3
m=2
n_offspring=1000

offspring, V1, V2, V3, CoM = SPX_2D(X1, X2, X3, epsilon, n_offspring, m)

# Finally, you can plot the parents and offspring to produce Figure 3

import matplotlib.pyplot as plt

# Plot the parents
x1, y1 = zip(X1, X2, X3)
plt.scatter(x1, y1, s=50, label='Parents')

# Plot the center of mass
x2, y2 = zip(CoM)
plt.scatter(x2, y2, s=150, label='Center of Mass')

# Plot the expanded simplex
x3, y3 = zip(V1, V2, V3)
plt.scatter(x3, y3, s=100, label='Simplex Vertices')

# Plot the offspring
x4, y4 = zip(*offspring)
plt.scatter(x4, y4, s=0.05, label='Offspring')

plt.legend()

plt.show()



Figure 3

### General Case

The general case is denoted as SPX-n-m-ε, with n parameters and n parents.

Each parent solution vector is:

These vectors are in R^n.

Thinking in terms of Biology, this parent vector mimics a chromosome, with m parameters as its different traits.

The general case can be outlined in four steps:

1) Parent vectors are selected from the population pool

2)  R^n is space is then divided as follows:Divide R^n into h sets of R^m-1 spaces and one R^q space. I found this easier when I thought of it in the context of a chromosome, i.e. large parent vector of length n, being divided into smaller sub-vectors of length m-1, and a remainder vector or length q. See figure 4 below:

Figure 4

3) In each R^m-1 space, the m parent vectors are used to generate offspring using the expanded simplex as described in the 2-dimensional, 3 parent, 2 parameters case. Again, using the chromosome analogy, figure 5 shows how this can be depicted for m=3 parents and for example, n=9 parameters. R^9 is divided into four (h=integer(n/(m-1)=4) R^2 (m-1=2) spaces and one R^1 space (q=remainder(n/(m-1)=1).

Figure 5

4) The sets of offspring in R^m-1 produced in step 3 together with the vector in R^q (which remains unchanged), are then combined in their correct positions to form an offspring vector in R^n.

The code for the general SPX case can be found in David Hadka’s github.