# NetCDF Operators

This post is an introduction to Linux-based climate data and NetCDF operators (CDOs or NCOs) which allow you to perform various operations on netNDF files through the command line. I found these commands to be really nifty when I was working with pre-industrial control runs from a GCM. The output was being written on daily timestep, across 1200 years, and for the whole world, so it was absolutely essential that I cut the size of the files down as much as I could before transferring to my own computer.

The official documentation and installation instructions for NCO can be found here and CDO here, but if you’re working on a supercomputer, the libraries will already likely be installed. I will outline how I used some of these functions for my pre-industrial runs.

## Concatenation

Some of the NCO commands have size limits of 40 GB, so it’s important to use the right order of operations when processing your files, which will be different depending on your ultimate goal. My goal was to ultimately get the 500-hpa geopotential height anomalies across the whole 1200 year period for specifically the Western US. Assuming you have a directory with all the NetCDF files, the first goal is to concatenate the data, since my run was broken into many smaller files. The easiest way to do this is with the following command which will take all the netcdf files in the directory (using the *) and merge them into a file called merged_file.nc:

cdo mergetime *.nc merged_file.nc


## Return Individual Monthly Files

When calculating anomalies, you will need to determine a mean geopotential height value for each of the 12 months, and then calculate daily deviations with respect to these months to obtain daily deviations. You can do this with the following command:

cdo splitmon merged_file.nc zg500.mon


This command will return 12 files of the form zg500.mon$i.nc. ## Return Monthly Means and Daily Anomalies The next step is to calculate a monthly mean for each of these files. For example, for January use: cdo timmean zg500.mon1.nc zg500.mean.mon1.nc  ## Return Anomalies Now we subtract the means from each monthly file to return the daily anomalies for each month, which will be of the form: zg500.mean.mon${i}.anom.nc. If you want to combine the last two steps into one loop, you can use the code below:

for i in $(seq 1 12) do cdo timmean zg500.mon${i}.nc zg500.mean.mon${i}.nc cdo sub zg500.mon${i}.nc zg500.mean.mon${i}.nc zg500.mean.mon${i}.anom.nc
done


## Cut Down to Geographical Area of Interest

Finally, we need to cut down the data just to the Western US. We use ncks (NetCDF Kitchen Sink) from NCO, which is probably the most versatile of all the functions (hence the name). This command is one that has the 40 GB limit, which is why I had to wait until I had monthly files before I could cut them down geographically. We must first specify the variable of interest using the -v flag. In my case, I only had one variable to extract, but you can also extract multiple in one command. Then denote the range of latitude and longitude using the -d flags. It is very important to include the periods at the end of each lat/lon (even if your bounds are integers) otherwise the command does not work.

for i in $(seq 1 12) do ncks -v zg500 -d lon,180.,260. -d lat,30.,60. zg500.mean.mon${i}.cut.anom.nc -o zg500.mean.mon${i}.cut.anom.region.nc done  Ultimately, you will get 12 files of the form: zg500.mean.mon${i}.cut.anom.region.nc. And that’s it! You can concatenate the monthly files back together and try to resort the data back into the correct sequence according to time. I was unsuccessful at finding a quick way to do this, but it is possible. I found it much easier to work on this within R. I imported each of the 12 anomaly files, assigned a time vector, concatenated each monthly anomaly matrix into a larger matrix and then sorted according to date. If your files are small enough by the end of the process, this likely is the easiest way to take care of resorting. Enjoy!

# Introduction to Wavelets

The post is a brief introduction and overview of wavelets. Wavelets are a powerful tool for time series analysis, de-noising, and data compression, but have recently exploded in fields like hydrology and geophysics. We are often interested in discovering periodic phenomena or underlying frequencies that are characteristic in a signal of interest, such as low-frequency teleconnections. We may also want to better understand very rapidly changing seismic signals during an earthquake or ocean wave dispersion. Wavelets can help us answer many questions in these applications!

### Basic Background

The typical approach to understanding what frequencies are present in a signal involves transforming the time-domain signal into the frequency domain by means of the Fourier Transform. A well-known deficiency of the Fourier Transform is that is provides information about the frequencies that are present, but all information about the time at which these frequencies are present is lost. Thus, signals that are non-stationary, or exhibit changing frequencies over time cannot be analyzed by the Fourier Transform (Deshpande, 2015). One solution that was proposed to address this limitation was the Short-time Fourier Transform (STFT). In a very basic sense, the STFT involves applying a window function to segment the signal in time and then performing a Fourier transform on each segment. Then, the frequencies for each segment can be plotted to better understand the changing spectra (Smith, 2011). A considerable amount of research was spent on determining appropriate windowing functions between the 1940s and 1970s. However, limitations still existed with this approach. The STFT utilizes the same windowing function across the whole time series, which may not be appropriate for all the different frequencies that may be characterize a signal. When a signal has very high frequency content, a narrow window is preferable, but this results in poor frequency resolution. When a signal has lower frequency content, a wider window is preferable, but this results in poor time resolution. This tradeoff is often termed an example of the Heisenberg Uncertainty Principle (Marković et al, 2012).

The Short-Term Fourier Transform assumes that frequencies are present uniformly across the time series. Wavelet bases of different scales (frequency bands) can be influential at select times in the series.

In order to properly analyze signals that are non-stationary or transient (most signals of interest) and to appropriately address both large and small scale features in the frequency space, we can make use of wavelets! Wavelet transforms are especially useful for analyzing signals which have low frequencies for long durations and short frequencies for short durations. Furthermore, the basis functions are not restricted to only sinusoids, which can make it easier to approximate functions with sharper peaks or corners.

In a continuous wavelet transform, expressed below in Equation (1), the basis function or window is termed the “mother” wavelet, which is designated by Ψ .

This mother wavelet can be scaled and translated with the s and τ  parameters, respectively, to produce “daughter” or “child” wavelets, which are simply variations of the mother wavelet. The wavelets are not only translated across the signal but can be dilated or contracted to capture longer or shorter term events.

By definition, a continuous wavelet transform is a convolution of the input signal with the daughter wavelets, which is inherently a measure of similarity. The wavelet is “slid” over the signal of interest and similarities at each time point are measured. Therefore, the wavelet will be correlated with the parts of the series that contain a similar frequency.

The convolution with the selection of wavelets can lead to redundancy of information when the basis functions are not orthogonal. A family of wavelet basis functions have been developed to address this limitation. The simplest orthonormal wavelet is the Haar wavelet, which is a discrete, square shaped wavelet. The Haar wavelet is not continuous or differentiable, however it is particularly useful for approximating the response of systems that may experience a sudden transition. A Morlet wavelet is a complex sinusoid enclosed in a Gaussian envelope and may be more useful in applications such as music, hearing/vision, and for analyzing electrocardiograms.

Common Wavelets : a) Haar b) Gaussian c) Daubechies d) Morlet (Baker, 2007)

In a discrete wavelet transform (DWT), the translation and scale parameters, s and τ are discretized in such a way that each successive wavelet is twice the dimension as the one before, to cover all but very low frequencies. This is termed a dyadic filter bank. Each daughter wavelet can therefore be thought of as a bandpass filter that represents a specific frequency of interest or scale.

Dyadic filter bank frequency response magnitudes (Smith, 2011)

The correlation across time associated with the signal and each daughter wavelet can be plotted in a scalogram as shown below.

Mapping the wavelet scalogram (Shoeb & Clifford, 2006)

The coefficients of the wavelet indicate the correlation between the wavelet and the signal of interest at a specific time and scale. The amplitude squared of the wavelet coefficient,|Wi|2 defines the wavelet power and can be used to create a Wavelet Power Spectrum.  A larger power corresponds to a higher correlation, therefore, the regions of high power in the spectrum correspond to areas of interest5 .

### WaveletComp

We will use the library WaveletComp in R to demonstrate how to find the Wavelet Power Spectrum. Many wavelet libraries exist in Python and MATLAB work equivalently and just as simply, but I like WaveletComp due the huge supplement in the package repository that contains many examples on how to use the functions. For this post, I took quarterly El Niño 3 Region (NINO3) SST surface anomalies from NOAA recorded over 1871-1996 and applied a single function from the package to create the spectrum.

library(waveletComp)
my.w = analyze.wavelet(mydata, "Index",
loess.span = 0,
dt = 0.25, dj = 1/250,
lowerPeriod = 0.25,
upperPeriod = 32,
make.pval = TRUE, n.sim = 30)
wt.image(my.w, n.levels = 250,
legend.params = list(lab = "wavelet power levels") )



Here we specify the name of the dataframe that contains the data (mydata), the column that contains the index (“Index”), the number of observations per unit time (dt=0.25 due to the quarterly resolution) and the upper and lower period that bounds the search area. The range of periods is generally represented as a series of octaves of 2j and each octave is divided into 250 sub-octaves based on the dj term. The “make.pvalue=TRUE” argument draws a white line around the areas that are deemed significant. Finally, we plot the wavelet objecive using wt.image.

Wavelet Power Spectrum

As seen in the power spectrum, the highest powers across the time series are recorded within the 2-7 year frequency bands, which matches with the period that El Niño events tend to occur. Interestingly, the El Niño signal seems strongest at the earliest and later parts of the decade and is notably less prominent between the years of 1920-1960, which has been observed in other work (Torrence & Webster, 1997). The wavelet spectrum allows us to see how the periodicity of the El Niño signal has changed over time, with longer periods being observed around 1920 and shorter periods closer to the beginning and end of the century.

Because wavelets are shifted across each time-point in the convolution operation, the coefficients at both edges (half of  the wavelet duration at each frequency) are not accurate. The smaller frequencies utilize smaller wavelet durations, creating a “cone of influence” that is shown by the white shading on the edges of the plot. The cone of influence designates the areas of the plot that should be disregarded.

Coherence is a measure of the common oscillations that two signals share. Generally, coherence or cross-correlation is used to assess similarity in the time or frequency domain. However, if the two signals being compared are non-stationary, the correlation can change over time. Therefore, the coherence must be represented in a way to show changes across frequency and time. We can create cross-correlation plots in WaveletComp as well. I have extracted some data supplied by MathWorks which contains a monthly Niño3 index along with an average All-India Rainfall Index6. In order to assess the how these two time series are linked, we use the analyze.coherency function in WaveletComp.

my.wc = analyze.coherency(data,
my.pair = c("Nino_Index", "Rainfall"),
loess.span = 0,
dt = 1/12, dj = 1/50,
lowerPeriod = 0.25, upperPeriod = 32,
make.pval = TRUE, n.sim = 10)

wc.image(my.wc, n.levels = 250,
legend.params = list(lab = "cross-wavelet power levels"),
color.key = "interval",show.date=TRUE)



Cross-Wavelet Power Spectrum

The maximum correlation aligns well within the 2-7 year bands that were observed in the above plot. The orientation of the arrows in the figure signify the delay between the two signals at that time period, The vertical to horizontal arrows denote a ¼ – ½ cycle delay within the significant areas or ½-3.5 years of delay between the El Niño SST’s observed off the coast of South America to influence rainfall in the Indian subcontinent. Pretty cool!

There is quite a bit of solid math and proofs that one should go through to truly understand the theory behind wavelets, but hopefully this post serves as a good introduction to show how wavelets can be useful for your respective analysis. When I first learned about wavelets and tried out WaveletComp, I immediately began conducting a wavelet analysis on every time series that I had on hand to look for hidden frequencies! I hope that this tutorial motivates you to explore wavelets for analyzing your non-stationary signals.

### Sources:

Baker, J. W. “Quantitative Classification of Near-Fault Ground Motions Using Wavelet Analysis.” Bulletin of the Seismological Society of America, vol. 97, no. 5, 2007, pp. 1486–1501., doi:10.1785/0120060255.

Deshpande, Jaidev. “Limitations of the Fourier Transform: Need For a Data Driven Approach.” Limitations of the Fourier Transform: Need For a Data Driven Approach – Pyhht 0.0.1 Documentation, 2015, pyhht.readthedocs.io/en/latest/tutorials/limitations_fourier.html.

Marković, Dejan, et al. “Time-Frequency Analysis: FFT and Wavelets.” DSP Architecture Design Essentials, 2012, pp. 145–170., doi:10.1007/978-1-4419-9660-2_8.

Smith, Julius O. Spectral Audio Signal Processing. W3K, 2011.

Torrence, Christopher, and Peter J. Webster. “Interdecadal Changes in the ENSO–Monsoon System.” Journal of Climate, vol. 12, no. 8, 1999, pp. 2679–2690., doi:10.1175/1520-0442(1999)0122.0.co;2.

[5] “Wavelet Power Spectrum.” Wavelet Toolkit Architecture, Dartmouth, 16 June 2005, northstar-www.dartmouth.edu/doc/idl/html_6.2/Wavelet_Power_Spectrum.html.

[6] “Wcoherence.” MATLAB & Simulink Example, The MathWorks, Inc., 2020, http://www.mathworks.com/help/wavelet/examples/compare-time-frequency-content-in-signals-with-wavelet-coherence.html.

# Deeper Dive into Self-Organizing Maps (SOMs)

This blog post focuses on Self-Organizing Maps (SOMs) or Kohonen maps, first created by Teuvo Kohonen in the 1980s, and most recently introduced in Dave Gold’s latest blog post. SOMs are a type of artificial neural network (ANN) that are trained using unsupervised learning to produce a two-dimensional pattern representation of higher-dimensional datasets. SOMs are therefore a good algorithm for dimension reduction, and inputs that are closer in their higher-dimensional input vectors are also mapped closely together in the resulting 2D representation [1].

SOMs can be very effective for clustering/classifying outputs and are advantageous because they “self-organize” to identify clusters. The SOM map is first initialized with a user-defined 2D grid of neurons instead of containing layers like traditional ANNs. The weight vector that characterizes the neurons can be randomly initialized with value or more favorably, by sampling across the eigenvectors of the two largest PCs of the input dataset. The training process that SOMs use to build maps is called “competitive learning”, which is also fundamentally different from backpropagation used for ANNs. During competitive learning, an input training point is fed into the algorithm. Its Euclidean distance from each of the neurons is calculated, and the closest neuron is termed the best matching unit, or BMU. The weight of the neuron and surrounding neurons is adjusted to be closer to the related input vector, using an update formula. This process is repeated for every input in the training data and for a large number of iterations until patterns and clusters begin to emerge [3]. The animation on the right illustrates competitive learning. Note how the locations of neurons are adjusted as training points get assigned to their respective BMUs and how whole grid shifts in response to the new samples.

At a high level, this is how SOMs are trained. Next, we will go through a clustering example using the kohonen package offered in R.

### Test Case: Maryland Trout

For this text case, we will be using a presence/absence dataset of Maryland trout. Each data point is characterized by a feature vector that includes information on agricultural land cover, logarithmic distance to the nearest road, riffle quality, dissolved oxygen, and spring water temperature. The label associated with each point is a 0 or 1, denoting if trout was observed or not. We will train a SOM to cluster the feature vector and then investigate the characteristics of the resulting clusters. Below is a snippet of our dataset. We use columns 3-7 as our feature vector while column 2 denotes our label. SOMs can also be used for classification, which is not discussed in this blog post. Therefore, you can train the SOM only on a subset of the dataset and used the trained SOM to predict the clusters that the testing points will be assigned to.

First we must scale our data in order to make sure that variables in our feature vector that may have different magnitudes are treated equally by the algorithm. Then we define the SOM grid. The choice of the number of neurons generally follows the rule: 5*sqrt(# of datapoints). In our case, we have 84 datapoints, so technically a grid that is smaller than 7×7 will work well. I decide to use 5×5, which will be explained in the next paragraph. I also specify that I want the grid to be in a hexagonal shape.  Finally, we train our model by feeding in our input data, the user-defined grid, and specifying an iteration rate of 1000. This means that the input data will be presented 1000 times to the algorithm.

require(kohonen)
#Create the training dataset
data_train &amp;amp;lt;-data[, c(3:7)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix &amp;amp;lt;- as.matrix(scale(data_train))
# Create the SOM Grid
som_grid &amp;amp;lt;- somgrid(xdim = 5, ydim=5, topo="hexagonal")
# Finally, train the SOM
som_model &amp;amp;lt;- som(data_train_matrix,
grid=som_grid, rlen = 1000,
keep.data = TRUE )


Once the SOM is trained, there are a variety of figures that can be easily created. We can plot the training progress and observe how the distance to the BMU decreases as a function of iteration and starts to level off as we reach many iterations. This is ideal behavior because it means that the training points are associating closely with their BMU.

plot(som_model, type="changes")


We can also plot a heat map of the node counts, which shows how many of each of the training inputs are associated with each node. Any node that is grey means that no training points were assigned to that BMU. This may indicate that the map is too big, so you can go back and adjust the grid size adjust accordingly until the grey points are no longer existent. For this dataset, this was around 5×5. Furthermore, we want the number of points associated with each node to be as uniform as possible.

plot(som_model, type="count", main="Node Counts")


Next we can view the neighbor distances for each neuron, which is also called the U-matrix. Small distances indicate similarity between neighboring nodes.

plot(som_model, type="dist.neighbours", main = "SOM neighbor distances")


We can also view the weight vector of the nodes, which are termed codes. The fan in each node indicates the variables of prominence that are linking the datapoints assigned to the neuron. In our dataset, you can see that in the upper left corner of the map, the algorithm is assigning datapoints to these neurons because of the similarity among the agricultural area and water temperature variables in the feature vector. In the lower right corner, these datapoints are linked primarily due to similarities in riffle quality, and dissolved oxygen. This type of map is useful to understand by what features the points are actually starting to cluster.

plot(som_model, type="codes")


In my opinion, one of the most useful standard figures that the kohonen library creates are heatmaps that show the gradient of input values across the nodes. A heatmap can be created for each feature and when placed side by side, they can help demonstrate correlation between inputs. We can clearly see that areas characterized by low agricultural land tend to have a better riffle quality, lower water temperature, and a higher dissolved oxygen. While these patterns are expected and have a strong physical explanation, these maps can also help to discover other unexpected patterns. Note that the values on the side bar are normalized and can be transformed back to the original scale.



#Plot heatmaps for each feature
plot(som_model, type = "property", property = getCodes(som_model)[,1], main=colnames(getCodes(som_model))[1],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,2], main=colnames(getCodes(som_model))[2],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,3], main=colnames(getCodes(som_model))[3],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,5], main=colnames(getCodes(som_model))[5],palette.name=magma)
#Unscale the variables- do this for each input column in data_train
var_unscaled &amp;amp;lt;- aggregate(as.numeric(data_train[,1]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2] plot(som_model, type = "property", property=var_unscaled, main=colnames(getCodes(som_model))[1], palette.name=magma)  Finally we can cluster our neuron codes weight vectors) to add some boundaries to our map. Typically what is done is to use k-means clustering to determine an appropriate number of clusters that, in this case, minimize the within-cluster sum of squares (WCSS). We can then plot the results and look for an elbow in the scree plot.  #k-means clustering code from [4] mydata = som_model$codes[[1]]
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(mydata, centers=i)$withinss) } par(mar=c(5.1,4.1,4.1,2.1)) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)") #Define a palette and cluster with the number of clusters determined using k-means  The elbow isn’t so clear in this dataset, but I go with 8 clusters and then use hierarchical clustering to assign each neuron to a cluster. Finally, we can obtain the cluster that each point is assigned to and insert that back as a column in our dataset. #Final clustering code from [4] #Define a palette and cluster with the number of clusters determined using k-means pretty_palette &amp;amp;lt;- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2') som_cluster &amp;amp;lt;- cutree(hclust(dist(som_model$codes[[1]])), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")

# get vector with cluster value for each original data sample
cluster_assignment &amp;amp;lt;- som_cluster[som_model$unit.classif] # for each of analysis, add the assignment as a column in the original data: data$cluster &amp;amp;lt;- cluster_assignment



### Post-Processing

We’re now interested in understanding which of our points actually clustered together, so we generate summary statistics for each cluster. The points highlighted in blue represent the best values for the feature while the points in red represent the worst. Note that the clusters with a large amount of trout presence on average are characterized by best values of features that are favorable for aquatic life, while the clusters that exhibit a lower presence of trout do not have good values for these features. You can also see that clusters with similar characteristics are side by side on the clustering map.

Overall, I found the kohonen library to be a really useful tool to start to internalize how SOMs work and to fairly quickly be able to do a nice analysis on both the features and the resulting clusters. I would highly recommend this package as a starting point. There are packages in Python such as SimpSom and minisom that have snappier graphics and will be a good next step to explore.

References

[1]https://towardsdatascience.com/self-organizing-maps-ff5853a118d4

[2] By Chompinha – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77822988

[3]http://fourier.eng.hmc.edu/e161/lectures/som/node2.html

[4] Much direction for structuring this tutorial comes from https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

# The New and Improved Borg R Wrapper

This semester, I have had the opportunity to build a multi-objective optimization model in R that uses the Borg evolutionary algorithm framework for the optimization component to discover tradeoffs of the modeled system. The R wrapper historically has been the most under-developed of the wrappers offered for Borg, which has primarily been limiting for visiting students or groups on our campus in atmospheric or biological  science that tend to use R. While fuller/faster Borg capabilities may be realized by transitioning to C++ or Python, I spent some time fixing up the R wrapper for serial Borg in an effort to help broaden our target users.

To request access to the Borg MOEA, complete step 2 of Jazmin’s introduction to Borg, found here.  Once you receive access, within the Borg folder, navigate to /plugins/R. You will find borg.R (the borg R wrapper), DTLZ2.R (a test function), and a script called test.R that calls the wrapper and performs an optimization.

### Setting up the Borg R Wrapper

The R wrapper pulls out relevant files, libraries, and functions from borg.c, borg.dll, and libborg.so, so these files must be in the same directory as the above files in order for the R wrapper to work. The R wrapper requires an R package called rdyncall, which allows R to call C libraries that Borg uses. This package is unfortunately not offered within the current CRAN repository, so you can find an archived version of the package here and download it manually. In order to install a tar.gz package, you can do so using the following R command: install.packages(“directory/to/package”, type=”source”). This command has other variants that work specifically for Windows vs. Mac vs. Unix users and has given other students trouble in the past, so feel free to reach out if you can’t get this step to work. Once this is done, you can run test.R and get your optimization results. The way the wrapper is set up, it will return the decision variables and objective function values in a dataframe called “result”.

### Running Multiple Seeds

The R wrapper is set up to only run with Borg’s default parameterization, so you have to go into the wrapper and adjust Borg’s random state in order to facilitate using a different random initialization of Borg’s population or operator dynamics. Borg is a “seeded” algorithm, so the initialization of the algorithm is determined by a  number shown in “rngstate”. This allows the runs to be reproducible, if the same seed is specific. In order to facilitate new seeds, you can just change this number. If we want to run 50 seeds, we wouldn’t want to change this number manually, so instead I subtract a variable “s” that will run from 1-50 depending on the seed.

We can feed this number in as an argument to specify seeds. Assuming you want to run this on the cluster, a bash script can be written to loop through the seeds and fed in as inputs into the test.R script as shown below:

The variable, “s” must be defined in your test.R script at the top, in the correct argument number. This number happens to be the 6th value that is found when calling commandArgs( ). This may differ across computers, so it’s important to check that you’re pulling in the correct argument. Provided you are running the seeds in parallel, you’ll also want to save the result in a file for its respective seed. The arguments into the Borg call are (number of variables, number of objectives, number of constraints, the name of the function that you are trying to optimize in DTLZ2.R which is named “DTLZ2”, NFE, and epsilons for each objective).

As shown at the top of DTLZ.R, you need to actually define nvars and nobjs and nconstrs if you have them. Objective functions must be listed as obj[1]=… etc. and the function must return(objs, constrs) if necessary.

### Creating Runtime Files

Another important tool that has been missing is the creation of runtime files which allow you to track how the Pareto front is evolving and to get a sense of convergence. In order to implement your own runtime files, you need to navigate to the part of the wrapper where the Borg MOEA is actually running (Lines 314-318). Inside this while loop, you need to implement the following statement block. First initialize the runtime file and write the problem characteristics at the time. Then, create a runtime vector that outlines the frequency that you want your output. Here I’m running 1000 FE and I want the runtime every 100 FE. Because Borg can be a little fickle and not print at the specific FE that you want, I specify a range of +2 around each value, so that if something doesn’t print quite at 100 FE, you’ll likely get output at 101 or 102 FE. When running jobs on the Cube, it’s important to have your output file stored somewhere that any external node can access; therefore I create my file in my scratch directory instead of my home directory.

When you land on a FE value in your output frequency range, you have to use the appropriate getter functions to first get the archive and get the variables and objectives out of that result. The NFE is printed along with the archive and then a # is added at the end to separate out solutions. An example runtime file is shown below.

This style of file might not be the easiest to pull data from, but it’s the structure that is necessary to be recognized by MOEAFramework to create a reference set and runtime metrics. Happy coding!

# Accessing a Virtual Machine in Red Cloud

This blog post is an introduction to Red Cloud- a cloud computing service that is maintained by Cornell’s Center for Advanced Computing (CAC). Red Cloud is a private research cloud and can only be accessed by those with a subscription, but exploratory accounts are available for free for Cornell students, faculty and staff.

Subscriptions to cloud systems such as Red Cloud allow access to a variety of remote computing sources within seconds. Users can request instances, or virtual machines (VMs), of a variety of configurations ranging from CPUs to GPUs with varying amounts of RAM. In Red Cloud, users can access instances with up to 28 core and 240 GB of RAM. In this post, I’ll go through the very basic steps you need to access a VM through Red Cloud. These steps should generally apply to any cloud system that uses OpenStack as their cloud computing platform.

### Step 1: Accessing OpenStack

OpenStack is a cloud operating system that will allow us to access the Red Cloud resources through a simple web interface. Log in with your CAC username and password (for the Reed Group: your credentials to access the Cube). This will lead you to an overview page that shows your usage of the resources.

Click on the Images tab. This shows the virtual machines that are available for use. You can access machines that have Linux distributions such as Centos (a -cuda means that these images can support GPUs) or Ubuntu. VMs usually have very minimal software installed, so there are also various images with pre-loaded software like Matlab.

OpenStack Overview Page

Available Images

### Step 2: Creating a Key Pair

A key pair needs to be set up before launching to allow secure access to your instance through SSH authentication. You can create a new key pair under the Key Pairs tab.

Creating a Key Pair (Source: CAC)

Give it a meaningful name and copy the private key to a text file. Change the extension to a .pem file and store it somewhere convenient on your computer.

### Step 3: Creating a Security Group

A security group allows you to control how you to specify what internet traffic can come from (ingress) or go to (egress) the instance. You can create your own, but for now, we will use Red Cloud’s default security group and add rules to that. Click on “Manage Rules”.

Overview of Security Groups

You’ll see a variety of Ingress/Egress rules already in the group.

Adding Rules to the Security Group

However, if you’re accessing a Linux-based VM, you will need to also allow access through an SSH command. Click on “Add Rule” and then choose “SSH” in the first drop-down menu and then “Add”. The SSH rule will now be listed as one of your rules. There are many options for rules including restricting access to only Cornell IP addresses etc.

### Step 3: Launch an Image

Now we have all the tools to launch an instance. Under the Compute and then Instances tab, you will have the option to launch an instance.

Launching an Instance

Under the Details tab, give your instance a name.

Under the Source tab, choose your instance. I’ll go with the latest stable version of Ubuntu and then click the up arrow.

Choosing an Image

Then, choose your flavor. It’s recommended to start with the lowest RAM (8GB), especially if you are just starting to explore cloud computing. You can always scale up when your image is launched if you need to.

Choosing a Flavor

Under the Security Group tab, check to see that the default security group is selected. Then choose your key pair under the Key Pair tab. Great, now we can launch the instance by clicking the blue “Launch Instance” button.

The instance will go through a variety of tasks until it stabilizes at “Running”.

Instance Status

Now we can SSH into our remote server using the IP address that is listed for the instance. I’ll use MobaXterm to start a local terminal and navigate into the directory where I saved my private key. Use the following command, inserting in the IP address of your instance and the name of your key.

SSH-ing into the Ubuntu VM

Now we’ve entered into our Ubuntu machine and we can interact with it using the command line. Enjoy!

Ubuntu Image!

Once you are done, make sure to either shelve (if you want your disk contents to be unchanged) or delete your instance. Even if the machine is idle, this prevents it from being used by other users, so you will still be billed.

Shelving an Instance

In the next tutorial, I’ll describe Docker and how you can use VMs to run containerized code.

Acknowledgements: Much of the information shared in this tutorial comes from many conversations with staff at CAC, particularly Peter Vaillancourt and the Red Cloud wiki.

# So What’s the Rationale Behind the Water Programming Blog?

By Patrick M. Reed

In general, I’ve made an effort to keep this blog focused on the technical topics that have helped my students tackle various issues big and small. It helps with collaborative learning and maintaining our exploration of new ideas.

This post, however, represents a bit of departure from our normal posts in response to some requests for a suggested reading guide for my Fall 2019 AGU Paul A. Witherspoon Lecture entitled “Conflict, Coordination, and Control in Water Resources Systems Confronting Change” (on Youtube should you have interest).

The intent is to take a step back and zoom out a bit to get the bigger picture behind what we’re doing as research group and much of the original motivation in initiating the Water Programming Blog itself. So below, I’ll provide a summary of papers that related to the topics covered in the lecture sequenced by the talk’s focal points at various points in time. Hope this provides some interesting reading for folks. My intent here is to keep this informal. The highlighted reading resources were helpful to me and are not meant to be a formal review of any form.

So let’s first highlight Paul Witherspoon himself, a truly exceptional scientist and leader (7 minute marker, slides 2-3).

1. His biographical profile
2. A summary of his legacy
3. The LBL Memo Creating the Earth Sciences Division (an example of institutional change)

Next stop, do we understand coordination, control, and conflicting objectives in our institutionally complex river basins (10 minute marker, slides 6-8)? Some examples and a complex systems perspective.

1. The NY Times Bomb Cyclone Example
2. Interactive ProPublica and The Texas Tribune Interactive Boomtown, Flood Town (note this was written before Hurricane Harvey hit Houston)
3. A Perspective on Interactions, Multiple Stressors, and Complex Systems (NCA4)

Does your scientific workflow define the scope of your hypotheses? Or do your hypotheses define how you need to advance your workflow (13 minute marker, slide 9)? How should we collaborate and network in our science?

Perspectives and background on Artificial Intelligence (15 minute marker, slides 10-16)

The Wicked Problems Debate (~22 minute marker, slides 17-19) and the emergence of post normal science and decision making under deep uncertainty.

Lastly, the Vietnam and North Carolina application examples.

# Introduction to Unit Testing

## Motivation

Here is something that has happened to most or all of us engineers who code. We write a function or a class and make sure it works by running it and printing some internal state output on the console. The following week, after adding or making improvements on seemingly unrelated parts of the code, the results start to look weird. After a few hours of investigation (a.k.a. wasted time) we find out the reason for the weird results is that that first function or class is either not giving us the right numbers anymore or not behaving as expected with different inputs. I wonder how many papers would have been written, how many deadlines would have been met, and how many “this $@#%& does not @#$% work!!!” thoughts and comments to friends and on our codes themselves would have been avoided if we managed to get rid of such frustrating and lengthy situations.

In this post I will introduce a solution to this rather pervasive issue: unit testing, which amounts to ways of creating standard tests throughout the development of a code to avoid the mentioned setbacks and simplify debugging. I will also get into some of the specifics of unit testing for Python and demonstrate it with a reservoir routing example. Lastly, I will briefly mention a unit testing framework for C++.

## The General Idea

Unit test frameworks are libraries with functions (or macros) that allow for the creation of standard tests for individual parts of code, such as function and classes, that can be run separately from the rest of code. The advantages of using Unit Testing are mainly that (1) it eliminates the need of running the entire code with a full set of inputs simply to see if a specific function or class is working, which in some cases can get convoluted take a really long time, (2) it allows for easy and fast debugging, given that if a certain test fails you know where the problem is, (3) it prevents you or someone else from breaking the code down the road by running the tests frequently as the code is developed, and (4) it prevents an error in a function or class to be made noticible in another, which complicates the debugging process. Of course, the more modular a code is the easier it is to separately test its parts — a code comprised of just one 1,000-lines long function cannot be properly unit-tested, or rather can only have one unit test for the entire function.

## Example Code to be Tested

The first thing we need to perform unit testing is a code to be tested. Let’s then consider the reservoir routing code below, comprised of a reservoir class, a synthetic stream flow generation function, a function to run the simulation over time, and the main.:

import numpy as np
import matplotlib.pyplot as plt

class Reservoir:
__capacity = -1
__storage_area_curve = np.array([[], []])
__evaporation_series = np.array([])
__inflow_series = np.array([])
__demand_series = np.array([])
__stored_volume = np.array([])

def __init__(self, storage_area_curve, evaporations, inflows, demands):
""" Constructor for reservoir class

Arguments:
storage_area_curve {2D numpy matrix} -- Matrix with a
row of storages and a row of areas
evaporations {array/list} -- array of evaporations
inflows {array/list} -- array of inflows
demands {array/list} -- array of demands
"""

self.__storage_area_curve = storage_area_curve
assert(storage_area_curve.shape[0] == 2)
assert(len(storage_area_curve[0]) == len(storage_area_curve[1]))

self.__capacity = storage_area_curve[0, -1]

n_weeks = len(demands)
self.__stored_volume = np.ones(n_weeks, dtype=float) * self.__capacity
self.__evaporation_series = evaporations
self.__inflow_series = inflows
self.__demand_series = demands

def calculate_area(self, stored_volume):
""" Calculates reservoir area based on its storave vs. area curve

Arguments:
stored_volume {float} -- current stored volume

Returns:
float -- reservoir area
"""

storage_area_curve_T = self.__storage_area_curve.T .astype(float)

if stored_volume  self.__capacity:
print "Storage volume {} graeter than capacity {}.".format(
stored_volume, self.__capacity)
raise ValueError

for i in range(1, len(storage_area_curve_T)):
s, a = storage_area_curve_T[i]
# the &st; below needs to be replace with "smaller than" symbol. WordPress code highlighter has a bug that was distorting the code because of this "smaller than."
if stored_volume &st; s:
sm, am = storage_area_curve_T[i - 1]
return am + (stored_volume - sm)/ (s - sm) * (a - am)

return a

def mass_balance(self, upstream_flow, week):
""" Performs mass balance on reservoir

Arguments:
upstream_flow {float} -- release from upstream reservoir
week {int} -- week

Returns:
double, double -- reservoir release and unfulfilled
demand (in case reservoir gets empty)
"""

evaporation = self.__evaporation_series[week] *\
self.calculate_area(self.__stored_volume[week - 1])
new_stored_volume = self.__stored_volume[week - 1] + upstream_flow +\
self.__inflow_series[week] - evaporation - self.__demand_series[week]

release = 0
unfulfiled_demand = 0

if (new_stored_volume > self.__capacity):
release = new_stored_volume - self.__capacity
new_stored_volume = self.__capacity
elif (new_stored_volume < 0.):
unfulfiled_demand = -new_stored_volume
new_stored_volume = 0.

self.__stored_volume[week] = new_stored_volume

return release, unfulfiled_demand

def get_stored_volume_series(self):
return self.__stored_volume

def run_mass_balance(reservoirs, n_weeks):
upstream_flow = 0
release = 0
total_unfulfilled_demand = 0
for week in range(1, n_weeks):
for reservoir in reservoirs:
release, unfulfilled_demand = reservoir.mass_balance(release, week)

def generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma):
"""Log-normally distributed stream flow generator.

Arguments:
n_weeks {int} -- number of weeks of stream flows
sin_amplitude {double} -- amplitude of log-mean sinusoid fluctuation
log_mu {double} -- mean log-mean
log_sigma {double} -- log-sigma

Returns:
{Numpy array} -- stream flow series
"""

streamflows = np.zeros(n_weeks)

for i in range(n_weeks):
streamflow = np.random.randn() * log_sigma + log_mu *\
(1. + sin_amplitude * np.sin(2. * np.pi / 52 * (i % 52)))
streamflows[i] = np.exp(streamflow)

return streamflows

if __name__ == "__main__":
np.random.seed(41)
n_weeks = 522
sin_amplitude, log_mean, log_std = 1., 2.0, 1.8
streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mean, log_std)
storage_area_curve = np.array([[0, 1000, 3000, 4000], [0, 400, 600, 900]])

reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks) / 10,\
streamflows, np.random.rand(n_weeks) * 60)

run_mass_balance([reseroir1], n_weeks)

plt.plot(reseroir1.get_stored_volume_series())
plt.xlabel("Weeks [-]")
plt.ylabel("Stored Volume [MG]")
plt.title("Storage Over Time")
plt.ylim(0, 4100)
plt.show()


We have quite a few functions to be tested in the code above, but lets concentrate on Reservoir.calculate_area and generate_streamflow. The first is a function within a class (Reservoir) while the second is a standalone function. Another difference between both functions is that we know exactly what output to expect from Reservoir.calculate_area (area given storage), while given generate_stream flow is setup to return random numbers we can only test their estimated statistical properties, which will change slightly every time we run the function. We will use the PyTest framework to perform unit testing in the reservoir routing code above.

## PyTest

PyTest is one among many available unit test frameworks for Python. The reason for its choice is that PyTest is commonly used by Python developers. You can find the API (list of commands) here and examples more here.

### Basic Work Flow

Using PyTest is pretty simple. All you have to do it:

• Install PyTest with pip or conda (pip install pytest or conda install pytest)
• Create a file (preferably in the same directory as the file with the code you want to test) whose name either begins or end with “test_” or “_test.py,” respectively — this is a “must.” In this example, the main code is called “reservoir_routing.py,” so I created the PyTest file in the same directory and named it “test_reservoir_routing.py.”
• In the unit test file (“test_reservoir_routing.py”), import from the main code file the functions and classes you want to test, the PyTest module pytest and any other packages you may use to write the tests, such as Numpy.
• Write the tests. Tests in PyTest are simply regular Python functions whose names begin with “test_” and that have assertions (“assert,” will get to that below) in them.
• On the Command Prompt, PowerShell, or Linux terminal, run the command “pytest.” PyTest will then look for all your files that begin with “test_” or end in “_test.py,” scan them for functions beginning with “test_,” which indicate a test function, run them and give you the results.

### Assertions

Assertions are the actual checks, the cores of the tests. Assertions are performed with the assert command and will return “pass” if the condition being asserting is true and “fail” otherwise. Below is an example of an assertion, which will return “pass” if the function being tested returns the value on the right-hand-side for the given arguments or “false” otherwise:

assert function_calculate_something(1, 'a', 5) == 100


Sometimes we are not sure of the value a function will return but have an approximate idea. This is the case of functions that perform stochastic calculations or calculate numerically approximate solutions — e.g. sampling from distributions, Newton-Rhapson minimum finder, E-M algorithm, finite-differences PDE solvers, etc. In this case, we should perform our assertion against an approximate value with:

# assert if function returns 100 +- 2
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., abs=2.)


or with:

# assert if function returns 100 +- 0.1 (or 100 +- 100 * 1e-3)
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., rel=1e-3)


Sometimes (although, unfortunately, rarely) we add code to a function to check the validity of the arguments, intermediate calculations, or to handle exceptions during the function’s execution. We can also test all that with:

# Test if an exception (such as Exception) are properly raised
with pytest.raises(Exception):
function_calculate_something(1345336, 'agfhdhdh', 54784574)


### Test File Example

Putting all the above together, below is a complete PyTest file with two tests (two test functions, each with whatever number of assertions):

from reservoir_routing import Reservoir, generate_streamflow
import numpy as np
import pytest

def test_calculate_area():
n_weeks = 522
storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

# Test specific values of storages and areas
assert reseroir1.calculate_area(500) == 400
assert reseroir1.calculate_area(650) == 500
assert reseroir1.calculate_area(1000) == 900

# Test if exceptions are properly raised
with pytest.raises(ValueError):
reseroir1.calculate_area(-10)
reseroir1.calculate_area(1e6)

def test_generate_streamflow():
n_weeks = 100000
log_mu = 7.8
log_sigma = 0.5
sin_amplitude = 1.
streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma)

whitened_log_mu = (np.log(streamflows[range(0, n_weeks, 52)]) - log_mu) / log_sigma

# Test if whitened mean in log space is 0 +- 0.5
assert np.mean(whitened_log_mu) == pytest.approx(0., abs=0.05)


If I open my Command Prompt, navigate to the folder where both the main code and the test Python files are located and run the command pytest I will get the output below:

F:\Dropbox\Bernardo\Blog posts>pytest
============================= test session starts =============================
platform win32 -- Python 2.7.13, pytest-4.2.0, py-1.7.0, pluggy-0.8.1
rootdir: F:\Dropbox\Bernardo\Blog posts, inifile:
collected 2 items

test_reservoir_routing.py F.                                             [100%]

================================== FAILURES ===================================
_____________________________ test_calculate_area _____________________________

def test_calculate_area():
n_weeks = 522
storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

# Test specific values of storages and areas
assert reseroir1.calculate_area(500) == 400
>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500
E        +  where 400 = (650)
E        +    where  = .calculate_area

test_reservoir_routing.py:14: AssertionError
========================== deprecated python version ==========================
You are using Python 2.7.13, which will no longer be supported in pytest 5.0
https://docs.pytest.org/en/latest/py27-py34-deprecation.html
===================== 1 failed, 1 passed in 1.07 seconds ======================

F:\Dropbox\Bernardo\Blog posts>


This seems like a lot just to say that one test passed and the other failed, but there is actually some quite useful information in this output. The lines below (20 and 21 of the output above) indicate which assertion failed, so you can go ahead and debug your the corresponding specific part of code with a debugger:

>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500


I actually found this error with PyTest as I was writing the reservoir routing code for this post and used VS Code, which has a debugger integrated with PyTest, to debug and fix the error. The error is happening because of an int to float conversion that is not being done correctly in line 45 of the original code, which can be replaced by the version below to fix the problem:

storage_area_curve_T = self.__storage_area_curve.T<strong>.astype(float)</strong>


The last piece of important information contained in the last few lines of the output is that support for Python 2.7 will be discontinued, as it is happening with other Python packages as well. It may be time to finally switch to Python 3.

### Practice Exercise

As a practice exercise, try writing a test function for the Reservoir.mass_balance function including cases in which the reservoir spills, gets empty, and is partly full.

## Unit Testing in C++

A framework for unit testing in C++ that is quite popular is the Catch2 framework. The idea is exactly the same: you have a code you want to test and use the framework to write test functions. An example of Catch2 being used to check a factorial function is shown below, with both the main and the test codes in the same file:

#define CATCH_CONFIG_MAIN  // This tells Catch to provide a main() - only do this in one cpp file
#include "catch.hpp"

unsigned int Factorial( unsigned int number ) {
return number <= 1 ? number : Factorial(number-1)*number;
}

TEST_CASE( "Factorials are computed", "[factorial]" ) {
REQUIRE( Factorial(1) == 1 );
REQUIRE( Factorial(2) == 2 );
REQUIRE( Factorial(3) == 6 );
REQUIRE( Factorial(10) == 3628800 );
}


## Conclusion

Setting up and running unit testing is not rocket science. Also, with my test functions I do not have to use my actual reservoir system test case in order to check if the surface area calculations are correct. If I find they are not, I can just re-run the test function with a debugger (VS Code is great for that) and fix my function without having to deal with the rest of the code. You would be surprised by how many hours of debugging can be saved by taking the unit test approach. Lastly, be sure to run your tests after any change to your code you consider minimally significant (I know, this sounds vague). You never know when you mess something up by trying to fix or improve something else.

# Intro to Machine Learning Part 5: Bagging

In Part 3 of this series, it was stated that classification error can be due to bias, variance, or noise. There are different methods to address each type of classification error. This post will focus on bootstrap aggregation, or bagging, which aims to improve classification error due to variance.

As a reminder, expected test error can be decomposed as follows:

where hd is the training data classifier and hbar is the ideal expected classifier. Now let’s just focus on the part of the testing error that is due to variance. From the equation, it is clear that if we want this error component to approach 0, then hd(x) must tend to hbar(x). We can use the the Weak Law of Large Numbers (WLLN) to understand how this can happen. WLLN states that if you have a sample of independent and identically distributed (i.i.d) random variables, as the sample size grows larger, the sample mean will tend toward the population mean.

If this is applied in a classification setting:

In this equation, a classifier is trained on each of the m datatsets and then the results are averaged. By WLLN, hd tends to hbar …but only as our number of datasets, m, approaches infinity. But we only have one datatset, D, so how can we get more datasets? Through bagging!

### Bagging

Fig 1: Bootstrapping Process

Let’s say our original dataset D comes from some distribution P, shown in Figure 1.We can simulate drawing from P by bootstrapping, or sampling our original dataset D with replacement. Because we are sampling with replacement, we are essentially creating new datasets out of our original one! Note that when we sample with replacement, we cannot say that the new samples are i.i.d., so therefore the Weak Law of Large Numbers does not quite apply. But error due to variance can still be considerably reduced with this method.

### Application: Random Forests

Bagging is most popularly utilized for random forests which are bagged decision trees. A decision tree is trained on each of the m datasets and then either the mode label is assigned (in a classification setting), or an average value is calculated (in a regression setting). Random Forests are termed “out of the box” algorithms; they are very easy to use because 1) only 2 hyper-parameters have to be defined, m (the number of datasets) and k (a feature splitting parameter) and  2) features don’t need to be normalized to have the same units or scale, which is extremely helpful for any kind of heterogeneous data.

#### Coding up the Classifier

Random Forests can be implemented in every programming language. In Python, scikit-learn has a function called RandomForestClassifier. Only the number of trees in the forest needs to be specified, but there are many extra parameters that can be fine-tuned, including the criterion to measure the quality of a split (Gini vs. Entropy) and the maximum depth of each tree. The code below outlines a basic implementation of the classifier. A training dataset with “value” and “label” fields is read in and then fit with the RandomForestClassifier function that is trained on 200 trees. Then the classifier is used to predict the labels associated with each test set “value” and the prediction accuracy is calculated.


#import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#import training data
#fit classifier in training data
rfc1=RandomForestClassifier(n_estimators= 200, max_depth=8, criterion='entropy')
rfc1.fit(training.value, training.label)
#import testing data
#predict labels
predicted = rfc1.predict(test.value)
#calculate accuracy
acc=np.mean(predicted == test.label)



### Random Forests in Water Resources

The Water Resources field has seen an increasing machine learning presence for the past decade. However, the majority of research has been focused on implementation of artificial neural networks. Though just emerging in the field, random forests are becoming increasingly popular for both prediction and regression. Shortridge et al. (2016), tests the ability of random forests (among other models such as GLMs and ANNs) to predict monthly streamflow for five rivers in the Lake Tana Basin in Ethiopia. Empirical models that estimate monthly streamflow as a function of climate conditions and agricultural land cover in each basin were developed for the period from 1961 to 2004. The predictive capabilities of each model was tested along with investigation of their error structures. Random forests were found to better capture non-linear relationships than their ANN counterparts. Furthermore, they were much more parsimonious and physically interpretable. The paper also addresses ability of these models to make meaningful predictions under uncertainty and a non-stationary climate.

#### References:

All of the material for the post comes from Kilian Weinberger’s CS4780 class notes and lectures found at: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html

Shortridge, Julie E., et al. “Machine Learning Methods for Empirical Streamflow Simulation: a Comparison of Model Accuracy, Interpretability, and Uncertainty in Seasonal Watersheds.” Hydrology and Earth System Sciences, vol. 20, no. 7, 2016, pp. 2611–2628., doi:10.5194/hess-20-2611-2016.

# Introduction to DiscoveryDV

In this blog post, I will outline some of the capabilities of DiscoveryDV, a software created by Josh Kollat, a former PhD student of Dr. Reed and the CEO and founder of DecisionVis. DiscoveryDV allows users to visualize and explore tradeoffs in high dimensional data using pareto sort techniques, a variety of plotting tools, and animations. I will demonstrate these functionalities using a problem that Josh first explored when he was developing VIDEO (Visually Interactive Decision-Making and Design Using Evolutionary Multi-Objective Optimization), a precursor to DiscoveryDV. DiscoveryDV is not available for direct download, but you can request access here.

### Premise

The above linked paper uses a long-term groundwater monitoring case study to demonstrate how visualization tools can be helpful in exploring large multi-objective solution sets to aid in decision making. The case study premise is as follows: A PCE contamination plume from an underground storage tank is spreading through an aquifer. PCE data samples are available at 25 pre-determined well sampling locations and there are 1-3 sampling points along the vertical axis of the well that can also be sampled. The issue is addressed through a multi-objective optimization problem with four conflicting objectives.

1. Minimize Cost: Cost is equivalent to the number of wells chosen to be sampled.
2. Minimize Concentration Error: Concentration estimation error is calculated between a Kriged map of the plume utilizing all available sampling locations and a Kriged map of the plume which utilizes a subset of the sampling locations.
3. Minimize Uncertainty: The sum of the estimated variances attained for each estimation location
4. Minimize Mass: Reflects the error between the total mass of PCE estimated by Kriging the domain based on all available well sampling locations, and the estimated mass of PCE obtained by Kriging the domain based on a subset of well sampling locations.

There are 33 million possible solutions to this optimization problem, so let’s use DiscoveryDV to dig in!

### Starting a Project in DiscoveryDV

Figure 1 is a screenshot of the DiscoveryDV interface which is very simple and easy to use. All data sets and plots can be accessed from the storyboard on the right.

Figure 1: DiscoveryDV Interface

There are many pre-loaded data sets in DiscoveryDV and we will be using the LTM example.

1. From the Examples tab in the toolbar, choose LTM. Notice that a new page has been created in your storyboard.
2. If you click the > by LTM, you will see the attributes for the project: data files, plots, brushing rules, snapshots, etc.

By clicking the example, DiscoveryDV will load in a CSV file with the pareto optimal results. There are 2569 pareto optimal solutions, so what solution do you choose? This is a very real dilemma for stakeholders. The first step is to visualize the pareto front.

### Creating a 3D Plot of the Pareto Surface

Since the problem formulation has 4 objectives, you should expect to see a 3D surface. We can plot the results of the 4th objective as the color of our surface.

In order to make this plot, you can use the N-Dimensional Plot tool. Right click on Plots in the storyboard and choose Create plot -> N-Dimensional Plot. Plot Cost, Error, and Risk on the X,Y, and Z axes respectively and Mass as color. You can rotate the graph and zoom in and out. You can show the utopia point by choosing it in the Glyphs menu. You can also save this image into a PowerPoint by choosing Share>Send To PowerPoint from the plot menu. Finally, you can create an animation, as shown in Figure 2.

Figure 2: Pareto Surface GIF

It is interesting to note that by solving for the high-order Pareto-optimal surface for four objectives, all the sub-problems based on three objectives, two objectives, or even a single objective are implicitly solved at the same time. It can be unwieldy to look at a whole pareto surface, so sometimes it is advantageous to just look at two-objective pareto fronts to gain a better understanding of the behavior of the objectives.

### Creating a 2D Plot to Showcase Two-Objective Tradeoffs

A quick way to visualize the 2D tradeoffs is by using the 2D scatter matrix plot (Figure 3). Along the diagonals are histograms that show the breakdown of solutions in bins. The rest of the plots show the tradeoffs. By hovering over a point, you can see what bin that point lies in along with its location on the other relevant 2D plots.

However, we are not necessarily interested in all of the solutions portrayed in the 2D scatter. Rather we would like to see the pareto fronts of the 2-objective formulations to better dissect what is happening in the 3D surface. The 2D scatter matrix doesn’t have pareto sorting functionality, so let’s look at one of these tradeoffs more closely using the XY Plot tool.

1. Right click on Plots in the storyboard and choose Create plot -> XY Plot.
2. Plot Cost and Error on the X and Y axis respectively
3. Perform a pareto sort. The dominated data points should turn transparent.

Figure 4: Two-objective Pareto front

From Figure 4, it is clear that increasing cost will decrease error which is somewhat intuitive, but maybe wasn’t clear from the 3D plot. This is the advantage of looking at two-objective tradeoffs. It is up to the decision maker to determine what criteria are important to them, and 2D plots can help them elicit these preferences. Is it acceptable to be cheap but have 35% error? When cost is low, increasing the cost marginally will decrease the error a lot, but when does it become too costly to minimize error? Pareto sorting on a 2D plot gives a clear idea of the optimal solutions for different combinations of objectives and an idea of the tradeoffs associated with those decisions. Furthermore, in DiscoveryDV, a pareto sort can be performed with a single click.

### Brushing to Elicit Preferences

At this point, the stakeholder should have a better understanding of the tradeoffs that exist, but we need some way to help narrow down the set of possible options. One way to do this is to brush solutions based on stakeholder preference. The stakeholder can decide on an acceptable range for each objective from the full ranges shown below.

 Objective Range Minimize Cost 7-46 sampled locations Minimize Concentration Error 0-34.5 (Kg PCE per Cubic Meter of Aquifer) Minimize Uncertainty 1284-1563 (Kg PCE per Cubic Meter Aquifer)2 Minimize Mass 0-48.64 (Kg PCE)

In order to implement the ranges, expand the brushing tab and add brushing rules for each objective. Adjust to your pre-specified ranges. On the “hiding” axis of the 3D plot, pick the “brushing” option. Only the solutions that meet the range requirements will remain (Figure 5).

Figure 5: Example of a Brushed Pareto Front

### Parallel Axis Plots to View Tradeoffs

Now let’s create a set from our brushed data for further inspection. Click the analysis menu option and then click Create Set From > Brushed on Data. On your storyboard, you will see a new member, A, under your set membership. Now we will use a new type of plot, a parallel axis plot, to view the tradeoffs among these solutions.

1. Right click on Plots in the storyboard and choose Create plot -> Parallel Coordinate Plot.
2. Place your objectives of interest on the plot: Cost, Error, Risk, and Mass. Perhaps the stakeholder is particularly interested in cost. Let’s place cost on the color axis for extra visibility.

Each line in the parallel axis plot represents one of our solutions and where the lines cross between two axes indicates a tradeoff. Now, on the hiding axis, specify “brushing”. This will narrow down your parallel axis plot to only your brushed solutions.

Figure 6: Full and Brushed Parallel Axis Plots

You can also record an animation of your parallel axis plot and save it as a GIF (Figure 7)! In this GIF, I cycled through the solutions from lowest to highest cost.

Figure 7: Parallel Axis Plot GIF

From here, it is completely up to the decision maker to decide which solution is best. You can double click on solutions of interest in your parallel axis plot to mark them. They will show up as M1 and M2 the marking tab of your storyboard. Then you can then mark them in your 3D plot using the marking axis and add captions for visualization purposes (Figure 8).

Figure 8: Brushed 3D Plot with Caption

A user would typically take on the order of hours to create these visualizations from scratch, but in DiscoveryDV, most graphs can be created with a few clicks of the mouse. This is especially usefully when you are trying to quickly visualize large amounts of data. There is quite a bit more functionality that DiscoveryDV offers that wasn’t explored in this blog post, but once the user understands the interface, more complicated graphs can be made such as Figure 9, which is the culminating figure from the linked paper.

Figure 9: Negotiated Design Selection using 2D Tradeoffs

# Intro to Machine Learning: Classification using K-NN

Rohini and I are taking Kilian Weinberger’s fantastic machine learning class here at Cornell this semester and I thought I’d use a couple blog posts to share what we’ve been learning. I hope these posts can serve as a look into the topics covered in this class and what is meant by the term “machine learning”, which is a hot topic these days. In these posts I’ll provide overviews of the topics we’re learning and some example code. There is way more material in this class than could every be covered on this blog and if you’re interested in learning more on the subject I’d suggest checking out the course web page where all lecture video links and lecture notes (where the content in this post is derived from) are free an open to the public.

In this post we’ll start with a very simple algorithm that you’ve probably heard of, the K-Nearest Neighbors (K-NN) algorithm. We’ll look at how K-NN is performed, why it works and the theoretical limits of its performance. We’ll then examine cases where K-NN performs poorly and learn why. The concepts in this simple example will lay the foundation for the more sophisticated Machine Learning techniques that will be discussed in future posts.

## The Problem: Binary Classification

Say you have a set of data where each point has a number of features (attributes that define the point, such as an x-y coordinate) and a label such as red/blue or 1/-1. If we were given a new point and told it’s features but not it’s label, can we predict its label based on information from the data we have? We’ll call our original data set the training set, and the new point a test input.

Figure 1: Can we predict the classification for the new point (star) based on it’s features (x,y coordinates) and those of the training data?

## The K-NN algorithm

The K-NN algorithm is based on a simple idea, that points close to each other in the feature space are likely to have similar labels. This allows us to create a classification rule:

For a test input x, assign the most common label among its k most similar training inputs

Here we are defining similarity in terms of distance within the feature space. To put the above definition formally for a new point x, the classification rule for label y of the new point, h(x) is:

$h(x) = mode({y'' : (x'', y'') \in S_x})$

where:

Sx is the set of k nearest neighbors of x:

$S_x \subseteq D \quad s.t \enspace |S_x| =k$

and $\forall (x',y') \in D,S_x,$

$dist(x,x') \geq \underset{dist(x,x'') \in S_x}{max}(x'',y'')$

Figure 2: If k=5, the new point is classified as +1 because it has three +1 neighbors and two -1 neighbors.

The choice of k here is important one, for noisy data with a smooth boundary between classes, a larger k will usually be preferable. Choosing k is difficult to do a priori, and should be done by finding k that minimizes the error in a validation set and testing it over a separate testing data set to check for over fitting.

## Defining a distance metric

The concept of proximity is critical to K-NN, we define a set of nearest neighbors as the k closest points to our test point in the feature space. But how do we define closeness? When we think of distances in everyday life we usually think of Euclidean distance:

$dist(x,z) = \sqrt{\sum_{r=1}^{d} (x_r-z_r)^2 }$

However, other distance metrics may be useful. In fact, the Euclidean distance is just one form of the more general Minkowski distance:

$dist(x,z) =(\sum_{r=1}^{d} (x_r-z_r)^p)^{1/p}$

Aside from the euclidean distance (p=2), other common forms of the Minkowski distance include the taxicab norm (p=1) or the infinity norm (p=infinity).

Another possible distance metric is the Mahalanobis distance, which accounts for correlation between points. For a detailed look into the use of Mahalanobis and Euclidean distance see Billy’s excellent post from earlier this year.

## But how good of a classifier is K-NN?

How should we evaluate the performance of a machine learning algorithm? We can calculate the error rate of our predictions over a testing set, but this alone does not tell the full story as it neglects the properties of the testing set itself. It is therefore useful to set theoretical lower and upper bounds for error of a potential model.

The upper bound on error constitutes a classifier that our constructed model should always beat. A useful upper bound is the best constant predictor, which will always predict the most common label in the training set for any new point. If you construct a model that is able to classify points correctly 55% of the time, but 60% of the points in your data set have a certain label, you’re model is pretty useless.

Establishing the upper bound on error is pretty straight forward, but how do we specify a lower bound? What is the theoretical limit to the performance of a classification algorithm? To answer this question, lets assume we knew the probability of label y of data set x, P(y|x). We can then define the Bayes Optimal Classifier, hopt(x) as:

$h_{opt}(x) = y*=\underset{y}{argmax}{P(y|x)}$

We the error rate of  hopt(x)  is then defined as:

$\epsilon_{BayesOpt} = 1-P(h_{opt}|y) = 1-P(y^*|x)$

This error is the theoretical limit of our performance. If we new precisely the probability that a point had features x, this is the best we could do to predict its label. So how does the K-NN algorithm stack up? Cover and Hart (1967) famously proved that the error rate of a 1-NN classifier is no more than twice there error of the Bayes optimal classifier (See course notes for the full proof). Not bad!

## The Curse of Dimensionality

The K-NN algorithm rests on the assumption that similar points have similar labels. As we increase the dimensionality of our feature space however, finding “similar” points becomes increasingly difficult. As we increase the number of features we are comparing, all points in our data set become more distant from each other.

As an example, lets take a look at the “similarity” between fellow Reed group member Rohini and I based on a set of features I’ve chosen at random:

If we look at only three dimensions of features, Rohini and I are quite similar (in fact identical in terms of occupation, favorite class and current location). As we increase the number of features however, our similarity decreases precipitously. The more features we examine, the more unique each of us gets.  If we were to randomly sample these features of 100 people on campus, the similarity between Rohini and I in this feature set would likely not stand out from the sample. This is the curse of dimensionality, as the the number of features become large, the distance between all points increases to a point where it is impossible to determine “similarity” between points. If  we are unable to find neighbors that are “similar” to a test point, how can we assume any other point can be used to predict its label?

For high dimensional data sets, techniques such as PCA may make K-NN feasible if the data has a lower dimensional structure. If this isn’t effective, we must seek other strategies for classification. While the curse of dimensionality dictates that all points in a large dimensional space are distant from each other, this concept does not hold for the distance between points and hyperplanes. The Perceptron Algorithm, which I’ll cover in the next post, exploits this idea to create linear classifiers that are effective on high dimensional data.