Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Introducing the R Package “biascorrection”

For variety of reasons, we need hydrological models for our short- and long-term predictions and planning.  However, it is no secret that these models always suffer from some degree of bias. This bias can stem from many different and often interacting sources. Some examples are biases in underlying model assumptions, missing processes, model parameters, calibration parameters, and imperfections in input data (Beven and Binley, 1992).

The question of how to use models, given all these uncertainties, has been an active area of research for at least 50 years and will probably remain so for the foreseeable future, but going through that is not the focus of this blog post.

In this post, I explain a technique called bias correction that is frequently used in an attempt to improve model predictions. I also introduce an R package for bias correction that I recently developed; the package is called “biascorrection.” Although most of the examples in this post are about hydrological models, the arguments and the R package might be useful in other disciplines, for example with atmospheric models that have been one of the hotspots of bias correction applications (for example, here, here and here). The reason is that the algorithm follows a series of simple mathematical procedures that can be applied to other questions and research areas.

What Do I Mean by Bias Correction?

Bias correction in the context of hydrological modeling usually means “manual” improvement of model simulations to ensure that they match the observed data. Here’s a simple example: Let’s say we know that our model tends to underestimate summer flow by 30%. Then we might reason that to improve our predictions, we could add 30% to our simulated summer flows. In this blog post, I use a more sophisticated method, but this example should give you an idea of what I’m talking about.

Bias Correction using Quantile Mapping

Quantile mapping is a popular bias correction method that has been used in various applications. In this method, we first create quantiles of observed and simulated data. After that, whenever we have a simulated value we can find its simulated quantile and replace it with the value of the closest observed quantile.

Figure 1 – A simplified workflow of the quantile mapping technique

We generally follow the following steps to do the bias correction (see here):

  • Monthly Quantiles

We need to have two monthly time series of observed and simulated streamflow. If both series use daily time steps, we must aggregate the daily values to monthly values first, to create the monthly quantiles. We then sort the observed and simulated streamflows and assign each value to a quantile. This process generates streamflow quantiles for each month (January, February, etc.).

  • Monthly Bias Correction

When the quantiles are ready, we can start from the first month of the simulated results and determine what quantile its values belong to. Then we can go to the same quantile of the observed values and use it instead of the simulated one. This creates a monthly bias-corrected stream.

  • Annual Adjustment

Hamlet and Lettenmaier 1999, (also here) argue that the monthly bias correction can dramatically change the magnitude of annual streamflow predictions. However, although hydrologic models usually perform poorly at monthly time steps, they are pretty good at capturing annual variations. Therefore, we tend to rely on them. We calculate the annual difference between the bias-corrected and simulated flows and then we apply that to each individual month. This way, we can make sure that while the monthly variations are consistent with the recorded streamflow, our model is still able to determine how the average annual flow looks.

  • Disaggregation to Daily

After we apply the annual adjustments, we can use simulated or historical observed values to disaggregate the monthly time series to a daily one. The “biascorrection” package provides two methods for doing that: (1) Rescaling the raw simulated daily time series to match the monthly bias-corrected values. For example, if the total simulated streamflow for a month is half of the bias-corrected values for that month, the disaggregation module multiplies the raw daily simulated values by two for that specific month. (2) Sampling from the daily observed historical times series. In this case, the model uses KNN to find some of the closest months in the historical period (in terms of average monthly values) and randomly selects one of them.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In climate change studies, if the historical, simulated, and observed quantiles do not differ widely from the projected future scenarios, we can use the historical quantiles. However, if the future values are fundamentally different from the historical time series, this might not be justifiable. In such a case, synthetic generation of future data may be the way to go. The R package doesn’t include this either.

There are just two quick disclaimers:

  1. The R package has been tested and seems to perform well, but its complete soundness is not guaranteed.
  2. This blog post and the R package only introduce this bias correction as a common practice; I do not endorse it as a remedy for all the problems of hydrological models. Readers should keep in mind that there are serious criticisms of bias correction (for example here). I will discuss some in the following sections.

Arguments Against Bias Correction

One of the main advantages of hydrological models is that they can simultaneously and, arguably, consistently simulate different water and energy balance processes. However, bias correction manually perturbs streamflow without taking into account its effects on other components of the system. Therefore, it takes away one of the main advantages of hydrological models. In some cases, this can even distort climate change signals.

The other problem is that bias correction tries only to match the overall, aggregate statistics of the simulated flow with the observed flow, although streamflow has many more attributes. For example, current bias-correction algorithms can systematically ignore extremes that occur in daily to weekly time steps.

The “biascorrection” Package

I recently developed an R package that can be used for bias correction in simulated streamflow. The package has four main functions. Its workflow is consistent with the four bias-correction steps described above: (1) quantile creation, (2) monthly quantile mapping, (3) annual adjustment, (4) disaggregation to daily. The package doesn’t have a prescribed unit, and it can be used in different applications that require bias correction.

How to Install “biascorrection” Package

The package is available on GitHub (here) and it can be installed using the following command:


You can also go to my “blog” folder on GitHub to download simulated and observed datasets of streamflow at inlet of the Arrow dam in British Columbia (AKA Keenleyside Dam). The simulated flow has been generated using the Variable Infiltration Capacity model (VIC), and the observed flow is from Bonneville Power Administration’s No Regulation-No Irrigation streamflow datasets.

observed_input<-read.table("sample_data/Arrow_observed.txt", header = T)
simulated_input<-read.table("sample_data/Arrow_simulated.txt", header = T)

Note that the two datasets have different starting and ending dates. I intentionally used them to show how biascorrection package handles these types of datasets.

However, I plotted the overlapping period of the two datasets to demonstrate the difference between them. The simulated data set tends to underestimate streamflow during low-flow periods and overestimate during the high-flow periods.

Figure 2 – Observed vs simulated inflow to Arrow dam

Monthly Bias-Corrections

To run the monthly bias-correction function, first, you need to define the following starting and ending dates of your observed and simulated data frames:


You also need to define the following two conditions:

date_type<-"JY" ## Water year (WY) or Julian Year (JY)

Finally, we can use the following to calculate the monthly bias corrected flow:


df_bc_month<-bias_correct_monthly(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type)

Note that the format of observed and simulated inputs to the “bias_correct_monthly” function are not data frames. Here is how bias correction changes the simulated streamflow. As you see in the followin figure the bias-corrected flow doesn’t seem to have the underestimation problem during the low-flow anymore.

Figure 3- Monthly VIC simulated flow vs. bias-corrected flow

Annual Adjustment

You can use the following command to return the average annual flow back to what model originally simulated. Note that the inputs to the function is output of the monthly function.

df_bc_annual<-bias_correct_annual(df_bc_month$simulated, df_bc_month$bias_corrected, start_date_simulated, end_date_simulated)

Daily Disaggregation

The following function first uses the monthly function, then applies annual adjustment, and finally disaggregate the monthly streamflow to daily. The package has two options to convert monthly to daily: 1- it can multiply the simulated streamflow of each month by its bias-correction coefficient 2- it can use the K-nearest Neighbors method to find the closest month in the observed record

disaggregation_method<-"scaling_coeff" # "scaling_coeff" or "knn"

df_bc_daily<-bias_correct_daily(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type, disaggregation_method)

Here is how the entire bias-correction procedure affect the simulated streamflow:

Figure 4 – Simulated vs. bias-corrected flow after annual adjustment and daily disaggregation

Known Issues and Future Plans

  • Currently the biascorrection package only accepts complete years. For example, if your year starts in January it has to end in December, and it can not continue to, let’s say, next February.
  • I am thinking about adding some more functionalities to the biascorrection package in a few months. Some built-in post-processing options, built-in example datasets, and at least one more bias-correction techniques are some of the options.
  • For the next version, I am also thinking about releasing the model through CRAN.

How to automate scripts on a cluster

There are several reasons why you might need to schedule or automate your scripts on a personal machine or a cluster:

  • You’re waiting for a job to finish before submitting another
  • You’d like to automate regular backups or cleanups of your data (e.g., move new data to another location or remove unnecessary output files)
  • You need to submit jobs to get around node limitations (e.g., you’d like to spread out the submissions over several days)
  • You need to retrieve regularly updated data (e.g., you have a model that uses daily precipitation data and you’d like to automatically collect them every day)

Cron is a utility program on Unix operating systems that allows you to schedule or repeat such tasks in the future. There’s a crontab file associated with every user in a cluster, where you’ll input all the information needed to schedule and automate your tasks. Note that not all clusters automatically allow their users to run cron jobs[1], for example, I can use it on the Reed Group’s Cube cluster, but not on XSEDE’s Comet.

To edit the crontab file associated with your user, type the following in your command line:

crontab -e

This will open a text editor (like Vim) which you can edit. To simply view your current crontab without editing, run:

crontab -l

Crontab syntax is made up of two parts: the timer indicating when to run and the command to run:


The timer accepts five fields, indicating the time and day for the command to run:

  • Minute — minute of the hour, from 0 to 59
  • Hour — hour of the day, from 0 to 23
  • Day of the month — day of the month, from 1 to 31
  • Month — month of the year, from 1 to 12
  • Day of the week — day of the week, from 0 to 7

For example the following would execute script.sh on January 2nd at 9:00AM:

0 9 2 1 * /home/user/scripts/script.sh

Special characters are naturally very useful here, as they allow multiple execution times or ranges:

Asterisk (*) — to use all scheduling parameters in a field, for example, run the script, every day at midnight:

0 0 * * * /home/user/scripts/script.sh

Comma (,) — to use more than one scheduling parameter in a field, for example, run the script every day at midnight and 12PM:

0 0,12 * * * /home/user/scripts/script.sh

Slash (/) — to create predetermined time intervals, for example, run the script every four hours:

0 */4 * * * /home/user/scripts/script.sh

Hyphen (-) — to determine a range of values in a field, for example, run the script every minute during the first 10 minutes of every hour, every day

0-10 * * * * /home/user/scripts/script.sh

Hyphens and slashes can be combined, for example, to run a script every 5 minutes during the first 30 minutes of every hour, every day:

0-30/5 * * * * /home/user/scripts/script.sh

Last (L) — this character can only be used in the day-of-the-month and day-of-the-week fields to specify the last occurrence of something, for example the last day of the month (which could differ):

0 9 L * * /home/user/scripts/script.sh

or, to specify constructs such as “the last Friday” of a every month:

0 9 * * 5L /home/user/scripts/script.sh

Weekday ( W) — this character is only allowed on the day-of-month field and is used to determine the closest weekday to that day of the month. For instance, using “15W” indicates to cron to run the script on the nearest weekday to the 15th day of the month. If the 15th is a Saturday, the script will be executed on Friday the 14th. If the 15th is a Sunday, the script will be executed on Monday the 16th. If the 15th is a weekday, the script will be executed on the same day:

0 0 15W * * /home/user/scripts/script.sh

Hash (#) — this character is only allowed in the day-of-week field and is used to specify constructs such as the second Friday of every month:

0 0 * * 5#2 /home/user/scripts/script.sh

Lastly, if you’d like to be notified whenever a script is executed you can use the MAILTO parameter, with your email address.

The important thing to remember when running cron on a cluster (as opposed to your own machine) is that it will launch a shell that with a new clean environment (i.e., without the environment variables that are automatically applied when you log on an interactive shell) and it will likely not be able to recognize some commands or where your modules are. This can be easily addressed by sourcing your bash_rc or bash_profile from your home directory before running anything. You also need to remember that it will launch at your home directory and you need to specify the absolute path of the scripts to be executed, or change directory before executing them.

For example my crontab file on the Reed Group cluster looks like this:

00 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git pull; sbatch ./script.sh
30 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git add . ; git commit -m 'fetched data'; git push

This does the following:
Every day at 10am it sources my bashrc profile so it knows all my environment variables. It changes to the directory of my project and pulls from git any new updates to that project. It then submits a script using sbatch. I get an email at the same time, with the text that would that would have appeared in my command line had I executed these commands in an interactive node (i.e., the git information and a line saying Submitted batch job xxxxx).
Then, every day at 10:30 am, I commit and push the new data back to git.

[1] If you’re just a regular user on a cluster you might need to request to be granted access. If you have root privileges (say, on a personal machine), you need to edit your cron allow and deny files:


Introduction to Wavelets

The post is a brief introduction and overview of wavelets. Wavelets are a powerful tool for time series analysis, de-noising, and data compression, but have recently exploded in fields like hydrology and geophysics. We are often interested in discovering periodic phenomena or underlying frequencies that are characteristic in a signal of interest, such as low-frequency teleconnections. We may also want to better understand very rapidly changing seismic signals during an earthquake or ocean wave dispersion. Wavelets can help us answer many questions in these applications!

Basic Background

The typical approach to understanding what frequencies are present in a signal involves transforming the time-domain signal into the frequency domain by means of the Fourier Transform. A well-known deficiency of the Fourier Transform is that is provides information about the frequencies that are present, but all information about the time at which these frequencies are present is lost. Thus, signals that are non-stationary, or exhibit changing frequencies over time cannot be analyzed by the Fourier Transform (Deshpande, 2015). One solution that was proposed to address this limitation was the Short-time Fourier Transform (STFT). In a very basic sense, the STFT involves applying a window function to segment the signal in time and then performing a Fourier transform on each segment. Then, the frequencies for each segment can be plotted to better understand the changing spectra (Smith, 2011). A considerable amount of research was spent on determining appropriate windowing functions between the 1940s and 1970s. However, limitations still existed with this approach. The STFT utilizes the same windowing function across the whole time series, which may not be appropriate for all the different frequencies that may be characterize a signal. When a signal has very high frequency content, a narrow window is preferable, but this results in poor frequency resolution. When a signal has lower frequency content, a wider window is preferable, but this results in poor time resolution. This tradeoff is often termed an example of the Heisenberg Uncertainty Principle (Marković et al, 2012).


The Short-Term Fourier Transform assumes that frequencies are present uniformly across the time series. Wavelet bases of different scales (frequency bands) can be influential at select times in the series.

In order to properly analyze signals that are non-stationary or transient (most signals of interest) and to appropriately address both large and small scale features in the frequency space, we can make use of wavelets! Wavelet transforms are especially useful for analyzing signals which have low frequencies for long durations and short frequencies for short durations. Furthermore, the basis functions are not restricted to only sinusoids, which can make it easier to approximate functions with sharper peaks or corners.

In a continuous wavelet transform, expressed below in Equation (1), the basis function or window is termed the “mother” wavelet, which is designated by Ψ .


This mother wavelet can be scaled and translated with the s and τ  parameters, respectively, to produce “daughter” or “child” wavelets, which are simply variations of the mother wavelet. The wavelets are not only translated across the signal but can be dilated or contracted to capture longer or shorter term events.

By definition, a continuous wavelet transform is a convolution of the input signal with the daughter wavelets, which is inherently a measure of similarity. The wavelet is “slid” over the signal of interest and similarities at each time point are measured. Therefore, the wavelet will be correlated with the parts of the series that contain a similar frequency.

The convolution with the selection of wavelets can lead to redundancy of information when the basis functions are not orthogonal. A family of wavelet basis functions have been developed to address this limitation. The simplest orthonormal wavelet is the Haar wavelet, which is a discrete, square shaped wavelet. The Haar wavelet is not continuous or differentiable, however it is particularly useful for approximating the response of systems that may experience a sudden transition. A Morlet wavelet is a complex sinusoid enclosed in a Gaussian envelope and may be more useful in applications such as music, hearing/vision, and for analyzing electrocardiograms.


Common Wavelets : a) Haar b) Gaussian c) Daubechies d) Morlet (Baker, 2007)

In a discrete wavelet transform (DWT), the translation and scale parameters, s and τ are discretized in such a way that each successive wavelet is twice the dimension as the one before, to cover all but very low frequencies. This is termed a dyadic filter bank. Each daughter wavelet can therefore be thought of as a bandpass filter that represents a specific frequency of interest or scale.


Dyadic filter bank frequency response magnitudes (Smith, 2011)

The correlation across time associated with the signal and each daughter wavelet can be plotted in a scalogram as shown below.


Mapping the wavelet scalogram (Shoeb & Clifford, 2006)

The coefficients of the wavelet indicate the correlation between the wavelet and the signal of interest at a specific time and scale. The amplitude squared of the wavelet coefficient,|Wi|2 defines the wavelet power and can be used to create a Wavelet Power Spectrum.  A larger power corresponds to a higher correlation, therefore, the regions of high power in the spectrum correspond to areas of interest5 .


We will use the library WaveletComp in R to demonstrate how to find the Wavelet Power Spectrum. Many wavelet libraries exist in Python and MATLAB work equivalently and just as simply, but I like WaveletComp due the huge supplement in the package repository that contains many examples on how to use the functions. For this post, I took quarterly El Niño 3 Region (NINO3) SST surface anomalies from NOAA recorded over 1871-1996 and applied a single function from the package to create the spectrum.

my.w = analyze.wavelet(mydata, "Index",
loess.span = 0,
dt = 0.25, dj = 1/250,
lowerPeriod = 0.25,
upperPeriod = 32,
make.pval = TRUE, n.sim = 30)
wt.image(my.w, n.levels = 250,
legend.params = list(lab = "wavelet power levels") )

Here we specify the name of the dataframe that contains the data (mydata), the column that contains the index (“Index”), the number of observations per unit time (dt=0.25 due to the quarterly resolution) and the upper and lower period that bounds the search area. The range of periods is generally represented as a series of octaves of 2j and each octave is divided into 250 sub-octaves based on the dj term. The “make.pvalue=TRUE” argument draws a white line around the areas that are deemed significant. Finally, we plot the wavelet objecive using wt.image.


Wavelet Power Spectrum

As seen in the power spectrum, the highest powers across the time series are recorded within the 2-7 year frequency bands, which matches with the period that El Niño events tend to occur. Interestingly, the El Niño signal seems strongest at the earliest and later parts of the decade and is notably less prominent between the years of 1920-1960, which has been observed in other work (Torrence & Webster, 1997). The wavelet spectrum allows us to see how the periodicity of the El Niño signal has changed over time, with longer periods being observed around 1920 and shorter periods closer to the beginning and end of the century.

Because wavelets are shifted across each time-point in the convolution operation, the coefficients at both edges (half of  the wavelet duration at each frequency) are not accurate. The smaller frequencies utilize smaller wavelet durations, creating a “cone of influence” that is shown by the white shading on the edges of the plot. The cone of influence designates the areas of the plot that should be disregarded.

Coherence is a measure of the common oscillations that two signals share. Generally, coherence or cross-correlation is used to assess similarity in the time or frequency domain. However, if the two signals being compared are non-stationary, the correlation can change over time. Therefore, the coherence must be represented in a way to show changes across frequency and time. We can create cross-correlation plots in WaveletComp as well. I have extracted some data supplied by MathWorks which contains a monthly Niño3 index along with an average All-India Rainfall Index6. In order to assess the how these two time series are linked, we use the analyze.coherency function in WaveletComp.

my.wc = analyze.coherency(data,
my.pair = c("Nino_Index", "Rainfall"),
loess.span = 0,
dt = 1/12, dj = 1/50,
lowerPeriod = 0.25, upperPeriod = 32,
make.pval = TRUE, n.sim = 10)

wc.image(my.wc, n.levels = 250,
legend.params = list(lab = "cross-wavelet power levels"),
color.key = "interval",show.date=TRUE)


Cross-Wavelet Power Spectrum

The maximum correlation aligns well within the 2-7 year bands that were observed in the above plot. The orientation of the arrows in the figure signify the delay between the two signals at that time period, The vertical to horizontal arrows denote a ¼ – ½ cycle delay within the significant areas or ½-3.5 years of delay between the El Niño SST’s observed off the coast of South America to influence rainfall in the Indian subcontinent. Pretty cool!

There is quite a bit of solid math and proofs that one should go through to truly understand the theory behind wavelets, but hopefully this post serves as a good introduction to show how wavelets can be useful for your respective analysis. When I first learned about wavelets and tried out WaveletComp, I immediately began conducting a wavelet analysis on every time series that I had on hand to look for hidden frequencies! I hope that this tutorial motivates you to explore wavelets for analyzing your non-stationary signals.


Baker, J. W. “Quantitative Classification of Near-Fault Ground Motions Using Wavelet Analysis.” Bulletin of the Seismological Society of America, vol. 97, no. 5, 2007, pp. 1486–1501., doi:10.1785/0120060255.

Deshpande, Jaidev. “Limitations of the Fourier Transform: Need For a Data Driven Approach.” Limitations of the Fourier Transform: Need For a Data Driven Approach – Pyhht 0.0.1 Documentation, 2015, pyhht.readthedocs.io/en/latest/tutorials/limitations_fourier.html.

Marković, Dejan, et al. “Time-Frequency Analysis: FFT and Wavelets.” DSP Architecture Design Essentials, 2012, pp. 145–170., doi:10.1007/978-1-4419-9660-2_8.

Smith, Julius O. Spectral Audio Signal Processing. W3K, 2011.

Torrence, Christopher, and Peter J. Webster. “Interdecadal Changes in the ENSO–Monsoon System.” Journal of Climate, vol. 12, no. 8, 1999, pp. 2679–2690., doi:10.1175/1520-0442(1999)0122.0.co;2.

[5] “Wavelet Power Spectrum.” Wavelet Toolkit Architecture, Dartmouth, 16 June 2005, northstar-www.dartmouth.edu/doc/idl/html_6.2/Wavelet_Power_Spectrum.html.

[6] “Wcoherence.” MATLAB & Simulink Example, The MathWorks, Inc., 2020, http://www.mathworks.com/help/wavelet/examples/compare-time-frequency-content-in-signals-with-wavelet-coherence.html.

Everything you want to know about subplots in Python’s Matplotlib

Sometimes its helpful to get back to basics. I recently created a summer course on data visualization with Python, and the experience made me realize that the workings of Python’s main visualization library, Matplotlib, are often left of out formal Python courses. Over the course of my graduate career I’ve taken several courses on programming and Python, and each time visualization or plotting was viewed as an afterthought. I don’t think my experience is uncommon, creating basic visualizations in Python is fairly straight forward, and a quick Google search will yield tutorials on how to make most common plot types. While constructing the summer course, I realized how much time I could have saved if I had learned how Matplotlib worked earlier in my PhD. With this in mind, I’m writing this post to serve as a guide for those new to plotting with Matplotlib, and to help fill in some gaps for those who are already experienced Matplotlib users.

I’ve chosen to devote this post to making subplots, a task often necessary for scientific visualizations that can be surprisingly frustrating if you don’t understand Matplotlib’s structure. Since Python is an open source language, there are multiple ways of creating and working with subplots, so in this post I’ll outline a few ways that work for me, and provide some context about how things work behind the scenes in Matplotlib.

A brief introduction to Matplotlib’s object oriented syntax

Let’s start with some history. As it’s name suggests, Matplotlib was originally created as a means of replicating Matlab style plotting functionality in Python. As such, one way we can create visualizations using Matplotlib is through “Matlab style” syntax which is contained within Matplotlib’s flagship module, Pyplot. For example, to create a simple line plot we can use the following code:

import matplotlib.pyplot as plt
import numpy as np
x = np.arange(0,10)
y = np.arange(0, 10)


Which will generate this plot:

While the Matlab style syntax is easy to use, it is actually quite limiting in its ability to create custom visualizations. Luckily, the Matlab style syntax is simply hiding an object oriented code structure that is at the core of Matplotlib. By directly accessing and working with this object oriented structure we can create highly customized visualizations.

For the purposes of this blog, there are two Matplotlib objects that are important: Figure objects and Axes objects. Figure objects represent the containers that hold a visualization. I think of this as the blank canvas that the plots will be generated on. Figure objects can contain one or multiple Axes objects, each representing an individual chart (this can be a big source of confusion when learning Matplotlib, the term “axes object” is inherited from Matlab and refers to an entire chart rather than a x or y axis) . To create the plot above using Matplotlib’s object oriented syntax we need to make a few small modifications to the original code. First, we’ll generate a Figure object using Pyplot which will automatically create an axes object behind the scenes. Next we access this Axes object using the Figure object’s “gca” method, which stands for “get current axes”. Finally, we’ll use the Axes object to generate the plot:

fig = plt.figure()
ax = plt.gca()

This will make the identical plot as above:

Note that we can use the Axes object to create any kind of plot that can be created with the Matlab style syntax. Examples include ax.plot (line plot), ax.scatter (scatter plot), ax.bar (bar plot), ax.imshow (heatmaps and color mesh) etc.

Creating Basic subplots

Now that we have some insight into the object oriented structure of Matplotlib, we can start making some subplots. There are two main ways we can make subplots with Matplotlib, the first is to use Pyplot’s “subplots” function, which allows us to specify the number of rows and columns of plots in your figure. This function will return both a Figure object and a list of Axes objects. Note that the list of Axes objects is arranged in the same way as the subplots are within the figure (i.e. list entry [0,0] is the top left subplot):

fig, [[ax0, ax1],[ax2, ax3]] = plt.subplots(nrows=2, ncols=2)
ax0.text(0.5, 0.5, "This is axes object 0", ha='center')
ax1.text(0.5, 0.5, "This is axes object 1", ha='center')
ax2.text(0.5, 0.5, "This is axes object 2", ha='center')
ax3.text(0.5, 0.5, "This is axes object 3", ha='center')

Which will make the following set of subplots:

Instead of specifying individual names of each axes, we can alternatively store them in a single named list and axes them via indices like this:

fig, axes = plt.subplots(nrows=2, ncols=2)
axes[0,0].text(0.5, 0.5, "This is axes object 0", ha='center')
axes[0,1].text(0.5, 0.5, "This is axes object 1", ha='center')
axes[1,0].text(0.5, 0.5, "This is axes object 2", ha='center')
axes[1,1].text(0.5, 0.5, "This is axes object 3", ha='center')

An alternative way to create subplots is to first make the Figure object on its own, and then add subplots one at a time using the method “add_subplot” from the figure object. This method takes one argument, a three digit number. The first digit represents the number of rows, the second represents the number of columns and the third represents the placement of the individual subplot (the location from left to right, top to bottom, with the first placement being the top left and the last being the lower right. Oddly, this is 1 indexed unlike everything else in the entire Python language):

fig = plt.figure()
ax0 =fig.add_subplot(221)
ax0.text(0.5, 0.5, "This is axes object 0", ha='center')
ax1 =fig.add_subplot(222)
ax1.text(0.5, 0.5, "This is axes object 1", ha='center')
ax2 =fig.add_subplot(223)
ax2.text(0.5, 0.5, "This is axes object 2", ha='center')
ax3 =fig.add_subplot(224)
ax3.text(0.5, 0.5, "This is axes object 3", ha='center')

This script will make the identical set of subplots as shown above:

Giving your plots some room to breath

You may have noticed that the plots above are very cramped. By default, there is very little room between adjacent subplots. One remedy is to use Matplotlib’s tight_layout function, which will automatically fit your subplots into the figure. Just add this one line after creating your subplots. For demonstration, I’ll also add titles and axes labels to each subplot:

fig, [[ax0, ax1],[ax2, ax3]] = plt.subplots(nrows=2, ncols=2)
ax0.text(0.5, 0.5, "This is axes object 0", ha='center')
ax0.set_title('Title 0')
ax1.text(0.5, 0.5, "This is axes object 1", ha='center')
ax1.set_title('Title 1')
ax2.text(0.5, 0.5, "This is axes object 2", ha='center')
ax2.set_title('Title 2')
ax3.text(0.5, 0.5, "This is axes object 3", ha='center')
ax3.set_title('Title 3')

This will create the following set of plots:

If we want to add some extra padding between subplots, we can add some arguments to override the default tight_layout parameters. The argument “pad” adds padding between subplots and the figure boarders, while “h_pad” and “w_pad” add height and width padding between subplots. The units of this padding are in percentages of the default font size.

plt.tight_layout(pad=1.5, h_pad=2.5, w_pad=2)

Note that to add the necessary padding, tight_layout will make the subplots themselves smaller and smaller. To fix this, we can increase the size of the figure object using the “figsize” argument when we create the Figure object. The units of this function are in inches:

fig, [[ax0, ax1],[ax2, ax3]] = plt.subplots(nrows=2, ncols=2, figsize=(8,6))

Before moving on, I should note that the function: plt.subplots_adjust() has very similar functionality to tight_layout and can allow you to adjust left, right, bottom and top paddings individually. For the sake of brevity I’m omitting it here.

Getting fancy: creating subplots of different sizes

So far we’ve been creating subplots of uniform size. In practice however, it can be helpful to generate plots of varying sizes. We can do this using the Gridspec class. Gridspec will create a grid of subplot locations within a Figure object. When generating subplots, we can assign each a location and size in Gridspec coordinates. Importantly, a single subplot can span multiple rows or columns of Gridspec coordinates. Below, I’ll make a 2×3 Gridspec and use it to create different sized subplots.

fig = plt.figure(figsize=(8,6))
gspec = fig.add_gridspec(nrows=2, ncols=3)

# the first subplot will span one row and two columns
# it will start at the top left
ax0 = fig.add_subplot(gspec[0,:2])
ax0.text(0.5, 0.5, "This is axes object 0, \ngspec coordinates [0,:2]", ha='center')

# the second subplot will span one row and one column
# it will start at the bottom left
ax1 = fig.add_subplot(gspec[1,0])
ax1.text(0.5, 0.5, "This is axes object 1, \ngspec coordinates [1,0]", ha='center')

# the third subplot will span one row and one column
# it will start at the bottom middle
ax2 = fig.add_subplot(gspec[1,1])
ax2.text(0.5, 0.5, "This is axes object 2, \ngspec coordinates [1,1]", ha='center')

# the fourth subplot will span two rows and one column
# it will start at the top right
ax3 = fig.add_subplot(gspec[:,2])
ax3.text(0.5, 0.5, "This is axes object 3, \ngspec coordinates [:,2]", ha='center')


Which will make this plot:

We can also customize the height and width of each Gridspec coordinate using the arguments “height_ratios” and “width_ratios” respectively when creating the Gridspec object.

fig = plt.figure(figsize=(8,6))
# parameters to specify the width and height ratios between rows and columns
widths= [1, 1.5, 2]
heights = [1, .5]

gspec = fig.add_gridspec(ncols=3, nrows=2, width_ratios = widths, height_ratios = heights)

# the first subplot will span one row and two columns
# it will start at the top left
ax0 = fig.add_subplot(gspec[0,:2])
ax0.text(0.5, 0.5, "This is axes object 0, \ngspec coordinates [0,:2]", ha='center')

# the second subplot will span one row and one column
# it will start at the bottom left
ax1 = fig.add_subplot(gspec[1,0])
ax1.text(0.5, 0.5, "This is axes object 1, \ngspec coordinates [1,0]", ha='center')

# the third subplot will span one row and one column
# it will start at the bottom middle
ax2 = fig.add_subplot(gspec[1,1])
ax2.text(0.5, 0.5, "This is axes object 2, \ngspec coordinates [1,1]", ha='center')

# the fourth subplot will span two rows and one column
# it will start at the top right
ax3 = fig.add_subplot(gspec[:,2])
ax3.text(0.5, 0.5, "This is axes object 3, \ngspec coordinates [:,2]", ha='center')


Final thoughts

There are many more ways we can customize subplots in Matplotlib, but the material in this post suits my needs 99% of the time. For further reading, check out the Matplotlib documentation and examples: https://matplotlib.org/gallery/subplots_axes_and_figures/subplots_demo.html#sphx-glr-gallery-subplots-axes-and-figures-subplots-demo-py

Spatial statistics (Part 2): Spatial Regression Models

Regression is one of the main techniques of data analysis. A regression model that can incorporate spatial dependency in a dependent variable is called a spatial regression model. It can be used as a simple surrogate model for prediction when the data are not available for some locations, or for understanding the factors behind patterns. In this blogpost, I am going to create a simple regression model for a crop yield, check the residuals for signs of relationships with nearby areas, and try to remove the potential spatial dependencies in the residuals by applying a spatial regression model. The autocorrelation in the residuals is a sign that the underlying process being studied varies systematically across the study area. In this situation, the resulting estimates of a fitted model are biased. Spatial regression models have applications in different fields such as agriculture (e.g., farm management, policy issues), natural sciences (e.g., species patterns), public health (e.g., air pollution), and social sciences (e.g., forecast population).The datasets that I am going to use (ww.* and WW_ave_hist.txt ) are available here. The .txt file includes historical winter wheat yield for some locations (4*4 km grid cells) with distinct IDs, and the “ww” shapefile (which includes 6 files) has some information for each location based on its ID as well. We will merge these two files, apply linear regression, and check whether we can use some explanatory variables from our data (predictors) to explain the variation in yield (dependent variable) across the region. We assume that we can predict yield by knowing annual potential evapotranspiration, precipitation, and available water in the soil profile.

setwd("---your path--- ")
Annual_var<- readOGR(".","ww") # This is  SpatialPolygonsDataFrame objects that brings the spatial representations of the polygons with the  data.
yield_ww<- read.table("---your path---/ww_ave_hist.txt",header = T)
yield<- merge(Annual_var,yield_ww,by="ID")
names (yield)
# fit the linear model
lm_yield <- lm(ww_ave_yield ~  ET_pot  +  precipitat +soil_water, data=yield) 
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5058.6326   291.9022   17.33   <2e-16 ***
ET_pot        -5.0868     0.2147  -23.70   <2e-16 ***
precipitat    12.8215     0.1626   78.86   <2e-16 ***
soil_water    -5.9440     0.5295  -11.23   <2e-16 ***
Adjusted R-squared:  0.9035

After fitting a regression, and checking the coefficients to ensure that all the variables are statistically significant, one should check the the residuals to make sure they are independent. If there is any correlation, it means that our regression’s coefficient estimates could be wrong and they can not be  assumed constant across the study area. Therefore, the fitted model is not a good representation of the dependent variable.From our fitted model, we can extract the residuals:

yield$residuals_lm <- residuals(lm_yield)

Here, we look at the spatial patterning in the distribution of the residuals:

yield %>% as.data.frame %>% 
  ggplot(aes(LON, LAT)) + geom_tile(aes(fill=residuals_lm), alpha=3/4) + 
  ggtitle("") + coord_equal() + theme_bw()+scale_fill_gradientn(colours=c("black","yellow","red"),breaks=seq(-1660,1360,400)) 

Beside the visual inspection of the residuals, a more formal test would be required to decide whether spatial autocorrelation is present. In the first part of this blogpost, we used the linkages based on the physical distance to examine the spatial autocorrelation. Similar as before, we are going to create a list of neighbors using the Queen criteria. In my previous blog post, we calculated the local Moran’s I test statistic for the actual data. Here, we want to apply it on the residuals, so we need to use another function that takes into account that the variable under consideration is a residual of a regression. Also, in this example, we look at the global rather than the local Moran’s which is based on both feature locations and feature values simultaneously.

neighbor <- poly2nb(yield)
lw <- nb2listw(neighbor)
lm.morantest(lm_yield, lw) 
# 0.7908136752

The result shows statistically significant value for Moran’s I. We can also use the “neighbor” object to get the average value for the neighbors of each location (grid cell), and look at the correlations between these and the residuals or create a scatter plot for visual inspection.

mean_function <- sapply(neighbor, function(x) mean(yield$residuals_lm[x]))
cor(yield$residuals_lm, mean_function) 
# 0.8847948
plot(yield$residuals_lm, mean_function, xlab='Residuals', ylab='Mean adjacent residuals')

We clearly see that the spatial dependencies in the residuals are significant. This means that, if we use this model, the predicted values are systematically underestimated or overestimated. Therefore, we need to use spatial autoregressive models that account for spatial dependencies. There are two models used as spatial regression models: the spatially lagged model and the spatial error model. The first model uses a spatial lag variable that averages the neighboring values for each location and accounts for autocorrelation using a weights matrix. In the second model, the spatial dependence is handled through the errors rather than through the systematic component of the model. In order to decide whether to fit the spatial error or lagged model, the Lagrange Multiplier (LM) test is used to distinguish which is more appropriate. The R function lm.LMtests() would perform this test by considering these statistics: the LM test for the error dependence (LMerr) and the spatially lagged dependent variable (LMlag), as well as for their robust forms (RLMerr and RLMlag; e.g., RLMerr examines the spatially autocorrelated residuals in the possible presence of an omitted lagged dependent variable).

 lm.LMtests(lm_yield, lw, test = c("LMerr","LMlag","RLMerr","RLMlag"))
#LMerr = 2142.1, df = 1, p-value < 2.2e-16
#LMlag = 1025, df = 1, p-value < 2.2e-16
#RLMerr = 1121.8, df = 1, p-value < 2.2e-16
#RLMlag = 4.6952, df = 1, p-value = 0.03025

Since both LMerr and LMlag have significant p-values, we compare the p-values of the robust forms RLMerr and RLMlag. In doing so, it can be seen that RLMerr is significant. Therefore, the LM test suggests that we should run a spatial error model.

fit_err <- errorsarlm(ww_ave_yield ~  ET_pot  +  precipitat +soil_water, data=yield, lw)
# lagsarlm() is a function that creates a  spatial lag model
#Lambda: 0.93674, LR test value: 1912.7, p-value: < 2.22e-16
#AIC: 15166, (AIC for lm: 17077)

The results show that the Likelihood Ratio (LR) test is highly significant (p value 2.22e-16).This shows further evidence that the spatial error model is a good fit. Also, the  Akaike Information Criterion (AIC:an estimate of out-of-sample prediction error and therefore the relative quality of statistical models for a given dataset) in this new model has a AIC of 15166 and has a better fit compared to the original linear model, with no spatial error dependencies (AIC of 17077).

We can now check the residuals of this new model. In addition, we can take a look at the Moran’s I statistic one more time. Note that we previously used a Moran’s I test for spatial autocorrelation in residuals from an estimated linear model. Now, we don’t have a linear model, so we can use a Permutation test for the Moran’s I statistic: the moran.mc() function uses random permutations of x for the given spatial weighting scheme. The residuals graph and Moran’s I statistic both show that there is no correlation in the residuals:

yield$residuals_error_model <- residuals(fit_err)
mean_function_error_model <- sapply(neighbor, function(x) mean(yield$residuals_error_model[x]))
cor(yield$residuals_error_model, mean_function_error_model)
plot(yield$residuals_error_model, mean_function_error_model, xlab='Residuals', ylab='Mean adjacent residuals')
moran.mc(yield$residuals_error_model, lw, 1000)   
# 1000 is a number of permutations

If we used the other model, the residuals would show some correlations:


Srinivasan, S., 2008. Spatial Regression Models, in: Shekhar, S., Xiong, H. (Eds.), Encyclopedia of GIS. Springer US, Boston, MA, pp. 1102–1105. https://doi.org/10.1007/978-0-387-35973-1_1294 https://rspatial.org/raster/analysis/index.html

Using Rhodium for exploratory modeling

Rhodium is a powerful, simple, open source Python library for multiobjective robust decision making. As part of Project Platypus, Rhodium is compatible with Platypus (a MOEA optimization library) and PRIM (the Patent Rule Induction Method for Python), making it a valuable tool for bridging optimization and analysis. 

In the Rhodium documentation, a simple example of optimization and analysis uses the Lake Problem (DPS formulation). The actual optimization is performed in the line:

optimize(model, "NSGAII", 10000)

This optimize function uses the Platypus library directly for optimization; here the NSGAII algorithm is used for 10,000 function evaluations on the defined Lake Problem (model). This optimization call is concise and simple, but there are a few reasons why it may not be ideal.

  1. Speed. Python, an interpreted language, is inherently slower than compiled languages (Java, C/C++, etc.) The Platypus library is built entirely in Python, making optimization slow.
  2. Scalability. Platypus has support for parallelizing optimization, but this method is not ideal for large-scale computational experiments on computing clusters. 
  3. MOEA Suite. State of the art MOEAs such as the Borg MOEA are not implemented in Platypus for licensing reasons, so it is not usable directly by Rhodium.

Thus, external optimization is necessary for computationally demanding Borg runs. Luckily, Rhodium is easily compatible with external data files, so analysis with Rhodium of independent optimizations is simple. In this post, I’ll use a sample dataset obtained from a parallel Borg run of the Lake Problem, using the Borg wrapper.

The code and data used in this post can be found in this GitHub repository. lakeset.csv contains a Pareto approximate Lake Problem set. Each line is a solution, where the first six values are the decision variables and the last four are the corresponding objectives values. 

We’ll use Pandas for data manipulation. The script below reads the sample .csv file with Pandas, converts it to a list of Python dictionaries, and creates a Rhodium DataSet. There are a few important elements to note. First, the Pandas to_dict function takes in an optional argument ‘records’ to specify the format of the output. This specific format creates a list of Python dictionaries, where each element of the list is an individual solution (i.e. a line from the .csv file) with dictionary keys corresponding to the decision / objective value names and dictionary values as each line’s data. This is the format necessary for making a Rhodium DataSetwhich we create by calling the constructor with the dictionary as input.

import pandas as pd
from rhodium import *

# use pandas to read the csv file
frame = pd.read_csv("lakeset.csv")

# convert the pandas data frame to a Python dict in record format
dictionary = frame.to_dict('records')

# create a Rhodium DataSet instance from the Python dictionary
dataset = DataSet(dictionary)

Printing the Rhodium DataSet with print(dataset) yields:

Index 204:
   c1: 0.286373779
   r1: 0.126801547
   w1: 0.6265428129999999
   c2: -0.133307575
   r2: 1.3584425430000002
   w2: 0.10987546599999999
   benefit: -0.412053431
   concentration: 0.359441661
   inertia: -0.98979798
   reliability: -0.9563

Once we have a Rhodium DataSet instantiated, we access many of the library’s functionalities, without performing direct optimization with Platypus. For example, if we want the policy with the lowest Phosphorus concentration (denoted by the ‘concentration’ field), the following code outputs:

policy = dataset.find_min('concentration')
{'c1': 0.44744488600000004, 'r1': 0.9600368159999999, 'w1': 0.260339899, 'c2': 0.283860122, 'r2': 1.246763577, 'w2': 0.5300663529999999, 'benefit': -0.213267399, 'concentration': 0.149320863, 'inertia': -1.0, 'reliability': -1.0}

Rhodium also offers powerful plotting functionalities. For example, we can easily create a Parallel Axis plot of our data to visualize the trade-offs between objectives. The following script uses the parallel_coordinates function in Rhodium on our external dataset. Here, since parallel_coordinates takes a Rhodium model as input, we can: 1) define the external optimization problem as a Rhodium model, or 2) define a ‘dummy’ model that gives us just enough information to create plots. For the sake of simplicity, we will use the latter, but the first option is simple to set up if there exists a Python translation of your problem/model. Note, to access the scenario discovery and sensitivity analysis functionalities of Rhodium, it is necessary to create a real Rhodium Model.

# define a trivial "dummy" model in Rhodium with an arbitrary function
model = Model(lambda x: x)

# set up the model's objective responses to match the keys in your dataset
# here, all objectives are minimized
# this is the only information needed to create a parallel coordinate plot
model.responses = [Response("benefit", Response.MINIMIZE),
                   Response("concentration", Response.MINIMIZE),
                   Response("inertia", Response.MINIMIZE),
                   Response("reliability", Response.MINIMIZE)]

# create the parallel coordinate plot from the results of our external optimization
fig = parallel_coordinates(model, dataset, target="bottom",
                           brush=[Brush("reliability < -0.95"), Brush("reliability >= -0.95")])

How to make horizon plots in Python

Horizon plots were invented about a decade ago to facilitate visual comparison between two time series. They are not intuitive to read right away, but they are great for comparing and presenting many sets of timeseries together. They can take advantage of a minimal design by avoiding titles and ticks on every axis and packing them close together to convey a bigger picture. The example below shows percent changes in the price of various food items in 25 years.

The way they are produced and read is by dividing the values along the y axis in bands based on ranges. The color of each band is given by a divergent color map. By collapsing the bands to the zero axis and layering the higher bands on top, one can create a time-varying heatmap of sorts.

Source: http://idl.cs.washington.edu/papers/horizon

I wasn’t able to find a script that could produce this in Python, besides some code in this github repository, that is about a decade old and cannot really run in Python 3. I cleaned it up and updated the scripts with some additional features. I also added example data comparing USGS streamflow data with model simulation data for the same locations for 38 years. The code can be found here and can be used with any two datasets that one would like to compare with as many points of comparison as needed (I used eight below, but the script can accept larger csv files with more or less comparison points, which will be detected automatically). The script handles the transformation of the data to uniform bands and produces the following figure, with every subplot comparing model output with observations at eight gauges, i.e. model prediction error. When the model is over predicting the area is colored blue, when the area is underpredicting, the area is colored red. Darker shades indicate further divergence from the zero axis. The script automatically uses three bands for both positive or negative divergence, but more can be added, as long as the user defines additional colors to be used.

Using this type of visualization for these data allows for time-varying comparisons of multiple locations in the same basin. The benefit of it is most exploited with many subplots that make up a bigger picture.

Future extensions in this repository will include code to accept more file types than csv, more flexibility in how the data is presented and options to select different colormaps when executing.

Visualizing High Dimensional Data Using the T-SNE Method

In this blog post, I am going to go over a machine learning method that is used to visualize high dimensional datasets. The method is called T-distributed Stochastic Neighbor Embedding, or T-SNE for short. The method was developed by the Google Brain Team (van der Maaten & Hinton, 2008). In this blog post, I describe T-SNE and provide a simple example in R.

Two weeks ago, Dave introduced a few useful dimensionality reduction methods to visualize the Pareto front. Last week, Rohini’s blog post provided some very useful information about self-organizing maps. I guess this blog post will officially make this a series of posts on dimensionality reduction and the visualization of high dimensional datasets. If you haven’t already read these two super informative blog posts, I highly recommend you take a look at them.

T-SNE is an extension of the Local Linear Embedding technique, and it uses an interesting method to maintain the local relationships and similarities that exist in the original high-dimensional datasets when transferring that into a 2-D plane. As van der Maaten and Hinton (2008) argue, other dimensionality reduction techniques (e.g., PCA) usually focus on macro distances in the datasets, which might lead to overlooking short-distance relationships. However, T-SNE is able to capture both local and macro distances in the data structure. Therefore, it can give more reasonable results when you are dealing with non-linear high dimensional datasets.

T-SNE uses the following formula, which basically creates a Gaussian distribution over each data point and computes the density of all other data points in respect to the distribution around that point. The denominator of the equation normalizes the nominator.

In this equation, xi is the reference point, and xj is its neighboring points. Also, Sigma is the bandwidth that returns the same perplexity for each point. Perplexity is a measure of uncertainty that has a direct relationship with entropy. For more information about it, you can read this Wikipedia page. Basically, perplexity is a hyper parameter of T-SNE, and the final outcome might be very sensitive to its value. Also, in this equation, pij is the probability of the similarity of each two points in the high dimensional space. Basically, if two points are close to each other, pij will be high. If they are far from each other, the probability of similarity is low.

To make the low dimensional map consistent with its original high dimensional dataset, T-SNE also calculates a similar point-to-point probability of similarity in the 2D map.

Basically, T-SNE moves the point on the 2D plain to find the locations that minimize the Kullback-Leibler divergence metric between the Pi distributions of the original data and the Qi distributions of the 2D plane.

As you probably noticed, in the low dimensional space, T-SNE uses the Student T-distribution curve (with one degree of freedom). The reason is that Student T (compared to Gaussian) is a fat-tailed distribution. This allows the T-SNE to separate dissimilar points with higher margins, which makes the map easier to interpret.

T-SNE Example in R

Here, I provide a simple R code example that demonstrates how you can use T-SNE to visualize high dimensional datasets. This example only has four dimensions, so it doesn’t really represent a high dimensional problem. As such, keep in mind that T-SNE shows its true capabilities when there are tens or even hundreds of dimensions.

First, you need to install “tsne” library and load it.


# We also need to load ggplot2 and ggpubr libraries


In this example, I use R’s famous “iris” data set. Here is how you can activate it.

# Load the iris dataset

Now you can use the following scrip to visualize the data using T-SNE.

# Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=30, max_iter=1000, check_duplicates = FALSE) # Run TSNE

# Create a data frame using IRIS values and the T-SNE values
df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))

# Specify column names
names(df_to_gg)<-c("Species", "Y.1", "Y.2")

# Show the objects in the 2D T-SNE representation
ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
  labs(title=paste("T-SNE Visualization- Perplexity Number = 30" )) +
  scale_color_manual(values = c("darkblue", "orange", "pink3"))

Here is the 2-D map that our code generates

Although T-SNE is a powerful method and has become quite popular in recent years, it might have some pitfalls. This blog post discusses some of these issues, including the sensitivity of the final map to perplexity hyperparameters or the fact that it can be tricky to interpret T-SNE maps because they might not represent the actual relationships between different clusters.

To explore the impacts of the perplexity hyperparameter on the final clusters, I use the following code to create T-SNE maps for different perplexity values.

# Here are the perplexity values that I am taking into account in my T-SNE analysis
perplexity_number_values<-c(2,5,10, 15, 30, 40)

# Initializing a plot list object
plot_list = list()

# This loop explores the effect of perplexity values on the T-SNE results
for (i_prp in 1:6){
  # I am setting a seed to makes sure that my results for different perplexities are not sensitive to any random factors
  # Perplexity number
  perplexity_number<- perplexity_number_values[i_prp]
  # Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
  y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=perplexity_number, max_iter=1000, check_duplicates = FALSE) # Run TSNE
  # Create a data frame using IRIS values and the T-SNE values
  df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))
  # Specify Column names
  names(df_to_gg)<-c("Species", "Y.1", "Y.2")
  # Show the objects in the 2D T-SNE representation
  plt<-ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
    labs(title=paste("T-SNE Visualization- Perplexity Number = ", perplexity_number )) +
    scale_color_manual(values = c("darkblue", "orange", "pink3"))
  # Save figure in the plot list object
# Combine all the plot objects
plt_combined<-ggarrange(plotlist = plot_list[1:6], ncol = 2, nrow = 3, common.legend = T)

Here is the final result that underlines the high sensitivity of the location of our clusters on 2-D map to the choice of the perplexity values. There are other hyperparameters that can be tried as well, for example, the maximum number of iterations might change the final output in some cases.

There are many other very good tutorials on T-SNE, such as here and here. Also, for example, this tutorial provides a nice and easy-to-follow Python code for T-SNE. Finally, there are some great YouTube videos that clearly explain the logic behind T-SNE and its algorithm (for example, here and here).

Using the Python Borg Wrapper – Lake Problem Example

Python compatibility with the Borg MOEA is highly useful for practical development and optimization of Python models. The Borg algorithm is implemented in C, a strongly-typed, compiled language that offers high efficiency and scalability for complex optimization problems. Often, however, our models are not written in C/C++, rather simpler scripting languages like Python, MATLAB, and R. The Borg Python wrapper allows for problems in Python to be optimized by the underlying C algorithm, maintaining efficiency and ease of use. Use of the Python wrapper varies slightly across operating systems and computing architectures. This post will focus on Linux systems, like “The Cube”, our computing cluster here at Cornell. This post will work with the most current implementation of the Borg MOEA, which can be accessed with permission from this site.

The underlying communications between a Python problem and the Borg C source code are handled by the wrapper with ctypes, a library that provides Python compatibility with C types and functions in shared libraries. Shared libraries (conventionally .so files in Linux/Unix) provide dynamic linking, a systems tool that allows for compiled code to be linked and reused with different objects. For our purposes, we can think of the Borg shared library as a way to compile the C algorithm and reuse it with different Python optimization runs, without having to re-compile any code. The shared library gives the wrapper access to the underlying Borg functions needed for optimization, so we need to create this file first. In the directory with the Borg source code, use the following command to create the (serial) Borg shared library.

gcc -shared -fPIC -O3 -o libborg.so borg.c mt19937ar.c –lm 

Next, we need to move our shared library into the directory containing the Python wrapper (borg.py) and whatever problem we are optimizing.

In this post, we’ll be using the Lake Problem DPS formulation to demonstrate the wrapper. Here’s the source code for the problem:

@author: Rohini

#DPS Formulation 

#1) Maximize expected economic benefit
#2) Minimize worst case average P concentration 
#3) Maximize average inertia of P control policy
#4) Maximize average reliability 

#Reliability has to be <85%

#Decision Variables 
#vars: vector of size 3n 
#n is the number of radial basis functions needed to define the policy
#Each has a weight, center, and radius (w1, c1, r1..wm,cn,rn)

#Time Horizon for Planning, T: 100 Years
#Simulations, N: 100 


import numpy as np
from sys import *
from math import *
from scipy.optimize import root
import scipy.stats as ss

# Lake Parameters
b = 0.42
q = 2.0

# Natural Inflow Parameters
mu = 0.03
sigma = np.sqrt(10**-5)

# Economic Benefit Parameters
alpha = 0.4
delta = 0.98

# Set the number of RBFs (n), decision variables, objectives and constraints
n = 2
nvars = 3 * n
nobjs = 4
nYears = 100
nSamples = 100
nSeeds = 2
nconstrs = 1

# Set Thresholds
reliability_threshold = 0.85
inertia_threshold = -0.02

###### RBF Policy ######

#Define the RBF Policy
def RBFpolicy(lake_state, C, R, W):
    # Determine pollution emission decision, Y
    Y = 0
    for i in range(len(C)):
        if R[i] != 0:
            Y = Y + W[i] * ((np.absolute(lake_state - C[i]) / R[i])**3)

    Y = min(0.1, max(Y, 0.01))

    return Y

###### Main Lake Problem Model ######

def LakeProblemDPS(*vars):

    seed = 1234

    #Solve for the critical phosphorus level
    def pCrit(x):
        return [(x[0]**q) / (1 + x[0]**q) - b * x[0]]

    sol = root(pCrit, 0.5)
    critical_threshold = sol.x

    #Initialize arrays
    average_annual_P = np.zeros([nYears])
    discounted_benefit = np.zeros([nSamples])
    yrs_inertia_met = np.zeros([nSamples])
    yrs_Pcrit_met = np.zeros([nSamples])
    lake_state = np.zeros([nYears + 1])
    objs = [0.0] * nobjs
    constrs = [0.0] * nconstrs

    #Generate nSamples of nYears of natural phosphorus inflows
    natFlow = np.zeros([nSamples, nYears])
    for i in range(nSamples):
        np.random.seed(seed + i)
        natFlow[i, :] = np.random.lognormal(
            mean=log(mu**2 / np.sqrt(mu**2 + sigma**2)),
            sigma=np.sqrt(log((sigma**2 + mu**2) / mu**2)),

    # Determine centers, radii and weights of RBFs
    C = vars[0::3]
    R = vars[1::3]
    W = vars[2::3]
    newW = np.zeros(len(W))

    #Normalize weights to sum to 1
    total = sum(W)
    if total != 0.0:
        for i in range(len(W)):
            newW[i] = W[i] / total
        for i in range(len(W)):
            newW[i] = 1 / n

    #Run model simulation
    for s in range(nSamples):
        lake_state[0] = 0
        Y = np.zeros([nYears])

        #find policy-derived emission

        Y[0] = RBFpolicy(lake_state[0], C, R, newW)

        for i in range(nYears):
            lake_state[i + 1] = lake_state[i] * (1 - b) + (
                lake_state[i]**q) / (1 +
                                     (lake_state[i]**q)) + Y[i] + natFlow[s, i]
                i] = average_annual_P[i] + lake_state[i + 1] / nSamples
                s] = discounted_benefit[s] + alpha * Y[i] * delta**i

            if i >= 1 and ((Y[i] - Y[i - 1]) > inertia_threshold):
                yrs_inertia_met[s] = yrs_inertia_met[s] + 1

            if lake_state[i + 1] < critical_threshold:
                yrs_Pcrit_met[s] = yrs_Pcrit_met[s] + 1

            if i < (nYears - 1):
                #find policy-derived emission
                Y[i + 1] = RBFpolicy(lake_state[i + 1], C, R, newW)

# Calculate minimization objectives (defined in comments at beginning of file)
    objs[0] = -1 * np.mean(discounted_benefit)  #average economic benefit
    objs[1] = np.max(
        average_annual_P)  #minimize the max average annual P concentration
    objs[2] = -1 * np.sum(yrs_inertia_met) / (
        (nYears - 1) * nSamples
    )  #average percent of transitions meeting inertia thershold
    objs[3] = -1 * np.sum(yrs_Pcrit_met) / (nYears * nSamples
                                            )  #average reliability

    constrs[0] = max(0.0, reliability_threshold - (-1 * objs[3]))

    return (objs, constrs)

The important function for this blog post is LakeProblemDPS, which demonstrates how to configure your own problem with the wrapper. Your function must take in *vars, the decision variables, and return objs, a list of objective values (or a tuple of objective values and constraints). Within the problem, refer to vars[i] as the i-th decision variable, for i in [0,nvars-1]. Set the list of objective values in the same manner.

Once our problem is defined and compatible with the wrapper, we can optimize with Borg. The following code runs the Lake problem optimization for 10,000 function evaluations.

# Serial Borg run with Python wrapper
# ensure libborg.so is compiled and in this directory
from borg import *
from lake import *

maxevals = 10000

# create an instance of the serial Borg MOEA
borg = Borg(nvars, nobjs, nconstrs, LakeProblemDPS)

# set the decision variable bounds and objective epsilons
borg.setBounds(*[[-2, 2], [0, 2], [0, 1]] * int((nvars / 3)))
borg.setEpsilons(0.01, 0.01, 0.0001, 0.0001)

# perform the optimization
# pass in a dictionary of arguments, as defined in borg.py
result = borg.solve({"maxEvaluations": maxevals})

# print the resulting objectives
for sol in result:

Note the constructor Borg() creates an instance of the Borg algorithm with a specified number of variables, objectives, and constraints. The LakeProblemDPS argument is the objective function to be optimized by this instance of Borg. The setBounds and setEpsilons methods are required. solve() performs the optimization and takes in a dictionary of Borg parameters. See borg.py for a comprehensive list.

Using the Python wrapper to run the Parallel Borg MOEA

The previous example uses the serial Borg MOEA, but the wrapper also supports the master-worker and multi-master parallelizations. Configuring a parallel version of the algorithm requires a few additional steps. First, you must compile a shared library of the parallel implementation and move it to the wrapper directory.

For the master-worker version, use:

mpicc -shared -fPIC -O3 -o libborgms.so borgms.c mt19937ar.c -lm

For the multi-master version, use:

mpicc -shared -fPIC -O3 -o libborgmm.so borgmm.c mt19937ar.c -lm

To call the master-worker version, you must explicitly start up and shut down MPI using the Configuration class provided in borg.py. The following code performs a parallel master-worker optimization of the Lake problem:

# Master-worker Borg run with Python wrapper
# ensure libborgms.so or libborgms.so is compiled and in this directory
from borg import *
from lake import *

# set max time in hours
maxtime = 0.1

# need to start up MPI first

# create an instance of Borg with the Lake problem
borg = Borg(nvars, nobjs, nconstrs, LakeProblemDPS)

# set bounds and epsilons for the Lake problem
borg.setBounds(*[[-2, 2], [0, 2], [0, 1]] * int((nvars / 3)))
borg.setEpsilons(0.01, 0.01, 0.0001, 0.0001)

# perform the optimization
result = borg.solveMPI(maxTime=maxtime)

# shut down MPI

# only the master node returns a result
# print the objectives to output
if result:
    for solution in result:

This script must be called as a parallel process – here’s a SLURM submission script that can be used to run the optimization on 16 processors (compatible for The Cube):

#SBATCH -J py-wrapper
#SBATCH -o normal.out
#SBATCH -e normal.err
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 16

mpirun python3 mslake.py

sbatch submission.sbatch will allocate one node with 16 processors for the optimization run.


Depending on your machine’s MPI version and your shell’s LD_LIBRARY_PATH environment variable, the Borg wrapper may try to access an unavailable mpi shared library. This issue happens on our cluster, the Cube, and causes the following error:

OSError: libmpi.so.0: cannot open shared object file: No such file or directory

In borg.py, the startMPI method attempts to access the nonexistent libmpi.so.0 shared library. To fix this, find the location of your mpi files with:


Likely, a directory to your mpi library (i.e. /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib on the cube) will print. (Note, if such a path does not exist, set the LD_LIBRARY_PATH environment variable to include your mpi library) Navigate to this directory and view the file names. On the Cube, libmpi.so.0 (the library the Borg wrapper is trying to access) does not exist, but libmpi.so does (this is a software versioning discrepancy). Back in the startMPI method in borg.py, change the line

CDLL("libmpi.so.0", RTLD_GLOBAL)

to access the existing mpi library. On the cube:

CDLL("libmpi.so", RTLD_GLOBAL)

Deeper Dive into Self-Organizing Maps (SOMs)

This blog post focuses on Self-Organizing Maps (SOMs) or Kohonen maps, first created by Teuvo Kohonen in the 1980s, and most recently introduced in Dave Gold’s latest blog post. SOMs are a type of artificial neural network (ANN) that are trained using unsupervised learning to produce a two-dimensional pattern representation of higher-dimensional datasets. SOMs are therefore a good algorithm for dimension reduction, and inputs that are closer in their higher-dimensional input vectors are also mapped closely together in the resulting 2D representation [1].

Self Organizing Maps. Recently, I learned about SOMs while… | by ...
User-defined neuron grid that represents the basis of SOM  [1]

SOMs can be very effective for clustering/classifying outputs and are advantageous because they “self-organize” to identify clusters. The SOM map is first initialized with a user-defined 2D grid of neurons instead of containing layers like traditional ANNs. The weight vector that characterizes the neurons can be randomly initialized with value or more favorably, by sampling across the eigenvectors of the two largest PCs of the input dataset. The training process that SOMs use to build maps is called “competitive learning”, which is also fundamentally different from backpropagation used for ANNs. During competitive learning, an input training point is fed into the algorithm. Its Euclidean distance from each of the neurons is calculated, and the closest neuron is termed the best matching unit, or BMU. The weight of the neuron and surrounding neurons is adjusted to be closer to the related input vector, using an update formula. This process is repeated for every input in the training data and for a large number of iterations until patterns and clusters begin to emerge [3]. The animation on the right illustrates competitive learning. Note how the locations of neurons are adjusted as training points get assigned to their respective BMUs and how whole grid shifts in response to the new samples.

Competitive learning [2]

At a high level, this is how SOMs are trained. Next, we will go through a clustering example using the kohonen package offered in R.

Test Case: Maryland Trout

For this text case, we will be using a presence/absence dataset of Maryland trout. Each data point is characterized by a feature vector that includes information on agricultural land cover, logarithmic distance to the nearest road, riffle quality, dissolved oxygen, and spring water temperature. The label associated with each point is a 0 or 1, denoting if trout was observed or not. We will train a SOM to cluster the feature vector and then investigate the characteristics of the resulting clusters. Below is a snippet of our dataset. We use columns 3-7 as our feature vector while column 2 denotes our label. SOMs can also be used for classification, which is not discussed in this blog post. Therefore, you can train the SOM only on a subset of the dataset and used the trained SOM to predict the clusters that the testing points will be assigned to.


First we must scale our data in order to make sure that variables in our feature vector that may have different magnitudes are treated equally by the algorithm. Then we define the SOM grid. The choice of the number of neurons generally follows the rule: 5*sqrt(# of datapoints). In our case, we have 84 datapoints, so technically a grid that is smaller than 7×7 will work well. I decide to use 5×5, which will be explained in the next paragraph. I also specify that I want the grid to be in a hexagonal shape.  Finally, we train our model by feeding in our input data, the user-defined grid, and specifying an iteration rate of 1000. This means that the input data will be presented 1000 times to the algorithm.

#Create the training dataset
data_train &amp;amp;lt;-data[, c(3:7)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix &amp;amp;lt;- as.matrix(scale(data_train))
# Create the SOM Grid
som_grid &amp;amp;lt;- somgrid(xdim = 5, ydim=5, topo="hexagonal")
# Finally, train the SOM
som_model &amp;amp;lt;- som(data_train_matrix,
                 grid=som_grid, rlen = 1000,
                 keep.data = TRUE )

Once the SOM is trained, there are a variety of figures that can be easily created. We can plot the training progress and observe how the distance to the BMU decreases as a function of iteration and starts to level off as we reach many iterations. This is ideal behavior because it means that the training points are associating closely with their BMU.

plot(som_model, type="changes")
Average distance from a test point to its BMU

We can also plot a heat map of the node counts, which shows how many of each of the training inputs are associated with each node. Any node that is grey means that no training points were assigned to that BMU. This may indicate that the map is too big, so you can go back and adjust the grid size adjust accordingly until the grey points are no longer existent. For this dataset, this was around 5×5. Furthermore, we want the number of points associated with each node to be as uniform as possible.

plot(som_model, type="count", main="Node Counts")

Next we can view the neighbor distances for each neuron, which is also called the U-matrix. Small distances indicate similarity between neighboring nodes.

plot(som_model, type="dist.neighbours", main = "SOM neighbor distances")

We can also view the weight vector of the nodes, which are termed codes. The fan in each node indicates the variables of prominence that are linking the datapoints assigned to the neuron. In our dataset, you can see that in the upper left corner of the map, the algorithm is assigning datapoints to these neurons because of the similarity among the agricultural area and water temperature variables in the feature vector. In the lower right corner, these datapoints are linked primarily due to similarities in riffle quality, and dissolved oxygen. This type of map is useful to understand by what features the points are actually starting to cluster.

plot(som_model, type="codes")

In my opinion, one of the most useful standard figures that the kohonen library creates are heatmaps that show the gradient of input values across the nodes. A heatmap can be created for each feature and when placed side by side, they can help demonstrate correlation between inputs. We can clearly see that areas characterized by low agricultural land tend to have a better riffle quality, lower water temperature, and a higher dissolved oxygen. While these patterns are expected and have a strong physical explanation, these maps can also help to discover other unexpected patterns. Note that the values on the side bar are normalized and can be transformed back to the original scale.


#Plot heatmaps for each feature
plot(som_model, type = "property", property = getCodes(som_model)[,1], main=colnames(getCodes(som_model))[1],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,2], main=colnames(getCodes(som_model))[2],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,3], main=colnames(getCodes(som_model))[3],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,5], main=colnames(getCodes(som_model))[5],palette.name=magma)
#Unscale the variables- do this for each input column in data_train
var_unscaled &amp;amp;lt;- aggregate(as.numeric(data_train[,1]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled, main=colnames(getCodes(som_model))[1], palette.name=magma)

Finally we can cluster our neuron codes weight vectors) to add some boundaries to our map. Typically what is done is to use k-means clustering to determine an appropriate number of clusters that, in this case, minimize the within-cluster sum of squares (WCSS). We can then plot the results and look for an elbow in the scree plot.


#k-means clustering code from [4]
mydata = som_model$codes[[1]]
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)") #Define a palette and cluster with the number of clusters determined using k-means

The elbow isn’t so clear in this dataset, but I go with 8 clusters and then use hierarchical clustering to assign each neuron to a cluster. Finally, we can obtain the cluster that each point is assigned to and insert that back as a column in our dataset.

#Final clustering code from [4]
#Define a palette and cluster with the number of clusters determined using k-means
pretty_palette &amp;amp;lt;- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
som_cluster &amp;amp;lt;- cutree(hclust(dist(som_model$codes[[1]])), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(som_model, som_cluster)

# get vector with cluster value for each original data sample
cluster_assignment &amp;amp;lt;- som_cluster[som_model$unit.classif]
# for each of analysis, add the assignment as a column in the original data:
data$cluster &amp;amp;lt;- cluster_assignment



We’re now interested in understanding which of our points actually clustered together, so we generate summary statistics for each cluster. The points highlighted in blue represent the best values for the feature while the points in red represent the worst. Note that the clusters with a large amount of trout presence on average are characterized by best values of features that are favorable for aquatic life, while the clusters that exhibit a lower presence of trout do not have good values for these features. You can also see that clusters with similar characteristics are side by side on the clustering map.


Overall, I found the kohonen library to be a really useful tool to start to internalize how SOMs work and to fairly quickly be able to do a nice analysis on both the features and the resulting clusters. I would highly recommend this package as a starting point. There are packages in Python such as SimpSom and minisom that have snappier graphics and will be a good next step to explore.



[2] By Chompinha – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77822988


[4] Much direction for structuring this tutorial comes from https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/