Get your research workflow on

I have done some research on research workflow and that includes interviewing some of my peers at Cornell grad school to get a sense of what increases their productivity and what are their strategies for accomplishing long-term research goals.  In addition to this, I also gathered good advice from my PI, who is the most ultra-efficient human that I know.  Had I taken the following advice, I would’ve written this blog post a week ago.

The Get your research workflow on series, consists of two parts:

Part 1 covers General research workflow tips and Part 2. Setting up your technical workflow for people training in with the Decision Analytics crew (a.k.a. the best crew in town).

General research workflow tips

Disclosure: some of the contents of this list may be easier said than done.

First of all, a research workflow can be very personal and it is definitely tailored to each person’s requirements, personality and interests,  but here are some general categories that I think every researcher can relate to:

Taking notes, organizing and reflecting on ideas

I was gifted with the memory of a tuna fish, so I need to take notes for everything.  Unfortunately, taking notes with paper notebooks resulted in disaster for me in the past, it was very hard to  keep information organized, and occasionally my notebooks would either disappear or get  coffee stains all over.   Luckily, my office mate Dave, introduced me to the best application for note taking ever: Evernote,  this app allows you to keep your notes categorized, so you can keep the information that you need indexed and searchable across every single platform you have, that means that you can keep your notes synchronized  with your smartphone, laptop, desktop, etc, and have it accessible anywhere you go.

In addition, the Evernote web clipper tool allows you to save and categorize articles or webpages within your notes and make annotations on them.  Additionally, you can tag your notes, this is useful if you have notes that could fit into multiple notebooks.  You can also share  and invite people to edit notes and you can connect it with Google drive.  I would probably still flock to Google docs or Dropbox Paper for collaborative documents, but for personal notes, I prefer the Evernote interface.   There’s no limit on the amount of notebooks that you can have.  I’ve found this app very useful for brainstorming and developing ideas, I also use it to keep a research log to track my research progress.

Reading journal papers and reference management 

Keeping up with the scientific literature can be very challenging specially with the overwhelming amount of journal papers out there, but you can make things manageable for yourself if you find a reference manager that allows you to build a library that makes it easy to find, add, organize, read, prioritize and annotate papers  that you can later cite.  Additionally, you may want to set up smart notifications about new papers on topics that interest you, and get notified via e-mail. A couple of popular free and open source reference managers that allow you to do the previous are Zotero and Mendeley,  also Endnote basic, its free but you would need to upgrade to Endnote Desktop for unlimited storage.  These reference managers also allow you to export BibTex files for its integration with LaTeX.  You can check out the Grand Reference Management Comparison  table for all the reference management software available out there.

In addition to reference manager software,  a couple of popular subscription-based multidisciplinary databases are Web of Science and Scopus, they  differ from Google scholar,  by the fact that these are human curated databases, they are selected by scholarly and quality criteria by literature review committees, and they let you build connections between topics.

Finally,  I came across this article on How to keep up with the scientific literature, where a number of scientists were interviewed on the subject, and they all agree that it can be overwhelming but it is key to stay up to date with the literature as its the only way to contextualize your work and identify the knowledge gaps.  The article provided advice on how to prioritize what to read despite the overwhelming amount of scientific literature.


Time management and multi-tasking

This is my Achilles heel, and its a skill that requires a lot of practice and discipline.   Sometimes, progress in research can seem hard to accomplish, specially when you are involved in several projects, dealing with hard deadlines, taking many classes, TA-ing, or you’re simply busy being a socialité,  but there are several tricks to be on top of research while avoiding getting overwhelmed in a multi-tasking world.   Some, or most of this tips came from a time-manager master-mind:

Tip # 1.  Schedule everything and time everything

Schedule everything from hard, set-in-stone deadlines to casual meetings, that way you’ll know for sure how much time you’ll have to spare on different projects and you can block time for those projects in a weekly basis.  Keep track of the time that you spend on different projects/tasks.   There’s a very popular app among 3 economists, Julie’s brother and my friend Justyna called be focused that allows you to manage tasks and time them.  You can use it to keep track, for instance, of the time it takes you to read a paper,  some people use it to time the time it takes them to write a paper till completion, right now I’m tracking the amount of time its taking me to write this blogpost.  Timing everything will allow you to get better at predicting the time it will take you to accomplish something and reflect on how you can improve.   I always tend to underestimate my timings but this app is giving me a reality check.. very annoying.

Tip # 2. Different mindsets for different time slots

When your schedule is full of small time gaps, fill them doing tasks that involve less concentration, probably reading, answering e-mails, organizing yourself, and leave larger time slots for the most creative and challenging part of your work.

Also, a general recommendation of multi-tasking is don’t do it,  trying to do multiple things at once can hurt your productivity, instead,  block times to carry specific tasks, were you focus on that one thing, and then move on to the next. Remember to prioritize and tackle the most important thing first.

Tip #4. Visualize long-term research goals and work backwards

Picture what you want to accomplish in a year or in a semester, and work your way backwards, till you refine the accomplishments of each month, each week and each day to hit your long-term target.  Setup to-do lists for the week and for the day.

Tip #3. Set aside time for new skills that you want to acquire

Even if you set aside one or two hours a week devoted to that skill that you want to develop, it will pay off, you’ll have come a long way at the end of the year.  Challenge yourself and continue to develop new skills.

Tip #5. Don’t leave e-mails sitting in your inbox

There are a couple of strategies for this, you can either allocate time each day specifically for replying to e-mails or you can tackle each e-mail as it comes.   If it’s something that will require you more time, move it to a special list, if it’s a meeting, put it in your calendar, if it’s for reference, save it. No matter what your strategy is,  take action on an e-mail as soon as you read it.

Collaborative work

Some tools for collaborative work :

Overleaf– for writing LaTeX files collaboratively and visualizing the changes live, the platform has several journal templates and can track changes easily.

Github– platform for collaborative code development and management.

Slack – organize conversations with teams, and organize your collaborative workflow

A final recommendation is to have a consistent and intuitive organization of your research.  Document everything, and have reproducible code.  If you get hit by a bus and your colleagues are able to continue research were you left off in less than a week, then you’re in good shape, organization-wise.

I hope this helps, let me know if there are some crucial topics that I missed, I can always come back and edit.

Special thanks to all of my grad/postdoc friends that participated in the brief research workflow interview.






Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables. provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions and, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

Pair Number Sites
1 Marietta and Muddy Run
2 Marietta and Lateral Inflows
3 Marietta and Evaporation
4 Muddy Run and Lateral Inflows
5 Muddy Run and Evaporation
6 Lateral Inflows and Evaporation



Open Source Streamflow Generator Part I: Synthetic Generation

This post describes how to use the Kirsch-Nowak synthetic streamflow generator to generate synthetic streamflow ensembles for water systems analysis. As Jon Lamontagne discussed in his introduction to synthetic streamflow generation, generating synthetic hydrology for water systems models allows us to stress-test alternative management plans under stochastic realizations outside of those observed in the historical record. These realizations may be generated assuming stationary or non-stationary models. In a few recent papers from our group applied to the Red River and Lower Susquehanna River Basins (Giuliani et al., 2017; Quinn et al., 2017; Zatarain Salazar et al., In Revision), we’ve generated stationary streamflow ensembles by combining methods from Kirsch et al. (2013) and Nowak et al. (2010). We use the method of Kirsch et al. (2013) to generate flows on a monthly time step and the method of Nowak et al. (2010) to disaggregate these monthly flows to a daily time step. The code for this streamflow generator, written by Matteo Giuliani, Jon Herman and me, is now available on Github. Here I’ll walk through how to use the MATLAB code in the subdirectory /stationary_generator to generate correlated synthetic hydrology at multiple sites, and in Part II I’ll show how to use the Python code in the subdirectory /validation to statistically validate the synthetic hydrology. As an example, I’ll use the Lower Susquehanna River Basin (LSRB).

A schematic of the LSRB, reproduced from Giuliani et al. (2014) is provided below. The system consists of two reservoirs: Conowingo and Muddy Run. For the system model, we generate synthetic hydrology upstream of the Conowingo Dam at the Marietta gauge (USGS station 01576000), as well as lateral inflows between Marietta and Conowingo, inflows to Muddy Run and evaporation rates over Conowingo and Muddy Run dams. The historical hydrology on which the synthetic hydrologic model is based consists of the historical record at the Marietta gauge from 1932-2001 and simulated flows and evaporation rates at all other sites over the same time frame generated by an OASIS system model. The historical data for the system can be found here.

The first step to use the synthetic generator is to format the historical data into an nD × nS matrix, where nD is the number of days of historical data with leap days removed and nS is the number of sites, or hydrologic variables. An example of how to format the Susquehanna data is provided in clean_data.m. Once the data has been reformatted, the synthetic generation can be performed by running script_example.m (with modifications for your application). Note that in the LSRB, the evaporation rates over the two reservoirs are identical, so we remove one of those columns from the historical data (line 37) for the synthetic generation. We also transform the historical evaporation with an exponential transformation (line 42) since the code assumes log-normally distributed hydrologic data, while evaporation in this region is more normally distributed. After the synthetic hydrology is generated, the synthetic evaporation rates are back-transformed with a log-transformation on line 60. While such modifications allow for additional hydrologic data beyond streamflows to be generated, for simplicity I will refer to all synthetic variables as “streamflows” for the remainder of this post. In addition to these modifications, you should also specify the number of realizations, nR, you would like to generate (line 52), the number of years, nY, to simulate in each realization (line 53) and a string with the dimensions nR × nY for naming the output file.

The actual synthetic generation is performed on line 58 of script_example.m which calls combined_generator.m. This function first generates monthly streamflows at all sites on line 10 where it calls monthly_main.m, which in turn calls monthly_gen.m to perform the monthly generation for the user-specified number of realizations. To understand the monthly generation, we denote the set of historical streamflows as \mathbf{Q_H}\in \mathbb{R}^{N_H\times T} and the set of synthetic streamflows as \mathbf{Q_S}\in \mathbb{R}^{N_S\times T}, where N_H and N_S are the number of years in the historical and synthetic records, respectively, and T is the number of time steps per year. Here T=12 for 12 months. For the synthetic generation, the streamflows in \mathbf{Q_H} are log-transformed to yield the matrix Y_{H_{i,j}}=\ln(Q_{H_{i,j}}), where i and j are the year and month of the historical record, respectively. The streamflows in \mathbf{Y_H} are then standardized to form the matrix \mathbf{Z_H}\in \mathbb{R}^{N_H\times T} according to equation 1:

1) Z_{H_{i,j}} = \frac{Y_{H_{i,j}}-\hat{\mu_j}}{\hat{\sigma_j}}

where \hat{\mu_j} and \hat{\sigma_j} are the sample mean and sample standard deviation of the j-th month’s log-transformed streamflows, respectively. These variables follow a standard normal distribution: Z_{H_{i,j}}\sim\mathcal{N}(0,1).

For each site, we generate standard normal synthetic streamflows that reproduce the statistics of \mathbf{Z_H} by first creating a matrix \mathbf{C}\in \mathbb{R}^{N_S\times T} of randomly sampled standard normal streamflows from \mathbf{Z_H}. This is done by formulating a random matrix \mathbf{M}\in \mathbb{R}^{N_S\times T} whose elements are independently sampled integers from (1,2,...,N_H). Each element of \mathbf{C} is then assigned the value C_{i,j}=Z_{H_{(M_{i,j}),j}}, i.e. the elements in each column of \mathbf{C} are randomly sampled standard normal streamflows from the same column (month) of \mathbf{Z_H}. In order to preserve the historical cross-site correlation, the same matrix \mathbf{M} is used to generate \mathbf{C} for each site.

Because of the random sampling used to populate \mathbf{C}, an additional step is needed to generate auto-correlated standard normal synthetic streamflows, \mathbf{Z_S}. Denoting the historical autocorrelation \mathbf{P_H}=corr(\mathbf{Z_H}), where corr(\mathbf{Z_H}) is the historical correlation between standardized streamflows in months i and j (columns of \mathbf{Z_H}), an upper right triangular matrix, \mathbf{U}, can be found using Cholesky decomposition (chol_corr.m) such that \mathbf{P_H}=\mathbf{U^\intercal U}. \mathbf{Z_S} is then generated as \mathbf{Z_S}=\mathbf{CU}. Finally, for each site, the auto-correlated synthetic standard normal streamflows \mathbf{Z_S} are converted back to log-space streamflows \mathbf{Y_S} according to Y_{S_{i,j}}=\hat{\mu_j}+Z_{S_{i,j}}\hat{\sigma_j}. These are then transformed back to real-space streamflows \mathbf{Q_S} according to Q_{S_{i,j}}=exp(Y_{S_{i,j}}).

While this method reproduces the within-year log-space autocorrelation, it does not preserve year to-year correlation, i.e. concatenating rows of \mathbf{Q_S} to yield a vector of length N_S\times T will yield discontinuities in the autocorrelation from month 12 of one year to month 1 of the next. To resolve this issue, Kirsch et al. (2013) repeat the method described above with a historical matrix \mathbf{Q_H'}\in \mathbb{R}^{N_{H-1}\times T}, where each row i of \mathbf{Q_H'} contains historical data from month 7 of year i to month 6 of year i+1, removing the first and last 6 months of streamflows from the historical record. \mathbf{U'} is then generated from \mathbf{Q_H'} in the same way as \mathbf{U} is generated from \mathbf{Q_H}, while \mathbf{C'} is generated from \mathbf{C} in the same way as \mathbf{Q_H'} is generated from \mathbf{Q_H}. As before, \mathbf{Z_S'} is then calculated as \mathbf{Z_S'}=\mathbf{C'U'}. Concatenating the last 6 columns of \mathbf{Z_S'} (months 1-6) beginning from row 1 and the last 6 columns of \mathbf{Z_S} (months 7-12) beginning from row 2 yields a set of synthetic standard normal streamflows that preserve correlation between the last month of the year and the first month of the following year. As before, these are then de-standardized and back-transformed to real space.

Once synthetic monthly flows have been generated, combined_generator.m then finds all historical total monthly flows to be used for disaggregation. When calculating all historical total monthly flows a window of +/- 7 days of the month being disaggregated is considered. That is, for January, combined_generator.m finds the total flow volumes in all consecutive 31-day periods within the window from 7 days before Jan 1st to 7 days after Jan 31st. For each month, all of the corresponding historical monthly totals are then passed to KNN_identification.m (line 76) along with the synthetic monthly total generated by monthly_main.mKNN_identification.m identifies the k nearest historical monthly totals to the synthetic monthly total based on Euclidean distance (equation 2):

2) d = \left[\sum^{M}_{m=1} \left({\left(q_{S}\right)}_{m} - {\left(q_{H}\right)}_{m}\right)^2\right]^{1/2}

where {(q_S)}_m is the real-space synthetic monthly flow generated at site m and {(q_H)}_m is the real-space historical monthly flow at site m. The k-nearest neighbors are then sorted from i=1 for the closest to i=k for the furthest, and probabilistically selected for proportionally scaling streamflows in disaggregation. KNN_identification.m uses the Kernel estimator given by Lall and Sharma (1996) to assign the probability p_n of selecting neighbor n (equation 3):

3) p_{n} = \frac{\frac{1}{n}}{\sum^{k}_{i=1} \frac{1}{i}}

Following Lall and Sharma (1996) and Nowak et al. (2010), we use k=\Big \lfloor N_H^{1/2} \Big \rceil. After a neighbor is selected, the final step in disaggregation is to proportionally scale all of the historical daily streamflows at site m from the selected neighbor so that they sum to the synthetically generated monthly total at site m. For example, if the first day of the month of the selected historical neighbor represented 5% of that month’s historical flow, the first day of the month of the synthetic series would represent 5% of that month’s synthetically-generated flow. The random neighbor selection is performed by KNN_sampling.m (called on line 80 of combined_generator.m), which also calculates the proportion matrix used to rescale the daily values at each site on line 83 of combined_generator.m. Finally, script_example.m writes the output of the synthetic streamflow generation to files in the subdirectory /validation. Part II shows how to use the Python code in this directory to statistically validate the synthetically generated hydrology, meaning ensure that it preserves the historical monthly and daily statistics, such as the mean, standard deviation, autocorrelation and spatial correlation.

Water Programming Blog Guide (3)

This is the final post on the Water Programming Blog Guide.  I encourage you to also look at the Water Programming Blog Guide (Part I) and Water Programming Blog Guide (Part 2) for a different set of topics.  In this post, the following topics will be covered:

  1. Visualization and figure editing
  2. Tutorials
  3. Training videos
  4. LaTex
  5. Reference management
  6. NetCDF
  7. Miscellaneous topics: Optimization and Statistics, Hydrologic Modeling, Global change assessment model

1.  Visualization and figure editing

Visualization is very important in our field,  its a good way to analyze tradeoffs and even interact with the system at hand.  We want to represent multi-dimensional data sets in an informative and appealing way , and the methods for doing are are well documented in our blog, we even have a brief historical overview  for visualizing multi-dimensional , along with useful resources to create either quick and interactive plots  or publication ready ones:

Parallel coordinate plots

Creating parallel axes plots

New web-based multi-objective data interactive visualization tools

Saving d3.parcoords to SVG

The Linked Parallel Coordinate Plot with Linked Heatmap

Alluvial Plots

Multidimensional data and color palettes

Visualization strategies for multidimensional data

Colorbrewer: Color palettes for your figures


AeroVis Documentation

AeroVis: Reproducing the DTLZ1 Animation

AeroVis: Turning a Glyph Plot Into a Figure

Code Sample: Modify AeroVis files for Matlab

Figure editing

Scientific figures in Illustrator

9 basic skills for editing and creating vector graphics in Illustrator

Other Resources

Data Visualization Resources

Plotting Code added to Lake Problem Diagnostics Repository

2. Tutorials

In this section you can find a series of training tutorials.  They consist on short test studies that emulate some of the larger studies that our group members have published,  and it will give you a good idea on how to use some of the tools so you can implement them on your own.  I also recommend looking at the Glossary of commonly used terms to get  familiar with the terminology on multi-objective optimization, decision analytics, systems analysis and water resources and some other lingo that we use in the research group that may not be evident to new members.

MOEA runtime analysis

Runtime dynamics with serial Borg

Random Seed Analysis for the Borg MOEA using DTLZ2, 3 objective instance

MOEA diagnostics

MOEA diagnostics for a simple test case (Part 1/3) ,  (Part 2/3) , (Part 3/3)

Algorithm Diagnostics Walkthrough using the Lake Problem as an example (Part 1 of 3: Generate Pareto approximate fronts) ,  (Part 2 of 3: Calculate metrics for Analysis) , (Part 3 of 3: Metrics-based analysis of algorithm performance)

Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 1) , (Part 2)

3. Training videos

The following videos were developed mainly for MOEAFramework  training , they are not necessarily in sequence but they cover key topics that will enable you to use the MOEAFramework capabilities within your own problems.

Training Video: External problems in MOEAFramework

Training video: MOEAFramework GUI

Training video: Java file for external problems

Training video: Elements of Problem Formulation

Training video: MOEAframework Sobol Sensitivity Analysis

Training video: MOEAFramework, submitting multiple random seeds

Training video: Cluster job submission basics

4.  LaTeX

LaTex helps the quality and aesthetics of documents; also,  as a researcher LaTex gives you the opportunity to focus on the parts of your work that  really matter since it saves you the headache of formatting.   In our blog we have a variety of posts that will help you setup LaTex, have consistent fonts across figures and text within your document, use online platforms to work on LaTex documents collaboratively, and easily track changes from one document to another, this last feature is extremely handy for revisions.

Beginner’s LaTeX Guide

Setup for LaTeX and Sublime

Embedding figure text into a Latex document

Overleaf: LaTeX Collaboration and Publication

latexdiff: “track changes” for LaTeX

5. Reference management

Reference management can be a graduate student’s best friend.  It plays an important role in increasing your productivity by maintaining good record of the articles that you commonly use  and by keeping track of encountered literature.  In our blog you can find useful resources and tips to take advantage of reference managers and on how to populate efficiently your references.

PDFExtract: Get a list of BibTeX references from a scholarly PDF

RSS Feeds for Water Resources Journals

Writing a Paper in Markdown Using Pandoc

Zotero introduction (video)

Web-based Free Options for Bibliography Management and LaTeX Editing

6. NetCDF

The NetCDF website describes this data form better than I can:

NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages. The netCDF libraries support a machine-independent format for representing scientific data. Together, the interfaces, libraries, and format support the creation, access, and sharing of scientific data.

Source: NetCDF

We also have plenty of documentation on handling NetCDF from previous group members:

Using HDF5/zlib Compression in NetCDF4

Use python cf and esgf-python-client package to interact with ESGF data portal

Basic NetCDF processing in Matlab

Reducing size of NetCDF spatial data with list representation

Plotting a map of NetCDF data with Matplotlib/Basemap

7.  Miscellaneous topics

Finally, this is a recompilation of interesting topics that contributors to the blog have come across and have used in their own research and/or classes.

Fitting Multivariate Normal Distributions

Solving non-linear problems using linear programming

Hydrologic modeling

Updated rainfall-runoff models

Diffusion in a 2D Box

Global Change Assessment Model

Porting GCAM onto a UNIX cluster




Reading CSV files in C++

If you are an engineer used to coding in Python or Matlab who is transitioning to C++, you will soon find out that even the most innocent task will now require several lines of code. A previous post has already shown how to export data to a CSV file. In order to facilitate your transition to C++, see below for an example of how to read your new CSV file.


#include <string>
#include <vector>
#include <sstream> //istringstream
#include <iostream> // cout
#include <fstream> // ifstream

using namespace std;

 * Reads csv file into table, exported as a vector of vector of doubles.
 * @param inputFileName input file name (full path).
 * @return data as vector of vector of doubles.
vector<vector<double>> parse2DCsvFile(string inputFileName) {

    vector<vector<double> > data;
    ifstream inputFile(inputFileName);
    int l = 0;

    while (inputFile) {
        string s;
        if (!getline(inputFile, s)) break;
        if (s[0] != '#') {
            istringstream ss(s);
            vector<double> record;

            while (ss) {
                string line;
                if (!getline(ss, line, ','))
                try {
                catch (const std::invalid_argument e) {
                    cout << "NaN found in file " << inputFileName << " line " << l
                         << endl;


    if (!inputFile.eof()) {
        cerr << "Could not read file " << inputFileName << "\n";
        __throw_invalid_argument("File not found.");

    return data;

int main()
    vector<vector<double>> data = parse2DCsvFile("test.csv");

    for (auto l : data) {
    	for (auto x : l)
    		cout << x << " ";
    	cout << endl;

    return 0;

Glossary of commonly used terms

I have recently started training with the group and coming from a slightly different research background I was unfamiliar with some a lot of the terminology. I thought it might be useful to put together a glossary of sorts containing the terms that someone new to this field of study might not intuitively understand. The idea is that when someone encounters an unfamiliar term while going through the training or reading through the material, they can come to the blog glossary and quickly Ctrl-F the term (yes, that is a keyboard shortcut used as a verb).

The definitions are not exhaustive, but links/references are provided so the reader can find additional material. The glossary is a work in progress, some definitions are still incomplete, but it will be regularly updated. I’m also probably the least qualified person in the group to provide the definitions, so any corrections/suggestions are more than welcome. If there’s any other term I’ve omitted and you feel should be included, please leave a comment so I can edit the post.


Adaptive metropolis algorithm

It is based on the classical random walk Metropolis algorithm and adapts continuously to the target distribution. 1

Akaike’s Information Criterion (AIC)

A measure of the relative quality of statistical models for a given set of data.2


MOEA that applies a multi-algorithm search that combines the NSGAII, particle swarm optimization, differential evolution, and adaptive metropolis.3

Approximation set

The set of solutions output by a multi-objective evolutionary algorithm approximating the Pareto front.4


A secondary population used by many multi-objective evolutionary algorithms to store non-dominated solutions found through the generational process.5


Attainment plot

Bayesian Information Criterion


Iterative search algorithm for multi-objective optimization. It combines adaptive operator selection with ε-dominance, adaptive population sizing and time continuation. 6

Classification and Regression Trees (CART)

Decision tree algorithms that can used for classification and regression analysis of predictive modeling problems.7

Closed vs. open loop control


See Convexity.


Restrictions imposed on the decision space by the particular characteristics of the system. They must be satisfied for a certain solution to be considered acceptable.5

Control map


Refers to whether the parameterization choices have significant impacts on the success or failure for an algorithm.


Progress towards the reference set.


Geometrically, a function is convex is a line segment drawn from any point along function f to any other point along f lies on or above the graph of f. For optimization problems, a problem is convex if the objective function and all constraints are convex functions if minimizing, and concave is maximizing.

(The) Cube

The Reed Group’s computing cluster run by the Center for Advanced Computing (CAC) at Cornell. The cluster has 32 nodes, each with Dual 8-core E5-2680 CPUs @ 2.7 GHz and 128 GB of RAM. For more information see:

Decision space

The set of all decision variables.

Decision variables

The numerical quantities (variables) that are manipulated during the optimization process and they represent how our actions are encoded within the problem.5


When elements of a solution set at a given time are dominated by a solution set the algorithm maintained some time before.5

Differential evolution

Evolutionary algorithm designed to optimize problems over continuous domains.5

Direct policy search

Dominance resistance

The inability of an algorithm to produce offspring that dominates poorly performing, non-dominated members of the population. (See also Pareto dominance).8

DTLZ problems

A suite of representative test problems for MOEAs that for which the analytical solutions have been found.  The acronym is a combination of the first letters of the creators’ last names (Deb, Thiele, Laumanns, Zitzler). DTLZ problems have been used for benchmarking and diagnostics when evaluating the performance of MOEAs.

Dynamic memory


Refers to a case where as evolution progresses, non-dominated solutions will not be lost in subsequent generations.

Epsilon (ε) dominance

When dominance is determined by use of a user-specified precision to simplify sorting. Pareto epsilon (ε) optimality and Pareto epsilon (ε) front are defined accordingly. (See also Pareto dominance).5

Epsilon (ε) dominance archive

Epsilon (ε)-box dominance archive


A steady-state MOEA, the first to incorporate ε-dominance archiving into its search process.

Epsilon (ε) indicator

Additive ε-indicator (ε+-indicator) (performance metric) 

The smallest distance ε that the approximation set must be translated by in order to completely dominate the reference set.4

Epsilon (ε) progress


Evolutionary algorithms

A class of search and optimization algorithms inspired by processes of natural evolution.5

Multi-objective evolutionary algorithms (MOEAs)

Evolutionary algorithms used for solving multi-objective problems.5

Evolutionary operators

They operate on the population of an evolutionary algorithm attempting to generate solutions with higher and higher fitness.5

Mutation evolutionary operators

They perturb the decision variables of a single solution to look for improvement in its vicinity.

Recombination evolutionary operators

They combine decision variables from two or more solutions to create new solutions.

Selection evolutionary operators

They determine which solutions are allowed to proceed to the next cycle.


A cross-language automation tool for running models. (See also Project Platypus).

Exploratory Modeling and Analysis (EMA)

A research methodology that uses computational experiments to analyze complex and uncertain systems.9

Data-driven exploratory modeling

Used to reveal implications of a data set by searching through an ensemble of models for instances that are consistent with the data.

Question-driven exploratory modeling

Searches over an ensemble of models believed to be plausible to answer a question of interest or illuminate policy choices.

Model-driven exploratory modeling

Investigates the properties of an ensemble of models without reference to a data set or policy question. It is rather a theoretical investigation into the properties of a class of models.

Feasible region

The set of all decision variables in the decision space that are feasible (i.e. satisfy all constraints).5


Generational algorithms

A class of MOEAs that replace the entire population during each full mating, mutation, and selection iteration of the algorithm.5 (See also Steady-state algorithms).

Generational distance (performance metric)

The average distance from every solution in the approximation set to the nearest solution in the reference set.4

Gini index

A generalization of the binomial variance used in Classification and Regression Trees (CART). (See also Classification and Regression Trees (CART)). 

High performance computing

Hypervolume (performance metric)

The volume of the space dominated by the approximation set.4

Inverted generational distance (performance metric)

The average distance from every solution in the reference set to the nearest solution in the approximation set.4


A free desktop application for producing and sharing high-dimensional, interactive scientific visualizations. (See also Project Platypus).

Kernel density estimation

Latin Hypercube Sampling (LHS)

Stratified technique used to generate samples of parameter values.

Markov chain

Method of moments

MOEA Framework

A free and open source Java library for developing and experimenting with multi-objective evolutionary algorithms and other general-purpose optimization algorithms.4

Monte Carlo

Morris method


The Non-dominated Sorting Genetic Algorithm-II. MOEA featuring elitism, efficient non-domination sorting, and parameter free diversity maintenance.10


A generational algorithm that uses e-dominance archiving, adaptive population sizing and time continuation.

Number of function evaluations (NFE)


The criteria used to compare solutions in an optimization problem.


A particle swarm optimization algorithm—the first to include e-dominance as a means to solve many-objective problems.11


The process of identifying the best solution (or a set of best solutions) among a set of alternatives.

Multi-objective optimization

Multi-objective optimization employs two or more criteria to identify the best solution(s) among a set of alternatives

Intertemporal optimization

Parallel computing

Parametric generator

Pareto optimality

The notion that a solution is superior or inferior to another solution only when it is superior in all objectives or inferior in all objectives respectively.

Pareto dominance

A dominating solution is superior to another in all objectives. A dominated solution is inferior to another in all objectives. A non-dominated solution is superior in one objective but inferior in another.  

Pareto front

Contains the objective values of all non-dominated solutions (in the objective function space).

Pareto optimal set

Contains the decision variables of all non-dominated solutions (in the decision variable space).

Particle swarm optimization

Population-based stochastic optimization technique where the potential solutions, called particles, fly through the problem space by following the current optimum particles.

Patient Rule Induction method (PRIM)

A rule induction algorithm.

Performance metrics

Procedures used to compare the performance of approximation sets.



The set of encoded solutions that are manipulated and evaluated during the application of an evolutionary algorithm.

Principle Component Analysis (PCA)

Project Platypus

A Free and Open Source Python Library for Multiobjective Optimization. For more information see:

Radial basis function

Reference set

The set of globally optimal solutions in an optimization problem.


Python Library for Robust Decision Making and Exploratory Modelling. (See also Project Platypus).

Robust Decision Making (RDM)

An analytic framework that helps identify potential robust strategies for a particular problem, characterize the vulnerabilities of such strategies, and evaluate trade-offs among them.12

Multi-objective robust decision making (MORDM)

An extension of Robust Decision Making (RDM) to explicitly include the use of multi-objective optimization to discover robust strategies and explore the trade-offs among multiple competing performance objectives.13


An open source implementation of MORDM with the tools necessary to perform a complete MORDM analysis.14 For more information see:

Safe operating space



Sobol sampling

Spacing (performance metric)

The uniformity of the spacing between the solutions in an approximation set.


MOEA that assigns a fitness value to each solution based on the number of competing solutions it dominates.

 State of the world

A fundamental concept in decision theory which refers to a feature of the world that the agent/decision maker has no control over and is the origin of the agent’s uncertainty about the world.

Steady-state algorithms

A class of MOEAs that only replace one solution in the population during each full mating, mutation, and selection iteration of the algorithm. (See also Generational algorithms).

Time continuation

The injection of new solutions in the population to reinvigorate search.


The set of candidate solutions selected randomly from a population.


Visual analytics

The rapid analysis of large datasets using interactive software that enables multiple connected views of planning problems.

More information on the concepts

  1. Haario, H., Saksman, E. & Tamminen, J. An adaptive Metropolis algorithm. Bernoulli 7, 223–242 (2001).
  2. Akaike, H. Akaike’s information criterion. in International Encyclopedia of Statistical Science 25–25 (Springer, 2011).
  3. Vrugt, J. A. & Robinson, B. A. Improved evolutionary optimization from genetically adaptive multimethod search. Proc. Natl. Acad. Sci. 104, 708–711 (2007).
  4. Hadka, D. Beginner’s Guide to the MOEA Framework. (CreateSpace Independent Publishing Platform, 2016).
  5. Coello, C. A. C., Lamont, G. B. & Van Veldhuizen, D. A. Evolutionary algorithms for solving multi-objective problems. 5, (Springer, 2007).
  6. Hadka, D. & Reed, P. Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework. Evol. Comput. 21, 231–259 (2012).
  7. Breiman, L. Classification and Regression Trees. (Wadsworth International Group, 1984).
  8. Reed, P. M., Hadka, D., Herman, J. D., Kasprzyk, J. R. & Kollat, J. B. Evolutionary multiobjective optimization in water resources: The past, present, and future. Adv. Water Resour. 51, 438–456 (2013).
  9. Bankes, S. Exploratory Modeling for Policy Analysis. Oper. Res. 41, 435–449 (1993).
  10. Deb, K., Pratap, A., Agarwal, S. & Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. Evol. Comput. IEEE Trans. On 6, 182–197 (2002).
  11. Sierra, M. R. & Coello, C. C. Improving PSO-based multi-objective optimization using crowding, mutation and e-dominance. in Evolutionary multi-criterion optimization 3410, 505–519 (Springer, 2005).
  12. Lempert, R. J., Groves, D. G., Popper, S. W. & Bankes, S. C. A general, analytic method for generating robust strategies and narrative scenarios. Manag. Sci. 52, 514–528 (2006).
  13. Kasprzyk, J. R., Nataraj, S., Reed, P. M. & Lempert, R. J. Many objective robust decision making for complex environmental systems undergoing change. Environ. Model. Softw. (2013). doi:10.1016/j.envsoft.2012.12.007
  14. Hadka, D., Herman, J., Reed, P. & Keller, K. An open source framework for many-objective robust decision making. Environ. Model. Softw. 74, 114–129 (2015).


Enhance your (Windows) remote terminal experience with MobaXterm

Jazmin and Julie recently introduced me to a helpful program for Windows called “MobaXterm” that has significantly sped up my workflow when running remotely on the Cube (our cluster here at Cornell). MobaXterm bills itself as an “all in one” toolbox for remote computing. The program’s interface includes a terminal window as well as a graphical SFTP browser. You can link the terminal to the SFTP browser so that as you move through folders on the terminal the browser follows you. The SFTP browser allows you to view and edit files using your text editor of choice on your windows desktop, a feature that I find quite helpful for making quick edits to shell scripts or pieces of code as go.


A screenshot of the MobaXterm interface. The graphical SFTP browser is on the left, while the terminal is on the right (note the checked box in the center of the left panel that links the browser to the terminal window).


You can set up a remote Cube session using MobaXterm with the following steps:

  1. Download MobaXterm using this link
  2.  Follow the installation instructions
  3. Open MobaXterm and select the “Session” icon in the upper left corner.
  4. In the session popup window, select a new SSH session in the upper left, enter “” as the name of the remote host and enter your username.
  5. When the session opens, check the box below the SFTP browser on the left to link the browser to your terminal
  6. Run your stuff!

Note that for a Linux system, you can simply link your file browser window to your terminal window and get the same functionality as MobaXterm. MobaXterm is not available for Mac, but Cyberduck and Filezilla are decent alternatives. An alternative graphical SFTP browser for Windows is WinSCP, though I prefer MobaXterm because of its linked terminal/SFTP interface.

For those new to remote computing, ssh or UNIX commands in general, I’d recommend checking out the following posts to get familiar with running on a remote cluster: