Magnitude-varying sensitivity analysis and visualization (Part 2)

In my last post, I talked about producing these flow-duration-curve-type figures for an output time-series one might be interested in, and talked about their potential use in an exploratory approach for the purpose of robust decision making. Again, the codes to perform the analysis and visualization are in this Github repository.


Fig. 1: Historical data vs. range of experiment outputs

As already discussed, there are multiple benefits for visualizing the output in such manner: we are often concerned with the levels and frequencies of extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), or we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“). The various percentiles tell a different part of the story of how a system operates, the 5th percentile tells as that its level is exceeded 95% of the time, the 99th tells as that its level is only reached once in every 100 years in our records. These might seem obvious to the readers of this blog, but often times we perform our analyses for only some of these percentiles, “the worst event”, “the average”, etc., which is certainly very informative, but can potentially miss part of the bigger picture.

In this post I’m going to walk the reader through performing a sensitivity analysis using the output of an experiment using multiple Latin Hypercube Samples. The analysis will be magnitude-varying, i.e., it will be performed at different magnitudes of our output of interest. For this particular example, we aim to see what are the most significant drivers of shortage at the different levels it’s experienced by this user. In other words, if some factors appear to be driving the frequent small shortages experienced, are those factors the same for the rare large shortages?

To perform the sensitivity analysis, I am going to use SALib (featured in this blog multiple times already), to perform a Delta Moment-Independent Analysis [1] (also produces a first order Sobol sensitivity index [2]). You’ll probably need to install SALib if it’s not a package you’ve used already. I’m also going to use statsmodels, to perform a simple linear regression on the outputs and look at their R2 values. But, why, you might ask, perform not one, not two, but three sensitivity analyses for this? There are nuanced, yet potentially important differences between what the three methods capture:

Delta method: Look for parameters most significantly affecting the density function of observed shortages. This method is moment-independent, i.e., it looks at differences in the entire distribution of the output we’re interested in.
First order Sobol (S1): Look for parameters that most significantly affect the variance of observed outputs, including non-linear effects.
R2: Look for parameters best able to describe the variance of observed outputs, limited to linear effects.

Another important thing to note is that using the First order Sobol index, the total variance resulting from the parameters should equal 1. This means that if we sum up the S1’s we get from our analysis, the sum represents the variance described by the first order effects of our parameters, leaving whatever is left to interactions between our variables (that S1 cannot capture). The same holds using R2, as we are repeatedly fitting our parameters and scoring them on how much of the output variance they describe as a sole linear predictor (with no interactions or other relationships).

The following Python script will produce all three as well as confidence intervals for the Delta index and S1. The script essentially loops through all percentiles in the time-series and performs the two analyses for each one. In other words, we’re are looking at how sensitive each magnitude percentile is to each of the sampled parameters.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from SALib.analyze import delta
# Load parameter samples
LHsamples = np.loadtxt('./LHsamples.txt')
params_no = len(LHsamples[0,:])
param_bounds=np.loadtxt('./uncertain_params.txt', usecols=(1,2))
# Parameter names
# Define problem class
problem = {
'num_vars': params_no,
'names': param_names,
'bounds': param_bounds.tolist()
# Percentiles for analysis to loop over
percentiles = np.arange(0,100)
# Function to fit regression with Ordinary Least Squares using statsmodels
def fitOLS(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])
# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors
#fit OLS regression
ols = sm.OLS(dta['Shortage'], dta[cols])
result =
return result
# Create empty dataframes to store results
DELTA = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
DELTA_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
S1 = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
S1_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
R2_scores = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
DELTA.index=DELTA_conf.index=S1.index=S1_conf.index = R2_scores.index = param_names
# Read in experiment data
expData = np.loadtxt('./experiment_data.txt')
# Identify magnitude at each percentiles
syn_magnitude = np.zeros([len(percentiles),len(LHsamples[:,0])])
for j in range(len(LHsamples[:,0])):
syn_magnitude[:,j]=[np.percentile(expData[:,j], i) for i in percentiles]
# Delta Method analysis
for i in range(len(percentiles)):
if syn_magnitude[i,:].any():
result= delta.analyze(problem, LHsamples, syn_magnitude[i,:], print_to_console=False, num_resamples=2)
DELTA[percentiles[i]]= result['delta']
DELTA_conf[percentiles[i]] = result['delta_conf']
# OLS regression analysis
dta = pd.DataFrame(data = LHsamples, columns=param_names)
# fig = plt.figure()
for i in range(len(percentiles)):
shortage = np.zeros(len(LHsamples[:,0]))
for k in range(len(LHsamples[:,0])):
for m in range(params_no):
predictors = dta.columns.tolist()[m😦m+1)]
result = fitOLS(dta, predictors)[param_names[m],percentiles[i]]=result.rsquared

The script produces the sensitivity analysis indices for each magnitude percentile and stores them as .csv files.

I will now present a way of visualizing these outputs, using the curves from Fig. 1 as context.  The code below reads in the values for each sensitivity index, normalizes them to the range of magnitude at each percentile, and then plots them using matplotlib’s stackplot fuction, which stacks the contribution of each parameter to the sum (in this case the maximum of the resulting range)

I’ll go through what the code does in more detail:

First, we take the range boundaries (globalmax and globalmin) which give us the max and min values for each percentile. We then read in the values for each sensitivity index and normalize them to that range (i.e. globalmaxglobalmin for each percentile). The script also adds two more arrays (rows in the pandas dataframe), one representing interaction and one representing the globalmin, upon which we’re going to stack the rest of the values. [Note: This is a bit of a roundabout way of getting the figures how we like them, but it’s essentially creating a pseudo-stack for the globalmin, that we’re plotting in white.] 

The interaction array is only used when normalizing the S1 and R2 values, where we attribute to it the difference between 1 and the sum of the calculated indices (i.e. we’re attributing the rest to interaction between the parameters). We don’t need to do this for the delta method indices (if you run the code the array remains empty), but the reason I had to put it there was to make it simpler to create labels and a single legend later.

The plotting simply creates three subplots and for each one uses stackplot to plot the normalized values and then the edges in black. It is important to note that the colorblocks in each figure do not represent the volume of shortage attributed to each parameter at each percentile, but rather the contribution of each parameter to the change in the metric, namely, the density distribution (Delta Method), and the variance (S1 and R2). The code for this visualization is provided at the bottom of the post.


Fig. 2: Magnitude sensitivity curves using three sensitivity indeces

The first thing that pops out from this figure is the large blob of peach, which represents the irrigation demand multiplier in our experiment. The user of interest here was an irrigation user, which would suggest that their shortages are primarily driven by increases in their own demands and of other irrigation users. This is important, because irrigation demand is an uncertainty for which we could potentially have direct or indirect control over, e.g. through conservation efforts.

Looking at the other factors, performing the analysis in a magnitude-varying manner, allowed us to explore the vulnerabilities of this metric across its different levels. For example, dark blue and dark green represent the mean flow of dry and wet years, respectively. Across the three figures we can see that the contribution of mean wet-year flow is larger in the low-magnitude percentiles (left hand side) and diminishes as we move towards the larger-magnitude percentiles.

Another thing that I thought was interesting to note was the difference between the S1 and the R2 plots. They are both variance-based metrics, with R2 limited to linear effects in this case. In this particular case, the plots are fairly similar which would suggest that a lot of the parameter effects on the output variance are linear. Larger differences between the two would point to non-linearities between changes in parameter values and the output.

The code to produce Fig. 2:

# Percentiles for analysis to loop over
percentiles = np.arange(0,100)
# Estimate upper and lower bounds
globalmax = [np.percentile(np.max(expData_sort[:,:],1),p) for p in percentiles]
globalmin = [np.percentile(np.min(expData_sort[:,:],1),p) for p in percentiles]
delta_values = pd.read_csv('./DELTA_scores.csv')
delta_values = delta_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(delta_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(delta_values.columns.values))
delta_values = pd.concat([top_row,delta_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(delta_values[str(percentiles[p])])['Min',str(percentiles[p])]
if total!=0:
for param in param_names:
value = (globalmax[p]-globalmin[p])*[param,str(percentiles[p])]/total
delta_values = delta_values.round(decimals = 2)
delta_values_to_plot = delta_values.values.tolist()
S1_values = pd.read_csv('./S1_scores.csv')
S1_values = S1_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(S1_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(S1_values.columns.values))
S1_values = pd.concat([top_row,S1_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(S1_values[str(percentiles[p])])['Min',str(percentiles[p])]
if total!=0:
diff = 1-total
for param in param_names+['Interaction']:
value = (globalmax[p]-globalmin[p])*[param,str(percentiles[p])]
S1_values = S1_values.round(decimals = 2)
S1_values_to_plot = S1_values.values.tolist()
R2_values = pd.read_csv('./R2_scores.csv')
R2_values = R2_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(R2_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(R2_values.columns.values))
R2_values = pd.concat([top_row,R2_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(R2_values[str(percentiles[p])])['Min',str(percentiles[p])]
if total!=0:
diff = 1-total
for param in param_names+['Interaction']:
value = (globalmax[p]-globalmin[p])*[param,str(percentiles[p])]
R2_values = R2_values.round(decimals = 2)
R2_values_to_plot = R2_values.values.tolist()
color_list = ["white", "#F18670", "#E24D3F", "#CF233E", "#681E33", "#676572", "#F3BE22", "#59DEBA", "#14015C", "#DAF8A3", "#0B7A0A", "#F8FFA2", "#578DC0", "#4E4AD8", "#F77632"]
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(14.5,8))
ax1.stackplot(percentiles, delta_values_to_plot, colors = color_list, labels=parameter_names_long)
l1 = ax1.plot(percentiles, globalmax, color='black', linewidth=2)
l2 = ax1.plot(percentiles, globalmin, color='black', linewidth=2)
ax1.set_title("Delta index")
ax2.stackplot(np.arange(0,100), S1_values_to_plot, colors = color_list, labels=parameter_names_long)
ax2.plot(percentiles, globalmax, color='black', linewidth=2)
ax2.plot(percentiles, globalmin, color='black', linewidth=2)
ax3.stackplot(np.arange(0,100), R2_values_to_plot, colors = color_list, labels=parameter_names_long)
ax3.plot(percentiles, globalmax, color='black', linewidth=2)
ax3.plot(percentiles, globalmin, color='black', linewidth=2)
handles, labels = ax3.get_legend_handles_labels()
ax1.set_ylabel('Annual shortage (af)', fontsize=12)
ax2.set_xlabel('Shortage magnitude percentile', fontsize=12)
ax1.legend((l1), ('Global ensemble',), fontsize=10, loc='upper left')
fig.legend(handles[1:], labels[1:], fontsize=10, loc='lower center',ncol = 5)


[1]: Borgonovo, E. “A New Uncertainty Importance Measure.” Reliability Engineering & System Safety 92, no. 6 (June 1, 2007): 771–84.

[2]: Sobol, I. M. (2001). “Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates.” Mathematics and Computers in Simulation, 55(1-3):271-280, doi:10.1016/S0378-4754(00)00270-6.

Magnitude-varying sensitivity analysis and visualization (Part 1)

Various posts have discussed sensitivity analysis and techniques in this blog before. The purpose of this post is to show an application of the methods and demonstrate how they can be used in an exploratory manner, for the purposes of robust decision making (RDM). RDM aims to evaluate the performance of a policy/strategy/management plan over an ensemble of deeply uncertain parameter combinations – commonly referred to as “states of the world” (SOWs) – and then identify the policies that are most robust to those uncertainties. Most importantly, this process allows the decision maker to examine the implications of their assumptions about the world (or how it will unfold) on their candidate strategies [1].

This is Part 1 of a two part post. In this first post, I’ll introduce the types of figures I’ll be talking about, and some visualization code. In the second post (up in a couple days), I’ll discuss sensitivity analysis for the system as well as some visuals. All the code and data to produce the figures below can be found in this repository.

Now assume the performance of a system is described by a time-series, produced by our model as an output. This might be a streamflow we care about, reservoir releases, nutrient loading, or any type of time-series produced by a run of our model. For the purposes of this example, I’ll use a time-series from the system I’ve been working on, which represents historical shortages for an agricultural user.


Fig. 1: Historical data in series

We can sort and rank these data, in the style of a flow duration curve, which would allow us to easily see, levels for median shortage (50th percentile), worst (99th), etc. The reasons one might care about these things (instead of, say, just looking at the mean, or at the time series as presented in Fig. 1) are multiple : we are often concerned with the levels and frequencies of our extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“), or, simply, maintain the distributional information of the series we care about in an easily interpretable format.


Fig. 2: Historical data sorted by percentile

For the purposes of an exploratory experiment, we would like to see how this time-series of model output might change under different conditions (or SOWs). There are multiple ways one might go about this [2], and in this study we sampled a broad range of parameters that we thought would potentially affect the system using Latin Hypercube Sampling [3], producing 1000 parameter combinations. We then re-simulated the system and saved all equivalent outputs for this time-series. We would like to see how this output changes under all the sampled runs.


Fig. 3: Historical data vs. experiment outputs (under 1000 SOWs)

Another way of visualizing this information, if we’re not interested in seeing all the individual lines, is to look at the range of outputs. To produce Fig. 4, I used the fill_between function in matplotlib, filling between the max and min values at each percentile level.


Fig. 4: Historical data vs. range of experiment outputs

By looking at the individual lines or the range, there’s one piece of potentially valuable information we just missed. We have little to no idea of what the density of outputs is within our experiment. We can see the max and min range, the lines thinning out at the edges, but it’s very difficult to infer any density of output within our samples. To address this, I’ve written a little function that loops through 10 frequency levels (you can also think of them as percentiles) and uses the fill_between function again. The only tricky thing to figure out was how to appropriately represent each layer of increasing opacity in the legend – they are all the same color and transparency, but become darker as they’re overlaid. I pulled two tricks for this. First, I needed a function that calculates the custom alpha, or the transparency, as it is not cumulative in matplotlib (e.g., two objects with transparency 0.2 together will appear as a single object with transparency 0.36).

def alpha(i, base=0.2):
l = lambda x: x+base-x*base
ar = [l(0)]
for j in range(i):
return ar[-1]
view raw hosted with ❤ by GitHub

Second, I needed proxy artists representing the color at each layer. These are the handles in the code below, produced with every loop iteration.

handles = []
fig = plt.figure()
for i in range(len(p)):
ax.fill_between(P, np.min(expData_sort[:,:],1), np.percentile(expData_sort[:,:], p[i], axis=1), color='#4286f4', alpha = 0.1)
ax.plot(P, np.percentile(expData_sort[:,:], p[i], axis=1), linewidth=0.5, color='#4286f4', alpha = 0.3)
handle = matplotlib.patches.Rectangle((0,0),1,1, color='#4286f4', alpha=alpha(i, base=0.1))
label = "{:.0f} %".format(100-p[i])
ax.plot(P,hist_sort, c='black', linewidth=2, label='Historical record')
ax.legend(handles=handles, labels=labels, framealpha=1, fontsize=8, loc='upper left', title='Frequency in experiment',ncol=2)
ax.set_xlabel('Shortage magnitude percentile', fontsize=12)
view raw hosted with ❤ by GitHub

Fig. 5: Historical data vs. frequency of experiment outputs

This allows us to draw some conclusions about how events of different magnitudes/frequencies shift under the SOWs we evaluated. For this particular case, it seems that high frequency, small shortages (left hand side) are becoming smaller and/or less frequent, whereas low frequency, large shortages (right hand side) are becoming larger and/or more frequent. Of course, the probabilistic inference here depends on the samples we chose, but it serves the exploratory purposes of this analysis.


[1]: Bryant, Benjamin P., and Robert J. Lempert. “Thinking inside the Box: A Participatory, Computer-Assisted Approach to Scenario Discovery.” Technological Forecasting and Social Change 77, no. 1 (January 1, 2010): 34–49.

[2]: Herman, Jonathan D., Patrick M. Reed, Harrison B. Zeff, and Gregory W. Characklis. “How Should Robustness Be Defined for Water Systems Planning under Change?” Journal of Water Resources Planning and Management 141, no. 10 (2015): 4015012.

[3]: McKay, M. D., R. J. Beckman, and W. J. Conover. “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code.” Technometrics 21, no. 2 (1979): 239–45.

Introducing Julia: A Fast and Modern Language

In this blog post, I am introducing Julia, a high-level open-source dynamic programming language. While it has taken root in finance, machine learning, life sciences, energy, optimization, and economic realms, it is certainly not common within the water programming realm. Through this post, I will give an overview of the features, pros, and cons of Julia before comparing its performance to Python. Following this, I give an overview of common Julia development environments as well as linking to additional resources.

Julia: What’s Going On?

Julia is a high-level open-source dynamic programming language built for scientific computing and data processing that is Pythonic in syntax to ensure accessibility while boosting computational efficiency.  It is parallelizable on high performance computing resources utilizing CPUs and GPUs, allowing for its use in large-scale experiments.

With Version 1.1.0 recently released, Julia now stands amongst the established and stable programming languages. As such, Julia can handle matrices with ease while performing at speeds nearly comparable to C and Fortran. It is comparable to Cython/Numba and faster than Numpy, Python, Matlab, and R. Note that further benchmarking is required on a project-specific basis due to the speed of individual packages/libraries, but a simple case is shown in the following section.

The biggest case for implementing Julia over C/C++/Fortran is its simplicity in writing and implementation. Its language is similar Python, allowing for not only list comprehension but also ‘sloppy’ dynamic variable type declarations and use. Julia has the ability to move between dynamic and static variable types. Furthermore, Julia allows users to create unique variable types, allowing for maximum flexibility while maintaining efficiency.

Installing and implementing packages from the start is nearly seamless, simply requiring the name of the package and an internet connection. Most packages and libraries are hosted on GitHub and are relatively straightforward to install with just a couple lines of code. On the same hand, distributing and contributing to the opensource community is straightforward. Furthermore, Julia can access Python modules with wrappers and C/Fortran functions without, making it a very versatile language. Likewise, it can easily be called from Python (PyJulia) to speed up otherwise cumbersome operations.

Julia has relatively straightforward profiling modules built in. Beyond being able to utilize Jupyter Notebook commands, Julia has a range of code diagnostic tools, such as a @Time command to inform the user of wall time and memory allocation for a function.

One of the most obvious downsides to Julia when compared to Python is the relatively small community. There is not nearly the base of documentation that I am accustomed to; nearly every issue imaginable in the Python world has been explored on Stack Overflow, Julia is much more limited. However, because much of its functionality is based on MATLAB, similar issues and their corresponding solutions bleed over.

Additional transition issues that may be encountered is that unlike Python, Julia’s indexing begins with 1. Furthermore, Julia uses column-major ordering (like Matlab, Fortran, and R) unlike Python and C (row-major ordering), making iterating column-by-column substantially faster. Thus, special care must be taken when switching between languages in addition to ensuring consistent strategies for speeding up code. Another substantial transition that may be required is that Julia is not directly object-oriented since objects do not directly have embedded methods; instead, a given object must be passed to a function to be modified.

Overall, I see Julia as having the potential to improve computational efficiency for computationally-intensive models without the need to learn a more complex language (e.g. Fortran or C/C++). Between its ease of integration between other common languages/libraries to its ability to properly utilize HPC CPU/GPU resources, Julia may be a useful tool to utilize alongside Python to create more efficient large-scale models.

Julia Allows for Faster Matrix Operations

As a simple test to compare scalability of Julia and Python, I computed the dot product of increasingly large matrices using the base Julia language and Numpy in Python (both running in Jupyter Notebooks on a Windows desktop). You can find the code at the following link on Github.


As you can see in the generated figure, Julia outperforms Python for large matrix applications, indicating that Julia is more efficient and may be worth using for larger and larger operations. If you have any suggestions on improving this code for comparison, please let me know; I am absolutely game for improvement.

Development Environments for Julia

The first and most obvious variables on what development environment is best is you, the user. If you’re wanting all the bells and whistles versus a simple text-editing environment, the range of products to fulfill you desires exists.

Juno—an extension to Atom—is the default IDE that is incorporated with JuliaPro, a pre-packaged version of Julia that includes a range of standard packages—Anaconda is the parallel in the Python world.

Jupyter Notebooks  are my go-to development environment for experimenting with code in both Python and Julia, especially for testing small sections of code.  Between being able to easily create visuals and examples inline with your code and combining code with Markdown, it allows for clean and interactive approach to sharing code or teaching. Note that installation generally requires IJulia but can allow for easy integration into already existing Jupyter workflows.

JetBrains IDEs (e.g. CLion, PyCharm) support a Julia plugin that supports a wide range of the same features JetBrains is known for. However, it is still in beta and is working to improve its formatter and debugging capabilities.

JuliaBox is an online web interface using Jupyter Notebooks for code development on a remote server, requiring no local installation of Julia. Whether you are in a classroom setting, wanting to write code on your phone, or are just wanting to experiment with parallel computing (or don’t have access to HPC resources), JuliaBox allows for a nearly seamless setup experience. However, note that this development environment limits each session to 90 minutes (up to 8 hours on a paid subscription) and requires a subscription to access any resources beyond 3 CPU cores. Note that you can access GPU instances with a paid subscription.

Julia Studio is a default IDE used by a range of users. It is a no-frills IDE based on Qt Creator, resulting in a clean look.

For anyone looking to use Visual Studio, you can install a VS Code extension.

Vim is not surprisingly available for Julia.  And if you’re feeling up the challenge, Emacs also has an interface.

Of course, you can just use a texteditor of your choice (e.g. Sublime Text with an extension) and simply run Julia from the terminal.

Julia Resources


Thanks to Dave Gold for his assistance in giving guidance for benchmarking and reminding me of the KISS principle.

Introduction to Unit Testing


Here is something that has happened to most or all of us engineers who code. We write a function or a class and make sure it works by running it and printing some internal state output on the console. The following week, after adding or making improvements on seemingly unrelated parts of the code, the results start to look weird. After a few hours of investigation (a.k.a. wasted time) we find out the reason for the weird results is that that first function or class is either not giving us the right numbers anymore or not behaving as expected with different inputs. I wonder how many papers would have been written, how many deadlines would have been met, and how many “this $@#%& does not @#$% work!!!” thoughts and comments to friends and on our codes themselves would have been avoided if we managed to get rid of such frustrating and lengthy situations.

In this post I will introduce a solution to this rather pervasive issue: unit testing, which amounts to ways of creating standard tests throughout the development of a code to avoid the mentioned setbacks and simplify debugging. I will also get into some of the specifics of unit testing for Python and demonstrate it with a reservoir routing example. Lastly, I will briefly mention a unit testing framework for C++.

The General Idea

Unit test frameworks are libraries with functions (or macros) that allow for the creation of standard tests for individual parts of code, such as function and classes, that can be run separately from the rest of code. The advantages of using Unit Testing are mainly that (1) it eliminates the need of running the entire code with a full set of inputs simply to see if a specific function or class is working, which in some cases can get convoluted take a really long time, (2) it allows for easy and fast debugging, given that if a certain test fails you know where the problem is, (3) it prevents you or someone else from breaking the code down the road by running the tests frequently as the code is developed, and (4) it prevents an error in a function or class to be made noticible in another, which complicates the debugging process. Of course, the more modular a code is the easier it is to separately test its parts — a code comprised of just one 1,000-lines long function cannot be properly unit-tested, or rather can only have one unit test for the entire function.

Example Code to be Tested

The first thing we need to perform unit testing is a code to be tested. Let’s then consider the reservoir routing code below, comprised of a reservoir class, a synthetic stream flow generation function, a function to run the simulation over time, and the main.:

import numpy as np
import matplotlib.pyplot as plt

class Reservoir:
    __capacity = -1
    __storage_area_curve = np.array([[], []])
    __evaporation_series = np.array([])
    __inflow_series = np.array([])
    __demand_series = np.array([])
    __stored_volume = np.array([])

    def __init__(self, storage_area_curve, evaporations, inflows, demands):
        """ Constructor for reservoir class

            storage_area_curve {2D numpy matrix} -- Matrix with a
            row of storages and a row of areas
            evaporations {array/list} -- array of evaporations
            inflows {array/list} -- array of inflows
            demands {array/list} -- array of demands

        self.__storage_area_curve = storage_area_curve
        assert(storage_area_curve.shape[0] == 2)
        assert(len(storage_area_curve[0]) == len(storage_area_curve[1]))

        self.__capacity = storage_area_curve[0, -1]

        n_weeks = len(demands)
        self.__stored_volume = np.ones(n_weeks, dtype=float) * self.__capacity
        self.__evaporation_series = evaporations
        self.__inflow_series = inflows
        self.__demand_series = demands

    def calculate_area(self, stored_volume):
        """ Calculates reservoir area based on its storave vs. area curve

            stored_volume {float} -- current stored volume

            float -- reservoir area

        storage_area_curve_T = self.__storage_area_curve.T .astype(float)

        if stored_volume  self.__capacity:
            print "Storage volume {} graeter than capacity {}.".format(
                stored_volume, self.__capacity)
            raise ValueError

        for i in range(1, len(storage_area_curve_T)):
            s, a = storage_area_curve_T[i]
            # the &st; below needs to be replace with "smaller than" symbol. WordPress code highlighter has a bug that was distorting the code because of this "smaller than."
            if stored_volume &st; s:
                sm, am = storage_area_curve_T[i - 1]
                return am + (stored_volume - sm)/ (s - sm) * (a - am)

        return a

    def mass_balance(self, upstream_flow, week):
        """ Performs mass balance on reservoir

            upstream_flow {float} -- release from upstream reservoir
            week {int} -- week

            double, double -- reservoir release and unfulfilled
            demand (in case reservoir gets empty)

        evaporation = self.__evaporation_series[week] *\
            self.calculate_area(self.__stored_volume[week - 1])
        new_stored_volume = self.__stored_volume[week - 1] + upstream_flow +\
            self.__inflow_series[week] - evaporation - self.__demand_series[week]

        release = 0
        unfulfiled_demand = 0

        if (new_stored_volume > self.__capacity):
            release = new_stored_volume - self.__capacity
            new_stored_volume = self.__capacity
        elif (new_stored_volume < 0.):
            unfulfiled_demand = -new_stored_volume
            new_stored_volume = 0.

        self.__stored_volume[week] = new_stored_volume

        return release, unfulfiled_demand

    def get_stored_volume_series(self):
        return self.__stored_volume

def run_mass_balance(reservoirs, n_weeks):
    upstream_flow = 0
    release = 0
    total_unfulfilled_demand = 0
    for week in range(1, n_weeks):
        for reservoir in reservoirs:
            release, unfulfilled_demand = reservoir.mass_balance(release, week)

def generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma):
    """Log-normally distributed stream flow generator.

        n_weeks {int} -- number of weeks of stream flows
        sin_amplitude {double} -- amplitude of log-mean sinusoid fluctuation
        log_mu {double} -- mean log-mean
        log_sigma {double} -- log-sigma

        {Numpy array} -- stream flow series

    streamflows = np.zeros(n_weeks)

    for i in range(n_weeks):
        streamflow = np.random.randn() * log_sigma + log_mu *\
            (1. + sin_amplitude * np.sin(2. * np.pi / 52 * (i % 52)))
        streamflows[i] = np.exp(streamflow)

    return streamflows

if __name__ == "__main__":
    n_weeks = 522
    sin_amplitude, log_mean, log_std = 1., 2.0, 1.8
    streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mean, log_std)
    storage_area_curve = np.array([[0, 1000, 3000, 4000], [0, 400, 600, 900]])

    reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks) / 10,\
        streamflows, np.random.rand(n_weeks) * 60)

    run_mass_balance([reseroir1], n_weeks)

    plt.xlabel("Weeks [-]")
    plt.ylabel("Stored Volume [MG]")
    plt.title("Storage Over Time")
    plt.ylim(0, 4100)

We have quite a few functions to be tested in the code above, but lets concentrate on Reservoir.calculate_area and generate_streamflow. The first is a function within a class (Reservoir) while the second is a standalone function. Another difference between both functions is that we know exactly what output to expect from Reservoir.calculate_area (area given storage), while given generate_stream flow is setup to return random numbers we can only test their estimated statistical properties, which will change slightly every time we run the function. We will use the PyTest framework to perform unit testing in the reservoir routing code above.


PyTest is one among many available unit test frameworks for Python. The reason for its choice is that PyTest is commonly used by Python developers. You can find the API (list of commands) here and examples more here.

Basic Work Flow

Using PyTest is pretty simple. All you have to do it:

  • Install PyTest with pip or conda (pip install pytest or conda install pytest)
  • Create a file (preferably in the same directory as the file with the code you want to test) whose name either begins or end with “test_” or “,” respectively — this is a “must.” In this example, the main code is called “,” so I created the PyTest file in the same directory and named it “”
  • In the unit test file (“”), import from the main code file the functions and classes you want to test, the PyTest module pytest and any other packages you may use to write the tests, such as Numpy.
  • Write the tests. Tests in PyTest are simply regular Python functions whose names begin with “test_” and that have assertions (“assert,” will get to that below) in them.
  • On the Command Prompt, PowerShell, or Linux terminal, run the command “pytest.” PyTest will then look for all your files that begin with “test_” or end in “,” scan them for functions beginning with “test_,” which indicate a test function, run them and give you the results.


Assertions are the actual checks, the cores of the tests. Assertions are performed with the assert command and will return “pass” if the condition being asserting is true and “fail” otherwise. Below is an example of an assertion, which will return “pass” if the function being tested returns the value on the right-hand-side for the given arguments or “false” otherwise:

assert function_calculate_something(1, 'a', 5) == 100

Sometimes we are not sure of the value a function will return but have an approximate idea. This is the case of functions that perform stochastic calculations or calculate numerically approximate solutions — e.g. sampling from distributions, Newton-Rhapson minimum finder, E-M algorithm, finite-differences PDE solvers, etc. In this case, we should perform our assertion against an approximate value with:

# assert if function returns 100 +- 2
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., abs=2.)

or with:

# assert if function returns 100 +- 0.1 (or 100 +- 100 * 1e-3)
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., rel=1e-3)

Sometimes (although, unfortunately, rarely) we add code to a function to check the validity of the arguments, intermediate calculations, or to handle exceptions during the function’s execution. We can also test all that with:

# Test if an exception (such as Exception) are properly raised
with pytest.raises(Exception):
    function_calculate_something(1345336, 'agfhdhdh', 54784574)

Test File Example

Putting all the above together, below is a complete PyTest file with two tests (two test functions, each with whatever number of assertions):

from reservoir_routing import Reservoir, generate_streamflow
import numpy as np
import pytest

def test_calculate_area():
    n_weeks = 522
    storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

    reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
                            np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

    # Test specific values of storages and areas
    assert reseroir1.calculate_area(500) == 400
    assert reseroir1.calculate_area(650) == 500
    assert reseroir1.calculate_area(1000) == 900

    # Test if exceptions are properly raised
    with pytest.raises(ValueError):

def test_generate_streamflow():
    n_weeks = 100000
    log_mu = 7.8
    log_sigma = 0.5
    sin_amplitude = 1.
    streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma)

    whitened_log_mu = (np.log(streamflows[range(0, n_weeks, 52)]) - log_mu) / log_sigma

    # Test if whitened mean in log space is 0 +- 0.5
    assert np.mean(whitened_log_mu) == pytest.approx(0., abs=0.05)

If I open my Command Prompt, navigate to the folder where both the main code and the test Python files are located and run the command pytest I will get the output below:

F:\Dropbox\Bernardo\Blog posts>pytest
============================= test session starts =============================
platform win32 -- Python 2.7.13, pytest-4.2.0, py-1.7.0, pluggy-0.8.1
rootdir: F:\Dropbox\Bernardo\Blog posts, inifile:
collected 2 items F.                                             [100%]

================================== FAILURES ===================================
_____________________________ test_calculate_area _____________________________

    def test_calculate_area():
        n_weeks = 522
        storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

        reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
                                np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

        # Test specific values of storages and areas
        assert reseroir1.calculate_area(500) == 400
>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500
E        +  where 400 = (650)
E        +    where  = .calculate_area AssertionError
========================== deprecated python version ==========================
You are using Python 2.7.13, which will no longer be supported in pytest 5.0
For more information, please read:
===================== 1 failed, 1 passed in 1.07 seconds ======================

F:\Dropbox\Bernardo\Blog posts>

This seems like a lot just to say that one test passed and the other failed, but there is actually some quite useful information in this output. The lines below (20 and 21 of the output above) indicate which assertion failed, so you can go ahead and debug your the corresponding specific part of code with a debugger:

>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500

I actually found this error with PyTest as I was writing the reservoir routing code for this post and used VS Code, which has a debugger integrated with PyTest, to debug and fix the error. The error is happening because of an int to float conversion that is not being done correctly in line 45 of the original code, which can be replaced by the version below to fix the problem:

storage_area_curve_T = self.__storage_area_curve.T<strong>.astype(float)</strong>

The last piece of important information contained in the last few lines of the output is that support for Python 2.7 will be discontinued, as it is happening with other Python packages as well. It may be time to finally switch to Python 3.

Practice Exercise

As a practice exercise, try writing a test function for the Reservoir.mass_balance function including cases in which the reservoir spills, gets empty, and is partly full.

Unit Testing in C++

A framework for unit testing in C++ that is quite popular is the Catch2 framework. The idea is exactly the same: you have a code you want to test and use the framework to write test functions. An example of Catch2 being used to check a factorial function is shown below, with both the main and the test codes in the same file:

#define CATCH_CONFIG_MAIN  // This tells Catch to provide a main() - only do this in one cpp file
#include "catch.hpp"

unsigned int Factorial( unsigned int number ) {
    return number <= 1 ? number : Factorial(number-1)*number;

TEST_CASE( "Factorials are computed", "[factorial]" ) {
    REQUIRE( Factorial(1) == 1 );
    REQUIRE( Factorial(2) == 2 );
    REQUIRE( Factorial(3) == 6 );
    REQUIRE( Factorial(10) == 3628800 );


Setting up and running unit testing is not rocket science. Also, with my test functions I do not have to use my actual reservoir system test case in order to check if the surface area calculations are correct. If I find they are not, I can just re-run the test function with a debugger (VS Code is great for that) and fix my function without having to deal with the rest of the code. You would be surprised by how many hours of debugging can be saved by taking the unit test approach. Lastly, be sure to run your tests after any change to your code you consider minimally significant (I know, this sounds vague). You never know when you mess something up by trying to fix or improve something else.

Intro to Machine Learning Part 5: Bagging

In Part 3 of this series, it was stated that classification error can be due to bias, variance, or noise. There are different methods to address each type of classification error. This post will focus on bootstrap aggregation, or bagging, which aims to improve classification error due to variance.

As a reminder, expected test error can be decomposed as follows:


where hd is the training data classifier and hbar is the ideal expected classifier. Now let’s just focus on the part of the testing error that is due to variance. From the equation, it is clear that if we want this error component to approach 0, then hd(x) must tend to hbar(x). We can use the the Weak Law of Large Numbers (WLLN) to understand how this can happen. WLLN states that if you have a sample of independent and identically distributed (i.i.d) random variables, as the sample size grows larger, the sample mean will tend toward the population mean.

If this is applied in a classification setting:                                 Capture(4)

In this equation, a classifier is trained on each of the m datatsets and then the results are averaged. By WLLN, hd tends to hbar …but only as our number of datasets, m, approaches infinity. But we only have one datatset, D, so how can we get more datasets? Through bagging!



Fig 1: Bootstrapping Process

Let’s say our original dataset D comes from some distribution P, shown in Figure 1.We can simulate drawing from P by bootstrapping, or sampling our original dataset D with replacement. Because we are sampling with replacement, we are essentially creating new datasets out of our original one! Note that when we sample with replacement, we cannot say that the new samples are i.i.d., so therefore the Weak Law of Large Numbers does not quite apply. But error due to variance can still be considerably reduced with this method.

Application: Random Forests

Bagging is most popularly utilized for random forests which are bagged decision trees. A decision tree is trained on each of the m datasets and then either the mode label is assigned (in a classification setting), or an average value is calculated (in a regression setting). Random Forests are termed “out of the box” algorithms; they are very easy to use because 1) only 2 hyper-parameters have to be defined, m (the number of datasets) and k (a feature splitting parameter) and  2) features don’t need to be normalized to have the same units or scale, which is extremely helpful for any kind of heterogeneous data.

Coding up the Classifier

Random Forests can be implemented in every programming language. In Python, scikit-learn has a function called RandomForestClassifier. Only the number of trees in the forest needs to be specified, but there are many extra parameters that can be fine-tuned, including the criterion to measure the quality of a split (Gini vs. Entropy) and the maximum depth of each tree. The code below outlines a basic implementation of the classifier. A training dataset with “value” and “label” fields is read in and then fit with the RandomForestClassifier function that is trained on 200 trees. Then the classifier is used to predict the labels associated with each test set “value” and the prediction accuracy is calculated.

#import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#import training data
training = pd.read_csv('train.csv', delimiter=',')
#fit classifier in training data
rfc1=RandomForestClassifier(n_estimators= 200, max_depth=8, criterion='entropy'), training.label)
#import testing data
test = pd.read_csv('test.csv', delimiter=',')
#predict labels
predicted = rfc1.predict(test.value)
#calculate accuracy
acc=np.mean(predicted == test.label)

Random Forests in Water Resources

The Water Resources field has seen an increasing machine learning presence for the past decade. However, the majority of research has been focused on implementation of artificial neural networks. Though just emerging in the field, random forests are becoming increasingly popular for both prediction and regression. Shortridge et al. (2016), tests the ability of random forests (among other models such as GLMs and ANNs) to predict monthly streamflow for five rivers in the Lake Tana Basin in Ethiopia. Empirical models that estimate monthly streamflow as a function of climate conditions and agricultural land cover in each basin were developed for the period from 1961 to 2004. The predictive capabilities of each model was tested along with investigation of their error structures. Random forests were found to better capture non-linear relationships than their ANN counterparts. Furthermore, they were much more parsimonious and physically interpretable. The paper also addresses ability of these models to make meaningful predictions under uncertainty and a non-stationary climate.


All of the material for the post comes from Kilian Weinberger’s CS4780 class notes and lectures found at:

Shortridge, Julie E., et al. “Machine Learning Methods for Empirical Streamflow Simulation: a Comparison of Model Accuracy, Interpretability, and Uncertainty in Seasonal Watersheds.” Hydrology and Earth System Sciences, vol. 20, no. 7, 2016, pp. 2611–2628., doi:10.5194/hess-20-2611-2016.