Creating a collaborative research group lab manual with Jupyter Books


Onboarding new students, staff, or collaborators can be a challenge in highly technical fields. Often, the knowledge needed to run complex experiments or computing workflows is spread across multiple individuals or teams based on their unique experiences, and training of new team members tends to occur in an ad hoc and inefficient manner. These challenges are compounded by the inevitable turnover of students, postdocs, and external collaborators in academic research settings.

Over the years, the Reed Group has developed a large bank of training materials to help streamline the onboarding process for new team members. These materials introduce concepts and tools related to water systems modeling, multi-objective evolutionary algorithms, global sensitivity analysis, synthetic streamflow generation, etc. However, these materials are still spread across a variety of sources (academic papers, GitHub repositories, blog posts, etc.) and team members, and there has been growing recognition of the need to catalogue and compile relevant resources and trainings in a more structured way.

For this reason, we have begun to create a lab manual for the Reed group. This will include a wide variety of information relevant to new students and researchers – everything from training exercises and code snippets, to reading lists and coursework suggestions, to a code of conduct outlining our values and expectations. The goal is for this to be a collaborative, living document created and maintained by students and postdocs. Ideally this will continue to evolve for years along with the evolving state-of-the-art in methodology, software, and literature.

After considering a number of different platforms for constructing websites, we settled on the Jupyter Book package for Python. You can find our lab manual here, and the source code used to create it here – note that this is still very much in development, a skeleton waiting to be fleshed out. In the remainder of this blog post, I will highlight the major elements of a Jupyter Book website, using our skeleton lab manual as an example. Then in a future blog post, I will outline the Continuous Integration and Continuous Delivery (CI/CD) strategy we are using to manage versioning and platform dependency issues across multiple developers.

Intro to Jupyter Book

Jupyter Book is a Python package for creating static websites. The package is built on the popular Sphinx engine used to create documentation for many of your favorite Python packages. Sphinx was also used to create the ebook for “Addressing Uncertainty in MultiSector Dynamics Research“, as described in two recent blog posts by Rohini Gupta and Travis Thurber. The ebook was a source of inspiration for our lab manual and the reason we initially considered Sphinx-based workflows. However, Jupyter Books layers several additional functionalities on top of Sphinx. First, it supports use of the MyST Markdown language, which is more familiar and intuitive to most researchers than the reStructured Text format favored by Sphinx. And second, it allows for pages to be built from executable Jupyter Notebooks, a powerful tool for combining text and equations with formatted code blocks, program output, and generated figures.

The Jupyter Book documentation contains tutorials, examples, and references, and is an excellent resource for anyone looking to build their own site. The documentation itself is, of course, created using the Jupyter Book package, and interested readers can check out the source code here.

Designing the website structure

The hierarchical structure of a Jupyter Book is defined in a simple YAML-style Table of Contents file, which should be named _toc.yml. Here is the TOC for our lab manual at present:

format: jb-book
- chapters:
  - file: ExamplePages/
    - file: ExamplePages/
    - file: ExamplePages/nbExample.ipynb
  - file: Resources/
    - file: Resources/
    - file: Resources/
    - file: Resources/
    - file: Resources/
    - file: Resources/
    - file: Resources/
    - file: Resources/
    - file: Resources/
  - file: Training/
    - file: Training/
    - file: Training/
    - file: Training/
    - file: Training/
    - file: Training/

The “root” defines the landing page, in this case the markdown file. That landing page will link to three “chapters” called ExamplePages, Resources, and Training. Each of these chapters has it’s own landing page as well as multiple child “sections.” Each page can either be written as a Markdown file (.md) or a Jupyter Notebook (.ipynb).

The other important YAML file for all Jupyter Books is _config.yml:

title: Reed group lab manual
author: The Reed Group at Cornell CEE
logo: logo.png

# Force re-execution of notebooks on each build.
# See
  execute_notebooks: force

# Define the name of the latex output file for PDF builds
    targetname: book.tex

# Add a bibtex file so that we can create citations
  - references.bib

# Information about where the book exists on the web
  url:  # Online location of your book
  path_to_book: docs  # Optional path to your book, relative to the repository root

# Add GitHub buttons to your book
# See
  use_issues_button: true
  use_repository_button: true

We first define our website’s title and author, as well as an image logo to display. The line “execute_notebooks: force” means that we want to reexecute all Jupyter Notebooks each time the site is built (see docs for other options). The url gives the web address where we want to host our site – in this case the GitHub Pages address associated with the GitHub repository for the site. The path_to_book defines “docs” as the folder in the repository where all source code is to be held. Finally, the last two options are used to create buttons at the top of our site that link to the GitHub repository in case readers want to browse the source code or report an issue. For now, we are using the default vanilla style, but there are many ways to customize the structural and aesthetic style of the site. You would need to point to custom style files from this configuration file – see the Jupyter Book gallery for inspiration.

Building pages with Markdown and Jupyter Notebooks

Jupyter Book makes it very easy to write new pages using either Markdown or Jupyter Notebooks. For context, here is a screenshot of the site’s homepage:

The main content section for this page is built from the “root” file,

# Welcome to our lab manual! 

This site is still under construction

The purpose of this site is to help new students and collaborators get up to speed on the research methods/tools used by the Reed Group. This page is designed and maintained by other graduate students and post docs, and is intended to serve as a living document. 

This manual was created using the Jupyter Books Python package, and is hosted with GitHub Pages. You can find our source code at


As you can see, this uses a very human-readable and intuitive Markdown-based file structure. Jupyter Book provides simple functionality for warning labels and other emphasis boxes, as well as a Table of Contents that is automatically rendered from the _toc.yml file. The tableofcontents command can be used from anywhere in the hierarchical page tree and will automatically filter to include only children of the current page. The separate sidebar TOC will also expand to show “sections” as you navigate into different “chapters.” For example, here is the Markdown and rendered webpage for the “ExamplePages” chapter:

# Example Pages with JupyterBooks

For more detailed pages, you can also apply standard Markdown syntax to add section headers, bold/italic font, code blocks, lists, Latex equations, images, etc. For example, here is ExamplePages/

# Markdown example
This is an example page using just markdown

### Subsection 1
Here is a subsection

### Subsection 2
Here is another subsection. 

Here is a note!

And here is a code block:

e = mc^2

And here comes a cute image!

![capybara and friends](capybaraFriends.jpg "Capybara and friends")

Lastly, and most importantly for purposes of building a training manual, we can create pages using Jupyter Notebooks. For example, here are two screenshots of the webpage rendered from ExamplePages/nbExample.ipynb:

As you can see, the Notebook functionality allows us to combine text and equations with rendered Python code. We can also execute Bash, R, or other programs using Jupyter Notebook’s “magic” commands. Note that the Jupyter-based website is not interactive – for that you’ll need Binder, as demonstrated in this blog post by David Gold.

Nevertheless, the Notebook is reexecuted each time we rebuild the website, which should really streamline collaborative lab manual development. For example, once we have developed a code bank of visualization examples (stay tuned!), it will be straightforward to edit the existing examples and/or add new examples, with the rendered visualizations being automatically updated rather than needing to manually upload the new images. Additionally, reexecuting the Notebooks each time we rebuild the site will force us to maintain the functionality of our existing code bank rather than letting portions become obsolete due to package dependencies or other issues.

Next steps

You now have the basic building blocks to create your own lab manual or ebook using a collection of YAML files, Markdown files, and Jupyter Notebooks. The last two critical steps are to actually build the static site (e.g., the html files) using Jupyter Book, and then host the site using GitHub pages. I will demonstrate these steps, as well as our CI/CD strategy based on GitHub Actions, in my next blog post.

Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.


PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.


The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…


The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

  1. Installation
  2. Retrieving USGS Water Data
  3. Retrieving Geophysical (NLDI) Data
  4. Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

  • Minimum and maximum temperature (tmin, tmax)
  • Precipitation (prcp)
  • Vapor pressure (vp)
  • Snow-Water Equivalent (swe)
  • Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()


# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.


I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.


Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

Bivariate choropleth maps

Bivariate choropleth maps

Choropleth maps are a ubiquitous tool in geospatial analysis where color is used to visualize the spatial variability of data (e.g., coloring counties based on temperature, GDP, or election results). Bivariate choropleth maps are a less common variant that use 2-dimensional color palettes to show two different types of data at once. This can be a powerful tool for visualizing spatial correlations and clustering patterns between two related variables.

This blog post will demonstrate how to create a bivariate choropleth map in Python. Data and code for this post can be found in this GitHub repository.

CalEnviroScreen dataset

CalEnviroScreen is a valuable dataset from the California Office of Environmental Health Hazard Assessment which seeks to “identify California communities that are most affected by many sources of pollution, and where people are often especially vulnerable to pollution’s effects.” It provides census tract-level data on a variety of pollution sources (ozone, PM2.5, drinking water contamination, traffic impacts, etc.) and population characteristics (asthma, cardiovascular disease, education, poverty, etc.). The dataset also includes more general aggregated indicators for “Pollution Burden” and “Population Characteristics,” and finally an overall indicator of pollution vulnerability which is the product of the previous two indicators. This data is used by various state agencies to locate disadvantaged communities that can be targeted for environmental justice programs and investments. The CalEnviroScreen 4.0 dataset is available either as an online mapping platform or a downloadable shapefile – the latter is used for this blog post.

Creating univariate choropleth maps using geopandas

This analysis requires the following Python packages: Glob for grabbing files, Matplotlib for general plotting, GenerativePy and Pillow (PIL) for manipulating colors, Geopandas for handling geospatial data, and Contextily for importing basemaps.

import glob
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
from matplotlib.colors import rgb2hex
from generativepy.color import Color
from PIL import ImageColor
import geopandas as gpd
import contextily as cx

Once we have imported the necessary libraries, we can read the data into a Geopandas dataframe. Geopandas is a great package which allows us to read in and manipulate shapefiles as dataframes that act very similarly to Pandas dataframes while maintaining the spatial context.

### Read in CalEnviroScreen shapefile, from
shapefiles = glob.glob('*/*.shp')
ces = gpd.read_file(shapefiles[0])

If we print the columns of the dataframe, we see a large list of variables related to pollution burden and population characteristics (e.g., Ozone, Poverty) and aggregated indicators (PolBurd, PopChar, CIscore), as well as percentile scales for most of these variables (e.g., PolBurdP). The specific meanings of each variable can be found in the documentation. Several geospatial attributes are also included among the columns (e.g., Shape_Leng, geometry).

Index(['Tract', 'ZIP', 'County', 'ApproxLoc', 'TotPop19', 'CIscore',
       'CIscoreP', 'Ozone', 'OzoneP', 'PM2_5', 'PM2_5_P', 'DieselPM',
       'DieselPM_P', 'Pesticide', 'PesticideP', 'Tox_Rel', 'Tox_Rel_P',
       'Traffic', 'TrafficP', 'DrinkWat', 'DrinkWatP', 'Lead', 'Lead_P',
       'Cleanup', 'CleanupP', 'GWThreat', 'GWThreatP', 'HazWaste', 'HazWasteP',
       'ImpWatBod', 'ImpWatBodP', 'SolWaste', 'SolWasteP', 'PollBurd',
       'PolBurdSc', 'PolBurdP', 'Asthma', 'AsthmaP', 'LowBirtWt', 'LowBirWP',
       'Cardiovas', 'CardiovasP', 'Educatn', 'EducatP', 'Ling_Isol',
       'Ling_IsolP', 'Poverty', 'PovertyP', 'Unempl', 'UnemplP', 'HousBurd',
       'HousBurdP', 'PopChar', 'PopCharSc', 'PopCharP', 'Child_10',
       'Pop_10_64', 'Elderly65', 'Hispanic', 'White', 'AfricanAm', 'NativeAm',
       'OtherMult', 'Shape_Leng', 'Shape_Area', 'AAPI', 'geometry'],

After reading in the dataframe, it is straightforward to loop over our three aggregated indicator percentiles and create a separate choropleth map for each. Note that the data contain some holes, which are coded as -9999, so we only plot census tracts with values >= 0.

label_dict = {'CIscoreP': 'CalEnviroScreen score percentile', 
              'PolBurdP': 'Pollution burden percentile', 
              'PopCharP': 'Population characteristics percentile'}

for attr, label in label_dict.items():
    fig, ax = plt.subplots(1,1, figsize=(7,10))
    ces.loc[ces[attr] >= 0].plot(attr, ax=ax, alpha=alpha, vmin=0, legend=True, 
             legend_kwds={'label':label, 'orientation':'horizontal', 'shrink': 0.5, 'pad': 0.03, 'alpha':alpha})
    cx.add_basemap(ax = ax,, source=cx.providers.Stamen.TonerLite)
    _ = ax.set_xticks([])
    _ = ax.set_yticks([])
    plt.savefig(f'figs/{attr}.jpg', dpi=dpi, bbox_inches='tight')

Consider first the aggregated CalEnviroScreen Index. I have also included a zoomed version focusing on the Los Angeles/Orange County area (see repository for code).

Readers familiar with California’s environmental issues probably aren’t too surprised by the general pattern: the regions rated highest on the CalEnviroScreen Indicator are heavily concentrated in the Central Valley and the LA-Long Beach area. What about the underlying Pollution Burden and Population Characteristics Scores?

In general, we find that both pollution burden and vulnerable populations tend to be co-located in the Central Valley and the LA-Long Beach area. However, the data are not identical, and there are a variety of reasons why policymakers, public health experts, and researchers may want to compare the two variables and investigate their spatial correlation and clustering.

Creating a bivariate choropleth map

Is there a more effective way to compare and contrast the two variables then looking back and forth between the two figures? Enter the bivariate choropleth map. The goal is to use a two-dimensional colormap that blends two color scales together based on two sets of data. For example, a blue and red colormap would have white or grey in the lower left corner, representing low values for each variable. The upper left and lower right corners would be blue and red, representing a high value for one variable and low value for the other. The upper right corner where both variables take high values would be purple, a blending of red and blue. Intermediate parts of the space can be blended to be more blue or red depending on the underlying variable values. This can be a continuous colorscale, but discrete with 3-5 classes per variable is more common and generally easier to decipher. See this blog post for some examples.

The code below is used to create such a 2D color grid. I chose to use the pink/cyan color map from the post cited above, which has nice contrast and avoids the good/bad connotations of the blue/red map. I retrieved the hex codes for the four corner colors from the same post. We convert each hex color to RGB, and then to a Color object from the Generativepy package. This package provides the functionality for linear interpolation (“lerping”) of colors, which we use to create a 4×4 grid of blended colors representing the quartiles of each index. Lastly, we convert everything back to hex which is easier to use with Geopandas

### percentile bounds defining upper boundaries of color classes
percentile_bounds = [25, 50, 75, 100]

### function to convert hex color to rgb to Color object (generativepy package)
def hex_to_Color(hexcode):
    rgb = ImageColor.getcolor(hexcode, 'RGB')
    rgb = [v/256 for v in rgb]
    rgb = Color(*rgb)
    return rgb

### get corner colors from
c00 = hex_to_Color('#e8e8e8')
c10 = hex_to_Color('#be64ac')
c01 = hex_to_Color('#5ac8c8')
c11 = hex_to_Color('#3b4994')

### now create square grid of colors, using color interpolation from generativepy package
num_grps = len(percentile_bounds)
c00_to_c10 = []
c01_to_c11 = []
colorlist = []
for i in range(num_grps):
    c00_to_c10.append(c00.lerp(c10, 1/(num_grps-1) * i))
    c01_to_c11.append(c01.lerp(c11, 1/(num_grps-1) * i))
for i in range(num_grps):
    for j in range(num_grps):
        colorlist.append(c00_to_c10[i].lerp(c01_to_c11[i], 1/(num_grps-1) * j))

### convert back to hex color
colorlist = [rgb2hex([c.r, c.g, c.b]) for c in colorlist]

Having defined our percentile bounds and color grid, we can assign a color to each census tract based on its Pollution Burden and Population Characteristics Percentiles, and then create a map using Geopandas as before. The 2D color grid legend can be created on an inset axis using Matplotlib’s patch functionality.

### function to get bivariate color given two percentiles
def get_bivariate_choropleth_color(p1, p2):
    if p1>=0 and p2>=0:
        count = 0
        stop = False
        for percentile_bound_p1 in percentile_bounds:
            for percentile_bound_p2 in percentile_bounds:
                if (not stop) and (p1 <= percentile_bound_p1):
                    if (not stop) and (p2 <= percentile_bound_p2):
                        color = colorlist[count]
                        stop = True
                count += 1
        color = [0.8,0.8,0.8,1]
    return color

### plot map based on bivariate choropleth
fig, ax = plt.subplots(1,1, figsize=(8,10))
attr = 'CIscoreP'
ces['color_bivariate'] = [get_bivariate_choropleth_color(p1, p2) for p1,p2 in zip(ces['PolBurdP'].values, ces['PopCharP'].values)]
ces.loc[ces[attr] >= 0].plot(ax=ax, color=ces.loc[ces[attr] >= 0]['color_bivariate'], alpha=alpha, legend=False)
cx.add_basemap(ax = ax,, source=cx.providers.Stamen.TonerLite)

### now create inset legend
ax = ax.inset_axes([0.6,0.6,0.35,0.35])
ax.set_aspect('equal', adjustable='box')
count = 0
xticks = [0]
yticks = [0]
for i,percentile_bound_p1 in enumerate(percentile_bounds):
    for j,percentile_bound_p2 in enumerate(percentile_bounds):
        percentileboxes = [Rectangle((i,j), 1, 1)]
        pc = PatchCollection(percentileboxes, facecolor=colorlist[count], alpha=alpha)
        count += 1
        if i == 0:

_=ax.set_xticks(list(range(len(percentile_bounds)+1)), xticks)
_=ax.set_xlabel('Pollution burden percentile')
_=ax.set_yticks(list(range(len(percentile_bounds)+1)), yticks)
_=ax.set_ylabel('Population characteristics percentile')

plt.savefig(f'figs/bivariate.jpg', dpi=dpi, bbox_inches='tight')

The bivariate color palette allows us to distinguish at a more granular level how census tracts compare across the two variables. Some regions are found to exhibit high pollution burden but less vulnerable communities (pink), while others display the opposite pattern (cyan). Those that score highly on both pollution burden and population characteristics (deep blue/purple) would be considered the most vulnerable to pollution and the most in need of pollution-related state programs and investments.

A Hidden-Markov Modeling Based Approach to Creating Synthetic Streamflow Scenarios

As Dave mentioned in his prior post, our recently published eBook, Addressing Uncertainty in Multisector Dynamics Research, provides several interactive tutorials for hands on training in model diagnostics and uncertainty characterization. This blog post is a preview of an upcoming extension of these trainings featuring an interactive tutorial on creating synthetic streamflow scenarios using a Hidden-Markov Modeling (HMM)- based approach specifically for the Upper Colorado River Basin. This training heavily utilizes Julie Quinn’s prior tutorial on fitting an HMM-based streamflow generator and is inspired by her Earth’s Future paper on understanding if exploratory modeling can truly be scenario-neutral.

In this post, we will primarily be focusing on decadal hydrologic drought in the region as a metric of interest to explore in the historic period, the stationary-synthetic, and non-stationary synthetic ensembles. We will also explore how to place CMIP5 climate projections in the context of our synthetically-created ensembles.  All code for this demo is available in a Jupyter Notebook on Github that follows the progression of this blog step by step (though some content may be subject to change before going into the eBook formally). The code is written in Python.


In the Western United States, and particularly the Colorado River Basin, recent studies have used tree-ring reconstructions to suggest that the megadrought that has been occurring in the Southwest over the past 22 years is the regions worst drought since about 800 AD (Williams et al., 2022). The study’s lead author, UCLA climatologist Park Williams, suggested that had the sequence of wet-dry years occurred as observed but without the human-caused drying trend, the 2000s would have likely still been dry, but not on the same level as the worst of the last millennium’s megadroughts.

Lake Powell shows persistent effects from drought (Source: U.S. Bureau of Reclamation)

The recent trend of warming and reduced soil moisture in the SW is becoming extremely problematic from a water systems planning and management perspective for the Colorado River Basin. It has becoming rapidly clear that the river is completely over-allocated and won’t be able to sustain flow requirements as dictated by the Colorado Compact. Thus, there has been an increasing focus in understanding how susceptible water systems in this region are to plausible future streamflow scenarios, particularly those that may contain persistent droughts. In this tutorial, we’ll discuss how to create these scenarios using a Hidden Markov Model (HMM)- based synthetic generator.

Observed Data

First let’s take a look at the observed data from 1909-2013 for the outlet gauge of the Upper Colorado River that is located at the CO-UT stateline. Below we create a plot of the annual streamflow. Let’s also add an 11-year rolling mean which will give us a sense of long-term averaged behavior.

The Colorado Compact prescribing flows between the Upper and Lower Colorado Basins was negotiated using data prior to 1922, which we now know was one of the consistently wetter periods in the record. It’s clear today that since the 1980s, the Southwest has been experiencing imminent aridification and that this observed record alone isn’t an accurate representation of what future climate might look like in this region.

Let’s get a little more specific and formally quantify decadal drought risk in the observed period. We use a metric proposed in Ault et al. (2014). The authors define a decadal drought as a streamflow value that is more than a half a standard deviation below the 11-year running mean of the available record.

By this metric, the Upper Colorado Basin region has experienced two decadal droughts in the past 100 years.

Synthetic Stationary Generator to Understand Natural Variability

It is important to remember that the historical record of the region is only one instance of streamflow, which ultimately comes from an inherently stochastic and chaotic system (the atmosphere). We require a tool that will allow us to see multiple plausible realizations of that same variability. The tool that we use to develop synthetic flows for the region is a Gaussian Hidden Markov Model. If a system follows a Markov process, it switches between a number of “hidden states” dictated by a transition matrix. Each state has its own Gaussian probability distribution (defined by a mean and standard deviation) and one can draw from this distribution to create synthetic flows that fit the properties of the historical distribution. HMMs are an attractive choice for this region because they can simulate long persistence, particularly long droughts, which is an important consideration for water systems vulnerabilities. The figure below shows an example of a 2-state Gaussian HMM that we will fit for the region.

At this point, I will direct readers to either Julie’s blog or my own Jupyter notebook, which have all the details and code required to fit the model and investigate the parameters. Now let’s create a sample 105-year synthetic realization and see what the drought dynamics look like in the synthetic scenario that we created.

Wow, it looks like we’ve created something like a megadrought type situation by just sampling within the bounds of the historical distribution! You can keep sampling from the model to create more 105-year traces and note how the location and number of decadal droughts changes. Maybe you will experience two like in the the historical period, or fewer or more. Re-sampling from the model will demonstrate how different 105-year traces can look just within the range of natural variability and that what we’ve observed historically is only one trace. It’s also important to remember that when droughts occur can also define the ultimate effect of the drought (i.e. is it at a time when there is a large population growth or a time when humans can adapt by conserving or building more infrastructure?). A hydrologic drought need not manifest into an agricultural drought of the same magnitude if stored surface water is available, for example.

Non-Stationary Synthetic Generator to Impose Climate Changes

Now, we want to be able to create flows under non-stationary conditions to get a better understanding of what flows can look like under climate changes. In order to create flows under non-stationary conditions, we can toggle the parameters of the HMM model in order to create systematic changes to the model that can represent a changing climate. The HMM has 6 parameters that define it. In the historical distribution, we fit a baseline value for these parameters. In this non-stationary generator, we define a range to sample these parameters from.

ParameterCurrent ValueLower BoundUpper Bound
Log-Space Wet State Mean Multiplier1.000.981.02
Log-Space Dry State Mean Multiplier1.000.981.02
Log-Space Wet State Standard Deviation Multiplier1.000.751.25
Log-Space Dry State Standard Deviation Multiplier1.000.751.25
Change in Dry-Dry Transition Probability0.00-0.30+0.30
Change in Wet-Wet Transition Probability0.00-0.30+0.30

Now refer to the Jupyter notebook for the code for the following steps. We first sample from these parameter ranges 1000 times using the Latin Hypercube Sample functionality within SALib. We then swap out the baseline parameters with the newly sampled parameters and sample from the model. Below is an example trace from the new non-stationary model in which we are generating more instances of decadal drought.

Placing Non-Stationary Flows in the Context of CMIP5 Projections

By creating a non-stationary generator, we can broaden the drought conditions that we are creating which that can be very useful to understand how our water systems model performs under potentially extreme scenarios. However, it’s useful to place our synthetically generated flows in the context of physically-driven CMIP5 projections to get a better understanding of how the two approaches compare. This example makes use of 97 CMIP5 projections used in the Colorado River Water Availability Study (CWCB, 2012). In each of these projections, monthly precipitation factor changes and temperature delta changes were computed between mean projected 2035–2065 climate statistics and mean historical climate statistics from 1950–2013. These 97 different combinations of 12 monthly precipitation multipliers and 12 monthly temperature delta shifts were applied to historical precipitation and temperature time series from 1950–2013. The resulting climate time series were run through a Variable Infiltration Capacity (VIC) model of the UCRB, resulting in 97 time series of projected future streamflows at the Colorado‐Utah state line.

We can fit an HMM to each trace of projected streamflow and retain the fitted parameters. Then we take the ratio between these parameters and the baseline HMM parameters in order to ultimately have a set of multipliers associated with each CMIP5 projection. This allows us to place the CMIP5 projections into the same space that we are sampling for our synthetic realizations.

In order to visualize these two ensembles in the same space, we plot a response surface that allows us to map how combinations of HMM parameters tend to lead to increased decadal drought occurrence. In order to get a continuous surface, we’ll fit a non-linear regression that ultimately maps the parameter values to decadal drought occurrence over a set of grid points. This response surface is shown by the color gradient below.

We choose two parameters, the dry-dry transition probability and the dry mean, that would most likely influence the decadal drought occurrence and create the response surface (note that a more formal sensitivity analysis can be used to identify the top most influential parameters as well). From the response surface, you can see that we’re more likely to see more instances of decadal droughts when we (1) increase the dry-dry transition probability, which inherently will increase persistence of the dry state, and (2) when we make the dry state log mean drier. We also place those multipliers from the CMIP5 projections into the space as white dots. Note that these CMIP5 projections tend to span the extent of the dry mean sample space, but are less representative of the dry transition probability sample space, particularly the increase in the dry-dry transition probability which leads to more persistent droughts. This suggests that the types of hydrological droughts represented in the projections tend to only be wetter to slightly drier than our baseline.

Both methods of producing these scenarios are valid, though if your goal is to produce a variety of ensembles that are characterized by many different drought characteristics, you will likely find that a generator approach will serve this purpose better.

Concluding Remarks

If you have any questions about the notebook or feedback as you go through the tutorial, please post below in the comments section or email me. We love getting as much feedback as possible before publication in the eBook.


Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate27(20), 7529-7549.

CWCB (2012).Colorado River Water Availability Study Phase I Report. Colorado Water Conservation Board

Williams, A. P., Cook, B. I., & Smerdon, J. E. (2022). Rapid intensification of the emerging southwestern North American megadrought in 2020–2021. Nature Climate Change12(3), 232-234.

Cythonizing your Python code, Part 2: Next steps with reservoir simulation

Cython is an extension of the Python programming language that allows you to add C-like functionality, such as static typing and compilation. This can lead to significant efficiency gains over pure Python. In my last blog post, I introduced Cython’s motivation and benefits using a simple Fibonacci sequence example. Cythonizing this program led to a 20-30x speedup on my laptop, for very little work!

However, you may be wondering about the feasibility and impact of Cython for more complex programs, such as simulation models used in research. In this blog post, I will demonstrate how Cython can be applied to larger, object-oriented programs that use functions, classes, and NumPy arrays, using a reservoir simulation model as an example. Readers unfamiliar with Cython are encouraged to read the first post in this series prior to proceeding.

Reservoir simulation model

All code used in this blog post can be found in this GitHub repository, in the reservoir_sim directory. This contains nine sub-directories with different versions of the model. I will describe the differences between the versions later in this post, but first I will briefly introduce the major components of the simulation model using the py_fast version. The model simulates a two-reservoir system, one of which is downstream from the other. Both have their own inflows, which are sinusoidal plus noise. Additionally, both reservoirs have their own demands and minimum flow thresholds, which are also sinusoidal plus noise. Each time step, the upper reservoir attempts to release the minimum flow requirement if there is enough water, then tries to satisfy the demand if possible. The lower reservoir then repeats the same process, with the upper reservoir’s releases added to its own catchment inflows. The simulation is run over a number of years, and the stochastic inputs (inflows, upstream releases, minimum flow requirements, demand) and outputs (releases, deliveries, storages) are stored as time series.

The directory py_fast contains three Python files, representing the Demand, Reservoir, and Model classes. The simplest class, Demand, is initialized with inputs that determine the parameterization of the random demand process, and a function that can be called each day to randomly sample from this process:

### Demand class - pure Python
from random import gauss
from math import pi, sin

class Demand():
  def __init__(self, name, demand_params): = name
    # amplitude, phase, shift, & noise std for sine wave of demands
    self.demand_amp, self.demand_phase, self.demand_shift, self.demand_noise = demand_params
    # save 2pi/365 to save conversion time
    self.days_to_radians = 2. * pi / 365.

  ### demand function
  def demand_t(self, t):
    noise = gauss(0., self.demand_noise)
    demand_t = self.demand_amp * sin((t - self.demand_phase)  * self.days_to_radians) + self.demand_shift + noise
    return demand_t

The Reservoir class is similar, with parameter initialization and its own stochastic inflow and minimum flow sampling methods. The initialization also instantiates a Demand object belonging to the Reservoir object.

### Reservoir class - pure Python
import numpy as np
from Demand import Demand

class Reservoir():
  def __init__(self, name, inflow_params, min_flow_params, demand_params, storage_params): = name
    # amplitude, phase, shift, & noise std for sine wave of inflows
    self.inflow_amp, self.inflow_phase, self.inflow_shift, self.inflow_noise = inflow_params
    # amplitude, phase, and shift for sine wave of minimum flows
    self.min_flow_amp, self.min_flow_phase, self.min_flow_shift = min_flow_params
    # reservoir capacity and initial storage
    self.capacity, = storage_params
    # set up demand object
    self.demand = Demand(name, demand_params)

  ### inflow function
  def inflow_t(self, t):
    noise = np.random.normal(0., self.inflow_noise, 1)[0]
    inflow_t = self.inflow_amp * np.sin((t - self.inflow_phase)  * 2. * np.pi / 365.) + self.inflow_shift + noise
    return inflow_t

  ### min flow function
  def min_flow_t(self, t):
    min_flow_t = self.min_flow_amp * np.sin((t - self.min_flow_phase)  * 2. * np.pi / 365.) + self.min_flow_shift
    return min_flow_t

The Reservoir class also has a step method, which dictates the release, delivery, and updated storage based on the availability of water.

  ### step reservoir another day
  def step(self, t, upstream_release):
    inflow = self.inflow_t(t)
    min_flow = self.min_flow_t(t)
    demand = self.demand.demand_t(t)
    # first assume release & delivery meet min_flow and demand exactly
    release = min_flow
    delivery = demand += inflow + upstream_release - release - demand

    # check if storage overflow
    if > self.capacity:
      release += - self.capacity = self.capacity

    # check if storage went negative. if so, curtail demand first, then env flows
    elif < 0:
      if delivery > (
        delivery += = 0.
      else: += delivery
        delivery = 0
        if release > (
          release +=
 = 0
          print('This should not be happening')
    return (inflow, upstream_release, min_flow, demand, release, delivery,

Finally, the Model class is used to set up and run the simulation. Upon initialization, the Model object first creates the upper and lower Reservoir objects, and creates a dictionary of lists to hold the results.

### Model class for running simulation
import matplotlib.pyplot as plt
import random
import numpy as np
from Reservoir import Reservoir

class Model():
  def __init__(self, years, plot, seed):
    # length of simulation (no leap years)
    self.years = years
    self.days = 365 * self.years
    # boolean for whether to plot
    self.plot = plot
    # random seed

    ### set up upper reservoir
    # inflow params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    inflow_params = (300., 0., 500., 20.)
    # min flow params in terms of sinusoid, (amplitude, phase, shift). units = AF/day
    min_flow_params = (200., 122., 300.)
    # demand params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    demand_params = (300., 163., 600., 30.)
    # storage params (capacity, starting storage). units = AF
    storage_params = (500000., 250000.)
    # initialize upper reservoir
    self.reservoir_upper = Reservoir('upper', inflow_params, min_flow_params, demand_params, storage_params)

    ### set up lower reservoir
    # inflow params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    inflow_params = (200., 30., 600., 100.)
    # min flow params in terms of sinusoid, (amplitude, phase, shift). units = AF/day
    min_flow_params = (100., 91., 400.)
    # demand params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    demand_params = (100., 152., 300., 10.)
    # storage params (capacity, starting storage). units = AF
    storage_params = (20000., 15000.)
    # initialize lower reservoir
    self.reservoir_lower = Reservoir('lower', inflow_params, min_flow_params, demand_params, storage_params)

    ### set up data storage
    self.reservoir_list = [self.reservoir_upper, self.reservoir_lower]
    self.output = {}
    for reservoir in self.reservoir_list:
      self.output[] = {}
      self.output[]['inflow'] = []
      self.output[]['upstream_release'] = []
      self.output[]['min_flow'] = []
      self.output[]['demand'] = []
      self.output[]['release'] = []
      self.output[]['delivery'] = []
      self.output[]['storage'] = []

The function performs the simulation loop.

  def run(self):
    for t in range(self.days):
      # step upper
      (inflow, upstream_release, min_flow, demand, release, delivery, storage) = self.reservoir_upper.step(t, 0.)

      (inflow, upstream_release, min_flow, demand, release, delivery, storage) = self.reservoir_lower.step(t, release)

    # return total storage last time step to make sure versions are equivalent
    return self.output['upper']['storage'][-1] + self.output['lower']['storage'][-1]

We can run the model and plot the results as follows (plot function not shown for brevity, see repository).

from Model import Model

# initialize model
model = Model(years = 5, plot = True, seed = 101)

# run simulation
tot_storage =

# plot results
Figure 1: Five-year realization of simulated inflows with and without upstream release (column 1), demand vs. deliveries (column 2), minimum flow requirement vs. release (column 3), and storage (column 4) for the upper (row 1) and lower (row 2) reservoirs.

Cythonizing a class

Now that we have our model in place, we can begin to add static type information and Cythonize it. I will focus now on the cy version of the model, which builds on the py_fast model. First consider Demand.pyx.

### Demand class - cython
from random import gauss
from math import pi, sin

cdef class Demand():
  def __init__(self, name, demand_params): = name
    # amplitude, phase, shift, & noise std for sine wave of demands
    self.demand_amp, self.demand_phase, self.demand_shift, self.demand_noise = demand_params
    # save 2pi/365 as double to save conversion time
    self.days_to_radians = 2. * pi / 365.

  ### demand function
  cdef double demand_t(self, double t):
    cdef double noise, demand_t, sin_t
    noise = gauss(0., self.demand_noise)
    sin_t = sin((t - self.demand_phase) * self.days_to_radians)
    demand_t = self.demand_amp * sin_t + self.demand_shift + noise
    return demand_t

This is similar to in the Python version, other than a few main differences. Cython classes are defined with “cdef” rather than “def”. Within a class, we can also define its methods using the cdef keyword. Similar to the Fibonacci sequence example in my earlier blog post, we provide type information for the return value and arguments in the function definition for demand_t(). Lastly, we can provide type information for any variables used within the function, such as noise, again using the cdef signifier.

It is not necessary to provide this information for every variable and function. For example, the __init__ function in general cannot be defined with cdef. However, the more type information you can provide, the more Cython will be able to speed up your code. Inspecting the annotated html that Cython outputs (see the first post in this series), as well as code profiling (see this blog post by Antonia Hadjimichael), can help to figure out which variables and methods are the most important bottlenecks to speed up, and which have diminishing marginal returns and can be left as untyped Python variables/methods.

You may have noticed that I did not provide static type information for class attributes (e.g.,, self.demand_amp). These variables are defined and typed in a separate file, Demand.pxd.

cdef class Demand():

    public str name

    public double demand_amp, demand_phase, demand_shift, demand_noise, days_to_radians

  cdef double demand_t(self, double t)

A .pxd “definition file” is the Cython equivalent of a header file in C; it defines attributes and methods of the class and gives type information. Any class that has a definition file will create a C extension type. Variables and methods in the definition file can only be accessed by other Cython scripts, rather than pure Python code. Cython also has a third kind of function definition, “cpdef”, which can be called from either Cython or Python. This can be useful if your application needs to be interoperable with Cython and Python, but will not perform as efficiently as cdef methods.

One important thing to notice is that each variable belonging to the class should be preceded by “public” if we want the attribute to be accessible to other objects. Also note that only cdef-defined functions should be included here, rather than pure Python functions.

The Reservoir class is Cythonized in a similar way, and the interested reader is referred to the GitHub repository to see the changes in detail. Moving to the Model class, there are several important differences to note in the .pxd file.

from Reservoir cimport Reservoir

cdef class Model():

    public int years, days

    public double tot_storage

    public bint plot

    public list reservoir_list

    public dict output

    public Reservoir reservoir_upper, reservoir_lower
  cdef double run(self) except *

First, notice that the Reservoir class is imported with the cimport command (“Cython import”) rather than import, since it is now an extension type. This should be done in both the .pyx and .pxd files. Second, I have included “except *” after the parentheses in the definition of the run() method. This instructs our compiled C code to pass exceptions through to the Python runtime. Otherwise, the program will try to treat errors as warnings and keep running, which makes it challenging to debug and verify behavior. This should be included in both the .pyx and .pxd files. Third, bint is the C equivalent of the bool type in Python.

The last important change in the Cython version is the inclusion of a new file, main.pyx. As noted previously, Cython extension types can only be imported using cimport from Cython, rather than pure Python code. For this reason, we can no longer directly import and run the Model class as we did in the pure Python implementation above. Instead, main.pyx is used as a go-between. This script is Cythonized (note the .pyx ending), but is not an extension type since it does not include a definition file (.pxd) with cythonized methods or attributes. Thus the file is allowed to import extension types (since it is a Cython file), while also remaining importable itself into pure Python (since it is not an extension type).

from Model cimport Model

cdef class main():
  cdef public Model model

  def __init__(self, int years, bint plot, int seed):
    # initialize model
    self.model = Model(years = years, plot = plot, seed = seed)

    # run simulation
    self.model.tot_storage =

    # plot results
    if self.model.plot:

The main class can now be imported in pure Python to run the simulation, in much the same way as we imported and ran the Model class before.

from main import main

# run model
main_obj = main(years = 5, plot = True, seed = 101)

# verify that final storage is the same

For both py_fast and cy versions, we can verify that the final storage at the end of five years is 12,066. However, as we shall see later in this post, the Cythonized version completes the procedure more quickly.

NumPy and Memory Views

Most scientific programs rely heavily on NumPy arrays for numeric computation. For example, the py_numpy version of the Model class stores time series outputs in a NumPy array rather than nested dictionaries and lists. This output array is initialized in the class constructor, and then populated with outputs as the model is run (compare this to the original Python version above).

class Model():
  def __init__(self, years, plot, seed):    

    self.output = np.empty((num_reservoirs * self.num_step_outputs, self.days + 1))
    for i in range(len(reservoir_list)):
      self.output[i * self.num_step_outputs + self.num_step_outputs - 1, 0] = reservoir_list[i].storage

  def run(self):
    for t in range(1, self.days + 1):
      t_run = t - 1

      ### step upper
      self.output[:7, t] = self.reservoir_upper.step(t_run, 0.)

      ### step lower, using upper release as inflow
      self.output[7:, t] = self.reservoir_lower.step(t_run, self.output[4, t])

    # return total storage last time step to make sure versions are equivalent
    return self.output[6, -1] + self.output[-1, -1]

Because a NumPy array is not a native Python type (e.g., float, list), it cannot be used as a static type for Cython (i.e., we can’t simply define an array arr with “cdef np.array arr“). However, we can use “memoryviews,” which are a more general form of typed memory buffer than NumPy (and are, in fact, what NumPy is based on). The nuances of memoryviews can be a bit challenging, and a full explanation is beyond the scope of this post. But I will give a simple example usage for Cython, and the interested reader is referred to the docs here and here for more information.

Memoryviews are created by designating the type of variable to go inside the array (e.g., double), followed by square brackets with colons and commas dictating the number of dimensions. For example, in the cython_numpy version of Model.pxd, we define the 2-dimensional memory view to store results as follows:

cdef class Model():


    public double[:,:] output

Now in the model constructor, we initialize the memory view as follows:

    output_np = np.empty((num_reservoirs * self.num_step_outputs, self.days + 1))
    self.output = output_np

    for i in range(len(reservoir_list)):
      self.output[i * self.num_step_outputs + self.num_step_outputs - 1, 0] = reservoir_list[i].storage

The only difference from the Pythonic implementation is that we have to split the creation of the memoryview into two steps; first we create a NumPy array (output_np), then we set the memoryview (np.output) to correspond to the block of memory holding output_np. Once we have created the memoryview, we can access and reassign values using standard NumPy syntax. For example, here is the part of the function that steps the upper reservoir.

      (inflow, upstream_release, min_flow, demand, release, delivery, storage) = self.reservoir_upper.step(t_run, 0.)
      self.output[0, t] = inflow
      self.output[1, t] = upstream_release
      self.output[2, t] = min_flow
      self.output[3, t] = demand
      self.output[4, t] = release
      self.output[5, t] = delivery
      self.output[6, t] = storage

Again, this has to be split into two steps; first call the reservoir’s step function and receive the output as a tuple of variables, and then reset each value in the memoryview using this data. We cannot simply set the entire slice self.output[:7, t] directly using the tuple returned by the function, like we did in python_numpy, because Cython is expecting an array rather than a tuple. It is possible to convert the tuple into a NumPy array and then reassign the slice using this NumPy array, but this was found to be less efficient when I timed different versions. Alternatively, we can alter the definition of Reservoir.step() to return a memoryview rather than a tuple, which allows the Model class to reassign values in the memoryview directly from the function call. However, this too was found to be less efficient, as we will see in the next section. The interested reader is referred to the cy_numpy_reservoir version of the model for implementation details.

Testing the performance of different versions

Now that I have described how to Cythonize your classes and NumPy arrays, you may be wondering, “Ok, but was it worth it? How much does this speed up the code?” We can answer this question with a timing experiment. The reservoir_sim directory in the GitHub repository contains nine versions of the code, summarized as follows:

  1. py_slow: My first attempt at creating the example reservoir simulation model in Python, which contains several bottlenecks.
  2. py_fast: An improved Python-only model after eliminating bottlenecks, used as a baseline for improvement.
  3. py_numpy: Python-only version that uses a NumPy array to store the time series of results.
  4. py_numpy_reservoir: Python-only version that uses NumPy arrays to store results inside the daily time step as well.
  5. py_cy: A copy of py_fast, which is then Cythonized without explicitly adding static type information.
  6. cy: Updates to py_fast to add static type information.
  7. cy_numpy: Similar to py_numpy, but Cythonized with static types and memory views.
  8. cy_numpy_reservoir: Similar to py_numpy_reservoir, but Cythonized with static types and memory views.
  9. cy_numpy_noCheck: Copy of cy_numpy, with the additional step of disabling Python’s automatic bounds checking and index wrapping.

This directory also includes two scripts: will first Cythonize all the Cython versions of the model, and then run a timing experiment with all versions, using the script The experiment will output the relative efficiency of each version, averaged over 4 trials of 20 simulations, each of which is 50 years long.

Starting timings

py_slow: answer = 10494.223405436973, time = 1.1308925539487973, efficiency  = 1.0

py_fast: answer = 9851.320332349133, time = 0.26572948903776705, efficiency relative to py_slow = 4.25580374253483

py_numpy: answer = 9851.320332349133, time = 0.3013913669856265, efficiency relative to py_fast = 0.8816758479032341

py_numpy_reservoir: answer = 9851.320332349133, time = 0.38436312798876315, efficiency relative to py_fast = 0.6913501053762255

py_cy: answer = 9851.320332349133, time = 0.23376567207742482, efficiency relative to py_fast = 1.1367344344286598

cy: answer = 9851.320332349133, time = 0.13966371200513095, efficiency relative to py_fast = 1.9026380240273488

cy_numpy: answer = 9851.320332349133, time = 0.09388123091775924, efficiency relative to py_fast = 2.8304857791068843

cy_numpy_reservoir: answer = 9851.320332349133, time = 0.22499373101163656, efficiency relative to py_fast = 1.181052857975068

cy_numpy_noCheck: answer = 9851.320332349133, time = 0.09497074002865702, efficiency relative to py_fast = 2.7980143037485474

Here are the major takeaways from comparing the speedups across versions:

  1. Make sure your Python code is optimized prior to using Cython. py_slow was 4.3x more efficient than py_fast. The only major difference between these is that in the former, I used the numpy.random module to generate stochastic realizations each day, whereas in the latter I used the basic random module. This small change led to significant gains due to avoiding the extra overhead of the NumPy generator (this is also why these versions got different answers, due to different random number generators). The point here is that major gains can often be made with very little work by finding and eliminating bottlenecks. As mentioned above, code profiling can be extremely helpful to make sure that you have eliminated major bottlenecks prior to implementing larger changes in Cython. Note that py_fast is used as the baseline comparison for all other versions, rather than py_slow. (For any of you thinking I should have just used NumPy to pre-generate an entire array of random numbers rather than generating them one-by-one each day – yes you are right, that would have probably been faster. But I set it up like this to provide a more challenging example of an object-oriented model with function calls between nested objects).
  2. NumPy arrays are not always faster. Like many scientists/engineers, I tend to rely on NumPy a lot, both due to their convenience and due to the fact that they are often praised for speed (NumPy is largely written in C, after all). However, in some cases this can actually degrade performance due to the larger overhead of creating NumPy arrays relative to lists. The py_numpy version that uses NumPy to hold the results was actually 12% less efficient than the baseline version using nested dictionaries and lists. Moreover, py_numpy_reservoir, which uses a new NumPy array to store results for each reservoir each day, is 31% slower than baseline. This is not to say NumPy arrays can’t be beneficial – they are extremely fast at vectorized numerical operations. But for situations with smaller arrays and/or minimal opportunity for vectorized computation, they may not be worth the overhead.
  3. Cythonizing Python code can improve performance. The py_cy version, which does not include static type information and is identical to py_fast except for the Cythonization, leads to a 20% efficiency gain. This is not huge, but could be meaningful and requires little extra work. Meanwhile, the cy version with statically typed Cython extension types has a larger efficiency gain of 90%.
  4. Cython and NumPy can be synergistic if used properly. The cy_numpy version further improves performance, with an efficiency of 2.8 relative to py_fast. Interestingly, this version is identical to py_numpy other than the Cython-related changes, but is found to lead to significant gains whereas the former leads to degraded performance. Cython and memoryviews allow us to interface more directly with NumPy’s underlying C code, reducing the Python-related overhead. However, this is not universally true – cy_numpy_reservoir is found to degrade performance significantly relative to the other Cythonized versions, suggesting that the creation of NumPy arrays within the inner daily step routine is not worth the overhead.
  5. Disabling certain Python features may not be worthwhile. The last version, cy_numpy_noCheck, builds on cy_numpy by adding two compiler directives to @boundscheck(False) and @wraparound(False). The first tells Cython to disable Python’s automatic bounds checking for lists, tuples, etc., while the second disables the ability to count backwards when indexing (e.g., arr[-3:]). These directives are meant to allow for further efficiency gains by reducing the Pythonic overhead. However, in this particular case, it isn’t found to have a meaningful impact. Since these directives can lead to mysterious bugs for users accustomed to Python syntax, they should be used with caution, and only after verifying that they actually create efficiency gains that are worth the risk.

Overall, Cython is found to improve the efficiency of this reservoir simulation model by 2-3x on my laptop. Unfortunately, this is much lower than the 20-30x speedups observed for the Fibonacci sequence example in the last blog post. This is due to the more complex operations and object-oriented structure of the simulation model, which leads to more interactions with the Python interpreter compared to the tight numerical loop of the Fibonacci generator. However, a 2-3x efficiency gain can still be very meaningful, especially in situations involving large ensembles of expensive simulations in parallel computing environments. In general, the potential gain from Cython, and whether or not this is worth the trouble, will depend on factors such as the structure of the model, its runtime, and the skill level of model developers and users.

Further reading

If this blog post has piqued your interest and you want to apply Cython to your own research, you might try looking through more advanced water resources simulation models that make use of Cython, such as Pywr or the California Food-Energy-Water System (CALFEWS).

For more general information, of course see the Cython docs. I also highly recommend Kurt W. Smith’s O’Reilly Media book, “Cython: A Guide for Python Programmers” for those wanting a deeper understanding. It also looks like the author also has a tutorial, but I haven’t tried it yet. Lastly, for information on how to improve the efficiency of your Python code more generally (prior to or in lieu of using Cython), I recommend Micha Gorelick and Ian Ozsvald’s O’Reilly Media book, “High Performance Python: Practical Performant Programming for Humans.”

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Last week, Lillian and I published the first post in a series of training post studying the “Fisheries Game, which is a decision making problem within a complex, non-linear, and uncertain ecological context.

In preparing for that post, I learned about the Ipywidgets python library for widget construction. It stood out to me as a tool for highlighting the influence of parametric uncertainty on model performance. More broadly, I think it has great as an educational or data-narrative device.

This blogpost is designed to highlight this potential, and provide a basic introduction to the library. A tutorial demonstration of how an interactive widget is constructed is provided, this time using the HYMOD rainfall-runoff model.

This post is intended to be viewed through a Jupyter Notebook for interaction, which can be accessed through a Binder at this link!

The Binder was built with an internal environment specification, so it should not be necessary to install any packages on your local machine! Because of this, it may take a minute to load the page.

Alternatively, you can pull the source code and run the Jupyter Notebook from your local machine. All of the source code is available in a GitHub repository: Ipywidget_Demo_Interactive_HYMOD.

If using your local machine, you will first need to install the Ipywidget library:

pip install ipywidgets

Let’s begin!

HYMOD Introduction

HYMOD is a conceptual rainfall-runoff model. Given some observed precipitation and evaporation, a parameterized HYMOD model simulates the resulting down-basin runoff.

This post does not focus on specific properties or performance of the HYMOD model, but rather uses the model as a demonstration of the utility of the Ipywidget library.

I chose to use the HYMOD model for this, because the HYMOD model is commonly taught in introductory hydrologic modeling courses. This demonstration shows how an Ipywidget can be used in an educational context. The resulting widget can allow students to interact in real-time with the model behavior, by adjusting parameter values and visualizing the changes in the resulted streamflow.

If you are interested in the technical details of implementing the HYMOD model, you can dig into the source code, available (and throughly commented/descriptive) in the repository for this post: Ipywidget_Demo_Interactive_HYMOD.

HYMOD represents surface flow as a series of several quick-flow reservoirs. Groundwater flow is represented as a single slow-flow reservoir. The reservoirs have constant flow rates, with the quick-flow reservoir rate, Kq, being greater than the slow-flow reservoir rate, Ks.

Image source: Sun, Wenchao & Ishidaira, Hiroshi & Bastola, Satish. (2010)

HYMOD Parameters:

Like any hydrologic model, the performance of HYMOD will be dependent upon the specified parameter values. There are several parameters that can be adjusted:

  • Cmax: Max soil moisture storage (mm) [10-2000]
  • B: Distribution of soil stores [0.0 – 7.0]
  • Alpha: Division between quick/slow routing [0.0 – 1.0]
  • Kq: Quick flow reservoir rate constant (day^-1) [0.15 – 1.0]
  • Ks: Slow flow reservoir rate constant. (day^-1) [0.0 – 0.15]
  • N: The number of quick-flow reservoirs.

Interactive widget demonstration

I’ve constructed an Ipywidets object which allows a user to visualize the impact of the HYMOD model parameters on the resulting simulation timeseries. The user also has the option to select from three different error metrics, which display in the plot, and toggle the observed timeseries plot on and off.

Later in this post, I will give detail on how the widget was created.

Before provided the detail, I want to show the widget in action so that you know the expectation for the final product.

The gif below shows the widget in-use:

Demonstration of the HYMOD widget.

Ipywidgets Introduction

The Ipywdiget library allows for highly customized widgets, like the one above. As with any new tool, I’d recommend you check out the documentation here.

Below, I walk through the process of generating the widget shown above.

Lets begin!

Import the library

# Import the library
import ipywidgets as widgets

Basic widget components

Consider an Ipywidget as being an arrangement of modular components.

The tutorial walks through the construction of five key widget components:

  1. Variable slider
  2. Drop-down selectors
  3. Toggle buttons
  4. Label objects
  5. Interactive outputs (used to connect the plot to the other three components)

In the last section, I show how all of these components can be arranged together to construct the unified widget.


Sliders are one of the most common ipywidet tools. They allow for manual manipulation of a variable value. The slider is an object that can be passed to the interactive widget (more on this further down).

For my HYMOD widget, I would like to be able to manipulate each of the model parameters listed above. I begin by constructing a slider object for each of the variables.

Here is an example, for the C_max variable:

# Construct the slider
Cmax_slider = widgets.FloatSlider(value = 500, min = 10, max = 2000, step = 1.0, description = "C_max",
                                  disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')

# Display the slider

Notice that each slider recieves a specified minmax, and step corresponding to the possible values. For the HYMOD demo, I am using the parameter ranges specified in Herman, J.D., P.M. Reed, and T. Wagener (2013), Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior.

I will construct the sliders for the remaining parameters below. Notice that I don’t assign the description parameter in any of these sliders… this is intentional. Later in this tutorial I will show how to arrange the sliders with Label() objects for a cleaner widget design.

# Construct remaining sliders
Cmax_slider = widgets.FloatSlider(value = 100, min = 10, max = 2000, step = 1.0, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
B_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Alpha_slider = widgets.FloatSlider(value = 0.30, min = 0.00, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Kq_slider = widgets.FloatSlider(value = 0.33, min = 0.15, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Ks_slider = widgets.FloatSlider(value = 0.07, min = 0.00, max = 0.15, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
N_slider = widgets.IntSlider(value = 3, min = 2, max = 7, disabled = False, continuous_update = False, readout = True)

# Place all sliders in a list
list_of_sliders = [Kq_slider, Ks_slider, Cmax_slider, B_slider, Alpha_slider, N_slider]

The Dropdown() allows the user to select from a set of discrete variable options. Here, I want to give the user options on which error metric to use when comparing simulated and observed timeseries.

I provide three options:

  1. RMSE: Root mean square error
  2. NSE: Nash Sutcliffe efficiency
  3. ROCE: Runoff coefficient error

See the calculate_error_by_type inside the script to see how these are calculated.

To provide this functionality, I define the Dropdown() object, as below, with a list of options and the initial value:

# Construct the drop-down to select from different error metrics
drop_down = widgets.Dropdown(options=['RMSE','NSE','ROCE'], description='',
                                value = 'RMSE', disabled = False)

# Display the drop-down


The ToggleButton() allows for a bool variable to be toggled between True and False. For my streamflow plot function, I have an option plot_observed = False which determines if the observed streamflow timeseries is shown in the figure.

# Construct the button to toggle observed data On/Off
plot_button = widgets.ToggleButton(value = False, description = 'Toggle', disabled=False, button_style='', tooltip='Description')

# Display the button


As mentioned above, I choose to not include the description argument within the slider, drop-down, or toggle objects. This is because it is common for these labels to get cut-off when displaying the widget object.

For example, take a look at this slider below, with a long description argument:

# Make a slider with a long label
long_title_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, description = 'This slider has a long label!', readout = True)

# Display: Notice how the label is cut-off!

The ipywidgets.Label() function provides a way of avoiding this while allowing for long descriptions. Using Label() will ultimately provide you with a lot more control over your widget layout (last section of the tutorial).

The Label() function generates a separate object. Below, I create a unique Label() object for each HYMOD parameter.

# Import the Label() function
from ipywidgets import Label

# Make a list of label strings
param_labs = ['Kq : Quick flow reservoir rate constant (1/day)',
            'Ks : Slow flow reservoir rate constant (1/day)',
            'C_max : Maximum soil moisture storage (mm)',
            'B : Distribution of soil stores',
            'Alpha : Division between quick/slow routing',
            'N : Number of quick-flow reservoirs']

# Make a list of Label() objects
list_of_labels = [Label(i) for i in param_labs]

# Display the first label, for example.


Now that we have constructed interactive

The interactive_output function takes two inputs, the function to interact with, and a dictionary of variable assignments:

interactive_output( function, {‘variable_name’ : variable_widget, …} )

I have created a custome function plot_HYMOD_results which:

  1. Loads 1-year of precipitation and evaporation data for the Leaf River catchment.
  2. Runs the HYMOD simulation using the provided parameter values.
  3. Calculates the error of the simulated vs. observed data.
  4. Plots the timeseries of runoff.

The source code for this function can be found in the GitHub repository for this post, or specifically here.

The function receives parameter values for each of the HYMOD parameters discussed above, a bool indicator if observed data should be plotted, and a specified error metric.

plot_HYMOD_results(C_max, B, Alpha, Ks, Kq, N_reservoirs, plot_observed = False, error_type = ‘RMSE’):

I have already generated widget components corresponding to each of these variables! (If you are on the Jupyter Notebook version of this post, make sure to have Run every cell before this, or else the following code wont work.

I can now use the interactive_output function to link the widget components generated earlier with the function inputs:

# Import the interactive_output function
from ipywidgets import interactive_output

# Import my custom plotting function
from HYMOD_plots import plot_HYMOD_results

result_comparison_plot = interactive_output(plot_HYMOD_results, {'C_max' : Cmax_slider, 'B': B_slider, 'Alpha':Alpha_slider, 
                                                                 'Ks':Ks_slider, 'Kq':Kq_slider,'N_reservoirs':N_slider, 
                                                                 'plot_observed' : plot_button,'error_type': drop_down})

# Show the output
Output generated by the interactive_output().

Displaying the interactive_output reveals only the plot, but does not include any of the widget functionality…

Despite this, the plot is still linked to the widget components generated earlier. If you don’t believe me (and are reading the Jupyter Notebook version of this post), scroll up and click the ToggleButton a few cells up, then come back and look at the plot again.

Using the interactive_output() function, rather than other variations of the interact() functions available, allows for cleaner widgets to be produced, because now the arrangment of the widget components can be entirely customizable.

Keep reading for more detail on this!

Arranging widget components

Rather than using widget features one-at-a-time, Ipywidgets allow for several widgets to be arranged in a unified layout. Think of everything that has been generated previously as being a cell within the a gridded widget; the best part is that each cell is linked with one another.

Once the individual widget features (e.g., sliders, buttons, drop-downs, and output plots) are defined, they can be grouped using the VBox() (vertical box) and HBox() (horizontal box) functions.

I’ve constructed a visual representation of my intended widget layout, shown below. The dashed orange boxes show those components grouped by the HBox() function, and the blue boxes show those grouped by the VBox() function.

Visual representation of the final widget layout.

Before getting started, import some of the basic layout functions:

# Import the various 
from ipywidgets import HBox, VBox, Layout

Before constructing the entire widget, it is good to get familiar with the basic HBox() and VBox() functionality.

Remember the list of sliders and list of labels that we created earlier?

# Stack the list of label objects vertically:

# Try the same thing with the sliders (remove comment #):

In the final widget, I want the column of labels to be located on the left of the column of sliders. HBox() allows for these two columns to be arrange next to one another:

# Putting the columns side-by-side
HBox([VBox(list_of_labels), VBox(list_of_sliders)])

Generating the final widget

Using the basic HBox() and VBox() functions shown above, I arrange all of the widget components I’ve defined previously. I first define each row of the widget using HBox(), and finally stack the rows using VBox().

The script below will complete the arrangement, and call the final widget!

# Define secifications for the widgets: center and wrap 
box_layout = widgets.Layout(display='flex', flex_flow = 'row', align_items ='center', justify_content = 'center')

# Create the rows of the widets
title_row = Label('Select parameter values for the HYMOD model:')
slider_row = HBox([VBox(list_of_labels), VBox(list_of_sliders)], layout = box_layout)
error_menu_row = HBox([Label('Choose error metric:'), drop_down], layout = box_layout)
observed_toggle_row = HBox([Label('Click to show observed flow'), plot_button], layout = box_layout)
plot_row = HBox([result_comparison_plot], layout = box_layout)

# Combine label and slider box (row_one) with plot for the final widget
HYMOD_widget = VBox([title_row, slider_row, plot_row, error_menu_row, observed_toggle_row])

# Call the widget and have fun!

Concluding remarks

If you’ve made it this far, thank you for reading!

I hope that you are able to find some fun/interesting/educational use for the Ipywidget skills learned in this post.

Viewing your Scientific Landscape with VOSviewer

In this post, we’re taking a look at a cool tool for visualizing bibliometric networks called VOSviewer. In our group, we have been thinking more about our scientific landscape (i.e. what fields are we reaching with our work and who are we collaborating with most). VOSviewer provides a great way to interactively engage with this network in maps like this:

(The web app cannot always accommodate high traffic, so if the link appears broken, please check back in a couple hours!)

In this blog post, I’m going to highlight this tool by looking at the reach of the Borg Multi-Objective Evolutionary Algorithm (Borg MOEA) developed by Dave Hadka and Patrick Reed and first published in 2013. I also really like that these conceptual networks can be embedded into a website. Slight caveat: if you use WordPress, you must have a Business account with plugins enabled to embed these maps directly into your website. You’ll see that in order to view the interactive maps, I’ll have links to VosViewer’s online application, but unfortunately you will not be able to interact with the maps directly in this post.

Step 1: Download bibliographic data

The first step to creating these maps is to get our bibliographic network information. I’m using the Dimensions database which has a free web app here. It should be noted that Vosviewer also supports exports from Scopus, Web of Science, Lens, and PubMed. Once I get to the Dimensions web app, I type in “Borg MOEA” to find the original paper and download the citations associated with that paper.

You can choose the Borg paper as shown above and then scroll down and click on “show all” next to “Publication Citations”.

Then we export this by clicking the save/export button at the top of the page.

Make sure to choose “bibliographic mapping” so that the data are compatible with VosViewer. Dimensions will send an email with the link to download a zipped file which contains a .csv file with all the meta data needed to create the network.

Step 2: Create the map in VOSviewer

The user interface for the desktop version of VosViewer is fairly intuitive but also quite powerful. First we choose “Create” on the left  and then specify that we want to create a map based on bibliographic data.

Next, we specify that we want to read our data from a database file. Note that VosViewer also supports RIS files and has an API that you can directly query with. On the right, we choose the Dimensions tab and upload the metadata .csv file.

Now comes the fun part! We can create maps based on various components of the metadata that we downloaded. For example, let’s do the co-authorship analysis. This will show the strength of co-authorship among all the authors that have cited the Borg paper.

Click “Next>” and then fill out some narrowing thresholds. I decrease the number of documents that the author must have from 5 to 1 and then choose to show only connected authors to create a clean map. The result is this:

Here the colors represent the different clusters of authors and the size represents the number of documents that have been published by that author that have cited Borg. If we switch to the “Overlay Visualization” tab, we get more information on the years that the authors have published most together.

Step 3: Clean-Up and Customization

One thing you might have noticed is that there are duplicates of some names. For example, Patrick M. Reed and Patrick Reed. We want to merge these publications so that we only have one bubble to represent Pat. For this, we need to create thesaurus file. In the folder that contains the VOSviewer application, navigate to the “data” folder and look for “thesaurus_authors.txt”. Here we can specify which names to merge.

Now, I can recreate my map, but this time, I supply a thesaurus file when creating the map.

Below, we have our final merged map.

We can create another map that instead shows the journals that the citations come from. Now instead we create a map and choose “Citation” and “Sources”.

So we are seeing strong usage of Borg in journals pertaining to water resources, evolutionary computation, and some miscellaneous IEEE journals.

Note that you can also create maps associated with countries from where the authors’ institutions are located.

We see that the US is the strongest hub, followed by China, Italy, and the UK. We can finally show the institutions that the authors are affiliated with. Here we see lots of publications from Cornell, Penn State, and Politecnico di Milano.

Step 4: Creating a shared link or embedding in a website

Screenshots are not an ideal way to view these maps, so let’s see how we can go about sharing them as interactive applications that others can use. We can go to the right hand side of the application and choose “share” as a google drive link. After going through authentication, you now have a shareable link as follows:





If your website supports it, you can also embed these applications directly into a webpage using the following html code. You would just need to swap out the source location, src, with your own links. When swapping out, be sure to include &simple_ui=true after your link as shown below.

  style="border: 1px solid #ddd; max-width: 1200px; min-height: 500px"

This is just scraping the surface of what VOSviewer is capable of, so definitely take time to explore the website, the examples, and the manual. If you make any cool maps, please link them in the comments below!

Cythonizing your Python code, Part 1: the basics.

Intro to Python and Cython

Python has become one of the most loved and in demand programming languages in fields from data science and artificial intelligence to web and app development. Compared to lower-level programming languages like C/C++, Python puts an emphasis on readability and simplicity (you will often read that “Pythonic” code should be “beautiful”). This helps to accelerate the pace of program development and improve the sharing of ideas. According to Python developer Guido van Rossum,

Typically when you ask a programmer to explain to a lay person what a programming language is, they will say that it is how you tell a computer what to do… In reality, programming languages are how programmers express and communicate ideas — and the audience for those ideas is other programmers, not computers. The reason: the computer can take care of itself, but programmers are always working with other programmers, and poorly communicated ideas can cause expensive flops.

What we commonly call Python is really two different things: the language and the implementation. The Python language defines the syntax with which we write code. The most common implementation is CPython (written in C and Python), which compiles valid Python code into bytecode and then interprets it in real time as the program runs.

One of the reason’s for Python’s popularity is the flexibility afforded by its dynamic typing and memory management. I can define a variable x = 5, divide x by 2 to get 2.5, and then later set x = 'Florida'. In this sequence, x will have been typed as an integer, a float, and then a string. Python can do this for us automatically with no problem; under the hood it is dynamically allocating/deallocating memory, checking for type compatibility with mathematical operations, etc.

However, this flexibility has a cost, as it requires the interpreter to run a large number of tasks and checks behind the scenes at run time. Lower-level compiled languages like C require the user to pre-specify the type for each variable, how to allocate and deallocate memory, etc. This structure makes them less flexible and more difficult to use, but potentially much faster. In many applications Python is significantly slower (e.g., 10-100x) than typed, compiled languages such as C/C++.

Cython is an attempt to bridge the gap by bringing some of C’s qualities to Python. It is technically a separate programming language which is a superset of the Python language; Cython code mostly looks like Python code, but it also adds optional C-inspired syntax. Most notably, it allows us to declare static types when defining variables/classes/functions.

Cython can automatically translate our code into C, which can then be then compiled into CPython extension modules. These compiled modules can be imported and used within Python code with significantly lower overhead than the equivalent Python modules.


You can find all code used in this blog post at this GitHub repository.

If you want to follow along and build Cython code on your own machine, you will need a working Python 3 environment (preferably 3.6+ to avoid dependency issues) with NumPy, Cython, Matplotlib, and Jupyter.

You will also need the right compiler. OSX and Linux (including WSL) users should already have gcc standard. Windows users who don’t use WSL will need to download Microsoft Visual Studio 2019 to make sure you have the right compiler. Choose “Desktop development with C++” when it asks which programs you want to install.

A simple example: the Fibonacci sequence

The Fibonacci sequence is defined through the following recurrence relation:

Fn = Fn-1 + Fn-2

The sequence begins 0, 1, 1, 2, 3, 5, 8, …

In this blogpost we will build several versions of a function to calculate the n’th Fibonacci number. All scripts associated with this example can be found in the “fibonacci/” directory of the repository above. The examples in this section are adapted from Kurt W. Smith’s O’Reilly Media book, “Cython: A Guide for Python Programmers“.

First, consider the pure Python implementation of this function, saved as

def fib_py(n):
    a, b = 0, 1
    for i in range(n - 1):
        a, b = a + b, a
    return a

We can import and run this function, and print the 30th Fibonacci number as follows:

from fib_py import fib_py

The output is 514229, which is indeed the 30th number in the sequence.

The second file in the repository,, is identical to the first script. We will “Cythonize” it to compiled C code without making any manual changes.

The third version, fib_cy.pyx, is a Cythonic implementation. Note the new file ending, .pyx, for Cython source files. Each variable is now statically typed as an integer. In the function body, this is done by prefacing the variable definition by cdef. The cdef is not necessary for arguments in the function definition (e.g., n).

def fib_cy(int n):
    cdef int a = 0
    cdef int b = 1
    cdef int i
    for i in range(n - 1):
        a, b = a + b, a
    return a

Because of the importance of types for Cython, I have also included three more files which are equivalent to the first three, except that they use floating point numbers to count rather than integers. Note that, confusingly, the Python float class is equivalent to C doubles, not C floats.

Here is and These don’t explicitly define a static type, but a and b are defined using a decimal so that they will be stored as floating points rather than integers (e.g., 0. rather than 0).

def fib_py_double(n):
    a, b = 0., 1.
    for i in range(n - 1):
        a, b = a + b, a
    return a

And here is fib_cy_double.pyx with statically defined types:

def fib_cy_double(int n):
    cdef double a = 0.
    cdef double b = 1.
    cdef int i
    for i in range(n - 1):
        a, b = a + b, a
    return a

In order to run the Cython code, we first have to Cythonize it. This involves two steps: translating our Python-like code into C code, and compiling the C code into an executable. If you want to repeat this process on your own machine, first delete all *.so, *.pyd, *.c, and *.html files, as well as the build/ directory.

To Cythonize, we use a setup file,

from setuptools import setup
from Cython.Build import cythonize

setup(ext_modules = cythonize(['', 'fib_cy.pyx', '', 'fib_cy_double.pyx'], annotate=True, language_level=3))

and run it in Python:

python3 build_ext --inplace

We get the following output:

Compiling because it changed.
Compiling fib_cy.pyx because it changed.
Compiling because it changed.
Compiling fib_cy_double.pyx because it changed.
[1/4] Cythonizing fib_cy.pyx
[2/4] Cythonizing fib_cy_double.pyx
[3/4] Cythonizing
[4/4] Cythonizing
running build_ext
building 'fib_py_cy' extension
creating build
creating build/temp.linux-x86_64-3.8
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.8 -c fib_py_cy.c -o build/temp.linux-x86_64-3.8/fib_py_cy.o
creating build/lib.linux-x86_64-3.8
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/fib_py_cy.o -o build/lib.linux-x86_64-3.8/


copying build/lib.linux-x86_64-3.8/ ->
copying build/lib.linux-x86_64-3.8/ ->
copying build/lib.linux-x86_64-3.8/ ->
copying build/lib.linux-x86_64-3.8/ ->

This output has several steps. First, Python checks which of our files has changed since the last Cythonization (in this case, all of them, since we deleted all previous Cython files). Each of these is Cythonized, creating a *.c file. Then each *.c file is compiled using gcc to create a *.so dynamic library, which is able to communicate directly with CPython at runtime (this may be a *.pyd file if you are using Windows without WSL). Finally, the files are copied from the build/ directory to the working directory, where they can be imported by Python scripts.

If you look at the C files, you will notice they are much longer and more complicated than the original Python file, or for that matter, than a true C implementation would be. For example, my version of fib_cy.c is 3179 lines long. The vast majority of this is specialized functions for interfacing between C and Python and handling various errors and exceptions. Thankfully, when programming in Cython we generally don’t have any need to edit or interpret the C files themselves.

Also created is a *.html annotation file that can be opened in the browser. These files highlight regions of the original Cython file based on how much Python interaction they require at runtime. For example, compare the highlighting for and fib_cy.pyx:

Cython annotation for
Cython annotation for fib_cy.pyx

The former, which does not include static type information, has significant interaction with the Python interpreter due to the need for dynamic type checking and memory management throughout the loop. The latter, on the other hand, only interacts with Python when calling and returning from the function; the internal computation of the loop is self-contained C code. If you click on particular highlighted lines, you can expand to see the specific C code associated with those lines.

So how do the performance of these different versions compare? The script will run 10×100,000 repetitions of each version of our code, given a Fibonacci sequence number (e.g., 30) that is supplied as a command line argument:

$ python3 30

Pure python: answer = 514229, time = 0.07640906400047243, speedup = 1.0

Cythonized Python: answer = 514229, time = 0.03642013104399666, speedup = 2.0979898152526655

Typed Cython: answer = 514229, time = 0.0032417900511063635, speedup = 23.570022362921193

Pure Python (double): answer = 514229.0, time = 0.06389092199970037, speedup = 1.1959299006646167

Cythonized Python (double): answer = 514229.0, time = 0.016547597013413906, speedup = 4.617532318350108

Typed Cython (double): answer = 514229.0, time = 0.002909799979534, speedup = 26.259215251183424

For each of our six version, the answer calculated is the same, as expected. The “speedup” is calculated as the runtime for the pure Python integer implementation, divided by the runtime for the implementation of interest. The speedup is found to be 2.1 for, the version that was Cythonized without explicitly adding type information. Pretty good for basically no work! However, the speedup for the statically typed version is much larger, at 23.6. This highlights how important dynamic type checking and memory management are as a bottleneck for Python. The double versions are found to be somewhat faster than the integer versions as well.

However, redoing the analysis for n=50 highlights an important lesson:

$ python3 50

Pure python: answer = 7778742049, time = 0.1132775130099617, speedup = 1.0

Cythonized Python: answer = 7778742049, time = 0.06092342402553186, speedup = 1.8593425241905186

Typed Cython: answer = -811192543, time = 0.003898979048244655, speedup = 29.053121755286153

Pure Python (double): answer = 7778742049.0, time = 0.09670376998838037, speedup = 1.1713867310816608

Cythonized Python (double): answer = 7778742049.0, time = 0.02165580401197076, speedup = 5.2308153946787135

Typed Cython (double): answer = 7778742049.0, time = 0.0044801399926654994, speedup = 25.284369058870908

Notice how the answer produced by the typed Cython integer version is -811192543, rather than the correct answer of 7778742049. The reason for this is that Python3 integers can be unlimited size, while C integers can overflow if they get too big. This fact would typically be accounted for by the Python interpreter when interfacing with C code, by converting to a long integer rather than a standard integer when necessary (notice how the Cythonized Python version got the correct answer). However, in our static Cython code, these types of error checks are not performed. This shows that the speed advantages of Cython are not free – you do lose some of the flexibility & safety of Python, so some care is required. In practice, you may need to introduce manual checks into your function (much like in C) where overflow or other memory errors are possible.

Next steps

I hope this introduction to Cython has been helpful. In my next blog post, I will demonstrate how to use Cython for more complex tasks involving functions, classes, and Numpy arrays, using a reservoir simulation example.

Further reading

For more information, of course see the Cython docs. I also highly recommend Kurt W. Smith’s O’Reilly Media book, “Cython: A Guide for Python Programmers” for those wanting a deeper understanding. It also looks like the author also has a tutorial, but I haven’t tried it yet. Lastly, for information on how to improve the efficiency of your Python code more generally (prior to or in lieu of using Cython), I recommend Micha Gorelick and Ian Ozsvald’s O’Reilly Media book, “High Performance Python: Practical Performant Programming for Humans.”

A time-saving tip for simulating multiple scenarios

A time-saving tip for simulating multiple scenarios

Last week I was reading a paper by Thomlinson, Arnott, and Harou (2020), A water resource simulator in Python. In this, they describe a method of evaluating many simulation scenarios (or hydrologic realizations) simultaneously, and claim that this approach results in significant improvements in computational efficiency.

I was struck because I have been constructing similar multi-scenario simulations using a different approach, by evaluating one scenario at a time.

After having thought I perfected my code, was it possible that a slight modification to the implementation could significantly improve the efficiency? (Spoiler: Yes! By many hours, in fact.)

I decided to post this here (A) as a simple demonstration in comparing runtime metrics of two different algorithm approaches, and (B) as a reminder to remain open to the possibility of simple yet significant improvements in code performance.

A GitHub repository containing all of the relevant Python scripts used below can be accessed here.


There are many reasons why an analyst might want to perform simulation of many different scenarios. For example, when simulating a water resource policy which is affected by stochastic exogenous forces, it is preferable to evaluate the policy under many different realizations of the stochastic variable. The performance of that policy can then be measured as an aggregation of the performance under the different realizations.

How one decides to simulate the different realizations can have significant impacts on the efficiency of the program.

Here, I compare two different implementations for many-scenario evaluation, and demonstrate a comparison of the methods using the shallow lake problem (Carpenter et al., 1999).

The two implementation techniques are shown graphically in the figure below.

Figure: Visual representation of the two different multi-scenario simulation implementations discusses in this post. (This figure was inspired by a similar figure contained in a Pywr presentation made available by James Thomlinson)

In the past, I (and others who came before me) have used what I refer to as a serial implementation, where the realizations are looped-through one-by-one, evaluating the model over the entire simulation time period before moving on to the next realization. Now, I am contrasting that with what I am calling the block implementation. For the block implementation, the simulation is evaluated for all realizations before progressing to the subsequent time step.

When viewed this way, it may seem obvious that the block simulation will perform more efficiently. The model is applied to the columns rather than the individual elements of a matrix of realizations or scenarios. However, hindsight is 20-20, and if you’re a water resource modeler with an engineering background (such as myself) this may not be apparent to you when constructing the model.

Now, onto the demonstration.

The Lake Problem

The shallow lake problem, as introduced by Carpenter et al. (1999), was used as a case study for design of multi-objective pollution policies by Professor Julie Quinn and has been explored in depth on this blog. While a comprehensive understanding of the lake problem, and the corresponding model, is not necessary for understanding this post, I will provide a brief summary here.

The problem centers around a hypothetical town located near a shallow lake. The town would like to discharge nutrient pollution (e.g., waste water treatment plant effluent) into the lake, while balancing the economic benefits (i.e., cost savings associated with the discharge) and environmental objectives (i.e., preserving the ecological health of the lake).

The pollution dynamics of the lake are described by a non-linear model representing pollution fluxes into and out of the lake:

X_{t+1} = X_t + a_t + Y_t + \frac{X_t^q}{1+X_t^q} - bX_t

Where X_t is the pollution concentration in the lake, a_t is the anthropogenically pollution discharged from the town, Y_t \approx Log-Normal(\mu, \sigma) is the natural pollution inflow into the lake and follows a log-normal distribution, q is the rate at which pollution is recycled from the sediment into the water column, and b is the rate at which the nutrient pollution is removed from the lake due to natural processes.

The non-linearity of the nutrient dynamics shown above result in the presence of an unstable equilibrium or ecological tipping point beyond which the lake becomes permanently eutrophic (ecologically unhealthy).

The problem then becomes identifying pollution policies that prescribe a preferable amount of pollution, a_t during each year while not crossing the ecological tipping point.

The lake model

A model can be constructed which simulates a pollution policy and measures corresponding environmental and economic performance objectives over a 100-year period. In her 2017 publication, Prof. Quinn shows how an adaptive pollution policy, which determines pollution discharges rates conditional upon the current lake pollution level, can result in greater system performance and robustness. However, for the sake of simplicity in this demonstration I am modeling an intertemporal pollution policy which consists of a set of annual pollution discharge quantities (release_decision in the following code). I selected one pollution policy from a set of pareto-approximate policies generated previously. To learn more about how optimal pollution discharge policies can be generated for this model, see the blog post here.

Since natural nutrient pollution inflow into the lake is stochastic, it is best to evaluate a single policy for many different realizations of natural inflow. The objective performance can then be measured as the summary (e.g., mean, sum, or max) of the performance across the different realizations.

Below, I am going to show two different ways of implementing multi-realization simulations corresponding to the two approaches described in the Introduction.

I’ve separated the common and and unique features of the two different implementations, to highlight the differences. Both models begin by defining a set of parametric constants, initializing variables for later use, and by calculating the ecological tipping point (critical pollution threshold):

def lake_model(release_decision):

    # Choose the number of stochastic realizations
    n_realizations = 100

    #Initialize conditions
    n_years = 100
    n_objs = 4
    n_time = 100

    # Constants
    reliab_threshold = 0.85
    inertia_threshold = 0.02
    b = 0.42
    q = 2.0
    delta = 0.98
    X0 = 0
    alpha = 0.4
    mu = 0.5
    sigma = np.sqrt(10**-0.5)

    # Initialize variables
    years_inertia_met = np.zeros(n_realizations)
    years_reliability_met = np.zeros(n_realizations)
    expected_benefits = np.zeros(n_realizations)
    average_annual_concentration = np.zeros(n_realizations)
    objs = [0.0]*n_objs

    # Identify the critical pollution threshold
    #Function defining x-critical. X-critical exists for flux = 0.
    def flux(x):
            f = pow(x,q)/(1+pow(x,q)) - b*x
            return f

    # solve for the critical threshold with unique q and b
    Xcrit = sp.bisect(flux, 0.01, 1.0)

Serial simulation

For the serial model evaluation, lake quality over a 100-year period is evaluated for one realization at a time. A single realization of stochastic nutrient pollution inflow, Y_t is generated and the policy is evaluated under these conditions. This is repeated 100 times for 100 unique realizations of natural inflow.

def serial_lake_model(release_decision):


    # Begin evaluation; evaluate one realization at a time
    for s in range(n_realizations):

        ### Generate a single inflow_realization
        inflow_realizations = np.random.lognormal(mean = mu, sigma = sigma, size = n_time)

        # Initialize a new matrix of lake-state variables
        lake_state = np.zeros(n_years)

        # Set the initial lake pollution level
        lake_state[0] = X0

        for t in range(n_years-1):

            lake_state[t+1] = lake_state[t] + release_decision[t] + inflow_realizations[t] + (lake_state[t]**q)/(1 + lake_state[t]**q)

            expected_benefits[s] += alpha * release_decision[t] * delta**(t+1)

            # Check if policy meets inertia threshold
            if t>= 1:
                years_inertia_met[s] += (abs(release_decision[t-1] - release_decision[t]) < inertia_threshold) * 1

            # Check if policy meets reliability threshold
            years_reliability_met[s] += (lake_state[t] < Xcrit) * 1

        average_annual_concentration[s] = np.mean(lake_state)

    # Calculate objective scores
    objs[0] = -np.mean(expected_benefits)
    objs[1] = max(average_annual_concentration)
    objs[2] = -np.sum(years_inertia_met / (n_years - 1))/n_realizations
    objs[3] = -np.sum(years_reliability_met)/(n_realizations * n_years)

    return objs

This is the way I have written this program in the past, and the way it has been presented previously in this blog.

Block Simulation

I’ve now re-written the model using the block simulation method, as shown in Figure 1. Rather than taking a single timeseries of natural inflow at a time, I produce a matrix of 100-realizations and perform calculations one-timestep (or column) at a time.

I recommend you pause to compare the above and below code segments, to make sure that you have a solid understanding of the differences in implementation before looking at the results.

def matrix_lake_model(release_decision):

    ### Create a matrix contaiining all inflow realizations
    inflow_realizations = np.zeros((n_realizations, n_time))
    for s in range(n_realizations):
        inflow_realizations[s, :] = np.random.lognormal(mean = mu, sigma = sigma, size = n_time)

    # Begin evaluation; evaluate all realizations one time step at a time
    for t in range(n_years-1):

        lake_state[:,t+1] = lake_state[:,t] + release_decision[t] + inflow_realizations[:,t] + (lake_state[:,t]**q)/(1 + lake_state[:,t]**q)

        expected_benefits[:] += alpha * release_decision[t] * delta**(t+1)

        # Check if policy meets inertia threshold
        if t>= 1:
            years_inertia_met += (abs(release_decision[t-1] - release_decision[t]) < inertia_threshold) * 1

        # Check if policy meets reliability threshold
        years_reliability_met += (lake_state[:,t] < Xcrit) * 1

    # Calculate objective scores
    objs[0] = -np.mean(expected_benefits)
    objs[1] = max(np.mean(lake_state, axis = 1))
    objs[2] = -np.sum(years_inertia_met / (n_years - 1))/n_realizations
    objs[3] = -np.sum(years_reliability_met)/(n_realizations * n_years)

    return objs

Running the same pollution policy in both of the above models, I first made sure that the output objective scores were identical. Having verified that, I set up some simple timing tests.

Comparison of evaluation time efficiency

I used the built-in time module in Python to perform runtime tests to compare the two implementation methods. The time module contains the function perf_counter() which makes a precise recording of the time at which it is called.

Calling perf_counter() before and after model evaluation provides a simple way of measuring the evaluation time. An example of this time measurement is shown below:

import time
import lake_model

start = time.perf_counter()
model_output = lake_model(release_decision, n_realizations)
end = time.perf_counter()
model_runtime = end - start

When performing an optimization of the pollution release policy using the Borg multi-objective evolutionary algorithm (MOEA), it is standard for our group to simulate each policy for 100-realizations of hydrologic stochasticity. The objective scores are calculated as some aggregation of the performance across those realizations.

With this in mind, I first measured the difference in runtime between the two simulation methods for 100-realizations of stochasticity.

I found the model using the block implementation, evaluating all realizations one time-step at a time, was 0.048 seconds faster then evaluating the 100-realizations in series. That may seem marginal at first, but how does this difference in speed scale when performing many thousands of model evaluations throughout an optimization?

Previously, when optimizing the release decisions for the lake problem, I allowed the Borg MOEA to run for a maximum of 50,000 model evaluations per seed, and repeated this process for 20 random seeds… at a savings of 0.048 seconds per evaluation, I would have saved 13.3 hours of computation time if I had used the block implementation technique!

This difference grows exponentially with the number of realizations performed in each evaluation. In the figure below, I am showing runtime of the block implementation relative to the serial implementation (I.e., \frac{T_{simul}}{T_{serial}}) for an increasingly large number of realizations:

Figure: Runtime of the matrix implementation technique relative to the serial technique, for a range of realization numbers.


In a stochastic world, evaluation of water resource policies under a broad ensemble of scenarios allows for the selection of more policies that are more robust under uncertain factors. Writing code that is computationally burdensome can limit the size of the ensemble, and ultimately limit the analysis.

When working with high performance computing resources, researchers are often limited in allotted compute-hours, or the time spent processing a program on the machine. Given this context, it is highly valuable to identify time-saving measures such as the block implementation presented here.

Ultimately, this experimentation was pleasantly humbling. I hope that this serves as motivation to approach one of your existing models with a fresh perspective. At the least, I hope to keep you from making the same costly mistake as me.

I want to shout out Thomlinson, Arnott, and Harou again for writing the paper that sparked this re-assessment, check out their water resource simulator for solving resource allocation problems, Pywr, here.

Introduction to Drought Metrics

In this blog post, we’re going to discuss meteorological and hydrological drought metrics. Droughts are an important but difficult hazard to characterize. Whereas hurricanes or tornadoes occur suddenly and persist for a short and clearly defined interval, a drought is usually thought of as a “creeping disaster”, in which it is often difficult to pinpoint exactly when a drought begins and when it ends. Droughts can last for decades (termed a megadrought) where dryness is chronic but still can be interspersed with brief wet periods.

Droughts are expected to become more prominent globally in the upcoming decades due to the warming of the climate and even now, many locations in the Western US are experiencing droughts that exceed or rival some of the driest periods in the last millennium (California and the Southwest US). Thus, substantial literature is focused on investigating drought through observed gauge data, reconstructions using paleodata, and projections of precipitation and streamflow from climate models to better understand what future droughts may look like. However, many studies differ in their characterization of drought risk. Different GCMs can create contrasting representations of drought risk which often is caused by differences in the underlying representations of future warming and precipitation. Further, it is important to consider that a drought does not have a universal definition. It manifests due to the complex interactions across the atmosphere, land and hydrology, and the human system [1]. To account for this complexity, different definitions and metrics for droughts have been developed to account for single or combinations of these sectors (meteorological, hydrological, agricultural, etc.). Though a single definition has not been agreed upon, it has been widely suggested that using any one measure of drought can be limiting. For example, Swann (2018) suggests that ignoring plant response in any metric will overestimate drought. Furthermore, Hao et al. (2013) explains that a meteorological (precipitation) drought may not lead to an agricultural (soil moisture) drought in tropical areas where there is high soil moisture content. In this blog post, we’re going to define and discuss some of these metrics that are used to represent droughts to aid in deciding what metrics are appropriate for your case study.


The Standardized Precipitation Index (SPI) and Standardized Streamflow Index (SSI) are the most common means of characterizing a meteorological or hydrologic drought respectively. These metrics are highly popular because they only require a monthly time series (ideally of length > 30 years) of either precipitation or streamflow. These metrics quantify how much a specific month’s precipitation or streamflow deviates from the long-term monthly average. The steps to calculate SPI are as follows (SSI is the same):

  1. Determine a moving window or averaging period. This is usually a value from the following: 3, 6, 12, or 24 months. Any moving window from 1-6 months is better for capturing meteorological and soil moisture conditions whereas 6-24 month averaging windows will better represent droughts affecting streamflow, reservoir storage, and groundwater. If calculating a 6-month SPI,  the way to factor in the windows is as follows: the aggregate precipitation attributed to month j is accumulated over month j-5 to j [4]. Repeat this procedure for the whole time series.
  2. Next an appropriate probability density function is fit to the accumulated precipitation. Generally a gamma or Pearson Type III distribution works well for precipitation data, but be sure to check for goodness of fit.  
  3. The associated cumulative probability distribution from Step 2 is then estimated and subsequently transformed to a normal distribution. Now each data point, j, can be worked through this framework to ultimately calculate a precipitation deviation for a normally distributed probability density with a mean of zero and standard deviation of 1. Because the SPI is normalized, it means that you can use this metric to characterize both dry and wet periods.

The range that the SPI values can take are summarized below [5]:

> 2.00Extremely Wet
1.50 to 1.99Very Wet
1.00 to 1.49Moderately Wet
0.00 to 0.99Near Normal
0 to -0.99Mild Drought
-1.00 to -1.49Moderate Drought
-1.50 to -1.99Severe Drought
SPI Values and Classifications

From the SPI, you can define various drought metrics of interest. We’ll use an SPI of <= -1 to define the consequential drought threshold.

Percentage of Time in Drought: How many months in the time series have an SPI less than -1 divided by the total months in the time period in question?

Drought Duration: What is the number of consecutive months with an SPI below -1?

Drought Intensity or Peak: What is the minimum value of a drought’s period’s SPI?

Drought Recovery: How many months does it take to get from the peak of the drought to an SPI above -1?

There are lots of packages that exist to calculate SPI and SSI and it’s not difficult to code up these metrics from scratch either. Today, I’m going to demonstrate one of the easiest toolboxes to use for MATLAB, the Standardized Drought Analysis Toolbox (SDAT). We’ll use the provided example time series, precip.txt, which is 500 months long. I specify that I want to utilize a window of 6 months (line 50). Running the script creates the SPI index and plots it for you.

SDAT Output

Let’s investigate these results further. I’ll change the window to 12 months and plot the results side by side. In the plotting window, I isolate a piece of the time series that is of length 25 months. The instances of drought are denoted by the black squares and the peak of each drought is denoted by the circle. We can then calculate some of those metrics defined above.

SPI results across different averaging windows for a portion of the SPI index
Metric6 Month SPI12-month SPI
# of Drought Events21
Duration1,2 months 2 months
Recovery Time 1 month1 month
Intensity -1.78-1.36
Metrics derived from SPI time series created from 2 different windows

Note how different the SPI metric and perception of drought is depending on your chosen window. The window will greatly influence your perception of the number of droughts experienced and their intensity which is worth addressing. As much as SPI and SSI are easy metrics to calculate, that simplicity comes with limitations. Because these metrics only use the precipitation or streamflow time series, there is no consideration of evapotranspiration which means that any information about how much warming plays a role in these drought conditions is neglected, which can be a large piece of the puzzle. These metrics tend to be very sensitive to the length of the record as well [6].


Given the limitations of SPI, the Standardized Precipitation Evapotranspiration Index (SPEI) was proposed by Vicente-Serrano et al. (2010) that is based on both temperature and precipitation data. Mathematically, the SPEI is calculated similarly to the SPI, thought instead of using precipitation only, the aggregated value to generate the index on is now precipitation minus PET. PET can be calculated using a more physically-based equation such as the Penman-Montieth equation or the simpler Thornthwaite equation (which only uses air temperature to derive PET). After the aggregate data is created, one can simply outlined in the SPI section. Vicente-Serrano et al. (2010) compare SPI and SPEI in Indore, India and demonstrate that only SPEI can identify an increase in drought severity associated with higher water demand as a result of evapotranspiration. The key limitation of SPEI is mostly based on the fact that it requires a complete temperature and precipitation dataset which may limit its use due to insufficient data being available.


Another metric to be aware of is the Multivariate Standardized Drought Index (MSDI) from Hao et al. (2013). This metric probabilistically combines SPI and SSI for a more comprehensive drought characterization that covers both meteorological and agricultural drought conditions. The SPI and SSI metrics are determined as usual and then a copula is fit to derive a joint probability distribution across the two indices. The MSDI index is ultimately found by pushing the joint probabilities into an inverse normal distribution, thus placing it in the same space as the original SPI and SSI. The authors demonstrate that (1) the difference between SPI and SSI decreases as drought duration increases and (2) the MDSI can capture the evolution of both meteorological and hydrological droughts and particularly show that the onset of droughts is dominated by the SPI index and drought persistence is dominated more by the SSI index.

Decadal Drought

If you are interested more in capturing drought at longer time scales, such as decadal to multi-decadal, a similar style of metric is proposed in Ault et al. (2014). The authors define a decadal drought, such as the 1930s dustbowl or 1950s drought in the Southwest as a precipitation value that is more than ½ sigma below the 11-year running mean. A multi-decadal drought is characterized as a precipitation value that is more than ½ sigma below a 35-year running mean. This definition is somewhat arbitrary and more of a worse-case planning scenario because it corresponds to drought events that are worse than what has been reconstructed over the past millennium.  

I hope this post is useful to you! If you use other drought metrics that are not listed here or have certain preferences for metrics based on experiences, please feel free to comment below!


[1] Touma, D., Ashfaq, M., Nayak, M. A., Kao, S. C., & Diffenbaugh, N. S. (2015). A multi-model and multi-index evaluation of drought characteristics in the 21st century. Journal of Hydrology526, 196-207.

[2] Swann, A. L. (2018). Plants and drought in a changing climate. Current Climate Change Reports4(2), 192-201.

[3] Hao, Z., & AghaKouchak, A. (2013). Multivariate standardized drought index: a parametric multi-index model. Advances in Water Resources57, 12-18.

[4] Guenang, G. M., & Kamga, F. M. (2014). Computation of the standardized precipitation index (SPI) and its use to assess drought occurrences in Cameroon over recent decades. Journal of Applied Meteorology and Climatology53(10), 2310-2324.

[5] McKee, T. B., Doesken, N. J., & Kleist, J. (1993, January). The relationship of drought frequency and duration to time scales. In Proceedings of the 8th Conference on Applied Climatology (Vol. 17, No. 22, pp. 179-183).


[7] Vicente-Serrano S.M., Santiago Beguería, Juan I. López-Moreno, (2010) A Multi-scalar drought index sensitive to global warming: The Standardized Precipitation Evapotranspiration Index – SPEI. Journal of Climate 23: 1696-1718.

[8] Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate27(20), 7529-7549.