Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Creating a collaborative research group lab manual with Jupyter Books

Motivation

Onboarding new students, staff, or collaborators can be a challenge in highly technical fields. Often, the knowledge needed to run complex experiments or computing workflows is spread across multiple individuals or teams based on their unique experiences, and training of new team members tends to occur in an ad hoc and inefficient manner. These challenges are compounded by the inevitable turnover of students, postdocs, and external collaborators in academic research settings.

Over the years, the Reed Group has developed a large bank of training materials to help streamline the onboarding process for new team members. These materials introduce concepts and tools related to water systems modeling, multi-objective evolutionary algorithms, global sensitivity analysis, synthetic streamflow generation, etc. However, these materials are still spread across a variety of sources (academic papers, GitHub repositories, blog posts, etc.) and team members, and there has been growing recognition of the need to catalogue and compile relevant resources and trainings in a more structured way.

For this reason, we have begun to create a lab manual for the Reed group. This will include a wide variety of information relevant to new students and researchers – everything from training exercises and code snippets, to reading lists and coursework suggestions, to a code of conduct outlining our values and expectations. The goal is for this to be a collaborative, living document created and maintained by students and postdocs. Ideally this will continue to evolve for years along with the evolving state-of-the-art in methodology, software, and literature.

After considering a number of different platforms for constructing websites, we settled on the Jupyter Book package for Python. You can find our lab manual here, and the source code used to create it here – note that this is still very much in development, a skeleton waiting to be fleshed out. In the remainder of this blog post, I will highlight the major elements of a Jupyter Book website, using our skeleton lab manual as an example. Then in a future blog post, I will outline the Continuous Integration and Continuous Delivery (CI/CD) strategy we are using to manage versioning and platform dependency issues across multiple developers.

Intro to Jupyter Book

Jupyter Book is a Python package for creating static websites. The package is built on the popular Sphinx engine used to create documentation for many of your favorite Python packages. Sphinx was also used to create the ebook for “Addressing Uncertainty in MultiSector Dynamics Research“, as described in two recent blog posts by Rohini Gupta and Travis Thurber. The ebook was a source of inspiration for our lab manual and the reason we initially considered Sphinx-based workflows. However, Jupyter Books layers several additional functionalities on top of Sphinx. First, it supports use of the MyST Markdown language, which is more familiar and intuitive to most researchers than the reStructured Text format favored by Sphinx. And second, it allows for pages to be built from executable Jupyter Notebooks, a powerful tool for combining text and equations with formatted code blocks, program output, and generated figures.

The Jupyter Book documentation contains tutorials, examples, and references, and is an excellent resource for anyone looking to build their own site. The documentation itself is, of course, created using the Jupyter Book package, and interested readers can check out the source code here.

Designing the website structure

The hierarchical structure of a Jupyter Book is defined in a simple YAML-style Table of Contents file, which should be named _toc.yml. Here is the TOC for our lab manual at present:

format: jb-book
root: intro.md
parts:
- chapters:
  - file: ExamplePages/ExamplePages.md
    sections:
    - file: ExamplePages/mdExample.md
    - file: ExamplePages/nbExample.ipynb
  - file: Resources/Resources.md
    sections:
    - file: Resources/ClusterBasics.md
    - file: Resources/Computing.md
    - file: Resources/Courses.md
    - file: Resources/DataVisualization.md
    - file: Resources/LifeAtCornell.md
    - file: Resources/ReedGroupTools.md
    - file: Resources/WritingProcess.md
    - file: Resources/CitationNetworkandDiscovery.md
  - file: Training/Training.md
    sections:
    - file: Training/Schedule.md
    - file: Training/Reading.md
    - file: Training/LakeProblem.md
    - file: Training/Training_Fisheries_Part1.md
    - file: Training/Linux_MOEAs_HPC.md

The “root” defines the landing page, in this case the intro.md markdown file. That landing page will link to three “chapters” called ExamplePages, Resources, and Training. Each of these chapters has it’s own landing page as well as multiple child “sections.” Each page can either be written as a Markdown file (.md) or a Jupyter Notebook (.ipynb).

The other important YAML file for all Jupyter Books is _config.yml:

title: Reed group lab manual
author: The Reed Group at Cornell CEE
logo: logo.png

# Force re-execution of notebooks on each build.
# See https://jupyterbook.org/content/execute.html
execute:
  execute_notebooks: force

# Define the name of the latex output file for PDF builds
latex:
  latex_documents:
    targetname: book.tex

# Add a bibtex file so that we can create citations
bibtex_bibfiles:
  - references.bib

# Information about where the book exists on the web
repository:
  url: https://github.com/reedgroup/reedgroup.github.io  # Online location of your book
  path_to_book: docs  # Optional path to your book, relative to the repository root

# Add GitHub buttons to your book
# See https://jupyterbook.org/customize/config.html#add-a-link-to-your-repository
html:
  use_issues_button: true
  use_repository_button: true

We first define our website’s title and author, as well as an image logo to display. The line “execute_notebooks: force” means that we want to reexecute all Jupyter Notebooks each time the site is built (see docs for other options). The url gives the web address where we want to host our site – in this case the GitHub Pages address associated with the GitHub repository for the site. The path_to_book defines “docs” as the folder in the repository where all source code is to be held. Finally, the last two options are used to create buttons at the top of our site that link to the GitHub repository in case readers want to browse the source code or report an issue. For now, we are using the default vanilla style, but there are many ways to customize the structural and aesthetic style of the site. You would need to point to custom style files from this configuration file – see the Jupyter Book gallery for inspiration.

Building pages with Markdown and Jupyter Notebooks

Jupyter Book makes it very easy to write new pages using either Markdown or Jupyter Notebooks. For context, here is a screenshot of the site’s homepage:

The main content section for this page is built from the “root” file, intro.md:

# Welcome to our lab manual! 

```{warning}
This site is still under construction
```

The purpose of this site is to help new students and collaborators get up to speed on the research methods/tools used by the Reed Group. This page is designed and maintained by other graduate students and post docs, and is intended to serve as a living document. 

This manual was created using the Jupyter Books Python package, and is hosted with GitHub Pages. You can find our source code at https://github.com/reedgroup/reedgroup.github.io.

```{tableofcontents}```

As you can see, this uses a very human-readable and intuitive Markdown-based file structure. Jupyter Book provides simple functionality for warning labels and other emphasis boxes, as well as a Table of Contents that is automatically rendered from the _toc.yml file. The tableofcontents command can be used from anywhere in the hierarchical page tree and will automatically filter to include only children of the current page. The separate sidebar TOC will also expand to show “sections” as you navigate into different “chapters.” For example, here is the Markdown and rendered webpage for the “ExamplePages” chapter:

# Example Pages with JupyterBooks
```{tableofcontents}```

For more detailed pages, you can also apply standard Markdown syntax to add section headers, bold/italic font, code blocks, lists, Latex equations, images, etc. For example, here is ExamplePages/mdExample.md:

# Markdown example
This is an example page using just markdown

### Subsection 1
Here is a subsection

### Subsection 2
Here is another subsection. 

:::{note}
Here is a note!
:::

And here is a code block:

```
e = mc^2
```

And here comes a cute image!

![capybara and friends](capybaraFriends.jpg "Capybara and friends")

Lastly, and most importantly for purposes of building a training manual, we can create pages using Jupyter Notebooks. For example, here are two screenshots of the webpage rendered from ExamplePages/nbExample.ipynb:

As you can see, the Notebook functionality allows us to combine text and equations with rendered Python code. We can also execute Bash, R, or other programs using Jupyter Notebook’s “magic” commands. Note that the Jupyter-based website is not interactive – for that you’ll need Binder, as demonstrated in this blog post by David Gold.

Nevertheless, the Notebook is reexecuted each time we rebuild the website, which should really streamline collaborative lab manual development. For example, once we have developed a code bank of visualization examples (stay tuned!), it will be straightforward to edit the existing examples and/or add new examples, with the rendered visualizations being automatically updated rather than needing to manually upload the new images. Additionally, reexecuting the Notebooks each time we rebuild the site will force us to maintain the functionality of our existing code bank rather than letting portions become obsolete due to package dependencies or other issues.

Next steps

You now have the basic building blocks to create your own lab manual or ebook using a collection of YAML files, Markdown files, and Jupyter Notebooks. The last two critical steps are to actually build the static site (e.g., the html files) using Jupyter Book, and then host the site using GitHub pages. I will demonstrate these steps, as well as our CI/CD strategy based on GitHub Actions, in my next blog post.

Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.

PyGeohydro

PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.

PyNHD

The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…

PyDaymet

The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

  1. Installation
  2. Retrieving USGS Water Data
  3. Retrieving Geophysical (NLDI) Data
  4. Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
plot.signatures(flow_data)
Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    navigation="upstreamTributaries",
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

  • Minimum and maximum temperature (tmin, tmax)
  • Precipitation (prcp)
  • Vapor pressure (vp)
  • Snow-Water Equivalent (swe)
  • Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

NOTE:
So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()

daymet_data.shape

# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.

Conclusion

I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.

Citations

Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

Fisheries Training Part 2 – Tradeoff Visualization and Introduction to J3

Hello there! If you’re here, then you likely have successfully navigated the previous two posts in our Fisheries Training Series:

In these posts, we explored the complex dynamics of a two-species predator-prey fisheries system. We also visualized various potential scenarios of stability and collapse that result from a variety of system parameter values. We then set up the problem components that include its parameters and their associated uncertainty ranges, performance objectives and the radial basis functions (RBFs) that map the current system state to policy action

Now, we will building off the previous posts and generate the full Pareto-approximate set of performance objectives and their associated decision variable values. We will also specify our robustness multivariate satisficing criteria (Starr, 1963) set by Hadjimichael et al (2020) and use J3, a visualization software, to explore the tradeoff space and identify the solutions that meet these criteria.

To better follow along with our training series, please find the accompanying GitHub repository that contains all the source code here.

A brief recap on decision variables, parameters and performance objectives

In the Fisheries Training series, we describe the system using the following parameters:

  • x_{t} and y_{t}: The prey and predator population densities at time t respectively
  • \alpha: The rate at which the predator encounters the prey
  • b: The prey growth rate
  • c: The rate at which the predator converts prey to new predators
  • d: The predator death rate
  • h: The time the predator needs to consume the prey (handling time)
  • K: Environmental carrying capacity
  • m: The level of predator interaction
  • z: The fraction of prey that is harvested

Please refer to Post 0 for further details on the relevance of each parameter.

Our decision variables are the three RBF parameters: the center (c_{i}), radius (r_{i}) and weights (w_{i}) of each RBF i respectively. From Part 1, we opt to use two RBFs where i \in [1,2] to result in six decision variables.

Next, our objectives are as follows:

  • Objective 1: Maximize net present value (NPV)
  • Objective 2: Minimize prey-population deficit
  • Objective 3: Minimize the longest duration of consecutive low harvest
  • Objective 4: Minimize worst harvest instance
  • Objective 5: Minimize harvest variance

Detailed explanation on the formulation and Python execution of the RBFs and objectives can be found in Post 1.

Now that we’ve reviewed the problem setup, let’s get to setting up the code!

Running the full problem optimization

Importing all libraries and setting up the problem

Before beginning, ensure that both Platypus and PyBorg are downloaded and installed as recommended by Post 1. Next, as previously performed, import all the necessary libraries:

# import all required libraries
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
from fish_game_functions import *
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
import matplotlib.pyplot as plt
import time
import random

We then define the problem by setting the number of variables (nVars), performance objectives (nObjs) and constraints (nCnstr). We also define the upper and lower bounds of each objective. The negative values associated with Objectives 1 and 4 indicate that they are to be maximized.

# Set the number of decision variables, constraints and performance objectives
nVars = 6   # Define number of decision variables
nObjs = 5   # Define number of objectives
nCnstr = 1      # Define number of decision constraints

# Define the upper and lower bounds of the performance objectives
objs_lower_bounds = [-6000, 0, 0, -250, 0]
objs_upper_bounds = [0, 1, 100, 0, 32000]

Then we initialize the algorithm (algorithm) to run over 10,000 function evaluations (nfe) with a starting population of 500 (pop_size):

# initialize the optimization
algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr)
nfe = 10000    # number of function evaluations
pop_size = 500    # population size

Storing the Pareto-approximate objectives and decision variables

We are ready to run this (Fisheries) world! But first, we will open two CSV files where we will store the Pareto-approximate objectives (Fisheries2_objs.csv) and their associated decision variables (Fisheries2_vars.csv). These are the (approximately) optimal performance objective values and the RBF (c_{i}, r_{i}, w_{i}) vectors that give rise to them discovered by PyBorg. We also record the total amount of time it takes to optimize the Fisheries over 10,000 NFEs with a population of 500.

# open file in which to store optimization objectives and variables
f_objs = open('Fisheries2_objs.txt', "w+")
f_vars = open('Fisheries2_vars.txt', "w+")

# get number of algorithm variables and performance objectives
nvars = algorithm.problem.nvars
nobjs = algorithm.problem.nobjs
    
# begin timing the optimization
opt_start_time = time.time()

algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(pop_size))
algorithm.run(int(nfe))

# get the solution archive
arch = algorithm.archive[:]
for i in range(len(arch)):
    sol = arch[i]
    # write objectives to file
    for j in range(nobjs):
        f_objs.write(str(sol.objectives[j]) + " ")
    # write variables to file
    for j in range(nvars):
        f_vars.write(str(sol.variables[j]) + " ")
        
    f.write("\n")

# end timing and print optimization time 
opt_end_time = time.time()

opt_total_time = opt_end_time - opt_start_time

f_objs.close()
f_vars.close()

# print the total time to console
print(format"\nTime taken = ", {opt_total_time})

The optimization should take approximately 3,100 seconds or 52 minutes. When the optimization is completed, you should be able to locate the Fisheries2_objs.txt and Fisheries2_vars.txt files in the same folder where the Jupyter notebook is stored.

Post-Processing

To ensure that our output can be used in our following steps, we perform post-processing to convert the .txt files into .csv files.

import numpy as np

# convert txt files to csv 
# load the .txt files as numpy matrices
matrix_objs = np.genfromtxt('Fisheries2_objs.txt', delimiter=' ')
matrix_vars = np.genfromtxt('Fisheries2_vars.txt', delimiter=' ')

# reshape the matrices 
# the objectives file should have shape (n_solns, nObjs)
# the variables file should have shape (n_solns, nVars)
n_solns = int(matrix_objs.shape[0]/nObjs)

matrix_objs = np.reshape(matrix_objs, (n_solns,nObjs))
matrix_vars = np.reshape(matrix_vars, (n_solns,nVars))

# label the objectives and variables
objs_names = ['NPV', 'Pop_Deficit', 'Low_Harvest', 'Worst_Harvest', 'Variance']
var_names = ['c1', 'r1', 'w1', 'c2', 'r2', 'w2']

# Convert the matrices to dataframes with header names
df_objs = pd.DataFrame(matrix_objs, columns=objs_names)
df_vars = pd.DataFrame(matrix_vars, columns=var_names)

# save the processed matrices as csv files
df_objs.to_csv('Fisheries2_objs.csv', sep=',', index=False)
df_vars.to_csv('Fisheries2_vars.csv', sep=',', index=False)

You should now be able to locate the Fisheries2_objs.csv and Fisheries2_vars.csv within the same folder where you store the Jupyter Notebook.

In the following steps, we will introduce the J3 Visualization Software, which takes .csv files as inputs, to visualize and explore the tradeoff space of the fisheries problem.

Introduction to visualization with J3

J3 is an open-sourced app to produce and share high-dimensional, interactive scientific visualizations. It is part of the larger Project Platypus, which is a collection of libraries that aid in decision-making, optimization, and data visualization. It is influenced by D3.js, which is a JavaScript library for manipulating data using documents (data-driven documents; hence the name). Instead of documents, J3 manipulates data using many-dimensional plots, annotations and animations.

There is a prior post by Antonia Hadjimichael that covers the Python implementation of J3. In this post, we will be exploring the J3 app itself.

Installing and setting up J3

To use J3, you should first install Java. Please follow the directions found on the official Java site to select the appropriate installation package for your operating system.

Next, you can install J3 in either one of two ways:

  1. Download the .zip file from the J3 Github Repository and extract its contents into a desired location on your location machine.
  2. Install using git clone:
cd your-desired-location-path
git clone https://github.com/Project-Platypus/J3.git

You should now see a folder called ‘J3’ located in the path where you chose to extract the repository. Run the J3.exe file within the folder as shown below:

Next, we upload our Fisheries2_objs.csv file into J3:

The GIF below shows the a 3D tradeoff plot that is used to demonstrate the functions that each of the toggles serve. In this 3D plot, the NPV and Harvest Variance are seen on the x- and y-axes, while the Worst-case Harvest is seen on the z-axis. The size of the points represents Lowest Harvest Instance and their colors demonstrate the Population Size.

Other functions not shown above include:

  1. Zooming in by scrolling on your mouse or trackpad
  2. Deleting the annotations by right-clicking on them
  3. Pressing the ‘esc’ key to de-select a point of interest

Next, we can also generate accompanying 2D-scatter and parallel axis plots to this 3D tradeoff figure:

In the parallel axis plot, the direction of preference is upwards. Here, we can visualize the significant tradeoffs between net present cost of the fisheries’ yield and population deficit. If stakeholders wish to maximize the economic value of the fisheries, they may experience unsustainable prey population deficits. The relationship between the remaining objectives is less clear. In J3, you can move the parallel axis plot axes to better visualize the tradeoffs between two objectives:

Here, we observe that there is an additional tradeoff between the minimizing the population deficit and maintaining low occurrences of low-harvest events. From this brief picture, we can observe that the main tradeoffs within the Fisheries system are between ecological objectives such as population deficit and economic objectives such as net present value and harvest.

Note the Brushing tool that appears next to the parallel axis plot. This will be important as we begin our next step, and that is defining our robustness multivariate satisficing criteria.

The multivariate satisficing criteria and identifying robust solutions

The multivariate satisficing criteria is derived from Starr’s domain criterion satisficing measure (Starr, 1962). In Hadjimichael et al. (2020), the multivariate satisficing criteria was selected as it allowed the identification of solutions that meet stakeholders’ competing requirements. In the context of Part 2, we use these criteria to identify the solutions in the Pareto-approximate set that satisfy the expectations of stakeholder. Here, the requirements are as follows:

  1. Net present value (NPV) \geq 1,500
  2. Prey-population deficit \leq 0.5
  3. Longest duration of consecutive low harvest \leq 5
  4. Worst harvest instance \geq 50
  5. Harvest variance \leq 1

Using the brushing tool to highlight only the solutions of interest, we find a pared-down version of the Pareto set. This tells us that not all optimal solutions are realistic, feasible, or satisfactory to decision-makers in the Fisheries system.

Conclusion

Good job with making it this far! Your accomplishments are many:

  1. You ran a full optimization of the Fisheries Problem.
  2. Your downloaded, installed, and learned how to use J3 to visualize and manipulate data to explore the tradeoff space of the Fisheries system.
  3. You learned about Multivariate Satisficing Criteria to identify solution tradeoffs that are acceptable to the stakeholders within the Fisheries system.

In our next post, we will further expand on the concept of the multivariate satisficing criteria and use it to evaluate how 2-3 of the different solutions that were found to initially satisfy stakeholder requirements when tested across more challenging scenarios. But in the meantime, we recommend that you explore the use of J3 on alternative datasets as well, and see if you can come up with an interesting narrative based on your data!

Until then, happy visualizing!

References

Giuliani, M., Castelletti, A., Pianosi, F., Mason, E. and Reed, P., 2016. Curses, Tradeoffs, and Scalable Management: Advancing Evolutionary Multiobjective Direct Policy Search to Improve Water Reservoir Operations. Journal of Water Resources Planning and Management, 142(2).

Hadjimichael, A., Reed, P. and Quinn, J., 2020. Navigating Deeply Uncertain Tradeoffs in Harvested Predator-Prey Systems. Complexity, 2020, pp.1-18.

Starr, M., 1963. Product design and decision theory. Journal of the Franklin Institute, 276(1), p.79.

Bivariate choropleth maps

Bivariate choropleth maps

Choropleth maps are a ubiquitous tool in geospatial analysis where color is used to visualize the spatial variability of data (e.g., coloring counties based on temperature, GDP, or election results). Bivariate choropleth maps are a less common variant that use 2-dimensional color palettes to show two different types of data at once. This can be a powerful tool for visualizing spatial correlations and clustering patterns between two related variables.

This blog post will demonstrate how to create a bivariate choropleth map in Python. Data and code for this post can be found in this GitHub repository.

CalEnviroScreen dataset

CalEnviroScreen is a valuable dataset from the California Office of Environmental Health Hazard Assessment which seeks to “identify California communities that are most affected by many sources of pollution, and where people are often especially vulnerable to pollution’s effects.” It provides census tract-level data on a variety of pollution sources (ozone, PM2.5, drinking water contamination, traffic impacts, etc.) and population characteristics (asthma, cardiovascular disease, education, poverty, etc.). The dataset also includes more general aggregated indicators for “Pollution Burden” and “Population Characteristics,” and finally an overall indicator of pollution vulnerability which is the product of the previous two indicators. This data is used by various state agencies to locate disadvantaged communities that can be targeted for environmental justice programs and investments. The CalEnviroScreen 4.0 dataset is available either as an online mapping platform or a downloadable shapefile – the latter is used for this blog post.

Creating univariate choropleth maps using geopandas

This analysis requires the following Python packages: Glob for grabbing files, Matplotlib for general plotting, GenerativePy and Pillow (PIL) for manipulating colors, Geopandas for handling geospatial data, and Contextily for importing basemaps.

import glob
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
from matplotlib.colors import rgb2hex
from generativepy.color import Color
from PIL import ImageColor
import geopandas as gpd
import contextily as cx

Once we have imported the necessary libraries, we can read the data into a Geopandas dataframe. Geopandas is a great package which allows us to read in and manipulate shapefiles as dataframes that act very similarly to Pandas dataframes while maintaining the spatial context.

### Read in CalEnviroScreen shapefile, from https://oehha.ca.gov/calenviroscreen/maps-data/download-data
shapefiles = glob.glob('*/*.shp')
ces = gpd.read_file(shapefiles[0])

If we print the columns of the dataframe, we see a large list of variables related to pollution burden and population characteristics (e.g., Ozone, Poverty) and aggregated indicators (PolBurd, PopChar, CIscore), as well as percentile scales for most of these variables (e.g., PolBurdP). The specific meanings of each variable can be found in the documentation. Several geospatial attributes are also included among the columns (e.g., Shape_Leng, geometry).

Index(['Tract', 'ZIP', 'County', 'ApproxLoc', 'TotPop19', 'CIscore',
       'CIscoreP', 'Ozone', 'OzoneP', 'PM2_5', 'PM2_5_P', 'DieselPM',
       'DieselPM_P', 'Pesticide', 'PesticideP', 'Tox_Rel', 'Tox_Rel_P',
       'Traffic', 'TrafficP', 'DrinkWat', 'DrinkWatP', 'Lead', 'Lead_P',
       'Cleanup', 'CleanupP', 'GWThreat', 'GWThreatP', 'HazWaste', 'HazWasteP',
       'ImpWatBod', 'ImpWatBodP', 'SolWaste', 'SolWasteP', 'PollBurd',
       'PolBurdSc', 'PolBurdP', 'Asthma', 'AsthmaP', 'LowBirtWt', 'LowBirWP',
       'Cardiovas', 'CardiovasP', 'Educatn', 'EducatP', 'Ling_Isol',
       'Ling_IsolP', 'Poverty', 'PovertyP', 'Unempl', 'UnemplP', 'HousBurd',
       'HousBurdP', 'PopChar', 'PopCharSc', 'PopCharP', 'Child_10',
       'Pop_10_64', 'Elderly65', 'Hispanic', 'White', 'AfricanAm', 'NativeAm',
       'OtherMult', 'Shape_Leng', 'Shape_Area', 'AAPI', 'geometry'],
      dtype='object')

After reading in the dataframe, it is straightforward to loop over our three aggregated indicator percentiles and create a separate choropleth map for each. Note that the data contain some holes, which are coded as -9999, so we only plot census tracts with values >= 0.

label_dict = {'CIscoreP': 'CalEnviroScreen score percentile', 
              'PolBurdP': 'Pollution burden percentile', 
              'PopCharP': 'Population characteristics percentile'}
alpha=0.85
dpi=300

for attr, label in label_dict.items():
    fig, ax = plt.subplots(1,1, figsize=(7,10))
    ces.loc[ces[attr] >= 0].plot(attr, ax=ax, alpha=alpha, vmin=0, legend=True, 
             legend_kwds={'label':label, 'orientation':'horizontal', 'shrink': 0.5, 'pad': 0.03, 'alpha':alpha})
    cx.add_basemap(ax = ax, crs=ces.crs, source=cx.providers.Stamen.TonerLite)
    _ = ax.set_xticks([])
    _ = ax.set_yticks([])
    plt.savefig(f'figs/{attr}.jpg', dpi=dpi, bbox_inches='tight')

Consider first the aggregated CalEnviroScreen Index. I have also included a zoomed version focusing on the Los Angeles/Orange County area (see repository for code).

Readers familiar with California’s environmental issues probably aren’t too surprised by the general pattern: the regions rated highest on the CalEnviroScreen Indicator are heavily concentrated in the Central Valley and the LA-Long Beach area. What about the underlying Pollution Burden and Population Characteristics Scores?

In general, we find that both pollution burden and vulnerable populations tend to be co-located in the Central Valley and the LA-Long Beach area. However, the data are not identical, and there are a variety of reasons why policymakers, public health experts, and researchers may want to compare the two variables and investigate their spatial correlation and clustering.

Creating a bivariate choropleth map

Is there a more effective way to compare and contrast the two variables then looking back and forth between the two figures? Enter the bivariate choropleth map. The goal is to use a two-dimensional colormap that blends two color scales together based on two sets of data. For example, a blue and red colormap would have white or grey in the lower left corner, representing low values for each variable. The upper left and lower right corners would be blue and red, representing a high value for one variable and low value for the other. The upper right corner where both variables take high values would be purple, a blending of red and blue. Intermediate parts of the space can be blended to be more blue or red depending on the underlying variable values. This can be a continuous colorscale, but discrete with 3-5 classes per variable is more common and generally easier to decipher. See this blog post for some examples.

The code below is used to create such a 2D color grid. I chose to use the pink/cyan color map from the post cited above, which has nice contrast and avoids the good/bad connotations of the blue/red map. I retrieved the hex codes for the four corner colors from the same post. We convert each hex color to RGB, and then to a Color object from the Generativepy package. This package provides the functionality for linear interpolation (“lerping”) of colors, which we use to create a 4×4 grid of blended colors representing the quartiles of each index. Lastly, we convert everything back to hex which is easier to use with Geopandas

### percentile bounds defining upper boundaries of color classes
percentile_bounds = [25, 50, 75, 100]

### function to convert hex color to rgb to Color object (generativepy package)
def hex_to_Color(hexcode):
    rgb = ImageColor.getcolor(hexcode, 'RGB')
    rgb = [v/256 for v in rgb]
    rgb = Color(*rgb)
    return rgb

### get corner colors from https://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/
c00 = hex_to_Color('#e8e8e8')
c10 = hex_to_Color('#be64ac')
c01 = hex_to_Color('#5ac8c8')
c11 = hex_to_Color('#3b4994')

### now create square grid of colors, using color interpolation from generativepy package
num_grps = len(percentile_bounds)
c00_to_c10 = []
c01_to_c11 = []
colorlist = []
for i in range(num_grps):
    c00_to_c10.append(c00.lerp(c10, 1/(num_grps-1) * i))
    c01_to_c11.append(c01.lerp(c11, 1/(num_grps-1) * i))
for i in range(num_grps):
    for j in range(num_grps):
        colorlist.append(c00_to_c10[i].lerp(c01_to_c11[i], 1/(num_grps-1) * j))

### convert back to hex color
colorlist = [rgb2hex([c.r, c.g, c.b]) for c in colorlist]

Having defined our percentile bounds and color grid, we can assign a color to each census tract based on its Pollution Burden and Population Characteristics Percentiles, and then create a map using Geopandas as before. The 2D color grid legend can be created on an inset axis using Matplotlib’s patch functionality.

### function to get bivariate color given two percentiles
def get_bivariate_choropleth_color(p1, p2):
    if p1>=0 and p2>=0:
        count = 0
        stop = False
        for percentile_bound_p1 in percentile_bounds:
            for percentile_bound_p2 in percentile_bounds:
                if (not stop) and (p1 <= percentile_bound_p1):
                    if (not stop) and (p2 <= percentile_bound_p2):
                        color = colorlist[count]
                        stop = True
                count += 1
    else:
        color = [0.8,0.8,0.8,1]
    return color

### plot map based on bivariate choropleth
fig, ax = plt.subplots(1,1, figsize=(8,10))
attr = 'CIscoreP'
ces['color_bivariate'] = [get_bivariate_choropleth_color(p1, p2) for p1,p2 in zip(ces['PolBurdP'].values, ces['PopCharP'].values)]
ces.loc[ces[attr] >= 0].plot(ax=ax, color=ces.loc[ces[attr] >= 0]['color_bivariate'], alpha=alpha, legend=False)
cx.add_basemap(ax = ax, crs=ces.crs, source=cx.providers.Stamen.TonerLite)
ax.set_xticks([])
ax.set_yticks([])

### now create inset legend
ax = ax.inset_axes([0.6,0.6,0.35,0.35])
ax.set_aspect('equal', adjustable='box')
count = 0
xticks = [0]
yticks = [0]
for i,percentile_bound_p1 in enumerate(percentile_bounds):
    for j,percentile_bound_p2 in enumerate(percentile_bounds):
        percentileboxes = [Rectangle((i,j), 1, 1)]
        pc = PatchCollection(percentileboxes, facecolor=colorlist[count], alpha=alpha)
        count += 1
        ax.add_collection(pc)
        if i == 0:
            yticks.append(percentile_bound_p2)
    xticks.append(percentile_bound_p1)

_=ax.set_xlim([0,len(percentile_bounds)])
_=ax.set_ylim([0,len(percentile_bounds)])
_=ax.set_xticks(list(range(len(percentile_bounds)+1)), xticks)
_=ax.set_xlabel('Pollution burden percentile')
_=ax.set_yticks(list(range(len(percentile_bounds)+1)), yticks)
_=ax.set_ylabel('Population characteristics percentile')

plt.savefig(f'figs/bivariate.jpg', dpi=dpi, bbox_inches='tight')

The bivariate color palette allows us to distinguish at a more granular level how census tracts compare across the two variables. Some regions are found to exhibit high pollution burden but less vulnerable communities (pink), while others display the opposite pattern (cyan). Those that score highly on both pollution burden and population characteristics (deep blue/purple) would be considered the most vulnerable to pollution and the most in need of pollution-related state programs and investments.

A Hidden-Markov Modeling Based Approach to Creating Synthetic Streamflow Scenarios

As Dave mentioned in his prior post, our recently published eBook, Addressing Uncertainty in Multisector Dynamics Research, provides several interactive tutorials for hands on training in model diagnostics and uncertainty characterization. This blog post is a preview of an upcoming extension of these trainings featuring an interactive tutorial on creating synthetic streamflow scenarios using a Hidden-Markov Modeling (HMM)- based approach specifically for the Upper Colorado River Basin. This training heavily utilizes Julie Quinn’s prior tutorial on fitting an HMM-based streamflow generator and is inspired by her Earth’s Future paper on understanding if exploratory modeling can truly be scenario-neutral.

In this post, we will primarily be focusing on decadal hydrologic drought in the region as a metric of interest to explore in the historic period, the stationary-synthetic, and non-stationary synthetic ensembles. We will also explore how to place CMIP5 climate projections in the context of our synthetically-created ensembles.  All code for this demo is available in a Jupyter Notebook on Github that follows the progression of this blog step by step (though some content may be subject to change before going into the eBook formally). The code is written in Python.

Background

In the Western United States, and particularly the Colorado River Basin, recent studies have used tree-ring reconstructions to suggest that the megadrought that has been occurring in the Southwest over the past 22 years is the regions worst drought since about 800 AD (Williams et al., 2022). The study’s lead author, UCLA climatologist Park Williams, suggested that had the sequence of wet-dry years occurred as observed but without the human-caused drying trend, the 2000s would have likely still been dry, but not on the same level as the worst of the last millennium’s megadroughts.

Lake Powell shows persistent effects from drought (Source: U.S. Bureau of Reclamation)

The recent trend of warming and reduced soil moisture in the SW is becoming extremely problematic from a water systems planning and management perspective for the Colorado River Basin. It has becoming rapidly clear that the river is completely over-allocated and won’t be able to sustain flow requirements as dictated by the Colorado Compact. Thus, there has been an increasing focus in understanding how susceptible water systems in this region are to plausible future streamflow scenarios, particularly those that may contain persistent droughts. In this tutorial, we’ll discuss how to create these scenarios using a Hidden Markov Model (HMM)- based synthetic generator.

Observed Data

First let’s take a look at the observed data from 1909-2013 for the outlet gauge of the Upper Colorado River that is located at the CO-UT stateline. Below we create a plot of the annual streamflow. Let’s also add an 11-year rolling mean which will give us a sense of long-term averaged behavior.

The Colorado Compact prescribing flows between the Upper and Lower Colorado Basins was negotiated using data prior to 1922, which we now know was one of the consistently wetter periods in the record. It’s clear today that since the 1980s, the Southwest has been experiencing imminent aridification and that this observed record alone isn’t an accurate representation of what future climate might look like in this region.

Let’s get a little more specific and formally quantify decadal drought risk in the observed period. We use a metric proposed in Ault et al. (2014). The authors define a decadal drought as a streamflow value that is more than a half a standard deviation below the 11-year running mean of the available record.

By this metric, the Upper Colorado Basin region has experienced two decadal droughts in the past 100 years.

Synthetic Stationary Generator to Understand Natural Variability

It is important to remember that the historical record of the region is only one instance of streamflow, which ultimately comes from an inherently stochastic and chaotic system (the atmosphere). We require a tool that will allow us to see multiple plausible realizations of that same variability. The tool that we use to develop synthetic flows for the region is a Gaussian Hidden Markov Model. If a system follows a Markov process, it switches between a number of “hidden states” dictated by a transition matrix. Each state has its own Gaussian probability distribution (defined by a mean and standard deviation) and one can draw from this distribution to create synthetic flows that fit the properties of the historical distribution. HMMs are an attractive choice for this region because they can simulate long persistence, particularly long droughts, which is an important consideration for water systems vulnerabilities. The figure below shows an example of a 2-state Gaussian HMM that we will fit for the region.

At this point, I will direct readers to either Julie’s blog or my own Jupyter notebook, which have all the details and code required to fit the model and investigate the parameters. Now let’s create a sample 105-year synthetic realization and see what the drought dynamics look like in the synthetic scenario that we created.

Wow, it looks like we’ve created something like a megadrought type situation by just sampling within the bounds of the historical distribution! You can keep sampling from the model to create more 105-year traces and note how the location and number of decadal droughts changes. Maybe you will experience two like in the the historical period, or fewer or more. Re-sampling from the model will demonstrate how different 105-year traces can look just within the range of natural variability and that what we’ve observed historically is only one trace. It’s also important to remember that when droughts occur can also define the ultimate effect of the drought (i.e. is it at a time when there is a large population growth or a time when humans can adapt by conserving or building more infrastructure?). A hydrologic drought need not manifest into an agricultural drought of the same magnitude if stored surface water is available, for example.

Non-Stationary Synthetic Generator to Impose Climate Changes

Now, we want to be able to create flows under non-stationary conditions to get a better understanding of what flows can look like under climate changes. In order to create flows under non-stationary conditions, we can toggle the parameters of the HMM model in order to create systematic changes to the model that can represent a changing climate. The HMM has 6 parameters that define it. In the historical distribution, we fit a baseline value for these parameters. In this non-stationary generator, we define a range to sample these parameters from.

ParameterCurrent ValueLower BoundUpper Bound
Log-Space Wet State Mean Multiplier1.000.981.02
Log-Space Dry State Mean Multiplier1.000.981.02
Log-Space Wet State Standard Deviation Multiplier1.000.751.25
Log-Space Dry State Standard Deviation Multiplier1.000.751.25
Change in Dry-Dry Transition Probability0.00-0.30+0.30
Change in Wet-Wet Transition Probability0.00-0.30+0.30

Now refer to the Jupyter notebook for the code for the following steps. We first sample from these parameter ranges 1000 times using the Latin Hypercube Sample functionality within SALib. We then swap out the baseline parameters with the newly sampled parameters and sample from the model. Below is an example trace from the new non-stationary model in which we are generating more instances of decadal drought.

Placing Non-Stationary Flows in the Context of CMIP5 Projections

By creating a non-stationary generator, we can broaden the drought conditions that we are creating which that can be very useful to understand how our water systems model performs under potentially extreme scenarios. However, it’s useful to place our synthetically generated flows in the context of physically-driven CMIP5 projections to get a better understanding of how the two approaches compare. This example makes use of 97 CMIP5 projections used in the Colorado River Water Availability Study (CWCB, 2012). In each of these projections, monthly precipitation factor changes and temperature delta changes were computed between mean projected 2035–2065 climate statistics and mean historical climate statistics from 1950–2013. These 97 different combinations of 12 monthly precipitation multipliers and 12 monthly temperature delta shifts were applied to historical precipitation and temperature time series from 1950–2013. The resulting climate time series were run through a Variable Infiltration Capacity (VIC) model of the UCRB, resulting in 97 time series of projected future streamflows at the Colorado‐Utah state line.

We can fit an HMM to each trace of projected streamflow and retain the fitted parameters. Then we take the ratio between these parameters and the baseline HMM parameters in order to ultimately have a set of multipliers associated with each CMIP5 projection. This allows us to place the CMIP5 projections into the same space that we are sampling for our synthetic realizations.

In order to visualize these two ensembles in the same space, we plot a response surface that allows us to map how combinations of HMM parameters tend to lead to increased decadal drought occurrence. In order to get a continuous surface, we’ll fit a non-linear regression that ultimately maps the parameter values to decadal drought occurrence over a set of grid points. This response surface is shown by the color gradient below.

We choose two parameters, the dry-dry transition probability and the dry mean, that would most likely influence the decadal drought occurrence and create the response surface (note that a more formal sensitivity analysis can be used to identify the top most influential parameters as well). From the response surface, you can see that we’re more likely to see more instances of decadal droughts when we (1) increase the dry-dry transition probability, which inherently will increase persistence of the dry state, and (2) when we make the dry state log mean drier. We also place those multipliers from the CMIP5 projections into the space as white dots. Note that these CMIP5 projections tend to span the extent of the dry mean sample space, but are less representative of the dry transition probability sample space, particularly the increase in the dry-dry transition probability which leads to more persistent droughts. This suggests that the types of hydrological droughts represented in the projections tend to only be wetter to slightly drier than our baseline.

Both methods of producing these scenarios are valid, though if your goal is to produce a variety of ensembles that are characterized by many different drought characteristics, you will likely find that a generator approach will serve this purpose better.

Concluding Remarks

If you have any questions about the notebook or feedback as you go through the tutorial, please post below in the comments section or email me. We love getting as much feedback as possible before publication in the eBook.

References

Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate27(20), 7529-7549.

CWCB (2012).Colorado River Water Availability Study Phase I Report. Colorado Water Conservation Board

Williams, A. P., Cook, B. I., & Smerdon, J. E. (2022). Rapid intensification of the emerging southwestern North American megadrought in 2020–2021. Nature Climate Change12(3), 232-234.

Time-evolving scenario discovery for infrastructure pathways

Our recently published eBook, Addressing Uncertainty in Multisector Dynamics Research, provides several interactive tutorials for hands on training in model diagnostics and uncertainty characterization. This blog post is a preview of an upcoming extension of these trainings featuring an interactive tutorial on time-evolving scenario discovery for the development of adaptive infrastructure pathways for water supply planning. This post builds off the prior tutorial on gradient-boosted trees for scenario discovery.

I’ll first introduce a styled water supply test case featuring two water utilities seeking to develop a cooperative infrastructure investment and management policy over a 45-year planning horizon. I’ll then demonstrate how the utilities can explore evolving vulnerability across the planning period. All code for this demo is available on Github. The code is written in Python, but the workflow is model agnostic and can be paired with simulation models in any language.

Background

The Bedford-Greene metropolitan area (Figure 1) is a stylized water resources test case containing two urban water utilities seeking to develop an infrastructure and investment and management strategy to confront growing demands and changing climate. The utilities have agreed to jointly finance and construct a new water treatment plant on Lake Classon, a large regional resource. Both utilities have also identified a set of individual infrastructure options to construct if necessary.

Figure 1: The Bedford-Greene metropolitan area

The utilities have formulated a cooperative and adaptive regional water supply management strategy that uses a risk-of-failure (ROF) metric to trigger both short-term drought mitigation actions (water use restrictions and treated transfers between utilities) and long-term infrastructure investment decisions (Figure 2a). ROFs represent a dynamic measure of the utilities’ evolving capacity-to-demand ratios. Both utilities have specified a set of ROF thresholds to trigger drought mitigation actions and plan to actively monitor their short-term ROF on a weekly basis. When a utility’s ROF crosses a specified threhold, the utility will implement drought mitigation actions in the following week. The utilities will also monitor long-term ROF on an annual basis, and trigger infrastructure investment if long-term risk crosses a threshold dictated by the policy. The utilities have also specified a construction order for available infrastructure options.

Figure 2: a) the regional infrastructure investment and management policy. b) adaptive pathways generated across 2,000 deeply uncertain SOWs

The utilities performed a Monte Carlo simulation to evaluate how this policy responds to a wide array of future states of the world (SOWs), each representing a different sample of uncertainties including demand growth rates, changes to streamflows, and financial variables.

The ROF-based policies respond to each SOW by generating a unique infrastructure pathway – a sequence of infrastructure investment decisions over time. Infrastructure pathways across 2,000 SOWs are shown in Figure 2b. Three clusters summarizing infrastructure pathways are plotted as green lines which represent the median week that options are triggered. The frequency that each option is triggered across all SOWs is plotted as the shading behind the lines. Bedford relies on the Joint Water Treatment facility and short-term measures (water use restrictions and transfers) to maintain supply reliability. Greene constructs the Fulton Creek reservoir in addition to the Joint Treatment plant and does not strongly rely on short-term measures to combat drought.

The utilities are now interested in evaluating the robustness of their proposed policy, characterizing how uncertainties generate vulnerability and understanding how this vulnerability may evolve over time.

Time-evolving robustness

To measure the robustness of the infrastructure investment and management policy, the two utilities employ a satisficing metric, which measures the fraction of SOWs where the policy is able to meet a set of performance criteria. The utilities have specified five performance criteria that measure the policy’s ability to maintain both supply reliability and financial stability. Performance criteria are shown in Table 1.

Performance criteriaThreshold
Reliability< 99%
Restriction Frequency>20%
Worst-case cost>10 % annual revenue
Peak financial cost> 80% annual revenue
Stranded assets> $5/kgal unit cost of expansion
Table 1: Satisficing criteria

Figure 3 shows the evolution of robustness over time for the two utilities. While the cooperative policy is very robust after the first ten years of the planning horizon, it degrades sharply for both utilities over time. Bedford meets the performance criteria in nearly 100% of sampled SOWs after the first 10 years of the planning horizon, but its robustness is reduced to around 30% by the end of the 45-year planning period. Greene has a robustness of over 90% after the first 10 years and degrades to roughly 60% after 45 years. These degradations suggest that the cooperative infrastructure investment and management policy is insufficient to successfully maintain long-term performance in challenging future scenarios. But what is really going on here? The robustness metric aggregates performance across the five criteria shown in Table 1, giving us a general picture of evolving performance, but leaving questions about the nature of the utilities’ vulnerability.

Figure 3: Robustness over time

Figure 4 provides some insight into how utility vulnerability evolves over time. Figure 4 shows the fraction of failure SOWs that can be attributed to each performance criterion when performance is measured in the near term (next 10 years), mid-term (next 22 years), and long-term (next 45 years). Figure 4 reveals that the vulnerability of the two utilities evolves in very different ways over the planning period. Early in the planning period, all of Bedford’s failures can be attributed to supply reliability. As the planning horizon progresses, Bedford’s failures diversify into failures in restriction frequency and worst-case drought management cost, indicating that the utility is generally unable to manage future drought. Bedford likely needs more infrastructure investment than is specified by the policy to maintain supply reliability.

In contrast to Bedford’s performance, Greene begins with vulnerability to supply reliability, but its vulnerability shifts over time to become dominated by failures in peak financial cost and stranded assets – measures of the utility’s financial stability. This shift indicates that while the infrastructure investments specified by the cooperative policy mitigate supply failures by the end of the planning horizon, these investments drive the utility into financial failure in many future scenarios.

Figure 4: Failures over time

Factor mapping and factor ranking

To understand how and why the vulnerability evolves over time, we perform factor mapping. Figure 5 below, shows the uncertainty space projected onto the two most influential factors for Bedford, across three planning horizons. Each point represents a sampled SOW, red points represent SOWs that resulted in failure, while white points represent SOWs that resulted in success. The color in the background shows the predicted regions of success and failure from the boosted trees classification.

Figure 4 indicates that Bedford’s vulnerability is primarily driven by rapid and sustained demand growth and this vulnerability increases over time. When evaluated using a 22-year planning horizon, the utility only appears vulnerable to extreme values of near-term demand growth, combined with low values of restriction effectiveness. This indicates that the utility is relying on restrictions to mitigate supply failures, and is vulnerable when they do not work as anticipated. When evaluated over the 45-year planning horizon, Bedford’s failure is driven by rapid and sustained demand growth. If near-term demand grows faster than anticipated (scaling factor > 1.0 on the horizontal axis), the utility will likely fail to meet its performance criteria. If near-term demand is lower than anticipated, the utility may still fail to meet performance criteria if under conditions of high mid-term demand growth. These results provide further evidence that the infrastructure investment and management policy is insufficient to meet Bedford’s long-term water supply needs.

Figure 5: Time-evolving factor maps for Bedford

Greene’s vulnerability (Figure 6) evolves very differently from Bedford’s. Greene is vulnerable to high-demand scenarios in the near term, indicating that its current infrastructure is insufficient to meet rapidly growing demands. Greene can avoid this failure under scenarios where the construction permitting time multiplier is the lowest, indicating that new infrastructure investment can meet the utility’s near-term supply needs. When evaluated across a 22-year planning horizon, the utility fails when near-term demand is high and restriction effectiveness is low, a similar failure mode to Bedford. However, the 22-year planning horizon reveals a second failure mode – low demand growth. This failure mode is responsible for the stranded assets failures shown in Figure 3. This failure mode increases when evaluated across the 45-year planning horizon, and is largely driven by low-demand futures when the utility does not generate the revenue to cover debt service payments needed to fund infrastructure investment.

Figure 6: Time-evolving factor maps for Greene

The factor maps in Figures 5 and 6 only show the two most influential factors determined by gradient boosted trees, however, the utilities are vulnerable to other sampled uncertainties. Figure 7 shows the factor importance as determined by gradient boosted trees for both utilities across the three planning horizons. While near-term demand growth is important for both utilities under all three planning horizons, the importance of other factors evolves over time. For example, restriction effectiveness plays an important role for Greene under the 22-year planning horizon but disappears under the 45-year planning horizon. In contrast, the bond interest rate is important for predicting success over the 45-year planning horizon, but does not appear important over the 10- or 22-year planning horizons. These findings highlight how assumptions about the planning period can have a large impact on modeling outcomes.

Figure 7: Factor rankings over time
Fisheries Training Part 1 – Harvest Optimization and MOEA Diagnostics

Fisheries Training Part 1 – Harvest Optimization and MOEA Diagnostics

Welcome to the second post in the Fisheries Training Series, in which we are studying decision making under deep uncertainty within the context of a complex harvested predator-prey fishery. The accompanying GitHub repository, containing all of the source code used throughout this series, is available here. The full, in-depth Jupyter Notebook version of this post is available in the repository as well.

This post builds off of the initial post, Fisheries Training 0: Exploring Predator-Prey Dynamics, and presents the following:

  1. A brief re-cap of the harvested predator-prey model
  2. Formulation of the harvesting policy and an overview of radial basis functions (RBFs)
  3. Formulation of the policy objectives
  4. A simulation model for the harvested system
  5. Optimization of the harvesting policy using the PyBorg MOEA
    • Installation of Platypus and PyBorg*
    • Optimization problem formulation
    • Basic MOEA diagnostics

Note
*The PyBorg MOEA used in this demonstration is derived from the Borg MOEA and may only be used with permission from its creators. Fortunately, it is freely available for academic and non-commercial use. Visit BorgMOEA.org to request access.

Now, onto the tutorial!

Harvested predator-prey model

In the previous post, we introduced a modified form of the Lotka-Volterra system of ordinary differential equations (ODEs) defining predator-prey population dynamics.

This modified version includes a non-linear predator population growth dynamic original proposed by Arditi and Akçakaya (1990), and includes a harvesting parameter, z. This system of equations is defined in Hadjimichael et al. (2020) as:

\frac{dx}{dt} = bx\Big(1 - \frac{x}{K}\Big) - \frac{\alpha xy}{y^m + \alpha hx} - zx

\frac{dy}{dt} = \frac{c\alpha xy}{y^m + \alpha hx} - dy

Where x is the prey population being harvested and y is the predator population. Please refer to Post 0 of this series for the rest of the parameter descriptions, and for insights into the non-linear dynamics that result from these ODEs. It also demonstrates how the system alternates between ‘basins’ of stability and population collapse.

Harvesting policy

In this post, we instead focus on the generation of harvesting policies which can be operated safely in the system without causing population collapse. Rather than assigning a deterministic (specific, pre-defined) harvest effort level for every time period, we instead design an adaptive policy which is a function of the current state of the system:

z_t = f(\cdot)

The problem then becomes the optimization of the control rule, f(\cdot), rather than specific parameter values, z = [z_1, z_2, ..., z_t]. The process of optimizing the parameters of a state-aware control rule is known as Direct Policy Search (DPS; Quinn et al, 2017).

Previous work done by Quinn et al. (2017) showed that an adaptive policy, generated using DPS, was able to navigate deeply uncertain ecological tipping points more reliably than intertemporal policies which prescribed specific efforts at each timestep.

Radial basis functions

The core of the DPS method are radial basis functions (RBFs), which are flexible, parametric function formulations that map the current state of the system to policy action. A previous study by Giuliani et al (2015) demonstrated that RBFs are highly effective in generating Pareto-approximate sets of solutions, and that they perform well when applied to horizons different from the optimized simulation horizon.

There are various RBF approaches available, such as the cubic RBF used by Quinn et al. (2017). Here, we use the Gaussian RBF introduced by Hadjimichael et al. (2020), where the harvest effort during the next timestep, z_{t+1}, is mapped to the current prey population levels, x_t by the function:

z_{t+1} = \sum_{i=1}^n w_i \Big[exp\Big[-\Big(\frac{x_t-c_i}{b_i}\Big)^2\Big]\Big]

In this formulation c_i, r_i, and w_i are the center, radius, and weights of each RBF i respectively. Additionally, n is the number of RBFs used in the function; in this study we use n = 2 RBFs. With two RBFs, there are a total of 6 parameters. Increasing the number of RBFs allows for more flexible function forms to be achieved. However, two RBFs have been shown to be sufficient for this problem.

The sum of the weights must be equal to one, such that:

\sum_{i=1}^n w_i= 1

The function harvest_streategy() is contained within the fish_game_functions.py script, which can be accessed here in the repository.

A simplified rendition of the harvest_strategy() function, evaluate_RBF(), is shown below and uses the RBF parameter values (i.e., [c_1, b_1, w_1, c_2, b_2, w_2]), and the current prey population, to calculate the next year’s harvesting effort.

import numpy as np
import matplotlib.pyplot as plt

def evaluate_RBF(x, RBF_params, nRBFs):
    """
    Parameters:
    -----------
    x : float
        The current state of the system.
    RBF_params : list [3xnRBFs]
        The RBF parameters in the order of [c, r, w,...,c, r, w].
    nRBFs : int
        The number of RBFs used in the mapping function.

    Returns:
    --------
    z : float
        The policy action.
    """

    c = RBF_params[0::3]
    r = RBF_params[1::3]
    w = RBF_params[2::3]

    # Normalize the weights
    w_norm = []
    if np.sum(w) != 0:
        for w_i in w:
            w_norm.append(w_i / np.sum(w))
    else:
        w_norm = (1/nRBFs)*np.ones(len(w))
    
    z = 0.0
    
    for i in range(nRBFs):
        
        # Avoid division by zero
        if r[i] != 0:
            z = z + w[i] * np.exp(-((x - c[i])/r[i])**2)
        else:
            z = z + w[i] * np.exp(-((x - c[i])/(10**-6))**2)
            
    # Impose limits on harvest effort
    if z < 0:
        z = 0
    elif z > 1:
        z = 1

    return z

To better understand the nature of the harvesting policy, it is helpful to visualize the policy function, z = f(\cdot).

For some arbitrary selection of RBF parameters:

[c_1, b_1, w_1, c_2, b_2, w_2] = [0.2, 1.1, 0.41, 0.34,0.7, 0.59]

The following function will plot the harvesting strategy:

def plot_RBF_policy(x_range, x_label, y_range, y_label, RBF_params, nRBFs):
    """
    Parameters:
    -----------
    RBF_params : list [3xnRBFs]
        The RBF parameters in the order of [c, r, w,...,c, r, w].
    nRBFs : int
        The number of RBFs used in the mapping function.
        
    Returns:
    --------
    None.
    """
    # Step size
    n = 100
    x_min = x_range[0]
    x_max = x_range[1]
    y_min = y_range[0]
    y_max = y_range[1]

    # Generate data
    x_vals = np.linspace(x_min, x_max, n)
    y_vals = np.zeros(n)

    for i in range(n):
        y = evaluate_RBF(x_vals[i], RBF_params, nRBFs)

        # Check that assigned actions are within range
        if y < y_min:
            y = y_min
        elif y > y_max:
            y = y_max

        y_vals[i] = y

    # Plot
    fig, ax = plt.subplots(figsize = (5,5), dpi = 100)
    ax.plot(x_vals, y_vals, label = 'Policy', color = 'green')
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title('RBF Policy')	
    plt.show()
    return

Let’s take a look at the policy that results from the random RBF parameters listed above. Setting my problem-specific inputs, and running the function:

# Set the RBF parameters
nRBFs = 2
RBF_params = [0.2, 1.1, 0.41, 0.34,0.7, 0.59]

# Specify plot ranges
x_range = [0, 1]
x_label = 'Population ($x$)'
y_range = [0,1]
y_label = 'Harvest Effort ($z$)'

# Plot the policy curve
plot_RBF_policy(x_range, x_label, y_range, y_label, RBF_params, nRBFs)
Fig: A random RBF policy.

This policy does not make much intuitive sense… why should harvesting efforts be decreased when the fish population is large? Well, that’s because we chose these RBF parameter values randomly.

To demonstrate the flexibility of the RBF functions and the variety of policy functions that can result from them, I generated a few (n = 7) policies using a random sample of parameter values. The parameter values were sampled from a uniform distribution over each parameters range: c_i, b_i, w_i \in [0,1]. Below is a plot of the resulting random policy functions:

Fig: Many random RBF policies, showing flexibility of RBFs.

Finding the sets of RBF parameter values that result in Pareto-optimal harvesting policies is the next step in this process!

Harvest strategy objectives

We take a multi-objective approach to the generation of a harvesting strategy. Given that the populations are vulnerable to collapse, it is important to consider ecological objectives in the problem formulation.

Here, we consider five objectives, described below.

Objective 1: Net present value

The net present value (NPV) is an economic objective corresponding to the amount of fish harvested.

During the simulation-optimization process (later in this post), we simulate a single policy N times, and take the average objective score over the range of simulations. This method helps to account for variability in expected outcomes due to natural stochasticity. Here, we use N = 100 realizations of stochasticity.

With that in mind, the NPV (O_1) is calculated as:


O_1 = \frac{1}{N} \sum_{i=1}^N\Big( \sum_{t=0}^T \frac{z_{t+1,i}x_{t,i}}{(1+\delta)^t}\Big)

where \delta is the discount rate which converts future benefits to present economic value, here \delta = 0.05.

Objective 2: Prey population deficit

The second objective aims to minimize the average prey population deficit, relative to the prey population carrying capacity, K:

O_2 = \frac{1}{N} \sum_{i=1}^N\Big( \frac{1}{T} \sum_{t=1}^T \frac{K - x_{t,i}}{K}\Big)

Objective 3: Longest duration of consecutive low harvest

In order to maintain steady harvesting levels, we minimize the longest duration of consecutive low harvests. Here, a subjective definition of low harvest is imposed. In a practical decision making process, this threshold may be solicited from the relevant stakeholders.

Objective 3 is defined as:

O_3 = \frac{1}{N} \sum_{i=1}^N(max_T(\phi_{t,i}))

where

And the low harvest limit is: limit = 5\%.

Objective 4: Worst harvest instance

In addition to avoiding long periods of continuously low harvest, the stakeholders have a desire to limit financial risks associated with very low harvests. Here, we minimize the worst 1% of harvest.

The fourth objective is defined as:

O_4 = \frac{1}{N} \sum_{i=1}^N(percentile_T(z_{t+1,i}x_{t,i}, 1))

Objective 5: Harvest variance

Lastly, policies which result in low harvest variance are more easily implemented, and can limit corresponding variance in fish populations.

The last objective minimizes the harvest variance, with the objective score defined as:

O_5 = \frac{1}{N} \sum_{i=1}^N(Var_T(z_{t+1,i}x_{t,i}))

Constraint: Avoid collapse of predator population

During the optimization process, we are able to include constraints on the harvesting policies.

Since population collapse is a stable equilibrium point, from which the population will not regrow, it is imperative to consider policies which prevent collapse.

With this in mind, the policy must not result in any population collapse across the N realizations of environmental stochasticity. Mathematically, this is enforced by:

\frac{1}{N} \sum_{i=1}^N(\Psi_{t,i})) = 0

where

Problem formulation

Given the objectives described above, the optimization problem is:

Minimize \ F(z_x) = (-O_1, O_2, O_3, -O_4, O_5)

Simulation model of the harvested system

Here, we provide an overview of the fish_game_5_objs() model which combines many of the preceding topics. The goal for this model is to take a set of RBF parameters, which define the harvesting strategy, simulate the policy for some length of time, and then return the objective scores resulting from the policy.

Later, this model will allow for the optimization of the harvesting policy RBF parameters through a Multi-Objective Evolutionary Algorithm (MOEA). The MOEA will evaluate many thousands of policies (RBF parameter combinations) and attempt to find, through evolution, those RBF parameters which yield best objective performance.

A brief summary of the model process is described here, but the curious learner is encouraged to take a deeper look at the code and dissect the process.

The model can be understood as having three major sections:

  1. Initialization of storage vectors, stochastic variables, and assumed ODE parameters.
  2. Simulation of policy and fishery populations over time period T.
  3. Calculation of objective scores.
def fish_game_5_objs(vars):
    """
    Defines the full, 5-objective fish game problem to be solved

    Parameters
    ----------
    vars : list of floats
        Contains the C, R, W values

    Returns
    -------
    objs, cnstr
    """

    # Get chosen strategy
    strategy = 'Previous_Prey'

    # Define variables for RBFs
    nIn = 1 # no. of inputs (depending on selected strategy)
    nOut = 1 # no. of outputs (depending on selected strategy)
    nRBF = 2 # no. of RBFs to use

    nObjs = 5
    nCnstr = 1 # no. of constraints in output

    tSteps = 100 # no. of timesteps to run the fish game on
    N = 100 # Number of realizations of environmental stochasticity

    # Define assumed system parameters
    a = 0.005
    b = 0.5
    c = 0.5
    d = 0.1
    h = 0.1
    K = 2000
    m = 0.7
    sigmaX = 0.004
    sigmaY = 0.004

    # Initialize storage arrays for populations and harvest
    x = np.zeros(tSteps+1) # Prey population
    y = np.zeros(tSteps+1) # Predator population
    z = np.zeros(tSteps+1) # Harvest effort

    # Create array to store harvest for all realizations
    harvest = np.zeros([N,tSteps+1])
    # Create array to store effort for all realizations
    effort = np.zeros([N,tSteps+1])
    # Create array to store prey for all realizations
    prey = np.zeros([N,tSteps+1])
    # Create array to store predator for all realizations
    predator = np.zeros([N,tSteps+1])

    # Create array to store metrics per realization
    NPV = np.zeros(N)
    cons_low_harv = np.zeros(N)
    harv_1st_pc = np.zeros(N)
    variance = np.zeros(N)

    # Create arrays to store objectives and constraints
    objs = [0.0]*nObjs
    cnstr = [0.0]*nCnstr

    # Create array with environmental stochasticity for prey
    epsilon_prey = np.random.normal(0.0, sigmaX, N)

    # Create array with environmental stochasticity for predator
    epsilon_predator = np.random.normal(0.0, sigmaY, N)

    # Go through N possible realizations
    for i in range(N):

        # Initialize populations and values
        x[0] = prey[i,0] = K
        y[0] = predator[i,0] = 250
        z[0] = effort[i,0] = harvest_strategy([x[0]], vars, [[0, K]], [[0, 1]], nIn, nOut, nRBF)
        NPVharvest = harvest[i,0] = effort[i,0]*x[0]

        # Go through all timesteps for prey, predator, and harvest
        for t in range(tSteps):

            # Solve discretized form of ODE at subsequent time step
            if x[t] > 0 and y[t] > 0:
                x[t+1] = (x[t] + b*x[t]*(1-x[t]/K) - (a*x[t]*y[t])/(np.power(y[t],m)+a*h*x[t]) - z[t]*x[t])* np.exp(epsilon_prey[i]) # Prey growth equation
                y[t+1] = (y[t] + c*a*x[t]*y[t]/(np.power(y[t],m)+a*h*x[t]) - d*y[t]) *np.exp(epsilon_predator[i]) # Predator growth equation

                # Solve for harvesting effort at next timestep
                if t <= tSteps-1:
                    if strategy == 'Previous_Prey':
                        input_ranges = [[0, K]] # Prey pop. range to use for normalization
                        output_ranges = [[0, 1]] # Range to de-normalize harvest to
                        z[t+1] = harvest_strategy([x[t]], vars, input_ranges, output_ranges, nIn, nOut, nRBF)

            # Store values in arrays
            prey[i,t+1] = x[t+1]
            predator[i,t+1] = y[t+1]
            effort[i,t+1] = z[t+1]
            harvest[i,t+1] = z[t+1]*x[t+1]
            NPVharvest = NPVharvest + harvest[i,t+1]*(1+0.05)**(-(t+1))

        # Solve for objectives and constraint
        NPV[i] = NPVharvest
        low_hrv = [harvest[i,j]<prey[i,j]/20 for j in range(len(harvest[i,:]))] # Returns a list of True values when there's harvest below 5%
        count = [ sum( 1 for _ in group ) for key, group in itertools.groupby( low_hrv ) if key ] # Counts groups of True values in a row
        if count: # Checks if theres at least one count (if not, np.max won't work on empty list)
            cons_low_harv[i] = np.max(count)  # Finds the largest number of consecutive low harvests
        else:
            cons_low_harv[i] = 0
        harv_1st_pc[i] = np.percentile(harvest[i,:],1)
        variance[i] = np.var(harvest[i,:])

    # Average objectives across N realizations
    objs[0] = -np.mean(NPV) # Mean NPV for all realizations
    objs[1] = np.mean((K-prey)/K) # Mean prey deficit
    objs[2] = np.mean(cons_low_harv) # Mean worst case of consecutive low harvest across realizations
    objs[3] = -np.mean(harv_1st_pc) # Mean 1st percentile of all harvests
    objs[4] = np.mean(variance) # Mean variance of harvest

    cnstr[0] = np.mean((predator < 1).sum(axis=1)) # Mean number of predator extinction days per realization

    # output should be all the objectives, and constraint
    return objs, cnstr

The next section shows how to optimize the harvest policy defined by vars, using the PyBorg MOEA.

A (Very) Brief Overview of PyBorg

PyBorg is the secondary implementation of the Borg MOEA written entirely in Python by David Hadka and Andrew Dircks. It is made possible using functions from the Platypus optimization library, which is a Python evolutionary computing framework.

As PyBorg is Borg’s Python wrapper and thus derived from the original Borg MOEA, it can only be used with permission from its creators. To obtain permission for download, please visit BorgMOEA and complete the web form. You should receive an email with the link to the BitBucket repository shortly.

Installation

The repository you have access to should be named ‘Serial Borg MOEA’ and contain a number of folders, including one called Python/. Within the Python/ folder, you will be able to locate a folder called pyborg. Once you have identified the folder, please follow these next steps carefully:

  1. Check your current Python version. Python 3.5 or later is required to enable PyBorg implementation.
  2. Download the pyborg folder and place it in the folder where this Jupyter Notebook all other Part 1 training material is located.
  3. Install the platypus library. This can be in done via your command line by running one of two options:

    If you are using pip:
pip install platypus-opt

If you are using conda:

conda config --add channels conda-forge
conda install platypus-opt
  1. Make sure the following training startup files are located within the same folder as this Jupyter Notebook:
    a) fish_game_functions.py: Contains all function definitions to setup the problem, run the optimization, plot the hypervolume, and conduct random seed analysis.
    b) Part 1 - Harvest Optimization and MOEA Diagnostics.ipynb: This is the current notebook and where the Fisheries fame will be demonstrated.

We are now ready to proceed!

Optimization of the Fisheries Game

Import all libraries

All functions required for this post can be found in the fish_game_functions.py file. This code is adapted from Antonia Hadjimichael’s original post on exploring the Fisheries Game dynamics using PyBorg.

# import all required libraries
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
from fish_game_functions import *
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
import time
import random

Formulating the problem

Define number of decision variables, constraints, and specify problem formulation:

# Set the number of decision variables, constraints and performance objectives
nVars = 6   # Define number of decision variables
nObjs = 5   # Define number of objectives
nCnstr = 1      # Define number of decision constraints

# Define the upper and lower bounds of the performance objectives
objs_lower_bounds = [-6000, 0, 0, -250, 0]
objs_upper_bounds = [0, 1, 100, 0, 32000]

Initialize the problem for optimization

We call the fisheries_game_problem_setup.py function to set up the optimization problem. This function returns a PyBorg object called algorithm in this exercise that will be optimized in the next step.

def fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=100):
    """
    Sets up and runs the fisheries game for a given population size

    Parameters
    ----------
    nVars : int
        Number of decision variables.
    nObjs : int
        Number of performance objectives.
    nCnstr : int
        Number of constraints.
    pop_size : int, optional
        Initial population size of the randomly-generated set of solutions.
        The default is 100.

    Returns
    -------
    algorithm : pyBorg object
        The algorthm to optimize with a unique initial population size.

    """
    # Set up the problem
    problem = Problem(nVars, nObjs, nCnstr)
    nVars = 6   # Define number of decision variables
    nObjs = 5   # Define number of objective -- USER DEFINED
    nCnstr = 1      # Define number of decision constraints

    problem = Problem(nVars, nObjs, nCnstr)

    # set bounds for each decision variable
    problem.types[0] = Real(0.0, 1.0)
    problem.types[1] = Real(0.0, 1.0)
    problem.types[2] = Real(0.0, 1.0)
    problem.types[3] = Real(0.0, 1.0)
    problem.types[4] = Real(0.0, 1.0)
    problem.types[5] = Real(0.0, 1.0)

    # all values should be nonzero
    problem.constraints[:] = "==0"

    # set problem function
    if nObjs == 5:
        problem.function = fish_game_5_objs
    else:
        problem.function = fish_game_3_objs

    algorithm = BorgMOEA(problem, epsilons=0.001, population_size=pop_size)
    return algorithm
# initialize the optimization
algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr)

Define parameters for optimization

Before optimizing, we have to define our desired population size and number of function evaluations (NFEs). The NFEs correspond to the number of evolutions of the set of solutions. For complex, many-objective problems, it may be necessary for a large NFE.

Here, we start with a small limit on NFE, to test the speed of the optimization. Limiting the optimization to 100 NFE is going to produce relatively poor performing solutions, however it is a good starting point for our diagnostic tests.

init_nfe = 100
init_pop_size = 100

Begin the optimization

In addition to running the optimization, we also time the optimization to get a general estimate on the time the full hypervolume analysis will require.

# begin timing the Borg run
borg_start_time = time.time()

algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(init_pop_size))
algorithm.run(int(init_nfe))

# end timing and print optimization time 
borg_end_time = time.time()

borg_total_time = borg_end_time - borg_start_time

print(f"borg_total_time={borg_total_time}s")
Output: borg_total_time=33.62936472892761s

NOTICE:
Running the PyBrog MOEA 100 times took ~34 seconds (on the machine which this was written on…). Keep this in mind, that increasing the NFE will require correspondingly more time. If you increase the number too much, your machine may take a long time to compute the final Pareto-front.

Plot the tradeoff surface

Here, we plot a 3-dimensional plot showing the tradeoff between a select number of objectives. If you have selected the 5-objective problem formulation, you should select the three objectives you would like to analyze the tradeoff surface for. Please select the (abbreviated) objective names from the following list:

Objective 1: Mean NPV
Objective 2: Mean prey deficit
Objective 3: Mean WCLH
Objective 4: Mean 1% harvest
Objective 5: Mean harvest variance

# Plot objective tradeoff surface
fig_objs = plt.figure(figsize=(8,8))
ax_objs = fig_objs.add_subplot(111, projection='3d')

# Select the objectives to plot from the list provided in the description above
obj1 = 'Mean NPV'
obj2 = 'Mean prey deficit'
obj3 = 'Mean 1% harvest'

plot_3d_tradeoff(algorithm, ax_objs, nObjs, obj1, obj2, obj3)
Fig: Pareto-approximate solutions generated with 100 function evaluations. The star is an ideal solution.

The objectives scores arn’t very good, but that is because the number of function evaluations is so low. In order to get a better set of solutions, we need to run the MOEA for many function evaluations.

The next section demonstrates the change in objective performance with respect to the number of function evaluations.

MOEA Diagnostics

A good MOEA is assessed by it’s ability to quickly converge to a set of solutions (the Pareto-approximate set) that is also diverse. This means that the final set of solutions is close to the true set, as well as covers a large volume of the multi-dimensional problem space. There are three quantitative metrics via which convergence and diversity are evaluated:

  1. Generational distance approximates the average distance between the true Pareto front and the Pareto-approximate reference set that your MOEA identifies. It is the easiest metric to meet.
  2. Epsilon indicator is a harder metric than generational distance to me et. A high-performing MOEA will have a low epsilon indicator value when the distance of its worst-performing approximate solution from the true Pareto set is small.
  3. Hypervolume measures the ‘volume’ that a Pareto front covers across all dimensions of a problem. It is the hardest metric to meet and the most computationally intensive.

Both the generational distance and epsilon indicator metrics require a reference set, which is the known, true Pareto front. Conversely, the hypervolume does not have such a requirement. Given that the Fisheries Game is a complex, multi-dimensional, many-stakeholder problem with no known solution, the hypervolume metric is thus the most suitable to evaluate the ability of PyBorg to quickly converge to a diverse Pareto-approximate set of solutions.

More detailed descriptions of each metric are provided in this handy blog post by Joe Kasprzyk.

Hypervolume

The hypervolume is a measure of the multi-dimensional volume dominated by the approximated Pareto front. As the Pareto front advances toward the “ideal” solution, this value approaches 1.

The efficiency of an MOEA in optimizing a solution can be considered by measuring the hypervolume with respect to the number of function evaluations. This allows the user to understand how quickly the MOEA is converging to a good set of solutions, and how many function evaluations are needed to achieve a good set of solutions.

Defining hypervolume parameters

First, we define the maximum number of function evaluations (maxevals) and the NFE step size (frequency) for which we would like to evaluate the problem hypervolume over. Try modifying these values to see how the plot changes.

Mind that the value of maxevals should always be more than that of your initial NFE, and that the value of frequency should be less than that of the initial NFE. Both values should be integer values.

Also be mindful that increasing the maxevals > 1000 is going to result in long runtimes.

maxevals = 500
frequency = 100

Plotting the hypervolume

Using these parameters, we then plot the hypervolume graph, showing the change in hypervolume value over the NFEs.

fig_hvol = plt.figure(figsize=(10,7))
ax_hvol = fig_hvol.add_subplot()

plot_hvol(algorithm, maxevals, frequency, objs_lower_bounds, objs_upper_bounds, ax_hvol)

plt.title('PyBorg Runtime (Hypervolume)')
plt.xlabel('Number of Function Evaluations')
plt.ylabel('Hypervolume')
plt.show()

Perform random seed analysis

Next, we perform random seed analysis (RSA).

Generally, RSA is performed to track an algorithm’s performance during search. In addition, it is also done to determine if an algorithm has discovered an acceptable approximation of the true Pareto set. More details on RSA can be found here in a blog post by Dave Gold.

For the Fisheries Game, we conduct RSA to determine if PyBorg’s performance is sensitive to the size of its initial population. We do this using the folllowing steps:

  1. Run an ensemble of searches, each starting with a randomly sampled set of initial conditions (aka “random seeds”)
  2. Combine search results across all random seeds to generate a “reference set” that contains only the best non-dominated solutions across the ensemble
  3. Repeat steps 1 and 2 for an initial population size of 200, 400, etc.
pop_size_list = [100, 200, 400, 800, 1000]

fig_rand_seed = plt.figure(figsize=(10,7))
ax_rand_seed = fig_rand_seed.add_subplot()

for p in range(len(pop_size_list)):
    fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size_list[p])
    algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(init_pop_size))
    algorithm.run(int(init_nfe))
    
    plot_hvol(algorithm, maxevals, frequency, objs_lower_bounds, objs_upper_bounds, 
              ax_rand_seed, pop_size_list[p])

plt.title('PyBorg Random Seed Analysis')
plt.xlabel('Number of Function Evaluations')
plt.ylabel('Hypervolume')
plt.legend()
plt.show()

Notice that the runs performed with different initial population sizes tend to converge toward a similar hypervolume value after 500 NFEs.

This reveals that the PyBorg MOEA is not very sensitive to the specific initial parameters; it is adaptable enough to succeed under different configurations.

Conclusion

A classic decision-making idiom says ‘defining the problem is the problem’. Hopefully, this post has revealed that to be true; we have shown that changes to the harvesting strategy functions, simulation model, or objective scores can result in changes to the resulting outcomes.

And if you’ve made it this far, congratulations! Take a minute to think back on the progression of this post: we revisited the harvested predator-prey model, formulated the harvesting policy using RBFs, and formulated the policy objectives and its associated simulation model. Next, we optimized the harvesting policy using the PyBorg MOEA and performed basic MOEA diagnostics using hypervolume as our measure, and executed random seed analysis.

If you’ve progressed through this tutorial using the Jupyter Notebook, we encourage you to re-visit the source code involved in this process. The next advisable step is to re-produce this problem from scratch, as this is the best way to develop a detailed understanding of the process.

Next time, we will explore the outcomes of this optimization, by considering the tradeoffs present across the Pareto set of solutions.

Till then, thanks for reading!

Cythonizing your Python code, Part 2: Next steps with reservoir simulation

Cython is an extension of the Python programming language that allows you to add C-like functionality, such as static typing and compilation. This can lead to significant efficiency gains over pure Python. In my last blog post, I introduced Cython’s motivation and benefits using a simple Fibonacci sequence example. Cythonizing this program led to a 20-30x speedup on my laptop, for very little work!

However, you may be wondering about the feasibility and impact of Cython for more complex programs, such as simulation models used in research. In this blog post, I will demonstrate how Cython can be applied to larger, object-oriented programs that use functions, classes, and NumPy arrays, using a reservoir simulation model as an example. Readers unfamiliar with Cython are encouraged to read the first post in this series prior to proceeding.

Reservoir simulation model

All code used in this blog post can be found in this GitHub repository, in the reservoir_sim directory. This contains nine sub-directories with different versions of the model. I will describe the differences between the versions later in this post, but first I will briefly introduce the major components of the simulation model using the py_fast version. The model simulates a two-reservoir system, one of which is downstream from the other. Both have their own inflows, which are sinusoidal plus noise. Additionally, both reservoirs have their own demands and minimum flow thresholds, which are also sinusoidal plus noise. Each time step, the upper reservoir attempts to release the minimum flow requirement if there is enough water, then tries to satisfy the demand if possible. The lower reservoir then repeats the same process, with the upper reservoir’s releases added to its own catchment inflows. The simulation is run over a number of years, and the stochastic inputs (inflows, upstream releases, minimum flow requirements, demand) and outputs (releases, deliveries, storages) are stored as time series.

The directory py_fast contains three Python files, representing the Demand, Reservoir, and Model classes. The simplest class, Demand, is initialized with inputs that determine the parameterization of the random demand process, and a function that can be called each day to randomly sample from this process:

### Demand class - pure Python
from random import gauss
from math import pi, sin

class Demand():
  def __init__(self, name, demand_params):
    self.name = name
    # amplitude, phase, shift, & noise std for sine wave of demands
    self.demand_amp, self.demand_phase, self.demand_shift, self.demand_noise = demand_params
    # save 2pi/365 to save conversion time
    self.days_to_radians = 2. * pi / 365.

  ### demand function
  def demand_t(self, t):
    noise = gauss(0., self.demand_noise)
    demand_t = self.demand_amp * sin((t - self.demand_phase)  * self.days_to_radians) + self.demand_shift + noise
    return demand_t

The Reservoir class is similar, with parameter initialization and its own stochastic inflow and minimum flow sampling methods. The initialization also instantiates a Demand object belonging to the Reservoir object.

### Reservoir class - pure Python
import numpy as np
from Demand import Demand

class Reservoir():
  def __init__(self, name, inflow_params, min_flow_params, demand_params, storage_params):
    self.name = name
    # amplitude, phase, shift, & noise std for sine wave of inflows
    self.inflow_amp, self.inflow_phase, self.inflow_shift, self.inflow_noise = inflow_params
    # amplitude, phase, and shift for sine wave of minimum flows
    self.min_flow_amp, self.min_flow_phase, self.min_flow_shift = min_flow_params
    # reservoir capacity and initial storage
    self.capacity, self.storage = storage_params
    # set up demand object
    self.demand = Demand(name, demand_params)

  ### inflow function
  def inflow_t(self, t):
    noise = np.random.normal(0., self.inflow_noise, 1)[0]
    inflow_t = self.inflow_amp * np.sin((t - self.inflow_phase)  * 2. * np.pi / 365.) + self.inflow_shift + noise
    return inflow_t

  ### min flow function
  def min_flow_t(self, t):
    min_flow_t = self.min_flow_amp * np.sin((t - self.min_flow_phase)  * 2. * np.pi / 365.) + self.min_flow_shift
    return min_flow_t

The Reservoir class also has a step method, which dictates the release, delivery, and updated storage based on the availability of water.

  ### step reservoir another day
  def step(self, t, upstream_release):
    inflow = self.inflow_t(t)
    min_flow = self.min_flow_t(t)
    demand = self.demand.demand_t(t)
    
    # first assume release & delivery meet min_flow and demand exactly
    release = min_flow
    delivery = demand
    self.storage += inflow + upstream_release - release - demand

    # check if storage overflow
    if self.storage > self.capacity:
      release += self.storage - self.capacity
      self.storage = self.capacity

    # check if storage went negative. if so, curtail demand first, then env flows
    elif self.storage < 0:
      if delivery > (-self.storage):
        delivery += self.storage
        self.storage = 0.
      else:
        self.storage += delivery
        delivery = 0
        if release > (-self.storage):
          release += self.storage
          self.storage = 0
        else:
          print('This should not be happening')
    
    return (inflow, upstream_release, min_flow, demand, release, delivery, self.storage)

Finally, the Model class is used to set up and run the simulation. Upon initialization, the Model object first creates the upper and lower Reservoir objects, and creates a dictionary of lists to hold the results.

### Model class for running simulation
import matplotlib.pyplot as plt
import random
import numpy as np
from Reservoir import Reservoir

class Model():
  def __init__(self, years, plot, seed):
    # length of simulation (no leap years)
    self.years = years
    self.days = 365 * self.years
    # boolean for whether to plot
    self.plot = plot
    # random seed
    random.seed(seed)

    ### set up upper reservoir
    # inflow params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    inflow_params = (300., 0., 500., 20.)
    # min flow params in terms of sinusoid, (amplitude, phase, shift). units = AF/day
    min_flow_params = (200., 122., 300.)
    # demand params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    demand_params = (300., 163., 600., 30.)
    # storage params (capacity, starting storage). units = AF
    storage_params = (500000., 250000.)
    # initialize upper reservoir
    self.reservoir_upper = Reservoir('upper', inflow_params, min_flow_params, demand_params, storage_params)

    ### set up lower reservoir
    # inflow params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    inflow_params = (200., 30., 600., 100.)
    # min flow params in terms of sinusoid, (amplitude, phase, shift). units = AF/day
    min_flow_params = (100., 91., 400.)
    # demand params in terms of sinusoid, (amplitude, phase, shift, noise). units = AF/day
    demand_params = (100., 152., 300., 10.)
    # storage params (capacity, starting storage). units = AF
    storage_params = (20000., 15000.)
    # initialize lower reservoir
    self.reservoir_lower = Reservoir('lower', inflow_params, min_flow_params, demand_params, storage_params)

    ### set up data storage
    self.reservoir_list = [self.reservoir_upper, self.reservoir_lower]
    self.output = {}
    for reservoir in self.reservoir_list:
      self.output[reservoir.name] = {}
      self.output[reservoir.name]['inflow'] = []
      self.output[reservoir.name]['upstream_release'] = []
      self.output[reservoir.name]['min_flow'] = []
      self.output[reservoir.name]['demand'] = []
      self.output[reservoir.name]['release'] = []
      self.output[reservoir.name]['delivery'] = []
      self.output[reservoir.name]['storage'] = [reservoir.storage]

The Model.run function performs the simulation loop.

  def run(self):
    for t in range(self.days):
      # step upper
      (inflow, upstream_release, min_flow, demand, release, delivery, storage) = self.reservoir_upper.step(t, 0.)
      self.output['upper']['inflow'].append(inflow)
      self.output['upper']['upstream_release'].append(upstream_release)
      self.output['upper']['min_flow'].append(min_flow)
      self.output['upper']['demand'].append(demand)
      self.output['upper']['release'].append(release)
      self.output['upper']['delivery'].append(delivery)
      self.output['upper']['storage'].append(storage)

      (inflow, upstream_release, min_flow, demand, release, delivery, storage) = self.reservoir_lower.step(t, release)
      self.output['lower']['inflow'].append(inflow)
      self.output['lower']['upstream_release'].append(upstream_release)
      self.output['lower']['min_flow'].append(min_flow)
      self.output['lower']['demand'].append(demand)
      self.output['lower']['release'].append(release)
      self.output['lower']['delivery'].append(delivery)
      self.output['lower']['storage'].append(storage)

    # return total storage last time step to make sure versions are equivalent
    return self.output['upper']['storage'][-1] + self.output['lower']['storage'][-1]

We can run the model and plot the results as follows (plot function not shown for brevity, see repository).

from Model import Model

# initialize model
model = Model(years = 5, plot = True, seed = 101)

# run simulation
tot_storage = model.run()

# plot results
model.plot_results()
Figure 1: Five-year realization of simulated inflows with and without upstream release (column 1), demand vs. deliveries (column 2), minimum flow requirement vs. release (column 3), and storage (column 4) for the upper (row 1) and lower (row 2) reservoirs.

Cythonizing a class

Now that we have our model in place, we can begin to add static type information and Cythonize it. I will focus now on the cy version of the model, which builds on the py_fast model. First consider Demand.pyx.

### Demand class - cython
from random import gauss
from math import pi, sin

cdef class Demand():
  def __init__(self, name, demand_params):
    self.name = name
    # amplitude, phase, shift, & noise std for sine wave of demands
    self.demand_amp, self.demand_phase, self.demand_shift, self.demand_noise = demand_params
    # save 2pi/365 as double to save conversion time
    self.days_to_radians = 2. * pi / 365.

  ### demand function
  cdef double demand_t(self, double t):
    cdef double noise, demand_t, sin_t
    
    noise = gauss(0., self.demand_noise)
    sin_t = sin((t - self.demand_phase) * self.days_to_radians)
    demand_t = self.demand_amp * sin_t + self.demand_shift + noise
    return demand_t

This is similar to Demand.py in the Python version, other than a few main differences. Cython classes are defined with “cdef” rather than “def”. Within a class, we can also define its methods using the cdef keyword. Similar to the Fibonacci sequence example in my earlier blog post, we provide type information for the return value and arguments in the function definition for demand_t(). Lastly, we can provide type information for any variables used within the function, such as noise, again using the cdef signifier.

It is not necessary to provide this information for every variable and function. For example, the __init__ function in general cannot be defined with cdef. However, the more type information you can provide, the more Cython will be able to speed up your code. Inspecting the annotated html that Cython outputs (see the first post in this series), as well as code profiling (see this blog post by Antonia Hadjimichael), can help to figure out which variables and methods are the most important bottlenecks to speed up, and which have diminishing marginal returns and can be left as untyped Python variables/methods.

You may have noticed that I did not provide static type information for class attributes (e.g., self.name, self.demand_amp). These variables are defined and typed in a separate file, Demand.pxd.

cdef class Demand():

  cdef:
    public str name

    public double demand_amp, demand_phase, demand_shift, demand_noise, days_to_radians

  cdef double demand_t(self, double t)

A .pxd “definition file” is the Cython equivalent of a header file in C; it defines attributes and methods of the class and gives type information. Any class that has a definition file will create a C extension type. Variables and methods in the definition file can only be accessed by other Cython scripts, rather than pure Python code. Cython also has a third kind of function definition, “cpdef”, which can be called from either Cython or Python. This can be useful if your application needs to be interoperable with Cython and Python, but will not perform as efficiently as cdef methods.

One important thing to notice is that each variable belonging to the class should be preceded by “public” if we want the attribute to be accessible to other objects. Also note that only cdef-defined functions should be included here, rather than pure Python functions.

The Reservoir class is Cythonized in a similar way, and the interested reader is referred to the GitHub repository to see the changes in detail. Moving to the Model class, there are several important differences to note in the .pxd file.

from Reservoir cimport Reservoir

cdef class Model():

  cdef:
    public int years, days

    public double tot_storage

    public bint plot

    public list reservoir_list

    public dict output

    public Reservoir reservoir_upper, reservoir_lower
  
  cdef double run(self) except *

First, notice that the Reservoir class is imported with the cimport command (“Cython import”) rather than import, since it is now an extension type. This should be done in both the .pyx and .pxd files. Second, I have included “except *” after the parentheses in the definition of the run() method. This instructs our compiled C code to pass exceptions through to the Python runtime. Otherwise, the program will try to treat errors as warnings and keep running, which makes it challenging to debug and verify behavior. This should be included in both the .pyx and .pxd files. Third, bint is the C equivalent of the bool type in Python.

The last important change in the Cython version is the inclusion of a new file, main.pyx. As noted previously, Cython extension types can only be imported using cimport from Cython, rather than pure Python code. For this reason, we can no longer directly import and run the Model class as we did in the pure Python implementation above. Instead, main.pyx is used as a go-between. This script is Cythonized (note the .pyx ending), but is not an extension type since it does not include a definition file (.pxd) with cythonized methods or attributes. Thus the file is allowed to import extension types (since it is a Cython file), while also remaining importable itself into pure Python (since it is not an extension type).

from Model cimport Model

cdef class main():
  cdef public Model model

  def __init__(self, int years, bint plot, int seed):
    # initialize model
    self.model = Model(years = years, plot = plot, seed = seed)

    # run simulation
    self.model.tot_storage = self.model.run()

    # plot results
    if self.model.plot:
      self.model.plot_results()

The main class can now be imported in pure Python to run the simulation, in much the same way as we imported and ran the Model class before.

from main import main

# run model
main_obj = main(years = 5, plot = True, seed = 101)

# verify that final storage is the same
print(main_obj.model.tot_storage)

For both py_fast and cy versions, we can verify that the final storage at the end of five years is 12,066. However, as we shall see later in this post, the Cythonized version completes the procedure more quickly.

NumPy and Memory Views

Most scientific programs rely heavily on NumPy arrays for numeric computation. For example, the py_numpy version of the Model class stores time series outputs in a NumPy array rather than nested dictionaries and lists. This output array is initialized in the class constructor, and then populated with outputs as the model is run (compare this to the original Python version above).

class Model():
  def __init__(self, years, plot, seed):    
    
    ...

    self.output = np.empty((num_reservoirs * self.num_step_outputs, self.days + 1))
    for i in range(len(reservoir_list)):
      self.output[i * self.num_step_outputs + self.num_step_outputs - 1, 0] = reservoir_list[i].storage


  def run(self):
    for t in range(1, self.days + 1):
      t_run = t - 1

      ### step upper
      self.output[:7, t] = self.reservoir_upper.step(t_run, 0.)

      ### step lower, using upper release as inflow
      self.output[7:, t] = self.reservoir_lower.step(t_run, self.output[4, t])

    # return total storage last time step to make sure versions are equivalent
    return self.output[6, -1] + self.output[-1, -1]

Because a NumPy array is not a native Python type (e.g., float, list), it cannot be used as a static type for Cython (i.e., we can’t simply define an array arr with “cdef np.array arr“). However, we can use “memoryviews,” which are a more general form of typed memory buffer than NumPy (and are, in fact, what NumPy is based on). The nuances of memoryviews can be a bit challenging, and a full explanation is beyond the scope of this post. But I will give a simple example usage for Cython, and the interested reader is referred to the docs here and here for more information.

Memoryviews are created by designating the type of variable to go inside the array (e.g., double), followed by square brackets with colons and commas dictating the number of dimensions. For example, in the cython_numpy version of Model.pxd, we define the 2-dimensional memory view to store results as follows:

cdef class Model():

  cdef:
    
    ...

    public double[:,:] output

Now in the model constructor, we initialize the memory view as follows:

    output_np = np.empty((num_reservoirs * self.num_step_outputs, self.days + 1))
    self.output = output_np

    for i in range(len(reservoir_list)):
      self.output[i * self.num_step_outputs + self.num_step_outputs - 1, 0] = reservoir_list[i].storage

The only difference from the Pythonic implementation is that we have to split the creation of the memoryview into two steps; first we create a NumPy array (output_np), then we set the memoryview (np.output) to correspond to the block of memory holding output_np. Once we have created the memoryview, we can access and reassign values using standard NumPy syntax. For example, here is the part of the Model.run() function that steps the upper reservoir.

      (inflow, upstream_release, min_flow, demand, release, delivery, storage) = self.reservoir_upper.step(t_run, 0.)
      self.output[0, t] = inflow
      self.output[1, t] = upstream_release
      self.output[2, t] = min_flow
      self.output[3, t] = demand
      self.output[4, t] = release
      self.output[5, t] = delivery
      self.output[6, t] = storage

Again, this has to be split into two steps; first call the reservoir’s step function and receive the output as a tuple of variables, and then reset each value in the memoryview using this data. We cannot simply set the entire slice self.output[:7, t] directly using the tuple returned by the function, like we did in python_numpy, because Cython is expecting an array rather than a tuple. It is possible to convert the tuple into a NumPy array and then reassign the slice using this NumPy array, but this was found to be less efficient when I timed different versions. Alternatively, we can alter the definition of Reservoir.step() to return a memoryview rather than a tuple, which allows the Model class to reassign values in the memoryview directly from the function call. However, this too was found to be less efficient, as we will see in the next section. The interested reader is referred to the cy_numpy_reservoir version of the model for implementation details.

Testing the performance of different versions

Now that I have described how to Cythonize your classes and NumPy arrays, you may be wondering, “Ok, but was it worth it? How much does this speed up the code?” We can answer this question with a timing experiment. The reservoir_sim directory in the GitHub repository contains nine versions of the code, summarized as follows:

  1. py_slow: My first attempt at creating the example reservoir simulation model in Python, which contains several bottlenecks.
  2. py_fast: An improved Python-only model after eliminating bottlenecks, used as a baseline for improvement.
  3. py_numpy: Python-only version that uses a NumPy array to store the time series of results.
  4. py_numpy_reservoir: Python-only version that uses NumPy arrays to store results inside the daily time step as well.
  5. py_cy: A copy of py_fast, which is then Cythonized without explicitly adding static type information.
  6. cy: Updates to py_fast to add static type information.
  7. cy_numpy: Similar to py_numpy, but Cythonized with static types and memory views.
  8. cy_numpy_reservoir: Similar to py_numpy_reservoir, but Cythonized with static types and memory views.
  9. cy_numpy_noCheck: Copy of cy_numpy, with the additional step of disabling Python’s automatic bounds checking and index wrapping.

This directory also includes two scripts: time_versions.sh will first Cythonize all the Cython versions of the model, and then run a timing experiment with all versions, using the script time_sim.py. The experiment will output the relative efficiency of each version, averaged over 4 trials of 20 simulations, each of which is 50 years long.

Starting timings

py_slow: answer = 10494.223405436973, time = 1.1308925539487973, efficiency  = 1.0

py_fast: answer = 9851.320332349133, time = 0.26572948903776705, efficiency relative to py_slow = 4.25580374253483

py_numpy: answer = 9851.320332349133, time = 0.3013913669856265, efficiency relative to py_fast = 0.8816758479032341

py_numpy_reservoir: answer = 9851.320332349133, time = 0.38436312798876315, efficiency relative to py_fast = 0.6913501053762255

py_cy: answer = 9851.320332349133, time = 0.23376567207742482, efficiency relative to py_fast = 1.1367344344286598

cy: answer = 9851.320332349133, time = 0.13966371200513095, efficiency relative to py_fast = 1.9026380240273488

cy_numpy: answer = 9851.320332349133, time = 0.09388123091775924, efficiency relative to py_fast = 2.8304857791068843

cy_numpy_reservoir: answer = 9851.320332349133, time = 0.22499373101163656, efficiency relative to py_fast = 1.181052857975068

cy_numpy_noCheck: answer = 9851.320332349133, time = 0.09497074002865702, efficiency relative to py_fast = 2.7980143037485474

Here are the major takeaways from comparing the speedups across versions:

  1. Make sure your Python code is optimized prior to using Cython. py_slow was 4.3x more efficient than py_fast. The only major difference between these is that in the former, I used the numpy.random module to generate stochastic realizations each day, whereas in the latter I used the basic random module. This small change led to significant gains due to avoiding the extra overhead of the NumPy generator (this is also why these versions got different answers, due to different random number generators). The point here is that major gains can often be made with very little work by finding and eliminating bottlenecks. As mentioned above, code profiling can be extremely helpful to make sure that you have eliminated major bottlenecks prior to implementing larger changes in Cython. Note that py_fast is used as the baseline comparison for all other versions, rather than py_slow. (For any of you thinking I should have just used NumPy to pre-generate an entire array of random numbers rather than generating them one-by-one each day – yes you are right, that would have probably been faster. But I set it up like this to provide a more challenging example of an object-oriented model with function calls between nested objects).
  2. NumPy arrays are not always faster. Like many scientists/engineers, I tend to rely on NumPy a lot, both due to their convenience and due to the fact that they are often praised for speed (NumPy is largely written in C, after all). However, in some cases this can actually degrade performance due to the larger overhead of creating NumPy arrays relative to lists. The py_numpy version that uses NumPy to hold the results was actually 12% less efficient than the baseline version using nested dictionaries and lists. Moreover, py_numpy_reservoir, which uses a new NumPy array to store results for each reservoir each day, is 31% slower than baseline. This is not to say NumPy arrays can’t be beneficial – they are extremely fast at vectorized numerical operations. But for situations with smaller arrays and/or minimal opportunity for vectorized computation, they may not be worth the overhead.
  3. Cythonizing Python code can improve performance. The py_cy version, which does not include static type information and is identical to py_fast except for the Cythonization, leads to a 20% efficiency gain. This is not huge, but could be meaningful and requires little extra work. Meanwhile, the cy version with statically typed Cython extension types has a larger efficiency gain of 90%.
  4. Cython and NumPy can be synergistic if used properly. The cy_numpy version further improves performance, with an efficiency of 2.8 relative to py_fast. Interestingly, this version is identical to py_numpy other than the Cython-related changes, but is found to lead to significant gains whereas the former leads to degraded performance. Cython and memoryviews allow us to interface more directly with NumPy’s underlying C code, reducing the Python-related overhead. However, this is not universally true – cy_numpy_reservoir is found to degrade performance significantly relative to the other Cythonized versions, suggesting that the creation of NumPy arrays within the inner daily step routine is not worth the overhead.
  5. Disabling certain Python features may not be worthwhile. The last version, cy_numpy_noCheck, builds on cy_numpy by adding two compiler directives to model.run(): @boundscheck(False) and @wraparound(False). The first tells Cython to disable Python’s automatic bounds checking for lists, tuples, etc., while the second disables the ability to count backwards when indexing (e.g., arr[-3:]). These directives are meant to allow for further efficiency gains by reducing the Pythonic overhead. However, in this particular case, it isn’t found to have a meaningful impact. Since these directives can lead to mysterious bugs for users accustomed to Python syntax, they should be used with caution, and only after verifying that they actually create efficiency gains that are worth the risk.

Overall, Cython is found to improve the efficiency of this reservoir simulation model by 2-3x on my laptop. Unfortunately, this is much lower than the 20-30x speedups observed for the Fibonacci sequence example in the last blog post. This is due to the more complex operations and object-oriented structure of the simulation model, which leads to more interactions with the Python interpreter compared to the tight numerical loop of the Fibonacci generator. However, a 2-3x efficiency gain can still be very meaningful, especially in situations involving large ensembles of expensive simulations in parallel computing environments. In general, the potential gain from Cython, and whether or not this is worth the trouble, will depend on factors such as the structure of the model, its runtime, and the skill level of model developers and users.

Further reading

If this blog post has piqued your interest and you want to apply Cython to your own research, you might try looking through more advanced water resources simulation models that make use of Cython, such as Pywr or the California Food-Energy-Water System (CALFEWS).

For more general information, of course see the Cython docs. I also highly recommend Kurt W. Smith’s O’Reilly Media book, “Cython: A Guide for Python Programmers” for those wanting a deeper understanding. It also looks like the author also has a tutorial, but I haven’t tried it yet. Lastly, for information on how to improve the efficiency of your Python code more generally (prior to or in lieu of using Cython), I recommend Micha Gorelick and Ian Ozsvald’s O’Reilly Media book, “High Performance Python: Practical Performant Programming for Humans.”

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Last week, Lillian and I published the first post in a series of training post studying the “Fisheries Game, which is a decision making problem within a complex, non-linear, and uncertain ecological context.

In preparing for that post, I learned about the Ipywidgets python library for widget construction. It stood out to me as a tool for highlighting the influence of parametric uncertainty on model performance. More broadly, I think it has great as an educational or data-narrative device.

This blogpost is designed to highlight this potential, and provide a basic introduction to the library. A tutorial demonstration of how an interactive widget is constructed is provided, this time using the HYMOD rainfall-runoff model.

This post is intended to be viewed through a Jupyter Notebook for interaction, which can be accessed through a Binder at this link!

The Binder was built with an internal environment specification, so it should not be necessary to install any packages on your local machine! Because of this, it may take a minute to load the page.

Alternatively, you can pull the source code and run the Jupyter Notebook from your local machine. All of the source code is available in a GitHub repository: Ipywidget_Demo_Interactive_HYMOD.

If using your local machine, you will first need to install the Ipywidget library:

pip install ipywidgets

Let’s begin!

HYMOD Introduction

HYMOD is a conceptual rainfall-runoff model. Given some observed precipitation and evaporation, a parameterized HYMOD model simulates the resulting down-basin runoff.

This post does not focus on specific properties or performance of the HYMOD model, but rather uses the model as a demonstration of the utility of the Ipywidget library.

I chose to use the HYMOD model for this, because the HYMOD model is commonly taught in introductory hydrologic modeling courses. This demonstration shows how an Ipywidget can be used in an educational context. The resulting widget can allow students to interact in real-time with the model behavior, by adjusting parameter values and visualizing the changes in the resulted streamflow.

If you are interested in the technical details of implementing the HYMOD model, you can dig into the source code, available (and throughly commented/descriptive) in the repository for this post: Ipywidget_Demo_Interactive_HYMOD.

HYMOD represents surface flow as a series of several quick-flow reservoirs. Groundwater flow is represented as a single slow-flow reservoir. The reservoirs have constant flow rates, with the quick-flow reservoir rate, Kq, being greater than the slow-flow reservoir rate, Ks.

Image source: Sun, Wenchao & Ishidaira, Hiroshi & Bastola, Satish. (2010)

HYMOD Parameters:

Like any hydrologic model, the performance of HYMOD will be dependent upon the specified parameter values. There are several parameters that can be adjusted:

  • Cmax: Max soil moisture storage (mm) [10-2000]
  • B: Distribution of soil stores [0.0 – 7.0]
  • Alpha: Division between quick/slow routing [0.0 – 1.0]
  • Kq: Quick flow reservoir rate constant (day^-1) [0.15 – 1.0]
  • Ks: Slow flow reservoir rate constant. (day^-1) [0.0 – 0.15]
  • N: The number of quick-flow reservoirs.

Interactive widget demonstration

I’ve constructed an Ipywidets object which allows a user to visualize the impact of the HYMOD model parameters on the resulting simulation timeseries. The user also has the option to select from three different error metrics, which display in the plot, and toggle the observed timeseries plot on and off.

Later in this post, I will give detail on how the widget was created.

Before provided the detail, I want to show the widget in action so that you know the expectation for the final product.

The gif below shows the widget in-use:

Demonstration of the HYMOD widget.

Ipywidgets Introduction

The Ipywdiget library allows for highly customized widgets, like the one above. As with any new tool, I’d recommend you check out the documentation here.

Below, I walk through the process of generating the widget shown above.

Lets begin!

Import the library

# Import the library
import ipywidgets as widgets

Basic widget components

Consider an Ipywidget as being an arrangement of modular components.

The tutorial walks through the construction of five key widget components:

  1. Variable slider
  2. Drop-down selectors
  3. Toggle buttons
  4. Label objects
  5. Interactive outputs (used to connect the plot to the other three components)

In the last section, I show how all of these components can be arranged together to construct the unified widget.

Sliders

Sliders are one of the most common ipywidet tools. They allow for manual manipulation of a variable value. The slider is an object that can be passed to the interactive widget (more on this further down).

For my HYMOD widget, I would like to be able to manipulate each of the model parameters listed above. I begin by constructing a slider object for each of the variables.

Here is an example, for the C_max variable:

# Construct the slider
Cmax_slider = widgets.FloatSlider(value = 500, min = 10, max = 2000, step = 1.0, description = "C_max",
                                  disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')


# Display the slider
display(Cmax_slider)

Notice that each slider recieves a specified minmax, and step corresponding to the possible values. For the HYMOD demo, I am using the parameter ranges specified in Herman, J.D., P.M. Reed, and T. Wagener (2013), Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior.

I will construct the sliders for the remaining parameters below. Notice that I don’t assign the description parameter in any of these sliders… this is intentional. Later in this tutorial I will show how to arrange the sliders with Label() objects for a cleaner widget design.

# Construct remaining sliders
Cmax_slider = widgets.FloatSlider(value = 100, min = 10, max = 2000, step = 1.0, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
B_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Alpha_slider = widgets.FloatSlider(value = 0.30, min = 0.00, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Kq_slider = widgets.FloatSlider(value = 0.33, min = 0.15, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Ks_slider = widgets.FloatSlider(value = 0.07, min = 0.00, max = 0.15, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
N_slider = widgets.IntSlider(value = 3, min = 2, max = 7, disabled = False, continuous_update = False, readout = True)

# Place all sliders in a list
list_of_sliders = [Kq_slider, Ks_slider, Cmax_slider, B_slider, Alpha_slider, N_slider]

The Dropdown() allows the user to select from a set of discrete variable options. Here, I want to give the user options on which error metric to use when comparing simulated and observed timeseries.

I provide three options:

  1. RMSE: Root mean square error
  2. NSE: Nash Sutcliffe efficiency
  3. ROCE: Runoff coefficient error

See the calculate_error_by_type inside the HYMOD_components.py script to see how these are calculated.

To provide this functionality, I define the Dropdown() object, as below, with a list of options and the initial value:

# Construct the drop-down to select from different error metrics
drop_down = widgets.Dropdown(options=['RMSE','NSE','ROCE'], description='',
                                value = 'RMSE', disabled = False)

# Display the drop-down
display(drop_down)

ToggleButton

The ToggleButton() allows for a bool variable to be toggled between True and False. For my streamflow plot function, I have an option plot_observed = False which determines if the observed streamflow timeseries is shown in the figure.

# Construct the button to toggle observed data On/Off
plot_button = widgets.ToggleButton(value = False, description = 'Toggle', disabled=False, button_style='', tooltip='Description')

# Display the button
display(plot_button)

Labels

As mentioned above, I choose to not include the description argument within the slider, drop-down, or toggle objects. This is because it is common for these labels to get cut-off when displaying the widget object.

For example, take a look at this slider below, with a long description argument:

# Make a slider with a long label
long_title_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, description = 'This slider has a long label!', readout = True)

# Display: Notice how the label is cut-off!
display(long_title_slider)

The ipywidgets.Label() function provides a way of avoiding this while allowing for long descriptions. Using Label() will ultimately provide you with a lot more control over your widget layout (last section of the tutorial).

The Label() function generates a separate object. Below, I create a unique Label() object for each HYMOD parameter.

# Import the Label() function
from ipywidgets import Label

# Make a list of label strings
param_labs = ['Kq : Quick flow reservoir rate constant (1/day)',
            'Ks : Slow flow reservoir rate constant (1/day)',
            'C_max : Maximum soil moisture storage (mm)',
            'B : Distribution of soil stores',
            'Alpha : Division between quick/slow routing',
            'N : Number of quick-flow reservoirs']

# Make a list of Label() objects
list_of_labels = [Label(i) for i in param_labs]

# Display the first label, for example.
list_of_labels[0]

Interactive_output

Now that we have constructed interactive

The interactive_output function takes two inputs, the function to interact with, and a dictionary of variable assignments:

interactive_output( function, {‘variable_name’ : variable_widget, …} )

I have created a custome function plot_HYMOD_results which:

  1. Loads 1-year of precipitation and evaporation data for the Leaf River catchment.
  2. Runs the HYMOD simulation using the provided parameter values.
  3. Calculates the error of the simulated vs. observed data.
  4. Plots the timeseries of runoff.

The source code for this function can be found in the GitHub repository for this post, or specifically here.

The function receives parameter values for each of the HYMOD parameters discussed above, a bool indicator if observed data should be plotted, and a specified error metric.

plot_HYMOD_results(C_max, B, Alpha, Ks, Kq, N_reservoirs, plot_observed = False, error_type = ‘RMSE’):

I have already generated widget components corresponding to each of these variables! (If you are on the Jupyter Notebook version of this post, make sure to have Run every cell before this, or else the following code wont work.

I can now use the interactive_output function to link the widget components generated earlier with the function inputs:

# Import the interactive_output function
from ipywidgets import interactive_output

# Import my custom plotting function
from HYMOD_plots import plot_HYMOD_results

result_comparison_plot = interactive_output(plot_HYMOD_results, {'C_max' : Cmax_slider, 'B': B_slider, 'Alpha':Alpha_slider, 
                                                                 'Ks':Ks_slider, 'Kq':Kq_slider,'N_reservoirs':N_slider, 
                                                                 'plot_observed' : plot_button,'error_type': drop_down})

# Show the output
result_comparison_plot
Output generated by the interactive_output().

Displaying the interactive_output reveals only the plot, but does not include any of the widget functionality…

Despite this, the plot is still linked to the widget components generated earlier. If you don’t believe me (and are reading the Jupyter Notebook version of this post), scroll up and click the ToggleButton a few cells up, then come back and look at the plot again.

Using the interactive_output() function, rather than other variations of the interact() functions available, allows for cleaner widgets to be produced, because now the arrangment of the widget components can be entirely customizable.

Keep reading for more detail on this!

Arranging widget components

Rather than using widget features one-at-a-time, Ipywidgets allow for several widgets to be arranged in a unified layout. Think of everything that has been generated previously as being a cell within the a gridded widget; the best part is that each cell is linked with one another.

Once the individual widget features (e.g., sliders, buttons, drop-downs, and output plots) are defined, they can be grouped using the VBox() (vertical box) and HBox() (horizontal box) functions.

I’ve constructed a visual representation of my intended widget layout, shown below. The dashed orange boxes show those components grouped by the HBox() function, and the blue boxes show those grouped by the VBox() function.

Visual representation of the final widget layout.

Before getting started, import some of the basic layout functions:

# Import the various 
from ipywidgets import HBox, VBox, Layout

Before constructing the entire widget, it is good to get familiar with the basic HBox() and VBox() functionality.

Remember the list of sliders and list of labels that we created earlier?

# Stack the list of label objects vertically:
VBox(list_of_labels)

# Try the same thing with the sliders (remove comment #):
#VBox(list_of_sliders)

In the final widget, I want the column of labels to be located on the left of the column of sliders. HBox() allows for these two columns to be arrange next to one another:

# Putting the columns side-by-side
HBox([VBox(list_of_labels), VBox(list_of_sliders)])

Generating the final widget

Using the basic HBox() and VBox() functions shown above, I arrange all of the widget components I’ve defined previously. I first define each row of the widget using HBox(), and finally stack the rows using VBox().

The script below will complete the arrangement, and call the final widget!

# Define secifications for the widgets: center and wrap 
box_layout = widgets.Layout(display='flex', flex_flow = 'row', align_items ='center', justify_content = 'center')

# Create the rows of the widets
title_row = Label('Select parameter values for the HYMOD model:')
slider_row = HBox([VBox(list_of_labels), VBox(list_of_sliders)], layout = box_layout)
error_menu_row = HBox([Label('Choose error metric:'), drop_down], layout = box_layout)
observed_toggle_row = HBox([Label('Click to show observed flow'), plot_button], layout = box_layout)
plot_row = HBox([result_comparison_plot], layout = box_layout)


# Combine label and slider box (row_one) with plot for the final widget
HYMOD_widget = VBox([title_row, slider_row, plot_row, error_menu_row, observed_toggle_row])


# Call the widget and have fun!
HYMOD_widget

Concluding remarks

If you’ve made it this far, thank you for reading!

I hope that you are able to find some fun/interesting/educational use for the Ipywidget skills learned in this post.

Fisheries Training 0: Exploring Predator-Prey Dynamics

Fisheries Training 0: Exploring Predator-Prey Dynamics

Hello there, welcome to a new training series!

In a series of five posts, we will be analyzing and exploring the Fisheries Game, which is a rich, predator-prey system with complex and nonlinear dynamics arising from the interactions between two fish species. This game will be used to walk through the steps of performing a broad a posteriori decision making process, including the exploration and characterization of impacts of system uncertain on the performance outcomes. It also serves as a conceptual tool to demonstrate the importance of accounting for deep uncertainty, and the significance of its effects on system response to management actions.

A GitHub repository containing all of the necessary code for this series is available here, and will be updated as the series develops.

We will begin here (Post 0) with an introduction to the Fisheries Game, using a modified form of the Lotka-Volterra equations for predator-prey dynamics.

Toward the end of this post, we will provide an overview of where this series is going and the tools that will be used along the way.

A very quick introduction to the Lotka-Volterra system of equations

In this game, we are stakeholders for a managed fishery. Our goal is to determine a harvesting strategy which balances tradeoffs between ecological and economic objectives. Throughout this process we will consider ecological stability, harvest yield, harvest predictability, and profits while attempting to maintain the robustness of both the ecological system and economic performance under the presence of uncertainty.

The first step in our decision making process is to model population dynamics of the fish species.

Equation 1: The base Lotka-Volterra SOE.

Equation 1 is the original Lotka-Volterra system of equations (SOE) as developed independently by Alfred Lotka in 1910, and by Vito Volterra in 1928. In this equation, x is the population density of the prey, and y is the population density of the predator. This SOE characterizes a prey-dependent predator-prey functional response, which assumes a linear relationship between prey consumption and predator growth.

Arditi and Akçakaya (1990) constructed an alternative predator-prey SOE, which assume a non-linear functional response (i.e., predator population growth is not linear with consumption). It accounts for predation efficiency parameters such as interference between predators, time needed to consume prey after capture, and a measure of how long it takes to convert consumed prey into new predators. This more complex SOE takes the form:

Equation 2: The full predator-dependent predator-prey SOE, including predator interference (m) and harvesting (z).

The full description of the variables are as follows: x and y are the prey and predator population densities at time t respectively (where t is in years); α is the rate at which the predator encounters the prey; b is the prey growth rate; c is the rate at which the predator converts prey to new predators; d is predator death rate; h is the time it needs to consume the prey (handling time); K is the environmental carrying capacity; m is the level of predator interaction; and z is the fraction of the prey population that is harvested. In this post, we will spend some time exploring the potential behavior of these equations.

Before proceeding further, here are some key terms used throughout this post:

  • Zero isoclines: Lines on the plot indicating prey or predator population levels that result in constant population size over time (zero growth rate).
  • Global attractor: The specific value that the system tends to evolve toward, independent of initial conditions.
  • Equilibrium point: The intersection of two (or more) zero isoclines; a point or line of trajectory convergence.
    • Stable (nontrivial) equilibrium: Equilibrium at which both prey and predator exist
    • Unstable equilibrium: Global attractor is a stable limit cycle
  • Stable limit cycle: A closed (circular) trajectory
  • Trajectory: The path taken by the system given a specific set of initial conditions.
  • Functional response: The rate of prey consumption by your average predator.

Population equilibrium and stability

When beginning to consider a harvesting policy, it is important to first understand the natural stability of the system.

Population equilibrium is a necessary condition for stability. The predator and prey populations are in equilibrium if the rate of change of the populations is zero:

\frac{dx}{dt} = \frac{dy}{dt} = 0

Predator stability

Let’s first consider the equilibria of the predator population. In this case, we are only interested in the non-trivial (or co-existence) equilibria where the predator and prey populations are non-zero (y \neq 0, and x \neq 0).

\frac{dy}{dt} = \frac{c\alpha xy}{y^m + \alpha hx} -dy = 0

Here, we are interested in solving for the predator zero-isocline (x^*) which satisfies this equation. Re-arrangement of the above equation yields:

c\alpha xy = d(y^m + \alpha hx)

\alpha x(c-dh) - dy^m = 0

Solving for x yields the predator zero-isocline:

x^* = \frac{dy^m}{\alpha (c - dh)}

In the context of the fisheries, we are interested in the co-existence conditions in which x^* > 0. From the zero-isocline stability equation above, it is clear that this is only true if:

c > hd

Simply put, the rate at which predators convert prey into new predators (c) must be greater than the death rate (d) scaled by the time it needs to consume prey (h).

Prey stability

A similar process can be followed for the derivation of the zero-isocline stability condition for the prey population. The stability condition can be determined by solving for:

\frac{dx}{dt} = bx(1-\frac{x}{K}) - \frac{\alpha xy}{y^m + \alpha hx} = 0

As the derivation of the prey isocline is slightly more convoluted than the predator isocline, the details are not presented here, but are available in the Supplementary Materials of Hadjimichael et al. (2020), which is included in the GitHub repository for this post here.

Resolution of this stability condition yields the second necessary co-existence equilibrium condition:

\alpha (hK)^{1-m} < (b - z)^m

Uncertainty

As indicated by the two conditions above, the stability of the predator and prey populations depends upon ecosystem properties (e.g., the availability of prey, \alpha, predator interference, m, and the carrying capacity, K), unique species characteristics (e.g., the prey growth rate, b, and predation efficiency parameters c, h).

Specification of different values for these parameters results in wildly different system dynamics. Below are examples of three different population trajectories, each generated using the same predator-dependent system of equations, with different parameter values.

Figure 1: Three trajectory fields, resulting from predator-prey models with different parameter values. (a) A stable equilibrium with a single global attractor (values: a = 0.01, b = 0.35, c = 0.60, d = 0.20, h = 0.44, K = 1900, m = 0.30, z = 0.0); (b) an unstable system with limit cycles as global attractors (values: a = 0.32, b = 0.48, c = 0.40, d = 0.16, h = 0.23, K = 2400, m = 0.30, z = 0.0); (c) an unstable system with deterministic extinction of both species (values: a = 0.61, b = 0.11, c = 0.80, = 0.18, h = 0.18, K = 3100, m = 0.70, z = 0.0).

Exploring system dynamics (interactive!)

To emphasize the significance of parameter uncertainty on the behavior of predator-prey system behavior, we have prepared an interactive Jupyter Notebook. Here, you have ability to interactively change system parameter values and observe the subsequent change in system behavior.

The Binder link to the interactive Jupyter Notebook prepared by Trevor Amestoy can be found here. Before the code can be run, open the ‘Terminal’, as shown in Figure 2 below.

Figure 2: Starting page of the Jupyter Notebook binder.

Install the necessary libraries by entering the following line into the terminal:

pip install numpy scipy matplotlib pandas

You’re good to go once the libraries have been loaded. Open the Part_0_Interactive_fisheries_ODEs.ipynb file in the Part_0_ODE_Dynamics/ folder.

Play around with the sliders to see what system trajectories you end up with! Notice how abruptly the population dynamics change, even with minor changes in parameter values.

The following GIFs show some potential trajectories you might observe as you vary the ranges of the variables:

Starting at a stable equilibrium

In Figure 3 below, both prey and predator eventually coexist at constant population sizes with no human harvesting.

Figure 3: The Fisheries system at a stable equilibrium quickly transitioning to an unstable equilibrium, and back again to a stable equilibrium.

However, this is a fishery with very small initial initial prey population size. Here, note how quickly the system changes from one of stable equilibrium into that of unstable equilibrium with a small decrease in the value of α. Without this information, stakeholders might unintentionally over-harvest the system, causing the extinction of the prey population.

Starting at an unstable equilibrium

Next, Figure 4 below shows an unstable equilibrium with limit cycles and no human harvesting.

Figure 4: An unstable equilibrium where the both prey and predator populations oscillate in a stable limit cycle.

Figure 4 shows that a system will be in a state of unstable equilibrium when it takes very little time for the predator to consume prey, given moderately low prey population sizes and prey population growth rate. Although predators die at a relatively high rate, the prey population still experiences extinction as it is unable to replace population members lost through consumption by the predator species. This is a system in which stakeholders might instead choose to harvest the predator species.

Introducing human harvesting

Finally, a stable equilibrium that changes when human harvesting of the prey population is introduced in Figure 5 below.

Figure 5: An unstable equilibrium where the both prey and predator populations oscillate in a stable limit cycle.

This Figure demonstrates a combination of system parameters that might represent a system that can be harvested in moderation. It’s low prey availability is abated by a relatively high predator growth rate and relatively low conversion rates. Be that as it may, exceeding harvesting rates may suddenly cause the system to transition into an unstable equilibrium, or in the worst case lead to an unexpected collapse of the population toward extinction.

Given that the ecosystem parameters are uncertain, and that small changes in the assumed parameter values can result in wildly different population dynamics, it begs the question: how can fisheries managers decide how much to harvest from the system, while maintaining sustainable population levels and avoiding system collapse?

This is the motivating question for the upcoming series!

The MSD UC eBook

To discover economic and environmental tradeoffs within the system, reveal the variables shaping the trajectories shown in Figure 3 to 5, and map regions of system vulnerability, we will need suitable tools.

In this training series, we will be primarily be using the MSD UC eBook and its companion GitHub repository as the main resources for such tools. Titled ‘Addressing uncertainty in MultiSector Dynamics Research‘, the eBook is part of an effort towards providing an open, ‘living’ toolkit of uncertainty characterization methods developed by and for the MultiSector Dynamics (MSD) community. The eBook also provides a hands-on Jupyter Notebook-based tutorial for performing a preliminary exploration of the dynamics within the Fisheries Game, which we will be expanding on in this tutorial series. We will primarily be using the eBook to introduce you to sensitivity analysis (SA) methods such as Sobol SA, sampling techniques such as Latin Hypercube Sampling, scenario discovery approaches like Logistic Regression, and their applications to complex, nonlinear problems within unpredictable system trajectories. We will also be making frequent use of the functions and software packages available on the MSD GitHub repository.

To make the most out of this training series, we highly recommend familiarizing yourself with the eBook prior to proceeding. In each of the following posts, we will also be making frequent references to sections of the eBook that are relevant to the current post.

Training sequence

In this post, we have provided an overview of the Fisheries Game. First, we introduced the basic Lotka-Volterra equations and its predator-dependent predator-prey variation (Arditi and Akcakaya, 1990) used in Hadjimichael et al. (2020). We also visualized the system and explored how the system trajectories change with different parameter specifications. Next, we introduced the MSD UC eBook as a toolkit for exploring the Fisheries’ system dynamics.

Overall, this post is meant to be a gateway into a deep dive into the system dynamics of the Fisheries Game as formulated in Hadjimichael et al. (2020), using methods from the UC eBook as exploratory tools. The next few posts will set up the Fisheries Game and its objectives for optimization, explore the implications of varying significant decision variables, map parameter ranges that result in system vulnerability, and visualize these outcomes. The order for these posts are as follows:

Post 1 Problem formulation and optimization of the Fisheries Game in Rhodium

Post 2 Visualizing and exploring the Fisheries’ Pareto set and Pareto front using Rhodium and J3

Post 3 Identifying important system variables using different sensitivity analysis methods

Post 4 Mapping consequential scenarios that drive the Fisheries’ vulnerability to deep uncertainty

That’s all from us – see you in Post 1!

References

Abrams, P., & Ginzburg, L. (2000). The nature of predation: prey dependent, ratio dependent or neither?. Trends In Ecology &Amp; Evolution, 15(8), 337-341. https://doi.org/10.1016/s0169-5347(00)01908-x

Arditi, R., & Akçakaya, H. R. (1990). Underestimation of Mutual Interference of Predators. Oecologia, 83(3), 358–361. http://www.jstor.org/stable/4219345

Hadjimichael, A., Reed, P., & Quinn, J. (2020). Navigating Deeply Uncertain Tradeoffs in Harvested Predator-Prey Systems. Complexity, 2020, 1-18. https://doi.org/10.1155/2020/4170453

Lotka, A. (1910). Contribution to the Theory of Periodic Reactions. The Journal Of Physical Chemistry, 14(3), 271-274. https://doi.org/10.1021/j150111a004

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Volterra, V. (1928). Variations and Fluctuations of the Number of Individuals in Animal Species living together. ICES Journal Of Marine Science, 3(1), 3-51. https://doi.org/10.1093/icesjms/3.1.3