Creating a collaborative research group lab manual with Jupyter Books

Motivation

Onboarding new students, staff, or collaborators can be a challenge in highly technical fields. Often, the knowledge needed to run complex experiments or computing workflows is spread across multiple individuals or teams based on their unique experiences, and training of new team members tends to occur in an ad hoc and inefficient manner. These challenges are compounded by the inevitable turnover of students, postdocs, and external collaborators in academic research settings.

Over the years, the Reed Group has developed a large bank of training materials to help streamline the onboarding process for new team members. These materials introduce concepts and tools related to water systems modeling, multi-objective evolutionary algorithms, global sensitivity analysis, synthetic streamflow generation, etc. However, these materials are still spread across a variety of sources (academic papers, GitHub repositories, blog posts, etc.) and team members, and there has been growing recognition of the need to catalogue and compile relevant resources and trainings in a more structured way.

For this reason, we have begun to create a lab manual for the Reed group. This will include a wide variety of information relevant to new students and researchers – everything from training exercises and code snippets, to reading lists and coursework suggestions, to a code of conduct outlining our values and expectations. The goal is for this to be a collaborative, living document created and maintained by students and postdocs. Ideally this will continue to evolve for years along with the evolving state-of-the-art in methodology, software, and literature.

After considering a number of different platforms for constructing websites, we settled on the Jupyter Book package for Python. You can find our lab manual here, and the source code used to create it here – note that this is still very much in development, a skeleton waiting to be fleshed out. In the remainder of this blog post, I will highlight the major elements of a Jupyter Book website, using our skeleton lab manual as an example. Then in a future blog post, I will outline the Continuous Integration and Continuous Delivery (CI/CD) strategy we are using to manage versioning and platform dependency issues across multiple developers.

Intro to Jupyter Book

Jupyter Book is a Python package for creating static websites. The package is built on the popular Sphinx engine used to create documentation for many of your favorite Python packages. Sphinx was also used to create the ebook for “Addressing Uncertainty in MultiSector Dynamics Research“, as described in two recent blog posts by Rohini Gupta and Travis Thurber. The ebook was a source of inspiration for our lab manual and the reason we initially considered Sphinx-based workflows. However, Jupyter Books layers several additional functionalities on top of Sphinx. First, it supports use of the MyST Markdown language, which is more familiar and intuitive to most researchers than the reStructured Text format favored by Sphinx. And second, it allows for pages to be built from executable Jupyter Notebooks, a powerful tool for combining text and equations with formatted code blocks, program output, and generated figures.

The Jupyter Book documentation contains tutorials, examples, and references, and is an excellent resource for anyone looking to build their own site. The documentation itself is, of course, created using the Jupyter Book package, and interested readers can check out the source code here.

Designing the website structure

The hierarchical structure of a Jupyter Book is defined in a simple YAML-style Table of Contents file, which should be named _toc.yml. Here is the TOC for our lab manual at present:

format: jb-book
root: intro.md
parts:
- chapters:
  - file: ExamplePages/ExamplePages.md
    sections:
    - file: ExamplePages/mdExample.md
    - file: ExamplePages/nbExample.ipynb
  - file: Resources/Resources.md
    sections:
    - file: Resources/ClusterBasics.md
    - file: Resources/Computing.md
    - file: Resources/Courses.md
    - file: Resources/DataVisualization.md
    - file: Resources/LifeAtCornell.md
    - file: Resources/ReedGroupTools.md
    - file: Resources/WritingProcess.md
    - file: Resources/CitationNetworkandDiscovery.md
  - file: Training/Training.md
    sections:
    - file: Training/Schedule.md
    - file: Training/Reading.md
    - file: Training/LakeProblem.md
    - file: Training/Training_Fisheries_Part1.md
    - file: Training/Linux_MOEAs_HPC.md

The “root” defines the landing page, in this case the intro.md markdown file. That landing page will link to three “chapters” called ExamplePages, Resources, and Training. Each of these chapters has it’s own landing page as well as multiple child “sections.” Each page can either be written as a Markdown file (.md) or a Jupyter Notebook (.ipynb).

The other important YAML file for all Jupyter Books is _config.yml:

title: Reed group lab manual
author: The Reed Group at Cornell CEE
logo: logo.png

# Force re-execution of notebooks on each build.
# See https://jupyterbook.org/content/execute.html
execute:
  execute_notebooks: force

# Define the name of the latex output file for PDF builds
latex:
  latex_documents:
    targetname: book.tex

# Add a bibtex file so that we can create citations
bibtex_bibfiles:
  - references.bib

# Information about where the book exists on the web
repository:
  url: https://github.com/reedgroup/reedgroup.github.io  # Online location of your book
  path_to_book: docs  # Optional path to your book, relative to the repository root

# Add GitHub buttons to your book
# See https://jupyterbook.org/customize/config.html#add-a-link-to-your-repository
html:
  use_issues_button: true
  use_repository_button: true

We first define our website’s title and author, as well as an image logo to display. The line “execute_notebooks: force” means that we want to reexecute all Jupyter Notebooks each time the site is built (see docs for other options). The url gives the web address where we want to host our site – in this case the GitHub Pages address associated with the GitHub repository for the site. The path_to_book defines “docs” as the folder in the repository where all source code is to be held. Finally, the last two options are used to create buttons at the top of our site that link to the GitHub repository in case readers want to browse the source code or report an issue. For now, we are using the default vanilla style, but there are many ways to customize the structural and aesthetic style of the site. You would need to point to custom style files from this configuration file – see the Jupyter Book gallery for inspiration.

Building pages with Markdown and Jupyter Notebooks

Jupyter Book makes it very easy to write new pages using either Markdown or Jupyter Notebooks. For context, here is a screenshot of the site’s homepage:

The main content section for this page is built from the “root” file, intro.md:

# Welcome to our lab manual! 

```{warning}
This site is still under construction
```

The purpose of this site is to help new students and collaborators get up to speed on the research methods/tools used by the Reed Group. This page is designed and maintained by other graduate students and post docs, and is intended to serve as a living document. 

This manual was created using the Jupyter Books Python package, and is hosted with GitHub Pages. You can find our source code at https://github.com/reedgroup/reedgroup.github.io.

```{tableofcontents}```

As you can see, this uses a very human-readable and intuitive Markdown-based file structure. Jupyter Book provides simple functionality for warning labels and other emphasis boxes, as well as a Table of Contents that is automatically rendered from the _toc.yml file. The tableofcontents command can be used from anywhere in the hierarchical page tree and will automatically filter to include only children of the current page. The separate sidebar TOC will also expand to show “sections” as you navigate into different “chapters.” For example, here is the Markdown and rendered webpage for the “ExamplePages” chapter:

# Example Pages with JupyterBooks
```{tableofcontents}```

For more detailed pages, you can also apply standard Markdown syntax to add section headers, bold/italic font, code blocks, lists, Latex equations, images, etc. For example, here is ExamplePages/mdExample.md:

# Markdown example
This is an example page using just markdown

### Subsection 1
Here is a subsection

### Subsection 2
Here is another subsection. 

:::{note}
Here is a note!
:::

And here is a code block:

```
e = mc^2
```

And here comes a cute image!

![capybara and friends](capybaraFriends.jpg "Capybara and friends")

Lastly, and most importantly for purposes of building a training manual, we can create pages using Jupyter Notebooks. For example, here are two screenshots of the webpage rendered from ExamplePages/nbExample.ipynb:

As you can see, the Notebook functionality allows us to combine text and equations with rendered Python code. We can also execute Bash, R, or other programs using Jupyter Notebook’s “magic” commands. Note that the Jupyter-based website is not interactive – for that you’ll need Binder, as demonstrated in this blog post by David Gold.

Nevertheless, the Notebook is reexecuted each time we rebuild the website, which should really streamline collaborative lab manual development. For example, once we have developed a code bank of visualization examples (stay tuned!), it will be straightforward to edit the existing examples and/or add new examples, with the rendered visualizations being automatically updated rather than needing to manually upload the new images. Additionally, reexecuting the Notebooks each time we rebuild the site will force us to maintain the functionality of our existing code bank rather than letting portions become obsolete due to package dependencies or other issues.

Next steps

You now have the basic building blocks to create your own lab manual or ebook using a collection of YAML files, Markdown files, and Jupyter Notebooks. The last two critical steps are to actually build the static site (e.g., the html files) using Jupyter Book, and then host the site using GitHub pages. I will demonstrate these steps, as well as our CI/CD strategy based on GitHub Actions, in my next blog post.

Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.

PyGeohydro

PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.

PyNHD

The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…

PyDaymet

The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

  1. Installation
  2. Retrieving USGS Water Data
  3. Retrieving Geophysical (NLDI) Data
  4. Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
plot.signatures(flow_data)
Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    navigation="upstreamTributaries",
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

  • Minimum and maximum temperature (tmin, tmax)
  • Precipitation (prcp)
  • Vapor pressure (vp)
  • Snow-Water Equivalent (swe)
  • Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

NOTE:
So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()

daymet_data.shape

# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.

Conclusion

I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.

Citations

Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

Fisheries Training Part 2 – Tradeoff Visualization and Introduction to J3

Hello there! If you’re here, then you likely have successfully navigated the previous two posts in our Fisheries Training Series:

In these posts, we explored the complex dynamics of a two-species predator-prey fisheries system. We also visualized various potential scenarios of stability and collapse that result from a variety of system parameter values. We then set up the problem components that include its parameters and their associated uncertainty ranges, performance objectives and the radial basis functions (RBFs) that map the current system state to policy action

Now, we will building off the previous posts and generate the full Pareto-approximate set of performance objectives and their associated decision variable values. We will also specify our robustness multivariate satisficing criteria (Starr, 1963) set by Hadjimichael et al (2020) and use J3, a visualization software, to explore the tradeoff space and identify the solutions that meet these criteria.

To better follow along with our training series, please find the accompanying GitHub repository that contains all the source code here.

A brief recap on decision variables, parameters and performance objectives

In the Fisheries Training series, we describe the system using the following parameters:

  • x_{t} and y_{t}: The prey and predator population densities at time t respectively
  • \alpha: The rate at which the predator encounters the prey
  • b: The prey growth rate
  • c: The rate at which the predator converts prey to new predators
  • d: The predator death rate
  • h: The time the predator needs to consume the prey (handling time)
  • K: Environmental carrying capacity
  • m: The level of predator interaction
  • z: The fraction of prey that is harvested

Please refer to Post 0 for further details on the relevance of each parameter.

Our decision variables are the three RBF parameters: the center (c_{i}), radius (r_{i}) and weights (w_{i}) of each RBF i respectively. From Part 1, we opt to use two RBFs where i \in [1,2] to result in six decision variables.

Next, our objectives are as follows:

  • Objective 1: Maximize net present value (NPV)
  • Objective 2: Minimize prey-population deficit
  • Objective 3: Minimize the longest duration of consecutive low harvest
  • Objective 4: Minimize worst harvest instance
  • Objective 5: Minimize harvest variance

Detailed explanation on the formulation and Python execution of the RBFs and objectives can be found in Post 1.

Now that we’ve reviewed the problem setup, let’s get to setting up the code!

Running the full problem optimization

Importing all libraries and setting up the problem

Before beginning, ensure that both Platypus and PyBorg are downloaded and installed as recommended by Post 1. Next, as previously performed, import all the necessary libraries:

# import all required libraries
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
from fish_game_functions import *
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
import matplotlib.pyplot as plt
import time
import random

We then define the problem by setting the number of variables (nVars), performance objectives (nObjs) and constraints (nCnstr). We also define the upper and lower bounds of each objective. The negative values associated with Objectives 1 and 4 indicate that they are to be maximized.

# Set the number of decision variables, constraints and performance objectives
nVars = 6   # Define number of decision variables
nObjs = 5   # Define number of objectives
nCnstr = 1      # Define number of decision constraints

# Define the upper and lower bounds of the performance objectives
objs_lower_bounds = [-6000, 0, 0, -250, 0]
objs_upper_bounds = [0, 1, 100, 0, 32000]

Then we initialize the algorithm (algorithm) to run over 10,000 function evaluations (nfe) with a starting population of 500 (pop_size):

# initialize the optimization
algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr)
nfe = 10000    # number of function evaluations
pop_size = 500    # population size

Storing the Pareto-approximate objectives and decision variables

We are ready to run this (Fisheries) world! But first, we will open two CSV files where we will store the Pareto-approximate objectives (Fisheries2_objs.csv) and their associated decision variables (Fisheries2_vars.csv). These are the (approximately) optimal performance objective values and the RBF (c_{i}, r_{i}, w_{i}) vectors that give rise to them discovered by PyBorg. We also record the total amount of time it takes to optimize the Fisheries over 10,000 NFEs with a population of 500.

# open file in which to store optimization objectives and variables
f_objs = open('Fisheries2_objs.txt', "w+")
f_vars = open('Fisheries2_vars.txt', "w+")

# get number of algorithm variables and performance objectives
nvars = algorithm.problem.nvars
nobjs = algorithm.problem.nobjs
    
# begin timing the optimization
opt_start_time = time.time()

algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(pop_size))
algorithm.run(int(nfe))

# get the solution archive
arch = algorithm.archive[:]
for i in range(len(arch)):
    sol = arch[i]
    # write objectives to file
    for j in range(nobjs):
        f_objs.write(str(sol.objectives[j]) + " ")
    # write variables to file
    for j in range(nvars):
        f_vars.write(str(sol.variables[j]) + " ")
        
    f.write("\n")

# end timing and print optimization time 
opt_end_time = time.time()

opt_total_time = opt_end_time - opt_start_time

f_objs.close()
f_vars.close()

# print the total time to console
print(format"\nTime taken = ", {opt_total_time})

The optimization should take approximately 3,100 seconds or 52 minutes. When the optimization is completed, you should be able to locate the Fisheries2_objs.txt and Fisheries2_vars.txt files in the same folder where the Jupyter notebook is stored.

Post-Processing

To ensure that our output can be used in our following steps, we perform post-processing to convert the .txt files into .csv files.

import numpy as np

# convert txt files to csv 
# load the .txt files as numpy matrices
matrix_objs = np.genfromtxt('Fisheries2_objs.txt', delimiter=' ')
matrix_vars = np.genfromtxt('Fisheries2_vars.txt', delimiter=' ')

# reshape the matrices 
# the objectives file should have shape (n_solns, nObjs)
# the variables file should have shape (n_solns, nVars)
n_solns = int(matrix_objs.shape[0]/nObjs)

matrix_objs = np.reshape(matrix_objs, (n_solns,nObjs))
matrix_vars = np.reshape(matrix_vars, (n_solns,nVars))

# label the objectives and variables
objs_names = ['NPV', 'Pop_Deficit', 'Low_Harvest', 'Worst_Harvest', 'Variance']
var_names = ['c1', 'r1', 'w1', 'c2', 'r2', 'w2']

# Convert the matrices to dataframes with header names
df_objs = pd.DataFrame(matrix_objs, columns=objs_names)
df_vars = pd.DataFrame(matrix_vars, columns=var_names)

# save the processed matrices as csv files
df_objs.to_csv('Fisheries2_objs.csv', sep=',', index=False)
df_vars.to_csv('Fisheries2_vars.csv', sep=',', index=False)

You should now be able to locate the Fisheries2_objs.csv and Fisheries2_vars.csv within the same folder where you store the Jupyter Notebook.

In the following steps, we will introduce the J3 Visualization Software, which takes .csv files as inputs, to visualize and explore the tradeoff space of the fisheries problem.

Introduction to visualization with J3

J3 is an open-sourced app to produce and share high-dimensional, interactive scientific visualizations. It is part of the larger Project Platypus, which is a collection of libraries that aid in decision-making, optimization, and data visualization. It is influenced by D3.js, which is a JavaScript library for manipulating data using documents (data-driven documents; hence the name). Instead of documents, J3 manipulates data using many-dimensional plots, annotations and animations.

There is a prior post by Antonia Hadjimichael that covers the Python implementation of J3. In this post, we will be exploring the J3 app itself.

Installing and setting up J3

To use J3, you should first install Java. Please follow the directions found on the official Java site to select the appropriate installation package for your operating system.

Next, you can install J3 in either one of two ways:

  1. Download the .zip file from the J3 Github Repository and extract its contents into a desired location on your location machine.
  2. Install using git clone:
cd your-desired-location-path
git clone https://github.com/Project-Platypus/J3.git

You should now see a folder called ‘J3’ located in the path where you chose to extract the repository. Run the J3.exe file within the folder as shown below:

Next, we upload our Fisheries2_objs.csv file into J3:

The GIF below shows the a 3D tradeoff plot that is used to demonstrate the functions that each of the toggles serve. In this 3D plot, the NPV and Harvest Variance are seen on the x- and y-axes, while the Worst-case Harvest is seen on the z-axis. The size of the points represents Lowest Harvest Instance and their colors demonstrate the Population Size.

Other functions not shown above include:

  1. Zooming in by scrolling on your mouse or trackpad
  2. Deleting the annotations by right-clicking on them
  3. Pressing the ‘esc’ key to de-select a point of interest

Next, we can also generate accompanying 2D-scatter and parallel axis plots to this 3D tradeoff figure:

In the parallel axis plot, the direction of preference is upwards. Here, we can visualize the significant tradeoffs between net present cost of the fisheries’ yield and population deficit. If stakeholders wish to maximize the economic value of the fisheries, they may experience unsustainable prey population deficits. The relationship between the remaining objectives is less clear. In J3, you can move the parallel axis plot axes to better visualize the tradeoffs between two objectives:

Here, we observe that there is an additional tradeoff between the minimizing the population deficit and maintaining low occurrences of low-harvest events. From this brief picture, we can observe that the main tradeoffs within the Fisheries system are between ecological objectives such as population deficit and economic objectives such as net present value and harvest.

Note the Brushing tool that appears next to the parallel axis plot. This will be important as we begin our next step, and that is defining our robustness multivariate satisficing criteria.

The multivariate satisficing criteria and identifying robust solutions

The multivariate satisficing criteria is derived from Starr’s domain criterion satisficing measure (Starr, 1962). In Hadjimichael et al. (2020), the multivariate satisficing criteria was selected as it allowed the identification of solutions that meet stakeholders’ competing requirements. In the context of Part 2, we use these criteria to identify the solutions in the Pareto-approximate set that satisfy the expectations of stakeholder. Here, the requirements are as follows:

  1. Net present value (NPV) \geq 1,500
  2. Prey-population deficit \leq 0.5
  3. Longest duration of consecutive low harvest \leq 5
  4. Worst harvest instance \geq 50
  5. Harvest variance \leq 1

Using the brushing tool to highlight only the solutions of interest, we find a pared-down version of the Pareto set. This tells us that not all optimal solutions are realistic, feasible, or satisfactory to decision-makers in the Fisheries system.

Conclusion

Good job with making it this far! Your accomplishments are many:

  1. You ran a full optimization of the Fisheries Problem.
  2. Your downloaded, installed, and learned how to use J3 to visualize and manipulate data to explore the tradeoff space of the Fisheries system.
  3. You learned about Multivariate Satisficing Criteria to identify solution tradeoffs that are acceptable to the stakeholders within the Fisheries system.

In our next post, we will further expand on the concept of the multivariate satisficing criteria and use it to evaluate how 2-3 of the different solutions that were found to initially satisfy stakeholder requirements when tested across more challenging scenarios. But in the meantime, we recommend that you explore the use of J3 on alternative datasets as well, and see if you can come up with an interesting narrative based on your data!

Until then, happy visualizing!

References

Giuliani, M., Castelletti, A., Pianosi, F., Mason, E. and Reed, P., 2016. Curses, Tradeoffs, and Scalable Management: Advancing Evolutionary Multiobjective Direct Policy Search to Improve Water Reservoir Operations. Journal of Water Resources Planning and Management, 142(2).

Hadjimichael, A., Reed, P. and Quinn, J., 2020. Navigating Deeply Uncertain Tradeoffs in Harvested Predator-Prey Systems. Complexity, 2020, pp.1-18.

Starr, M., 1963. Product design and decision theory. Journal of the Franklin Institute, 276(1), p.79.

Bivariate choropleth maps

Bivariate choropleth maps

Choropleth maps are a ubiquitous tool in geospatial analysis where color is used to visualize the spatial variability of data (e.g., coloring counties based on temperature, GDP, or election results). Bivariate choropleth maps are a less common variant that use 2-dimensional color palettes to show two different types of data at once. This can be a powerful tool for visualizing spatial correlations and clustering patterns between two related variables.

This blog post will demonstrate how to create a bivariate choropleth map in Python. Data and code for this post can be found in this GitHub repository.

CalEnviroScreen dataset

CalEnviroScreen is a valuable dataset from the California Office of Environmental Health Hazard Assessment which seeks to “identify California communities that are most affected by many sources of pollution, and where people are often especially vulnerable to pollution’s effects.” It provides census tract-level data on a variety of pollution sources (ozone, PM2.5, drinking water contamination, traffic impacts, etc.) and population characteristics (asthma, cardiovascular disease, education, poverty, etc.). The dataset also includes more general aggregated indicators for “Pollution Burden” and “Population Characteristics,” and finally an overall indicator of pollution vulnerability which is the product of the previous two indicators. This data is used by various state agencies to locate disadvantaged communities that can be targeted for environmental justice programs and investments. The CalEnviroScreen 4.0 dataset is available either as an online mapping platform or a downloadable shapefile – the latter is used for this blog post.

Creating univariate choropleth maps using geopandas

This analysis requires the following Python packages: Glob for grabbing files, Matplotlib for general plotting, GenerativePy and Pillow (PIL) for manipulating colors, Geopandas for handling geospatial data, and Contextily for importing basemaps.

import glob
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
from matplotlib.colors import rgb2hex
from generativepy.color import Color
from PIL import ImageColor
import geopandas as gpd
import contextily as cx

Once we have imported the necessary libraries, we can read the data into a Geopandas dataframe. Geopandas is a great package which allows us to read in and manipulate shapefiles as dataframes that act very similarly to Pandas dataframes while maintaining the spatial context.

### Read in CalEnviroScreen shapefile, from https://oehha.ca.gov/calenviroscreen/maps-data/download-data
shapefiles = glob.glob('*/*.shp')
ces = gpd.read_file(shapefiles[0])

If we print the columns of the dataframe, we see a large list of variables related to pollution burden and population characteristics (e.g., Ozone, Poverty) and aggregated indicators (PolBurd, PopChar, CIscore), as well as percentile scales for most of these variables (e.g., PolBurdP). The specific meanings of each variable can be found in the documentation. Several geospatial attributes are also included among the columns (e.g., Shape_Leng, geometry).

Index(['Tract', 'ZIP', 'County', 'ApproxLoc', 'TotPop19', 'CIscore',
       'CIscoreP', 'Ozone', 'OzoneP', 'PM2_5', 'PM2_5_P', 'DieselPM',
       'DieselPM_P', 'Pesticide', 'PesticideP', 'Tox_Rel', 'Tox_Rel_P',
       'Traffic', 'TrafficP', 'DrinkWat', 'DrinkWatP', 'Lead', 'Lead_P',
       'Cleanup', 'CleanupP', 'GWThreat', 'GWThreatP', 'HazWaste', 'HazWasteP',
       'ImpWatBod', 'ImpWatBodP', 'SolWaste', 'SolWasteP', 'PollBurd',
       'PolBurdSc', 'PolBurdP', 'Asthma', 'AsthmaP', 'LowBirtWt', 'LowBirWP',
       'Cardiovas', 'CardiovasP', 'Educatn', 'EducatP', 'Ling_Isol',
       'Ling_IsolP', 'Poverty', 'PovertyP', 'Unempl', 'UnemplP', 'HousBurd',
       'HousBurdP', 'PopChar', 'PopCharSc', 'PopCharP', 'Child_10',
       'Pop_10_64', 'Elderly65', 'Hispanic', 'White', 'AfricanAm', 'NativeAm',
       'OtherMult', 'Shape_Leng', 'Shape_Area', 'AAPI', 'geometry'],
      dtype='object')

After reading in the dataframe, it is straightforward to loop over our three aggregated indicator percentiles and create a separate choropleth map for each. Note that the data contain some holes, which are coded as -9999, so we only plot census tracts with values >= 0.

label_dict = {'CIscoreP': 'CalEnviroScreen score percentile', 
              'PolBurdP': 'Pollution burden percentile', 
              'PopCharP': 'Population characteristics percentile'}
alpha=0.85
dpi=300

for attr, label in label_dict.items():
    fig, ax = plt.subplots(1,1, figsize=(7,10))
    ces.loc[ces[attr] >= 0].plot(attr, ax=ax, alpha=alpha, vmin=0, legend=True, 
             legend_kwds={'label':label, 'orientation':'horizontal', 'shrink': 0.5, 'pad': 0.03, 'alpha':alpha})
    cx.add_basemap(ax = ax, crs=ces.crs, source=cx.providers.Stamen.TonerLite)
    _ = ax.set_xticks([])
    _ = ax.set_yticks([])
    plt.savefig(f'figs/{attr}.jpg', dpi=dpi, bbox_inches='tight')

Consider first the aggregated CalEnviroScreen Index. I have also included a zoomed version focusing on the Los Angeles/Orange County area (see repository for code).

Readers familiar with California’s environmental issues probably aren’t too surprised by the general pattern: the regions rated highest on the CalEnviroScreen Indicator are heavily concentrated in the Central Valley and the LA-Long Beach area. What about the underlying Pollution Burden and Population Characteristics Scores?

In general, we find that both pollution burden and vulnerable populations tend to be co-located in the Central Valley and the LA-Long Beach area. However, the data are not identical, and there are a variety of reasons why policymakers, public health experts, and researchers may want to compare the two variables and investigate their spatial correlation and clustering.

Creating a bivariate choropleth map

Is there a more effective way to compare and contrast the two variables then looking back and forth between the two figures? Enter the bivariate choropleth map. The goal is to use a two-dimensional colormap that blends two color scales together based on two sets of data. For example, a blue and red colormap would have white or grey in the lower left corner, representing low values for each variable. The upper left and lower right corners would be blue and red, representing a high value for one variable and low value for the other. The upper right corner where both variables take high values would be purple, a blending of red and blue. Intermediate parts of the space can be blended to be more blue or red depending on the underlying variable values. This can be a continuous colorscale, but discrete with 3-5 classes per variable is more common and generally easier to decipher. See this blog post for some examples.

The code below is used to create such a 2D color grid. I chose to use the pink/cyan color map from the post cited above, which has nice contrast and avoids the good/bad connotations of the blue/red map. I retrieved the hex codes for the four corner colors from the same post. We convert each hex color to RGB, and then to a Color object from the Generativepy package. This package provides the functionality for linear interpolation (“lerping”) of colors, which we use to create a 4×4 grid of blended colors representing the quartiles of each index. Lastly, we convert everything back to hex which is easier to use with Geopandas

### percentile bounds defining upper boundaries of color classes
percentile_bounds = [25, 50, 75, 100]

### function to convert hex color to rgb to Color object (generativepy package)
def hex_to_Color(hexcode):
    rgb = ImageColor.getcolor(hexcode, 'RGB')
    rgb = [v/256 for v in rgb]
    rgb = Color(*rgb)
    return rgb

### get corner colors from https://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/
c00 = hex_to_Color('#e8e8e8')
c10 = hex_to_Color('#be64ac')
c01 = hex_to_Color('#5ac8c8')
c11 = hex_to_Color('#3b4994')

### now create square grid of colors, using color interpolation from generativepy package
num_grps = len(percentile_bounds)
c00_to_c10 = []
c01_to_c11 = []
colorlist = []
for i in range(num_grps):
    c00_to_c10.append(c00.lerp(c10, 1/(num_grps-1) * i))
    c01_to_c11.append(c01.lerp(c11, 1/(num_grps-1) * i))
for i in range(num_grps):
    for j in range(num_grps):
        colorlist.append(c00_to_c10[i].lerp(c01_to_c11[i], 1/(num_grps-1) * j))

### convert back to hex color
colorlist = [rgb2hex([c.r, c.g, c.b]) for c in colorlist]

Having defined our percentile bounds and color grid, we can assign a color to each census tract based on its Pollution Burden and Population Characteristics Percentiles, and then create a map using Geopandas as before. The 2D color grid legend can be created on an inset axis using Matplotlib’s patch functionality.

### function to get bivariate color given two percentiles
def get_bivariate_choropleth_color(p1, p2):
    if p1>=0 and p2>=0:
        count = 0
        stop = False
        for percentile_bound_p1 in percentile_bounds:
            for percentile_bound_p2 in percentile_bounds:
                if (not stop) and (p1 <= percentile_bound_p1):
                    if (not stop) and (p2 <= percentile_bound_p2):
                        color = colorlist[count]
                        stop = True
                count += 1
    else:
        color = [0.8,0.8,0.8,1]
    return color

### plot map based on bivariate choropleth
fig, ax = plt.subplots(1,1, figsize=(8,10))
attr = 'CIscoreP'
ces['color_bivariate'] = [get_bivariate_choropleth_color(p1, p2) for p1,p2 in zip(ces['PolBurdP'].values, ces['PopCharP'].values)]
ces.loc[ces[attr] >= 0].plot(ax=ax, color=ces.loc[ces[attr] >= 0]['color_bivariate'], alpha=alpha, legend=False)
cx.add_basemap(ax = ax, crs=ces.crs, source=cx.providers.Stamen.TonerLite)
ax.set_xticks([])
ax.set_yticks([])

### now create inset legend
ax = ax.inset_axes([0.6,0.6,0.35,0.35])
ax.set_aspect('equal', adjustable='box')
count = 0
xticks = [0]
yticks = [0]
for i,percentile_bound_p1 in enumerate(percentile_bounds):
    for j,percentile_bound_p2 in enumerate(percentile_bounds):
        percentileboxes = [Rectangle((i,j), 1, 1)]
        pc = PatchCollection(percentileboxes, facecolor=colorlist[count], alpha=alpha)
        count += 1
        ax.add_collection(pc)
        if i == 0:
            yticks.append(percentile_bound_p2)
    xticks.append(percentile_bound_p1)

_=ax.set_xlim([0,len(percentile_bounds)])
_=ax.set_ylim([0,len(percentile_bounds)])
_=ax.set_xticks(list(range(len(percentile_bounds)+1)), xticks)
_=ax.set_xlabel('Pollution burden percentile')
_=ax.set_yticks(list(range(len(percentile_bounds)+1)), yticks)
_=ax.set_ylabel('Population characteristics percentile')

plt.savefig(f'figs/bivariate.jpg', dpi=dpi, bbox_inches='tight')

The bivariate color palette allows us to distinguish at a more granular level how census tracts compare across the two variables. Some regions are found to exhibit high pollution burden but less vulnerable communities (pink), while others display the opposite pattern (cyan). Those that score highly on both pollution burden and population characteristics (deep blue/purple) would be considered the most vulnerable to pollution and the most in need of pollution-related state programs and investments.