Sankey diagrams for USGS gauge data in python(?)

This post was inspired by the Sankey diagram in Figure 1 of this pre-print led by Dave Gold:Exploring the Spatially Compounding Multi-sectoral Drought Vulnerabilities in Colorado’s West Slope River Basins” (Gold, Reed & Gupta, In Review) which features a Sankey diagram of flow contributions to Lake Powell. I like the figure, and thought I’d make an effort to produce similar diagrams using USGS gauge data.

Sankey diagrams show data flows between different source and target destinations. Lot’s of people use them to visualize their personal/business cashflows. It’s an obvious visualization option for streamflows.

To explain the “(?)” in my title: When I started this, I’d realized quickly that I to choose one of two popular plotting packages: matplotlib or plotly.

I am a frequent matplotlib user and definitely appreciate the level of control in the figure generation process. However, it can sometimes be more time- and line-intensive designing highly customized figures using matplotlib. On the other hand, in my experience, plotly tools can often produce appealing graphics with less code. I am also drawn to the fact that the plotly graphics are interactive objects rather than static figures.

I decided to go with plotly to try something new. If you want to hear my complaints and thoughts on use context, you can skip to the conclusions below.

In the sections below, I provide some code which will:

  • Define a network of USGS gauge stations to include in the plot
  • Retrieve data from USGS gauge stations
  • Create a Sankey diagram using plotly showing streamflows across the network

Here, I focus on the Rio Grande river upstream of Albuquerque, NM. However you can plot a different streamflow network by modifying the dictionary of upstream nodes defining the network.


Plotting a Sankey streamflow network with plotly

The code used here requires both plotly and the pygeohydro package (for USGS data retrieval).

from pygeohydro import NWIS
import plotly.graph_objects as go

With that out of the way, we can get started.

Defining the flow network & data retrieval

I start by defining a dictionary called upstream_stations which defines the relationships between different gauges of interest.

This dictionary contains pairs of the form: {"GAUGE_ID" : ["LIST_OF", "UPSTREAM", "GAUGE_IDs"]}

If there is no upstream site, then include an empty list. For the Rio Grande network, this looks like:

# define relationships between each gauge and upstream sites
upstream_stations = {
    '08329918' : ['08319000', '08328950'], 
    '08319000' : ['08317400', '08317200'],
    '08328950' : [],
    '08317200' : [],
    '08317400' : ['08313000'],
    '08313000' : ['08290000', '08279500'],
    '08287000' : [],
    '08279500' : [],
    '08290000' : ['08287000', '08289000'],
    '08289000' : [],
}

# Get list of all stations from upstream_stations
all_stations = list(upstream_stations.keys())
for station, upstream in upstream_stations.items():
    all_stations += upstream
all_stations = list(set(all_stations))

Notice that I also made a list containing all the stations IDs. I use the pygeohydro package from the HyRiver suite of tools to get retrieve the gauge station data (Chegini, Li, & Leung, 2021). I often cite this package, and have written about it in a past post (“Efficient hydroclimatic data accessing with HyRiver for Python”).

Using the list of all_stations, I use the following code to pull daily streamflow data for each site from 2015-2020 (or some other specified dates):

def get_usgs_gauge_data(stations, dates):
    """
    Get streamflow data from USGS gauge stations using NWIS.
    """
    nwis = NWIS()
    df = nwis.get_streamflow(stations, dates, mmd=False)
    
    # get rid of USGS- in columns
    df.columns = df.columns.str.replace('USGS-', '')
    return df

# Get USGS flows
flows = get_usgs_gauge_data(all_stations, ('2015-01-01', '2020-12-31'))

For the Sankey diagram, we need a single flow value for each station. In this case I calculate an average of the annual total flows at each station:

# Get annual mean flows
agg_flows = flows.resample('Y').sum().agg('mean')

Creating the Sankey figure

At it’s core, a Sankey diagram is a visualization of a weighted network (also referred to as a graph) defined by:

  • Nodes
  • Links (aka Edges)
  • Weights

In our case, the nodes are the USGS gauge stations, the links are the connections between upstream and downstream gauges, and the weights are the average volumes of water flowing from one gauge to the next.

Each link is defined by a source and target node and a value. This is where the upstream_stations dictionary comes in. In the code block below, I set up the nodes and links, looping through upstream_stations to define all of the source-target relationships:

## Deinfe nodes and links
# Nodes are station IDs
nodes = all_stations
node_indices = {node: i for i, node in enumerate(nodes)}

# make links based on upstream-downstream relationships
links = {
    'source': [],
    'target': [],
    'value': [],
}

# loop through upstream_stations dict
for station, upstream_list in upstream_stations.items():
    for stn in upstream_list:
        if stn in agg_flows and station in agg_flows:
            links['source'].append(node_indices[stn])
            links['target'].append(node_indices[station])
            links['value'].append(agg_flows[stn])

Lastly, I define some node labels and assign colors to each node. In this case, I want to make the nodes black if they represent reservoir releases (gauges at reservoir outlets) or blue if they are simple gauge stations.

labels = {
    '08329918' : 'Rio Grande at Alameda', 
    '08319000' : 'San Felipe Gauge',
    '08328950' : 'Jemez Canyon Reservoir',
    '08317200' : 'Santa Fe River',
    '08317400' : 'Cochiti Reservoir',
    '08313000' : 'Rio Grande at Otowi Bridge',
    '08287000' : 'Abiquiu Reservoir',
    '08279500' : 'Rio Grande',
    '08290000' : 'Rio Chama',
    '08289000' : 'Rio Ojo Caliente',
}

# Create nodes labels and colors lists
node_labels = [labels[node] for node in nodes]
node_colors = ['black' if 'Reservoir' in label else 'dodgerblue' for label in node_labels]


Finally, the function to generate the figure:

def create_sankey_diagram(node_labels, links, node_colors, 
						  orientation='h',
                          size=(2000, 700)):
    """
    Create a Sankey diagram of using Plotly.
    
    Parameters
    ----------
    node_labels : list
        List of node labels.
    links : dict
        Dictionary with keys 'source', 'target', and 'value'.
    node_colors : list
        List of node colors.
    orientation : str
        Orientation of the diagram, 'h' for horizontal and 'v' for vertical.
        
    Returns
    -------
    sankey_fig : plotly.graph_objects.Figure
        Plotly figure object.
    """
    sankey_fig = go.Figure(go.Sankey(
        orientation=orientation,
        node=dict(
            pad=70,
            thickness=45,
            line=dict(color='dodgerblue', width=0.5),
            label=node_labels,
            color=node_colors
        ),
        link=dict(
            source=links['source'],
            target=links['target'],
            value=links['value'],
            color='cornflowerblue'
        )
    ))
    
    sankey_fig.update_layout(
        title_text="Rio Grande Streamflow ",
        font=dict(size=23),
        width=size[0],
        height=size[1]
    )
    return sankey_fig

There are some options for manipulating this figure script to better suit your needs. Specifically you may want to modify:

  • pad=70 : this is the horizontal spacing between nodes
  • thickness=45 : this is the thickness of the node elements

With our pre-prepped data from above, we can use the function like so:

sankey_fig = create_sankey_diagram(node_labels, 
								   links, 
								   node_colors, 
								   orientation='v', size=(1000, 1200))
sankey_fig

And here we have it:

I’d say it looks… okay. And admittedly this is the appearance after manipulating the node placement using the interactive interface.

It’s a squished vertically (which can be improved by making it a much taller figure). However my biggest issue is with the text being difficult to read.

Changing the orientation to horizontal (orientation='h') results in a slightly better looking figure. Which makes sense, since the Sankey diagram is often shown horizontal. However, this does not preserve the relationship to the actual North-South flow direction in the Rio Grande, so I don’t like it as much.

Conclusions

To answer the question posed by the title, “Sankey diagrams for USGS gauge data in python(?)”: Yes, sometimes. And sometimes something else.

Plotly complaints: While working on this post, I developed a few complaints with the plotly Sankey tools. Specifically:

  • It appears that the label text coloring cannot be modified. I don’t like the white edgecolor/blur effect, but could not get rid of this.
  • The font is very difficult to read… I had to make the text size very large for it to be reasonably legible.
  • You can only assign a single node thickness. I had wanted to make the reservoirs thick, and shrink the size of the gauge station nodes. However, it seems this cannot be done.
  • The diagrams appear low-resolution and I don’t see a way to save a high res version.

Ultimately, the plotly tools are very restrictive in the design of the graphic. However, this is a tradeoff in order to get the benefit of interactive graphics.

Plotly praise: The plotly Sankey tools have some advantages, specifically:

  • The plots are interactive
  • Plots can be saved as HTML and embedded in websites

These advantages make the plotly tools good for anyone who might want to have a dynamic and maybe frequently updated dashboard on a site.

On the other hand, if I wanted to prepare a publication-quality figure, where I had absolute control of the design elements, I’d likely turn to matplotlib. That way it could be saved as an SVG and further manipulated in a vector art program link Inkscape or Illustrator.

Thanks for reading!

References

Chegini, T., Li, H. Y., & Leung, L. R. (2021). HyRiver: Hydroclimate data retriever. Journal of Open Source Software, 6(66), 3175.

Links

Interactive Visualization of Scientific Data and Results with Panel

This is a guest post by David Lafferty from the University of Illinois Urbana-Champaign. David is a fifth-year PhD student in the Department of Climate, Meteorology & Atmospheric Sciences working with Dr. Ryan Sriver. David’s research contributes to advances in climate change risk assessment. Dr. Casey Burleyson, a research scientist at the Pacific Northwest National Laboratory, also contributed text about the MSD-LIVE platform.

Visualization plays a crucial role in the scientific process. From an initial exploratory data analysis to the careful presentation of hard-earned results, data visualization is a powerful tool to generate hypotheses and insights, communicate findings to both scientists and stakeholders, and aid decision-making. Interactive data visualization can enhance each of these use cases and is particularly useful for facilitating collaborations and deepening user engagement. Effective data visualization has been a recurring theme of the Water Programming blog, with several past posts directly addressing the topic, including some that highlight various tools for interactive visualization [1, 2].

In this post, I’d like to highlight Panel, a data exploration and web app framework in Python.  I’ll also showcase how we used Panel to construct an interactive dashboard to accompany a recent paper, allowing readers to explore our results in more detail.

An overview of the HoloViz ecosystem, taken from the hvPlot website. Panel provides a high-level interface that allows users to build linked apps across all of these visualization libraries.

Panel is an open-source Python library that allows developers to create interactive web apps and dashboards in Python. It is part of the HoloViz ecosystem, which includes other visualization packages like HoloViews and GeoViews. Panel provides a high-level API for creating rich, interactive visualizations and user interfaces directly from Python code. The library includes an impressive number of features, such as different types of panes for displaying and arranging plots, media, text, or other external objects; a wide range of widgets; and flexible, customizable dashboard layouts. Panel itself does not provide visualization tools but instead relies on (and is designed to work well with) other popular plotting libraries such as Bokeh, Datashader, Folium, Plotly, and Matplotlib. Panel provides several deployment options for your polished apps and dashboards, but also works well within development environments such as Jupyter Notebooks, VS Code, PyCharm, or Spyder.

A simple Panel dashboard

To get a sense of Panel’s capabilities, let’s build a simple dashboard that analyzes global temperature anomalies from different observational datasets. Global mean temperature rise above pre-industrial levels is one of most widely-used indicators of climate change, and 2023 was a record year in this respect – all major monitoring agencies recorded 2023 as the warmest year on record [3] and the exceptional warmth of 2023 fell outside the prediction intervals of many forecasting groups [4].

To visualize this temperature change, we’re going to look at four observational datasets: Berkeley Earth, HadCRUT5, NOAA, and GISTEMPv4. First, we need to import the required Python libraries:

!pip install hvplot

from functools import reduce
import pandas as pd
import hvplot.pandas

import panel as pn
pn.extension(comms='colab')

This notebook is available on Google Colab, which is why I’ve set comms=‘colab' in the Panel configuration settings. Next, we download and merge all observational datasets:

# HadCRUT5 anomaly relative to 1961-1990
# Link: https://www.metoffice.gov.uk/hadobs/hadcrut5/
hadcrut = pd.read_csv('https://www.metoffice.gov.uk/hadobs/hadcrut5/data/HadCRUT.5.0.2.0/analysis/diagnostics/HadCRUT.5.0.2.0.analysis.summary_series.global.annual.csv',
                      usecols=[0,1], skiprows=1, names=['Year','HadCRUT5'], index_col=0)

# GISTEMP v4 anomaly relative to 1951-1980
# Link: https://data.giss.nasa.gov/gistemp/
gistemp = pd.read_csv('https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.csv',
                      skiprows=2, usecols=[0,13], names=['Year', 'GISTEMPv4'], index_col=0, skipfooter=1, engine='python')

# NOAA anomaly relative to 1901-2000
# Link: https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/ytd/12
noaa = pd.read_csv('https://www.ncdc.noaa.gov/cag/global/time-series/globe/land_ocean/ytd/12/1880-2023.csv',
                   skiprows=5, names=['Year', 'NOAA'], index_col=0)

# Berkeley Earth anomaly relative to 1951-1980
# Link: https://berkeleyearth.org/data/
berkearth = pd.read_csv('https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Land_and_Ocean_summary.txt',
                        skiprows=58, usecols=[0,1], sep='\s+', names=['Year', 'BerkeleyEarth'], index_col=0)

# Merge
df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True, how='outer'), [hadcrut, gistemp, noaa, berkearth])

As you can see, each of these datasets measures the global mean temperature anomaly relative to a fixed baseline, and the baselines are not constant across datasets. Climate scientists often define 1850-1900 as the canonical pre-industrial baseline, but choosing this period is a non-trivial exercise [5] and can impact the uncertainty around our estimates of climate change. Let’s explore this phenomenon interactively using Panel and the datasets downloaded above:

def adjust_baseline(obs_name, new_baseline):
  # Adjusts given observational dataset to selected new baseline

def plot_timeseries(obs_to_include, new_baseline_start, new_baseline_end):
  # Plots timeseries of anomalies for selected datasets and new baseline

def get_2023_anomaly(obs_to_include, new_baseline_start, new_baseline_end):
  # Calculates and shows 2023 anomaly for selected datasets and new baseline

The full code for each of these functions is available in the notebook. Briefly, adjust_baseline calculates a new timeseries of global temperature anomalies for a given observational dataset, now relative to a new user-defined baseline. plot_timeseries then plots these new timeseries, and the user can choose which datasets are included via the obs_to_include parameter. As I mentioned above, Panel interfaces nicely with several other plotting libraries – here I elected to use hvPlot with the Bokeh backend to add zooming and panning capabilities to the timeseries plot. Finally, get_2023_anomaly reports the 2023 temperature anomalies relative to the new baseline (where I’ve used Panel’s Number indicator to color-code the anomaly report for some added glitz).

Now, let’s add interactivity to these functions via Panel. First, we define a set of ‘widgets’ that allow the user to enter their chosen baseline, and which observational datasets to include in the plots:

# Widgets
new_baseline_start = pn.widgets.IntInput(name='Baseline start year', value=1981, step=5, start=1850, end=2022, width=150)
new_baseline_end = pn.widgets.IntInput(name='Baseline end year', value=2010, step=5, start=1851, end=2023, width=150)

obs_to_include = pn.widgets.CheckButtonGroup(value=obs_names, options=obs_names, orientation='vertical', width=150)

Panel has a large number of widgets available for different use cases, so I encourage you to check those out. Then, we use Panel’s bind functionality to define reactive or bound versions of the functions above:

# Interactivity is generated by pn.bind
interactive_anomaly = pn.bind(get_2023_anomaly, obs_to_include, new_baseline_start, new_baseline_end)
interactive_plot = pn.bind(plot_timeseries, obs_to_include, new_baseline_start, new_baseline_end)

These functions will now be automatically re-run upon a user interaction. Finally, we can construct the dashboard by using a combination of Panel’s row and column layouts:

# Construct the dashboard
obs_to_include_heading = pn.pane.HTML('<label>Observational datasets to show:</label>')
anomaly_heading = pn.pane.Markdown('## 2023 anomaly:')

pn.Row(
    pn.Column(pn.WidgetBox(new_baseline_start, new_baseline_end, width=175, margin=10),
              pn.WidgetBox(pn.Column(obs_to_include_heading, obs_to_include), width=175, margin=10)),
    interactive_plot,
    pn.Column(anomaly_heading, interactive_anomaly),
    width=1000,
)

Et voila! You’ve just built your first Panel dashboard. If you run the code in the Google Colab notebook, it should look something like this:

You can try changing the baseline, zooming and panning around the plot, and toggling the datasets on/off. Notice that using an earlier baseline period typically results in a larger disagreement among the different products, reflecting the large uncertainties in early temperature measurements.  

Facilitating a deeper exploration of published results

We relied on Panel to produce a more elaborate dashboard to accompany a recent paper, Downscaling and bias-correction contribute considerable uncertainty to local climate projections in CMIP6, published in npj Climate & Atmospheric Science last year [6]. In brief, the goal of this paper was to understand the relative importance of various sources of uncertainty in local climate projections. In addition to the traditional sources of climate uncertainty, such as model uncertainty (different climate models can simulate different future climates) and internal variability (related to the chaotic nature of the Earth system), we also looked at downscaling and bias-correction uncertainty, which arises from the variety of available post-processing methods that aim to make climate model outputs more actionable (i.e., higher spatial resolution and reduced biases relative to observations). Our results were highly heterogeneous across space, time, and indicators of climate change, so it was challenging to condense everything into a few figures for the paper. As such, we developed a dashboard to allow users to explore our results in more detail by focusing in on specific locations or time periods that might be of more interest to them:

We relied on Panel for this dashboard mainly for its flexibility. Panel allowed us to include several different plot types and user interaction features, along with text boxes and figures to provide additional information. For example, users can explore the uncertainty breakdown at different locations by zooming and panning around a map plot, entering specific latitude and longitude coordinates, or selecting from a drop-down list of almost 900 world cities. As someone not well-versed in HTML or JavaScript, a major advantage provided by Panel was the ability to construct everything in Python.

Our dashboard is publicly available (https://lafferty-sriver-2023-downscaling-uncertainty.msdlive.org/) and we’d love for you to check it out! For deploying the dashboard, we worked with Carina Lansing and Casey Burleyson from PNNL to make it available on MSD-LIVE. MSD-LIVE (the MultiSector Dynamics Living, Intuitive, Value-adding, Environment) is a cloud-based flexible and scalable data and code management system combined with an advanced computing platform that will enable MSD researchers to document and archive their data, run their models and analysis tools, and share their data, software, and multi-model workflows within a robust Community of Practice. For the dashboard, we leveraged MSD-LIVE’s cloud computing capabilities to support the on-demand computing required to generate the plots. When a user clicks on the dashboard link, MSD-LIVE spins up a dedicated computing instance on the Amazon Web Services (AWS) cloud. That computing instance has the required Python packages pre-installed and is directly connected to the dataset that underpins the dashboard. Once the dashboard is closed or after a period of inactivity the dedicated computing instance is spun down. The underlying cloud computing capabilities in MSD-LIVE have also been used to support Jupyter-based interactive tutorials used to teach new people to use MSD models.

Conclusion

I hope this blog post has served as a useful introduction to Panel’s capabilities, and perhaps sparked some new ideas about how to visualize and communicate your scientific data and results. Although the temperature anomaly dashboard we created earlier was very simple, I think it demonstrates the key steps towards constructing more complicated apps: write functions to produce your desired visualizations, add interactivity via pn.bind, then arrange your dashboard using Panel’s plethora of panes, widgets, and layout features. Like any coding project, you will invariably run into unforeseen difficulties – there was definitely some trial-and-error involved in building the more complicated dashboard associated with our paper – but following this ordered approach should allow you to start creating interactive (and hopefully more effective) scientific visualizations. I would also encourage you to check out the Panel component and app galleries for yet more inspiration and demonstrations of what Panel can do.

References

  1. Trevor Amestoy, Creating Interactive Geospatial Maps in Python with Folium, Water Programing blog (2023). https://waterprogramming.wordpress.com/2023/04/05/creating-interactive-geospatial-maps-in-python-with-folium/
  2. Trevor Amestoy, Constructing interactive Ipywidgets: demonstration using the HYMOD model, Water Programing blog (2022). https://waterprogramming.wordpress.com/2022/07/20/constructing-interactive-ipywidgets-demonstration-using-the-hymod-model/
  3. Robert Rhode, Global Temperature Report for 2023, Berkeley Earth (2024). https://berkeleyearth.org/global-temperature-report-for-2023/
  4. Zeke Hausfather, 2023’s unexpected and unexplained warming, The Climate Brink (2024). https://www.theclimatebrink.com/p/2023s-unexpected-and-unexplained
  5. Hawkins, E., et al., Estimating Changes in Global Temperature since the Preindustrial Period, Bull. Amer. Meteor. Soc., 98, 1841–1856, https://doi.org/10.1175/BAMS-D-16-0007.1
  6. Lafferty, D.C., Sriver, R.L. Downscaling and bias-correction contribute considerable uncertainty to local climate projections in CMIP6. npj Clim Atmos Sci 6, 158 (2023). https://doi.org/10.1038/s41612-023-00486-0

Runtime Visualization of MOEA with Platypus in Python

The Platypus framework provides us with a Python library to solve and analyze multi-objective problems conveniently. In this notebook, we use Platypus to solve a multi-objective optimization problem – the 3 objective version of DTLZ2 (Deb et al., 2002a) – using the evolutionary algorithm NSGA-II (Deb et al., 2002b). We generate runtime visualizations for snapshots of the algorithm to help build an understanding of how the evolutionary algorithm works. Specifically, we are going to look at the population at each generation, their ranks (a key parameter NSGA-II uses to select offspring), the parallel coordinates plot of the current population, and runtime indicators such as hypervolume and generational distance.

Setting up the problem

I created a Google Colab for this post. Since this post is equivalent to the Google Colab file, you may choose to look at either one. This project is intended to be used in a Jupyter Notebook environment with Python.

First, we import the Python libraries we need for the visualization.

import numpy as np
import math
import pandas as pd
from pandas.plotting import parallel_coordinates
import random
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
from matplotlib import animation, rc, rcParams
rc('animation', html='jshtml')

from mpl_toolkits.mplot3d import Axes3D

!pip install platypus-opt
from platypus import (NSGAII, NSGAIII, DTLZ2, Hypervolume, EpsilonBoxArchive,
                      Solution, GenerationalDistance, InvertedGenerationalDistance,
                      Hypervolume, EpsilonIndicator, Spacing)

Our goal is to visualize the population at each generation of the evolutionary algorithm. To do so, we utilize an interface provided by the Platypus library: the callback function.

At each iteration of the algorithm, the callback function (if initialized) is called. We define our callback function to store the current number of function evaluations (algorithm.nfe) and all the data points in the current population (algorithm.result). Each population has the type Solution defined in Platypus, which not only contains the values of the variables but also attributes specific to the solver, such as ranks and crowding distances. We will access some of these attributes during visualization.

We also define the frequency of NFEs that we want to store the values. Saving a snapshot of the algorithm at every single iteration may be too expensive or unnecessary. Hence, we set a lapse for some number of iterations, in which the callback function does nothing unless the last NFE is more than the frequency amount apart from this NFE.

In the following example, we set up the problem DTLZ2 and use the callback function to solve it using NSGAII.

#define the frequency
frequency = 200

# define the problem definition
problem = DTLZ2(3)

# instantiate the optimization algorithm
algorithm = NSGAII(problem, divisions_outer=12)

# define callback function
solutions_list = []
hyp = []
nfe = []
last_calc = 0

def DTLZ2_callback(algorithm):
  global last_calc
  if algorithm.nfe == last_calc + frequency:
    last_calc = algorithm.nfe
    nfe.append(algorithm.nfe)
    solutions_list.append(algorithm.result);

# optimize the problem using 10,000 function evaluations
algorithm.run(10000,DTLZ2_callback)

In order to calculate the metrics, we first need to define our reference set. For DTLZ2(3), our points lie on the unit sphere in the first octant. Let’s randomly select 1000 such points and plot them.

# generate the reference set for 3D DTLZ2
reference_set = EpsilonBoxArchive([0.02, 0.02, 0.02])

for _ in range(1000):
    solution = Solution(problem)
    solution.variables = [random.uniform(0,1) if i < problem.nobjs-1 else 0.5 for i in range(problem.nvars)]
    solution.evaluate()
    reference_set.add(solution)

fig_ref = plt.figure()
ax_ref = fig_ref.add_subplot(projection="3d")
ax_ref.scatter(
              [s.objectives[0] for s in reference_set],
              [s.objectives[1] for s in reference_set],
              [s.objectives[2] for s in reference_set],
              )

Given the reference set, we can now calculate the indicators for each population across iterations. The Platypus library provides us with all the functions we need to calculate the following indicators: generational distance, hypervolume, epsilon indicator, and spacing.

We initialize them in a dictionary and iterate over all saved populations to calculate the indicators.

# calculate the indicators
indicators = {"gd" : GenerationalDistance(reference_set),
              "hyp" : Hypervolume(reference_set),
              "ei" : EpsilonIndicator(reference_set),
              "sp" : Spacing()}

indicator_results = {index : [] for index in indicators}

for indicator in tqdm(indicator_results):
  for solution in tqdm(solutions_list):
    indicator_results[indicator] += [indicators[indicator].calculate(solution)]

Setting up the visualization

At this point, we have the data we need to perform runtime visualizations. We will utilize the animation.FuncAnimation function in matplotlib to create interactive animations in Jupyter Notebook (or Google Colab). The idea behind creating such animations is to first initialize a static figure, and then define an update function to let the FuncAnimation know how to visualize new data for each iteration.

We define drawframe() that does the following: (1) clear the axis, so that previous data points are wiped out; (2) draw the new data points; (3) reset the limits of data axes so that the axes are consistent across frames; (4) update new data for indicator axes.

def drawframe(n):

    # clear axes
    ax.cla()
    ax_parallel.cla()

    # save results
    result = solutions_list[n]
    crowding_distances = [s.crowding_distance if s.crowding_distance != math.inf else 0 for s in result]
    ranks = [s.rank for s in result]

    result = solutions_list[n]
    points = {
              'X': [s.objectives[0] for s in result],
              'Y': [s.objectives[1] for s in result],
              'Z': [s.objectives[2] for s in result],
              'rank': [s.rank for s in result],
              'tag' : ['tag' for s in result]
    }
    df = pd.DataFrame(points)

    # update new data points
    ax.scatter(points['X'], points['Y'], points['Z'],
              c = ranks,
              alpha = 0.5,
              linestyle="", marker="o",
              cmap=cmap,
              vmax=max_rank,
              vmin=0)
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.set_zlim(zlim)
    ax.set_title('Solve DTLZ2, NFE = ' + str(nfe[n]))

    # update the parallel coordinates plot
    parallel_coordinates(df[['X','Y','Z','tag']], 'tag', ax=ax_parallel)

    # update indicator plots
    for indicator in indicator_axes:
      indicator_ax = indicator_axes[indicator]
      indicator_ax.plot(nfe[:n],indicator_results[indicator][:n], c = 'r')
      indicator_ax.set_xlim(left = min(nfe), right=max(nfe))
      indicator_ax.set_ylim(bottom = min(indicator_results[indicator]), top = max(indicator_results[indicator]))

With the drawframe() function, we create a static figure that initializes the axes we will feed into the drawframe() function. The initialization does the following: (1) set up the subplots of the figure; (2) calculate the maximum ranks of all points in all populations to determine the color mapping; (3) load the points from the first iteration; (4) initialize the scatter plots, parallel coordinates plot, and indicator plots.

fig = plt.figure(figsize=(20,10), dpi = 70)
fig.suptitle("Runtime Visualization", fontsize=20)
fig.subplots_adjust(wspace=0.3, hspace=0.3)
ax = fig.add_subplot(2, 3, 1, projection="3d")
ax_parallel = fig.add_subplot(2,3,4)

indicator_axes = {"gd" : fig.add_subplot(2, 3, 2),
                  "hyp" : fig.add_subplot(2, 3, 3),
                  "ei" : fig.add_subplot(2, 3, 5),
                  "sp" : fig.add_subplot(2, 3, 6)}

indicator_names = {"gd" : "Generational Distance",
                  "hyp" : "Hypervolume",
                  "ei" : "Epsilon Indicator",
                  "sp" : "Spacing"}

# load the ranks of all points
all_rank = [s.rank for result in solutions_list for s in result]
max_rank = max(all_rank)

# define the colormap
cmap = plt.get_cmap('Accent', max_rank)

# load the points from the first iteration
result = solutions_list[0]
points = {
    'X': [s.objectives[0] for s in result],
    'Y': [s.objectives[1] for s in result],
    'Z': [s.objectives[2] for s in result],
    'rank': [s.rank for s in result],
    'tag' : ['tag' for s in result]
}

df = pd.DataFrame(points)

# create the scatter plot
graph = ax.scatter(
              points['X'], points['Y'], points['Z'],
              c = points['rank'],
              alpha = 0.5,
              cmap=cmap,
              linestyle="", marker="o",
              vmax=max_rank,
              vmin=0
          )

# create the parallel coordinates plot
parallel_coordinates(df[['X','Y','Z','tag']], 'tag', ax=ax_parallel)

plt.colorbar(graph, label='Rank', pad = 0.2)

# save the dimensions for later use
xlim = ax.get_xlim()
ylim = ax.get_ylim()
zlim = ax.get_zlim()
title = ax.set_title("DTLZ2 Static Figure")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")

# initialize subplots for each indicator
for indicator in indicator_axes:
  indicator_axes[indicator].plot(nfe[:0],indicator_results[indicator][:0])
  indicator_axes[indicator].set_title(indicator_names[indicator] + " vs NFE")
  indicator_axes[indicator].set_xlabel("NFE")
  indicator_axes[indicator].set_ylabel(indicator_names[indicator])

Now we are ready to create an animation. We use the FuncAnimation() function, the initialized figure, and the drawframe() function to create the animation.

In the Jupyter Notebook or Google Colab, you will be able to play the animation using the play button at the bottom of the image. By default, the animation will play in a loop. You may select “once” so that the animation freezes in the last frame. Important: Before re-generating the animation, be sure to re-run the previous initialization so that the figure is reset. Otherwise, you may see overlapping points/lines.

ani = animation.FuncAnimation(fig, drawframe, frames=len(nfe), interval=20, blit=False)
ani

This visualization shows how the algorithm progresses as NFE grows. For the set of solutions, we clearly see how they converge to the reference set. Moreover, more and more points have lower ranks, indicating they are getting closer to the Pareto Front (points on the Pareto Front have rank = 0). The parallel coordinates plot shows how our solutions get narrowed down and the tradeoffs we could make. Finally, the four indicator plots track the performance of our algorithm as NFE increases. The trajectory of generational distance, hypervolume, and epsilon indicator suggests convergence.

In conclusion, the project highlights the potential of the Platypus library in Python in providing valuable insights into the progress of evolutionary algorithms, not just their final outcomes. Through the use of NSGA-II as an illustrative example, we have demonstrated the ability to monitor the ranks of points across generations. In Dave’s post, the runtime visualizations revealed the changing probabilities of variators across iterations. These findings emphasize the power of incorporating dynamic techniques to gain a comprehensive understanding of the runtime behavior of MOEA algorithms. I hope this project opens doors to further explore, visualize, and analyze the dynamics of evolutionary algorithms.

References

[1] Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002a). Scalable multi-objective optimization test problems. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600) (Vol. 1, pp. 825-830). IEEE.

[2] Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. A. M. T. (2002b). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2), 182-197.

[3] Gold, D. (2020, May 6). Beyond Hypervolume: Dynamic visualization of MOEA runtime. Water Programming: A Collaborative Research Blog. https://waterprogramming.wordpress.com/2020/05/06/beyond-hypervolume-dynamic-visualization-of-moea-runtime/

[4] Python – How to create 3D scatter animations – Stack Overflow https://stackoverflow.com/questions/41602588/how-to-create-3d-scatter-animations

[5] GitHub – Project-Platypus/Platypus: A Free and Open Source Python Library for Multiobjective Optimization https://github.com/Project-Platypus/Platypus

Figure Library Part 2 – Visualizations Supporting Synthetic Streamflow Diagnostics

Motivation

This is the second post in a series which was started by Dave last week and seeks to make a library of plotting functions for common visualization tasks. In this post, I share some plotting functions I have developed for synthetic streamflow diagnostic visualizations.

A thorough review of synthetic streamflow verification is outside the scope of this post, but in short we want to verify that the synthetic timeseries have statistical properties similar to the historic streamflow. Additionally, we want to see that the synthetic timeseries reflect the range of internal variability present in the basin by showing that a large synthetic ensemble envelopes the limited historic observations.

The following functions can be used to compare the following properties of historic and synthetic streamflow:

  • Flow magnitude
  • Non-exceedance probabilities (flow duration curves)
  • Autocorrelation
  • Spatial correlation across locations

Diagnostic plotting functions

  • plot_flow_ranges()
    • Range and median of flow values in historic and synthetic flow timeseries.
  • plot_fdc_ranges()
    • The range of annual flow duration curves (FDCs) for historic and synthetic streamflow timeseries.
  • plot_autocorrelation()
    • Comparison of historic and synthetic autocorrelation across a range of lag values.
  • plot_spatial_correlation()
    • Heatmap visualizations of Pearson correlation between different flow locations in the historic data and a single synthetic realization.

These functions are designed to accept historic (Qh) and synthetic (Qs) data with the following properties:

  • Qh : A pd.Series containing a single historic streamflow timeseries. The index of this data must be type pd.DatetimeIndex.
  • Qs : A pd.DataFrame containing many synthetic streamflow timeseries. The columns should each contain a single synthetic realization and the index must be type pd.DatetimeIndex. The column names do not matter.

Additionally, each function (excluding plot_fdc_ranges()) has a timestep keyword argument which changes the temporal scale of the flow being considered. Options are ‘daily’, ‘weekly’, or ‘monthly’, and the pandas.groupby() function is used to aggregate data accordingly to allow for comparison of statistics at different temporal scales. The default is ‘daily’.

Streamflow data

The historic data used in this example is taken from the Delaware River Basin. Specifically, streamflow data from the USGS gauge at Lordville, NY, on the upper portion of the Delaware River is used.

The synthetic streamflow was generated using the Kirsch-Nowak [1, 2] synthetic streamflow generator. For more detail on this generator, see Julie’s post here: Open Source Streamflow Generator Part 1.

I created 100 synthetic realizations of 30-year, daily streamflow timeseries at the Lordville gauge.

The historic and synthetic streamflow timeseries are assumed to start at the same date. Most of the plotting functions below do not explicitly consider the date associated with the timeseries, however comparisons between flow statistics across some range of time assume that the timeseries are aligned for the same dates.


Flow magnitude range

This first plot simply shows the range (max, min) and median of flows for both historic and synthetic datasets.

def plot_flow_ranges(Qh, Qs, 
                     timestep = 'daily',
                     units = 'cms', y_scale = 'log',
                     savefig = False, fig_dir = '.',
                     figsize = (7,5), colors = ['black', 'orange'],
                     title_addon = ""):
    """Plots the range of flow for historic and syntehtic streamflows for a specific timestep scale.
    
    Args:
        Qh (pd.Series): Historic daily streamflow timeseries. Index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. Index must be pd.DatetimeIndex.
        timestep (str, optional): The timestep which data should be aggregated over. Defaults to 'daily'. Options are 'daily', 'weekly', or 'monthly'.
        units (str, optional): Streamflow units, for axis label. Defaults to 'cms'.
        y_scale (str, optional): Scale of the y-axis. Defaults to 'log'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        colors (list, optional): List of two colors for historic and synthetic data respectively. Defaults to ['black', 'orange'].
        title_addon (str, optional): Text to be added to the end of the title. Defaults to "".
    """

    # Assert formatting matches expected
    assert(type(Qh.index) == pd.DatetimeIndex), 'Historic streamflow (Qh) should have pd.DatatimeIndex.'
    assert(type(Qs.index) == pd.DatetimeIndex), 'Synthetic streamflow (Qh) should have pd.DatatimeIndex.'

    # Handle conditional datetime formatting
    if timestep == 'daily':
        h_grouper = Qh.index.dayofyear
        s_grouper = Qs.index.dayofyear
        x_lab = 'Day of the Year (Jan-Dec)'
    elif timestep == 'monthly':
        h_grouper = Qh.index.month
        s_grouper = Qs.index.month
        x_lab = 'Month of the Year (Jan-Dec)'
    elif timestep == 'weekly':
        h_grouper = pd.Index(Qh.index.isocalendar().week, dtype = int)
        s_grouper = pd.Index(Qs.index.isocalendar().week, dtype = int)
        x_lab = 'Week of the Year (Jan-Dec)'
    else:
        print('Invalid timestep input. Options: "daily", "monthly", "weekly".')
        return

    # Find flow ranges
    s_max = Qs.groupby(s_grouper).max().max(axis=1)
    s_min = Qs.groupby(s_grouper).min().min(axis=1)
    s_median = Qs.groupby(s_grouper).median().median(axis=1)
    h_max = Qh.groupby(h_grouper).max()
    h_min = Qh.groupby(h_grouper).min()
    h_median = Qh.groupby(h_grouper).median()
  
    ## Plotting  
    fig, ax = plt.subplots(figsize = figsize, dpi=150)
    xs = h_max.index
    ax.fill_between(xs, s_min, s_max, color = colors[1], label = 'Synthetic Range', alpha = 0.5)
    ax.plot(xs, s_median, color = colors[1], label = 'Synthetic Median')
    ax.fill_between(xs, h_min, h_max, color = colors[0], label = 'Historic Range', alpha = 0.3)
    ax.plot(xs, h_median, color = colors[0], label = 'Historic Median')
    
    ax.set_yscale(y_scale)
    ax.set_ylabel(f'{timestep.capitalize()} Flow ({units})', fontsize=12)
    ax.set_xlabel(x_lab, fontsize=12)
    ax.legend(ncols = 2, fontsize = 10, bbox_to_anchor = (0, -.5, 1.0, 0.2), loc = 'upper center')    
    ax.grid(True, linestyle='--', linewidth=0.5)
    ax.set_title(f'{timestep.capitalize()} Streamflow Ranges\nHistoric & Synthetic Timeseries at One Location\n{title_addon}')
    plt.tight_layout()
    
    if savefig:
        plt.savefig(f'{fig_dir}/flow_ranges_{timestep}.png', dpi = 150)
    return

Assuming I have loaded a pd.DataFrame named Qs which contains synthetic timeseries in each column, and a single historic timeseries Qh, I can quickly plot daily (or weekly or monthly) flow ranges using:

# Useage
plot_flow_ranges(Qh, Qs, timestep='daily')

This figure provides a lot of useful information about our synthetic ensemble:

  • The synthetic ensembles envelope the limited historic values, and include more extreme high and low flows
  • The seasonal trends in the synthetics align with the historic
  • The median of the synthetic ensemble aligns very closely with the median of the historic data.

Flow duration curve ranges

The plot_fdc_ranges() function will calculate an FDC for each year of the historic and synthetic data, and plot the ranges that these FDCs take. It will also "total" FDCs calculated for the entirety of the historic or synthetic records.

def plot_fdc_ranges(Qh, Qs, 
                    units = 'cms', y_scale = 'log',
                    savefig = False, fig_dir = '.',
                    figsize = (5,5), colors = ['black', 'orange'],                   
                    title_addon = ""):
    """Plots the range and aggregate flow duration curves for historic and synthetic streamflows.
    
    Args:
        Qh (pd.Series): Historic daily streamflow timeseries. Index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. Index must be pd.DatetimeIndex.
        units (str, optional): Streamflow units, for axis label. Defaults to 'cms'.
        y_scale (str, optional): Scale of the y-axis. Defaults to 'log'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        colors (list, optional): List of two colors for historic and synthetic data respectively. Defaults to ['black', 'orange'].
        title_addon (str, optional): Text to be added to the end of the title. Defaults to "".
    """
    
    ## Assertions
    assert(type(Qs) == pd.DataFrame), 'Synthetic streamflow should be type pd.DataFrame.'
    assert(type(Qh.index) == pd.DatetimeIndex), 'Historic streamflow (Qh) should have pd.DatatimeIndex.'
    assert(type(Qs.index) == pd.DatetimeIndex), 'Synthetic streamflow (Qh) should have pd.DatatimeIndex.'

    
    # Calculate FDCs for total period and each realization
    nonexceedance = np.linspace(0.0001, 0.9999, 50)
    s_total_fdc = np.quantile(Qs.values.flatten(), nonexceedance)
    h_total_fdc = np.quantile(Qh.values.flatten(), nonexceedance) 
    
    s_fdc_max = np.zeros_like(nonexceedance)
    s_fdc_min = np.zeros_like(nonexceedance)
    h_fdc_max = np.zeros_like(nonexceedance)
    h_fdc_min = np.zeros_like(nonexceedance)

    annual_synthetics = Qs.groupby(Qs.index.year)
    annual_historic = Qh.groupby(Qh.index.year)

    for i, quant in enumerate(nonexceedance):
            s_fdc_max[i] = annual_synthetics.quantile(quant).max().max()
            s_fdc_min[i] = annual_synthetics.quantile(quant).min().min()
            h_fdc_max[i] = annual_historic.quantile(quant).max()
            h_fdc_min[i] = annual_historic.quantile(quant).min()
    
    ## Plotting
    fig, ax = plt.subplots(figsize=figsize, dpi=200)
    
    #for quant in syn_fdc_quants:
    ax.fill_between(nonexceedance, s_fdc_min, s_fdc_max, color = colors[1], label = 'Synthetic Annual FDC Range', alpha = 0.5)
    ax.fill_between(nonexceedance, h_fdc_min, h_fdc_max, color = colors[0], label = 'Historic Annual FDC Range', alpha = 0.3)

    ax.plot(nonexceedance, s_total_fdc, color = colors[1], label = 'Synthetic Total FDC', alpha = 1)
    ax.plot(nonexceedance, h_total_fdc, color = colors[0], label = 'Historic Total FDC', alpha = 1, linewidth = 2)

    ax.grid(True, linestyle='--', linewidth=0.5)
    ax.set_yscale(y_scale)
    ax.set_ylabel(f'Flow ({units})')
    ax.set_xlabel('Non-Exceedance Probability')
    ax.legend(fontsize= 10)
    ax.grid(True, linestyle='--', linewidth=0.5)
    
    plt.title(f'Flow Duration Curve Ranges\nHistoric & Synthetic Streamflow\n{title_addon}')
    if savefig:
        plt.savefig(f'{fig_dir}/flow_duration_curves_{title_addon}.png', dpi=200)
    plt.show()
    return

## Usage
plot_fdc_ranges(Qh, Qs)

Similar to the flow range plot above, we see that the synthetic ensemble is nicely enveloping the historic data; there are synthetic instances where the extreme flows in a particular year are more extreme than historic observations. However, the FDC of the total synthetic ensemble (if all synthetics were combined) is almost exactly equal to the total historic FDC.

Autocorrelation

Hydrologic systems are known to have memory, or correlation between streamflow at different points in time. We want to make sure that our synthetic streamflow timeseries have similar memory or autocorrelation.

The plot_autocorrelation() function will calculate autocorrelation for the historic record, and each synthetic realization for a range of different lag values. In addition to the required Qh and Qs arguments this function requires lag_range which should be a list or array of lag values to be considered.

def plot_autocorrelation(Qh, Qs, lag_range, 
                         timestep = 'daily',
                         savefig = False, fig_dir = '.',
                         figsize=(7, 5), colors = ['black', 'orange'],
                         alpha = 0.3):
    """Plot autocorrelation of historic and synthetic flow over some range of lags.

    Args:
        Qh (pd.Series): Historic daily streamflow timeseries. Index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. Index must be pd.DatetimeIndex.
        lag_range (iterable): A list or range of lag values to be used for autocorrelation calculation.
        
        timestep (str, optional): The timestep which data should be aggregated over. Defaults to 'daily'. Options are 'daily', 'weekly', or 'monthly'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        colors (list, optional): List of two colors for historic and synthetic data respectively. Defaults to ['black', 'orange'].
        alpha (float, optional): The opacity of synthetic data. Defaults to 0.3.
    """
    
    ## Assertions
    assert(type(Qs) == pd.DataFrame), 'Synthetic streamflow should be type pd.DataFrame.'
    assert(type(Qh.index) == pd.DatetimeIndex), 'Historic streamflow (Qh) should have pd.DatatimeIndex.'
    assert(type(Qs.index) == pd.DatetimeIndex), 'Synthetic streamflow (Qh) should have pd.DatatimeIndex.'

    if timestep == 'monthly':
        Qh = Qh.resample('MS').sum()
        Qs = Qs.resample('MS').sum()
        time_label = 'months'
    elif timestep == 'weekly':
        Qh = Qh.resample('W-SUN').sum()
        Qs = Qs.resample('W-SUN').sum()
        time_label = f'weeks'
    elif timestep == 'daily':
        time_label = f'days'
        
    # Calculate autocorrelations
    autocorr_h = np.zeros(len(lag_range))
    confidence_autocorr_h = np.zeros((2, len(lag_range)))
    for i, lag in enumerate(lag_range):
        h_corr = pearsonr(Qh.values[:-lag], Qh.values[lag:])
        autocorr_h[i] = h_corr[0]
        confidence_autocorr_h[0,i] = h_corr[0] - h_corr.confidence_interval().low
        confidence_autocorr_h[1,i] = h_corr.confidence_interval().high - h_corr[0]
    
    autocorr_s = np.zeros((Qs.shape[1], len(lag_range)))
    for i, realization in enumerate(Qs.columns):
        autocorr_s[i] = [pearsonr(Qs[realization].values[:-lag], Qs[realization].values[lag:])[0] for lag in lag_range]


    ## Plotting
    fig, ax = plt.subplots(figsize=figsize, dpi=200)

    # Plot autocorrelation for each synthetic timeseries
    for i, realization in enumerate(Qs.columns):
        if i == 0:
            ax.plot(lag_range, autocorr_s[i], alpha=1, color = colors[1], label=f'Synthetic Realization')
        else:
            ax.plot(lag_range, autocorr_s[i], color = colors[1], alpha=alpha, zorder = 1)
        ax.scatter(lag_range, autocorr_s[i], alpha=alpha, color = 'orange', zorder = 2)

    # Plot autocorrelation for the historic timeseries
    ax.plot(lag_range, autocorr_h, color=colors[0], linewidth=2, label='Historic', zorder = 3)
    ax.scatter(lag_range, autocorr_h, color=colors[0], zorder = 4)
    ax.errorbar(lag_range, autocorr_h, yerr=confidence_autocorr_h, 
                color='black', capsize=4, alpha=0.75, label ='Historic 95% CI')

    # Set labels and title
    ax.set_xlabel(f'Lag ({time_label})', fontsize=12)
    ax.set_ylabel('Autocorrelation (Pearson)', fontsize=12)
    ax.set_title('Autocorrelation Comparison\nHistoric Timeseries vs. Synthetic Ensemble', fontsize=14)

    ax.legend(fontsize=10)
    ax.grid(True, linestyle='--', linewidth=0.5)
    plt.subplots_adjust(left=0.12, right=0.95, bottom=0.12, top=0.92)

    if savefig:
        plt.savefig(f'{fig_dir}/autocorrelation_plot_{timestep}.png', dpi=200, bbox_inches='tight')
    plt.show()
    return

Let’s visualize the autocorrelation of daily streamflow for a maximum of 100-day lag.

## Usage
plot_autocorrelation(Qh, Qs, lag_range=range(1,100, 5), timestep='daily') 

Spatial Correlation

The plot_spatial_correlation() function generates a heatmap visualization of the Pearson correlation matrix for streamflow at different locations in both the historic and synthetic datasets.

The plot_spatial_correlation() function is different from the other three functions shown above, since it requires a set of timeseries data for multiple different locations.

Rather than providing a set of many different realizations for a single site we need to provide a single realization of synthetics across all sites of interest, with a timeseries for each site in each column. This new input is Qs_i, and must the same columns as Qh (they both must contain a timeseries for each location.

def plot_correlation(Qh, Qs_i,
                        timestep = 'daily',
                        savefig = False, fig_dir = '.',
                        figsize = (8,4), color_map = 'BuPu',
                        cbar_on = True):
    """Plots the side-by-side heatmaps of flow correlation between sites for historic and synthetic multi-site data.
    
    Args:
        Qh (pd.DataFrame): Historic daily streamflow timeseries at many different locations. The columns of Qh and Qs_i must match, and index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. The columns of Qh and Qs_i must match, and index must be pd.DatetimeIndex.
        timestep (str, optional): The timestep which data should be aggregated over. Defaults to 'daily'. Options are 'daily', 'weekly', or 'monthly'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        color_map (str, optional): The colormap used for the heatmaps. Defaults to 'BuPu.
        cbar_on (bool, optional): Indictor if the colorbar should be shown or not. Defaults to True.
    """
    
    ## Assertions
    assert(type(Qh) == type(Qs_i) and (type(Qh) == pd.DataFrame)), 'Both inputs Qh and Qs_i should be pd.DataFrames.'
    assert(np.mean(Qh.columns == Qs_i.columns) == 1.0), 'Historic and synthetic data should have same columns.'
    
    if timestep == 'monthly':
        Qh = Qh.resample('MS').sum()
        Qs_i = Qs_i.resample('MS').sum()
    elif timestep == 'weekly':
        Qh = Qh.resample('W-SUN').sum()
        Qs_i = Qs_i.resample('W-SUN').sum()
    
    ## Plotting
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=figsize, dpi = 250)
    
    # Heatmap of historic correlation
    h_correlation = np.corrcoef(Qh.values.transpose())
    sns.heatmap(h_correlation, ax = ax1, 
                square = True, annot = False,
                xticklabels = 5, yticklabels = 5, 
                cmap = color_map, cbar = False, vmin = 0, vmax = 1)

    # Synthetic correlation
    s_correlation = np.corrcoef(Qs_i.values.transpose())
    sns.heatmap(s_correlation, ax = ax2,
                square = True, annot = False,
                xticklabels = 5, yticklabels = 5, 
                cmap = color_map, cbar = False, vmin = 0, vmax = 1)
    
    for axi in (ax1, ax2):
        axi.set_xticklabels([])
        axi.set_yticklabels([])
    ax1.set(xlabel = 'Site i', ylabel = 'Site j', title = 'Historic Streamflow')
    ax2.set(xlabel = 'Site i', ylabel = 'Site j', title = 'Synthetic Realization')
    title_string = f'Pearson correlation across different streamflow locations\n{timestep.capitalize()} timescale\n'
        
    # Adjust the layout to make room for the colorbar
    if cbar_on:
        plt.tight_layout(rect=[0, 0, 0.9, 0.85])
        cbar_ax = fig.add_axes([0.92, 0.1, 0.02, 0.6])  # [left, bottom, width, height]
        cbar = fig.colorbar(ax1.collections[0], cax=cbar_ax)
        cbar.set_label("Pearson Correlation")
    plt.suptitle(title_string, fontsize = 12)
    
    if savefig:
        plt.savefig(f'{fig_dir}/spatial_correlation_comparison_{timestep}.png', dpi = 250)
    plt.show()
    return

To use this function, I need to load a different pd.DataFrame containing synthetic timeseries at all of my study locations.

## Usage
single_synthetic_all_sites = pd.read_csv('synthetic_streamflow_single_realization_all_sites.csv', sep=',', index_col=0, parse_dates=True)

plot_correlation(Qh_all_sites, single_synthetic_all_sites, timestep='weekly')

Across the different streamflow sites, we see that the synthetic timeseries do a decent job at preserving spatial correlation, although correlation is decreased for sites which are not strongly correlated in the historic record.

Conclusions

I hope these functions save you some time in your own work, and allow you expand your synthetic diagnostic considerations.

References

[1] Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management139(4), 396-406.

[2] Nowak, K., Prairie, J., Rajagopalan, B., & Lall, U. (2010). A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow. Water Resources Research46(8).

Building the Reed Group Figure Library – Part 1 Visualizations Supporting MOEA Diagnostics

This summer, we are constructing a Reed Group Lab manual (release forthcoming) as a guide for new graduate students and a resource about our tools for folks outside the group. The manual will complement this blog by organizing information and providing training syllabi for our tools and methodologies. One element of the lab manual that I’m excited about is a Figure Library, a set gallery of starter code for common visualizations we use in the group. In this post, I’ll present the first set of visualizations that will be included in our Figure Libary. These visualizations support MOEA diagnostic experiments. For some background on MOEA diagnostics, see this set of posts.

The section below is a reproduction of the Jupyter Notebook that will be included in our Figure Libary.

MOEA Diagnostics

This page contains Python code to generate basic visualizations for the diagnostics evaluation of multiobjective evolutionary algorithms (MOEAs). MOEA diagnostics are a systematic process of evaluating algorithmic performance. For background on MOEA diagnostics and an introductory tutorial, see this post. For MOEA diagnostics on water resources applications, see Reed et al., (2013), Zatarain et al., (2016), Zatarain et al., (2017) and Gupta et al., (2020).

# first load the relevant libraries
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.cm import ScalarMappable
import matplotlib
from scipy.ndimage import gaussian_filter

Attainment Plots

MOEAs are stochastic tools that are initialize with random seeds (starting solutions) that may affect the quality of search results. To account for the inherent uncertainty in search we run multiple random seeds of an algorithm. Attainment plots are a helpful way of visualizing an algorithm’s performance across random seeds. Attainment plots show the distribution of a MOEA performance metric across seeds and highlight the maximum performance achieved by an algorithm.

In the code below, four hypothetical algorithms are compared after running 20 random seeds each. The color of each bar represents the percentage of random seeds that achieve a given hypervolume, and the white points represent the best hypervolume achieved by each algorithm.

def attainment_plot(performance_file, metric):
    '''
    Creates an attainment plot comparing the performance of multiple algorithms across random seeds

    :param performance_file:            the path to a file containing performance metrics from diagnostic evaluation.
                                        the first row is assumed to be titles describing each algorithm

    :param metric:                      the name of the metric being plotted
    '''

    # load data
    performance_data = np.loadtxt(performance_file, delimiter=',', skiprows=1)

    # create a figure and get the axis object
    fig, ax = plt.subplots()

    # add a row of zeros
    performance_data = np.vstack((performance_data, np.ones(len(performance_data[0,:]))))

    # sort the hypervolume of each algorithm across all seeds from largest to smallest
    for i in range(len(performance_data[0,:])):
        performance_data[:,i] = np.sort(performance_data[:, i])[::-1]

    # plot a bar with the hypervolume achieved by each seed
    # start with 1 (full red), then highest, 2nd highest, etc.
    cmap = matplotlib.colormaps['RdBu'] # a color map for the plot
    for j in range(len(performance_data)):
        bar_color = (j)/len(performance_data) # color of the bar being plotted
        pcolor = cmap(bar_color) # pull color from colormap
        ax.bar(np.arange(len(performance_data[0,:])), performance_data[j,:], color=pcolor, alpha=1) # plot the seed

    # plot a the best hypervolume achieved by each Algorithm
    ax.scatter(np.arange(len(performance_data[0,:])), performance_data[1,:], color='w')

    # create a color bar
    sm = ScalarMappable(cmap= cmap)
    sm.set_array([])
    cbar = plt.colorbar(sm)
    cbar.set_label('Probability of Attainment')

    # format the plot
    #ax.set_xticks(np.arange(4))
    title_df = pd.read_csv(performance_file)
    #print(list(title_df.columns))
    ax.set_xticks(np.arange(len(performance_data[0,:])))
    ax.set_xticklabels(title_df.columns)
    ax.set_ylabel(metric +'\nPreference $\longrightarrow$')

#Example Usage
attainment_plot('HV_20_Seeds.csv', 'Relative Hypervolume')

Control Maps

Many MOEAs have complex parameterizations that may impact their ability to solve challenging multiobjective problems. Control maps are one way we can visualize how an algorithm’s performance varies across different parameterizations. Control maps are usually two dimensional projections of sampled parameters, with each axis on the map representing a different parameter. The algorithm’s performance at different parameter combinations is shown as the color on the map. An advantage of some high quality algorithms such as the Borg MOEA is that they are relatively insensitive to their parameterization, and their performance is mainly controlled by the number of function evaluations (NFEs) performed.

In the example below, the control map shows an algorithm that is sensitive to both the NFEs and intial population size, indicating that users must be careful of their parameterization when using this algorithm.

def control_plot(control_data, parameter_1, parameter_2):
    '''
    Creates a two dimensional control plot for MOEA performance across many parameterizations

    :param control_data:                the path to a csv file with performance data aranged as an nxn where n is the number of parameter samples
    :param parameter_1                  the name of the parameter being plotted on the horizontal axis
    :param parameter_2                  the name of the parameter being plotted on the vertical axis
    '''

    metrics = np.loadtxt(control_data, delimiter=',')
    filtered_metrics = gaussian_filter(metrics, sigma=5)

    fig, ax = plt.subplots()
    cmap = plt.get_cmap('RdYlBu')


    im = ax.imshow(filtered_metrics, cmap=cmap)
    ax.set_ylim([0,len(filtered_metrics[0,:])-1])
    ax.set_yticks(np.arange(len(filtered_metrics[:,0])))
    ax.set_xticks(np.arange(len(filtered_metrics[0,:])))
    ax.set_ylabel(parameter_2)
    ax.set_xlabel(parameter_1)
    ax.set_title('Control Map')

    # create a color bar
    sm = ScalarMappable(cmap= cmap)
    sm.set_array([])
    cbar = plt.colorbar(sm)
    cbar.set_label('Relative Hypervolume')

control_plot('Control_data.csv', 'Initial Population Size ($10^1$)', 'NFE ($10^3$)')

Runtime Dynamics

Another important way can compare algorithms is by visualizing the runtime dynamics, which reveals how algorithms perform during the search. We do this by tracking performance metrics during the search and plotting the performance against the NFEs. In the chart below, I plot the performance of two algorithms on a hypothetical test problem. Here the max and min across seeds are plotted as a shaded region, and the mean performance is plotted as a solid line for each algorithm.

import seaborn as sns
sns.set()
runtime_1 = np.loadtxt('Alg1_runtime.csv', delimiter=',', skiprows=1)
runtime_1 = np.vstack((np.zeros(21), runtime_1))

maxes_1 = runtime_1[:,1:].max(axis=1)
mins_1 = runtime_1[:,1:].min(axis=1)
means_1 = runtime_1[:,1:].mean(axis=1)

runtime_2 = np.loadtxt('Alg2_runtime.csv', delimiter=',', skiprows=1)
runtime_2 = np.vstack((np.zeros(21), runtime_2))

maxes_2 = runtime_2[:,1:].max(axis=1)
mins_2 = runtime_2[:,1:].min(axis=1)
means_2 = runtime_2[:,1:].mean(axis=1)

fig, ax = plt.subplots()
ax.fill_between(runtime_1[:,0], mins_1, maxes_1, alpha=.3, color='lightcoral')
ax.plot(runtime_1[:,0], means_1, color='firebrick')
ax.fill_between(runtime_2[:,0], mins_2, maxes_2, alpha=.3, color='darkorchid')
ax.plot(runtime_2[:,0], means_2, color='darkviolet')

ax.set_xlim([0,200])
ax.set_ylim([0,1])
ax.set_xlabel('NFE (10^3)')
ax.set_ylabel('Relative Hypervolume')
ax.set_title('Runtime Dynamics')

Multiobjective trade-offs

Finally, we can visualize the implications of algorithmic performance by plotting the Pareto approximate sets across multiple objectives. An important tool for high-dimensional problems is the parallel axis plot. For an introduction to parallel axis plots, see this post. Pandas has a simple tool for making these plots that are useful for data with multiple categories (for example, different algorithms).

from pandas.plotting import parallel_coordinates

objs = pd.read_csv('objs_data.csv')
fig, ax =plt.subplots()
parallel_coordinates(objs, 'Algorithm', linewidth=3, alpha=.8, ax=ax, color = ('#556270', '#4ECDC4'))
ax.set_ylabel('Objective Value\n(Preference $\longrightarrow$)')
ax.set_title('Approximate Pareto Sets')

References


Gupta, R. S., Hamilton, A. L., Reed, P. M., & Characklis, G. W. (2020). Can modern multi-objective evolutionary algorithms discover high-dimensional financial risk portfolio tradeoffs for snow-dominated water-energy systems?. Advances in water resources, 145, 103718.

Reed, P. M., Hadka, D., Herman, J. D., Kasprzyk, J. R., & Kollat, J. B. (2013). Evolutionary multiobjective optimization in water resources: The past, present, and future. Advances in water resources, 51, 438-456.

Salazar, J. Z., Reed, P. M., Herman, J. D., Giuliani, M., & Castelletti, A. (2016). A diagnostic assessment of evolutionary algorithms for multi-objective surface water reservoir control. Advances in water resources, 92, 172-185.

Salazar, J. Z., Reed, P. M., Quinn, J. D., Giuliani, M., & Castelletti, A. (2017). Balancing exploration, uncertainty and computational demands in many objective reservoir optimization. Advances in water resources, 109, 196-210.

Creating Interactive Geospatial Maps in Python with Folium

Interactive mapping and data visualization provide data scientists and researchers with a unique opportunity to explore and analyze geospatial data, and to share their work with stakeholders in a more engaging and accessible way.

Personally, I’ve found that constructing an interactive map for my own research, and iteratively updating it as the work progresses, has been great for improving my own understanding of my project. My project map (not the one in this demo) has been a helpful narrative aid when introducing someone to the project.

I’ve had a lot of success using the folium Python package to create interactive maps, and want to share the tool with you all here.

In this post, I provide a demonstration on how to use the folium package to create an HTML based interactive map of a hydrologic watershed.

Folium: a quick introduction

Folium is a Python package that is used to create interactive maps.

It is built on top of the Leaflet JavaScript library and can be used to visualize geospatial data in unique ways. Using folium, users construct maps by adding various features including lines, markers, pop-up text boxes, and different base-maps (aka, tiles), and can export maps as interactive HTML documents.

The full Folium documentation is here, however I think it is best to learn through an example.

Demo: Mapping a hydrologic basin data with Folium

All of the code and data used in this post is located in a GitHub repository: trevorja/Folium_Interactive_Map_Demo.

To interact with the demo map directly, download the file here: basin_map.html

The folium_map_demo.ipynb Jupyter Notebook steps through the process shown below and will re-create the map shown above. You can create a similar plot for a different basin by changing the station_id number inside ‘retrieve_basin_data.ipynb’ to a specific USGS gauge of interest.

Here is a brief video showcasing the interaction with the final map:

Map Data

For this demo, I am plotting various features within a hydrologic basin in northern New Mexico.

All of the data in the map is retrieved using the pynhd package from the HyRiver suite for python. For more information about these packages, see my previous post, Efficient hydroclimatic data accessing with HyRiver for Python.

The script ‘retrieve_basin_data.ipynb’ is set up to retrieve several features from the basin, including:

  • Basin boundary
  • Mainstem river
  • Tributary rivers
  • USGS gauge locations
  • New Mexico Water Data Initiative (NMWDI) Sites
  • HUC12 pour points

The geospatial data (longitude and latitudes) for each of these features are exported to data/basin_data.csv and used later to generate the folium map.

Constructing a folium hydrologic map

Like many other data visualization programs, folium maps are constructed by iteratively adding features or map elements to the main map object. It is easy to add a map feature, however it takes more care to ensure that the features are well-organized and respond to user interaction the way you want them to.

In this demo, I deconstruct the process into five parts:

  1. Initializing the map
  2. Initializing feature groups (layers)
  3. Adding points to feature layers
  4. Adding polygons & lines to feature layers
  5. Adding layers onto the map

Before going any further, we need to import the necessary packages, load our basin data, and convert the geospatial data to geopandas.GeoDataFrame() geometry objects:

# Import packages
import pandas as pd
import folium
from folium.plugins import MarkerCluster
import geopandas as gpd
from shapely.geometry import Point, LineString
from shapely import wkt

# Specify the coordinate reference system (CRS)
crs = 4386

# Load the basin data
basin_data = pd.read_csv('./data/basin_data.csv', sep = ',', index_col=0)

# Convert to a geoDataFrame and geometry objects
basin_data['geometry'] = basin_data['geometry'].apply(wkt.loads)
geodata = gpd.GeoDataFrame(basin_data, crs = crs)

Additionally, I start by specifying a couple design options before getting started (colors, line widths, etc.). I do this at the start so that I can easily test different colors, and I store all of these specifications in a dict. This step is not necessary, but just helps to stay organized. The following code block shows some of these specifications, but some are left out just to make this post shorter:

## Plot options
# Line widths
basin_linewidth = 2
mainstem_linewidth = 3
tributary_linewidth = 1

# ...
# More color and design specs here...
# ...

# Combine options into a dictionary
plot_options = {
    'station':{'color': usgs_color,
               'size': usgs_size},
    'pourpoint': {'color': pourpoint_color,
                  'size': pourpoint_size},
    'nmwdi': {'color': nmwdi_color,
              'size': nmwdi_size}
              }

With that out of the way, we will start mapping.

Initializing the map

We start to construct our map using the folium.Map() function:

# Initialize the map
geomap = folium.Map(location = [36.5, -106.5],
                    zoom_start = 9.2,
                    tiles = 'cartodbpositron',
                    control_scale = True)

The location and zoom_start arguments set the default view; the user will be able to pan and zoom around the map, but this will be the starting location.

The tile argument in the initial folium.Map() calls sets the default base-map style, but different styles can be added using the folium.TileLayer() function. Each TileLayer() call adds a different base-map style that is then available from the drop-down menu in the final figure.

## Add different tiles (base maps)
folium.TileLayer('openstreetmap').add_to(geomap)
folium.TileLayer('stamenwatercolor').add_to(geomap)
folium.TileLayer('stamenterrain').add_to(geomap)

Here is a side-by-side comparison of the four different tiles used in this demo:

Initializing feature groups (layers)

Before we add any lines or makers to the map, it is best to set up different layers using folium.FeatureGroup(). When the map is complete, the different feature groups can be toggled on-and-off using the drop-down menu.

Below, I initialize a folium.FeatureGroup for each of the six feature types in the map.

# Start feature groups for toggle functionality
basin_layer = folium.FeatureGroup(name = 'Basin Boundary', show=True)
usgs_layer = folium.FeatureGroup(name = 'USGS Gauges', show=True)
mainstem_layer = folium.FeatureGroup(name = 'Mainstem River', show=True)
tributary_layer = folium.FeatureGroup(name='Tributary Rivers', show=False)
pourpoint_layer = folium.FeatureGroup(name= 'HUC12 Pour Points', show=False)
nmwdi_layer = folium.FeatureGroup(name='NM Water Data Initiative Gauge', show=True)

The name argument is what will be displayed on the drop-down menu. The show argument indicates whether that layer is visible by default (when the map is first opened), or whether it first needs to be selected using the drop-down menu.

Now that the layers are initialized, we can begin adding the features (polygons, lines, and point markers) to each layer.

Adding points to feature layers

The folium.CircleMarker() is used to add circle points using a specific coordinate location.

The following code shows how I am iteratively adding different points to the different layers of the map based upon the feature type.

For each point, I extract the latitude and longitude (coords = [point.geometry.y, point.geometry.x]) and pass it to the folium.CircleMarker() function with colors and sizes specific to each of the three different point features (USGS gauge stations (station), NMWDI (nmwdi), and HUC12 pourpoints (pourpoint)):

plot_types = ['station', 'nmwdi', 'pourpoint']
plot_layers = [usgs_layer, nmwdi_layer, pourpoint_layer]

# Loop through the different feature point types
for i, feature_type in enumerate(plot_types):

	# Specify the correct feature layer to add the point too
	map_layer = plot_layers[i]

	# Add each point
    for _, point in geodata.loc[geodata['type'] == feature_type].iterrows():    
        coords = [point.geometry.y, point.geometry.x]
        
        # Add the popup box with description
        textbox = folium.Popup(point.description,
                               min_width= popup_width,
                               max_width= popup_width)

		# Add the marker at the coordinates with color-coordination
        folium.CircleMarker(coords,
                            popup= textbox,
                            fill_color = plot_options[feature_type]['color'],
                            fill = True,
                            fill_opacity = fill_opacity,
                            radius= plot_options[feature_type]['size'],
                            color = plot_options[feature_type]['color']).add_to(map_layer)

Adding polygons & lines to feature layers

I’ve found that it can be difficult to add Polygons or Lines to a folium map if the coordinate geometry are not formatted correctly. For me, the best method has been to convert the polygon or line data to a geopandas.GeoSeries and then converting this to JSON format data using the .to_json() method.

Once in JSON format, the data can be added to the map using the folium.GeoJson() method similar to other features. Although, rather than adding it to the map directly, we add it to the specific feature layer so that we can toggle things later.

Below, I show how I add the basin boundary polygon to the map. This needs to be repeated for the mainstem and tributary river lines, and the full code is included in folium_map_demo.ipynb.

## Plot basin border
for i,r in geodata.loc[geodata['type'] == ].iterrows():
	# Convert the Polygon or LineString to geoJSON format
    geo_json = gpd.GeoSeries(r['geometry']).simplify(tolerance = 0.000001).to_json()
    geo_json = folium.GeoJson(data= geo_json,
                              style_function=lambda x: {'fillcolor': 'none',
                                                        'weight': basin_linewidth,
                                                        'color': basin_color,
                                                        'opacity': 1,
                                                        'fill_opacity': 1,
                                                        'fill': False})
	# Add popup with line description
    folium.Popup(r.description,
                 min_width = popup_width,
                 max_width= popup_width).add_to(geo_json)
    
    # Add the feature to the appropriate layer
    geo_json.add_to(basin_layer)

And with that, the hard part is done.

Last step: adding layers onto map

Now, if you try to visualize the map it will be empty! This is because we have not added the feature layers to the map. In this last step, we add each of the six feature layers to the map and also add the folium.LayerControl() which will allow for us to toggle the different layers on-and-off:

# Add all feature layers to the map
basin_layer.add_to(geomap)
usgs_layer.add_to(geomap)
mainstem_layer.add_to(geomap)
tributary_layer.add_to(geomap)
pourpoint_layer.add_to(geomap)
nmwdi_layer.add_to(geomap)

# Add the toggle option for layers
folium.LayerControl().add_to(geomap)

Ready for the grand reveal?

Viewing, saving, and sharing the map

Viewing your map is as easy as calling the map name at any point in the script (i.e., geomap), and folium makes it easy to save the map as an HTML using the map.save() function as shown here:

# Save and view the map
geomap.save("basin_map.html")
geomap

Once you have your HTML saved, and you’ve taken a moment to open it on your computer and made sure that the features are situated nicely, then it comes time to share. Other users can view the maps simply by opening the HTML file on their local machine, or you can add the HTML to a website.

Concluding thoughts

I hope you’ve found some inspiration here, and find a way to incorporate interactive geospatial mapping in your project. I don’t think it can be overstated how much an interactive visual such as a folium map can serve to broaden the access to your dataset or model.

Thanks for reading!

Exploring time-evolving vulnerability with the newly published interactive tutorial in the Addressing Uncertainty in MultiSector Dynamics Research eBook

We recently published two new Jupyter Notebook tutorials as technical appendices to our eBook on Addressing Uncertainty in MultiSector Dynamics Research developed as part of the Integrated MultiSector, Multiscale Modeling (IM3) project, supported by the Department of Energy Office of Science’s MultiSector Dynamics Program. The eBook provides an overview of diagnostic modeling, perspectives on model evaluation, and a framework for basic methods and concepts used in sensitivity analysis. The technical appendices demonstrate key concepts introduced in the text and provide example Python code to use as a starting point for future analysis. In this post, I’ll discuss the concepts introduced in one of the new Python tutorials, Time-evolving scenario discovery for infrastructure pathways. This post will focus on the notebook’s connections to topics in the main text of the eBook rather than detailing the code demonstrated in the notebook. For details on the test case and code used in the tutorial, see the tutorial I posted last year.

In this post, I’ll first give a brief overview of the example water supply planning test case used in the tutorial, then discuss the methodological steps used to explore uncertainty and vulnerability in the system. The main topics discussed in this post are the design of experiments, factor mapping, and factor prioritization.

The Bedford-Greene water supply test case

The Bedford-Greene infrastructure investment planning problem (Figure 1) is a stylized water resources test case designed to reflect the challenges faced when evaluating infrastructure systems that evolve over time and are subject to uncertainty system inputs. The Bedford-Greene system contains two water utilities developing an infrastructure investment and management strategy to confront growing water demands and climate change. The test case was chosen for an eBook tutorial because it contains complex and evolving dynamics driven by strongly coupled human-natural system interactions. To explore these dynamics, the tutorial walks through an exploratory modeling experiment that evaluates how a large ensemble of uncertainties influences system performance and how these influences evolve over time.

In the Bedford-Greene system, modelers do not know or cannot agree upon the probability distributions of key system inputs, a condition referred to as “deep uncertainty” (Kwakkel et al., 2016). In the face of deep uncertainties, we perform an exploratory experiment to understand how a large ensemble of future scenarios may generate vulnerability for the water supply system.

Figure 1: The Bedford-Greene Water Resources Test Case

Setting up the computational experiment

The first step in the exploratory experiment is to define which factors in the mathematical model of the system are considered uncertain and how to sample the uncertainty space. The specific uncertain factors and their variability can be elicited through expert opinion, historical observation, values in literature, or physical meaning. In the Bedford-Greene system, we define 13 uncertain factors drawn from literature on real-world water supply systems (Gorelick et al., 2022). The uncertain factors used in this tutorial can be found in Table 1.

Factor NamePlausible RangeDescription
Near-Term Demand Growth Rate Mult.-0.25 to 2.0A scaling factor on projected demand growth over the first 15 years of the planning period
Mid-Term Demand Growth Rate Mult.-0.25 to 2.0A scaling factor on projected demand growth over the second 15 years of the planning period
Long-Term Demand Growth Rate Mult.-0.25 to 2.0A scaling factor on projected demand growth over the third15 years of the planning period
Bond Term0.8 to 1.2A scaling factor on the number of years over which infrastructure capital costs are repaid as debt service
Bond Interest Rate0.6 to 1.2A scaling factor that adjusts the fixed interest rate on bonds for infrastructure
Discount Rate0.8 to 1.2The rate at which infrastructure investment costs are discounted over time
Restriction Efficacy0.8 to 1.2A scaling factor on how effective water use restrictions are at reducing demands
Infrastructure Permitting Period0.75 to 1.5A scaling factor on estimated permitting periods for infrastructure projects
Infrastructure Construction Time1 to 1.2A scaling factor on estimated construction times for infrastructure projects
Inflow Amplitude0.8 to 1.2A sinusoidal scaling factor to apply non-stationary conditions to reservoir inflows
Inflow Frequency0.2 to 0.5A sinusoidal scaling factor to apply non-stationary conditions to reservoir inflows
Inflow Phase-pi/2 to pi/2A sinusoidal scaling factor to apply non-stationary conditions to reservoir inflows
Table 1: Deep Uncertainties sampled for the Bedford-Greene System

After the relevant uncertainties and their plausible ranges have been identified, the next step is to define a sampling strategy. A sampling strategy is often referred to as a design of experiments, a term that dates back to the work of Ronald Fisher in the context of laboratory or field-based experiments (Fischer, 1936). The design of experiments is a methodological choice that should be carefully considered before starting computational experiments. The design of experiments should be chosen to balance the computational cost of the exploratory experiment with the amount of information needed to accurately characterize system vulnerability. An effective design of experiments allows us to explore complex relationships within the model and evaluate the interactions of system inputs. Five commonly used designs of experiments are overviewed in Chapter 3.3 of the main text of the eBook.

In the Bedford-Greene test case, we employ a Latin Hypercube Sampling (LHS) strategy, shown in Figure 3d. With this sampling technique for the 13 factors shown in Table 1, a 13-dimensional hypercube is generated, with each factor divided into an equal number of levels to obtain 2,000 different samples of future scenarios. 2,000 samples were chosen here based on testing from similar water supply test cases (Trindade et al., 2020) for other analyses, but the sampling size must be determined on a case-by-case basis. The LHS design guarantees sampling from every level of the uncertainty space without overlaps and will generate a diverse coverage of the entire space. When the number of samples is much greater than the number of uncertain factors, LHS effectively approximates the more computationally expensive full factorial sampling scheme shown in Figure 3a without needing to constrain samples to discrete levels for each factor, as done in fractional factorial sampling, shown in Figure 3c. For more details on each sampling scheme and general information on design of experiments, see Chapter 3.3 of the eBook.

Figure 2: Alternative designs of experiments reproduced from Figure 3.3 of the eBook main text. a) full factorial design sampling of three factors at four levels with a total of 64 samples; b) the exponential growth of a necessary number of samples when applying full factorial design at four levels; c) fractional factorial design of three factors at four levels at a total of 32 samples; d) Latin Hypercube sample of three factors with uniform distributions for a total of 32 samples.

A final step in our experimental setup is to determine which model outputs are relevant to model users. In the Bedford-Greene test case, we specify five performance criteria along with performance thresholds that the water utilities would like their infrastructure investment policy to meet under all future conditions. The performance criteria and thresholds are shown in Table 2. These values are based on water supply literature, and relevant criteria and thresholds should be determined on a case-by-case basis.

Performance criteriaThreshold
Reliability< 99%
Restriction Frequency>20%
Worst-case cost>10 % annual revenue
Peak financial cost> 80% annual revenue
Stranded assets> $5/kgal unit cost of expansion
Table 2: Performance criteria and thresholds

Discovering consequential scenarios

To explore uncertainty in the Bedford-Greene system, we run the ensemble of uncertainties developed by our design of experiments through a water resources systems model and examine the outputs of each sampled combination of uncertainty. In this system, we’re interested in understanding 1) which uncertainties have the most impact on system vulnerability, 2) which combinations of uncertainties lead to consequential outcomes for the water supply system, and 3) how vulnerability evolves over time. Chapter 3.2 of the eBook introduces diagnostic approaches that can help us answer these questions. In this tutorial, we utilize gradient-boosted trees, a machine-learning algorithm that uses an ensemble of shallow trees to generate an accurate classifier (for more on boosting, see Bernardo’s post). Gradient-boosted trees are particularly well suited to infrastructure investment problems because they are able to capture non-linear and non-differential boundaries in the uncertainty space, which often occur as a result of discrete capacity expansions. Gradient-boosted trees are also resistant to overfitting, easy to interpret, and provide a simple means of ranking the importance of uncertainties. For more background on gradient-boosted trees for scenario discovery, see this post from last year.

Gradient-boosted trees provide a helpful measure of feature importance, the percentage decrease in the impurity of the ensemble of trees associated with each factor. We can use this measure to examine how each uncertainty contributes to the ability of the region’s infrastructure investment and management policy. Infrastructure investments fundamentally alter the water utilities’ storage-to-capacity ratios and levels of debt burden, which will impact their vulnerability over time. To account for these changes, we examine feature importance over three different time periods. The results of our exploratory modeling process are shown in Figure 3. We observe that the importance of various uncertainties evolves over time for both water utilities. For example, while near-term demand growth is a key factor for both utilities in all three time periods, restriction effectiveness is a key uncertainty for Bedford in the near- and mid-term but not in the long-term, likely indicating that infrastructure investment reduces the utility’s need to rely on water use restrictions. Greene is not sensitive to restriction effectiveness in the near-term or long-term, but is very sensitive in the mid-term. This likely indicates that the utility uses restrictions as a bridge to manage high demands before infrastructure investments have been fully constructed.

Figure 3: factor importance for the two utilities. Darker colors indicate that uncertainties have higher predictive value for discovering consequential scenarios.

To learn more about how vulnerability for the two water utilities evolves, we use factor mapping (eBook Chapter 3.2) to delineate regions of the uncertainty space that lead to consequential model outputs. The factor maps in Figures 4 and 5 complement the factor ranking in Figure 3 by providing additional information about which combinations of uncertainties generate vulnerability for the two utilities. While near-term demand growth and restriction effectiveness appear to generate vulnerability for Bedford in the near-term, Figure 4 reveals that the vast majority of sampled future states of the world meet the performance criteria. When evaluated using a 22-year planning horizon, however, failures emerge as a consequence of high demand and low restriction effectiveness. When evaluated across a 45-year planning horizon, the utility appears extremely vulnerable to high demand, indicating that the infrastructure investment policy is likely insufficient to maintain water supply reliability.

Figure 4: Factor maps for Bedford

Greene’s factor maps tell a different story. In the near-term, the utility is vulnerable to high-demand scenarios. In the mid-term, the vulnerable regions have transformed, and two failure modes are apparent. First, the utility is vulnerable to a combination of high near-term demand and low restriction effectiveness, indicating the potential for water supply reliability failures. Second, the utility is vulnerable to low-demand scenarios, highlighting a potential financial failure from over-investment in infrastructure. When analyzed across the 45-year planning horizon, the utility is vulnerable to only low-demand futures, indicating a severe financial risk from over-investment. These factor maps provide important context to the factor priorities shown in Figure 3. While the factor prioritization does highlight the importance of demand growth for Greene, it does not indicate which ranges of uncertainty generate vulnerability. Evaluating the system across time reveals that though the utility is always sensitive to demand growth, the consequences of demand growth and the range that generates vulnerability completely transform over the planning period.

Figure 5: Factor maps for Greene

Concluding thoughts

The purpose of this post was to provide additional context to the eBook tutorial on time-evolving scenario discovery. The Bedford-Greene test case was chosen because it represents a tightly coupled human natural system with complex and nonlinear dynamics. The infrastructure investments made by the two water utilities fundamentally alter the system’s state dynamics over time, necessitating an approach that can capture how vulnerability evolves. Through a carefully designed computational experiment, and scenario discovery using gradient-boosted trees, we discover multiple failure modes for both water utilities, which can help regional decision-makers monitor policy performance and adapt to changing conditions. While each application will be different, the code in this tutorial can be used as a starting point for applying this methodology to other human-natural systems. As with all tutorials in the eBook, the Jupyter notebook ends with a section on how to apply this methodology to your problem.

References

Kwakkel, J. H., Walker, W. E., & Haasnoot, M. (2016). Coping with the wickedness of public policy problems: approaches for decision making under deep uncertainty. Journal of Water Resources Planning and Management, 142(3), 01816001.

Fisher, R.A. (1936). Design of experiments. Br Med J, 1(3923):554–554

Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772.

Enhancing Jupyter Notebooks for Teaching

In our research group, we often use Jupyter Notebooks as teaching tools in the classroom and to train new students. Jupyter Notebooks are useful because they integrate code and its output into a single document that also supports visualizations, narrative text, mathematical equations, and other forms of media that often allows students to better contextualize what they are learning.

There are some aspects of Jupyter Notebooks that students sometimes struggle with as they shift from IDEs to this new environment. Many have noted difficulty tracking the value of variables through the notebook and getting stuck in a mindset of just pressing “Shift+Enter”. Since the advent of Project Jupyter, there have been a variety of “unofficial” extensions and widgets that have been put out by the community that I have found can help enhance the learning experience and make Jupyter Notebooks less frustrating. It’s just not apparent that these extensions exist until you go searching for them. I will demonstrate some below that have been particularly helpful when working with our undergrads and hopefully will be helpful for you too!

Variable Inspector

One very easy way to remedy the situation of not having a variable display is to enable the Variable Inspector Extension which allows you to see the value of all of the variables in an external floating window. In your command prompt type:

#Installation
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

#Enable extension
jupyter nbextension enable varInspector/main

Here we are installing a massive library of extensions called Nbextensions that we will enable in this post. When you open your notebooks, you will now see a small icon at the top that you can click to create a variable inspector window that will reside in the margins if the notebook.

Variable Inspector Icon

As you step through the notebook, your variable inspector will become populated.

Variable Inspector Floating Window

Snippets of Code

The Snippets menu is pretty fantastic for students who are new to coding or for those of us who go back and forth between coding languages and keep having to quickly search how to make arrays or need code for example plots.

We can just enable this extension as follows right in Jupyter Notebook under the Nbextensions tab:

Enabling the Snippets Menu

This gif provides an example of some of the options that are available for NumPy, Matplotlib, and Pandas.

Jupyter Snippets Menu

If for instance, you want to create a histogram but can’t remember the form of the code, a snippet for this exists under the Matplotlib submenu.

Making a histogram with Matplotlib Snippets

Rubberband (Highlight Multiple Cells)

If you enable the Rubberband extension in the Nbextensions tab, you can highlight and execute multiple cells, which just gives a bit more precision in selecting a subset of cells that you want to run at once. You hold down “Ctrl+Shift” while dragging the mouse. Then you can execute the highlighted cells using “Shift+Enter”.

Execution Time

You can wrap your code in a variety of timing functions or just enable the Execution Time extension in the Nbextensions tab. For every cell that you execute, the elapsed time will now be recorded under the cell. For me, this is particularly useful when you have some cells that take a longer time to execute. If you are trying to create training, it’s helpful for a user to get a breakdown of exactly how long the code should be taking to run (so that they know if the timing is normal or if there is an issue with the code).

Elapsed time is shown below each cell

Creating a Jupyter Notebook Slideshow with RISE (a reveal.js extension)

The last teaching tool that I want to demonstrate is how to turn your Jupyter Notebook into an interactive slideshow, which has been infinitely more useful to use as a teaching tool than just scrolling through the Jupyter Notebook at a meeting.

First, installation is easy:

pip install RISE

Now, we have to prepare our notebook for a slideshow. Go to View -> Cell Toolbar -> Slideshow. We need to select the Slide Type for each of our cells.

Selecting the slide type for the slideshow

At the top of the screen, we now have a new icon that allows you to enter into slideshow mode when you are ready. We can now step through the slideshow that we have made and interactively execute the coding boxes as well.

I hope these tips help you to have a more pleasant experience while using Jupyter Notebooks. If there are other extensions that you enjoy using regularly, please post below!

Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.

PyGeohydro

PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.

PyNHD

The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…

PyDaymet

The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

  1. Installation
  2. Retrieving USGS Water Data
  3. Retrieving Geophysical (NLDI) Data
  4. Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
plot.signatures(flow_data)
Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    navigation="upstreamTributaries",
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

  • Minimum and maximum temperature (tmin, tmax)
  • Precipitation (prcp)
  • Vapor pressure (vp)
  • Snow-Water Equivalent (swe)
  • Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

NOTE:
So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()

daymet_data.shape

# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.

Conclusion

I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.

Citations

Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

Fisheries Training Part 2 – Tradeoff Visualization and Introduction to J3

Hello there! If you’re here, then you likely have successfully navigated the previous two posts in our Fisheries Training Series:

In these posts, we explored the complex dynamics of a two-species predator-prey fisheries system. We also visualized various potential scenarios of stability and collapse that result from a variety of system parameter values. We then set up the problem components that include its parameters and their associated uncertainty ranges, performance objectives and the radial basis functions (RBFs) that map the current system state to policy action

Now, we will building off the previous posts and generate the full Pareto-approximate set of performance objectives and their associated decision variable values. We will also specify our robustness multivariate satisficing criteria (Starr, 1963) set by Hadjimichael et al (2020) and use J3, a visualization software, to explore the tradeoff space and identify the solutions that meet these criteria.

To better follow along with our training series, please find the accompanying GitHub repository that contains all the source code here.

A brief recap on decision variables, parameters and performance objectives

In the Fisheries Training series, we describe the system using the following parameters:

  • x_{t} and y_{t}: The prey and predator population densities at time t respectively
  • \alpha: The rate at which the predator encounters the prey
  • b: The prey growth rate
  • c: The rate at which the predator converts prey to new predators
  • d: The predator death rate
  • h: The time the predator needs to consume the prey (handling time)
  • K: Environmental carrying capacity
  • m: The level of predator interaction
  • z: The fraction of prey that is harvested

Please refer to Post 0 for further details on the relevance of each parameter.

Our decision variables are the three RBF parameters: the center (c_{i}), radius (r_{i}) and weights (w_{i}) of each RBF i respectively. From Part 1, we opt to use two RBFs where i \in [1,2] to result in six decision variables.

Next, our objectives are as follows:

  • Objective 1: Maximize net present value (NPV)
  • Objective 2: Minimize prey-population deficit
  • Objective 3: Minimize the longest duration of consecutive low harvest
  • Objective 4: Minimize worst harvest instance
  • Objective 5: Minimize harvest variance

Detailed explanation on the formulation and Python execution of the RBFs and objectives can be found in Post 1.

Now that we’ve reviewed the problem setup, let’s get to setting up the code!

Running the full problem optimization

Importing all libraries and setting up the problem

Before beginning, ensure that both Platypus and PyBorg are downloaded and installed as recommended by Post 1. Next, as previously performed, import all the necessary libraries:

# import all required libraries
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
from fish_game_functions import *
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
import matplotlib.pyplot as plt
import time
import random

We then define the problem by setting the number of variables (nVars), performance objectives (nObjs) and constraints (nCnstr). We also define the upper and lower bounds of each objective. The negative values associated with Objectives 1 and 4 indicate that they are to be maximized.

# Set the number of decision variables, constraints and performance objectives
nVars = 6   # Define number of decision variables
nObjs = 5   # Define number of objectives
nCnstr = 1      # Define number of decision constraints

# Define the upper and lower bounds of the performance objectives
objs_lower_bounds = [-6000, 0, 0, -250, 0]
objs_upper_bounds = [0, 1, 100, 0, 32000]

Then we initialize the algorithm (algorithm) to run over 10,000 function evaluations (nfe) with a starting population of 500 (pop_size):

# initialize the optimization
algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr)
nfe = 10000    # number of function evaluations
pop_size = 500    # population size

Storing the Pareto-approximate objectives and decision variables

We are ready to run this (Fisheries) world! But first, we will open two CSV files where we will store the Pareto-approximate objectives (Fisheries2_objs.csv) and their associated decision variables (Fisheries2_vars.csv). These are the (approximately) optimal performance objective values and the RBF (c_{i}, r_{i}, w_{i}) vectors that give rise to them discovered by PyBorg. We also record the total amount of time it takes to optimize the Fisheries over 10,000 NFEs with a population of 500.

# open file in which to store optimization objectives and variables
f_objs = open('Fisheries2_objs.txt', "w+")
f_vars = open('Fisheries2_vars.txt', "w+")

# get number of algorithm variables and performance objectives
nvars = algorithm.problem.nvars
nobjs = algorithm.problem.nobjs
    
# begin timing the optimization
opt_start_time = time.time()

algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(pop_size))
algorithm.run(int(nfe))

# get the solution archive
arch = algorithm.archive[:]
for i in range(len(arch)):
    sol = arch[i]
    # write objectives to file
    for j in range(nobjs):
        f_objs.write(str(sol.objectives[j]) + " ")
    # write variables to file
    for j in range(nvars):
        f_vars.write(str(sol.variables[j]) + " ")
        
    f.write("\n")

# end timing and print optimization time 
opt_end_time = time.time()

opt_total_time = opt_end_time - opt_start_time

f_objs.close()
f_vars.close()

# print the total time to console
print(format"\nTime taken = ", {opt_total_time})

The optimization should take approximately 3,100 seconds or 52 minutes. When the optimization is completed, you should be able to locate the Fisheries2_objs.txt and Fisheries2_vars.txt files in the same folder where the Jupyter notebook is stored.

Post-Processing

To ensure that our output can be used in our following steps, we perform post-processing to convert the .txt files into .csv files.

import numpy as np

# convert txt files to csv 
# load the .txt files as numpy matrices
matrix_objs = np.genfromtxt('Fisheries2_objs.txt', delimiter=' ')
matrix_vars = np.genfromtxt('Fisheries2_vars.txt', delimiter=' ')

# reshape the matrices 
# the objectives file should have shape (n_solns, nObjs)
# the variables file should have shape (n_solns, nVars)
n_solns = int(matrix_objs.shape[0]/nObjs)

matrix_objs = np.reshape(matrix_objs, (n_solns,nObjs))
matrix_vars = np.reshape(matrix_vars, (n_solns,nVars))

# label the objectives and variables
objs_names = ['NPV', 'Pop_Deficit', 'Low_Harvest', 'Worst_Harvest', 'Variance']
var_names = ['c1', 'r1', 'w1', 'c2', 'r2', 'w2']

# Convert the matrices to dataframes with header names
df_objs = pd.DataFrame(matrix_objs, columns=objs_names)
df_vars = pd.DataFrame(matrix_vars, columns=var_names)

# save the processed matrices as csv files
df_objs.to_csv('Fisheries2_objs.csv', sep=',', index=False)
df_vars.to_csv('Fisheries2_vars.csv', sep=',', index=False)

You should now be able to locate the Fisheries2_objs.csv and Fisheries2_vars.csv within the same folder where you store the Jupyter Notebook.

In the following steps, we will introduce the J3 Visualization Software, which takes .csv files as inputs, to visualize and explore the tradeoff space of the fisheries problem.

Introduction to visualization with J3

J3 is an open-sourced app to produce and share high-dimensional, interactive scientific visualizations. It is part of the larger Project Platypus, which is a collection of libraries that aid in decision-making, optimization, and data visualization. It is influenced by D3.js, which is a JavaScript library for manipulating data using documents (data-driven documents; hence the name). Instead of documents, J3 manipulates data using many-dimensional plots, annotations and animations.

There is a prior post by Antonia Hadjimichael that covers the Python implementation of J3. In this post, we will be exploring the J3 app itself.

Installing and setting up J3

To use J3, you should first install Java. Please follow the directions found on the official Java site to select the appropriate installation package for your operating system.

Next, you can install J3 in either one of two ways:

  1. Download the .zip file from the J3 Github Repository and extract its contents into a desired location on your location machine.
  2. Install using git clone:
cd your-desired-location-path
git clone https://github.com/Project-Platypus/J3.git

You should now see a folder called ‘J3’ located in the path where you chose to extract the repository. Run the J3.exe file within the folder as shown below:

Next, we upload our Fisheries2_objs.csv file into J3:

The GIF below shows the a 3D tradeoff plot that is used to demonstrate the functions that each of the toggles serve. In this 3D plot, the NPV and Harvest Variance are seen on the x- and y-axes, while the Worst-case Harvest is seen on the z-axis. The size of the points represents Lowest Harvest Instance and their colors demonstrate the Population Size.

Other functions not shown above include:

  1. Zooming in by scrolling on your mouse or trackpad
  2. Deleting the annotations by right-clicking on them
  3. Pressing the ‘esc’ key to de-select a point of interest

Next, we can also generate accompanying 2D-scatter and parallel axis plots to this 3D tradeoff figure:

In the parallel axis plot, the direction of preference is upwards. Here, we can visualize the significant tradeoffs between net present cost of the fisheries’ yield and population deficit. If stakeholders wish to maximize the economic value of the fisheries, they may experience unsustainable prey population deficits. The relationship between the remaining objectives is less clear. In J3, you can move the parallel axis plot axes to better visualize the tradeoffs between two objectives:

Here, we observe that there is an additional tradeoff between the minimizing the population deficit and maintaining low occurrences of low-harvest events. From this brief picture, we can observe that the main tradeoffs within the Fisheries system are between ecological objectives such as population deficit and economic objectives such as net present value and harvest.

Note the Brushing tool that appears next to the parallel axis plot. This will be important as we begin our next step, and that is defining our robustness multivariate satisficing criteria.

The multivariate satisficing criteria and identifying robust solutions

The multivariate satisficing criteria is derived from Starr’s domain criterion satisficing measure (Starr, 1962). In Hadjimichael et al. (2020), the multivariate satisficing criteria was selected as it allowed the identification of solutions that meet stakeholders’ competing requirements. In the context of Part 2, we use these criteria to identify the solutions in the Pareto-approximate set that satisfy the expectations of stakeholder. Here, the requirements are as follows:

  1. Net present value (NPV) \geq 1,500
  2. Prey-population deficit \leq 0.5
  3. Longest duration of consecutive low harvest \leq 5
  4. Worst harvest instance \geq 50
  5. Harvest variance \leq 1

Using the brushing tool to highlight only the solutions of interest, we find a pared-down version of the Pareto set. This tells us that not all optimal solutions are realistic, feasible, or satisfactory to decision-makers in the Fisheries system.

Conclusion

Good job with making it this far! Your accomplishments are many:

  1. You ran a full optimization of the Fisheries Problem.
  2. Your downloaded, installed, and learned how to use J3 to visualize and manipulate data to explore the tradeoff space of the Fisheries system.
  3. You learned about Multivariate Satisficing Criteria to identify solution tradeoffs that are acceptable to the stakeholders within the Fisheries system.

In our next post, we will further expand on the concept of the multivariate satisficing criteria and use it to evaluate how 2-3 of the different solutions that were found to initially satisfy stakeholder requirements when tested across more challenging scenarios. But in the meantime, we recommend that you explore the use of J3 on alternative datasets as well, and see if you can come up with an interesting narrative based on your data!

Until then, happy visualizing!

References

Giuliani, M., Castelletti, A., Pianosi, F., Mason, E. and Reed, P., 2016. Curses, Tradeoffs, and Scalable Management: Advancing Evolutionary Multiobjective Direct Policy Search to Improve Water Reservoir Operations. Journal of Water Resources Planning and Management, 142(2).

Hadjimichael, A., Reed, P. and Quinn, J., 2020. Navigating Deeply Uncertain Tradeoffs in Harvested Predator-Prey Systems. Complexity, 2020, pp.1-18.

Starr, M., 1963. Product design and decision theory. Journal of the Franklin Institute, 276(1), p.79.