Sankey diagrams for USGS gauge data in python(?)

This post was inspired by the Sankey diagram in Figure 1 of this pre-print led by Dave Gold:Exploring the Spatially Compounding Multi-sectoral Drought Vulnerabilities in Colorado’s West Slope River Basins” (Gold, Reed & Gupta, In Review) which features a Sankey diagram of flow contributions to Lake Powell. I like the figure, and thought I’d make an effort to produce similar diagrams using USGS gauge data.

Sankey diagrams show data flows between different source and target destinations. Lot’s of people use them to visualize their personal/business cashflows. It’s an obvious visualization option for streamflows.

To explain the “(?)” in my title: When I started this, I’d realized quickly that I to choose one of two popular plotting packages: matplotlib or plotly.

I am a frequent matplotlib user and definitely appreciate the level of control in the figure generation process. However, it can sometimes be more time- and line-intensive designing highly customized figures using matplotlib. On the other hand, in my experience, plotly tools can often produce appealing graphics with less code. I am also drawn to the fact that the plotly graphics are interactive objects rather than static figures.

I decided to go with plotly to try something new. If you want to hear my complaints and thoughts on use context, you can skip to the conclusions below.

In the sections below, I provide some code which will:

  • Define a network of USGS gauge stations to include in the plot
  • Retrieve data from USGS gauge stations
  • Create a Sankey diagram using plotly showing streamflows across the network

Here, I focus on the Rio Grande river upstream of Albuquerque, NM. However you can plot a different streamflow network by modifying the dictionary of upstream nodes defining the network.


Plotting a Sankey streamflow network with plotly

The code used here requires both plotly and the pygeohydro package (for USGS data retrieval).

from pygeohydro import NWIS
import plotly.graph_objects as go

With that out of the way, we can get started.

Defining the flow network & data retrieval

I start by defining a dictionary called upstream_stations which defines the relationships between different gauges of interest.

This dictionary contains pairs of the form: {"GAUGE_ID" : ["LIST_OF", "UPSTREAM", "GAUGE_IDs"]}

If there is no upstream site, then include an empty list. For the Rio Grande network, this looks like:

# define relationships between each gauge and upstream sites
upstream_stations = {
    '08329918' : ['08319000', '08328950'], 
    '08319000' : ['08317400', '08317200'],
    '08328950' : [],
    '08317200' : [],
    '08317400' : ['08313000'],
    '08313000' : ['08290000', '08279500'],
    '08287000' : [],
    '08279500' : [],
    '08290000' : ['08287000', '08289000'],
    '08289000' : [],
}

# Get list of all stations from upstream_stations
all_stations = list(upstream_stations.keys())
for station, upstream in upstream_stations.items():
    all_stations += upstream
all_stations = list(set(all_stations))

Notice that I also made a list containing all the stations IDs. I use the pygeohydro package from the HyRiver suite of tools to get retrieve the gauge station data (Chegini, Li, & Leung, 2021). I often cite this package, and have written about it in a past post (“Efficient hydroclimatic data accessing with HyRiver for Python”).

Using the list of all_stations, I use the following code to pull daily streamflow data for each site from 2015-2020 (or some other specified dates):

def get_usgs_gauge_data(stations, dates):
    """
    Get streamflow data from USGS gauge stations using NWIS.
    """
    nwis = NWIS()
    df = nwis.get_streamflow(stations, dates, mmd=False)
    
    # get rid of USGS- in columns
    df.columns = df.columns.str.replace('USGS-', '')
    return df

# Get USGS flows
flows = get_usgs_gauge_data(all_stations, ('2015-01-01', '2020-12-31'))

For the Sankey diagram, we need a single flow value for each station. In this case I calculate an average of the annual total flows at each station:

# Get annual mean flows
agg_flows = flows.resample('Y').sum().agg('mean')

Creating the Sankey figure

At it’s core, a Sankey diagram is a visualization of a weighted network (also referred to as a graph) defined by:

  • Nodes
  • Links (aka Edges)
  • Weights

In our case, the nodes are the USGS gauge stations, the links are the connections between upstream and downstream gauges, and the weights are the average volumes of water flowing from one gauge to the next.

Each link is defined by a source and target node and a value. This is where the upstream_stations dictionary comes in. In the code block below, I set up the nodes and links, looping through upstream_stations to define all of the source-target relationships:

## Deinfe nodes and links
# Nodes are station IDs
nodes = all_stations
node_indices = {node: i for i, node in enumerate(nodes)}

# make links based on upstream-downstream relationships
links = {
    'source': [],
    'target': [],
    'value': [],
}

# loop through upstream_stations dict
for station, upstream_list in upstream_stations.items():
    for stn in upstream_list:
        if stn in agg_flows and station in agg_flows:
            links['source'].append(node_indices[stn])
            links['target'].append(node_indices[station])
            links['value'].append(agg_flows[stn])

Lastly, I define some node labels and assign colors to each node. In this case, I want to make the nodes black if they represent reservoir releases (gauges at reservoir outlets) or blue if they are simple gauge stations.

labels = {
    '08329918' : 'Rio Grande at Alameda', 
    '08319000' : 'San Felipe Gauge',
    '08328950' : 'Jemez Canyon Reservoir',
    '08317200' : 'Santa Fe River',
    '08317400' : 'Cochiti Reservoir',
    '08313000' : 'Rio Grande at Otowi Bridge',
    '08287000' : 'Abiquiu Reservoir',
    '08279500' : 'Rio Grande',
    '08290000' : 'Rio Chama',
    '08289000' : 'Rio Ojo Caliente',
}

# Create nodes labels and colors lists
node_labels = [labels[node] for node in nodes]
node_colors = ['black' if 'Reservoir' in label else 'dodgerblue' for label in node_labels]


Finally, the function to generate the figure:

def create_sankey_diagram(node_labels, links, node_colors, 
						  orientation='h',
                          size=(2000, 700)):
    """
    Create a Sankey diagram of using Plotly.
    
    Parameters
    ----------
    node_labels : list
        List of node labels.
    links : dict
        Dictionary with keys 'source', 'target', and 'value'.
    node_colors : list
        List of node colors.
    orientation : str
        Orientation of the diagram, 'h' for horizontal and 'v' for vertical.
        
    Returns
    -------
    sankey_fig : plotly.graph_objects.Figure
        Plotly figure object.
    """
    sankey_fig = go.Figure(go.Sankey(
        orientation=orientation,
        node=dict(
            pad=70,
            thickness=45,
            line=dict(color='dodgerblue', width=0.5),
            label=node_labels,
            color=node_colors
        ),
        link=dict(
            source=links['source'],
            target=links['target'],
            value=links['value'],
            color='cornflowerblue'
        )
    ))
    
    sankey_fig.update_layout(
        title_text="Rio Grande Streamflow ",
        font=dict(size=23),
        width=size[0],
        height=size[1]
    )
    return sankey_fig

There are some options for manipulating this figure script to better suit your needs. Specifically you may want to modify:

  • pad=70 : this is the horizontal spacing between nodes
  • thickness=45 : this is the thickness of the node elements

With our pre-prepped data from above, we can use the function like so:

sankey_fig = create_sankey_diagram(node_labels, 
								   links, 
								   node_colors, 
								   orientation='v', size=(1000, 1200))
sankey_fig

And here we have it:

I’d say it looks… okay. And admittedly this is the appearance after manipulating the node placement using the interactive interface.

It’s a squished vertically (which can be improved by making it a much taller figure). However my biggest issue is with the text being difficult to read.

Changing the orientation to horizontal (orientation='h') results in a slightly better looking figure. Which makes sense, since the Sankey diagram is often shown horizontal. However, this does not preserve the relationship to the actual North-South flow direction in the Rio Grande, so I don’t like it as much.

Conclusions

To answer the question posed by the title, “Sankey diagrams for USGS gauge data in python(?)”: Yes, sometimes. And sometimes something else.

Plotly complaints: While working on this post, I developed a few complaints with the plotly Sankey tools. Specifically:

  • It appears that the label text coloring cannot be modified. I don’t like the white edgecolor/blur effect, but could not get rid of this.
  • The font is very difficult to read… I had to make the text size very large for it to be reasonably legible.
  • You can only assign a single node thickness. I had wanted to make the reservoirs thick, and shrink the size of the gauge station nodes. However, it seems this cannot be done.
  • The diagrams appear low-resolution and I don’t see a way to save a high res version.

Ultimately, the plotly tools are very restrictive in the design of the graphic. However, this is a tradeoff in order to get the benefit of interactive graphics.

Plotly praise: The plotly Sankey tools have some advantages, specifically:

  • The plots are interactive
  • Plots can be saved as HTML and embedded in websites

These advantages make the plotly tools good for anyone who might want to have a dynamic and maybe frequently updated dashboard on a site.

On the other hand, if I wanted to prepare a publication-quality figure, where I had absolute control of the design elements, I’d likely turn to matplotlib. That way it could be saved as an SVG and further manipulated in a vector art program link Inkscape or Illustrator.

Thanks for reading!

References

Chegini, T., Li, H. Y., & Leung, L. R. (2021). HyRiver: Hydroclimate data retriever. Journal of Open Source Software, 6(66), 3175.

Links

Introduction to Bayesian Regression using PyMC

Motivation

Fans of this blog will know that uncertainty is often a focus for our group. When approaching uncertainty, Bayesian methods might be of interest since they explicitly provide uncertainty estimates during the modeling process.

PyMC is the best tool I have come across for Bayesian modeling in Python; this post gives a super brief introduction to this toolkit.

Introduction to PyMC

PyMC, described in their own words:
“… is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.”

In my opinion, the best part of PyMC is the flexibility and breadth of model design features. The space of different model configurations is massive. It allows you to make models ranging from simple linear regressions (shown here), to more complex hierarchical models, copulas, gaussian processes, and more.

Regardless of your model formulation, PyMC let’s you generate posterior estimates of model parameter distributions. These parameter distributions reflect the uncertainty in the model, and can propagate uncertainty into your final predictions.

The posterior estimates of model parameters are generated using Markov chain Monte Carlo (MCMC) methods. A detailed overview of MCMC is outside the scope of this post (maybe in a later post…).

In the simplest terms, MCMC is a method for estimating posterior parameter distributions for a Bayesian model. It generates a sequence of samples from the parameter space (which can be huge and complex), where the probability of each sample is proportional to its likelihood given the observed data. By collecting enough samples, MCMC generates an approximation of the posterior distribution, providing insights into the probable values of the model parameters along with their uncertainties. This is key when the models are very complex and the posterior cannot be directly defined.

The PyMC example gallery has lots of cool stuff to get you inspired, with examples that go far and beyond the simple linear regression case.


Demonstration:

When writing drafting this post, I wanted to include a demonstration which is (a) simple enough to cover in a brief post, and (b) relatively easy for others to replicate. I settled on the simple linear regression model described below, since this was able to be done using readily retrievable CAMELS data.

The example attempts to predict mean streamflow as a linear function of basin catchment area (both in log space). As you’ll see, it’s not the worst model, but its far from a good model; there is a lot of uncertainty!

CAMELS Data

For a description of the CAMELS dataset, see Addor, Newman, Mizukami and Clark (2017).

I pulled all of the national CAMELS data using the pygeohydro package from HyRiver which I have previously recommended on this blog. This is a convenient single-line code to get all the data:

import pygeohydro as gh

### Load camels data
camels_basins, camels_qobs = gh.get_camels()

The camels_basins variable is a dataframe with the different catchment attributes, and the camels_qobs is a xarray.Dataset. In this case we will only be using the camels_basins data.

The CAMELS data spans the continental US, but I want to focus on a specific region (since hydrologic patterns will be regional). Before going further, I filter the data to keep only sites in the Northeaster US:

# filter by mean long lat of geometry: NE US
camels_basins['mean_long'] = camels_basins.geometry.centroid.x
camels_basins['mean_lat'] = camels_basins.geometry.centroid.y
camels_basins = camels_basins[(camels_basins['mean_long'] > -80) & (camels_basins['mean_long'] < -70)]
camels_basins = camels_basins[(camels_basins['mean_lat'] > 35) & (camels_basins['mean_lat'] < 45)]

I also convert the mean flow data (q_mean) units from mm/day to cubic meters per day:

# convert q_mean from mm/day to m3/s
camels_basins['q_mean_cms'] = camels_basins['q_mean'] * (1e-3) *(camels_basins['area_gages2']*1000**2) * (1/(60*60*24)) 

And this is all the data we need for this crude model!

Bayesian linear model

The simple linear regression model (hello my old friend):

Normally you might assume that there is a single, best value corresponding to each of the model parameters (alpha and beta). This is considered a Frequentist perspective and is a common approach. In these cases, the best parameters can be estimated by minimizing the errors corresponding to a particular set of parameters (see least squares, for example.

However, we could take a different approach and assume that the parameters (intercept and slope) are random variables themselves, and have some corresponding distribution. This would constitute a Bayesian perspective.

Keeping with simplicity in this example, I will assume that the intercept and slope each come from a normal distribution with a mean and variance such that:

When it comes time to make inferences or predictions using our model, we can create a large number of predictions by sampling different parameter values from these distributions. Consequently, we will end up with a distribution of uncertain predictions.

PyMC implementation

I recommend you see the PyMC installation guide to help you get set up.

NOTE: The MCMC sampler used by PyMC is written in C and will be SIGNIFICANTLY faster if you provide have access to GCC compiler and specify the it’s directory using the the following:

import pymc as pm

import os
os.environ["THEANO_FLAGS"] = "gcc__cxxflags=-C:\mingw-w64\mingw64\bin"

You will get a warning if you don’t have this properly set up.

Now, onto the demo!

I start by retrieving our X and Y data from the CAMELS dataset we created above:

# Pull out X and Y of interest
x_ftr= 'area_gages2'
y_ftr = 'q_mean_cms'
xs = camels_basins[x_ftr] 
ys = camels_basins[y_ftr]

# Take log-transform 
xs = np.log(xs)
ys = np.log(ys)

At a glance, we see there is a reasonable linear relationship when working in the log space:

Two of the key features when building a model are:

  • The random variable distribution constructions
  • The deterministic model formulation

There are lots of different distributions available, and each one simply takes a name and set of parameter values as inputs. For example, the normal distribution defining our intercept parameter is:

alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)

The value of the parameter priors that you specify when construction the model may have a big impact depending on the complexity of your model. For simple models you may get away with having uninformative priors (e.g., setting mu=0), however if you have some initial guesses then that can help with getting reliable convergence.

In this case, I used a simple least squares estimate of the linear regression as the parameter priors:

slope_prior, intercept_prior = np.polyfit(xs.values.flatten(), ys.values.flatten(), 1)

Once we have our random variables defined, then we will need to formulate the deterministic element of our model prediction. This is the functional relationship between the input, parameters, and output. For our linear regression model, this is simply:

y_mu = alpha + beta * xs

In the case of our Bayesian regression, this can be thought of as the mean of the regression outputs. The final estimates are going to be distributed around the y_mu with the uncertainty resulting from the combinations of our different random variables.

Putting it all together now:

### PyMC linear model
with pm.Model() as model:
    
    # Priors
    alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)
    beta = pm.Normal('beta', mu=slope_prior, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # mean/expected value of the model
    mu = alpha + beta * xs

    # likelihood
    y = pm.Normal('y', mu=mu, sigma=sigma, observed=ys)

    # sample from the posterior
    trace = pm.sample(2000, cores=6)
 

With our model constructed, we can use the pm.sample() function to begin the MCMC sampling process and estimate the posterior distribution of model parameters. Note that this process can be very computationally intensive for complex models! (Definitely make sure you have the GCC set up correctly if you plan on needing to sample complex models.)

Using the sampled parameter values, we can create posterior estimates of the predictions (log mean flow) using the posterior parameter distributions:

## Generate posterior predictive samples
ppc = pm.sample_posterior_predictive(trace, model=model)

Let’s go ahead and plot the range of the posterior distribution, to visualize the uncertainty in the model estimates:

### Plot the posterior predictive interval
fig, ax = plt.subplots(ncols=2, figsize=(8,4))

# log space
az.plot_hdi(xs, ppc['posterior_predictive']['y'], 
            color='cornflowerblue', ax=ax[0], hdi_prob=0.9)
ax[0].scatter(xs, ys, alpha=0.6, s=20, color='k')
ax[0].set_xlabel('Log ' + x_ftr)
ax[0].set_ylabel('Log Mean Flow (m3/s)')

# original dim space
az.plot_hdi(np.exp(xs), np.exp(ppc['posterior_predictive']['y']), 
            color='cornflowerblue', ax=ax[1], hdi_prob=0.9)
ax[1].scatter(np.exp(xs), np.exp(ys), alpha=0.6, s=20, color='k')
ax[1].set_xlabel(x_ftr)
ax[1].set_ylabel('Mean Flow (m3/s)')
plt.suptitle('90% Posterior Prediction Interval', fontsize=14)
plt.show()

And there we have it! The figure on the left shows the data and posterior prediction range in log-space, while the figure on the right is in non-log space.

As mentioned earlier, it’s not the best model (wayyy to much uncertainty in the large-basin mean flow estimates), but at least we have the benefit of knowing the uncertainty distribution since we took the Bayesian approach!

That’s all for now; this post was really meant to bring PyMC to your attention. Maybe you have a use case or will be more likely to consider Bayesian approaches in the future.

If you have other Bayesian/probabilistic programming tools that you like, please do comment below. PyMC is one (good) option, but I’m sure other people have their own favorites for different reasons.


PyMC resources:

References

Addor, N., Newman, A. J., Mizukami, N. and Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, doi:10.5194/hess-21-5293-2017, 2017.

Interactive Visualization of Scientific Data and Results with Panel

This is a guest post by David Lafferty from the University of Illinois Urbana-Champaign. David is a fifth-year PhD student in the Department of Climate, Meteorology & Atmospheric Sciences working with Dr. Ryan Sriver. David’s research contributes to advances in climate change risk assessment. Dr. Casey Burleyson, a research scientist at the Pacific Northwest National Laboratory, also contributed text about the MSD-LIVE platform.

Visualization plays a crucial role in the scientific process. From an initial exploratory data analysis to the careful presentation of hard-earned results, data visualization is a powerful tool to generate hypotheses and insights, communicate findings to both scientists and stakeholders, and aid decision-making. Interactive data visualization can enhance each of these use cases and is particularly useful for facilitating collaborations and deepening user engagement. Effective data visualization has been a recurring theme of the Water Programming blog, with several past posts directly addressing the topic, including some that highlight various tools for interactive visualization [1, 2].

In this post, I’d like to highlight Panel, a data exploration and web app framework in Python.  I’ll also showcase how we used Panel to construct an interactive dashboard to accompany a recent paper, allowing readers to explore our results in more detail.

An overview of the HoloViz ecosystem, taken from the hvPlot website. Panel provides a high-level interface that allows users to build linked apps across all of these visualization libraries.

Panel is an open-source Python library that allows developers to create interactive web apps and dashboards in Python. It is part of the HoloViz ecosystem, which includes other visualization packages like HoloViews and GeoViews. Panel provides a high-level API for creating rich, interactive visualizations and user interfaces directly from Python code. The library includes an impressive number of features, such as different types of panes for displaying and arranging plots, media, text, or other external objects; a wide range of widgets; and flexible, customizable dashboard layouts. Panel itself does not provide visualization tools but instead relies on (and is designed to work well with) other popular plotting libraries such as Bokeh, Datashader, Folium, Plotly, and Matplotlib. Panel provides several deployment options for your polished apps and dashboards, but also works well within development environments such as Jupyter Notebooks, VS Code, PyCharm, or Spyder.

A simple Panel dashboard

To get a sense of Panel’s capabilities, let’s build a simple dashboard that analyzes global temperature anomalies from different observational datasets. Global mean temperature rise above pre-industrial levels is one of most widely-used indicators of climate change, and 2023 was a record year in this respect – all major monitoring agencies recorded 2023 as the warmest year on record [3] and the exceptional warmth of 2023 fell outside the prediction intervals of many forecasting groups [4].

To visualize this temperature change, we’re going to look at four observational datasets: Berkeley Earth, HadCRUT5, NOAA, and GISTEMPv4. First, we need to import the required Python libraries:

!pip install hvplot

from functools import reduce
import pandas as pd
import hvplot.pandas

import panel as pn
pn.extension(comms='colab')

This notebook is available on Google Colab, which is why I’ve set comms=‘colab' in the Panel configuration settings. Next, we download and merge all observational datasets:

# HadCRUT5 anomaly relative to 1961-1990
# Link: https://www.metoffice.gov.uk/hadobs/hadcrut5/
hadcrut = pd.read_csv('https://www.metoffice.gov.uk/hadobs/hadcrut5/data/HadCRUT.5.0.2.0/analysis/diagnostics/HadCRUT.5.0.2.0.analysis.summary_series.global.annual.csv',
                      usecols=[0,1], skiprows=1, names=['Year','HadCRUT5'], index_col=0)

# GISTEMP v4 anomaly relative to 1951-1980
# Link: https://data.giss.nasa.gov/gistemp/
gistemp = pd.read_csv('https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.csv',
                      skiprows=2, usecols=[0,13], names=['Year', 'GISTEMPv4'], index_col=0, skipfooter=1, engine='python')

# NOAA anomaly relative to 1901-2000
# Link: https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/ytd/12
noaa = pd.read_csv('https://www.ncdc.noaa.gov/cag/global/time-series/globe/land_ocean/ytd/12/1880-2023.csv',
                   skiprows=5, names=['Year', 'NOAA'], index_col=0)

# Berkeley Earth anomaly relative to 1951-1980
# Link: https://berkeleyearth.org/data/
berkearth = pd.read_csv('https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Land_and_Ocean_summary.txt',
                        skiprows=58, usecols=[0,1], sep='\s+', names=['Year', 'BerkeleyEarth'], index_col=0)

# Merge
df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True, how='outer'), [hadcrut, gistemp, noaa, berkearth])

As you can see, each of these datasets measures the global mean temperature anomaly relative to a fixed baseline, and the baselines are not constant across datasets. Climate scientists often define 1850-1900 as the canonical pre-industrial baseline, but choosing this period is a non-trivial exercise [5] and can impact the uncertainty around our estimates of climate change. Let’s explore this phenomenon interactively using Panel and the datasets downloaded above:

def adjust_baseline(obs_name, new_baseline):
  # Adjusts given observational dataset to selected new baseline

def plot_timeseries(obs_to_include, new_baseline_start, new_baseline_end):
  # Plots timeseries of anomalies for selected datasets and new baseline

def get_2023_anomaly(obs_to_include, new_baseline_start, new_baseline_end):
  # Calculates and shows 2023 anomaly for selected datasets and new baseline

The full code for each of these functions is available in the notebook. Briefly, adjust_baseline calculates a new timeseries of global temperature anomalies for a given observational dataset, now relative to a new user-defined baseline. plot_timeseries then plots these new timeseries, and the user can choose which datasets are included via the obs_to_include parameter. As I mentioned above, Panel interfaces nicely with several other plotting libraries – here I elected to use hvPlot with the Bokeh backend to add zooming and panning capabilities to the timeseries plot. Finally, get_2023_anomaly reports the 2023 temperature anomalies relative to the new baseline (where I’ve used Panel’s Number indicator to color-code the anomaly report for some added glitz).

Now, let’s add interactivity to these functions via Panel. First, we define a set of ‘widgets’ that allow the user to enter their chosen baseline, and which observational datasets to include in the plots:

# Widgets
new_baseline_start = pn.widgets.IntInput(name='Baseline start year', value=1981, step=5, start=1850, end=2022, width=150)
new_baseline_end = pn.widgets.IntInput(name='Baseline end year', value=2010, step=5, start=1851, end=2023, width=150)

obs_to_include = pn.widgets.CheckButtonGroup(value=obs_names, options=obs_names, orientation='vertical', width=150)

Panel has a large number of widgets available for different use cases, so I encourage you to check those out. Then, we use Panel’s bind functionality to define reactive or bound versions of the functions above:

# Interactivity is generated by pn.bind
interactive_anomaly = pn.bind(get_2023_anomaly, obs_to_include, new_baseline_start, new_baseline_end)
interactive_plot = pn.bind(plot_timeseries, obs_to_include, new_baseline_start, new_baseline_end)

These functions will now be automatically re-run upon a user interaction. Finally, we can construct the dashboard by using a combination of Panel’s row and column layouts:

# Construct the dashboard
obs_to_include_heading = pn.pane.HTML('<label>Observational datasets to show:</label>')
anomaly_heading = pn.pane.Markdown('## 2023 anomaly:')

pn.Row(
    pn.Column(pn.WidgetBox(new_baseline_start, new_baseline_end, width=175, margin=10),
              pn.WidgetBox(pn.Column(obs_to_include_heading, obs_to_include), width=175, margin=10)),
    interactive_plot,
    pn.Column(anomaly_heading, interactive_anomaly),
    width=1000,
)

Et voila! You’ve just built your first Panel dashboard. If you run the code in the Google Colab notebook, it should look something like this:

You can try changing the baseline, zooming and panning around the plot, and toggling the datasets on/off. Notice that using an earlier baseline period typically results in a larger disagreement among the different products, reflecting the large uncertainties in early temperature measurements.  

Facilitating a deeper exploration of published results

We relied on Panel to produce a more elaborate dashboard to accompany a recent paper, Downscaling and bias-correction contribute considerable uncertainty to local climate projections in CMIP6, published in npj Climate & Atmospheric Science last year [6]. In brief, the goal of this paper was to understand the relative importance of various sources of uncertainty in local climate projections. In addition to the traditional sources of climate uncertainty, such as model uncertainty (different climate models can simulate different future climates) and internal variability (related to the chaotic nature of the Earth system), we also looked at downscaling and bias-correction uncertainty, which arises from the variety of available post-processing methods that aim to make climate model outputs more actionable (i.e., higher spatial resolution and reduced biases relative to observations). Our results were highly heterogeneous across space, time, and indicators of climate change, so it was challenging to condense everything into a few figures for the paper. As such, we developed a dashboard to allow users to explore our results in more detail by focusing in on specific locations or time periods that might be of more interest to them:

We relied on Panel for this dashboard mainly for its flexibility. Panel allowed us to include several different plot types and user interaction features, along with text boxes and figures to provide additional information. For example, users can explore the uncertainty breakdown at different locations by zooming and panning around a map plot, entering specific latitude and longitude coordinates, or selecting from a drop-down list of almost 900 world cities. As someone not well-versed in HTML or JavaScript, a major advantage provided by Panel was the ability to construct everything in Python.

Our dashboard is publicly available (https://lafferty-sriver-2023-downscaling-uncertainty.msdlive.org/) and we’d love for you to check it out! For deploying the dashboard, we worked with Carina Lansing and Casey Burleyson from PNNL to make it available on MSD-LIVE. MSD-LIVE (the MultiSector Dynamics Living, Intuitive, Value-adding, Environment) is a cloud-based flexible and scalable data and code management system combined with an advanced computing platform that will enable MSD researchers to document and archive their data, run their models and analysis tools, and share their data, software, and multi-model workflows within a robust Community of Practice. For the dashboard, we leveraged MSD-LIVE’s cloud computing capabilities to support the on-demand computing required to generate the plots. When a user clicks on the dashboard link, MSD-LIVE spins up a dedicated computing instance on the Amazon Web Services (AWS) cloud. That computing instance has the required Python packages pre-installed and is directly connected to the dataset that underpins the dashboard. Once the dashboard is closed or after a period of inactivity the dedicated computing instance is spun down. The underlying cloud computing capabilities in MSD-LIVE have also been used to support Jupyter-based interactive tutorials used to teach new people to use MSD models.

Conclusion

I hope this blog post has served as a useful introduction to Panel’s capabilities, and perhaps sparked some new ideas about how to visualize and communicate your scientific data and results. Although the temperature anomaly dashboard we created earlier was very simple, I think it demonstrates the key steps towards constructing more complicated apps: write functions to produce your desired visualizations, add interactivity via pn.bind, then arrange your dashboard using Panel’s plethora of panes, widgets, and layout features. Like any coding project, you will invariably run into unforeseen difficulties – there was definitely some trial-and-error involved in building the more complicated dashboard associated with our paper – but following this ordered approach should allow you to start creating interactive (and hopefully more effective) scientific visualizations. I would also encourage you to check out the Panel component and app galleries for yet more inspiration and demonstrations of what Panel can do.

References

  1. Trevor Amestoy, Creating Interactive Geospatial Maps in Python with Folium, Water Programing blog (2023). https://waterprogramming.wordpress.com/2023/04/05/creating-interactive-geospatial-maps-in-python-with-folium/
  2. Trevor Amestoy, Constructing interactive Ipywidgets: demonstration using the HYMOD model, Water Programing blog (2022). https://waterprogramming.wordpress.com/2022/07/20/constructing-interactive-ipywidgets-demonstration-using-the-hymod-model/
  3. Robert Rhode, Global Temperature Report for 2023, Berkeley Earth (2024). https://berkeleyearth.org/global-temperature-report-for-2023/
  4. Zeke Hausfather, 2023’s unexpected and unexplained warming, The Climate Brink (2024). https://www.theclimatebrink.com/p/2023s-unexpected-and-unexplained
  5. Hawkins, E., et al., Estimating Changes in Global Temperature since the Preindustrial Period, Bull. Amer. Meteor. Soc., 98, 1841–1856, https://doi.org/10.1175/BAMS-D-16-0007.1
  6. Lafferty, D.C., Sriver, R.L. Downscaling and bias-correction contribute considerable uncertainty to local climate projections in CMIP6. npj Clim Atmos Sci 6, 158 (2023). https://doi.org/10.1038/s41612-023-00486-0

The Thomas-Fiering Model for Synthetic Streamflow Generation with a Python Implementation

In 1962 a group of economists, engineers and political scientists who were involved in the Harvard Water Program published “Design of Water Resource Systems“. In chapter 12 of the book, Thomas and Fiering present the following statistical model which was one of the first, if not the first, formal application of stochastic modelling for synthetic streamflow generation and water resource systems evaluation.

It is an autoregressive model which can simulate monthly streamflow values based on the mean, variance, and correlation of historic observations.

In this blog post, I present the model in it’s original form along with a modified form presented by Stedinger and Taylor (1982). Then, I share a Python implementation of the model which is used to generate an ensemble of synthetic flows. I use plotting tools from the Synthetic Generation Figure Library to plot the results.

All of the code used for this post is available here: ThomasFieringModelDemo

Let’s get into it!

The Thomas-Fiering Model

The model that Thomas and Fiering proposed took the form:

Where, for each month m, Q_m is the generated flow, \bar{Q}_m is the mean historic flow, b_m is an autoregression coefficient for predicting that months flow from the prior months flow, \sigma is the standard deviation, r is the correlation coefficient and \epsilon is a random standard normal variable.

A modification to this model was proposed by Stedinger and Taylor (1982), which transforms transforms the streamflow values before fitting the model. I refer to this as the “Stedinger transformation” below and in the code.

Given Q_{m} as the observed flows in month m, the Stedinger transformation of the observed flows is then:

where \hat{\tau}_m is the estimated “lower bound” for each month, calculated as:

The modeled flows are generated from the recursive relationship:

Where:

  • \mu_{m} is the observed average historic monthly x series
  • \sigma_{m}^2 is the observed variance of the historic monthly x series
  • \epsilon_{m} independent standard-normal random variables
  • \rho_m observed between-month correlations of the historic x series

The above steps are performed for each month, and the synthetic streamflow sequence is generated by iteratively applying the stochastic process for the desired duration.

Python Implementation

I built this version of the Thomas Fiering model as a Python class with the following structure:

class ThomasFieringGenerator():
    def __init__(self, Q, **kwargs):
        
    def preprocessing(self, **kwargs):
	    # Stedinger normalization
	    
    def fit(self, **kwargs):
	    # Calculate mu, sigma, and rho for each month
	    
    def generate(self, n_years, **kwargs):
	    # Iteratively generate a single timeseries
	    # Inverse stedinger normalization
        return Q_synthetic
    
    def generate_ensemble(self, n_years, 
                          n_realizations = 1, 
                          **kwargs):
        # Loop and generate multiple timeseries
        return 

Rather than posting the entire code here, which would clutter the page, I will refer you to and encourage you to check out the full implementation which is in the linked repository here: ThomasFieringModelDemo/model.py

To see how this is used and replicate the results below using some example data, see the Jupyter Notebook: ThomasFieringModelDemo/ThomasFiering_demo.ipynb

Synthetic ensemble results

I used the ThomasFieringGenerator to produce 100 samples of 50-year monthly streamflows at USGS gauge site 01434000 on the Delaware River which has data going back to 1904.

This data is available in the repo and is stored in the file usgs_monthly_streamflow_cms.csv

The plotting functions are taken from the Synthetic Generation Figure Library which was shared previously on the blog.

First we consider the range of historic and synthetic streamflow timeseries:

Generally when working with synthetic ensembles it is good for the distribution of synthetic ensembles “envelope” the historic range while maintaining a similar median. The Thomas Fiering model does a good job at this!

The next figure shows the range of flow-quantile values for both historic and synthetic flows. Again, we see a nice overlapping of the synthetic ensemble:

Conclusions

I personally think it is fun and helpful to look back at the foundational work in a field. Since Thomas and Fiering’s work in the early 1960s, there has been a significant amount of work focused on synthetic hydrology.

The Thomas Fiering model has a nice simplicity while still performing very nicely (with the help of the Stedinger normalization). Sure there are some limitations to the model (e.g., the estimation of distribution and correlation parameters may be inaccurate for short records, and the method does not explicitly prevent the generation of negative streamflows), but the model, and the Harvard Water Program more broadly, was successful in ushering in new approaches for water resource systems analysis.

References

Maass, A., Hufschmidt, M. M., Dorfman, R., Thomas, Jr, H. A., Marglin, S. A., & Fair, G. M. (1962). Design of water-resource systems: New techniques for relating economic objectives, engineering analysis, and governmental planning. Harvard University Press.

Stedinger, J. R., & Taylor, M. R. (1982). Synthetic streamflow generation: 1. Model verification and validation. Water resources research, 18(4), 909-918.

A quick and straightforward introduction to LIME

In this blog post, we will be discussing the many household uses of citrus aurantiifolio, or the common lime.

Just kidding, we’ll be talking about a completely different type of LIME, namely Local Interpretable Model-Agnostic Explanations (at this point you may be thinking that the former’s scientific name becomes the easier of the two to say). After all, this is the WaterProgramming blog.

On a more serious note though, LIME is one of the two widely-known model agnostic explainable AI (xAI) methods, alongside Shapley Additive Explanations (SHAP). This post is intended to be an introduction to LIME in particular, and we will be setting up the motivation for using xAI methods as well as a brief example application using the North Carolina Research Triangle system.

Before we proceed, here’s a quick graphic to get oriented:

The figure above mainly demonstrates three main concepts: Artificial Intelligence (AI), the methods used to achieve AI (one of which includes Machine Learning, or ML), and the methods to explain how such methods achieved their version of AI (explainable AI, or more catchily known as xAI). For more explanation on the different categories of AI, and their methods, please refer to these posts by IBM’s Data Science and AI team and this SpringerLink book by Sarker (2022) respectively.

Model-agnostic vs model-specific

Model-specific methods

As shown in the figure, model-specific xAI methods are techniques that can only be used on the specific model that it was designed for. Here’s a quick rundown of the type of model and their available selection of xAI methods:

  • Decision trees
    • Decision tree visualization (e.g. Classification and Regression Tree (CART) diagrams)
    • Feature importance rankings
  • Neural networks (NN), including deep neural networks
    • Coefficient (feature weight) analysis
    • Neural network neuron/layer activation visualization
    • Attention (input sequence) mechanisms
    • Integrated gradients
    • DeepLIFT
    • GradCAM
  • Reinforcement learning
    • Causal models

Further information on the mathematical formulation for these methods can be found in Holzinger et al (2022). Such methods account for the underlying structure of the model that they are used for, and therefore require some understanding of the model formulation to fully comprehend (e.g. interpreting NN activation visualizations). While less flexible than their agnostic counterparts, model-specific xAI methods provide more granular information on how specific types of information is processed by the model, and therefore how changing input sequences, or model structure, can affect model predictions.

Model-agnostic methods

Model-agnostic xAI (as it’s name suggests) relies solely on analyzing the input-output sets of a model, and therefore can be applied to a wide range of machine learning models regardless of model structure or type. It can be thought of as (very loosely) as sensitivity analysis, but applied to AI methods (for more information on this discussion, please refer to Scholbeck et al. (2023) and Razavi et al. (2021)). SHAP and LIME both fall under this set of methods, and approximately abide by the following process: perturb the input then identify how the output is changed. Note that this set of methods provide little insight as to the specifics of model formulation and how it affects model predictions. Nonetheless, it affords a higher degree of flexibility, and does not bind you to one specific model.

Why does this matter?

Let’s think about this in the context of water resources systems management and planning. Assume you are a water utility responsible for ensuring that you reliably deliver water to 280,000 residents on a daily basis. In addition, you are responsible for planning the next major water supply infrastructure project. Using a machine learning model to inform your short- and long-term management and planning decisions without interrogating how it arrived at its recommendations implicitly assumes that the model will make sensible decisions that balance all stakeholders’ needs while remaining equitable. More often than not, this assumption can be incorrect and may lead to (sometimes funny but mostly) unintentional detrimental cascading implications on the most vulnerable (for some well-narrated examples of how ML went awry, please refer to Brian Christian’s “The Alignment Problem”).

Having said that, using xAI as a next step in the general framework of adopting AI into our decision-making processes can help us better understand why a model makes its predictions, how those predictions came to be, and their degree of usefulness to a decision maker. In this post, I will be demonstrating the use of LIME to answer the following questions.

The next section will establish the three components we will need to apply LIME to our problem:

  1. An input (feature) set and the training dataset
  2. The model predictions
  3. The LIME explainer

A quick example using the North Carolina Research Triangle

The input (feature) set and training dataset

The Research Triangle region in North Carolina consists of six main water utilities that deliver water to their namesake cities: OWASA (Orange County), Durham, Cary, Raleigh, Chatham, and Pittsboro (Gold et al., 2023). All utilities have collaboratively decided that they will each measure their system robustness using a satisficing metric (Starr 1962; Herman et al. 2015), where they satisfy their criteria for robustness if the following criteria are met:

  1. Their reliability meets or exceeds 98%
  2. Their worst-case cost of drought mitigation actions amount to no more than 10% of their annual volumetric revenue
  3. They do not restrict demand more than 20% of the time

If all three criteria are met, then they are considered “successful” (represented as a 1). Otherwise, they have “failed” (represented as a 0). We have 1,000 training data points of success or failure, each representing a state of the world (SOW) in which a utility fails or succeeds to meet their satisficing critieria. This is our training dataset. Each SOW consists of a unique combination of features that include inflow, evaporation and water demand Fourier series coefficients, as well as socioeconomic, policy and infrastructure construction factors. This is our feature set.

Feel free to follow along this portion of the post using the code available in this Google Colab Notebook.

The model prediction

To generate our model prediction, let’s first code up our model. In this example, we will be using Extreme Gradient Boosting, otherwise known as XGBoost (xgboost) as our prediction model, and the LIME package. Let’s first install the both of them:

pip install lime
pip install xgboost

We will also need to import all the needed libraries:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import lime
import lime.lime_tabular
import xgboost as xgb
from copy import deepcopy

Now let’s set up perform XGBoost! We will first need to upload our needed feature and training datasets:

satisficing = pd.read_csv('satisficing_all_utilites.csv')
du_factors = pd.read_csv('RDM_inputs_final_combined_headers.csv')  

# get all utility and feature names
utility_names = satisficing.columns[1:]
du_factor_names = du_factors.columns

There should be seven utility names (six, plus one that represents the entire region) and thirteen DU factors (or feature names). In this example, we will be focusing only on Pittsboro.

# select the utility 
utility = 'Pittsboro'

# convert to numpy arrays to be passed into xgboost function
du_factors_arr = du_factors.to_numpy()
satisficing_utility_arr = satisficing[utility].values

# initialize the figure object
fig, ax = plt.subplots(1, 1, figsize=(5, 5))  
perform_and_plot_xbg(utility, ax, du_factors_arr, satisficing_utility_arr, du_factor_names)

Note the perform_and_plot_xgb function being used – this function is not shown here for brevity, but you can view the full version of this function in this Google Colab Notebook.

The figure above is called a factor map that shows mid-term (Demand2) and long-term (Demand3) demand growth rates on its x- and y-axes respectively. The green denotes the range of demand growth rates where XGBoost has predicted that Pittsboro will successfully meet its satisficing criteria, and brown is otherwise. Each point is a sample from the original training dataset, where the color (white is 1 – success, and red is 0 – failure) denotes whether Pittsboro actually meets its satisficing criteria. In this case we can see that Pittsboro quickly transitions in failure when its mid- and long-term demand are both higher than expected (indicated by the 1.0-value on both axes).

The LIME explainer

Before we perform LIME, let’s first select an interesting point using the figure above.

In the previous section, we can see that the XGBoost algorithm predicts that Pittsboro’s robustness is affected most by mid-term (Demand2) and long-term (Demand3) demand growth. However, there is a point (indicated using the arrow and the brown circle below) where this prediction did not match the actual data point.

To better understand why then, this specific point was predicted to be a “successful” SOW where the true datapoint had it labeled as a “failure” SOW, let’s take a look at how the XGBoost algorithm made its decision.

First, let’s identify the index of this point:

# select an interesting point 
interesting_point_range_du = du_factors[(du_factors['Demand2'] < 0) & (du_factors['Demand3'] < 0)].index
interesting_point_satisficing = satisficing_utility_arr[interesting_point_range_du]
interesting_point_range_sat = np.where(satisficing_utility_arr == 0)
interesting_point = np.intersect1d(interesting_point_range_du, interesting_point_range_sat)[0]

This will return an index of 704. Next, we’ll run LIME to break down the how this (mis)classification was made:

# instantiate the lime explainer
explainer = lime_tabular.LimeTabularExplainer(du_factors_arr, 
                                              mode='classification', 
                                              feature_names=du_factor_names)

# explain the interesting instance
exp = explainer.explain_instance(du_factors_arr[interesting_point_selected], 
                                 xgb_model.predict_proba, 
                                 num_features=len(du_factor_names))

exp.show_in_notebook(show_table=True)

The following code, if run successfully, should result in the following figure.

Here’s how to interpret it:

  • The prediction probability bars on the furthest left show the model’s prediction. In this case, the XGBoost model classifies this point as a “failure” SOW with 94% confidence.
  • The tornado plot in the middle show the feature contributions. In this case, it shows the degree to which each SOW feature influenced the decision. In this case, the model misclassified the data point as a “success” although it was a failure as our trained model only accounts for the top two overall features that influence the entire dataset to plot the factor map, and did not account for short-term demand growth rates (Demand1) and the permitting time required for constructing the water treatment plant (JLWTP permit).
  • The table on the furthest right is the table of the values of all the features of this specific SOW.

Using LIME has therefore enabled us to identify the cause of XGBoost’s misclassification, allowing us to understand that the model needed information on short-term demand and permitting time to make the correct prediction. From here, it is possible to further dive into the types of SOWs and their specific characteristics that would cause them to be more vulnerable to short-term demand growth and infrastructure permitting time as opposed to mid- and long-term demand growth.

Summary

Okay, so I lied a bit – it wasn’t quite so “brief” after all. Nonetheless, I hope you learned a little about explainable AI, how to use LIME, and how to interpret its outcomes. We also walked through a quick example using the good ol’ Research Triangle case study. Do check out the Google Colab Notebook if you’re interested in how this problem was coded.

With that, thank you for sticking with me – happy learning!

References

Amazon Web Services. (1981). What’s the Difference Between AI and Machine Learning? Machine Learning & AI. https://aws.amazon.com/compare/the-difference-between-artificial-intelligence-and-machine-learning/

Btd. (2024, January 7). Explainable AI (XAI): Model-specific interpretability methods. Medium. https://baotramduong.medium.com/explainable-ai-model-specific-interpretability-methods-02e23ebceac1

Christian, B. (2020). The alignment problem: Machine Learning and human values. Norton & Company.

Gold, D. F., Reed, P. M., Gorelick, D. E., & Characklis, G. W. (2023). Advancing Regional Water Supply Management and infrastructure investment pathways that are equitable, robust, adaptive, and cooperatively stable. Water Resources Research, 59(9). https://doi.org/10.1029/2022wr033671

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10). https://doi.org/10.1061/(asce)wr.1943-5452.0000509

Holzinger, A., Saranti, A., Molnar, C., Biecek, P., & Samek, W. (2022). Explainable AI methods – A brief overview. xxAI – Beyond Explainable AI, 13–38. https://doi.org/10.1007/978-3-031-04083-2_2

IBM Data and AI Team. (2023, October 16). Understanding the different types of artificial intelligence. IBM Blog. https://www.ibm.com/blog/understanding-the-different-types-of-artificial-intelligence/

Sarker, I. H. (2022, February 10). AI-based modeling: Techniques, applications and research issues towards automation, intelligent and Smart Systems – SN Computer Science. SpringerLink. https://link.springer.com/article/10.1007/s42979-022-01043-x#Sec6

Scholbeck, C. A., Moosbauer, J., Casalicchio, G., Gupta, H., Bischl, B., & Heumann, C. (2023, December 20). Position paper: Bridging the gap between machine learning and sensitivity analysis. arXiv.org. http://arxiv.org/abs/2312.13234

Starr, M. K. (1963). Product design and decision theory. Prentice-Hall, Inc.

Geocoding Using Google API Key in Python

Introduction

In this post, I will delve into the transformative realm of geocoding, addressing the challenges posed by historical data laden with addresses. Throughout the discussion, I’ll guide you through the intricacies of leveraging the Google API key to seamlessly integrate location information.

Key points include:

  - Preparing your digital workspace for geocoding with essential tools.

  - Obtaining a Google API key to unlock the power of precise coordinates.

  - Applying practical steps to effortlessly transform addresses into valuable spatial insights using Pandas DataFrame.

ArcGIS Developers. (n.d.). ArcGIS API for Python. Retrieved from here.

As we study the realm of data-driven exploration, a significant challenge surfaces—one that personally resonated with me during my recent project. Historical data, often a treasure trove of insights, can be a double-edged sword. The challenge arises when this wealth of information arrives with addresses instead of coordinates, adding a layer of complexity to the analysis. In my own experience, this meant that a substantial portion of the received data was, unfortunately, unusable for my intended analysis without a transformative solution.

This is where geocoding steps in as a crucial ally, reshaping the landscape of historical datasets and making them analysis-ready. The inconsistency in reporting formats and the absence of standardized coordinates pose hurdles that can impede meaningful analysis. Geocoding becomes an indispensable tool, allowing me to bridge this gap, harmonize the data, and unlock its full potential for analysis.

This post focuses on the intricacies of geocoding—a transformative process that transcends mere addresses, providing the geographic insights necessary for a comprehensive understanding of your dataset. Equipped with the Google API key, we’ll delve into the practical steps needed to seamlessly integrate location information. The following sections outline the key stages of this geocoding journey, ensuring your dataset is prepared for advanced analysis.

Preparing your environment

Prepare your digital workspace for this impactful data transformation by ensuring you have all the necessary tools. Seamlessly install GoogleMaps and Pandas using the commands provided below in your terminal.

#Install necessary packages
pip install GoogleMaps
pip install pandas

Formatting your data

In the realm of geocoding, the importance of formatting data before initiating the geocoding process cannot be overstated, especially when leveraging the capabilities of the Google API key. The formatting step is a vital prerequisite for several reasons. Firstly, it ensures consistency and standardization in the structure of addresses, aligning them with the expectations of geocoding services. This consistency allows for more accurate interpretation and processing of addresses. Additionally, formatted addresses provide a clear separation of components such as city, state, and ZIP code, facilitating efficient matching during the geocoding algorithm’s execution. The optimization of API usage is another benefit, as well-formatted addresses contribute to accurate results, minimizing unnecessary costs associated with incorrect or ambiguous requests. By addressing these considerations in the formatting stage, one sets the stage for a more reliable and cost-effective geocoding journey, unlocking the full potential of spatial insights. Now, let’s delve into the practical implementation of this formatting process with the following code, which creates a new column containing the consolidated and formatted addresses within your dataset.

# Create new dataframe just with the address
df['ADDRESS'] = df['CITY'] + ',' + \
                df['STATE'] + ' ' + \
                df['ZIP'].astype(str)

Obtain a Google API Key

  1. Navigate to the Google Cloud console
  2. Under the Project drop-down menu, select the project that you want to work on. If one does not already exist, create a new project 
  3. Under the Library tab on the side menu, search “geocoding api” and ensure that it is enabled
  4. Under the Credentials tab on the side menu, select “Create Credentials” and create a new API key to generate a new key
  5. Your Google API key will then appear on the screen, save this key for future use

Geocoding with Pandas DataFrame

With your environment set up, your data formatted, and your Google API key in hand, it’s time to wield this powerful tool. Dive into the practical realm of geocoding by seamlessly applying the sample code below to your dataset, effortlessly transforming addresses into precise coordinates. Follow the code through each practical step, and witness your location data evolve into a valuable asset, ready for advanced analysis. 

# Import necessary libraries
import googlemaps
import pandas as pd

# Define your API key
# Replace "YOUR_GOOGLE_API_KEY" with your actual Google Maps API key that you saved previously.
gmaps_key = googlemaps.Client(key="YOUR_GOOGLE_API_KEY")

# Extract the last column (ADDRESS) to create a new dataframe (df_address)
df_address = df.iloc[:, -1:]

# Geocoding with Pandas DataFrame
# Define a function to geocode addresses and extract latitude and longitude
def geocode_and_extract(df_address):
    # Add empty columns for latitude and longitude
    df_address['LATITUDE'] = ""
    df_address['LONGITUDE'] = ""
    
    # Loop through each row in the df_address dataframe
    for i in range(len(df_address)):
        address = df_address['ADDRESS'][i]
        # Geocode the address using the Google Maps API
        geocode_result = gmaps_key.geocode(address)
        
        # Check if geocoding was successful
        if geocode_result:
            # Extract latitude and longitude from the geocoding result
            df_address['LATITUDE'][i] = geocode_result[0]['geometry']['location']['lat']
            df_address['LONGITUDE'][i] = geocode_result[0]['geometry']['location']['lng']

# Apply the geocoding function to your dataframe
geocode_and_extract(df_address)

# Print an example to verify correctness
example_address = df_address['ADDRESS'][0]
example_result = gmaps_key.geocode(example_address)
example_lat = example_result[0]["geometry"]["location"]["lat"]
example_long = example_result[0]["geometry"]["location"]["lng"]
print(f'Example Geocoded Address: {example_address}, Latitude: {example_lat}, Longitude: {example_long}')

# Let's join the coordinates with the original dataset
df['LATITUDE'] = df_address['LATITUDE']
df['LONGITUDE'] = df_address['LONGITUDE']

Replace "YOUR_GOOGLE_API_KEY" with your actual Google Maps API key that you saved previously.

Dealing with incomplete data

In the midst of my geocoding journey, I encountered a common challenge: incomplete data. Some addresses in my dataset were detailed with full street information, while others provided only city and state. The beauty of the Google API key revealed itself as it effortlessly transformed both types of data, allowing for a seamless integration of geographic insights.

This flexibility became especially crucial as I navigated through points where only city and state information existed. While it may seem like a potential hurdle, the Google API key’s ability to interpret varying address formats ensured that no data was left behind. For my specific analysis, the key only needed to fall within a designated radius of interest, and the results proved to be accurate enough for my purposes.

In handling diverse data formats, the Google API key stands as your ally, harmonizing and geocoding all pieces of the puzzle, ensuring that each contributes to the rich narrative your dataset is meant to tell.

Conclusion

Your geocoding journey has now equipped you to handle diverse datasets, turning a potential challenge into a transformative opportunity. As you venture forth, may your analyses unveil fresh perspectives and illuminate the hidden stories within your spatial data. The Google API key has been the linchpin throughout this exploration, ensuring that every piece of information contributes to the bigger picture. From handling varying addresses to pinpointing detailed locations, the key’s flexibility has proven invaluable, providing accurate insights.

Congratulations on mastering the art of geocoding—may your data-driven adventures persist in revealing the intricate layers of time, geography, and the untold stories that lie within.

Infrastructure Investment Selection as a Two-Stage Stochastic Programming Problem (Part 2)

In this post, we will continue where we left off from Part 1, in which we set up a two-stage stochastic programming problem to identify the optimal decision for the type of water supply infrastructure to build. As a recap, our simple case study involves a small water utility servicing the residents of Helm’s Keep (population 57,000) to identify the following:

  • The best new water supply infrastructure option to build
  • Its final built capacity
  • The total bond payment to be made
  • The expected daily deliveries that ‘ its final built capacity
  • The expected daily deliveries that it needs to meet

To identify these values, we will be setting up a two-stage stochastic problem using the Python cvxpy library. In this post, we will first review the formulation of the problem, provide information on the installation and use of the cvxpy library, and finally walk through code to identify the optimal solution to the problem.

Reviewing the infrastructure investment selection problem

In our previous post, we identified that Helm’s Keep would like to minimize their infrastructure net present cost (INPC), giving the following objective function:

min  f_{INPC} = \sum_{i=1}^{N}\sum_{s=1}^{S}p_s\sum_{y=1}^{30}\frac{PMT_{s,i}}{(1+d_{s,i})^y}

where

  • N=3 and S = 3 are the total number of infrastructure options and potential future scenarios to consider
  • p_s is the probability of occurrence for scenario s \in S
  • y is one year within the entire bond term T =[1,30]
  • PMT is the total bond payment, or bond principal
  • d_{s,i} is the discount rate in scenario s for infrastructure option i

In achieving this objective, Helm’s Keep also has to abide by the following constraints:

y_1 + y_2 + y_3 = 1

x_i \leq My_i

D_{s,i} \geq \frac{\delta_{s}}{1-MAX(\rho)}

D_{s,i} \leq \sum_{i=1}^{3}\frac{x_i}{1-\rho_i}

D_{s,i} \leq 8

PMT_{s,i} \leq 1.25R_{s,i}

where

  • x_i is the final built capacity of infrastructure option i
  • y_i is a binary [0,1] variable indicating if an infrastructure option is built (1) or not (0)
  • \delta_{s} is the daily demand in scenario s
  • D_{s,i} is the daily water deliveries from infrastructure option i in scenario s
  • \rho_i is the ratio of non-revenue water (NRW) if infrastructure option i is built
  • R_{s,i} is the net revenue from fulfilling demand (after accounting for NRW) using infrastructure option i in scenario s

For the full formulations of PMT and R, please refer to Part 1 of this tutorial.

In this problem, we our first-stage decision variables are the infrastructure option y_i to build and its final built capacity x_i. Our second-stage decision variables are the daily water deliveries made D_{s,i} in each scenario s.

The CVXPY Python library

To solve this problem, we will be using Python’s cvxpy library for convex optimization. It is one of the many solvers available for performing convex optimization in Python including Pyomo, as demonstrated in Trevor’s earlier blog post. Some other options options include PuLP, scipy.optimize, and Gurobi. For the purposes of our specific application, we will be using cvxpy as it can interpret lists and dictionaries of variables, constraints, or objectives. This allows direct appending and retrieval of said objects. This feature therefore makes it convenient to use with for-loops, which are useful in the case of two- or multi-stage stochastic problems where the decision variable space can exponentially grow with the number of scenarios or options being considered. You can find an introduction to cvxpy, documentation, and examples at the CVXPY website.

If you use pip, you can install cvxpy using the following:

pip install cvxpy

To install specific solver names, you can alternatively install cvxpy using

pip install cvxpy[CBC,CVXOPT,GLOP,GLPK,GUROBI,MOSEK,PDLP,SCIP,XPRESS]

where you can substitute the names of any convex optimization solver in the square brackets.

If you use conda, install cvxpy using

conda create --name cvxpy_env
conda activate cvxpy_env
conda install -c conda-forge cvxpy

Once you’ve installed cvxpy, you’re ready to follow along the next part!

Solving the two-stage stochastic programming problem

First, we import the cvxpy library into our file and define the constants and helper functions to calculate the values of PMT and R

import cvxpy as cp

# define constants
households = 57000  # number of households
T = 30  # bond term (years) for an infrastructure option 
UR = 0.06 # the uniform water rate per MGD

def calculate_pmt(br, C_i, y_i, VCC_i, x_i, T=30):
    """Calculate the annual payment for a loan.

    Args:
        br (float): The annual interest rate.
        C_i (float): capital cost of infrastructure option i 
        y_i (boolean): 1 if option i is built, 0 if not
        VCC_i (float): variable capital cost of infrastructure option i
        x_i (float): final built capacity of infrastructure option i
        T (const): T=30 years, the bond term in years.

    Returns:
        float: The annual payment amount.
    """
    # calculate p_i, the bond principal (total value of bond borrowed) for
    # infrastructure option i
    p_i = (C_i*y_i) + (VCC_i * x_i)

    # Convert the annual interest rate to a monthly rate.
    pmt = p_i*(br*(1+br)**T)/(((1+br)**T)-1)

    # Calculate the monthly payment.
    return pmt

def calculate_R(D_i, rho_i, VOC_i, households=57000, UR=0.06):
    """
    Calculates the potential net revenue from infrastructure option i in a given scenario.

    Args:
        D_i (float): per-capita daily water demand in MGD
        rho_i (float): percentage of water lost during transmission (non-revenue water, NRW) from infrastructure i to the city
        VOC_i (float): variable operating cost of i
        households (const): households=57000 the number of households in the city
        UR (const): UR=0.06/MGD the per-household uniform water rate

    Returns:
        R_i (float): The potential net revenue from infrastructure option i in a given scenario.
    """
    OC_i = VOC_i * (D_i/rho_i)
    R_i = ((D_i * UR * households)*(1-rho_i)) - OC_i
    return R_i

Then, we define all the variables required. We will first define the first-stage decision variables:

# Infrastructure options as boolean (1, 0) variables
y1 = cp.Variable(boolean = True, name='y1')
y2 = cp.Variable(boolean = True, name='y2')
y3 = cp.Variable(boolean = True, name='y3')

# capacity of each infrastructure option
x1 = cp.Variable(nonneg=True, name = 'x1')
x2 = cp.Variable(nonneg=True, name = 'x2')
x3 = cp.Variable(nonneg=True, name = 'x3')

# infrastructure option parameters
C = [15.6, 11.9, 13.9] # capital cost of each infrastructure option in $mil
VCC = [7.5, 4.7, 5.1] # variable capital cost of each infrastructure option in $mil/MGD capacity
VOC = [0.2, 0.5, 0.9] # variable operating cost of each infrastructure option in $mil/MGD
NRW = [0.2, 0.1, 0.12] # non-revenue water (NRW) for each infrastructure option

Next, we define the second-stage decision variable and the parameter values related to each potential scenario:

# volume of water delivered to the city in each scenario
D = {}
for i in range(3):
    D[i] = cp.Variable(nonneg=True)

P = [0.35, 0.41, 0.24]  # probability of each scenario
demand_increase = [4.4, 5.2, 3.9]  # per-capita daily demand increase in MGD
bond_rate = [0.043, 0.026, 0.052]  # bond rate in each scenario
discount_rate = [0.031, 0.026, 0.025]  # discount rate in each scenario

Note the for loop in the Lines 2-4 of the code snippet above. The cvxpy enables variables to be added to and accessed via a dictionary that allows access via both explicit and in-line for loop, as we will see below in the objective function code definition:

min_inpc = sum(P[s]*sum((calculate_pmt(bond_rate[s], C[0], y1, VCC[0], x1)/((1+discount_rate[s])**t)) for t in range(1,T+1)B) for s in range(3)) + \
    sum(P[s]*sum((calculate_pmt(bond_rate[s], C[0], y2, VCC[1], x2)/((1+discount_rate[s])**t)) for t in range(1,T+1)) for s in range(3)) + \
    sum(P[s]*sum((calculate_pmt(bond_rate[s], C[2], y3, VCC[2], x3)/((1+discount_rate[s])**t)) for t in range(1,T+1)) for s in range(3))

Some explanation is required here. Our goal is to find the minimum INPC required to build the supply required to meet potential demand growth. Our objective function formulation therefore is the sum of the INPC of all three potential infrastructure options, each calculated across the three scenarios. As the y_i variable is binary, only the sum across the three scenarios that requires the optimal infrastructure option will be chosen.

To constrain the solution space of our objective function, we define our constraints. Below, we can see the application of the ability of the cvxpy library that allows constraints to be added iteratively to a list:

constraints = []

# set an arbitrarily large value for M
M = 10e12

for s in range(3):
    constraints += [D[s] >= demand_increase[s]]   # daily water deliveries must be more than or equal to daily demand increase
    constraints += [D[s] <= ((x1/0.1) + (x2/0.1) + (x2/0.12))/1.2]   
    constraints += [calculate_pmt(bond_rate[s], C[0], y1, VCC[0], x1) <= 1.25*calculate_R(demand_increase[s], NRW[0], VOC[0])]
    constraints += [calculate_pmt(bond_rate[s], C[1], y2, VCC[1], x2) <= 1.25*calculate_R(demand_increase[s], NRW[1], VOC[1])]
    constraints += [calculate_pmt(bond_rate[s], C[2], y3, VCC[2], x3) <= 1.25*calculate_R(demand_increase[s], NRW[2], VOC[2])]

constraints += [y1 + y2 + y3 == 1]
constraints += [x1 <= M * y1]
constraints += [x2 <= M * y2]
constraints += [x3 <= M * y3]

Finally, we solve the problem using the Gurobi solver. This solver is selected as it comes pre-installed with the cvxpy library and does not require additional steps or licensing during installation. We also print the objective value and the solutions to the problem:

# set up the problem as a minimization
problem = cp.Problem(cp.Minimize(min_inpc), constraints)

# solve using Gurobi
problem.solve(solver=cp.GUROBI, verbose=False)

print(f'Optimal INPC: ${problem.value} mil' )
for variable in problem.variables():
  print(f"{variable.name()} = {variable.value}")

Obtaining the solutions

If you have closely followed the steps shown above, you would have identified that Helm’s Keep should build Infrastructure Option 3 (a new groundwater pumping station), to a total built capacity that allows total expected daily deliveries of 3.27MGD. This will result in a final INPC of USD$35.62 million. There are our first-stage decision variables.

In each scenario, the following daily deliveries (second-stage decision variables) should be expected:

ScenarioScenario probability (%)Demand increase (MGD)Daily deliveries (MGD)
1354.45.5
2415.26.5
3243.94.875

The values from the second and third column can be found in Part 1 of this tutorial. The final daily deliveries account for the maximum possible portion of NRW.

Let’s identify how much Helm’s Keep will require to pay in total annual bond payments and how much their future expected daily deliveries will be:

total_bond_payment = sum(P[s]*calculate_pmt(bond_rate[s], C[1], 1, VCC[1], x2.value) for s in range(3))
expected_daily_deliveries = sum(P[s]*D[s].value for s in range(3))

If you have closely followed the steps shown above, you would have obtained the following values:

Total annual bond paymentUSD$1.55 million
Expected total daily water deliveries5.76 MGD

Conclusion

Congratulations – you just solved a two-stage programming stochastic programming problem! In this post, we reviewed the content of Part 1, and provided a quick introduction to the cvxpy Python library and justified its use for the purpose of this test case. We also walked through the steps required to solve this problem in Python, identified that it should build a new groundwater pumping station with a 3.27MGD capacity. We also identified the total annual amount Helm’s Keep would have to pay annually to fulfill its debt service requirements, and how much it water would, on average, be expected to provide for its constituents.

I hope that both Parts 1 and 2 provided you with some information on what stochastic programming is, when to use it, and some methods that might be useful to approach it. Thank you for sticking around and happy learning!

An introduction to linear programming for reservoir operations (part 2) – Implementation with Pyomo

Introduction

Previously, in Part 1 I used a very simple reservoir operations scenario to demonstrate some linear programming (LP) concepts.

After some feedback on my initial formulation I went back and revised the formulation to make sure that (1) both reservoir releases and storage levels are optimized simultaneously and (2) the LP handles decisions over multiple timesteps (1,…,N) during optimization. Definitely look back at Part 1 for more context.

The current LP formulation is as follows:

In this post, I show a simple implementation of this LP using the Pyomo package for solving optimization problems in Python.

I have shared the code used in this demo in a repository here: Reservoir-LP-Demo


Constructing the LP model with Pyomo

While Pyomo can help us construct the LP model, you will need access to a separate solver software in order to actually run the optimization. I don’t get into the details here on how to set up these solvers (see their specific installation instructions), but generally you will need this solver to be accessible on you PATH.

Two solvers that I have had good experience with are:

As always, it’s best to consult the Pyomo documentation for any questions you might have. Here, I highlight a few things that are needed for our implementation.

We start by importing the pyomo.environ module:

import pyomo.environ as pyo

From this module we will need to use the following classes to help build our model:

  • pyo.ConcreteModel()
  • pyo.RangeSet()
  • pyo.Var()
  • pyo.Objective()
  • pyo.Constraint()
  • pyo.SolverFactory()

The nice thing about using pyomo rather than trying to manage the LP matrices yourself is that you can specify objectives and constraints as functions.

For example, the objective function is defined as:

# Objective Function
def objective_rule(m):
    return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)

And a constraint used to enforce the lower limit of the storage mass balance can defined as:

def S_balance_lower(m, t):
    if t == 0:
        return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
    return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

Rather than picking the full implementation apart, I present the entire function below, and encourage you to compare the code implementation with the problem definition above.

def pyomo_lp_reservoir(N, S_min, S_max, R_min, R_max, 
                       eta_0, I, D,  
                       initial_storage):

    # Model
    model = pyo.ConcreteModel()

    # Time range
    model.T = pyo.RangeSet(0, N-1)  

    # Decision Variables
    model.S = pyo.Var(model.T, bounds=(S_min, S_max))  # Storage
    model.R = pyo.Var(model.T, bounds=(R_min, R_max))  # Release

    # Objective Function
    def objective_rule(m):
        return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)
    model.objective = pyo.Objective(rule=objective_rule, sense=pyo.minimize)

    # Constraints
    def S_balance_lower(m, t):
        if t == 0:
            return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
        return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

    def S_balance_upper(m, t):
        if t == 0:
            return -(m.S[t] + m.R[t]) <= -(initial_storage + I[t] - D[t])
        return -(m.S[t] + m.R[t]) <= -(m.S[t-1] + I[t] - D[t])
    model.S_lower = pyo.Constraint(model.T, rule=S_balance_lower)
    model.S_upper = pyo.Constraint(model.T, rule=S_balance_upper)
    model.S_final = pyo.Constraint(expr=model.S[N-1] == initial_storage)

    # Solve
    solver = pyo.SolverFactory('scip')
    results = solver.solve(model)

    if results.solver.status == pyo.SolverStatus.ok:
        S_opt = np.array([pyo.value(model.S[t]) for t in model.T])
        R_opt = np.array([pyo.value(model.R[t]) for t in model.T])
        return S_opt, R_opt
    else:        
        raise ValueError('Not able to solve.')

Note that in this implementation, pyomo will optimize all of the reservoir release and storage decisions simultaneously, returning the vectors of length N which prescribe the next N days of operations.

Usage

Now we are ready to use our LP reservoir simulator. In the code block below, I set some specifications for our operational constraints, generate fake inflow and demand timeseries, run the LP solver, and plot the simulated results:

# spcifications
N_t = 30
S_min = 2500
S_max = 5000
R_min = 10
R_max = 1000
eta = 1.2

# Generate simple inflow and demand data
I, D = generate_data(N_t, correlation_factor = -0.70,
                     inflow_mean=500, inflow_std=100,
                     lag_correlation=0.2)


# Run LP operation simulation
S_sim, R_sim = pyomo_lp_reservoir(N_t, S_min, S_max, R_min, R_max, eta, I, D, 
                                  initial_storage=S_max)

# Plot results
plot_simulated_reservoir(I, D,
                         R_sim, S_sim, 
                         S_max, eta=eta)

The operations are shown:

Under this LP formulation, with a perfect inflow forecast, the reservoir operates as a “run of river” with the release rates being very close to the inflow rate.

In practice, operators may need to limit the difference in release volume from day-to-day. I added an optional parameter (R_change_limit) which adds a constraint on the difference subsequent releases from between each day.

The operations, with the daily release change rate limited to 50 is shown below.

Conclusions

The way you formulate the an LP problem dictates the “optimal” decisions that you will generate. The purpose of these past two posts was to make some attempt at giving a practice demo of some basic LP concepts. I hope for some that it might be useful as a starting point.

Runtime Visualization of MOEA with Platypus in Python

The Platypus framework provides us with a Python library to solve and analyze multi-objective problems conveniently. In this notebook, we use Platypus to solve a multi-objective optimization problem – the 3 objective version of DTLZ2 (Deb et al., 2002a) – using the evolutionary algorithm NSGA-II (Deb et al., 2002b). We generate runtime visualizations for snapshots of the algorithm to help build an understanding of how the evolutionary algorithm works. Specifically, we are going to look at the population at each generation, their ranks (a key parameter NSGA-II uses to select offspring), the parallel coordinates plot of the current population, and runtime indicators such as hypervolume and generational distance.

Setting up the problem

I created a Google Colab for this post. Since this post is equivalent to the Google Colab file, you may choose to look at either one. This project is intended to be used in a Jupyter Notebook environment with Python.

First, we import the Python libraries we need for the visualization.

import numpy as np
import math
import pandas as pd
from pandas.plotting import parallel_coordinates
import random
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
from matplotlib import animation, rc, rcParams
rc('animation', html='jshtml')

from mpl_toolkits.mplot3d import Axes3D

!pip install platypus-opt
from platypus import (NSGAII, NSGAIII, DTLZ2, Hypervolume, EpsilonBoxArchive,
                      Solution, GenerationalDistance, InvertedGenerationalDistance,
                      Hypervolume, EpsilonIndicator, Spacing)

Our goal is to visualize the population at each generation of the evolutionary algorithm. To do so, we utilize an interface provided by the Platypus library: the callback function.

At each iteration of the algorithm, the callback function (if initialized) is called. We define our callback function to store the current number of function evaluations (algorithm.nfe) and all the data points in the current population (algorithm.result). Each population has the type Solution defined in Platypus, which not only contains the values of the variables but also attributes specific to the solver, such as ranks and crowding distances. We will access some of these attributes during visualization.

We also define the frequency of NFEs that we want to store the values. Saving a snapshot of the algorithm at every single iteration may be too expensive or unnecessary. Hence, we set a lapse for some number of iterations, in which the callback function does nothing unless the last NFE is more than the frequency amount apart from this NFE.

In the following example, we set up the problem DTLZ2 and use the callback function to solve it using NSGAII.

#define the frequency
frequency = 200

# define the problem definition
problem = DTLZ2(3)

# instantiate the optimization algorithm
algorithm = NSGAII(problem, divisions_outer=12)

# define callback function
solutions_list = []
hyp = []
nfe = []
last_calc = 0

def DTLZ2_callback(algorithm):
  global last_calc
  if algorithm.nfe == last_calc + frequency:
    last_calc = algorithm.nfe
    nfe.append(algorithm.nfe)
    solutions_list.append(algorithm.result);

# optimize the problem using 10,000 function evaluations
algorithm.run(10000,DTLZ2_callback)

In order to calculate the metrics, we first need to define our reference set. For DTLZ2(3), our points lie on the unit sphere in the first octant. Let’s randomly select 1000 such points and plot them.

# generate the reference set for 3D DTLZ2
reference_set = EpsilonBoxArchive([0.02, 0.02, 0.02])

for _ in range(1000):
    solution = Solution(problem)
    solution.variables = [random.uniform(0,1) if i < problem.nobjs-1 else 0.5 for i in range(problem.nvars)]
    solution.evaluate()
    reference_set.add(solution)

fig_ref = plt.figure()
ax_ref = fig_ref.add_subplot(projection="3d")
ax_ref.scatter(
              [s.objectives[0] for s in reference_set],
              [s.objectives[1] for s in reference_set],
              [s.objectives[2] for s in reference_set],
              )

Given the reference set, we can now calculate the indicators for each population across iterations. The Platypus library provides us with all the functions we need to calculate the following indicators: generational distance, hypervolume, epsilon indicator, and spacing.

We initialize them in a dictionary and iterate over all saved populations to calculate the indicators.

# calculate the indicators
indicators = {"gd" : GenerationalDistance(reference_set),
              "hyp" : Hypervolume(reference_set),
              "ei" : EpsilonIndicator(reference_set),
              "sp" : Spacing()}

indicator_results = {index : [] for index in indicators}

for indicator in tqdm(indicator_results):
  for solution in tqdm(solutions_list):
    indicator_results[indicator] += [indicators[indicator].calculate(solution)]

Setting up the visualization

At this point, we have the data we need to perform runtime visualizations. We will utilize the animation.FuncAnimation function in matplotlib to create interactive animations in Jupyter Notebook (or Google Colab). The idea behind creating such animations is to first initialize a static figure, and then define an update function to let the FuncAnimation know how to visualize new data for each iteration.

We define drawframe() that does the following: (1) clear the axis, so that previous data points are wiped out; (2) draw the new data points; (3) reset the limits of data axes so that the axes are consistent across frames; (4) update new data for indicator axes.

def drawframe(n):

    # clear axes
    ax.cla()
    ax_parallel.cla()

    # save results
    result = solutions_list[n]
    crowding_distances = [s.crowding_distance if s.crowding_distance != math.inf else 0 for s in result]
    ranks = [s.rank for s in result]

    result = solutions_list[n]
    points = {
              'X': [s.objectives[0] for s in result],
              'Y': [s.objectives[1] for s in result],
              'Z': [s.objectives[2] for s in result],
              'rank': [s.rank for s in result],
              'tag' : ['tag' for s in result]
    }
    df = pd.DataFrame(points)

    # update new data points
    ax.scatter(points['X'], points['Y'], points['Z'],
              c = ranks,
              alpha = 0.5,
              linestyle="", marker="o",
              cmap=cmap,
              vmax=max_rank,
              vmin=0)
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    ax.set_zlim(zlim)
    ax.set_title('Solve DTLZ2, NFE = ' + str(nfe[n]))

    # update the parallel coordinates plot
    parallel_coordinates(df[['X','Y','Z','tag']], 'tag', ax=ax_parallel)

    # update indicator plots
    for indicator in indicator_axes:
      indicator_ax = indicator_axes[indicator]
      indicator_ax.plot(nfe[:n],indicator_results[indicator][:n], c = 'r')
      indicator_ax.set_xlim(left = min(nfe), right=max(nfe))
      indicator_ax.set_ylim(bottom = min(indicator_results[indicator]), top = max(indicator_results[indicator]))

With the drawframe() function, we create a static figure that initializes the axes we will feed into the drawframe() function. The initialization does the following: (1) set up the subplots of the figure; (2) calculate the maximum ranks of all points in all populations to determine the color mapping; (3) load the points from the first iteration; (4) initialize the scatter plots, parallel coordinates plot, and indicator plots.

fig = plt.figure(figsize=(20,10), dpi = 70)
fig.suptitle("Runtime Visualization", fontsize=20)
fig.subplots_adjust(wspace=0.3, hspace=0.3)
ax = fig.add_subplot(2, 3, 1, projection="3d")
ax_parallel = fig.add_subplot(2,3,4)

indicator_axes = {"gd" : fig.add_subplot(2, 3, 2),
                  "hyp" : fig.add_subplot(2, 3, 3),
                  "ei" : fig.add_subplot(2, 3, 5),
                  "sp" : fig.add_subplot(2, 3, 6)}

indicator_names = {"gd" : "Generational Distance",
                  "hyp" : "Hypervolume",
                  "ei" : "Epsilon Indicator",
                  "sp" : "Spacing"}

# load the ranks of all points
all_rank = [s.rank for result in solutions_list for s in result]
max_rank = max(all_rank)

# define the colormap
cmap = plt.get_cmap('Accent', max_rank)

# load the points from the first iteration
result = solutions_list[0]
points = {
    'X': [s.objectives[0] for s in result],
    'Y': [s.objectives[1] for s in result],
    'Z': [s.objectives[2] for s in result],
    'rank': [s.rank for s in result],
    'tag' : ['tag' for s in result]
}

df = pd.DataFrame(points)

# create the scatter plot
graph = ax.scatter(
              points['X'], points['Y'], points['Z'],
              c = points['rank'],
              alpha = 0.5,
              cmap=cmap,
              linestyle="", marker="o",
              vmax=max_rank,
              vmin=0
          )

# create the parallel coordinates plot
parallel_coordinates(df[['X','Y','Z','tag']], 'tag', ax=ax_parallel)

plt.colorbar(graph, label='Rank', pad = 0.2)

# save the dimensions for later use
xlim = ax.get_xlim()
ylim = ax.get_ylim()
zlim = ax.get_zlim()
title = ax.set_title("DTLZ2 Static Figure")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")

# initialize subplots for each indicator
for indicator in indicator_axes:
  indicator_axes[indicator].plot(nfe[:0],indicator_results[indicator][:0])
  indicator_axes[indicator].set_title(indicator_names[indicator] + " vs NFE")
  indicator_axes[indicator].set_xlabel("NFE")
  indicator_axes[indicator].set_ylabel(indicator_names[indicator])

Now we are ready to create an animation. We use the FuncAnimation() function, the initialized figure, and the drawframe() function to create the animation.

In the Jupyter Notebook or Google Colab, you will be able to play the animation using the play button at the bottom of the image. By default, the animation will play in a loop. You may select “once” so that the animation freezes in the last frame. Important: Before re-generating the animation, be sure to re-run the previous initialization so that the figure is reset. Otherwise, you may see overlapping points/lines.

ani = animation.FuncAnimation(fig, drawframe, frames=len(nfe), interval=20, blit=False)
ani

This visualization shows how the algorithm progresses as NFE grows. For the set of solutions, we clearly see how they converge to the reference set. Moreover, more and more points have lower ranks, indicating they are getting closer to the Pareto Front (points on the Pareto Front have rank = 0). The parallel coordinates plot shows how our solutions get narrowed down and the tradeoffs we could make. Finally, the four indicator plots track the performance of our algorithm as NFE increases. The trajectory of generational distance, hypervolume, and epsilon indicator suggests convergence.

In conclusion, the project highlights the potential of the Platypus library in Python in providing valuable insights into the progress of evolutionary algorithms, not just their final outcomes. Through the use of NSGA-II as an illustrative example, we have demonstrated the ability to monitor the ranks of points across generations. In Dave’s post, the runtime visualizations revealed the changing probabilities of variators across iterations. These findings emphasize the power of incorporating dynamic techniques to gain a comprehensive understanding of the runtime behavior of MOEA algorithms. I hope this project opens doors to further explore, visualize, and analyze the dynamics of evolutionary algorithms.

References

[1] Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002a). Scalable multi-objective optimization test problems. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600) (Vol. 1, pp. 825-830). IEEE.

[2] Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. A. M. T. (2002b). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2), 182-197.

[3] Gold, D. (2020, May 6). Beyond Hypervolume: Dynamic visualization of MOEA runtime. Water Programming: A Collaborative Research Blog. https://waterprogramming.wordpress.com/2020/05/06/beyond-hypervolume-dynamic-visualization-of-moea-runtime/

[4] Python – How to create 3D scatter animations – Stack Overflow https://stackoverflow.com/questions/41602588/how-to-create-3d-scatter-animations

[5] GitHub – Project-Platypus/Platypus: A Free and Open Source Python Library for Multiobjective Optimization https://github.com/Project-Platypus/Platypus

How ChatGPT Helped Me To Convert a MATLAB Toolbox to Python and Learn Python Coding

This post is a guest post by Veysel Yildiz from the University of Sheffield. Veysel is a third-year PhD student in Civil Engineering working with Dr. Charles Rouge (website). Veysel’s research contributes to scientific and technical advances in hydropower plant design. If you are interested in contributing a guest post to this blog, please contact Lillian Lau (lbl59@cornell.edu).

Like many engineers with a graduate education, I learned MATLAB first before realising Python could be a better language for code I want to make freely available.  But contrary to most, I had already written a 2500-line piece of software in MATLAB before realising other programming languages would be better. My software, the HYPER toolbox is the first to simultaneously optimise plant capacity and turbine design parameters for new run-of-river hydropower plants (for more information, please follow this link). In my PhD, I have been improving this toolbox to also evaluate the financial robustness of hydropower plant design in a changing world, i.e., is a plant still a sensible investment when climate and socio-economic conditions are different? This is a key question in countries such as Turkey, where I am from, that are getting drier. I have also been making my toolbox faster through algorithmic improvements, and want to release this fast version in a license-free language.

While I was proficient in MATLAB, my knowledge of Python was limited. Fortunately, with the help of ChatGPT, I was able to successfully navigate the transition from MATLAB to Python, both for my toolbox and for myself. Indeed, I was not only able to convert the code but also gain valuable insights into Python programming. In this blog post, I will share my ChatGPT experience, its benefits, and offer some practical tips for using ChatGPT effectively in Python coding.

The challenge of transitioning from MATLAB to Python

MATLAB is a sophisticated numerical computing environment known for its ease of use and extensive libraries for a wide range of scientific and engineering applications. Python, on the other hand, has grown in popularity due to its adaptability, open-source nature, and  diverse set of libraries and frameworks.  Given the changing environment of software development and user demands, it was vital to make the switch from MATLAB to Python. The main challenges I faced during this transition are:

  • Syntax Differences: MATLAB and Python have different syntax and conventions, making it challenging to convert code directly.
  • Lack of Proficiency: I was not proficient in Python, which added to the complexity of the task.
  • Code Size: With 2500 lines of code, the change could take several months.
  • Performance: Python can be slower than MATLAB for certain tasks such as matrix computation.

How ChatGPT accelerated the process

ChatGPT,  is a flexible large language model that can answer queries, generate code samples, and explain concepts in a variety of programming languages, including Python.

Here’s how ChatGPT helped me through this process:

Code Conversion Guidance

I started asking for basic functions including how to load/read a dataset. Then, I asked for guidance on how to convert specific functions and syntax from MATLAB to Python. ChatGPT provided clear explanations and Python code examples, helping me understand the nuances of Python syntax. For example:

ChatGPT guided me through the conversion process step by step. It assisted me in identifying and addressing syntactic differences, data structure changes, and library replacements needed to make the code run in Python. The conversion was not a one-time event. It involved continuous iterations and refinements. After implementing ChatGPT’s suggestions for one part of the code, I would return with additional questions as I progressed to the next steps. This iterative approach allowed me to build a solid foundation in Python while converting the MATLAB code.  Here is an example:

Debugging and Error Handling

During the conversion process, I encountered errors and unexpected behaviour in the Python code. Whenever it happened, I input the error messages to ChatGPT to identify the causes of the errors. ChatGPT described Python’s error messages and traceback information, allowing me to handle difficulties on my own. In other words, it assisted in debugging by offering tricks for error detection and correction. Here is an example:

ValueError: Unable to parse string “EKİM” at position 0

Code Optimization

When I successfully create a script that runs, I often ask ChatGPT to suggest methods to optimize the Python code for performance and readability.  Here is an example:

Learning Python Concepts

As I progressed in converting the code with the assistance of ChatGPT, I discovered that the journey was not just about code translation; it was an invaluable opportunity to learn essential Python concepts and best practices.  ChatGPT was quite beneficial in taking me through this learning process, providing explanations on Python’s data types, control structures, and how to use libraries efficiently.

A Word of Caution

ChatGPT is a powerful AI language model, but it, like any other, has limits. It might sometimes offer inaccurate or unsatisfactory code suggestions or descriptions. As a user, you should take caution and systematically validate the information given by ChatGPT. Here is an example:

Several Key Takeaway Messages of the Journey

  • Use clear instructions:  When interacting with ChatGPT, provide clear and concise instructions. Describe the problem you’re trying to solve, the desired outcome, and any constraints or requirements. The more specific you are, the better the model’s response is likely to be.
  • Understand its limitations: ChatGPT may not always produce perfect or optimal solutions. It is important to evaluate and verify the code created to ensure it meets your requirements.
  • Review and validate generated code: Carefully review the code generated by ChatGPT. Check for  accuracy, readability, and efficiency. Debug any issues that arise, and test the code with different inputs to ensure it works as expected.
  • Iterative refinement: If the initial code generated by ChatGPT doesn’t meet your requirements, provide feedback and iterate. Ask for specific improvements or clarifications, and guide the model towards a better solution.

Conclusion

Through the guidance and support of ChatGPT, I successfully converted 2500 lines of MATLAB code to Python in 10 days. This accomplishment was a significant milestone in my programming journey. Along the way, I gained proficiency in Python, a skill that opened up new opportunities in data science, and more.  This experience demonstrates the power of AI-driven tools in supporting programmers with challenging tasks while supporting continuous learning.

If you find yourself in a similar transition or coding challenge, try using AI-powered language models like ChatGPT to assist facilitate your journey and enrich your programming skills. It is not just about solving issues, but also about learning and growing as a programmer in a dynamic technological environment.

ChatGPT is a powerful tool to assist you with coding tasks, but it does not replace developing your own coding skills and actively engage in the learning process.