May 15, 2024 by Trevor Amestoy

Sankey diagrams for USGS gauge data in python(?)

This post was inspired by the Sankey diagram in Figure 1 of this pre-print led by Dave Gold: “Exploring the Spatially Compounding Multi-sectoral Drought Vulnerabilities in Colorado’s West Slope River Basins” (Gold, Reed & Gupta, In Review) which features a Sankey diagram of flow contributions to Lake Powell. I like the figure, and thought I’d make an effort to produce similar diagrams using USGS gauge data.

Sankey diagrams show data flows between different source and target destinations. Lot’s of people use them to visualize their personal/business cashflows. It’s an obvious visualization option for streamflows.

To explain the “(?)” in my title: When I started this, I’d realized quickly that I to choose one of two popular plotting packages: matplotlib or plotly.

I am a frequent matplotlib user and definitely appreciate the level of control in the figure generation process. However, it can sometimes be more time- and line-intensive designing highly customized figures using matplotlib. On the other hand, in my experience, plotly tools can often produce appealing graphics with less code. I am also drawn to the fact that the plotly graphics are interactive objects rather than static figures.

I decided to go with plotly to try something new. If you want to hear my complaints and thoughts on use context, you can skip to the conclusions below.

In the sections below, I provide some code which will:

Define a network of USGS gauge stations to include in the plot
Retrieve data from USGS gauge stations
Create a Sankey diagram using plotly showing streamflows across the network

Here, I focus on the Rio Grande river upstream of Albuquerque, NM. However you can plot a different streamflow network by modifying the dictionary of upstream nodes defining the network.

Plotting a Sankey streamflow network with plotly

The code used here requires both plotly and the pygeohydro package (for USGS data retrieval).

from pygeohydro import NWIS
import plotly.graph_objects as go

With that out of the way, we can get started.

Defining the flow network & data retrieval

I start by defining a dictionary called upstream_stations which defines the relationships between different gauges of interest.

This dictionary contains pairs of the form: {"GAUGE_ID" : ["LIST_OF", "UPSTREAM", "GAUGE_IDs"]}

If there is no upstream site, then include an empty list. For the Rio Grande network, this looks like:

# define relationships between each gauge and upstream sites
upstream_stations = {
    '08329918' : ['08319000', '08328950'], 
    '08319000' : ['08317400', '08317200'],
    '08328950' : [],
    '08317200' : [],
    '08317400' : ['08313000'],
    '08313000' : ['08290000', '08279500'],
    '08287000' : [],
    '08279500' : [],
    '08290000' : ['08287000', '08289000'],
    '08289000' : [],
}

# Get list of all stations from upstream_stations
all_stations = list(upstream_stations.keys())
for station, upstream in upstream_stations.items():
    all_stations += upstream
all_stations = list(set(all_stations))

Notice that I also made a list containing all the stations IDs. I use the pygeohydro package from the HyRiver suite of tools to get retrieve the gauge station data (Chegini, Li, & Leung, 2021). I often cite this package, and have written about it in a past post (“Efficient hydroclimatic data accessing with HyRiver for Python”).

Using the list of all_stations, I use the following code to pull daily streamflow data for each site from 2015-2020 (or some other specified dates):

def get_usgs_gauge_data(stations, dates):
    """
    Get streamflow data from USGS gauge stations using NWIS.
    """
    nwis = NWIS()
    df = nwis.get_streamflow(stations, dates, mmd=False)
    
    # get rid of USGS- in columns
    df.columns = df.columns.str.replace('USGS-', '')
    return df

# Get USGS flows
flows = get_usgs_gauge_data(all_stations, ('2015-01-01', '2020-12-31'))

For the Sankey diagram, we need a single flow value for each station. In this case I calculate an average of the annual total flows at each station:

# Get annual mean flows
agg_flows = flows.resample('Y').sum().agg('mean')

Creating the Sankey figure

At it’s core, a Sankey diagram is a visualization of a weighted network (also referred to as a graph) defined by:

Nodes
Links (aka Edges)
Weights

In our case, the nodes are the USGS gauge stations, the links are the connections between upstream and downstream gauges, and the weights are the average volumes of water flowing from one gauge to the next.

Each link is defined by a source and target node and a value. This is where the upstream_stations dictionary comes in. In the code block below, I set up the nodes and links, looping through upstream_stations to define all of the source-target relationships:

## Deinfe nodes and links
# Nodes are station IDs
nodes = all_stations
node_indices = {node: i for i, node in enumerate(nodes)}

# make links based on upstream-downstream relationships
links = {
    'source': [],
    'target': [],
    'value': [],
}

# loop through upstream_stations dict
for station, upstream_list in upstream_stations.items():
    for stn in upstream_list:
        if stn in agg_flows and station in agg_flows:
            links['source'].append(node_indices[stn])
            links['target'].append(node_indices[station])
            links['value'].append(agg_flows[stn])

Lastly, I define some node labels and assign colors to each node. In this case, I want to make the nodes black if they represent reservoir releases (gauges at reservoir outlets) or blue if they are simple gauge stations.

labels = {
    '08329918' : 'Rio Grande at Alameda', 
    '08319000' : 'San Felipe Gauge',
    '08328950' : 'Jemez Canyon Reservoir',
    '08317200' : 'Santa Fe River',
    '08317400' : 'Cochiti Reservoir',
    '08313000' : 'Rio Grande at Otowi Bridge',
    '08287000' : 'Abiquiu Reservoir',
    '08279500' : 'Rio Grande',
    '08290000' : 'Rio Chama',
    '08289000' : 'Rio Ojo Caliente',
}

# Create nodes labels and colors lists
node_labels = [labels[node] for node in nodes]
node_colors = ['black' if 'Reservoir' in label else 'dodgerblue' for label in node_labels]

Finally, the function to generate the figure:

def create_sankey_diagram(node_labels, links, node_colors, 
						  orientation='h',
                          size=(2000, 700)):
    """
    Create a Sankey diagram of using Plotly.
    
    Parameters
    ----------
    node_labels : list
        List of node labels.
    links : dict
        Dictionary with keys 'source', 'target', and 'value'.
    node_colors : list
        List of node colors.
    orientation : str
        Orientation of the diagram, 'h' for horizontal and 'v' for vertical.
        
    Returns
    -------
    sankey_fig : plotly.graph_objects.Figure
        Plotly figure object.
    """
    sankey_fig = go.Figure(go.Sankey(
        orientation=orientation,
        node=dict(
            pad=70,
            thickness=45,
            line=dict(color='dodgerblue', width=0.5),
            label=node_labels,
            color=node_colors
        ),
        link=dict(
            source=links['source'],
            target=links['target'],
            value=links['value'],
            color='cornflowerblue'
        )
    ))
    
    sankey_fig.update_layout(
        title_text="Rio Grande Streamflow ",
        font=dict(size=23),
        width=size[0],
        height=size[1]
    )
    return sankey_fig

There are some options for manipulating this figure script to better suit your needs. Specifically you may want to modify:

pad=70 : this is the horizontal spacing between nodes
thickness=45 : this is the thickness of the node elements

With our pre-prepped data from above, we can use the function like so:

sankey_fig = create_sankey_diagram(node_labels, 
								   links, 
								   node_colors, 
								   orientation='v', size=(1000, 1200))
sankey_fig

And here we have it:

I’d say it looks… okay. And admittedly this is the appearance after manipulating the node placement using the interactive interface.

It’s a squished vertically (which can be improved by making it a much taller figure). However my biggest issue is with the text being difficult to read.

Changing the orientation to horizontal (orientation='h') results in a slightly better looking figure. Which makes sense, since the Sankey diagram is often shown horizontal. However, this does not preserve the relationship to the actual North-South flow direction in the Rio Grande, so I don’t like it as much.

Conclusions

To answer the question posed by the title, “Sankey diagrams for USGS gauge data in python(?)”: Yes, sometimes. And sometimes something else.

Plotly complaints: While working on this post, I developed a few complaints with the plotly Sankey tools. Specifically:

It appears that the label text coloring cannot be modified. I don’t like the white edgecolor/blur effect, but could not get rid of this.
The font is very difficult to read… I had to make the text size very large for it to be reasonably legible.
You can only assign a single node thickness. I had wanted to make the reservoirs thick, and shrink the size of the gauge station nodes. However, it seems this cannot be done.
The diagrams appear low-resolution and I don’t see a way to save a high res version.

Ultimately, the plotly tools are very restrictive in the design of the graphic. However, this is a tradeoff in order to get the benefit of interactive graphics.

Plotly praise: The plotly Sankey tools have some advantages, specifically:

The plots are interactive
Plots can be saved as HTML and embedded in websites

These advantages make the plotly tools good for anyone who might want to have a dynamic and maybe frequently updated dashboard on a site.

On the other hand, if I wanted to prepare a publication-quality figure, where I had absolute control of the design elements, I’d likely turn to matplotlib. That way it could be saved as an SVG and further manipulated in a vector art program link Inkscape or Illustrator.

Thanks for reading!

References

Chegini, T., Li, H. Y., & Leung, L. R. (2021). HyRiver: Hydroclimate data retriever. Journal of Open Source Software, 6(66), 3175.

Performing runtime diagnostics using MOEAFramework

In this blog post, we will be reviewing how to perform runtime diagnostics using MOEAFramework. This software has been used in prior blog posts by Rohini and Jazmin to perform MOEA diagnostics across multiple MOEA parameterizations. Since then, MOEAFramework has undergone a number of updates and structure changes. This blog post will walk through the updated functionality of running MOEAFramework (version 4.0) via the command line to perform runtime diagnostics across 20 seeds using one set of parameters. We will be using the classic 3-objective DTLZ2 problem optimized using NSGAII, both of which are in-built into MOEAFramework.

Before we begin, some helpful tips and system configuration requirements:

Ensure that you have the latest version of Java installed (as of April 2024, this is Java Version 22). The current version of MOEAFramework was compiled using class file version 61.0, which was made available in Java Version 17 (find the complete list of Java versions and their associated class files here). This is the minimum requirement for being able to run MOEAFramework.
The following commands are written for a Linux interface. Download a Unix terminal emulator like Cygwin if needed.

Another helpful tip: to see all available flags and their corresponding variables, you can use the following structure:

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.mainClass.subClass.functionName --help

Replace mainClass, subClass, and functionName with the actual class, subclass, and function names. For example,

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.SampleGenerator --help

You can also replace --help with -h (if the last three alphabets prove too much to type for your weary digits).

Runtime diagnostics across 20 different seeds for one set of parameters

Generating MOEA parameters and running the optimization

To run NSGAII using one set of parameters, make sure to have a “parameters file” saved as a text file containing the following:

populationSize 10.0 250.999
maxEvaluations 10000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 100.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 100.0

For the a full list of parameter files for each of the in-built MOEAFramework algorithms, please see Jazmin’s post here.

In this example, I have called it NSGAII_Params.txt. Note that maxEvaluations is set to 10,000 on both its lower and upper bounds. This is because we want to fix the number of function evaluations completed by NSGAII. Next, in our command line, we run:

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.SampleGenerator --method latin --numberOfSamples 1 --parameterFile NSGAII_Params.txt --output NSGAII_Latin

The output NSGAII_Latin file should contain a single-line that can be opened as a text file. It should have six tab-delineated values that correspond to the six parameters in the input file that you have generated. Now that you have your MOEA parameter files, let’s begin running some optimizations!

First, make a new folder in your current directory to store your output data. Here, I am simply calling it data.

mkdir data

Next, optimize the DTLZ2 3-objective problem using NSGAII:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.RuntimeEvaluator --parameterFile NSGAII_Params.txt --input NSGAII_Latin --problem DTLZ2_3 --seed $i --frequency 1000 --algorithm NSGAII --output data/NSGAII_DTLZ2_3_$i.data; done

Here’s what’s going down:

First, you are performing a runtime evaluation of the optimization of the 3-objective DTLZ2 problem using NSGAII
You are obtaining the decision variables and objective values at every 1,000 function evaluations, effectively tracking the progress of NSGAII as it attempts to solve the problem
Finally, you are storing the output in the data/ folder
You then repeat this for 20 seeds (or for as many as you so desire).

Double check your .data file. It should contain information on your decision variables and objective values at every 1,000 NFEs or so, seperated from the next thousand by a “#”.

Generate the reference set

Next, obtain the only the objective values at every 1,000 NFEs by entering the following into your command line:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ResultFileMerger --problem DTLZ2_3 --output data/NSGAII_DTLZ2_3_$i.set --epsilon 0.01,0.01,0.01 data/NSGAII_DTLZ2_3_$i.data; done

Notice that we have a new flag here – the --epsilon flag tells MOEAFramework that you only want objective values that are at least 0.01 better than other non-dominated objective values for a given objective. This helps to trim down the size of the final reference set (coming up soon) and remove solutions that are only marginally better (and may not be decision-relevant in the real-world context).

On to generating the reference set – let’s combine all objectives across all seeds using the following command line directive:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ReferenceSetMerger --output data/NSGAII_DTLZ2_3.ref -epsilon 0.01,0.01,0.01 data/NSGAII_DTLZ2_3_$i.set; done

Your final reference set should now be contained within the NSGAII_DTLZ2_3.ref file in the data/ folder.

Generate the runtime metrics

Finally, let’s generate the runtime metrics. To avoid any mix-ups, let’s create a folder to store these files:

mkdir data_metrics

And finally, generate our metrics!

or i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ResultFileEvaluator --problem DTLZ2_3 --epsilon 0.01,0.01,0.01 --input data/NSGAII_DTLZ2_3_$i.data --reference data/NSGAII_DTLZ2_3.ref --output data_metrics/NSGAII_DTLZ2_3_$i.metrics; done

If all goes well, you should see 20 files (one each for each seed) similar in structure to the one below in your data_metrics/ folder:

The header values are the names of each of the MOEA performance metrics that MOEAFramework measures. In this blog post, we will proceed with visualizing the hypervolume over time across all 20 seeds.

Visualizing runtime diagnostics

The following Python code first extracts the metric that you would like to view, and saves the plot as a PNG file in the data_metrics/ folder:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

# define constants 
num_seeds = 20
NFE = 10000 
freq = 1000
num_output = int(NFE/freq)

algorithm = 'NSGAII'
problem = 'DTLZ2_3'
folder_name = 'data_metrics/'
metric_name = 'Hypervolume'
# create matrix of hypervolume runtimes 
hvol_matrix = np.zeros((num_seeds, num_output), dtype=float)
for seed in range(num_seeds):
    runtime_df = pd.read_csv(f'{folder_name}{algorithm}_{problem}_{seed+1}.metrics', delimiter=' ', header=0)
    if metric_name == 'Hypervolume':
        hvol_matrix[seed] = runtime_df['#Hypervolume'].values
    else:
        hvol_matrix[seed] = runtime_df[metric_name].values

# plot the hypervolume over time
fig, ax = plt.subplots(figsize=(10, 6))

ax.fill_between(np.arange(freq, NFE+freq, freq), np.min(hvol_matrix, axis=0), np.max(hvol_matrix, axis=0), color='paleturquoise', alpha=0.6)
ax.plot(np.arange(freq, NFE+freq, freq), np.min(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0)
ax.plot(np.arange(freq, NFE+freq, freq), np.max(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0)
ax.plot(np.arange(freq, NFE+freq, freq), np.mean(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0 ,label='Mean', linestyle='--')

ax.set_xlabel('NFE')
ax.set_xlim([freq, NFE+freq])
ax.set_ylabel(metric_name)
ax.set_ylim([0, 1])
ax.set_title(f'{metric_name} over time')
ax.legend(loc='upper left')

plt.savefig(f'{folder_name}{algorithm}{problem}_{metric_name}.png')

If you correctly implemented the code, you should be able to be view the following figure that shows how the hypervolume attained by the NSGAII algorithm improves steadily over time.

In the figure above, the colored inner region spans the hypervolume attained across all 20 seeds, with the dotted line representing the mean hypervolume over time. The solid upper and lower bounding lines are the maximum and minimum hypervolume achieved every 1,000 NFEs respectively. Note that, in this specific instance, NSGAII only achieves about 50% of the total hypervolume of the overall objective space. This implies that a higher NFE (a longer runtime) is required for NSGAII to further increase the hypervolume achieved. Nonetheless, the rate of hypervolume increase is gradually decreasing, indicating that this particular parameterization of NSGAII is fast approaching its maximum possible hypervolume, which additional NFEs only contributing small improvements to performance. It is also worth noting the narrow range of hypervolume values, especially as the number of NFEs grows larger. This is representative of the reliability of the NGSAII algorithm, demonstrating that is can somewhat reliably reproduce results across multiple different seeds.

Summary

This just about brings us to the end of this blog post! We’ve covered how to perform MOEA runtime diagnostics and plot the results. If you are curious, here are some additional things to explore:

Plot different performance metrics against NFE. Please see Joe Kasprzyk’s post here to better understand the plots you generate.
Explore different MOEAs that are built into MOEAFramework to see how they perform across multiple seeds.
Generate multiple MOEA parameter samples using the in-built MOEAFramework Latin Hypercube Sampler to analyze the sensitivity of a given MOEA to its parameterization.
Attempt examining the runtime diagnostics of Borg MOEA using the updated version of MOEAFramework.

On that note, make sure to check back for updates as MOEAFramework is being actively reviewed and improved! You can view the documentation of Version 4.0 here and access its GitHub repository here.

Happy coding!

April 11, 2024 by Trevor Amestoy

Introduction to Bayesian Regression using PyMC

Motivation

Fans of this blog will know that uncertainty is often a focus for our group. When approaching uncertainty, Bayesian methods might be of interest since they explicitly provide uncertainty estimates during the modeling process.

PyMC is the best tool I have come across for Bayesian modeling in Python; this post gives a super brief introduction to this toolkit.

Introduction to PyMC

PyMC, described in their own words:
“… is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.”

In my opinion, the best part of PyMC is the flexibility and breadth of model design features. The space of different model configurations is massive. It allows you to make models ranging from simple linear regressions (shown here), to more complex hierarchical models, copulas, gaussian processes, and more.

Regardless of your model formulation, PyMC let’s you generate posterior estimates of model parameter distributions. These parameter distributions reflect the uncertainty in the model, and can propagate uncertainty into your final predictions.

The posterior estimates of model parameters are generated using Markov chain Monte Carlo (MCMC) methods. A detailed overview of MCMC is outside the scope of this post (maybe in a later post…).

In the simplest terms, MCMC is a method for estimating posterior parameter distributions for a Bayesian model. It generates a sequence of samples from the parameter space (which can be huge and complex), where the probability of each sample is proportional to its likelihood given the observed data. By collecting enough samples, MCMC generates an approximation of the posterior distribution, providing insights into the probable values of the model parameters along with their uncertainties. This is key when the models are very complex and the posterior cannot be directly defined.

The PyMC example gallery has lots of cool stuff to get you inspired, with examples that go far and beyond the simple linear regression case.

Demonstration:

When writing drafting this post, I wanted to include a demonstration which is (a) simple enough to cover in a brief post, and (b) relatively easy for others to replicate. I settled on the simple linear regression model described below, since this was able to be done using readily retrievable CAMELS data.

The example attempts to predict mean streamflow as a linear function of basin catchment area (both in log space). As you’ll see, it’s not the worst model, but its far from a good model; there is a lot of uncertainty!

CAMELS Data

For a description of the CAMELS dataset, see Addor, Newman, Mizukami and Clark (2017).

I pulled all of the national CAMELS data using the pygeohydro package from HyRiver which I have previously recommended on this blog. This is a convenient single-line code to get all the data:

import pygeohydro as gh

### Load camels data
camels_basins, camels_qobs = gh.get_camels()

The camels_basins variable is a dataframe with the different catchment attributes, and the camels_qobs is a xarray.Dataset. In this case we will only be using the camels_basins data.

The CAMELS data spans the continental US, but I want to focus on a specific region (since hydrologic patterns will be regional). Before going further, I filter the data to keep only sites in the Northeaster US:

# filter by mean long lat of geometry: NE US
camels_basins['mean_long'] = camels_basins.geometry.centroid.x
camels_basins['mean_lat'] = camels_basins.geometry.centroid.y
camels_basins = camels_basins[(camels_basins['mean_long'] > -80) & (camels_basins['mean_long'] < -70)]
camels_basins = camels_basins[(camels_basins['mean_lat'] > 35) & (camels_basins['mean_lat'] < 45)]

I also convert the mean flow data (q_mean) units from mm/day to cubic meters per day:

# convert q_mean from mm/day to m3/s
camels_basins['q_mean_cms'] = camels_basins['q_mean'] * (1e-3) *(camels_basins['area_gages2']*1000**2) * (1/(60*60*24))

And this is all the data we need for this crude model!

Bayesian linear model

The simple linear regression model (hello my old friend):

Normally you might assume that there is a single, best value corresponding to each of the model parameters (alpha and beta). This is considered a Frequentist perspective and is a common approach. In these cases, the best parameters can be estimated by minimizing the errors corresponding to a particular set of parameters (see least squares, for example.

However, we could take a different approach and assume that the parameters (intercept and slope) are random variables themselves, and have some corresponding distribution. This would constitute a Bayesian perspective.

Keeping with simplicity in this example, I will assume that the intercept and slope each come from a normal distribution with a mean and variance such that:

When it comes time to make inferences or predictions using our model, we can create a large number of predictions by sampling different parameter values from these distributions. Consequently, we will end up with a distribution of uncertain predictions.

PyMC implementation

I recommend you see the PyMC installation guide to help you get set up.

NOTE: The MCMC sampler used by PyMC is written in C and will be SIGNIFICANTLY faster if you provide have access to GCC compiler and specify the it’s directory using the the following:

import pymc as pm

import os
os.environ["THEANO_FLAGS"] = "gcc__cxxflags=-C:\mingw-w64\mingw64\bin"

You will get a warning if you don’t have this properly set up.

Now, onto the demo!

I start by retrieving our X and Y data from the CAMELS dataset we created above:

# Pull out X and Y of interest
x_ftr= 'area_gages2'
y_ftr = 'q_mean_cms'
xs = camels_basins[x_ftr] 
ys = camels_basins[y_ftr]

# Take log-transform 
xs = np.log(xs)
ys = np.log(ys)

At a glance, we see there is a reasonable linear relationship when working in the log space:

Two of the key features when building a model are:

The random variable distribution constructions
The deterministic model formulation

There are lots of different distributions available, and each one simply takes a name and set of parameter values as inputs. For example, the normal distribution defining our intercept parameter is:

alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)

The value of the parameter priors that you specify when construction the model may have a big impact depending on the complexity of your model. For simple models you may get away with having uninformative priors (e.g., setting mu=0), however if you have some initial guesses then that can help with getting reliable convergence.

In this case, I used a simple least squares estimate of the linear regression as the parameter priors:

slope_prior, intercept_prior = np.polyfit(xs.values.flatten(), ys.values.flatten(), 1)

Once we have our random variables defined, then we will need to formulate the deterministic element of our model prediction. This is the functional relationship between the input, parameters, and output. For our linear regression model, this is simply:

y_mu = alpha + beta * xs

In the case of our Bayesian regression, this can be thought of as the mean of the regression outputs. The final estimates are going to be distributed around the y_mu with the uncertainty resulting from the combinations of our different random variables.

Putting it all together now:

### PyMC linear model
with pm.Model() as model:
    
    # Priors
    alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)
    beta = pm.Normal('beta', mu=slope_prior, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # mean/expected value of the model
    mu = alpha + beta * xs

    # likelihood
    y = pm.Normal('y', mu=mu, sigma=sigma, observed=ys)

    # sample from the posterior
    trace = pm.sample(2000, cores=6)

With our model constructed, we can use the pm.sample() function to begin the MCMC sampling process and estimate the posterior distribution of model parameters. Note that this process can be very computationally intensive for complex models! (Definitely make sure you have the GCC set up correctly if you plan on needing to sample complex models.)

Using the sampled parameter values, we can create posterior estimates of the predictions (log mean flow) using the posterior parameter distributions:

## Generate posterior predictive samples
ppc = pm.sample_posterior_predictive(trace, model=model)

Let’s go ahead and plot the range of the posterior distribution, to visualize the uncertainty in the model estimates:

### Plot the posterior predictive interval
fig, ax = plt.subplots(ncols=2, figsize=(8,4))

# log space
az.plot_hdi(xs, ppc['posterior_predictive']['y'], 
            color='cornflowerblue', ax=ax[0], hdi_prob=0.9)
ax[0].scatter(xs, ys, alpha=0.6, s=20, color='k')
ax[0].set_xlabel('Log ' + x_ftr)
ax[0].set_ylabel('Log Mean Flow (m3/s)')

# original dim space
az.plot_hdi(np.exp(xs), np.exp(ppc['posterior_predictive']['y']), 
            color='cornflowerblue', ax=ax[1], hdi_prob=0.9)
ax[1].scatter(np.exp(xs), np.exp(ys), alpha=0.6, s=20, color='k')
ax[1].set_xlabel(x_ftr)
ax[1].set_ylabel('Mean Flow (m3/s)')
plt.suptitle('90% Posterior Prediction Interval', fontsize=14)
plt.show()

And there we have it! The figure on the left shows the data and posterior prediction range in log-space, while the figure on the right is in non-log space.

As mentioned earlier, it’s not the best model (wayyy to much uncertainty in the large-basin mean flow estimates), but at least we have the benefit of knowing the uncertainty distribution since we took the Bayesian approach!

That’s all for now; this post was really meant to bring PyMC to your attention. Maybe you have a use case or will be more likely to consider Bayesian approaches in the future.

If you have other Bayesian/probabilistic programming tools that you like, please do comment below. PyMC is one (good) option, but I’m sure other people have their own favorites for different reasons.

PyMC resources:

The PyMC example gallery
I’ve found Thomas Wiecki’s site, twiecki.io, to have some great blog posts to help get started with PyMC:

References

Addor, N., Newman, A. J., Mizukami, N. and Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, doi:10.5194/hess-21-5293-2017, 2017.

March 27, 2024 by Lillian Lau

Parallel Programming Part 2: Demonstrating OpenMP constructs with a simple test case

In Parallel Programming Part 1, I introduced a number of basic OpenMP directives and constructs, as well as provided a quick guide to high-performance computing (HPC) terminology. In Part 2 (aka this post), we will apply some of the constructs previously learned to execute a few OpenMP examples. We will also view an example of incorrect use of OpenMP constructs and its implications.

TheCube Cluster Setup

Before we begin, there are a couple of steps to ensure that we will be able to run C++ code on the Cube. First, check the default version of the C++ compiler. This compiler essentially transforms your code into an executable file that can be easily understood and implemented by the Cube. You can do this by entering

g++ --version

into your command line. You should see that the Cube has G++ version 8.3.0 installed, as shown below.

Now, create two files:

openmp-demo-print.cpp this is the file where we will perform a print demo to show how tasks are distributed between threads
openmp-demo-access.cpp this file will demonstrate the importance of synchronization and the implications if proper synchronization is not implemented

Example 1: Hello from Threads

Open the openmp-demo-print.cpp file, and copy and paste the following code into it:

#include <iostream>
#include <omp.h>
#include <stdio.h>

int main() {

    int num_threads = omp_get_max_threads(); // get max number of threads available on the Cube
    omp_set_num_threads(num_threads);

    printf("The CUBE has %d threads.\n", num_threads);


    #pragma omp parallel 
    {
        int thread_id = omp_get_thread_num();
        printf("Hello from thread num %d\n", thread_id);
    }

    printf("Thread numbers written to file successfully. \n");

    return 0;
}

Here is what the code is doing:

Lines 1 to 3 are the libraries that need to be included in this script to enable the use of OpenMP constructs and printing to the command line.
In main() on Lines 7 and 8, we first get the maximum number of threads that the Cube has available on one node and assign it to the num_threads variable. We then set the number of threads we want to use.
Next, we declare a parallel section using #pragma omp parallel beginning on Line 13. In this section, we are assigning the printf() function to each of the num_threads threads that we have set on Line 8. Here, each thread is now executing the printf() function independently of each other. Note that the threads are under no obligation to you to execute their tasks in ascending order.
Finally, we end the call to main() by returning 0

Be sure to first save this file. Once you have done this, enter the following into your command line:

g++ openmp-demo-print.cpp -o openmp-demo-print -fopenmp

This tells the system to compile openmp-demo-print.cpp and create an executable object using the same file name. Next, in the command line, enter:

./openmp-demo-print

You should see the following (or some version of it) appear in your command line:

Notice that the threads are all being printed out of order. This demonstrates that each thread takes a different amount of time to complete its task. If you re-run the following script, you will find that the order of threads that are printed out would have changed, showing that there is a degree of uncertainty associated with the amount of time they each require. This has implications for the order in which threads read, write, and access data, which will be demonstrated in the next example.

Example 2: Who’s (seg)Fault Is It?

In this example, we will first execute a script that stores each thread number ID into an array. Each thread then accesses its ID and prints it out into a text file called access-demo.txt. See the implementation below:

#include <iostream>
#include <omp.h>
#include <vector>

using namespace std;

void parallel_access_demo() {
    vector<int> data;
    int num_threads = omp_get_max_threads();

    // Parallel region for appending elements to the vector
    #pragma omp parallel for
    for (int i = 0; i < num_threads; ++i) {
        // Append the thread ID to the vector
        data.push_back(omp_get_thread_num()); // Potential race condition here
    }
}

int main() {
    // Execute parallel access demo
    parallel_access_demo();
    printf("Code ran successfully. \n");

    return 0;
}

As per the usual, here’s what’s going down:

After including the necessary libraries, we use include namespace std on Line 5 so that we can more easily print text without having to used std:: every time we attempt to write a statement to the output file
Starting on Line 7, we write a simple parallel_access_demo() function that does the following:
- Initialize a vector of integers very creatively called data
- Get the maximum number of threads available per node on the Cube
- Declare a parallel for region where we iterate through all threads and push_back the thread ID number into the data vector
Finally, we execute this function in main() beginning Line 19.

If you execute this using a similar process shown in Example 1, you should see the following output in your command line:

Congratulations – you just encountered your first segmentation fault! Not to worry – this is by design. Take a few minutes to review the code above to identify why this might be the case.

Notice Line 15. Where, each thread is attempting to insert a new integer into the data vector. Since push_back() is being called in a parallel region, this means that multiple threads are attempting to modify data simultaneously. This will result in a race condition, where the size of data is changing unpredictably. There is a pretty simple fix to this, where we add a #pragma omp critical before Line 15.

As mentioned in Part 1, #pragma omp critical ensures that each thread executes its task one at a time, and not all at once. It essentially serializes a portion of the parallel for loop and prevents race conditions by ensuring that the size of data changes in a predictable and controlled manner. An alternative would be to declare #pragma omp single in place of #pragma omp parallel for on Line 12. This essentially delegates the following for-loop to a single thread, forcing the loop to be completed in serial. Both of these options will result in the blissful outcome of

Summary

In this post, I introduced two examples to demonstrate the use of OpenMP constructs. In Example 1, I demonstrated a parallel printing task to show that threads do not all complete their tasks at the same time. Example 2 showed how the lack of synchronization can lead to segmentation faults due to race conditions, and two methods to remedy this. In Part 3, I will introduce GPU programming using the Python PyTorch library and Google Colaboratory cloud infrastructure. Until then, happy coding!

February 8, 2024 by Trevor Amestoy

The Thomas-Fiering Model for Synthetic Streamflow Generation with a Python Implementation

In 1962 a group of economists, engineers and political scientists who were involved in the Harvard Water Program published “Design of Water Resource Systems“. In chapter 12 of the book, Thomas and Fiering present the following statistical model which was one of the first, if not the first, formal application of stochastic modelling for synthetic streamflow generation and water resource systems evaluation.

It is an autoregressive model which can simulate monthly streamflow values based on the mean, variance, and correlation of historic observations.

In this blog post, I present the model in it’s original form along with a modified form presented by Stedinger and Taylor (1982). Then, I share a Python implementation of the model which is used to generate an ensemble of synthetic flows. I use plotting tools from the Synthetic Generation Figure Library to plot the results.

All of the code used for this post is available here: ThomasFieringModelDemo

Let’s get into it!

The Thomas-Fiering Model

The model that Thomas and Fiering proposed took the form:

Where, for each month $m$ , $Q_m$ is the generated flow, $\bar{Q}_m$ is the mean historic flow, $b_m$ is an autoregression coefficient for predicting that months flow from the prior months flow, $\sigma$ is the standard deviation, $r$ is the correlation coefficient and $\epsilon$ is a random standard normal variable.

A modification to this model was proposed by Stedinger and Taylor (1982), which transforms transforms the streamflow values before fitting the model. I refer to this as the “Stedinger transformation” below and in the code.

Given $Q_{m}$ as the observed flows in month $m$ , the Stedinger transformation of the observed flows is then:

where $\hat{\tau}_m$ is the estimated “lower bound” for each month, calculated as:

The modeled flows are generated from the recursive relationship:

Where:

$\mu_{m}$ is the observed average historic monthly $x$ series
$\sigma_{m}^2$ is the observed variance of the historic monthly $x$ series
$\epsilon_{m}$ independent standard-normal random variables
$\rho_m$ observed between-month correlations of the historic $x$ series

The above steps are performed for each month, and the synthetic streamflow sequence is generated by iteratively applying the stochastic process for the desired duration.

Python Implementation

I built this version of the Thomas Fiering model as a Python class with the following structure:

class ThomasFieringGenerator():
    def __init__(self, Q, **kwargs):
        
    def preprocessing(self, **kwargs):
	    # Stedinger normalization
	    
    def fit(self, **kwargs):
	    # Calculate mu, sigma, and rho for each month
	    
    def generate(self, n_years, **kwargs):
	    # Iteratively generate a single timeseries
	    # Inverse stedinger normalization
        return Q_synthetic
    
    def generate_ensemble(self, n_years, 
                          n_realizations = 1, 
                          **kwargs):
        # Loop and generate multiple timeseries
        return

Rather than posting the entire code here, which would clutter the page, I will refer you to and encourage you to check out the full implementation which is in the linked repository here: ThomasFieringModelDemo/model.py

To see how this is used and replicate the results below using some example data, see the Jupyter Notebook: ThomasFieringModelDemo/ThomasFiering_demo.ipynb

Synthetic ensemble results

I used the ThomasFieringGenerator to produce 100 samples of 50-year monthly streamflows at USGS gauge site 01434000 on the Delaware River which has data going back to 1904.

This data is available in the repo and is stored in the file usgs_monthly_streamflow_cms.csv

The plotting functions are taken from the Synthetic Generation Figure Library which was shared previously on the blog.

First we consider the range of historic and synthetic streamflow timeseries:

Generally when working with synthetic ensembles it is good for the distribution of synthetic ensembles “envelope” the historic range while maintaining a similar median. The Thomas Fiering model does a good job at this!

The next figure shows the range of flow-quantile values for both historic and synthetic flows. Again, we see a nice overlapping of the synthetic ensemble:

Conclusions

I personally think it is fun and helpful to look back at the foundational work in a field. Since Thomas and Fiering’s work in the early 1960s, there has been a significant amount of work focused on synthetic hydrology.

The Thomas Fiering model has a nice simplicity while still performing very nicely (with the help of the Stedinger normalization). Sure there are some limitations to the model (e.g., the estimation of distribution and correlation parameters may be inaccurate for short records, and the method does not explicitly prevent the generation of negative streamflows), but the model, and the Harvard Water Program more broadly, was successful in ushering in new approaches for water resource systems analysis.

References

Maass, A., Hufschmidt, M. M., Dorfman, R., Thomas, Jr, H. A., Marglin, S. A., & Fair, G. M. (1962). Design of water-resource systems: New techniques for relating economic objectives, engineering analysis, and governmental planning. Harvard University Press.

Stedinger, J. R., & Taylor, M. R. (1982). Synthetic streamflow generation: 1. Model verification and validation. Water resources research, 18(4), 909-918.

January 29, 2024 by Lillian Lau

A quick and straightforward introduction to LIME

In this blog post, we will be discussing the many household uses of citrus aurantiifolio, or the common lime.

Just kidding, we’ll be talking about a completely different type of LIME, namely Local Interpretable Model-Agnostic Explanations (at this point you may be thinking that the former’s scientific name becomes the easier of the two to say). After all, this is the WaterProgramming blog.

On a more serious note though, LIME is one of the two widely-known model agnostic explainable AI (xAI) methods, alongside Shapley Additive Explanations (SHAP). This post is intended to be an introduction to LIME in particular, and we will be setting up the motivation for using xAI methods as well as a brief example application using the North Carolina Research Triangle system.

Before we proceed, here’s a quick graphic to get oriented:

The figure above mainly demonstrates three main concepts: Artificial Intelligence (AI), the methods used to achieve AI (one of which includes Machine Learning, or ML), and the methods to explain how such methods achieved their version of AI (explainable AI, or more catchily known as xAI). For more explanation on the different categories of AI, and their methods, please refer to these posts by IBM’s Data Science and AI team and this SpringerLink book by Sarker (2022) respectively.

Model-agnostic vs model-specific

Model-specific methods

As shown in the figure, model-specific xAI methods are techniques that can only be used on the specific model that it was designed for. Here’s a quick rundown of the type of model and their available selection of xAI methods:

Decision trees
- Decision tree visualization (e.g. Classification and Regression Tree (CART) diagrams)
- Feature importance rankings
Neural networks (NN), including deep neural networks
- Coefficient (feature weight) analysis
- Neural network neuron/layer activation visualization
- Attention (input sequence) mechanisms
- Integrated gradients
- DeepLIFT
- GradCAM
Reinforcement learning
- Causal models

Further information on the mathematical formulation for these methods can be found in Holzinger et al (2022). Such methods account for the underlying structure of the model that they are used for, and therefore require some understanding of the model formulation to fully comprehend (e.g. interpreting NN activation visualizations). While less flexible than their agnostic counterparts, model-specific xAI methods provide more granular information on how specific types of information is processed by the model, and therefore how changing input sequences, or model structure, can affect model predictions.

Model-agnostic methods

Model-agnostic xAI (as it’s name suggests) relies solely on analyzing the input-output sets of a model, and therefore can be applied to a wide range of machine learning models regardless of model structure or type. It can be thought of as (very loosely) as sensitivity analysis, but applied to AI methods (for more information on this discussion, please refer to Scholbeck et al. (2023) and Razavi et al. (2021)). SHAP and LIME both fall under this set of methods, and approximately abide by the following process: perturb the input then identify how the output is changed. Note that this set of methods provide little insight as to the specifics of model formulation and how it affects model predictions. Nonetheless, it affords a higher degree of flexibility, and does not bind you to one specific model.

Why does this matter?

Let’s think about this in the context of water resources systems management and planning. Assume you are a water utility responsible for ensuring that you reliably deliver water to 280,000 residents on a daily basis. In addition, you are responsible for planning the next major water supply infrastructure project. Using a machine learning model to inform your short- and long-term management and planning decisions without interrogating how it arrived at its recommendations implicitly assumes that the model will make sensible decisions that balance all stakeholders’ needs while remaining equitable. More often than not, this assumption can be incorrect and may lead to (sometimes funny but mostly) unintentional detrimental cascading implications on the most vulnerable (for some well-narrated examples of how ML went awry, please refer to Brian Christian’s “The Alignment Problem”).

Having said that, using xAI as a next step in the general framework of adopting AI into our decision-making processes can help us better understand why a model makes its predictions, how those predictions came to be, and their degree of usefulness to a decision maker. In this post, I will be demonstrating the use of LIME to answer the following questions.

The next section will establish the three components we will need to apply LIME to our problem:

An input (feature) set and the training dataset
The model predictions
The LIME explainer

A quick example using the North Carolina Research Triangle

The input (feature) set and training dataset

The Research Triangle region in North Carolina consists of six main water utilities that deliver water to their namesake cities: OWASA (Orange County), Durham, Cary, Raleigh, Chatham, and Pittsboro (Gold et al., 2023). All utilities have collaboratively decided that they will each measure their system robustness using a satisficing metric (Starr 1962; Herman et al. 2015), where they satisfy their criteria for robustness if the following criteria are met:

Their reliability meets or exceeds 98%
Their worst-case cost of drought mitigation actions amount to no more than 10% of their annual volumetric revenue
They do not restrict demand more than 20% of the time

If all three criteria are met, then they are considered “successful” (represented as a 1). Otherwise, they have “failed” (represented as a 0). We have 1,000 training data points of success or failure, each representing a state of the world (SOW) in which a utility fails or succeeds to meet their satisficing critieria. This is our training dataset. Each SOW consists of a unique combination of features that include inflow, evaporation and water demand Fourier series coefficients, as well as socioeconomic, policy and infrastructure construction factors. This is our feature set.

Feel free to follow along this portion of the post using the code available in this Google Colab Notebook.

The model prediction

To generate our model prediction, let’s first code up our model. In this example, we will be using Extreme Gradient Boosting, otherwise known as XGBoost (xgboost) as our prediction model, and the LIME package. Let’s first install the both of them:

pip install lime
pip install xgboost

We will also need to import all the needed libraries:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import lime
import lime.lime_tabular
import xgboost as xgb
from copy import deepcopy

Now let’s set up perform XGBoost! We will first need to upload our needed feature and training datasets:

satisficing = pd.read_csv('satisficing_all_utilites.csv')
du_factors = pd.read_csv('RDM_inputs_final_combined_headers.csv')  

# get all utility and feature names
utility_names = satisficing.columns[1:]
du_factor_names = du_factors.columns

There should be seven utility names (six, plus one that represents the entire region) and thirteen DU factors (or feature names). In this example, we will be focusing only on Pittsboro.

# select the utility 
utility = 'Pittsboro'

# convert to numpy arrays to be passed into xgboost function
du_factors_arr = du_factors.to_numpy()
satisficing_utility_arr = satisficing[utility].values

# initialize the figure object
fig, ax = plt.subplots(1, 1, figsize=(5, 5))  
perform_and_plot_xbg(utility, ax, du_factors_arr, satisficing_utility_arr, du_factor_names)

Note the perform_and_plot_xgb function being used – this function is not shown here for brevity, but you can view the full version of this function in this Google Colab Notebook.

The figure above is called a factor map that shows mid-term (Demand2) and long-term (Demand3) demand growth rates on its x- and y-axes respectively. The green denotes the range of demand growth rates where XGBoost has predicted that Pittsboro will successfully meet its satisficing criteria, and brown is otherwise. Each point is a sample from the original training dataset, where the color (white is 1 – success, and red is 0 – failure) denotes whether Pittsboro actually meets its satisficing criteria. In this case we can see that Pittsboro quickly transitions in failure when its mid- and long-term demand are both higher than expected (indicated by the 1.0-value on both axes).

The LIME explainer

Before we perform LIME, let’s first select an interesting point using the figure above.

In the previous section, we can see that the XGBoost algorithm predicts that Pittsboro’s robustness is affected most by mid-term (Demand2) and long-term (Demand3) demand growth. However, there is a point (indicated using the arrow and the brown circle below) where this prediction did not match the actual data point.

To better understand why then, this specific point was predicted to be a “successful” SOW where the true datapoint had it labeled as a “failure” SOW, let’s take a look at how the XGBoost algorithm made its decision.

First, let’s identify the index of this point:

# select an interesting point 
interesting_point_range_du = du_factors[(du_factors['Demand2'] < 0) & (du_factors['Demand3'] < 0)].index
interesting_point_satisficing = satisficing_utility_arr[interesting_point_range_du]
interesting_point_range_sat = np.where(satisficing_utility_arr == 0)
interesting_point = np.intersect1d(interesting_point_range_du, interesting_point_range_sat)[0]

This will return an index of 704. Next, we’ll run LIME to break down the how this (mis)classification was made:

# instantiate the lime explainer
explainer = lime_tabular.LimeTabularExplainer(du_factors_arr, 
                                              mode='classification', 
                                              feature_names=du_factor_names)

# explain the interesting instance
exp = explainer.explain_instance(du_factors_arr[interesting_point_selected], 
                                 xgb_model.predict_proba, 
                                 num_features=len(du_factor_names))

exp.show_in_notebook(show_table=True)

The following code, if run successfully, should result in the following figure.

Here’s how to interpret it:

The prediction probability bars on the furthest left show the model’s prediction. In this case, the XGBoost model classifies this point as a “failure” SOW with 94% confidence.
The tornado plot in the middle show the feature contributions. In this case, it shows the degree to which each SOW feature influenced the decision. In this case, the model misclassified the data point as a “success” although it was a failure as our trained model only accounts for the top two overall features that influence the entire dataset to plot the factor map, and did not account for short-term demand growth rates (Demand1) and the permitting time required for constructing the water treatment plant (JLWTP permit).
The table on the furthest right is the table of the values of all the features of this specific SOW.

Using LIME has therefore enabled us to identify the cause of XGBoost’s misclassification, allowing us to understand that the model needed information on short-term demand and permitting time to make the correct prediction. From here, it is possible to further dive into the types of SOWs and their specific characteristics that would cause them to be more vulnerable to short-term demand growth and infrastructure permitting time as opposed to mid- and long-term demand growth.

Summary

Okay, so I lied a bit – it wasn’t quite so “brief” after all. Nonetheless, I hope you learned a little about explainable AI, how to use LIME, and how to interpret its outcomes. We also walked through a quick example using the good ol’ Research Triangle case study. Do check out the Google Colab Notebook if you’re interested in how this problem was coded.

With that, thank you for sticking with me – happy learning!

References

Amazon Web Services. (1981). What’s the Difference Between AI and Machine Learning? Machine Learning & AI. https://aws.amazon.com/compare/the-difference-between-artificial-intelligence-and-machine-learning/

Btd. (2024, January 7). Explainable AI (XAI): Model-specific interpretability methods. Medium. https://baotramduong.medium.com/explainable-ai-model-specific-interpretability-methods-02e23ebceac1

Christian, B. (2020). The alignment problem: Machine Learning and human values. Norton & Company.

Gold, D. F., Reed, P. M., Gorelick, D. E., & Characklis, G. W. (2023). Advancing Regional Water Supply Management and infrastructure investment pathways that are equitable, robust, adaptive, and cooperatively stable. Water Resources Research, 59(9). https://doi.org/10.1029/2022wr033671

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10). https://doi.org/10.1061/(asce)wr.1943-5452.0000509

Holzinger, A., Saranti, A., Molnar, C., Biecek, P., & Samek, W. (2022). Explainable AI methods – A brief overview. xxAI – Beyond Explainable AI, 13–38. https://doi.org/10.1007/978-3-031-04083-2_2

IBM Data and AI Team. (2023, October 16). Understanding the different types of artificial intelligence. IBM Blog. https://www.ibm.com/blog/understanding-the-different-types-of-artificial-intelligence/

Sarker, I. H. (2022, February 10). AI-based modeling: Techniques, applications and research issues towards automation, intelligent and Smart Systems – SN Computer Science. SpringerLink. https://link.springer.com/article/10.1007/s42979-022-01043-x#Sec6

Scholbeck, C. A., Moosbauer, J., Casalicchio, G., Gupta, H., Bischl, B., & Heumann, C. (2023, December 20). Position paper: Bridging the gap between machine learning and sensitivity analysis. arXiv.org. http://arxiv.org/abs/2312.13234

Starr, M. K. (1963). Product design and decision theory. Prentice-Hall, Inc.

January 22, 2024 by Marilyn Smith

Geocoding Using Google API Key in Python

Introduction

In this post, I will delve into the transformative realm of geocoding, addressing the challenges posed by historical data laden with addresses. Throughout the discussion, I’ll guide you through the intricacies of leveraging the Google API key to seamlessly integrate location information.

Key points include:

- Preparing your digital workspace for geocoding with essential tools.

- Obtaining a Google API key to unlock the power of precise coordinates.

- Applying practical steps to effortlessly transform addresses into valuable spatial insights using Pandas DataFrame.

ArcGIS Developers. (n.d.). *ArcGIS API for Python*. Retrieved from here.

As we study the realm of data-driven exploration, a significant challenge surfaces—one that personally resonated with me during my recent project. Historical data, often a treasure trove of insights, can be a double-edged sword. The challenge arises when this wealth of information arrives with addresses instead of coordinates, adding a layer of complexity to the analysis. In my own experience, this meant that a substantial portion of the received data was, unfortunately, unusable for my intended analysis without a transformative solution.

This is where geocoding steps in as a crucial ally, reshaping the landscape of historical datasets and making them analysis-ready. The inconsistency in reporting formats and the absence of standardized coordinates pose hurdles that can impede meaningful analysis. Geocoding becomes an indispensable tool, allowing me to bridge this gap, harmonize the data, and unlock its full potential for analysis.

This post focuses on the intricacies of geocoding—a transformative process that transcends mere addresses, providing the geographic insights necessary for a comprehensive understanding of your dataset. Equipped with the Google API key, we’ll delve into the practical steps needed to seamlessly integrate location information. The following sections outline the key stages of this geocoding journey, ensuring your dataset is prepared for advanced analysis.

Preparing your environment

Prepare your digital workspace for this impactful data transformation by ensuring you have all the necessary tools. Seamlessly install GoogleMaps and Pandas using the commands provided below in your terminal.

#Install necessary packages
pip install GoogleMaps
pip install pandas

Formatting your data

In the realm of geocoding, the importance of formatting data before initiating the geocoding process cannot be overstated, especially when leveraging the capabilities of the Google API key. The formatting step is a vital prerequisite for several reasons. Firstly, it ensures consistency and standardization in the structure of addresses, aligning them with the expectations of geocoding services. This consistency allows for more accurate interpretation and processing of addresses. Additionally, formatted addresses provide a clear separation of components such as city, state, and ZIP code, facilitating efficient matching during the geocoding algorithm’s execution. The optimization of API usage is another benefit, as well-formatted addresses contribute to accurate results, minimizing unnecessary costs associated with incorrect or ambiguous requests. By addressing these considerations in the formatting stage, one sets the stage for a more reliable and cost-effective geocoding journey, unlocking the full potential of spatial insights. Now, let’s delve into the practical implementation of this formatting process with the following code, which creates a new column containing the consolidated and formatted addresses within your dataset.

# Create new dataframe just with the address
df['ADDRESS'] = df['CITY'] + ',' + \
                df['STATE'] + ' ' + \
                df['ZIP'].astype(str)

Obtain a Google API Key

Navigate to the Google Cloud console
Under the Project drop-down menu, select the project that you want to work on. If one does not already exist, create a new project
Under the Library tab on the side menu, search “geocoding api” and ensure that it is enabled
Under the Credentials tab on the side menu, select “Create Credentials” and create a new API key to generate a new key
Your Google API key will then appear on the screen, save this key for future use

Geocoding with Pandas DataFrame

With your environment set up, your data formatted, and your Google API key in hand, it’s time to wield this powerful tool. Dive into the practical realm of geocoding by seamlessly applying the sample code below to your dataset, effortlessly transforming addresses into precise coordinates. Follow the code through each practical step, and witness your location data evolve into a valuable asset, ready for advanced analysis.

# Import necessary libraries
import googlemaps
import pandas as pd

# Define your API key
# Replace "YOUR_GOOGLE_API_KEY" with your actual Google Maps API key that you saved previously.
gmaps_key = googlemaps.Client(key="YOUR_GOOGLE_API_KEY")

# Extract the last column (ADDRESS) to create a new dataframe (df_address)
df_address = df.iloc[:, -1:]

# Geocoding with Pandas DataFrame
# Define a function to geocode addresses and extract latitude and longitude
def geocode_and_extract(df_address):
    # Add empty columns for latitude and longitude
    df_address['LATITUDE'] = ""
    df_address['LONGITUDE'] = ""
    
    # Loop through each row in the df_address dataframe
    for i in range(len(df_address)):
        address = df_address['ADDRESS'][i]
        # Geocode the address using the Google Maps API
        geocode_result = gmaps_key.geocode(address)
        
        # Check if geocoding was successful
        if geocode_result:
            # Extract latitude and longitude from the geocoding result
            df_address['LATITUDE'][i] = geocode_result[0]['geometry']['location']['lat']
            df_address['LONGITUDE'][i] = geocode_result[0]['geometry']['location']['lng']

# Apply the geocoding function to your dataframe
geocode_and_extract(df_address)

# Print an example to verify correctness
example_address = df_address['ADDRESS'][0]
example_result = gmaps_key.geocode(example_address)
example_lat = example_result[0]["geometry"]["location"]["lat"]
example_long = example_result[0]["geometry"]["location"]["lng"]
print(f'Example Geocoded Address: {example_address}, Latitude: {example_lat}, Longitude: {example_long}')

# Let's join the coordinates with the original dataset
df['LATITUDE'] = df_address['LATITUDE']
df['LONGITUDE'] = df_address['LONGITUDE']

Replace "YOUR_GOOGLE_API_KEY" with your actual Google Maps API key that you saved previously.

Dealing with incomplete data

In the midst of my geocoding journey, I encountered a common challenge: incomplete data. Some addresses in my dataset were detailed with full street information, while others provided only city and state. The beauty of the Google API key revealed itself as it effortlessly transformed both types of data, allowing for a seamless integration of geographic insights.

This flexibility became especially crucial as I navigated through points where only city and state information existed. While it may seem like a potential hurdle, the Google API key’s ability to interpret varying address formats ensured that no data was left behind. For my specific analysis, the key only needed to fall within a designated radius of interest, and the results proved to be accurate enough for my purposes.

In handling diverse data formats, the Google API key stands as your ally, harmonizing and geocoding all pieces of the puzzle, ensuring that each contributes to the rich narrative your dataset is meant to tell.

Conclusion

Your geocoding journey has now equipped you to handle diverse datasets, turning a potential challenge into a transformative opportunity. As you venture forth, may your analyses unveil fresh perspectives and illuminate the hidden stories within your spatial data. The Google API key has been the linchpin throughout this exploration, ensuring that every piece of information contributes to the bigger picture. From handling varying addresses to pinpointing detailed locations, the key’s flexibility has proven invaluable, providing accurate insights.

Congratulations on mastering the art of geocoding—may your data-driven adventures persist in revealing the intricate layers of time, geography, and the untold stories that lie within.

December 18, 2023 by Lillian Lau

Infrastructure Investment Selection as a Two-Stage Stochastic Programming Problem (Part 2)

In this post, we will continue where we left off from Part 1, in which we set up a two-stage stochastic programming problem to identify the optimal decision for the type of water supply infrastructure to build. As a recap, our simple case study involves a small water utility servicing the residents of Helm’s Keep (population 57,000) to identify the following:

The best new water supply infrastructure option to build
Its final built capacity
The total bond payment to be made
The expected daily deliveries that ‘ its final built capacity
The expected daily deliveries that it needs to meet

To identify these values, we will be setting up a two-stage stochastic problem using the Python cvxpy library. In this post, we will first review the formulation of the problem, provide information on the installation and use of the cvxpy library, and finally walk through code to identify the optimal solution to the problem.

Reviewing the infrastructure investment selection problem

In our previous post, we identified that Helm’s Keep would like to minimize their infrastructure net present cost (INPC), giving the following objective function:

$min f_{INPC} = \sum_{i=1}^{N}\sum_{s=1}^{S}p_s\sum_{y=1}^{30}\frac{PMT_{s,i}}{(1+d_{s,i})^y}$

where

$N=3$ and $S = 3$ are the total number of infrastructure options and potential future scenarios to consider
$p_s$ is the probability of occurrence for scenario $s \in S$
$y$ is one year within the entire bond term $T =[1,30]$
$PMT$ is the total bond payment, or bond principal
$d_{s,i}$ is the discount rate in scenario $s$ for infrastructure option $i$

In achieving this objective, Helm’s Keep also has to abide by the following constraints:

$y_1 + y_2 + y_3 = 1$

$x_i \leq My_i$

$D_{s,i} \geq \frac{\delta_{s}}{1-MAX(\rho)}$

$D_{s,i} \leq \sum_{i=1}^{3}\frac{x_i}{1-\rho_i}$

$D_{s,i} \leq 8$

$PMT_{s,i} \leq 1.25R_{s,i}$

where

$x_i$ is the final built capacity of infrastructure option $i$
$y_i$ is a binary $[0,1]$ variable indicating if an infrastructure option is built (1) or not (0)
$\delta_{s}$ is the daily demand in scenario $s$
$D_{s,i}$ is the daily water deliveries from infrastructure option $i$ in scenario $s$
$\rho_i$ is the ratio of non-revenue water (NRW) if infrastructure option $i$ is built
$R_{s,i}$ is the net revenue from fulfilling demand (after accounting for NRW) using infrastructure option $i$ in scenario $s$

For the full formulations of $PMT$ and $R$ , please refer to Part 1 of this tutorial.

In this problem, we our first-stage decision variables are the infrastructure option $y_i$ to build and its final built capacity $x_i$ . Our second-stage decision variables are the daily water deliveries made $D_{s,i}$ in each scenario $s$ .

The CVXPY Python library

To solve this problem, we will be using Python’s cvxpy library for convex optimization. It is one of the many solvers available for performing convex optimization in Python including Pyomo, as demonstrated in Trevor’s earlier blog post. Some other options options include PuLP, scipy.optimize, and Gurobi. For the purposes of our specific application, we will be using cvxpy as it can interpret lists and dictionaries of variables, constraints, or objectives. This allows direct appending and retrieval of said objects. This feature therefore makes it convenient to use with for-loops, which are useful in the case of two- or multi-stage stochastic problems where the decision variable space can exponentially grow with the number of scenarios or options being considered. You can find an introduction to cvxpy, documentation, and examples at the CVXPY website.

If you use pip, you can install cvxpy using the following:

pip install cvxpy

To install specific solver names, you can alternatively install cvxpy using

pip install cvxpy[CBC,CVXOPT,GLOP,GLPK,GUROBI,MOSEK,PDLP,SCIP,XPRESS]

where you can substitute the names of any convex optimization solver in the square brackets.

If you use conda, install cvxpy using

conda create --name cvxpy_env
conda activate cvxpy_env
conda install -c conda-forge cvxpy

Once you’ve installed cvxpy, you’re ready to follow along the next part!

Solving the two-stage stochastic programming problem

First, we import the cvxpy library into our file and define the constants and helper functions to calculate the values of $PMT$ and $R$

import cvxpy as cp

# define constants
households = 57000  # number of households
T = 30  # bond term (years) for an infrastructure option 
UR = 0.06 # the uniform water rate per MGD

def calculate_pmt(br, C_i, y_i, VCC_i, x_i, T=30):
    """Calculate the annual payment for a loan.

    Args:
        br (float): The annual interest rate.
        C_i (float): capital cost of infrastructure option i 
        y_i (boolean): 1 if option i is built, 0 if not
        VCC_i (float): variable capital cost of infrastructure option i
        x_i (float): final built capacity of infrastructure option i
        T (const): T=30 years, the bond term in years.

    Returns:
        float: The annual payment amount.
    """
    # calculate p_i, the bond principal (total value of bond borrowed) for
    # infrastructure option i
    p_i = (C_i*y_i) + (VCC_i * x_i)

    # Convert the annual interest rate to a monthly rate.
    pmt = p_i*(br*(1+br)**T)/(((1+br)**T)-1)

    # Calculate the monthly payment.
    return pmt

def calculate_R(D_i, rho_i, VOC_i, households=57000, UR=0.06):
    """
    Calculates the potential net revenue from infrastructure option i in a given scenario.

    Args:
        D_i (float): per-capita daily water demand in MGD
        rho_i (float): percentage of water lost during transmission (non-revenue water, NRW) from infrastructure i to the city
        VOC_i (float): variable operating cost of i
        households (const): households=57000 the number of households in the city
        UR (const): UR=0.06/MGD the per-household uniform water rate

    Returns:
        R_i (float): The potential net revenue from infrastructure option i in a given scenario.
    """
    OC_i = VOC_i * (D_i/rho_i)
    R_i = ((D_i * UR * households)*(1-rho_i)) - OC_i
    return R_i

Then, we define all the variables required. We will first define the first-stage decision variables:

# Infrastructure options as boolean (1, 0) variables
y1 = cp.Variable(boolean = True, name='y1')
y2 = cp.Variable(boolean = True, name='y2')
y3 = cp.Variable(boolean = True, name='y3')

# capacity of each infrastructure option
x1 = cp.Variable(nonneg=True, name = 'x1')
x2 = cp.Variable(nonneg=True, name = 'x2')
x3 = cp.Variable(nonneg=True, name = 'x3')

# infrastructure option parameters
C = [15.6, 11.9, 13.9] # capital cost of each infrastructure option in $mil
VCC = [7.5, 4.7, 5.1] # variable capital cost of each infrastructure option in $mil/MGD capacity
VOC = [0.2, 0.5, 0.9] # variable operating cost of each infrastructure option in $mil/MGD
NRW = [0.2, 0.1, 0.12] # non-revenue water (NRW) for each infrastructure option

Next, we define the second-stage decision variable and the parameter values related to each potential scenario:

# volume of water delivered to the city in each scenario
D = {}
for i in range(3):
    D[i] = cp.Variable(nonneg=True)

P = [0.35, 0.41, 0.24]  # probability of each scenario
demand_increase = [4.4, 5.2, 3.9]  # per-capita daily demand increase in MGD
bond_rate = [0.043, 0.026, 0.052]  # bond rate in each scenario
discount_rate = [0.031, 0.026, 0.025]  # discount rate in each scenario

Note the for loop in the Lines 2-4 of the code snippet above. The cvxpy enables variables to be added to and accessed via a dictionary that allows access via both explicit and in-line for loop, as we will see below in the objective function code definition:

min_inpc = sum(P[s]*sum((calculate_pmt(bond_rate[s], C[0], y1, VCC[0], x1)/((1+discount_rate[s])**t)) for t in range(1,T+1)B) for s in range(3)) + \
    sum(P[s]*sum((calculate_pmt(bond_rate[s], C[0], y2, VCC[1], x2)/((1+discount_rate[s])**t)) for t in range(1,T+1)) for s in range(3)) + \
    sum(P[s]*sum((calculate_pmt(bond_rate[s], C[2], y3, VCC[2], x3)/((1+discount_rate[s])**t)) for t in range(1,T+1)) for s in range(3))

Some explanation is required here. Our goal is to find the minimum INPC required to build the supply required to meet potential demand growth. Our objective function formulation therefore is the sum of the INPC of all three potential infrastructure options, each calculated across the three scenarios. As the $y_i$ variable is binary, only the sum across the three scenarios that requires the optimal infrastructure option will be chosen.

To constrain the solution space of our objective function, we define our constraints. Below, we can see the application of the ability of the cvxpy library that allows constraints to be added iteratively to a list:

constraints = []

# set an arbitrarily large value for M
M = 10e12

for s in range(3):
    constraints += [D[s] >= demand_increase[s]]   # daily water deliveries must be more than or equal to daily demand increase
    constraints += [D[s] <= ((x1/0.1) + (x2/0.1) + (x2/0.12))/1.2]   
    constraints += [calculate_pmt(bond_rate[s], C[0], y1, VCC[0], x1) <= 1.25*calculate_R(demand_increase[s], NRW[0], VOC[0])]
    constraints += [calculate_pmt(bond_rate[s], C[1], y2, VCC[1], x2) <= 1.25*calculate_R(demand_increase[s], NRW[1], VOC[1])]
    constraints += [calculate_pmt(bond_rate[s], C[2], y3, VCC[2], x3) <= 1.25*calculate_R(demand_increase[s], NRW[2], VOC[2])]

constraints += [y1 + y2 + y3 == 1]
constraints += [x1 <= M * y1]
constraints += [x2 <= M * y2]
constraints += [x3 <= M * y3]

Finally, we solve the problem using the Gurobi solver. This solver is selected as it comes pre-installed with the cvxpy library and does not require additional steps or licensing during installation. We also print the objective value and the solutions to the problem:

# set up the problem as a minimization
problem = cp.Problem(cp.Minimize(min_inpc), constraints)

# solve using Gurobi
problem.solve(solver=cp.GUROBI, verbose=False)

print(f'Optimal INPC: ${problem.value} mil' )
for variable in problem.variables():
  print(f"{variable.name()} = {variable.value}")

Obtaining the solutions

If you have closely followed the steps shown above, you would have identified that Helm’s Keep should build Infrastructure Option 3 (a new groundwater pumping station), to a total built capacity that allows total expected daily deliveries of 3.27MGD. This will result in a final INPC of USD$35.62 million. There are our first-stage decision variables.

In each scenario, the following daily deliveries (second-stage decision variables) should be expected:

Scenario	Scenario probability (%)	Demand increase (MGD)	Daily deliveries (MGD)
1	35	4.4	5.5
2	41	5.2	6.5
3	24	3.9	4.875

The values from the second and third column can be found in Part 1 of this tutorial. The final daily deliveries account for the maximum possible portion of NRW.

Let’s identify how much Helm’s Keep will require to pay in total annual bond payments and how much their future expected daily deliveries will be:

total_bond_payment = sum(P[s]*calculate_pmt(bond_rate[s], C[1], 1, VCC[1], x2.value) for s in range(3))
expected_daily_deliveries = sum(P[s]*D[s].value for s in range(3))

If you have closely followed the steps shown above, you would have obtained the following values:

Total annual bond payment	USD$1.55 million
Expected total daily water deliveries	5.76 MGD

Conclusion

Congratulations – you just solved a two-stage programming stochastic programming problem! In this post, we reviewed the content of Part 1, and provided a quick introduction to the cvxpy Python library and justified its use for the purpose of this test case. We also walked through the steps required to solve this problem in Python, identified that it should build a new groundwater pumping station with a 3.27MGD capacity. We also identified the total annual amount Helm’s Keep would have to pay annually to fulfill its debt service requirements, and how much it water would, on average, be expected to provide for its constituents.

I hope that both Parts 1 and 2 provided you with some information on what stochastic programming is, when to use it, and some methods that might be useful to approach it. Thank you for sticking around and happy learning!

December 1, 2023 by Trevor Amestoy

An introduction to linear programming for reservoir operations (part 2) – Implementation with Pyomo

Introduction

Previously, in Part 1 I used a very simple reservoir operations scenario to demonstrate some linear programming (LP) concepts.

After some feedback on my initial formulation I went back and revised the formulation to make sure that (1) both reservoir releases and storage levels are optimized simultaneously and (2) the LP handles decisions over multiple timesteps (1,…,N) during optimization. Definitely look back at Part 1 for more context.

The current LP formulation is as follows:

In this post, I show a simple implementation of this LP using the Pyomo package for solving optimization problems in Python.

I have shared the code used in this demo in a repository here: Reservoir-LP-Demo

Constructing the LP model with Pyomo

While Pyomo can help us construct the LP model, you will need access to a separate solver software in order to actually run the optimization. I don’t get into the details here on how to set up these solvers (see their specific installation instructions), but generally you will need this solver to be accessible on you PATH.

Two solvers that I have had good experience with are:

SCIP Optimization Suite (used here)
GNU Linear Programming Kit

As always, it’s best to consult the Pyomo documentation for any questions you might have. Here, I highlight a few things that are needed for our implementation.

We start by importing the pyomo.environ module:

import pyomo.environ as pyo

From this module we will need to use the following classes to help build our model:

pyo.ConcreteModel()
pyo.RangeSet()
pyo.Var()
pyo.Objective()
pyo.Constraint()
pyo.SolverFactory()

The nice thing about using pyomo rather than trying to manage the LP matrices yourself is that you can specify objectives and constraints as functions.

For example, the objective function is defined as:

# Objective Function
def objective_rule(m):
    return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)

And a constraint used to enforce the lower limit of the storage mass balance can defined as:

def S_balance_lower(m, t):
    if t == 0:
        return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
    return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

Rather than picking the full implementation apart, I present the entire function below, and encourage you to compare the code implementation with the problem definition above.

def pyomo_lp_reservoir(N, S_min, S_max, R_min, R_max, 
                       eta_0, I, D,  
                       initial_storage):

    # Model
    model = pyo.ConcreteModel()

    # Time range
    model.T = pyo.RangeSet(0, N-1)  

    # Decision Variables
    model.S = pyo.Var(model.T, bounds=(S_min, S_max))  # Storage
    model.R = pyo.Var(model.T, bounds=(R_min, R_max))  # Release

    # Objective Function
    def objective_rule(m):
        return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)
    model.objective = pyo.Objective(rule=objective_rule, sense=pyo.minimize)

    # Constraints
    def S_balance_lower(m, t):
        if t == 0:
            return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
        return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

    def S_balance_upper(m, t):
        if t == 0:
            return -(m.S[t] + m.R[t]) <= -(initial_storage + I[t] - D[t])
        return -(m.S[t] + m.R[t]) <= -(m.S[t-1] + I[t] - D[t])
    model.S_lower = pyo.Constraint(model.T, rule=S_balance_lower)
    model.S_upper = pyo.Constraint(model.T, rule=S_balance_upper)
    model.S_final = pyo.Constraint(expr=model.S[N-1] == initial_storage)

    # Solve
    solver = pyo.SolverFactory('scip')
    results = solver.solve(model)

    if results.solver.status == pyo.SolverStatus.ok:
        S_opt = np.array([pyo.value(model.S[t]) for t in model.T])
        R_opt = np.array([pyo.value(model.R[t]) for t in model.T])
        return S_opt, R_opt
    else:        
        raise ValueError('Not able to solve.')

Note that in this implementation, pyomo will optimize all of the reservoir release and storage decisions simultaneously, returning the vectors of length N which prescribe the next N days of operations.

Usage

Now we are ready to use our LP reservoir simulator. In the code block below, I set some specifications for our operational constraints, generate fake inflow and demand timeseries, run the LP solver, and plot the simulated results:

# spcifications
N_t = 30
S_min = 2500
S_max = 5000
R_min = 10
R_max = 1000
eta = 1.2

# Generate simple inflow and demand data
I, D = generate_data(N_t, correlation_factor = -0.70,
                     inflow_mean=500, inflow_std=100,
                     lag_correlation=0.2)


# Run LP operation simulation
S_sim, R_sim = pyomo_lp_reservoir(N_t, S_min, S_max, R_min, R_max, eta, I, D, 
                                  initial_storage=S_max)

# Plot results
plot_simulated_reservoir(I, D,
                         R_sim, S_sim, 
                         S_max, eta=eta)

The operations are shown:

Under this LP formulation, with a perfect inflow forecast, the reservoir operates as a “run of river” with the release rates being very close to the inflow rate.

In practice, operators may need to limit the difference in release volume from day-to-day. I added an optional parameter (R_change_limit) which adds a constraint on the difference subsequent releases from between each day.

The operations, with the daily release change rate limited to 50 is shown below.

Conclusions

The way you formulate the an LP problem dictates the “optimal” decisions that you will generate. The purpose of these past two posts was to make some attempt at giving a practice demo of some basic LP concepts. I hope for some that it might be useful as a starting point.

November 7, 2023 by Lillian Lau

Infrastructure Investment Selection as a Two-Stage Stochastic Programming Problem (Part 1)

About Stochastic Programming

Stochastic programming (SP) is an approach to decision-making under uncertainty (alongside robust optimization and chance constrained optimization). It is an extension of deterministic optimization techniques such as linear (or non-linear) programming that incorporates stochastic elements (hence the name) in the form of multiple potential future scenarios. Overall, the aim of stochastic programming is to identify optimal solutions to a given problem that perform well under a finite range of possible scenarios. This is done by optimizing the expected value of the objective function over all scenarios. For clarification, a “scenario” in the context of SP is defined as the combined realization (expected outcome) of all future uncertainties.

There are two versions of an SP problem: the two-stage SP problem consists of two decision-making stages, where the first stage decisions represent the decisions taken before the uncertainty (the future scenario, in this case) unfolds and becomes known. The second stage decisions are taken once the future has unfolded and corrective action needs to be taken. Two-stage SP problems are typically used for scenario planning, where a decision-maker’s goal is to identify a solution that will result in good performance across all potential scenarios. Next, the multi-stage SP problem with $s$ -stages is an extension of the two-stage SP problem, where the decision at stage $s$ is dependent on all decisions made up until $s$ . Due to the problem’s high dimensionality, the multi-stage SP problem is typically represented using a scenario tree with each branch associated with a probable outcome and its probability and each level associated with a stage $s$ .

In this post, I will provide a two-stage stochastic programming example discuss the pros and cons of its use. In the following post, I will and demonstrate how to approach and solve such a problem using the Python cvxpy library.

The infrastructure selection example

Disclaimer: All values are arbitrary .

A water utility serving the residents of Helm’s Keep (57,000 households) are deciding on a water supply storage expansion project to meet projected increase in demand and wants to identify a suitable yet affordable project to invest in:

Option 1: Water treatment capacity expansion on the distant (but yet untapped) Lake Heleborn
Option 2: The construction of small local reservoir on the outskirts of Helm’s Keep
Option 3: Construction of a new groundwater pumping station

Each of these infrastructure options are associated with the following capital investment costs. The variable capital cost and variable operating cost are functions of the infrastructure options and expected demand (both in MGD).

	Fixed capital cost ($mil)	Variable capital cost ($mil/MGD capacity)	Variable operating cost ($mil/MGD demand)
Option 1	15.6	7.5	0.2
Option 2	11.9	4.7	0.5
Option 3	13.9	5.1	0.9

Helm’s Keep’s priority in the selection of its new water supply infrastructure is to minimize its total infrastructure net present cost:

$f_{INPC}=\sum_{y=1}^{T}\frac{PMT}{(1+d)^y}$

where $INPC$ is the infrastructure net present cost, $T=30$ is bond term for an infrastructure option, and is the discount rate for a given year $y$ of debt service payment $d$ for an infrastructure option since its bond was issued. The value of $PMT$ is calculated as follows:

$\frac{P[BR(1+BR)^T]}{(1+BR)^{T}-1}$

where $P$ is the bond principal (the total value of the bond), $BR$ is the bond interest rate to be paid to the creditor throughout the bond term $T$ . All payments are discounted to their present value. The value of the bond principal is dependent on the capacity of the infrastructure option and is calculated as a function of the fixed and variable capital costs:

$P_i = C_iy_i + VCC_ix_i$

such that $C_i$ is the capital cost of infrastructure option $i$ and $y_i$ is a binary ${0,1}$ variable indicating if option $i$ is chosen. $VCC_i$ is the variable capital cost of infrastructure option $i$ , and $x_i$ is the final built capacity of $i$ .

In addition, there is some water loss in transporting water from either location due to aging infrastructure, which will result in losses caused by non-revenue water (NRW). The losses associated with option are listed below:

Option 1: 20% water loss
Option 2: 10% water loss
Option 3: 12% water loss

Helm’s Keep must also consider the debt is undertaking relative to its potential annual net revenue $R_{s,i}$ , calculated as follows:

$R_{s,i}=[(D_{s,i}\times UR \times \text{households})(1-\rho_{i})] - OC_{s,i}$

where $D_{s,i}$ is per-capita daily water delivered, $UR=\$0.06/MGD$ is the per-household uniform water rate, and $\rho_{i}$ is the percentage of water lost during transmission given that infrastructure $i$ was chosen. $OC_{s,i}$ is Helm’s Keep’s daily operating costs per capita where

$OC_{s,i}=VOC_i \frac{D_{s,i}}{\rho_i}$

such that $D_{s,i}$ is the demand in scenario $s$ for infrastructure option $i$ , $VOC_{s,i}$ is the variable operating cost of $i$ , and $\rho_i$ is the percentage of NRW.

However, it must also meet the following constraints:

The infrastructure option decision is exclusive (only one option can be selected).
The total capacity of the chosen infrastructure option cannot be less than 1.2 times that of daily demand increase (after accounting for NRW).
Daily deliveries be at least equal to demand (after accounting for NRW).
All water produced (after accounting for NRW) are considered for-revenue water and result in gross revenue.
Annual payments $PMT_{s,i}$ should not exceed 1.25 times that of Helm’s Keep’s annual net revenue $R_{s,i}$ .

There are three possible scenarios $s \in S = {1,2,3}$ of water demand increase $\delta_s$ , Federal Reserve bond interest rate, and discount rate. The probability of each scenario is listed in the table below:

Scenarios	Demand increase (MGD)	Bond rate (%)	Discount rate (%)	Probability (%)
Scenario 1	4.4	4.3	3.1	0.35
Scenario 2	5.2	2.6	2.6	0.41
Scenario 3	3.9	5.2	2.5	0.24

Helm’s Keep faces the following decisions:

Stage 1: Which infrastructure option $i$ to build and to what capacity $x_i$ for $i \forall {1,2,3}$ ?
Stage 2: How much will Helm’s Keep’s corresponding bond principal be how much water is delivered in each scenario?

Formulating the problem

Let the first-stage decision variables be:

Infrastructure option selection $y_i$ where $y_1 + y_2 + y_3 = 1$ and $y_i \in {0,1} \forall i=[1,2,3]$
Capacity of each infrastructure option $x_i \forall i = {1,2,3}$

Next, we set up the second stage decision variables. Let $D_s$ , $BR_s$ and $d_s$ be the amount of water delivered, bond rate, and discount rate, in scenario $s \in S = {1,2,3}$ . This will result in the following problem setup:

$min f_{INPC} = \sum_{i=1}^{N}\sum_{s=1}^{S}p_s\sum_{y=1}^{30}\frac{PMT_{s,i}}{(1+d_{s,i})^y}$

where

$PMT_{s,i}=(C_i y_i + VCC_i x_i) \frac{[BR_{s,i}(1+BR_{s,i})^{30}]}{(1+BR_{s,i})^{30} -1}$

with constraints

$y_1 + y_2 + y_3 = 1$

$D_{s,i} \geq \frac{\delta_{s}}{1-MAX(\rho)}$

$D_{s,i} \geq \sum_{i=1}^{3}\frac{x_i}{1-\rho_i}$

$D_{s,i} \leq 8$

$PMT_{s,i} \leq 1.25R_{s,1}$

$x_i \leq My_i$

such that

$R_{s,i} = [(3420D_{s,i})(1-\rho_i)] - (VOC_i\frac{D_{s,i}}{\rho_s})$

and $p_s$ is the probability of scenario $s$ and $UR \times \text{households}=3420$ .

Some caveats

As we walked through the formulation of this two-stage stochastic programming problem, it quickly becomes apparent that the computational complexity of such an approach can quickly become intractable when the number of uncertain scenarios and the complexity of the model (large number of uncertain parameters) is high. In addition, this approach assumes a priori knowledge of future uncertainty and sufficient underlying expert knowledge to generate 6uncertainty estimates, which cannot guarantee that the theoretical robustness of the identified solution to real world outcomes if the estimated uncertainty does not match the estimated uncertainty. We can see that this method may therefore be best suited for problems where uncertainty can be accurately characterized and modeled as random variables or scenarios. It is nonetheless a useful approach for addressing problems in manufacturing, production line optimization, financial portfolio design, etc, where data is abundant and uncertainty is well-characterized, and then identifying solutions that perform well across a finite number of scenarios.

In this post, we covered an introduction to stochastic programming and provided a 2-stage infrastructure investment selection example to demonstrate the method’s key concepts. In the next post, we will walk through the installation and use of the Python cvxpy library and use it to solve this problem to identify the most suitable infrastructure option to invest in, its final built capacity, and its total demand fulfilled.

Plotting a Sankey streamflow network with plotly

Defining the flow network & data retrieval

Creating the Sankey figure

Conclusions

References

Links

Runtime diagnostics across 20 different seeds for one set of parameters

Generating MOEA parameters and running the optimization

Generate the reference set

Generate the runtime metrics

Visualizing runtime diagnostics

Summary

Motivation

Introduction to PyMC

Demonstration:

CAMELS Data

Bayesian linear model

PyMC implementation

PyMC resources:

References

TheCube Cluster Setup

Example 1: Hello from Threads

Example 2: Who’s (seg)Fault Is It?

Summary

The Thomas-Fiering Model

Python Implementation

Synthetic ensemble results

Conclusions

References

Model-agnostic vs model-specific

Model-specific methods

Model-agnostic methods

Why does this matter?

A quick example using the North Carolina Research Triangle

The input (feature) set and training dataset

The model prediction

The LIME explainer

Summary

References

Introduction

Preparing your environment

Formatting your data

Obtain a Google API Key

Geocoding with Pandas DataFrame

Dealing with incomplete data

Conclusion

Reviewing the infrastructure investment selection problem

The CVXPY Python library

Solving the two-stage stochastic programming problem

Obtaining the solutions

Conclusion

Introduction

Constructing the LP model with Pyomo

Usage

Conclusions

About Stochastic Programming

The infrastructure selection example

Formulating the problem

Some caveats