Performing runtime diagnostics using MOEAFramework

Performing runtime diagnostics using MOEAFramework

In this blog post, we will be reviewing how to perform runtime diagnostics using MOEAFramework. This software has been used in prior blog posts by Rohini and Jazmin to perform MOEA diagnostics across multiple MOEA parameterizations. Since then, MOEAFramework has undergone a number of updates and structure changes. This blog post will walk through the updated functionality of running MOEAFramework (version 4.0) via the command line to perform runtime diagnostics across 20 seeds using one set of parameters. We will be using the classic 3-objective DTLZ2 problem optimized using NSGAII, both of which are in-built into MOEAFramework.

Before we begin, some helpful tips and system configuration requirements:

  • Ensure that you have the latest version of Java installed (as of April 2024, this is Java Version 22). The current version of MOEAFramework was compiled using class file version 61.0, which was made available in Java Version 17 (find the complete list of Java versions and their associated class files here). This is the minimum requirement for being able to run MOEAFramework.
  • The following commands are written for a Linux interface. Download a Unix terminal emulator like Cygwin if needed.

Another helpful tip: to see all available flags and their corresponding variables, you can use the following structure:

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.mainClass.subClass.functionName --help

Replace mainClass, subClass, and functionName with the actual class, subclass, and function names. For example,

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.SampleGenerator --help

You can also replace --help with -h (if the last three alphabets prove too much to type for your weary digits).

Runtime diagnostics across 20 different seeds for one set of parameters

Generating MOEA parameters and running the optimization

To run NSGAII using one set of parameters, make sure to have a “parameters file” saved as a text file containing the following:

populationSize 10.0 250.999
maxEvaluations 10000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 100.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 100.0

For the a full list of parameter files for each of the in-built MOEAFramework algorithms, please see Jazmin’s post here.

In this example, I have called it NSGAII_Params.txt. Note that maxEvaluations is set to 10,000 on both its lower and upper bounds. This is because we want to fix the number of function evaluations completed by NSGAII. Next, in our command line, we run:

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.SampleGenerator --method latin --numberOfSamples 1 --parameterFile NSGAII_Params.txt --output NSGAII_Latin

The output NSGAII_Latin file should contain a single-line that can be opened as a text file. It should have six tab-delineated values that correspond to the six parameters in the input file that you have generated. Now that you have your MOEA parameter files, let’s begin running some optimizations!

First, make a new folder in your current directory to store your output data. Here, I am simply calling it data.

mkdir data

Next, optimize the DTLZ2 3-objective problem using NSGAII:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.RuntimeEvaluator --parameterFile NSGAII_Params.txt --input NSGAII_Latin --problem DTLZ2_3 --seed $i --frequency 1000 --algorithm NSGAII --output data/NSGAII_DTLZ2_3_$i.data; done

Here’s what’s going down:

  • First, you are performing a runtime evaluation of the optimization of the 3-objective DTLZ2 problem using NSGAII
  • You are obtaining the decision variables and objective values at every 1,000 function evaluations, effectively tracking the progress of NSGAII as it attempts to solve the problem
  • Finally, you are storing the output in the data/ folder
  • You then repeat this for 20 seeds (or for as many as you so desire).

Double check your .data file. It should contain information on your decision variables and objective values at every 1,000 NFEs or so, seperated from the next thousand by a “#”.

Generate the reference set

Next, obtain the only the objective values at every 1,000 NFEs by entering the following into your command line:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ResultFileMerger --problem DTLZ2_3 --output data/NSGAII_DTLZ2_3_$i.set --epsilon 0.01,0.01,0.01 data/NSGAII_DTLZ2_3_$i.data; done

Notice that we have a new flag here – the --epsilon flag tells MOEAFramework that you only want objective values that are at least 0.01 better than other non-dominated objective values for a given objective. This helps to trim down the size of the final reference set (coming up soon) and remove solutions that are only marginally better (and may not be decision-relevant in the real-world context).

On to generating the reference set – let’s combine all objectives across all seeds using the following command line directive:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ReferenceSetMerger --output data/NSGAII_DTLZ2_3.ref -epsilon 0.01,0.01,0.01 data/NSGAII_DTLZ2_3_$i.set; done

Your final reference set should now be contained within the NSGAII_DTLZ2_3.ref file in the data/ folder.

Generate the runtime metrics

Finally, let’s generate the runtime metrics. To avoid any mix-ups, let’s create a folder to store these files:

mkdir data_metrics

And finally, generate our metrics!

or i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ResultFileEvaluator --problem DTLZ2_3 --epsilon 0.01,0.01,0.01 --input data/NSGAII_DTLZ2_3_$i.data --reference data/NSGAII_DTLZ2_3.ref --output data_metrics/NSGAII_DTLZ2_3_$i.metrics; done

If all goes well, you should see 20 files (one each for each seed) similar in structure to the one below in your data_metrics/ folder:

The header values are the names of each of the MOEA performance metrics that MOEAFramework measures. In this blog post, we will proceed with visualizing the hypervolume over time across all 20 seeds.

Visualizing runtime diagnostics

The following Python code first extracts the metric that you would like to view, and saves the plot as a PNG file in the data_metrics/ folder:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

# define constants 
num_seeds = 20
NFE = 10000 
freq = 1000
num_output = int(NFE/freq)

algorithm = 'NSGAII'
problem = 'DTLZ2_3'
folder_name = 'data_metrics/'
metric_name = 'Hypervolume'
# create matrix of hypervolume runtimes 
hvol_matrix = np.zeros((num_seeds, num_output), dtype=float)
for seed in range(num_seeds):
    runtime_df = pd.read_csv(f'{folder_name}{algorithm}_{problem}_{seed+1}.metrics', delimiter=' ', header=0)
    if metric_name == 'Hypervolume':
        hvol_matrix[seed] = runtime_df['#Hypervolume'].values
    else:
        hvol_matrix[seed] = runtime_df[metric_name].values

# plot the hypervolume over time
fig, ax = plt.subplots(figsize=(10, 6))

ax.fill_between(np.arange(freq, NFE+freq, freq), np.min(hvol_matrix, axis=0), np.max(hvol_matrix, axis=0), color='paleturquoise', alpha=0.6)
ax.plot(np.arange(freq, NFE+freq, freq), np.min(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0)
ax.plot(np.arange(freq, NFE+freq, freq), np.max(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0)
ax.plot(np.arange(freq, NFE+freq, freq), np.mean(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0 ,label='Mean', linestyle='--')

ax.set_xlabel('NFE')
ax.set_xlim([freq, NFE+freq])
ax.set_ylabel(metric_name)
ax.set_ylim([0, 1])
ax.set_title(f'{metric_name} over time')
ax.legend(loc='upper left')

plt.savefig(f'{folder_name}{algorithm}{problem}_{metric_name}.png')

If you correctly implemented the code, you should be able to be view the following figure that shows how the hypervolume attained by the NSGAII algorithm improves steadily over time.

In the figure above, the colored inner region spans the hypervolume attained across all 20 seeds, with the dotted line representing the mean hypervolume over time. The solid upper and lower bounding lines are the maximum and minimum hypervolume achieved every 1,000 NFEs respectively. Note that, in this specific instance, NSGAII only achieves about 50% of the total hypervolume of the overall objective space. This implies that a higher NFE (a longer runtime) is required for NSGAII to further increase the hypervolume achieved. Nonetheless, the rate of hypervolume increase is gradually decreasing, indicating that this particular parameterization of NSGAII is fast approaching its maximum possible hypervolume, which additional NFEs only contributing small improvements to performance. It is also worth noting the narrow range of hypervolume values, especially as the number of NFEs grows larger. This is representative of the reliability of the NGSAII algorithm, demonstrating that is can somewhat reliably reproduce results across multiple different seeds.

Summary

This just about brings us to the end of this blog post! We’ve covered how to perform MOEA runtime diagnostics and plot the results. If you are curious, here are some additional things to explore:

  • Plot different performance metrics against NFE. Please see Joe Kasprzyk’s post here to better understand the plots you generate.
  • Explore different MOEAs that are built into MOEAFramework to see how they perform across multiple seeds.
  • Generate multiple MOEA parameter samples using the in-built MOEAFramework Latin Hypercube Sampler to analyze the sensitivity of a given MOEA to its parameterization.
  • Attempt examining the runtime diagnostics of Borg MOEA using the updated version of MOEAFramework.

On that note, make sure to check back for updates as MOEAFramework is being actively reviewed and improved! You can view the documentation of Version 4.0 here and access its GitHub repository here.

Happy coding!

Introduction to Bayesian Regression using PyMC

Motivation

Fans of this blog will know that uncertainty is often a focus for our group. When approaching uncertainty, Bayesian methods might be of interest since they explicitly provide uncertainty estimates during the modeling process.

PyMC is the best tool I have come across for Bayesian modeling in Python; this post gives a super brief introduction to this toolkit.

Introduction to PyMC

PyMC, described in their own words:
“… is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.”

In my opinion, the best part of PyMC is the flexibility and breadth of model design features. The space of different model configurations is massive. It allows you to make models ranging from simple linear regressions (shown here), to more complex hierarchical models, copulas, gaussian processes, and more.

Regardless of your model formulation, PyMC let’s you generate posterior estimates of model parameter distributions. These parameter distributions reflect the uncertainty in the model, and can propagate uncertainty into your final predictions.

The posterior estimates of model parameters are generated using Markov chain Monte Carlo (MCMC) methods. A detailed overview of MCMC is outside the scope of this post (maybe in a later post…).

In the simplest terms, MCMC is a method for estimating posterior parameter distributions for a Bayesian model. It generates a sequence of samples from the parameter space (which can be huge and complex), where the probability of each sample is proportional to its likelihood given the observed data. By collecting enough samples, MCMC generates an approximation of the posterior distribution, providing insights into the probable values of the model parameters along with their uncertainties. This is key when the models are very complex and the posterior cannot be directly defined.

The PyMC example gallery has lots of cool stuff to get you inspired, with examples that go far and beyond the simple linear regression case.


Demonstration:

When writing drafting this post, I wanted to include a demonstration which is (a) simple enough to cover in a brief post, and (b) relatively easy for others to replicate. I settled on the simple linear regression model described below, since this was able to be done using readily retrievable CAMELS data.

The example attempts to predict mean streamflow as a linear function of basin catchment area (both in log space). As you’ll see, it’s not the worst model, but its far from a good model; there is a lot of uncertainty!

CAMELS Data

For a description of the CAMELS dataset, see Addor, Newman, Mizukami and Clark (2017).

I pulled all of the national CAMELS data using the pygeohydro package from HyRiver which I have previously recommended on this blog. This is a convenient single-line code to get all the data:

import pygeohydro as gh

### Load camels data
camels_basins, camels_qobs = gh.get_camels()

The camels_basins variable is a dataframe with the different catchment attributes, and the camels_qobs is a xarray.Dataset. In this case we will only be using the camels_basins data.

The CAMELS data spans the continental US, but I want to focus on a specific region (since hydrologic patterns will be regional). Before going further, I filter the data to keep only sites in the Northeaster US:

# filter by mean long lat of geometry: NE US
camels_basins['mean_long'] = camels_basins.geometry.centroid.x
camels_basins['mean_lat'] = camels_basins.geometry.centroid.y
camels_basins = camels_basins[(camels_basins['mean_long'] > -80) & (camels_basins['mean_long'] < -70)]
camels_basins = camels_basins[(camels_basins['mean_lat'] > 35) & (camels_basins['mean_lat'] < 45)]

I also convert the mean flow data (q_mean) units from mm/day to cubic meters per day:

# convert q_mean from mm/day to m3/s
camels_basins['q_mean_cms'] = camels_basins['q_mean'] * (1e-3) *(camels_basins['area_gages2']*1000**2) * (1/(60*60*24)) 

And this is all the data we need for this crude model!

Bayesian linear model

The simple linear regression model (hello my old friend):

Normally you might assume that there is a single, best value corresponding to each of the model parameters (alpha and beta). This is considered a Frequentist perspective and is a common approach. In these cases, the best parameters can be estimated by minimizing the errors corresponding to a particular set of parameters (see least squares, for example.

However, we could take a different approach and assume that the parameters (intercept and slope) are random variables themselves, and have some corresponding distribution. This would constitute a Bayesian perspective.

Keeping with simplicity in this example, I will assume that the intercept and slope each come from a normal distribution with a mean and variance such that:

When it comes time to make inferences or predictions using our model, we can create a large number of predictions by sampling different parameter values from these distributions. Consequently, we will end up with a distribution of uncertain predictions.

PyMC implementation

I recommend you see the PyMC installation guide to help you get set up.

NOTE: The MCMC sampler used by PyMC is written in C and will be SIGNIFICANTLY faster if you provide have access to GCC compiler and specify the it’s directory using the the following:

import pymc as pm

import os
os.environ["THEANO_FLAGS"] = "gcc__cxxflags=-C:\mingw-w64\mingw64\bin"

You will get a warning if you don’t have this properly set up.

Now, onto the demo!

I start by retrieving our X and Y data from the CAMELS dataset we created above:

# Pull out X and Y of interest
x_ftr= 'area_gages2'
y_ftr = 'q_mean_cms'
xs = camels_basins[x_ftr] 
ys = camels_basins[y_ftr]

# Take log-transform 
xs = np.log(xs)
ys = np.log(ys)

At a glance, we see there is a reasonable linear relationship when working in the log space:

Two of the key features when building a model are:

  • The random variable distribution constructions
  • The deterministic model formulation

There are lots of different distributions available, and each one simply takes a name and set of parameter values as inputs. For example, the normal distribution defining our intercept parameter is:

alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)

The value of the parameter priors that you specify when construction the model may have a big impact depending on the complexity of your model. For simple models you may get away with having uninformative priors (e.g., setting mu=0), however if you have some initial guesses then that can help with getting reliable convergence.

In this case, I used a simple least squares estimate of the linear regression as the parameter priors:

slope_prior, intercept_prior = np.polyfit(xs.values.flatten(), ys.values.flatten(), 1)

Once we have our random variables defined, then we will need to formulate the deterministic element of our model prediction. This is the functional relationship between the input, parameters, and output. For our linear regression model, this is simply:

y_mu = alpha + beta * xs

In the case of our Bayesian regression, this can be thought of as the mean of the regression outputs. The final estimates are going to be distributed around the y_mu with the uncertainty resulting from the combinations of our different random variables.

Putting it all together now:

### PyMC linear model
with pm.Model() as model:
    
    # Priors
    alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)
    beta = pm.Normal('beta', mu=slope_prior, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # mean/expected value of the model
    mu = alpha + beta * xs

    # likelihood
    y = pm.Normal('y', mu=mu, sigma=sigma, observed=ys)

    # sample from the posterior
    trace = pm.sample(2000, cores=6)
 

With our model constructed, we can use the pm.sample() function to begin the MCMC sampling process and estimate the posterior distribution of model parameters. Note that this process can be very computationally intensive for complex models! (Definitely make sure you have the GCC set up correctly if you plan on needing to sample complex models.)

Using the sampled parameter values, we can create posterior estimates of the predictions (log mean flow) using the posterior parameter distributions:

## Generate posterior predictive samples
ppc = pm.sample_posterior_predictive(trace, model=model)

Let’s go ahead and plot the range of the posterior distribution, to visualize the uncertainty in the model estimates:

### Plot the posterior predictive interval
fig, ax = plt.subplots(ncols=2, figsize=(8,4))

# log space
az.plot_hdi(xs, ppc['posterior_predictive']['y'], 
            color='cornflowerblue', ax=ax[0], hdi_prob=0.9)
ax[0].scatter(xs, ys, alpha=0.6, s=20, color='k')
ax[0].set_xlabel('Log ' + x_ftr)
ax[0].set_ylabel('Log Mean Flow (m3/s)')

# original dim space
az.plot_hdi(np.exp(xs), np.exp(ppc['posterior_predictive']['y']), 
            color='cornflowerblue', ax=ax[1], hdi_prob=0.9)
ax[1].scatter(np.exp(xs), np.exp(ys), alpha=0.6, s=20, color='k')
ax[1].set_xlabel(x_ftr)
ax[1].set_ylabel('Mean Flow (m3/s)')
plt.suptitle('90% Posterior Prediction Interval', fontsize=14)
plt.show()

And there we have it! The figure on the left shows the data and posterior prediction range in log-space, while the figure on the right is in non-log space.

As mentioned earlier, it’s not the best model (wayyy to much uncertainty in the large-basin mean flow estimates), but at least we have the benefit of knowing the uncertainty distribution since we took the Bayesian approach!

That’s all for now; this post was really meant to bring PyMC to your attention. Maybe you have a use case or will be more likely to consider Bayesian approaches in the future.

If you have other Bayesian/probabilistic programming tools that you like, please do comment below. PyMC is one (good) option, but I’m sure other people have their own favorites for different reasons.


PyMC resources:

References

Addor, N., Newman, A. J., Mizukami, N. and Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, doi:10.5194/hess-21-5293-2017, 2017.

An introduction to linear programming for reservoir operations (part 1)

Motivation:

If you were like me, you may have been provided an overview of linear programming (LP) methods but craved a more hands-of demonstration, as opposed to the abstract/generic mathematical formulation that is often presented in lectures. This post is for the hydrology enthusiasts out there who want to improve their linear programming understanding by stepping through a contextual example.

In this post, I provide a very brief intro to linear programming then go through the process of applying a linear programming approach to a simple hydroelectric reservoir problem focused on determining a release rate that will consider both storage targets and energy generation. In a later post, I will come back to show how the formulation generated here can be implemented within simulation code.

Content:

  • An introduction to linear programming
    • Formulating an LP
    • Solving LP problems: Simplex
    • Relevance for water resource managers
  • A toy reservoir demonstration
    • The problem formulation
      • Constraints
      • Objective
  • Conclusions and what’s next

An introduction to linear programming

Linear programming (LP) is a mathematical optimization technique for making decisions under constraints. The aim of an LP problem is to find the best outcome (either maximizing or minimizing) given a set of linear constraints and a linear objective function.

Formulating an LP

The quality of an LP solution is only as good as the formulation used to generate it.

An LP formulation consists of three main components:

1. Decision Variables: These are the variables that you have control over. The goal here is to find the specific decision variable values that produce optimal outcomes. There are two decision variables shown in the figure below, x1 and x2.

2. Constraints: These are the restrictions which limit what decisions are allowed. (For our reservoir problem, we will use constraints to make sure total mass is conserved, storage levels stay within limits, etc.) The constraints functions are required to be linear, and are expressed as inequality relationships.

3. Objective Function: This is the (single) metric used to measure outcome performance. This function is required to be linear and it’s value is denoted Z in the figure below, where it depends on two decision variables (x1, and x2).

In practice, the list of constraints on the system can get very large. Fortunately matrix operations can be used to solve these problems quickly, although this requires that the problem is formatted in “standard form” shown above. The standard form arranges the coefficient values for the constraints within matrices A and b.

Solving LP problems: keep it Simplex

Often the number of potential solutions that satisfy your constraints will be very large (infinitely large for continuous variables), and you will want to find the one solution in this region that maximizes/minimizes your objective of interest (the “optimal solution”).

The set of all potential solutions is referred to as the “feasible space” (see the graphic below), with the constraint functions forming the boundary edges of this region. Note that by changing the constraints, you are inherently changing the feasible space and thus are changing the set of solutions that you are or are not considering when solving the problem.

The fundamental theorem of linear programming states that the optimal solution for an LP problem will exist at one of corners (or a vertex) of the feasible space.

Recognizing this, George Dantzig proposed the Simplex algorithm which strategically moves between vertices on the boundary feasible region, testing for optimality until the best solution is identified.

A detailed review of the Simplex algorithm warrants it’s own blog post, and wont be explained here. Fortunately there are various open LP solvers. For example, one option in Python is scipy.optimize.linprog().

Relevance for water resource managers

Why should you keep read this post?

If you work in water resource systems, you will likely need to make decisions in highly constrained settings. In these cases, LP methods can be helpful. In fact there are many scenarios in water resource management in which LP methods can (and historically have) been useful. Some general application contexts include:

  • Resource allocation problems: LP can be used to allocate water efficiently among different users like agriculture, industry, and municipalities.
  • Operations optimization: LP can guide the operations decisions of water reservoirs, deciding on storage levels, releases, and energy production to maximize objectives like water availability or energy generation from hydropower (the topic of this demonstration below).

Toy Reservoir Demonstration: Problem Formulation

The following demo aims to provide a concrete example of applying linear programming (LP) in the realm of water resource management. We will focus on a very simple reservoir problem, which, despite its simplicity, captures the essence of many real-world challenges.

Imagine you are managing a small hydroelectric reservoir that provides water and energy to a nearby town. Your goal is to operate the reservoir, choosing how much water to release each day such that the (1) the water level stays near a specific target and (2) you maximize energy generation. Additionally, there is a legally mandated minimum flow requirement downstream to sustain local ecosystems

Here, I will translate this problem context into formal LP notation, then in a later post I will provide Python code to implement the LP decision making process within a simulation model.

Problem formulation

We want to translate our problem context into formal mathematical notation that will allow us to use an LP solver. In order to help us get there, I’ve outlines the following table with all the variables being considered here.

Decision variables

In this case the things we are controlling at the reservoir releases (Rt), and the storage level (St) at each timestep.

Constraints:

Constraints are the backbone of any LP problem, defining the feasible space which ultimately dictates the optimal solution. Without constraints, an LP problem would have infinite solutions, rendering it practically meaningless. Constraints are meant to represent any sort of physical limitations, legal requirements, or specific conditions that must be satisfied by your decision. In the context of water resource management, constraints could signify capacity limits of a reservoir, minimum environmental flow requirements, or regulatory water diversion limits.

  • Mass balance requirement:

Naturally you want to make sure that mass is conserved through the reservoir, by balancing all of the inflows, outflows and storage states, for each timestep (t):

Although we need to separate this into two separate inequalities in order to meet the “standard form” for an LP formulation. I am also going to move the decision variables (Rt and St) to the left side of the inequalities.

  • Storage limitations:

The most obvious constraints are that we don’t want the reservoir to overflow or runout of water. We do this by requiring the storage to be between some minimum (Smin) and maximum (Smax) values specified by the operator.

Again we need to separate this into two separate inequalities:

  • Maximum and minimum release limitations:

The last two constraints represent the physical limit of how much water can be discharged out of the reservoir at any given time (Rmax) and the minimum release requirement that is needed to maintain ecosystem quality downstream of the reservoir (Rmin).

Objective:

The objective function in an LP represents your goals. In the case of our toy reservoir, we have two goals:

  1. Maximize water storage
  2. Maximize energy production.

Initially when I was setting this demonstration up, there was no hydroelectric component. However, I chose to add the energy generation because it introduces more complexity in the actions (as we will see). This is because the two objectives conflict with one another. When generating energy, the LP is encouraged to release a lot of water and maximize energy generation, however it must balance this with the goal of raising the storage to the target level. This tension between the two objectives makes for much more interesting decision dynamics.

1. Objective #1: Maximize storage

Since our reservoir managers want to make sure the Town’s water supply is reliable, they want to maximize the storage. This demo scenario assumes that flood is not a concern for this reservoir. We can define objective one simply as:

Again, we can replace the unknown value (St) with the mass-balance equation described above to generate a form of the objective function which includes the decision variable (Rt):

We can also drop the non-decision variables (all except Rt) since these cannot influence our solution:

2. Objective #2: Maximize energy generation:

Here I introduce a second component of the objective function associated with the amount of hydroelectric energy being generated by releases. Whereas the minimize-storage-deviation component of the objective function will encourage minimizing releases, the energy generation component of the objective function will incentivize maximizing releases to generate as much energy each day as possible.

I will define the energy generated during each timestep as:

(1+2). Aggregate minimization objective:
Lastly, LP problems are only able to solve for a single objective function. With that said, we need to aggregate the storage and energy generation objectives described above. In this case I take a simple sum of the two different objective scores. Of course this is not always appropriate, and the weighting should reflect the priorities of the decision makers and stakeholders.

In this case, this requires a priori specification of objective preferences, which will determine the solution to the optimization. In later posts I plan on showing an alternative method which allows for posterior interpretation of objective tradeoffs. But for now, we stay limited with the basic LP with equal objective weighting.

Also, we want to formulate the optimization as a minimization problem. Since we are trying to maximize both of our objectives, we will make them negative in the final objective function.

The final aggregate objective is calculated as the sum of objectives of the N days of operation:

Final formulation:

As we prepare to solve the LP problem, it is best to lay out the complete formulation in a way that will easily allow us to transform the information into the form needed for the solver:

Conclusions and what’s next

If you’ve made it this far, good on you, and I hope you found some benefit to the process of formulating a reservoir LP problem from scratch.

This post was meant to include simulated reservoir dynamics which implement the LP solutions in simulation. However, the post has grown quite long at this point and so I will save the code for another post. In the meantime, you may find it helpful to try to code the reservoir simulator yourself and try to identify some weaknesses with the proposed operational LP formulation.

Thanks for reading, and see you here next time where we will use this formulation to implement release decisions in a reservoir simulation.

Variance-based local time varying sensitivity analysis: A tutorial

In this post, we will be performing time-varying sensitivity analysis (TVSA) to assess how the use of information from reservoir storage and downstream city levels change across time when attempting to meet competing objectives in the Red River basin. This exercise is derived from the Quinn et al (2019) paper listed in the references below. You will be able to find all the files needed for this training in this GitHub repository. You can also follow along this tutorial at this Binder here.

Before beginning, here are some preliminary suggested readings to help you better understand TVSA and Radial Basis Functions (RBFs), which will be used in this training:

  • Herman, J. D., Reed, P. M., & Wagener, T. (2013). Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior. Water Resources Research, 49(3), 1400–1414. https://doi.org/10.1002/wrcr.20124
  • Giuliani, M., Zaniolo, M., Castelletti, A., Davoli, G., &amp; Block, P. (2019). Detecting the state of the climate system via artificial intelligence to improve seasonal forecasts and inform reservoir operations. Water Resources Research, 55(11), 9133–9147. https://doi.org/10.1029/2019wr025035
  • Quinn, J. D., Reed, P. M., Giuliani, M., &amp; Castelletti, A. (2019). What is controlling our control rules? opening the black box of multireservoir operating policies using time‐varying sensitivity analysis. Water Resources Research, 55(7), 5962–5984. https://doi.org/10.1029/2018wr024177
  • Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

The Red River System

The Red River basin consists of four reservoirs: Son La (SL), Thac Ba (TB), Tuyen Quang (TQ), and Hoa Binh (HB), shown in the figure below. The reservoirs’ operators must make daily decisions on the volume of water to release from each reservoir to achieve the following competing objectives:

  1. Protect the city of Hanoi (which lies downstream of the four reservoirs) from flooding by managing city water levels.
  2. Achieve maximum possible hydropower production from each reservoir to serve the energy demands of Hanoi.
  3. Minimize agricultural water supply deficit to meet irrigation needs for the region.
Figure 1: (a) The Red River Basin relative to Vietnam, Laos, and China, and (b) the model abstracting key components of the reservoir system that manages it.

The reservoir operators use information from the projected storage levels at each reservoir and water level forecasts to make their daily release decisions. These decisions in turn determine if they are able to meet their three objectives. Therefore, it is important to understand how different types of information is used and coordinated throughout the year to make these daily decisions.

This exercise

To complete this exercise, you should have the following subfolders and files within your Binder.

  1. HydroInfo_100_thinned.resultfile that contains the decision variables and objective values for all sets of optimal decisions. If this file is missing from the main folder structure, you download the original version here.
  2. daily/SolnXX/ the subfolder that contains text files of each reservoir’s daily decision variables (the columns) across 1,000 years (the rows)  for Solution XX.

Most of the daily data folders are large and cannot be stored on the linked GitHub repository. Note that only Soln33 can be found in the daily\ folder. If you would like to perform TVSA on other solutions, please download them separately here.

All in all, there are four selected solutions, named according to the objectives that they prioritize and their tradeoff characteristics:

  1. The “Best Flood” policy: Solution 33
  2. The “Best Hydropower” policy: Solution 37
  3. The “Best Deficit” policy: Solution 52
  4. The “Compromise” policy: Solution 35

Learning goals

In this exercise, we will:

  1. Briefly explain RBFs (Giuliani et al., 2016; Quinn et al., 2017; Giuliani et al., 2019).
  2. Unpack and understand the underlying mechanisms for calculating analytical variance-based first- and second-order sensitivity indices.
  3. Perform time-varying sensitivity analysis for each of the four listed solutions.
  4. Analyze the resulting figures to understand how information use changes throughout the year.

Understanding TVSA

Let’s walk through some key equations before we proceed. In this section, you will also find the Equation numbers that link these equations to their origins in the Quinn et al. (2019) paper listed above.

First, let’s look at some of the notation we will be using:

  • M is the number of policy inputs
  • K is the number of policy outputs
  • N is the number of radial basis functions (RBFs) used
  • t \in T=13\times365 is a day within the entire record, while $t\%365$ is a calendar year (day of year, DOY)
  • k \in K=1 is the number of reservoirs
  • u_{t}^{k} is the policy-prescribed release at reservoir $k$ at time $t$
  • a,b \in A,B=1 is the system state, aka the storage at the reservoir
  • (x_t)_a is a vector of system variables for state $a$
  • c_{n,m} and $b_{n,m}$ are the centers and radii of policy input $m \in M$ for RBF function $n \in N$
  • w_n is the weight of RBF function $n \in N$

Calculating the first-order sensitivity analysis

The following equation is drawn from Eqn.10 of Quinn et al. (2019):

\Bigl(\frac{\delta u_{t}^{k}}{\delta(x_t)_a}\Bigr)^{2}\text{Var }((x_{t\%365})_a)

This equation describes the annual first-order contributions of only variable $(x_t)_a$ to the system state at a.

Next, from Eqn.11 of Quinn et al. (2019):

\frac{\delta u_{t}^{k}}{\delta(x_t)_a}\frac{\delta u_{t}^{k}}{\delta(x_t)_b}\text{Cov }((x_{t\%365})_a, (x_{t\%365})_b)

This equation describes the annual second-order contributions for variable (x_t) to the system states at a and b. But how do we calculate the derivatives of the policy u_{t}^{k} with respect to its inputs?

Calculating the derivatives

We can calculate the derivative by taking the first partial differential of the policy with respect to each of its inputs. This approach is generalizable to all forms of policy formulation, but in this case, we will take the derivative with respect to the RBF function shown in Eqn 8 of Quinn et al. (2019).

\frac{\delta u_{t}^{k}}{\delta(x_t)_a} = \sum_{n=1}^{N}\Biggl\{\frac{-2w_{n}[(x_{t})_{a}-c_{n,a}]}{b_{n,a}^2}\text{ exp}\Biggl[-\sum_{m=1}^{M}\Bigl(\frac{(x_{t})_{j}-c_{n,m}}{b_{n,m}}\Bigr)^2\Biggr]\Biggr\}

For reference, here is the original RBF function (Eqn. 8 of Quinn et al. (2019)):

u_{t}^{k} = \sum_{n=1}^{N}w_{n}^{k}\text{ exp}\Biggr[-\sum_{m=1}^{M}\Biggl(\frac{(x_n)_m - c_{n,m}}{b_{n,m}}\Biggr)^2\Biggr]

A quick aside on RBFs

As shown in the equation above, an RBF is characterized by three parameters: centers c \in C and radii b \in B for each input variable, and a weight w. It is a type of Gaussian process kernel that informs the model (in our case, the Red River model) of the degree of similarity between two input points (Natsume, 2022). Each input is shifted by its distance from its respective “optimal” center point, and scaled by its “optimal” radius. The overall influence of an RBF function on the final release decision is then weighted by an “optimal” constant, where the sum of all w should be equal to one.

Herein lies the key advantage of using RBFs: instead of identifying the optimal output (the releases) at each timestep, optimizing RBFs finds optimal values of c, b, and w associated with each input. This limits the number of computations and storage space required to estimate the best-release policy. This is done by eliminating the need to compute future outcomes u_{t+1}, u_{t+2},...,u_{T} across all following timesteps, and avoiding the storage of information on prior releases u_{t-1}.

General steps

Time-varying sensitivity analysis in general is conducted as follows:

  1. Calculate the variance and covariance of each input variable for each timestep t over all years. In this case, the timestep is in daily increments.
  2. Calculate the partial first and second derivatives of each release policy for each day with respect to the input variables at each location.
  3. Compute the first- and second-order contributions of each input variable across all timesteps by:
    • Multiplying the variances with the first-order derivatives
    • Multiplying the covariances with their respective second-order derivatives
  4. Plot the contribution of each input variable to daily release policies (i.e. what kind of information is being used to make a decision on releases from reservoir k).  

Setting up the problem

To follow along with this tutorial, ensure that all files are located in their appropriate directories, open up the Jupyter notebook named tvsa_exercise.ipynb. Now, let’s import all the required libraries and set some constant parameters. We will be setting the values for the number of input parameters, M, the number of RBF functions N, and the number of outputs, K.

# import the libraries
import numpy as np
import math 
import matplotlib.pyplot as plt
import seaborn as sns
from calc_SI import *
from plot_SI import *

sns.set_style("dark")
# set the parameters 
M = 5 # Storage at the four reservoirs and the forecasted water levels in Hanoi 
N = 12 # Number of RBFs to parameterize. Set a minimum of M+K
K = 4 # Number of outputs (release decisions at the four reservoirs)
num_years = 1000 # Number of years to simulate

We will also need to define our input and output names. These are the column names that will appear in the output .txt files needed to plot our figures later in this exercise. Let’s also set the solution that we would like to analyze.

inputNames = ['sSL', 'sHB', 'sTQ', 'sTB', 'HNwater']  # the storage at Son La, Hoa Bing, Tuyen Quang, Thac Ba, and the water level at Hanoi
outputNames = ['rSL', 'rHB', 'rTQ', 'rTB']  # the release at Son La, Hoa Bing, Tuyen Quang, Thac Ba
policy_file = 'HydroInfo_100_thinned.resultfile'
soln_num = 33

Performing TVSA

The main calculations that output the sensitivity index for each day of the year across 1,000 years is performed by the function called calc_analytical_SI. This function can be found in the calc_SI.py file within this folder. It it highly recommended that you take the time to go through and understand the mechanisms of time-varying sensitivity analysis after completing this exercise. This step can take anywhere from 30-50 minutes, depending on your computing hardware. Feel free to comment out the line of code below if you would only like to plot the TVSA figures.

calc_analytical_SI(M, N, K, policy_file, soln_num, num_years, inputNames, outputNames)

If you are successful in running this function, you should see two new folders in daily/Soln_XX/ called release_decisions and Figures. The latter will become important in a while, when we begin to generate our TVSA figures. For now, open release_decisions to check if the files containing the sensitivity of each reservoir’s release decisions (e.g. rHB.txt) have populated the folder.

Are they there? Great! Let’s move on to generating the figures.

Plot the results

Now that we have the sensitivity indices of each reservoir’s release decision over time, let’s visualize how they use storage and water level forecast information. You can do this by using the plot_tvsa_policy function found in the plot_SI.py file. As before, you’re advised to attempt understand the script once you have completed this exercise.

To begin, let’s specify the name of the inputs that you would like to see in the legends of the figure. For this exercise, we will re-use the inputNames list defined prior to this. We will also need to add one more to this list, called interactions, which quantify the influence of input variable interactions on release decisions. In general, the list of input names and their associated colors should have length $M+1$.

Feel free to replace this with a list of other names that is more intuitive for you. We will also specify the colors they will be shown in. Since there are six inputs, we will have to specify six colors.

# specify the legend labels for each input and their respective colors
input_list = inputNames + ['interactions']

# for Son La, Hoa Binh, Tuyen Quang, Thac Ba, Hanoi, and their interactions respectively
input_colors = ['#FFBF00', '#E83F6F', '#2274A5', '#E3D7FF', '#92BFB1', '#F7A072']

For now, let’s see how this information affect’s Son La’s release decisions over time for the “Best Flood” solution. This is the solution that minimizes the “worst first-percentile of the amount by which the maximum water level in Hanoi exceed 11.25m” (Quinn et al., 2019).

plot_tvsa_policy(num_years, 33, input_list, input_colors, 'rSL', 'Sensitivity of Son La releases')

If all the steps have been completed, you should be able to view the following image:

Figure 2: Average contributions of each input across all simulation years and their interactions to Son La’s release decision variance. The vertical axis represents the relative contributions of each input variable to variance. The horizontal axis shows the months in a year.

If you’d like to see and automatically save the plots for all four reservoirs, you can opt for a for loop:

res_names = ['Son La', 'Hoa Binh', 'Tuyen Quang', 'Thac Ba']

for r in range(len(outputNames)):
    plot_name = res_names[r] + ' releases'
    plot_tvsa_policy(num_years, 33, input_list, input_colors, 
                     outputNames[r], plot_name + ' releases')

At the end of this, you should be able to locate all the figure in the daily/Soln_XX/Figures folder, where XX is the solution number of your choice.

Analyze the results

The figure below shows the time-varying sensitivity of the release decisions at all four reservoirs to the input variables. As a quick reminder, we are looking at Solution 33, which is the “Best Flood” solution.

Figure 3: Average contributions of each input across all simulation years and their interactions to (clockwise) Son La, Hoa Binh, Tuyen Quang, and Thac Ba. The vertical axis represents the relative contributions of each input variable to variance. The horizontal axis shows the months in a year.

In Figure 3 above, note the visual similarity between both Hoa Binh and Son La’s components’ contributions to their respective variance. The similarities between the pair of reservoirs could potentially be attributed to the geographic characteristics of the Red River basin (although further analysis is required here). In Figure 1, both Son La and Hoa Binh belong to the same Red River tributary, with the former preceding the latter. Their release decisions dominantly use information from Hanoi’s water level forecasts, particularly during the five months that correspond to Vietnam’s May-September monsoon season.

During the same period, the variance in Thac Ba and Tuyen Quang’s release decision have relatively-even contributions from Son La’s storage levels, Hanoi’s water level forecast, and interactions between different types of information. This indicates that it is crucial for the reservoir managers at Thac Ba and Tuyen Quang to carefully monitor these three main information sources during the monsoon period. Higher resolution data, and more frequent monitoring may be valuable during this time.

Interestingly, storage information from Hoa Binh appears to contribute most to the release decision variances of each reservoir before at and at the beginning of the monsoon. Hoa Binh’s storage information is particularly important during the first half of the year at all four reservoirs, which happens to be immediately before and during the monsoon season. During the same period of time, interactions between information from all other reservoirs also plays an important role in regulating release decisions.

Post-monsoon, the implications of interactions on release decisions becomes negligible, and each reservoir’s releases become dominantly driven by storage levels at Hoa Binh and Hanoi’s water levels once more, approximately mirroring the distribution of information contributions during the pre-monsoon period. However, storage levels at Thac Ba and Tuyen Quang play a more important role at determining release decisions during this time, and their contributions to release decisions is greater for their own reservoirs. Notably, information on storage at Thac Ba has the least influence for all reservoirs’ release decisions (including its own) throughout the entire year. This is likely due to it being the smallest reservoir in the system (Table 1; Quinn et al., 2019).

These observations can lead us to hypothesize that:

  1. For a policy that prioritizes minimizing flooding in Hanoi, information on storage levels at the Hoa Binh reservoir is vital.
  2. Storage levels at Hoa Binh should therefore be closely monitored and recorded at high resolution.
  3. During the monsoon, water levels in Hanoi are also key to release decisions and should be recorded at regular intervals.
  4. Interactions between difference sources of information is also a significant contributor to the variance across all reservoirs’ release decisions; treating decisions and conditions at each reservoir as independent of others can result in under-estimating the risk of flooding.

Summary

Overall, TVSA is a useful method to help decision-makers identify what information is useful, and when it is best used. This information is valuable when determining the resolution at which observations should be taken, or if downscaling is appropriate. Note that this method of calculating sensitivity relies on variance-based global sensitivity analysis method. This is where the sensitivity of an output to a provided input (or a combination thereof) is quantified as the contribution of total variance in the output by that input (or a combination of inputs). However, such methods may be biased by heavy-tailed distributions or outliers and therefore should be used only where appropriate (e.g. when the output distributions have analyzed, or when outputs are not multimodal; Reed et al., 2022). As you may have noticed, TVSA is also a computationally expensive process that requires a significant amount of time. This requirement grows exponentially with the number of simulations/years that you are calculating variance across, or as the number of inputs and outputs increase.

In this exercise, we performed time varying sensitivity analysis to understand how each reservoir within the Red River system in Vietnam uses storage and water level information on a day-to-day basis, and how this use of information changes over time. Similar methods have also been used to better understand how the formulation of hydrological processes affect watershed model behavior (Herman et al., 2013; Medina and Muñoz, 2020; Basijokaite and Kelleher, 2021), and asses information use in microgrid energy management (Liu et al., 2023). Post-exercise, do take some time to peruse these papers (linked before in the References section).

That said, thank you for sticking around and hope you learned a little something here!

References

  1. Basijokaite, R., & Kelleher, C. (2021). Time‐varying sensitivity analysis reveals relationships between watershed climate and variations in annual parameter importance in regions with strong interannual variability. Water Resources Research, 57(1). https://doi.org/10.1029/2020wr028544
  2. Giuliani, M., Castelletti, A., Pianosi, F., Mason, E., & Reed, P. M. (2016). Curses, tradeoffs, and scalable management: Advancing Evolutionary Multiobjective Direct Policy search to improve water reservoir operations. Journal of Water Resources Planning and Management, 142(2). https://doi.org/10.1061/(asce)wr.1943-5452.0000570
  3. Herman, J. D., Reed, P. M., & Wagener, T. (2013a). Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior. Water Resources Research, 49(3), 1400–1414. https://doi.org/10.1002/wrcr.20124
  4. Liu, M.V., Reed, P.M., Gold, D.F., Quist, G., Anderson, C.L. (2023). A Multiobjective Reinforcement Learning Framework for Microgrid Energy Management. arXiv.
    https://doi.org/10.48550/arXiv.2307.08692
  5. Medina, Y., & Muñoz, E. (2020). A simple time-varying sensitivity analysis (TVSA) for assessment of temporal variability of hydrological processes. Water, 12(9), 2463. https://doi.org/10.3390/w12092463
  6. Natsume, Y. (2022, August 23). Gaussian process kernels. Medium. https://towardsdatascience.com/gaussian-process-kernels-96bafb4dd63e.
  7. Quinn, J. D., Reed, P. M., Giuliani, M., & Castelletti, A. (2019). What is controlling our control rules? opening the black box of multireservoir operating policies using time‐varying sensitivity analysis. Water Resources Research, 55(7), 5962–5984. https://doi.org/10.1029/2018wr024177
  8. Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Highlighting the Excellence in Systems Analysis Teaching and Innovative Communication (ECSTATIC) Repository

*This post was written in collaboration with Dr. Chris Chini (Air Force Institute of Technology) and Dr. David Rosenberg (Utah State University)

The EWRI Congress and EWRS committee meeting that a good majority of us in water resources engineering attend every year serves as a reminder to me that there exist some really wonderful free resources to aid in teaching and communicating in our field. Most of us in EWRS are familiar with the ECSTASTIC repository, but I wanted to use our blog as a forum to formally discuss the repository and provide guidance on how to contribute.

Background  

David: In 2013, several of us among the 100+ members of the Environmental Water Resources Systems (EWRS) committee saw the need to develop, share, and disseminate excellent and innovative methods to teach water systems content. There were organized channels such as journals and conferences to communicate research. Individual committee members were mostly using teaching resources they developed themselves. In most cases, those materials were not accessible to other faculty, students, or life learners. We wanted a channel – the ECSTATIC repository – to share excellent teaching videos, audio clips, lecture materials, readings, problems, games, code, etc.

A second purpose for the task committee was to submit a peer-reviewed article that reviewed the state of systems analysis education, discussed what a systems analysis curriculum should include, and assessed the impact of the interactive website. We ended up authoring and publishing two peer-reviewed articles. The second article took data from interviews we did with practitioners in the field. The notes/videos from the interviews were also posted to the repository!

What is in the repository?

The repository can be found here: https://digitalcommons.usu.edu/ecstatic/

The teaching repository has a multitude of collections ranging from course syllabi, games, problems, lecture materials, and code/models.

Some of the most popular materials on ECSTATIC include lecture notes on multi-objective optimization and nonlinear optimization from Dr. Lina Sela, an Associate Professor at the University of Texas at Austin. Additionally, Dr. Joseph Kasprzyk, Associate Professor University of Colorado at Boulder, has lectures on Introduction to Water Resources Systems and Introduction to Reservoir Analysis. Other submissions include GAMS examples (https://digitalcommons.usu.edu/ecstatic_all/63/), case studies for a guided project on stochastic hydrology (https://digitalcommons.usu.edu/ecstatic_all/57/), and syllabi for courses on environmental systems (https://digitalcommons.usu.edu/ecstatic_all/19/) and water technology & policy (https://digitalcommons.usu.edu/ecstatic_all/66/).

Example Slide Deck Content

There are also free books, video lectures (including a series on Managing Infrastructure with Deep Uncertainty Discussions), and games that can be used to supplement lectures.

Video Lectures from the Managing Infrastructure with Deep Uncertainty Discussions Series

Who can contribute?

Please submit new material at: https://digitalcommons.usu.edu/cgi/ir_submit.cgi?context=ecstatic_all

New submissions are always welcome and are accepted in whatever format suits the material. Submissions can be made by professors, PhD students, practitioners, or any person who feels they have educational materials to share to benefit the Water Systems community. For any questions on the repository, its content, or how to submit an article please reach out to either Nathan Bonham (nathan.bonham@colorado.edu) or Christopher Chini (christopher.m.chini@gmail.com).

David: Submissions on any topic related to water resources would be great, but the ECSTATIC task committee is particularly interested in entire courses from colleagues who may be retired or who are retiring soon. Furthermore, the committee would like to build up the video content in the repository. During the pandemic, many seminars were conducted online and recorded through Zoom. Provided the speakers are fine with having their seminars online, these videos are the best opportunity for people using the site to get context around the methods being used and the problems being solved. In short- content that you can’t get from a textbook or a syllabus.

What is the peer review process like?

New submissions have to pass through a review process. When a new work is submitted via Utah State University’s Digital Commons, a volunteer Chief Editor assigns one or more peer reviewers who are solicited from the EWRS community.

Reviewers use 4 criteria to evaluate submissions:

·       Is the submission topical to ECSTATIC (water resources)?

·       Are the materials organized so someone can download and use them?

·       What are the strengths of the submission?

·       What changes, if any, are needed to post on ECSTATIC?

The entire submission, review, and publishing process occur through the content management system.

What is the reach of ECSTATIC?

ECSTATIC has grown – and continues to grow – to a global audience and usership that surpasses the EWRS community.

ECSTATIC Repository Reach

Since it’s inception, there have been 26,900 total downloads of material across many different countries (as shown in the map above) and types of institutions (academic, government, etc)

Downloads over time

The ECSTATIC repo is a great resource, not only to preserve and share knowledge within our community but to create a platform for those outside our community to become integrated in. Please considering sharing your resources and helping the ECSTATIC repository grow!

The Colorado River Basin: Exploring its Past, Present, and Future Management

I. Introduction

Figure 1. Aerial view of the Colorado River passing through the Grand Canyon in Arizona (McBride) [18]

The Colorado River Basin (CRB) covers a drainage area of 246,000 square miles across California, Colorado, Nevada, New Mexico, Utah, Wyoming and Arizona. Studies have estimated that the basin provides $1.4 trillion of economic revenue per year, 16 million jobs across 7 states, water to 40 million people, and irrigation to approximately 5.5 million acres of agricultural land. For states such as New Mexico and Nevada, it represents more than two thirds of their annual GDP (65% and 87%, respectively). [1]

On August 21, 2021, the U.S. federal government declared its first-ever water cuts in the basin, due to the ongoing megadrought. The basin is estimated to be filled to 35% of its full capacity and has suffered a 20% decrease in inflow in the last century. [2] Rising temperatures caused in part by climate change have dried up the water supply of the basin, with certain areas experiencing more than double the global average temperature. Hotter temperatures have had three notable impacts on the drying of the basin: accelerated evaporation and snowpack melting, drying up soil before runoff reaches reservoirs, and increased wildfires which cause erosion of sediment into the basin. 

The CRB finds itself at a critical juncture; as the population and demand in the area continues to grow, the water supply is only diminishing. Ideally, the basin should provide for municipal, agricultural, tribal, recreational and wildlife needs. Thus, appropriate water management policies will be foundational to the efficacy of water distribution in these critical times. 

II. Brief History

Representatives from the seven Colorado River states negotiated and signed the Colorado River Compact on November 9th, 1922, and a century later, this compact still defines much of the management strategies of the CRB. The compact divided the basin into Upper and Lower sections and provided a framework for the distribution of water [3]. At the time of the signing, it was estimated that the annual flow of the basin was 16.4 million acre-feet (maf/y). Accordingly, the right to 7.5 maf/y of water was split between each portion of the basin. A more accurate estimate of total flow adjusted this to 13.5 maf/y with fluctuations between 4.4 maf/y to 22 maf/y [3]. State-specific allocations were defined in 1928 as part of the Boulder Canyon Act for the Lower Basin, and in 1948 for the Upper Basin in the Upper Colorado River Basin Compact [4]. Figure 2 displays water distribution on a state basis in million-acre feet/year.

The system of water allocation, which is still in place today, dates back to 1876 when the Colorado Constitution put into place the Doctrine of Prior Appropriation. The core of the doctrine enforces that water rights can only be obtained for beneficial use. These rights can be bought, sold, inherited, and relocated. The doctrine gives owners of the land near the water, equal rights of use. This system regulates the uses of surface and tributary groundwater on a basis of priority. The highest priority are senior water rights holders. This group are those that have had the rights for the longest time and are typically best positioned in times of drought. Contrastingly, junior water rights holders are those that have obtained their water rights more recently and tend to be those whose supply is curtailed first in times of drought. This system is often described as “first-in-time, first-in-right” [5]. However, a key criticism of this doctrine is that it fails to encompass beneficial uses of water, which are particularly important when water is scarce.

Figure 3 summarizes the key historic moments of the Colorado River Basin from the Colorado Constitution in 1876 to the most recent 2023 negotiations. Some of the most notable events are the 1922 signing of the Colorado River Compact which created the foundation of division and apportionment of the water basin. Additionally, the construction of dams in the early to mid 20th century such as the Hoover Dam, Parker Dam, Glen Canyon, Flaming Gorge, Navajo and Curecanti have played an important role in determining water flow today. The formation of the Central Arizona Project in 1968, provided access to water for both agricultural lands and metropolitan areas such as the Maricopa, Pinal and Pima counties [6]. More recently, the critically low water levels of Lake Powell and the Glen Canyon Dam have raised concerns about their ability to generate hydropower. In early 2023, the seven states that utilize the CRB met in hopes of reaching an agreement on how to deal with the current shortages but the proposal failed.

III. Key Players

Figure 4, highlights prominent key players in the Colorado River Basin who are frequently involved in negotiations and policy making.

IV. Current Situation

In January 2023, the states within the CRB failed to come to an agreement on an updated water consumption plan aimed to curbing the impacts of the megadrought. The proposed plan would require California to cut their water from 4.4 maf/y to 1 maf/y. The agreement, aimed at preventing Lake Powell and Mead from falling below the critical level for hydropower, proposed major water cuts for Southwestern states of California and Arizona and incorporated losses from evaporation into the cutbacks [7]. Although six states agreed on a plan to move forward, California rejected the major water cuts due to concerns related to agriculture and legal water rights status. The proposed cuts would primarily impact regions such as Imperial County which have senior rights to 3.1 maf/y and an agricultural revenue of $2.3 billion annually [8]. California proposed an alternative plan, which called for a 400,000 acre-foot reduction (for CA specifically), with additional reductions contingent upon the water level of Lake Mead in the future. While both plans are similar, the California plan is founded on waiting until the water level passes a critical threshold, while the plan from the other states is more preventative in nature [7].

Figure 5. Key Facts about the CRB

In more recent news, the Biden Administration just proposed a plan to evenly cut water allocations to California, Arizona and Nevada by approximately one-quarter. An intervention from the federal government on such a scale would be unprecedented in the history of the region. The options considered include: taking no action, making reductions based on senior water rights, or making equal reductions across the three states. Opting for the first option would likely result to a dead pool in Lake Mead and Powell, a term used to describe when the water levels fall too low to continue flowing downstream [12]. The second option would favor California, as it is one of the states which holds the most seniority in the basin particularly in the Coachella and Imperial Valleys of Southern CA [9]. Consequently, this decision would more severely impact Arizona and Nevada, which are important swing states for the President and have an important tribal presence. The final decision is anticipated to be made in the summer of 2023 [10]. 

In many parts of California and Colorado, this winter has been particularly heavy in rain and snow, making people hopeful that the river could be replenished. Scholars estimate that it would require 3 years of average snow with no water consumption to fully restore the Colorado River Basin reservoirs and 7 years under current consumption activity [11]. Unfortunately, the basin’s soil is extremely dry, which means that any excess water that comes from the snowpack is likely to be absorbed by the ground before it has a chance to reach rivers and streams. It is estimated that by the end of 2023 Lake Powell will be at 3,555 feet of elevation (roughly 32% capacity). When Lake Powell reaches 3,490 feet, the Glen Canyon Dam will be unable to produce hydroelectric power. At 3,370 feet, a dead pool will be reached.

V. Paths to a Sustainable Future

As water supply in the CRB continues to diminish, it has become increasingly crucial to find ways to minimize water loss of the system. One of the major contributor is through evaporation, which accounts for approximately 1.5 maf/y of loss. This loss is more than Utah’s allocation from the Colorado River.  In order to minimize evaporation loss, an ideal reservoir would be very deep and have small surface area. However, this is not the case for reservoirs like Lake Powell which loses approximately 0.86 maf/y (more than 6% of CRB annual flow) [13]. This raises the important question of how to best allocate the CRB water considering that some states experience more evaporation loss than others. According to research from CU Boulder, some possible solutions include covering the surface water with reflective materials, films of organic compounds, and lightweight shades. Additionally, relocating the reservoir water underground storage areas or aquifers could also serve to reduce evaporation [14]. 

An alternative approach is cloud seeding. In early March of 2023, the Federal Government invested $2.4 million in cloud seeding for the CRB. Cloud seeding is a technique used to artificially induce more precipitation by injecting ice nuclei, silver iodide or other small crystals into subfreezing clouds. This promotes condensation of water around the nuclei and the formation of rain drops which are estimated to increase precipitation by 5-15% [15]. The grant will be used to fund new cloud seeding generators which can be operated remotely as well as aircrafts for silver iodide injections. While this is a significant investment, cloud seeding has been practiced for decades in the CRB. Indeed, it is estimated that Colorado, Utah, and Wyoming each spend over $1 million annually on cloud seeding and Utah has planned to increase its spendings to $14 million next year [16]. While the negative impacts of cloud seeding are still unclear, some scholars believe that they could cause silver toxicity because of the use of potentially harmful chemicals. Additionally, the wind can sometimes blow the seeded clouds to a different location [17]. Ultimately, cloud seeding does not solve the underlying obstacles of climate change or aridification in the region, but it may help alleviate some of the impact from the drought until a more sustainable alternative can be found. 

Work Cited

[1] “Economic Importance of the Colorado River.” The Nature Conservancy. The Nature Conservancy. Accessed December 1, 2021. https://www.nature.org/en-us/about-us/where-we-work/priority-landscapes/colorado-river/economic-importance-of-the-colorado-river/.

[2] Kann, Drew. “Climate Change Is Drying up the Colorado River, Putting Millions at Risk of ‘Severe Water Shortages’.” CNN. Cable News Network, February 22, 2020. https://www.cnn.com/2020/02/21/weather/colorado-river-flow-dwindling-warming-temperatures-climate-change/index.html.

[3] Megdal, Sharon B. “Sharing Colorado River Water: History, Public Policy and the Colorado River Compact.” The University of Arizona: Water Resources Research Center , 1 Aug. 1997, https://wrrc.arizona.edu/publication/sharing-colorado-river-water-history-public-policy-and-colorado-river-compact.&nbsp;

[4] Water Education Foundation. (n.d.). Colorado River Compact. Water Education Foundation. Retrieved April 17, 2023, from https://www.watereducation.org/aquapedia-background/colorado-river-compact#:~:text=The%20Lower%20Basin%20states%20were,River%20Basin%20Compact%20of%201948.&nbsp;

[5] Hockaday, S, and K.J Ormerod. “Western Water Law: Understanding the Doctrine of Prior Appropriation: Extension.” University of Nevada, Reno , University of Nevada, Reno, 2020, https://extension.unr.edu/publication.aspx?PubID=3750#:~:text=Senior%20water%20rights%20are%20often,farming%2C%20ranching%20and%20agricultural%20uses.&text=A%20claim%20to%20water%20that%20is%20more%20recent%20than%20senior,municipal%2C%20environmental%20or%20recreational%20uses.&nbsp;

[6] State of the Rockies Project 2011-12 Research Team. “The Colorado River Basin: An Overview.” Colorado College, 2012. 

[7] Partlow, Joshua. “As the Colorado River Dries up, States Can’t Agree on Saving Water.” The Washington Post, The Washington Post, 4 Apr. 2023, https://www.washingtonpost.com/climate-environment/2023/01/31/colorado-river-states-water-cuts-agreement/.

[8] Bland, Alastair. “California, Other States Reach Impasse over Colorado River.” CalMatters, 1 Feb. 2023, https://calmatters.org/environment/2023/01/california-colorado-river-water-2/.&nbsp;

[9] Hager, Alex. “Six States Agree on a Proposal for Colorado River Cutbacks, California Has a Counter.” KUNC, NPR, 1 Feb. 2023, https://www.kunc.org/environment/2023-01-31/six-states-agree-on-a-proposal-for-colorado-river-cutbacks-california-has-a-counter.

[10] Flavelle, Christopher. “Biden Administration Proposes Evenly Cutting Water Allotments from Colorado River.” The New York Times, The New York Times, 11 Apr. 2023, https://www.nytimes.com/2023/04/11/climate/colorado-river-water-cuts-drought.html?campaign_id=190&%3Bemc=edit_ufn_20230411&%3Binstance_id=89950&%3Bnl=from-the-times&%3Bregi_id=108807334&%3Bsegment_id=130155&%3Bte=1&%3Buser_id=ea42cfd845993c7028deab54b22c44cd.&nbsp;

[11] Mullane, Shannon. “Colorado’s Healthy Snowpack Promises to Offer Some Relief for Strained Water Supplies.” The Colorado Sun, The Colorado Sun, 14 Mar. 2023, https://coloradosun.com/2023/03/14/colorado-snowpack-water-supply-relief/.

[12]Tavernise, Sabrina, host. “7 States, 1 River and an Agonizing Choice.” The Daily, The New York Times, 31 Jan. 2023. https://www.nytimes.com/2023/01/31/podcasts/the-daily/colorado-river-water-cuts.html?showTranscript=1

[13] “Lake Powell Reservoir: A Failed Solution.” Glen Canyon Institute, Glen Canyon Institute , https://www.glencanyon.org/lake-powell-reservoir-a-failed-solution/.

[14] Blanken, Peter, et al. “Reservoir Evaporation a Big Challenge for Water Managers in West.” CU Boulder Today, 28 Dec. 2015, https://www.colorado.edu/today/2015/12/28/reservoir-evaporation-big-challenge-water-managers-west.

[15] McDonough, Frank. “What Is Cloud Seeding?” DRI, Desert Research Institute, 19 Sept. 2022, https://www.dri.edu/cloud-seeding-program/what-is-cloud-seeding/.

[16] Peterson, Brittany. “Feds Spend $2.4 Million on Cloud Seeding for Colorado River.” AP NEWS, Associated Press, 17 Mar. 2023, https://apnews.com/article/climate-change-cloud-seeding-colorado-river-f02c216532f698230d575d97a4a8ac7b.

[17] Rinkesh. “Various Pros and Cons of Cloud Seeding.” Conserve Energy Future, Conserve Energy Future, 28 July 2022, https://www.conserve-energy-future.com/pros-cons-of-cloud-seeding.php.

Work Cited for Photos

[18] McBride , P. (n.d.). The Colorado River winds through the Grand Canyon. photograph, Arizona.

[19] Kuhn, Eric, and John Fleck . “A Century Ago in Colorado River Compact Negotiations: Storage, Yes. but in the Compact?” Jfleck at Inkstain , 15 Nov. 2022, https://www.inkstain.net/2022/11/a-century-ago-in-colorado-river-compact-negotiations-storage-yes-but-in-the-compact/.

[20] “A Project for the Ages.” Bechtel Corporate, https://www.bechtel.com/projects/hoover-dam/.

[21] Lane, Taylor. “Lake Mead through the Decades – Lake Mead in 1950 Photograph from United States National Park Service Photography Collection .” Journal, Las Vegas Review-Journal, 17 Aug. 2022, https://www.reviewjournal.com/local/local-las-vegas/lake-mead-through-the-decades-photos-2602149/.

[22] “1944: U.S. Mexico Water Treaty.” Know Your Water News , Central Arizona Project , 25 Oct. 2022, https://knowyourwaternews.com/1944-u-s-mexico-water-treaty/.

[23] “Glen Canyon Unit – Arial View of the Glen Canyon Dam and Lake Powell.” Bureau of Reclamation – Upper Colorado Region , Bureau of Reclemation , https://www.usbr.gov/uc/rm/crsp/gc/.

[24] “Reclamation and Arizona.” Reclamation, U.S Department of the Interior , https://www.usbr.gov/lc/phoenix/AZ100/1960/supreme_court_AZ_vs_CA.html.

[25] Ferris, Kathleen. “CAP: Tracking the Flow of Colorado River Water to Your City.” AMWUA, Arizona Municipal Water Users Association, 19 Oct. 2015, https://www.amwua.org/blog/cap-tracking-the-flow-of-colorado-river-water-to-your-city.

Exploring Internal Variability with Hidden Markov Models in Our Newly Published eBook Interactive Tutorial

We recently published two new Jupyter Notebook tutorials as technical appendices to our eBook on Addressing Uncertainty in MultiSector Dynamics Research. Dave debuted his tutorial on the blog earlier this month here. This post will cover my addition to the eBook, which outlines a hidden Markov modeling approach to creating synthetic streamflow scenarios. Similarly to Dave’s, this post will outline the notebook’s connections to topics in the main text of the eBook rather than detailing the code demonstrated in the notebook.

In this post, I’ll first give a brief overview of how a Hidden-Markov model-based streamflow generator can help facilitate exploratory modeling in the Colorado River Basin. Then I’ll go through examples of how it can be used to explore internal variability that characterizes the region’s hydroclimate.

Exploratory Modeling for Colorado River Basin Planning and Management

The Colorado River Basin (CRB) is experiencing an unprecedented water shortage crisis brought upon by large hydroclimatic changes and over-allocation of the Colorado river. Parts of the basin are experiencing warming that is nearly twice the global average1. Even if above average snowpack can provide additional water, dry soils absorb this flow rather than allowing the runoff to reach and refill downstream reservoirs. The complexity of accommodating the demands of growing cities in the basin combined with a diminishing water supply makes planning and management a growing challenge in the region. Effective planning and management in the CRB must consider the fact that the region is experiencing a shift to a more arid climate. The historic streamflow record is now not a sufficient representation of future hydroclimate. Tools are needed that allow the exploration of the vulnerabilities of the CRB in future conditions that are plausible but haven’t been experienced before.

Herein lies the value of exploratory modeling as discussed in Chapter 2.2 of the eBook. Exploratory modeling can be used to run simulation models under vast ensembles of potential futures and then identify those with consequential effects. Rather than trying to predict future conditions exactly, the focus is shifted to a more holistic approach of discovering what conditions can lead to futures that cause system vulnerability or failure. The simulation model we use is a monthly water allocation model called StateMod which the State of Colorado uses for planning and simulating future management policies. In order to use this model in an exploratory modeling context, we need to find a way of generating plausible future scenarios that can be used to force the model. The future of the CRB will be defined by many uncertainties including demand, population growth, and policy changes pertaining to the Colorado River Compact. In this tutorial, we will focus specifically on the changing hydroclimate in the region and on a tool that allows us to generate streamflow scenarios for basins in the state of Colorado.

An HMM as a tool for exploring internal variability

As discussed in Chapter 3.3.6 of the eBook, there are many different ways to generate synthetic input data. To generate traces of streamflow, one can use physically-informed GCM projections, purely statistical approaches, or a hybrid of the two. In this tutorial, we utilize a Hidden Markov Model (HMM)-based streamflow generator, which is a purely statistical approach, to generate synthetic traces of streamflow. The HMM is (1) conditioned directly on regional streamflow and therefore better able to capture local drought dynamics than downscaled CMIP5/6 projections and (2) more computationally tractable to help facilitate exploratory analyses. The HMM can be utilized to generate stationary representations of streamflow and multipliers or shifts can be applied to the parameters to create non-stationary traces.

Decadal Drought Conditions

If we plot the historical annual streamflow at the outlet gage of the Upper Colorado River Basin (UCRB) in Figure 1, we can quantify decadal drought risk in the observed period. We use a metric proposed in Ault et al. (2014). The authors define a decadal drought as where the 11-year running mean of the available record is more than a half a standard deviation below the overall mean of the record. By this metric, the region has experienced two decadal droughts.

Figure 1: Historical annual streamflow at the outlet of the UCRB. Decadal drought periods are overlaid in tan.

However, this historical record has very few instances of extremes and thus underestimates flood/drought hazards. It is important to remember that the streamflow that we have observed in the region over the last century is only one instance of the hydrology that could occur since the atmosphere is an inherently stochastic system. Thus, we require a tool that will allow us to see multiple plausible realizations of the streamflow record to understand the internal variability that characterizes the historical period. This is where the HMM-based generator comes in. If a system follows a Markov process, it switches between a number of “hidden states” dictated by a transition matrix. The UCRB is best described by a two-state system: wet or dry. Each state has its own probability distribution. We use a Gaussian distribution for each hidden state’s distribution and thus each distribution is defined by a mean and standard deviation. One can draw samples from this distribution to create synthetic flows that fit the properties of the historical distribution. HMMs are an attractive choice for this region because they can simulate persistence (i.e., long duration droughts), which is needed to create future conditions that the region will likely experience. Figure 2 shows the 2-state Gaussian HMM that is fit in the tutorial and the Viterbi sequence (i.e. the classification of years in each state).

Figure 2: a) Example wet and dry state distributions and b) resulting Viterbi sequence

If we sample 105-year traces from the HMM, we are creating alternative hydrologies that the region could have experienced that are different that what was experienced historically. In the historical dataset, there are two decadal drought periods which leads to 20 years classified in decadal drought conditions (1959-1969 and 2000-2010). However, if we sample 1000 times from the model to create 1000 more 105-year traces, we have the capacity to create realizations with many more years classified in decadal drought conditions. Figure 3 shows a histogram of the years classified in decadal drought across the 1000 samples. The historic trace is shown as a red vertical line. Note how the HMM is creating many instances of >20 years of decadal drought conditions.

Figure 3: The number of years in a 105-year trace classified to be in decadal drought conditions across 1000 samples from the stationary generator

Figure 4 shows one of the 1000 traces where 53 years (or half of the whole record) are classified in decadal drought conditions.

Figure 4: Example trace from stationary generator which is characterized by more frequent drought conditions

Multi-Decadal Drought Conditions

The decadal drought metric in Ault et al. (2014) can be extended to a multi-decadal drought context by inspecting where the 35-year running mean dips below the drought threshold. By this metric, the region has experienced no multi-decadal droughts (Figure 5).

Figure 5: Historical annual streamflow at the outlet of the UCRB. The 35-year rolling mean does not cross the drought threshold.

However, the stationary synthetic generator does have the capacity to create multi-decadal drought, such as the trace shown in Figure 6, even though the historical record does not have one.


Figure 6: An example trace from the stationary generator that exhibits a multi-decadal drought

The traces shown in Figure 4 and 6 are characterized by substantially more years of drought and since they were created with a stationary generator, they are examples of what the UCRB could have experienced historically rather than what is shown in Figure 1. These examples demonstrate that natural variability alone can create records that contain more frequent droughts than what has been seen historically (even in the absence of climate changes).

Drought Severity

Is the HMM creating droughts that are more severe than what was experienced historically? Let’s use the decadal drought as an example. We can define severity as the maximum volume (cf) that rolling mean dips below the drought threshold. Figure 7 shows the histogram of these dips and once again, we see that we are creating droughts that dip substantially below the threshold unlike the behavior observed in Figure 1 and represented as the red line.


Figure 7: The severity of the decadal drought periods in each 105-year trace across 1000 samples from the stationary generator

Figure 8 shows an example trace of severe drought conditions where the rolling mean substantially dips below the drought threshold. Thus we see that natural variability alone can create records that contain more severe droughts than what has been seen historically.

Figure 8: An example trace from the stationary generator that exhibits severe drought conditions.

Conclusion

This tutorial was meant to demonstrate the utility of using the HMM-based generator in a stationary context with a focus on its ability to create more frequent and severe decadal and multi-decadal droughts. Sampling from the generator allows the ability to explore the rich internal variability that characterizes the region. We demonstrate that internal variability captured by the generator can produce more frequent and severe droughts than what was experienced historically and that this is in absence of adjustments to the HMM to create climate change proxies. Thus, it is important to acknowledge that future droughts may become even more severe and frequent based on phases of internal variability interacting with climate changes.

Though we won’t go over it in this blog post, the technical appendix also covers adjustments that can be made to the HMM parameterization to create non-stationary traces which you can find here.

References

1 The Nature Conservancy: https://www.nature.org/en-us/about-us/where-we-work/priority-landscapes/colorado-river/colorado-river-in-crisis/

Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate27(20), 7529-7549.

Using drawdata and Mercury in Jupyter Notebooks

I wrote a post last year on Enhancing Jupyter Notebooks for Teaching. Through my time at Cornell as a TA or mentoring our undergrad researchers, I’ve learned how important it is to find fun and interesting ways for students to play with data and to feel more comfortable diving into traditionally difficult topics like machine learning. This post features two cool new libraries that were brought to my attention recently that can help jazz up your tutorials:

(1) drawdata: allows a student to interactively draw a dataset (lines, histograms, scatterplots) with up to four different labels. The dataset and labels can be saved as JSONs and CSVs and also directly copied into Pandas dataframes. This can be useful to facilitate interactive machine learning tutorials.

(2) Mercury: converts a Jupyter notebook to web app. This is especially useful for classroom tutorials because it doesn’t require that students have Jupyter installed or even Python.

I’ve combined both of these functionalities into a notebook that is focused on classification of a dataset with 2 labels using support vector machines (SVMs) that can be found here.

drawdata

First let’s install drawdata by typing pip install drawdata into our command line. Next, let’s follow through the steps of the Jupyter notebook. We’ll import the rest of our libraries and draw out a simple linearly-separable dataset.

By clicking “copy csv” we can copy this exact dataset to a Pandas dataframe. Upon inspection, we see that each datapoint has 2 features (an x and y coordinate) and a label that is either “a” or “b”. Let’s also adjust the labels to be [-1,1] for the purposes of using some classifiers from scikit- learn. We then fit a very basic support vector classifier and plot the decision boundary.

data=pd.read_clipboard(sep=",")

#Rename the labels to integers

for i in range(0, len(data)):
    # checking if the character at char index is equivalent to 'a'
    if(data.iloc[i,2] == 'a'):
        # append $ to modified string
        data.iloc[i,2] = -1
    else:
        # append original string character
        data.iloc[i,2] = 1
data.iloc[:,2]=data.iloc[:,2].astype('int')


#Create our datasets

X=np.array(data.iloc[:,0:2])
y=np.array(data.iloc[:,2])


#Create a 60/40 training and test split 

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42)
    
#Fit classifier 

clf=svm.SVC(kernel='linear',C=0.025)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("The classification accuracy is", score)
#The classification accuracy is 1.0


#Plot original dataset that you drew 

cm = plt.cm.get_cmap('cool_r')
cm_bright = ListedColormap(["purple", "cyan"])
figure = plt.figure(figsize=(8, 6))
ax = plt.subplot(1,1,1)
#Plot decision boundary
ax.set_title("Classification Boundary")
DecisionBoundaryDisplay.from_estimator(clf, X, cmap=cm, alpha=0.8, ax=ax, eps=0.5)
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,cmap=cm_bright , edgecolors="k")
# Plot the testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k")

We can also plot the margin and support vectors.

This is a straightforward classification example, but we can also look at non-linearly separable examples.

Ruh roh! Our linear classifier is never going to work for this dataset! We’ll have to use the kernel trick- that is, we need to map our data to a higher dimensional space with an additional feature that may be able to better help us separate out the two classes. We can create an additional feature that captures the distance from a central point (300, 300). This allows us to find a hyperplane that can separate the points in a higher dimensional space.

#Create an additional dimension
r = np.sqrt((X[:, 0]-300)**2+(X[:, 1]-300)**2)

def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap=cm_bright,edgecolors="k")
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')
    

interact(plot_3D, elev=(-90, 90), azip=(-180, 180),
         X=fixed(X), y=fixed(y));

Thus, we can implement a similar radial basis function kernel in our SVC rather than using a linear kernel to define our non-linear classification boundary.

Mercury

Once you have a notebook that you are satisfied with, you can make it into an interactive web app by adding a YAML header. The YAML header also facilitates the ability interact with certain variables in the code and then run it. For example, users can select a classifier from a drop down menu or slide through different values of C. Finally they click the run button to execute the notebook.

A YAML header is added to the first RAW cell in the notebook. Here I want the user to be able to slide through the hardness of the margin (between 0 and 100).

---
title: SVM Classifier
description: Implement linear and non-linear classifiers
show-code: True
params:
    C_margin:
        input: slider 
        label: Value of Margin 
        value: 0.025
        min: 0
        max: 100
---

Then we install mercury (pip install mljar-mercury) and go to the command line and type mercury watch Classification_Tutorial.ipynb. The prompt will create a local address that you can put in your browser. The users can then draw their own dataset, adjust the hardness of the margin, and run the notebook.

Resources

Some of the SVM plotting code was borrowed from: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html

More info on Mercury can be found in this post: https://medium.com/@MLJARofficial/mercury-convert-jupyter-notebook-to-a-web-app-by-adding-yaml-header-872ce4e53676

Structuring a Python Project: Recommendations and a Template Example

Motivation:

The start of a new year is a good (albeit, relatively arbitrary) time to reassess aspects of your workflow.

I, like many people, taught myself Python by jumping into different projects. The consequence of this ad-hoc learning was that I did not learn some of the fundamentals until much later in my project development.

At the end of the Fall ’23 semester, I found myself spending a lot of time cleaning up repositories and modules that I had constructed in the preceding months. I ended up restructuring and reorganizing significant portions of the code base, implementing organizational practices that I had learned after it’s conception.

This post is intended to save the reader from making the same mistake. Here, I present a recommended structure for new Python projects, and discuss the main components. This is largely targeted at Python users who have not had a formal Python training, or who are working with their first significantly sized project.

I provide an example_python_project, hosted on GitHub, as a demonstration and to serve as a template for beginners.

Content:

  1. Recommended structure
  2. Example project repository
  3. Project components (with example project)
    1. Modules, packages, __init__.py
    2. Executable
    3. README
    4. Documentation
    5. Tests
    6. Requirements
    7. LICENSE
  4. Conclusion

Recommended project structure

I’ll begin by presenting a recommended project structure. Further down, I provide some explanation and justification for this structure.

My example_python_project follows recommendations from Kenneth Reitz, co-author of The Hitchhiker’s Guide to Pythton (1), while also drawing from a similar demo-project, samplemod, by Navdeep Gill (2).

The project folder follows this structure:

example_python_project/ 
├── sample_package/ 
│   ├── subpackage/ 
│   │   ├── __init__.py 
│   │   └── subpackage_module.py 
│   ├── __init__.py 
│   ├── helpers.py 
│   └── module.py 
├── docs/ 
│   └── documentation.md 
├── tests/ 
│   ├── __init__.py 
│   └── basic_test.py 
├── main.py 
├── README.md 
├── requirements.txt 
└── LICENSE

If you are just starting a project, it may seem unnecessary complex to begin with so much modularity. It may seem easier to open a .py file and start freewheeling. Here, I am trying to highlight the several reasons why it is important to take care when initially constructing a Python project. Some of these reasons include:

  1. Maintenance: A well-structured project makes it easier to understand the code, fix bugs, and add new features. This is especially important as the project grows in size and complexity.
  2. Collaboration: When working on a project with multiple developers, a clear structure makes it easier for everyone to understand how the code is organized and how different components interact with each other.
  3. Scalability: A well-structured project allows to scale up the codebase, adding new features and sub-components, without making the codebase hard to understand or maintain.
  4. Testing: A well-structured project makes it easier to write automated tests for the code. This helps to ensure that changes to the code do not break existing functionality.
  5. Distribution: A well-structured project makes it easier to distribute the code as a package. This allows others to easily install and use the code in their own projects.

Overall, taking the time to structure a Python project when starting can save a lot of time and heartache in the long run, by making the project easier to understand, maintain, and expand.


Example Project Repository

The repository containing this example project is available on GitHub here: example_python_project.

The project follows the recommended project structure above, and is designed to use modular functions from the module, helpers, and subpackage_module. It is intended to be a skeleton upon which you can build-up your own project.

If you would like to experiment with your own copy of the code, you can fork a copy of the repository, or Download a ZIP version.

Project overview

The project is a silly riddle program with no real usefulness other than forming the structure of the project. The bulk of the work is done in the main_module_function() which first prints a riddle on the screen, then iteratively uses the helper_function() and subpackage_function() to try and “solve” the riddle. Both of these functions simply return a random True/False, and are repeatedly called until the riddle is solved (when status == True).

Below is a visual representation of how the different functions are interacting. The green-box functions are contained within the main sample_package, while the blue-box function is stored in the subpackage.

The program can then be executed from a command line using the main.py executable:

C:\<your-local-directory>\example_python_project> python main.py

The output will first print out the riddle, then print statements indicating which functions are being used to “solve” the riddle. This is simply a means of demonstrating how the different functions are being activated, and not necessarily a recommended “Best Practice”.

A normal output should resemble something similar to the below, although there may be more or less print statements depending upon how many times it takes the random generator to produce a “True” solution:

Here is a riddle, maybe `sample_package` can help solve it:

   What runs but has no feet, roars but has no mouth?

Lets see if the helper can solve the riddle.
The helper_function is helping!
The helper could not solve it.
Maybe the subpackage_module can help.
The subpackage_function is being used now.
The subpackage solved it, the answer is "A River"!

Project components

Modules, packages, __init__.py, oh my!

Before going any further, I want to take time to clarify some vocabulary which is helpful for understanding component interactions.

Module:
A module is simply a file ending in .py, which contains functions and or variables.

Package:
A package is a collection of modules (.py files) which relate to one another, and which contains an __init__.py file.

__init__.py
Inclusion of a __init__.py file (pronounced “dunder in-it”) within a folder will indicate to Python that the folder is a package. Often, the __init__ module is empty, however it can be used to import other modules, or functions which will then be stored in namespace, making it available for use later.

For example, in my sample_package/__init__.py, I import all contents of the module.py and subpackage_module.py:

# Import the all functions from main and sub modules
from .module import *
from .subpackage.subpackage_module import *

This allows all of the functions stored within module to be callable from the primary sample_package directly, rather than specifying the various sub-structures needed to access various functions. For example, by including from .subpackage.subpackage_module import *, I able to run:

# IF __init__ imports all content from main and sub modules then you can do this:
import sample_package
sample_package.subpackage_module_function()

Rather than requiring the following fully-nested call, which is necessary when the __init__.py is empty:

# IF __init__ is EMPTY, then you need to do this:
import sample_package
sample_package.subpackage.subpackage_module.subpackage_module_function()

Notably, an __init__.py is not necessary to use modules and functions within a folder… however, customizing the imports present in the packages __init__.py will provide increased customization to your projects use. As the project increases in complexity, strategic usage of imports within the __init__ can keep your main executable functions cleaner.

Executables

So, you’ve crafted a Python project with a sleek, modular package design. The next step is to setup a single file which will execute the package.

Inclusion of a single executable has the benefit of providing a single-entry point for other users who want to run the program without getting lost in the project.

In the example_python_project, this is done with main.py:

# Import the main package
import sample_package

def run():
    solved = sample_package.main_module_function()
    return solved

# Run the function if this is the main file executed
if __name__ == "__main__":
    run()

The program then can then be executed from a command line:

C:\<your-local-directory\example_python_project> python run_program.py

README

The README.md file is typically someone’s first encounter with your project. This is particularly true if the project is hosted on GitHub, where the README.md is used as the home-page of a repository.

A README.md file should include, at minimum, a brief description of the project, it’s purpose, and clear instructions on how to use the code.

Often, README files are written in Markdown, which includes simple text-formatting options. You can find a basic Markdown Cheat Sheet here. Although reStructuredText is often used, and even .txt files may be suitable.

Documentation

Great code requires great documentation. Initializing a new project with a dedicated docs/ folder may help hold you accountable for documenting the code along the way.

For information on how to use Sphinx and reStructuredText to create clean webpage-based documentation, you can see Rohini Gupta’s post on Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!).

Tests

Bugs aren’t fun. They are even less fun when a code was bug-free yesterday but contains bugs today. Implementing automated tests in your project can help verify functionality throughout the development process and catch bugs when they may arise.

It is recommended to implement Unit Tests which verify individual components of the project. These tests should assert that function output properties align with expectations. As you develop your project in a modular way, you can go in and progressively add consecutive tests, then run all of the tests before sharing or pushing the project to others.

A standard Python instillation comes with the unittest package, which is intended to provide a framework for these tests. I provide an example test below, but deeper-dive into the unittest framework may require a dedicated future posts.

In the example_python_project, I include the basic_test.py to verify that the solution generated by main_module_function() is True using the unittest package:

import sample_package
import unittest

# Define a test suite targeting specific functionality
class BasicTestSuite(unittest.TestCase):
    """Basic test cases."""
    def test_that_riddle_is_solved(self):
        solved = sample_package.module.main_module_function()
        self.assertTrue(solved)


if __name__ == '__main__':
    unittest.main()

Running the basic_test module from the command line produce an “OK” if everything runs smoothly, otherwise will provide information regarding which tests are failing.

----------------------------------------------------------------------
Ran 1 test in 0.004s
OK

Currently, the example_python_project requires the basic_test module to be executed manually. To learn more about automating this process, you can see Andrew Dirck’s 2020 post: Automate unit testing with Github Actions for research codes.

Requirements

The requirements.txt is a simple text file which lists the dependencies, or necessary packages that are required to run the code.

This can be particularly important if your code requires a specific version of a package, since the package verison can be specified in the requirements.txt. Specifying a particular package version (e.g., numpy==1.24.1) can improve the reliability of your code, since different versions of these packages may operate in different ways in the future.

Here is an example of what might be inside a requirements.txt, if the numpy and random packages are necessary:

numpy==1.24.1
random==3.11.1

Users can easily install all the packages listed in requirements.txt using the command:

pip install -r requirements.txt

License

I’ll keep this section brief, since I am far from legally qualified to comment much on Licensing. However, general advice seems to suggest that if you are sharing code publicly, safest to include a license of some sort.

Inclusion of an open-source license allows other users to comfortably use and modify your code for their own purposes, allowing you to contribute and benefit the broader community. At the same time, protecting the original author from future liabilities associated with its use by others.

The GNU General Public License is the most common open-source license, however if you would like to know more about the different options, you can find some guidance here: https://choosealicense.com/

Conclusions

If you are an experienced Python user, there may not be anything new for you here but at the least I hope it serves as a reminder to take care in your project design this year.

Additionally, this is likely to be one part in a multi-part Introduction to Python series that I will be writing for future members of our research group. With that in mind, check back here later this spring for the subsequent parts if that interests you.

Best of luck!

References

(1) Reitz, K., & Schlusser, T. (2016). The Hitchhiker’s guide to Python: best practices for development. ” O’Reilly Media, Inc.”.
(2) Navdeep Gill. 2019. samplemod. https://github.com/navdeep-G/samplemod. (2023).

QPPQ Method for Streamflow Prediction at Ungauged Sites – Python Demo

Streamflow Predictions in Ungauged Basins

Predicting streamflow at ungauged locations is a classic problem in hydrology which has motivated significant research over the last several decades (Hrachowitz, Markus, et al., 2013).

There are numerous different methods for performing predictions in ungauged basins, but here I focus on the common QPPQ method.

Below, I describe the method and further down I provide a walkthrough demonstration of QPPQ streamflow prediction in Python.

The supporting code can be found on my GitHub here: QPPQ_Streamflow_Prediction_Tutorial.

QPPQ-Method for Streamflow Prediction

Fennessey (1994) introduced the QPPQ method for streamflow estimation at ungauged locations.

The QPPQ method is commonly used and encouraged by the USGS, and is described at length in their publication Estimation of Daily Mean Streamflow for Ungaged Stream locations… (2016).

QPPQ consists of four key steps:

  1. Estimating an FDC for the target catchment of interest, \hat{FDC}_{pred}.
  2. Identify K donor locations, nearest to the target point.
  3. Transferring the timeseries of nonexceedance probabilities (\mathbf{P}) from the donor site(s) to the target.
  4. Using estimated FDC for the target to map the donated nonexceedance timeseries, \mathbf{P} back to streamflow.

To limit the scope of this tutorial, let’s assume that an estimate of the FDC at the target site, \hat{FDC}_{pred}, has already been determined through some other statistical or observational study.

Then the QPPQ method can be described more formally. Given an ungauged location with an estimated FDC, \hat{FDC}{pred}, and set of observed streamflow timeseries \mathbf{q_i} at K neighboring sites, such that:

Q_{obs} = [\mathbf{q_1}, \mathbf{q_2}, ..., \mathbf{q_k}]

With corresponding K FDCs at the observation locations:

FDC = [FDC_1, FDC_2, ..., FDC_k]

The FDCs are used to convert the observed streamflow timeseries, \mathbf{q_{obs, i}}, to non-exceedance probability timeseries, \mathbf{p_{obs, i}}.

\displaystyle FDC_i : \mathbf{q_{i}} \to \mathbf{p_i}

We can then perform a weighted-aggregation of the non-exceedance probability timeseries to estimate the non-exceedance timeseries at the ungauged location. It is most common to apply an inverse-squared-distance weight to each observed timeseries such that:

\mathbf{p_{pred}} = \sum^k (\mathbf{p_i}w_i)

Where w_i = 1 / d_i^2 where d_i is the distance from the observation i to the ungauged location, and \sum^k w_i = 1.

Finally, the estimated FDC at the ungauged location, \hat{FDC}_{pred} is used to convert the non-exceedance timeseries to streamflow timeseries:

\hat{FDC}_{pred} : \mathbf{p}_{pred} \to \mathbf{q}_{pred}


Looking at this formulation, and the sequence of transformations that take place, I hope it is clear why the method is rightfully called the QPPQ method.

This method is summarized well by the following graphic, taken from the USGS Report on the topic:

In the following section, I step through an implementation of this method in Python.

Tutorial

All Python scripts used in this tutorial can be found in my GitHub repository: QPPQ_Streamflow_Prediction_Tutorial.

In order run the scripts in this tutorial yourself, you will need to have installed the a few Python libraries, listed in requirements.txt. Running pip install -r requirements.txt from the command line, while inside a local copy of the directory will install all of these packages.

Data retrieval

I collected USGS streamflow data from N gages using the HyRiver suite for Python.

If you would like to learn more about hydro-environmental data acquisition in Python, check out my old post on Efficient hydroclimatic data accessing with HyRiver for Python.

The script used to retrieve the data is available here. If you would like to experiment with this method in other regions, you can change the region variable on line 21, which specifies the corners of a bounding-box within which gage data will be retrieved:


# Specify time-range and region of interest
dates = ("2000-01-01", "2010-12-31")
region = (-108.0, 38.0, -105.0, 40.0)

Above, I specify a region West of the Rocky Mountains in Colorado. Running the generate_streamflow_data.py, I found 73 USGS gage locations (blue circles).

Fig: Locations of USGS gages used in this demo.

QPPQ Model

The file QPPQ.py contains the method outlined above, defined as the StreamflowGenerator class object.

The StreamflowGenerator has four key methods (or functions):

class StreamflowGenerator():
    def __init__(self, args):  
	    ...
	def get_knn(self):
		...
	def calculate_nep(self):
		...
	def interpolate_fdc(self):
		...
	def predict_streamflow(self):
		...
		return predicted_flow

The method get_knn finds the $k$ observation, gage locations nearest to the prediction location, and stores the distances to these observation locations (self.knn_distances) and the indices associated with these locations (self.knn_indices).

    def get_knn(self):
        """Find distances and indices of the K nearest gage locations."""
        distances = np.zeros((self.n_observations))
        for i in range(self.n_observations):
            distances[i] = geodesic(self.prediction_location, self.observation_locations[i,:]).kilometers
        self.knn_indices = np.argsort(distances, axis = 0)[0:self.K].flatten()
        self.knn_distances = np.sort(distances, axis = 0)[0:self.K].flatten()
        return

The next method, calculate_nep, calculates the NEP of a flow at an observation location at time t, or P(Q \leq q_t)_{i}.

    def calculate_nep(self, KNN_Q, Q_t):
        "Finds the NEP at time t based on historic observatons."
        # Get historic FDC
        fdc = np.quantile(KNN_Q, self.fdc_neps, axis = 1).T
        # Find nearest FDC value
        nearest_quantile = np.argsort(abs(fdc - Q_t), axis = 1)[:,0]
        nep = self.fdc_neps[nearest_quantile]
        return nep	

The interpolate_fdc performs a linear interpolate of the discrete FDC, and estimates flow for some given NEP.

    def interpolate_fdc(self, nep, fdc):
        "Performs linear interpolation of discrete FDC at a NEP."
        tol = 0.0000001
        if nep == 0:
            nep = np.array(tol)
        sq_diff = (self.fdc_neps - nep)**2

        # Index of nearest discrete NEP
        ind = np.argmin(sq_diff)

        # Handle edge-cases
        if nep <= self.fdc_neps[0]:
            return fdc[0]
        elif nep >= self.fdc_neps[-1]:
            return fdc[-1]

        if self.fdc_neps[ind] <= nep:
            flow_range = fdc[ind:ind+2]
            nep_range = self.fdc_neps[ind:ind+2]
        else:
            flow_range = fdc[ind-1:ind+1]
            nep_range = self.fdc_neps[ind-1:ind+1]

        slope = (flow_range[1] - flow_range[0])/(nep_range[1] - nep_range[0])
        flow = flow_range[0] + slope*(nep_range[1] - nep)
        return flow

Finally, predict_streamflow(self, *args) combines these other methods and performs the full QPPQ prediction.

    def predict_streamflow(self, args):
        "Run the QPPQ prediction method for a single locations."
        self.prediction_location = args['prediction_location']
        self.prediction_fdc = args['prediction_fdc']
        self.fdc_quantiles = args['fdc_quantiles']
        self.n_predictions = self.prediction_location.shape[0]

        ### Find nearest K observations
        self.get_knn()
        knn_flows = self.historic_flows[self.knn_indices, :]

        ### Calculate weights as inverse square distance
        self.wts = 1/self.knn_distances**2

        # Normalize weights
        self.norm_wts = self.wts/np.sum(self.wts)

        ### Create timeseries of NEP at observation locations
        self.observed_neps = np.zeros_like(knn_flows)
        for t in range(self.T):
            self.observed_neps[:,t] = self.calculate_nep(knn_flows, knn_flows[:,t:t+1])

        ### Calculate predicted NEP timeseries using weights
        self.predicted_nep = np.zeros((self.n_predictions, self.T))
        for t in range(self.T):
            self.predicted_nep[:,t] = np.sum(self.observed_neps[:,t:t+1].T * self.norm_wts)

        ### Convert NEP timeseries to flow timeseries
        self.predicted_flow = np.zeros_like(self.predicted_nep)
        for t in range(self.T):
            nep_t = self.predicted_nep[0,:][t]
            self.predicted_flow[0,t] = self.interpolate_fdc(nep_t, self.prediction_fdc)

        return self.predicted_flow

The predict_streamflow method is the only function called directly by the user. While get_knn, calculate_nep, and interpolate_fdc are all used by predict_streamflow.

Generate streamflow predictions

The script run_QPPQ_predictions.py runs the model and produces predictions at a test site. First, the data generated by generate_streamflow_data.py is loaded:

import numpy as np
from QPPQ import StreamflowGenerator

### Load Data
gage_locations = np.loadtxt('./data/observed_gage_locations.csv', delimiter = ',')
observed_flows = np.loadtxt('./data/observed_streamflow.csv', delimiter = ',')

The FDCs at each site are estimated at 200 discrete quantiles:

fdc_quantiles = np.linspace(0,1,200)
observed_fdc = np.quantile(observed_flows, fdc_quantiles, axis =1).T

A random test site is selected, and removed from the observation data:

# Select a test site and remove it from observations
test_site = np.random.randint(0, gage_locations.shape[0])

# Store test site
test_location = gage_locations[test_site,:]
test_flow = observed_flows[test_site, :]
test_site_fdc = observed_fdc[test_site, :]

# Remove test site from observations
gage_locations = np.delete(gage_locations, test_site, axis = 0)
observed_flows = np.delete(observed_flows, test_site, axis = 0)

When initializing the StreamflowGenerator, we must provide an array of gage location data (longitude, latitude), historic streamflow data at each gage, and the K number of nearest neighbors to include in the timeseries aggregation.

I’ve set-up the StreamflowGenerator class to receive these inputs as a dictionary, such as:

# Specify model prediction_inputs
QPPQ_args = {'observed_locations' : gage_locations,
            'historic_flows' : observed_flows,
            'K' : 20}

# Intialize the model
QPPQ_model = StreamflowGenerator(QPPQ_args)

Similarly, the prediction arguments are provided as a dictionary to the predict_streamflow function:

# Specify the prediction arguments
prediction_args = {'prediction_location': test_location,
                    'prediction_fdc' : test_site_fdc,
                    'fdc_quantiles' : fdc_quantiles}
                    
# Run the prediction
predicted_flow = QPPQ_model.predict_streamflow(prediction_args)

I made a function, plot_predicted_and_observed, which allows for a quick visual check of the predicted timeseries compared to the observed timeseries:

from plot_functions import plot_predicted_and_observed
plot_predicted_and_observed(predicted_flow, test_flow)

Which shows some nice quality predictions!

Fig: Comparison of observed streamflow and streamflow generated through QPPQ.

One benefit of working with the StreamflowGenerator as a Python class object is that we can retrieve the internal variables for further inspection.

For example, I can call QPPQ_model.knn_distances to retrieve the distances to the K nearest neighbors used to predict the flow at the ungauged location. In this case, the gages used to make the prediction above were located 4.44, 13.23,. 18.38,… kilometers away.

Caveat and Conclusion

It is worth highlighting one major caveat to this example, which is that the FDC used for the prediction site was perfectly known from the historic record. In most cases, the FDC will not be known when making predictions in ungauged basins. Rather, estimations of the FDC will need to be used, and thus the prediction quality shown above is somewhat of a ideal-case when performing a QPPQ in ungauged basins.

There are numerous methods for estimating FDCs at the ungauged site, including the Generalized Pareto distribution approximation proposed by Fennessey (1994) or, more recently, through the use of Neural Networks, as highlighted in Worland, et al. (2019).

Hopefully this tutorial helped to get you familiar with a foundational streamflow prediction method.

References

Fennessey, Neil Merrick. "A Hydro-Climatological Model of Daily Stream Flow for the Northeast United States." Order No. 9510802 Tufts University, 1994. Ann Arbor: ProQuest. Web. 21 Nov. 2022.

Hrachowitz, Markus, et al. "A decade of Predictions in Ungauged Basins (PUB)—a review." Hydrological sciences journal 58.6 (2013): 1198-1255.

Razavi, Tara, and Paulin Coulibaly. "Streamflow prediction in ungauged basins: review of regionalization methods." Journal of hydrologic engineering 18.8 (2013): 958-975.

Stuckey, M.H., 2016, Estimation of daily mean streamflow for ungaged stream locations in the Delaware River Basin, water years 1960–2010: U.S. Geological Survey Scientific Investigations Report 2015–5157, 42 p., http://dx.doi.org/10.3133/sir20155157.

Worland, Scott C., et al. "Prediction and inference of flow duration curves using multioutput neural networks." Water Resources Research 55.8 (2019): 6850-6868.