# Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
# fit Gaussian HMM to Q
model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))

# classify each observation as state 0 or 1
hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

# find parameters of Gaussian HMM
mus = np.array(model.means_)
sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
P = np.array(model.transmat_)

# find log-likelihood of Gaussian HMM
logProb = model.score(np.reshape(Q,[len(Q),1]))

# generate nSamples from Gaussian HMM
samples = model.sample(nSamples)

# re-organize mus, sigmas and P so that first row is lower mean (if not already)
if mus[0] > mus[1]:
mus = np.flipud(mus)
sigmas = np.flipud(sigmas)
P = np.fliplr(np.flipud(P))
hidden_states = 1 - hidden_states

return hidden_states, mus, sigmas, P, logProb, samples

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)



Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

sns.set()
fig = plt.figure()

xs = np.arange(len(Q))+1909
ax.plot(xs, Q, c='k')

ax.set_xlabel('Year')
ax.set_ylabel(ylabel)
handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
fig.savefig(filename)
fig.clf()

return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')



Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)



Running that we get the following:

[[ 0.6794469   0.3205531 ]
[ 0.34904974  0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σi[ei])e in which e is the eigenvector of PT corresponding to an eigenvalue of 1, and ei is the ith element of e. The overall distribution for our observations is then Y ~ π0N(μ0,σ02) + π1*N(μ1,σ12). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

# calculate stationary distribution
eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
one_eigval = np.argmin(np.abs(eigenvals-1))
pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

sns.set()
fig = plt.figure()
ax.hist(Q, color='k', alpha=0.5, density=True)
l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
fig.savefig(filename)
fig.clf()

return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')



Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

# Fitting Hidden Markov Models Part I: Background and Methods

Hydro-climatological variables often exhibit long-term persistence caused by regime-shifting behavior in the climate, such as the El Niño-Southern Oscillations (ENSO). One popular way of modeling this long-term persistence is with hidden Markov models (HMMs) [Thyer and Kuczera, 2000; Akintug and Rasmussen, 2005; Bracken et al., 2014]. What is an HMM? Recall from my five blog posts on weather generators, that the occurrence of precipitation is often modeled by a (first order) Markov model in which the probability of rain on a given day depends only on whether or not it rained on the previous day. A (first order) hidden Markov model is similar in that the climate “state” (e.g., wet or dry) at a particular time step depends only on the state from the previous time step, but the state in this case is “hidden,” i.e. not observable. Instead, we only observe a random variable (discrete or continuous) that was generated under a particular state, but we don’t know what that state was.

For example, imagine you are a doctor trying to diagnose when an individual has the flu. On any given day, this person is in one of two states: sick or healthy. These states are likely to exhibit great persistence; when the person gets the flu, he/she will likely have it for several days or weeks, and when he/she is heathy, he/she will likely stay healthy for months. However, suppose you don’t have the ability to test the individual for the flu virus and can only observe his/her temperature. Different (overlapping) distributions of body temperatures may be observed depending on whether this person is sick or healthy, but the state itself is not observed. In this case, the person’s temperature can be modeled by an HMM.

So why are HMMs useful for describing hydro-climatological variables? Let’s go back to the example of ENSO. Maybe El Niño years in a particular basin tend to be wetter than La Niña years. Normally we can observe whether or not it is an El Niño year based on SST anomalies in the tropical Pacific, but suppose we only have paleodata of tree ring widths. We can infer from the tree ring data (with some error) what the total precipitation might have been in each year of the tree’s life, but we may not know what the SST anomalies were those years. Or even if we do know the SST anomalies, maybe there is another more predictive regime-shifting teleconnection we haven’t yet discovered. In either case, we can model the total annual precipitation with an HMM.

What is the benefit of modeling precipitation in these cases with an HMM as opposed to say, an autoregressive model? Well often the year to year correlation of annual precipitation may not actually be that high, but several consecutive wet or consecutive dry years are observed [Bracken et al., 2014]. Furthermore, paleodata suggests that greater persistence (e.g. megadroughts) in precipitation is often observed than would be predicted by autoregressive models [Ault et al., 2013; Ault et al., 2014]. This is where HMMs may come in handy.

Here I will explain how to fit HMMs generally, and in Part II I will show how to apply these methods using the Python package hmmlearn. To understand how to fit HMMs, we first need to define some notation. Let Yt be the observed variable at time t (e.g., annual streamflow). The distribution of Yt depends on the state at time t, Xt (e.g., wet or dry). Let’s assume for simplicity that our observations can be modeled by Gaussian distributions. Then f(Yt | Xt = i) ~ N(μi,σi 2) and f(Yt | Xt = j) ~ N(μj,σj 2) for a two-state HMM. The state at time t, Xt, depends on the state at the previous time step, Xt-1. Let P be the state transition matrix, where each element pi,j represents the probability of transitioning from state i at time t to state j at time t+1, i.e. pij = P(Xt+1 = j | Xt = i). P is a n x n matrix where n is the number of states (e.g. 2 for wet and dry). In all Markov models (hidden or not), the unconditional probability of being in each state, π can be modeled by the equation π = πP, where π is a 1 x n vector in which each element πi represents the unconditional probability of being in state i, i.e. πi = P(Xt = i). π is also called the stationary distribution and can be calculated from P as described here. Since we have no prior information on which to condition the first set of observations, we assume the initial probability of being in each state is the stationary distribution.

In fitting a two-state Gaussian HMM, we therefore need to estimate the following vector of parameters: θ = [μ0, σ0, μ1, σ1, p00, p11]. Note p01 = 1 – p00 and p10 = 1 – p11. The most common approach to estimating these parameters is through the Baum-Welch algorithm, an application of Expectation-Maximization built off of the forward-backward algorithm. The first step of this process is to set initial estimates for each of the parameters. These estimates can be random or based on an informed prior. We then begin with the forward step, which computes the joint probability of observing the first t observations and ending up in state i at time t, given the initial parameter estimates: P(Xt = i, Y1 = y1, Y2 = y2, …, Yt = yt | θ). This is computed for all t ϵ {1, …, T}. Then in the backward step, the conditional probability of observing the remaining observations after time t given the state observed at time t is computed: P(Yt+1 = yt+1, …, YT = yT | Xt=i, θ). Using Bayes’ theorem, it can shown that the product of the forward and backward probabilities is proportional to the probability of ending up in state i at time t given all of the observations, i.e. P(Xt = i | Y1 = y1,…, YT = yT, θ). This is derived below:

1) $P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(Y_1=y_1, ..., Y_T=y_T \vert X_t=i, \theta) P(X_t=i \vert \theta)}{P(Y_1=y_1, ..., Y_T=y_t \vert \theta)}$

2) $P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(Y_1=y_1, ..., Y_t=y_t \vert X_t=i, \theta) P(Y_{t+1} = y_{t+1}, ..., Y_T=y_T \vert X_t=i, \theta) P(X_t=i \vert \theta)}{P(Y_1=y_1, ..., Y_T=y_t \vert \theta)}$

3) $P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(X_t=i, Y_1=y_1, ..., Y_t=y_t \vert \theta) P(Y_{t+1} = y_{t+1}, ..., Y_T=y_T \vert X_t=i, \theta)}{P(Y_1=y_1, ..., Y_T=y_t \vert \theta)}$

4) $P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) \propto P(X_t=i, Y_1=y_1, ..., Y_t=y_t \vert \theta) P(Y_{t+1}=y_{t+1}, ..., Y_T=y_T \vert X_t=i, \theta)$

The first equation is Bayes’ Theorem. The second equation is derived by the conditional independence of the observations up to time t (Y1, Y2, …, Yt) and the observations after time t (Yt+1, Yt+2, …, YT), given the state at time t (Xt). The third equation is derived from the definition of conditional probability, and the fourth recognizes the denominator as a normalizing constant.

Why do we care about the probability of ending up in state i at time t given all of the observations (the left hand side of the above equations)? In fitting a HMM, our goal is to find a set of parameters, θ, that maximize this probability, i.e. the likelihood function of the state trajectories given our observations. This is therefore equivalent to maximizing the product of the forward and backward probabilities. We can maximize this product using Expectation-Maximization. Expectation-Maximization is a two-step process for maximum likelihood estimation when the likelihood function cannot be computed directly, for example, because its observations are hidden as in an HMM. The first step is to calculate the expected value of the log likelihood function with respect to the conditional distribution of X given Y and θ (the left hand side of the above equations, or proportionally, the right hand side of equation 4). The second step is to find the parameters that maximize this function. These parameter estimates are then used to re-implement the forward-backward algorithm and the process repeats iteratively until convergence or some specified number of iterations. It is important to note that the maximization step is a local optimization around the current best estimate of θ. Hence, the Baum-Welch algorithm should be run multiple times with different initial parameter estimates to increase the chances of finding the global optimum.

Another interesting question beyond fitting HMMs to observations is diagnosing which states the observations were likely to have come from given the estimated parameters. This is often performed using the Viterbi algorithm, which employs dynamic programming (DP) to find the most likely state trajectory. In this case, the “decision variables” of the DP problem are the states at each time step, Xt, and the “future value function” being optimized is the probability of observing the true trajectory, (Y1, …,YT), given those alternative possible state trajectories. For example, let the probability that the first state was k be V1,k. Then V1,k = P(X1 = k) = P(Y1 = y1 | X1 = k)πk. For future time steps, Vt,k = P(Yt = yt | Xt = k)pik*Vt-1,i where i is the state in the previous time step. Thus, the Viterbi algorithm finds the state trajectory (X1, …, XT) maximizing VT,k.

Now that you know how HMMs are fit using the Baum-Welch algorithm and decoded using the Viterbi algorithm, read Part II to see how to perform these steps in practice in Python!

# Logistic Regression for Scenario Discovery

As most of you probably know, scenario discovery is an exploratory modeling approach [Bankes, 1993] that involves stress-testing proposed policies over plausible future “states of the world” (SOWs) to discover conditions under which those policies would fail to meet performance goals [Bryant and Lempert, 2010]. The scenario discovery process is therefore an exercise in statistical classification. Two commonly used methods used for the scenario discovery process are the Patient Rule Induction Method (PRIM; Friedman and Fisher [1999]) and Classification and Regression Trees (CART; Breiman et al. [1984]), both of which are included in the OpenMORDM R package and Rhodium Python package.

Another commonly used method in classification that hasn’t been given much attention in the scenario discovery literature is logistic regression. Logistic regression models estimate the probability that an event is classified as a success (1) as opposed to a failure (0) as a function of different covariates. This allows for the definition of “safe operating spaces,” or factor combinations leading to success, based on the probability with which one would like to be able to achieve the specified performance goal(s). We may not know the probability that a particular SOW will occur, but through the logistic regression we can estimate the probability of success in that SOW should it occur. The logistic regression can also identify which factors most influence a policy’s ability to meet those performance goals.

This blog post will illustrate how to build logistic regression models in Python for scenario discovery using the Red River basin as an example. Here we are interested in determining under what streamflow and demand characteristics reservoir operating policies are unable to protect Hanoi from the 100-yr flood. We assume operators want to ensure protection to this event with at least 95% reliability and use logistic regression to estimate under what combination of streamflow and demand characteristics they will be able to do so.

The form of the logistic regression model is given by Equation 1, where pi represents the probability that performance in the ith SOW is classified as a success and Xi represents a vector of covariates (in this case, streamflow and demand characteristics) describing the ith SOW:

1) $\ln\Bigg(\frac{p_i}{1-p_i}\Bigg) = \mathbf{X_i^\intercal}\mathbf{\beta}$.

The coefficients, $\mathbf{\beta}$, on the covariates are estimated using Maximum Likelihood Estimation.

To determine which streamflow and demand characteristics are most important in explaining successes and failures, we can compare the McFadden’s pseudo-R2 values associated with different models that include different covariates. McFadden’s pseudo-R2, $R_{McFadden}^2$, is given by Equation 2:

2) $R_{McFadden}^2 = 1 - \frac{\ln \hat{L}(M_{Full})}{\ln \hat{L}(M_{Intercept})}$

where $\ln \hat{L}(M_{Full})$ is the log-likelihood of the full model and $\ln \hat{L}(M_{Intercept})$ is the log-likelihood of the intercept model, i.e. a model with no covariates beyond the intercept. The intercept model therefore predicts the mean probability of success across all SOWs. $R_{McFadden}^2$ is a measure of improvement of the full model over the intercept model.

A common approach to fitting regression models is to add covariates one-by-one based on which most increase R2 (or in this case, $R_{McFadden}^2$), stopping once the increase of an additional covariate is marginal. The covariate that by itself most increases $R_{McFadden}^2$ is therefore the most important in predicting a policy’s success. To do this in Python, we will use the library statsmodels.

Imagine we have a pandas dataframe, dta that includes n columns of streamflow and demand characteristics describing different SOWs (rows) and a final column of 0s and 1s representing whether or not the policy being evaluated can provide protection to the 100-yr flood in that SOW (0 for no and 1 for yes). Assume the column of 0s and 1s is the last column and it is labeled Success. We can find the value of $R_{McFadden}^2$ for each covariate individually by running the following code:

import pandas as pd
import statsmodels.api as sm
from scipy import stats

# deal with fact that calling result.summary() in statsmodels.api
# calls scipy.stats.chisqprob, which no longer exists
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

def fitLogit(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])

# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors

#fit logistic regression
logit = sm.Logit(dta['Success'], dta[cols])
result = logit.fit()

return result

n = len(dta.columns) - 1
for i in range(n):
predictors = dta.columns.tolist()[i:(i+1)]
result = fitLogit(dta, predictors)
print(result.summary())



A sample output for one predictor, Col1 is shown below. This predictor has a pseudo-R2 of 0.1138.

Once the most informative predictor has been determined, additional models can be tested by adding more predictors one-by-one as described above. Suppose that through this process, one finds that the first 3 columns of dta (Col1,Col2 and Col3) are the most informative for predicting success on providing protection to the 100-yr flood, while the subsequent columns provide little additional predictive power. We can use this model to visualize the probability of success as a function of these 3 factors using a contour map. If we want to show this as a 2D projection, the probability of success can only be shown for combinations of 2 of these factors. In this case, we can hold the third factor constant at some value, say its base value. This is illustrated in the code below, which also shows a scatter plot of the SOWs. The dots are shaded light blue if the policy succeeds in providing protection to the 100-yr flood in that world, and dark red if it does not.


import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import statsmodels.api as sm

def fitLogit(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])

# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors

#fit logistic regression
logit = sm.Logit(dta['Success'], dta[cols])
result = logit.fit()

return result

def plotContourMap(ax, result, constant, dta, contour_cmap, dot_cmap, levels, xgrid, ygrid, \
xvar, yvar, base):

# find probability of success for x=xgrid, y=ygrid
X, Y = np.meshgrid(xgrid, ygrid)
x = X.flatten()
y = Y.flatten()
if constant == 'x3': # 3rd predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),x,y,np.ones(len(x))*base[2]])
elif constant == 'x2': # 2nd predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),x,np.ones(len(x))*base[1],y])
else: # 1st predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),np.ones(len(x))*base[0],x,y])

z = result.predict(grid)
Z = np.reshape(z, np.shape(X))

contourset = ax.contourf(X, Y, Z, levels, cmap=contour_cmap)
ax.scatter(dta[xvar].values, dta[yvar].values, c=dta['Success'].values, edgecolor='none', cmap=dot_cmap)
ax.set_xlim(np.min(X),np.max(X))
ax.set_ylim(np.min(Y),np.max(Y))
ax.set_xlabel(xvar,fontsize=24)
ax.set_ylabel(yvar,fontsize=24)
ax.tick_params(axis='both',labelsize=18)

return contourset

# build logistic regression model with first 3 columns of predictors from dta
predictors = dta.columns.tolist()[0:3]
result = fitLogit(dta, predictors)

# define color map for dots representing SOWs in which the policy
# succeeds (light blue) and fails (dark red)
dot_cmap = mpl.colors.ListedColormap(np.array([[227,26,28],[166,206,227]])/255.0)

# define color map for probability contours
contour_cmap = mpl.cm.get_cmap(‘RdBu’)

# define probability contours
contour_levels = np.arange(0.0, 1.05,0.1)

# define grid of x (1st predictor), y (2nd predictor), and z (3rd predictor) dimensions
# to plot contour map over
xgrid = np.arange(-0.1,1.1,0.01)
ygrid = np.arange(-0.1,1.1,0.01)
zgrid = np.arange(-0.1,1.1,0.01)

# define base values of 3 predictors
base = [0.5, 0.5, 0.5]

fig = plt.figure()
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, contour_cmap, dot_cmap, contour_levels, xgrid, ygrid, \
'Col1', 'Col2', base)
# plot contour map when 2nd predictor ('x2') is held constant
contourset = plotContourMap(ax, result, 'x2', dta, contour_cmap, dot_cmap, contour_levels, xgrid, zgrid, \
'Col1', 'Col3', base)

cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(contourset, cax=cbar_ax)
cbar_ax.set_ylabel('Probability of Success',fontsize=20)
yticklabels = cbar.ax.get_yticklabels()
cbar.ax.set_yticklabels(yticklabels,fontsize=18)
fig.set_size_inches([14.5,8])
fig.savefig('Fig1.png')
fig.clf()



This produces the following figure:

We can also use the probability contours discovered above to define “safe operating spaces” as combinations of these 3 factors under which the evaluated policy is able to succeed in providing protection to the 100-yr flood with some reliability, say 95%. The hyperplane of factor combinations defining that 95% probability contour can be determined by setting p to 0.95 in Equation 2. Again, to plot 2-D projections of that hyperplane, the values of the other covariates can be held constant at their base values. The code below illustrates how to do this with a 95% boundary.


# define colormap for classifying boundary between failure and success
class_cmap = mpl.colors.ListedColormap(np.array([[251,154,153],[31,120,180]])/255.0)

# define probability cutoff between failure and success
class_levels = [0.0, 0.95, 1.0]

fig = plt.figure()
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, class_cmap, dot_cmap, class_levels, xgrid, ygrid, \
'Col1', 'Col2', base)

# plot contour map when 2nd predictor ('x2') is held constant
plotContourMap(ax, result, 'x2', dta, class_cmap, dot_cmap, class_levels, xgrid, zgrid, \
'Col1', 'Col3', base)

fig.set_size_inches([14.5,8])
fig.savefig('Fig2.png')
fig.clf()



This produces the following figure, where the light red region is the parameter ranges in which the policy cannot provide protection to the 100-yr flood with 95% reliability, and the dark blue region is the “safe operating space” in which it can.

All code for this example can be found here.

# Policy Diagnostics with Time-Varying and State Space PDFs

Some of my work has focused on “policy diagnostics,” analyzing how policies (in this case, multi-reservoir operating policies) that favor different objectives perform under different conditions and why. This can guide analysts in choosing a policy to implement, or even in determining objectives that policies should be optimized to (cough, cough, see Quinn et al., 2017). One of the more effective ways we’ve found to analyze these policies is by examining their probabilistic behavior through time-varying PDFs and state-space PDFs. This blog post will illustrate these two types of figures and provide sample code for creating them. The code for the versions of these figures generated in the above paper can be found here.

Below is an example of how time-varying PDFs can provide insights into system behavior using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels on different days of the year (x axis in both figures), from the beginning of the monsoon in May to the end of the dry season in April. Red shades represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a policy minimizing the 100-yr annual maximum water level, while the right plot shows them for a policy maximizing the 100-yr average hydropower production. The flood-minimizing policy has a lower probability of overtopping the dikes and crossing a stakeholder-elicited alarm level of 11.25 m (Second Alarm) compared to the hydropower-maximizing policy. However, this reduction in the probability of high floodwaters requires a higher probability of crossing a lower stakeholder-elicited alarm level of 6 m (First Alarm), highlighting a tradeoff between reducing severe floods and nuisance floods. There are also different dynamics during the dry season, where the flood-minimizing solution releases more to both meet agricultural demand at the time of planting and lower the reservoir level in advance of the next monsoon. There is a bifurcation in the high probability density streak during this time, suggesting how much needs to be released depends on what is needed to lower the reservoir level to an acceptable pre-flood season level or meet the agricultural demand.

To create this figure, we simply need an N x 365 matrix of the water level on each day (column) of N different annual simulations (rows). Let’s call this matrix ‘data’. We then need to reformat ‘data’ into a Y x 365 matrix, where Y is the number of “bins” along the y axis (between ymin and ymax) that we are going to group our data into to make a histogram for each day. Finally, we just need to count how many data points occur in each bin, and then divide this count by the total number of simulated years, N. This is shown using the function ‘getTimeVaryingProbs.py’ below assuming we have two datasets we want to plot, ‘data1’ and ‘data2’.

import numpy as np

def getTimeVaryingProbs(data, N, Y, ymin, ymax):
'''Finds the probability of being at a specific water level (y) on a given day.'''
probMatrix = np.zeros([Y,365])
step = (ymax-ymin)/Y
for i in range(np.shape(probMatrix)[0]):
for j in range(np.shape(probMatrix)[1]):
count = ((data[:,j] < ymax-step*i) & (data[:,j] >= ymax-step*(i+1))).sum()
probMatrix[i,j] = count/N

return probMatrix

probMatrix1 = getTimeVaryingProbs(data1, 100000, 366, 0, 15)
probMatrix2 = getTimeVaryingProbs(data2, 100000, 366, 0, 15)


After calling ‘getTimeVaryingProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can plot the time-varying PDF of each of these using ‘imshow’. Since we want to compare the two side-by-side, we need to make sure they’re normalized over the same range. We do this by finding the lowest and highest probabilities over the two matrices and normalizing our color map over that range:

import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl

# find the lowest and highest probability between two probability matrices
probMin = min(np.min(probMatrix1), np.min(probMatrix2))
probMax = max(np.max(probMatrix1), np.max(probMatrix2))

fig = plt.figure()
sm = ax1.imshow(probMatrix1, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
sm = ax2.imshow(probMatrix2, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(sm, cax=cbar_ax)
cbar.ax.set_ylabel('Probability Density',fontsize=16)
fig.show()


In some cases, it may be helpful to plot a log transformation of the probability matrices, as was done in the above paper since streamflows are highly skewed.

Below is an example of how state-space PDFs can provide insights into system behavior, again using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels when the total storage in the reservoirs upstream is at different levels (x axis in both figures). Red shades again represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a compromise policy optimized to one set of objectives, while the right plot shows them for a compromise policy optimized to a different set of objectives. The compromise policy on the left fills up the reservoirs without releasing much water downstream, resulting in a high probability streak along the bottom of the plot at low water levels. This will favor hydropower production. However, when the largest reservoirs fill up, they are forced to spill, resulting in a spike in the water level downstream. This occurs before the smaller reservoirs have filled up, and in wet years, results in overtopping before total system storage has been reached. Consequently, this policy does not make full use of the total system storage for flood protection. The compromise policy on the right, however, increases the system storage and water level simultaneously, releasing some of what initially comes in to leave empty capacity for future flood events. This strategy makes better use of the full system capacity, only resulting in overtopping when maximum system storage has been reached. The difference in the behavior of these two compromise solutions highlights the need to test rival framings of objective functions for multi-objective optimization, as some formulations may suffer unintended consequences like the formulation on the left.

To create this figure, we need two N x 365 matrices, one of the water level on each day (column) of N different annual simulations (rows) and another of the total system storage. Let’s call these matrices ‘h’ and ‘s’, respectively. We then need to use these matrices to populate a Y x X probability matrix, where Y is the number of bins along the y axis (water level, h) between ymin and ymax, and X the number of bins along the x axis (storage, s) between xmin and xmax. This probability matrix will represent a 2D histogram of how many data points lie in a combined water level and storage bin.  We again just need to count how many data points occur in each bin, and then divide this count by the total number of simulated points (365N). This is shown using the function ‘getJointProbs.py’ below assuming we have two joint datasets, (h1,s1) and (h2,s2), that we want to plot.

def getJointProbs(h, s, Y, X, ymax, ymin, xmax, xmin):
'''Finds the probability of being at a specific water level (h) and storage (s) jointly'''
probMatrix = np.zeros([Y,X])
yStep = (ymax-ymin)/np.shape(probMatrix)[0]
xStep = (xmax-xmin)/np.shape(probMatrix)[1]
for i in range(np.shape(s)[0]):
for j in range(np.shape(s)[1]):
# figure out which "box" the simulated s and h are in
row = int(np.floor((ymax-h[i,j])/yStep))
col = int(np.ceil((s[i,j]-xmin)/xStep))
if row < np.shape(probMatrix)[0] and col < np.shape(probMatrix)[1]:
probMatrix[row,col] = probMatrix[row,col] + 1

# calculate probability of being in each box
probMatrix = probMatrix/(np.shape(s)[0]*np.shape(s)[1])

return probMatrix

probMatrix1 = getJointProbs(h1, s1, 100, 100, 15, 0, 3.0E10, 0.5E10)
probMatrix2 = getJointProbs(h2, s2, 100, 100, 15, 0, 3.0E10, 0.5E10)


After calling ‘getJointProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can again plot the state space PDF of each of these using ‘imshow’ as illustrated in the second snippet of code above. Now go analyze how your reservoirs are probabilistically operating as a system!

# Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plotFDCrange.py plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables.

plotFDCrange.py provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function monthly-moments.py. This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions autocorr.py and spatial-corr.py, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

 Pair Number Sites 1 Marietta and Muddy Run 2 Marietta and Lateral Inflows 3 Marietta and Evaporation 4 Muddy Run and Lateral Inflows 5 Muddy Run and Evaporation 6 Lateral Inflows and Evaporation

# Open Source Streamflow Generator Part I: Synthetic Generation

This post describes how to use the Kirsch-Nowak synthetic streamflow generator to generate synthetic streamflow ensembles for water systems analysis. As Jon Lamontagne discussed in his introduction to synthetic streamflow generation, generating synthetic hydrology for water systems models allows us to stress-test alternative management plans under stochastic realizations outside of those observed in the historical record. These realizations may be generated assuming stationary or non-stationary models. In a few recent papers from our group applied to the Red River and Lower Susquehanna River Basins (Giuliani et al., 2017; Quinn et al., 2017; Zatarain Salazar et al., 2017), we’ve generated stationary streamflow ensembles by combining methods from Kirsch et al. (2013) and Nowak et al. (2010). We use the method of Kirsch et al. (2013) to generate flows on a monthly time step and the method of Nowak et al. (2010) to disaggregate these monthly flows to a daily time step. The code for this streamflow generator, written by Matteo Giuliani, Jon Herman and me, is now available on Github. Here I’ll walk through how to use the MATLAB code in the subdirectory /stationary_generator to generate correlated synthetic hydrology at multiple sites, and in Part II I’ll show how to use the Python code in the subdirectory /validation to statistically validate the synthetic hydrology. As an example, I’ll use the Lower Susquehanna River Basin (LSRB).

A schematic of the LSRB, reproduced from Giuliani et al. (2014) is provided below. The system consists of two reservoirs: Conowingo and Muddy Run. For the system model, we generate synthetic hydrology upstream of the Conowingo Dam at the Marietta gauge (USGS station 01576000), as well as lateral inflows between Marietta and Conowingo, inflows to Muddy Run and evaporation rates over Conowingo and Muddy Run dams. The historical hydrology on which the synthetic hydrologic model is based consists of the historical record at the Marietta gauge from 1932-2001 and simulated flows and evaporation rates at all other sites over the same time frame generated by an OASIS system model. The historical data for the system can be found here.

The first step to use the synthetic generator is to format the historical data into an nD × nS matrix, where nD is the number of days of historical data with leap days removed and nS is the number of sites, or hydrologic variables. An example of how to format the Susquehanna data is provided in clean_data.m. Once the data has been reformatted, the synthetic generation can be performed by running script_example.m (with modifications for your application). Note that in the LSRB, the evaporation rates over the two reservoirs are identical, so we remove one of those columns from the historical data (line 37) for the synthetic generation. We also transform the historical evaporation with an exponential transformation (line 42) since the code assumes log-normally distributed hydrologic data, while evaporation in this region is more normally distributed. After the synthetic hydrology is generated, the synthetic evaporation rates are back-transformed with a log-transformation on line 60. While such modifications allow for additional hydrologic data beyond streamflows to be generated, for simplicity I will refer to all synthetic variables as “streamflows” for the remainder of this post. In addition to these modifications, you should also specify the number of realizations, nR, you would like to generate (line 52), the number of years, nY, to simulate in each realization (line 53) and a string with the dimensions nR × nY for naming the output file.

The actual synthetic generation is performed on line 58 of script_example.m which calls combined_generator.m. This function first generates monthly streamflows at all sites on line 10 where it calls monthly_main.m, which in turn calls monthly_gen.m to perform the monthly generation for the user-specified number of realizations. To understand the monthly generation, we denote the set of historical streamflows as $\mathbf{Q_H}\in \mathbb{R}^{N_H\times T}$ and the set of synthetic streamflows as $\mathbf{Q_S}\in \mathbb{R}^{N_S\times T}$, where $N_H$ and $N_S$ are the number of years in the historical and synthetic records, respectively, and T is the number of time steps per year. Here T=12 for 12 months. For the synthetic generation, the streamflows in $\mathbf{Q_H}$ are log-transformed to yield the matrix $Y_{H_{i,j}}=\ln(Q_{H_{i,j}})$, where i and j are the year and month of the historical record, respectively. The streamflows in $\mathbf{Y_H}$ are then standardized to form the matrix $\mathbf{Z_H}\in \mathbb{R}^{N_H\times T}$ according to equation 1:

1) $Z_{H_{i,j}} = \frac{Y_{H_{i,j}}-\hat{\mu_j}}{\hat{\sigma_j}}$

where $\hat{\mu_j}$ and $\hat{\sigma_j}$ are the sample mean and sample standard deviation of the j-th month’s log-transformed streamflows, respectively. These variables follow a standard normal distribution: $Z_{H_{i,j}}\sim\mathcal{N}(0,1)$.

For each site, we generate standard normal synthetic streamflows that reproduce the statistics of $\mathbf{Z_H}$ by first creating a matrix $\mathbf{C}\in \mathbb{R}^{N_S\times T}$ of randomly sampled standard normal streamflows from $\mathbf{Z_H}$. This is done by formulating a random matrix $\mathbf{M}\in \mathbb{R}^{N_S\times T}$ whose elements are independently sampled integers from $(1,2,...,N_H)$. Each element of $\mathbf{C}$ is then assigned the value $C_{i,j}=Z_{H_{(M_{i,j}),j}}$, i.e. the elements in each column of $\mathbf{C}$ are randomly sampled standard normal streamflows from the same column (month) of $\mathbf{Z_H}$. In order to preserve the historical cross-site correlation, the same matrix $\mathbf{M}$ is used to generate $\mathbf{C}$ for each site.

Because of the random sampling used to populate $\mathbf{C}$, an additional step is needed to generate auto-correlated standard normal synthetic streamflows, $\mathbf{Z_S}$. Denoting the historical autocorrelation $\mathbf{P_H}$=corr($\mathbf{Z_H}$), where corr($\mathbf{Z_H}$) is the historical correlation between standardized streamflows in months i and j (columns of $\mathbf{Z_H}$), an upper right triangular matrix, $\mathbf{U}$, can be found using Cholesky decomposition (chol_corr.m) such that $\mathbf{P_H}=\mathbf{U^\intercal U}$. $\mathbf{Z_S}$ is then generated as $\mathbf{Z_S}=\mathbf{CU}$. Finally, for each site, the auto-correlated synthetic standard normal streamflows $\mathbf{Z_S}$ are converted back to log-space streamflows $\mathbf{Y_S}$ according to $Y_{S_{i,j}}=\hat{\mu_j}+Z_{S_{i,j}}\hat{\sigma_j}$. These are then transformed back to real-space streamflows $\mathbf{Q_S}$ according to $Q_{S_{i,j}}$=exp($Y_{S_{i,j}}$).

While this method reproduces the within-year log-space autocorrelation, it does not preserve year to-year correlation, i.e. concatenating rows of $\mathbf{Q_S}$ to yield a vector of length $N_S\times T$ will yield discontinuities in the autocorrelation from month 12 of one year to month 1 of the next. To resolve this issue, Kirsch et al. (2013) repeat the method described above with a historical matrix $\mathbf{Q_H'}\in \mathbb{R}^{N_{H-1}\times T}$, where each row i of $\mathbf{Q_H'}$ contains historical data from month 7 of year i to month 6 of year i+1, removing the first and last 6 months of streamflows from the historical record. $\mathbf{U'}$ is then generated from $\mathbf{Q_H'}$ in the same way as $\mathbf{U}$ is generated from $\mathbf{Q_H}$, while $\mathbf{C'}$ is generated from $\mathbf{C}$ in the same way as $\mathbf{Q_H'}$ is generated from $\mathbf{Q_H}$. As before, $\mathbf{Z_S'}$ is then calculated as $\mathbf{Z_S'}=\mathbf{C'U'}$. Concatenating the last 6 columns of $\mathbf{Z_S'}$ (months 1-6) beginning from row 1 and the last 6 columns of $\mathbf{Z_S}$ (months 7-12) beginning from row 2 yields a set of synthetic standard normal streamflows that preserve correlation between the last month of the year and the first month of the following year. As before, these are then de-standardized and back-transformed to real space.

Once synthetic monthly flows have been generated, combined_generator.m then finds all historical total monthly flows to be used for disaggregation. When calculating all historical total monthly flows a window of +/- 7 days of the month being disaggregated is considered. That is, for January, combined_generator.m finds the total flow volumes in all consecutive 31-day periods within the window from 7 days before Jan 1st to 7 days after Jan 31st. For each month, all of the corresponding historical monthly totals are then passed to KNN_identification.m (line 76) along with the synthetic monthly total generated by monthly_main.mKNN_identification.m identifies the k nearest historical monthly totals to the synthetic monthly total based on Euclidean distance (equation 2):

2) $d = \left[\sum^{M}_{m=1} \left({\left(q_{S}\right)}_{m} - {\left(q_{H}\right)}_{m}\right)^2\right]^{1/2}$

where ${(q_S)}_m$ is the real-space synthetic monthly flow generated at site m and ${(q_H)}_m$ is the real-space historical monthly flow at site m. The k-nearest neighbors are then sorted from i=1 for the closest to i=k for the furthest, and probabilistically selected for proportionally scaling streamflows in disaggregation. KNN_identification.m uses the Kernel estimator given by Lall and Sharma (1996) to assign the probability $p_n$ of selecting neighbor n (equation 3):

3) $p_{n} = \frac{\frac{1}{n}}{\sum^{k}_{i=1} \frac{1}{i}}$

Following Lall and Sharma (1996) and Nowak et al. (2010), we use $k=\Big \lfloor N_H^{1/2} \Big \rceil$. After a neighbor is selected, the final step in disaggregation is to proportionally scale all of the historical daily streamflows at site m from the selected neighbor so that they sum to the synthetically generated monthly total at site m. For example, if the first day of the month of the selected historical neighbor represented 5% of that month’s historical flow, the first day of the month of the synthetic series would represent 5% of that month’s synthetically-generated flow. The random neighbor selection is performed by KNN_sampling.m (called on line 80 of combined_generator.m), which also calculates the proportion matrix used to rescale the daily values at each site on line 83 of combined_generator.m. Finally, script_example.m writes the output of the synthetic streamflow generation to files in the subdirectory /validation. Part II shows how to use the Python code in this directory to statistically validate the synthetically generated hydrology, meaning ensure that it preserves the historical monthly and daily statistics, such as the mean, standard deviation, autocorrelation and spatial correlation.

Works Cited

Giuliani, M., Herman, J. D., Castelletti, A., & Reed, P. (2014). Many‐objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management. Water resources research50(4), 3355-3377.

Giuliani, M., Quinn, J. D., Herman, J. D., Castelletti, A., & Reed, P. M. (2017). Scalable multiobjective control for large-scale water resources systems under uncertainty. IEEE Transactions on Control Systems Technology26(4), 1492-1499.

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2012). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management139(4), 396-406.

Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water Resources Research32(3), 679-693.

Nowak, K., Prairie, J., Rajagopalan, B., & Lall, U. (2010). A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow. Water Resources Research46(8).

Quinn, J. D., Reed, P. M., Giuliani, M., & Castelletti, A. (2017). Rival framings: A framework for discovering how problem formulation uncertainties shape risk management trade‐offs in water resources systems. Water Resources Research53(8), 7208-7233.

Zatarain Salazar, J., Reed, P. M., Quinn, J. D., Giuliani, M., & Castelletti, A. (2017). Balancing exploration, uncertainty and computational demands in many objective reservoir optimization. Advances in water resources109, 196-210.

# Making Watershed Maps in Python

This post builds off of earlier posts by Jon Lamontagne and Jon Herman on making global maps in Python using matplotlib and basemap. However rather than making a global map, I’ll show how to zoom into a particular region, here the Red River basin in East Asia. To make these maps, you’ll need to have basemap installed (from github here, or using a Windows installer here).

The first step is to create a basemap. Both Jons used the ‘robin’ global projection to do this in their posts. Since I’m only interested in a particular region, I just specify the bounding box using the lower and upper latitudes and longitudes of the region I’d like to plot. As Jon H points out, you can also specify the resolution (‘f’ = full, ‘h’ =high, ‘i’ = intermediate, ‘l’ = low, ‘c’ = crude), and you can even use different ArcGIS images for the background (see here). I use ‘World_Shaded_Relief’. It’s also possible to add a lot of features such as rivers, countries, coastlines, counties, etc. I plot countries and rivers. The argument ‘zorder’ specifies the order of the layering from 1 to n, where 1 is the bottom layer and n the top.


from mpl_toolkits.basemap import Basemap
from matplotlib import pyplot as plt

fig = plt.figure()
fig.set_size_inches([17.05,8.15])

# plot basemap, rivers and countries
m = basemap(llcrnrlat=19.5, urcrnrlat=26.0, llcrnrlon=99.6, urcrnr=107.5, resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)



The above code makes the following image (it takes some time, since I’m using high resolution):

Now let’s add a shaded outline of the Red River basin. To do this, you need a shapefile of the basin. The FAO provides a shapefile of major watersheds in the world, from which you can extract the watershed you’re interested in using ArcGIS (see instructions here). In this shapefile, the Red River is labeled by its name in Vietnamese, ‘Song Hong.’ I chose not to draw the bounds of the basin in my map because it would be too busy with the country borders. Instead, I shaded the region gray (facecolor=’0.33′) with a slightly darker border (edgecolor=’0.5′) and slight transparency (alpha=0.5). To do that, I had to collect all of the patches associated with the shapefile (which I called ‘Basin’ when reading it in) that needed to be shaded.


from matplotlib.patches import Polygon
from matplotlib.collections import Patch Collection

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1: # attribute in attribute table of shapefile
patches.append(Polygon(np.array(shape), True))



This creates the following image:

Now let’s add the locations of major dams and cities in the basin using ‘scatter‘. You could again do this by adding a shapefile, but I’m just going to add their locations manually, either by uploading their latitude and longitude coordinates from a .csv file or by passing them directly.


import numpy as np

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv', delimiter=',', skiprows=1, usecols=[1,2])
x, y = m(damsLatLong[:,1], damLatLong[:,0]) # m(longitude, latitude)
m.scatter(x, y, c='k', s = 150, marker = '^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)



This makes the following image:

If we want to label the dams and cities, we can add text specifying where on the map we’d like them to be located. This may require some guess-and-check work to determine the best place (comment if you know a better way!). I temporarily added gridlines to the map to aid in this process using ‘drawparallels‘ and ‘drawmeridians‘.


# label dams and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center', va='center', color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')



Now our map looks like this:

That looks nice, but it would be helpful to add some context as to where in the world the Red River basin is located. To illustrate this, we can create an inset of the greater geographical area by adding another set of axes with its own basemap. This one can be at a lower resolution.


from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1) # locations
axins.set_xlim(90, 115) # longitude boundaries of inset map
axins.set_ylim(8, 28) # latitude boundaries of inset map

# remove tick marks from inset axes
plt.xticks(visible=False)
plt.yticks(visible=False)

# add basemap to inset map
m2 = Basemap(llcrnrlat=8.0, urcrnclat=28.0, llcrnr=90.0, urcrnrlon=115.0, resolution='l', ax=axins)
m2.drawcountries(color='k', linewidth=0.5)



This image looks like this:

Now let’s highlight a country of interest (Vietnam) in green and also add the Red River basin in light gray again.


# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset



Now our map looks like this:

Finally, let’s label the countries in the inset. Some of the countries are too small to fit their name inside, so we’ll have to create arrows pointing to them using ‘annotate‘. In this function, ‘xy’ specifies where the arrow points to and ‘xytext’ where the text is written relative to where the arrow points.


# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(102.5, 20.2, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(9.5, 21.0, 'China', fontsize=11, ha='center', va='center', color='k')

# add arrows to label Vietnam and Cambodia
plt.annotate('Vietnam', xy=(108.0, 14.0), xycoords='data', xytext=(5.0, 20.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)
plt.annotate('Cambodia', xy=(104.5, 12.0), xycoords='data', xytext=(-60.0, -25.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)



Now our map looks like this:

I think that’s pretty good, so let’s save it ;). See below for all the code used to make this map, with all the import statements at the beginning rather than sporadically inserted throughout the code!

If you’re looking for any other tips on how to make different types of maps using basemap, I recommend browsing through the basemap toolkit documentation and this basemap tutorial, where I learned how to do most of what I showed here.


from mpl_toolkits.basemap import Basemap
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import numpy as np

# set-up Vietnam basemap
fig = plt.figure()
fig.set_size_inches([17.05, 8.15])

# plot basemap, rivers and countries
m = Basemap(llcrnrlat=19.5,urcrnrlat=26.0,llcrnrlon=99.6,urcrnrlon=107.5,resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1:
patches.append(Polygon(np.array(shape), True))

# plot dams
x, y = m(damsLatLong[:,1], damsLatLong[:,0])
m.scatter(x, y, c='k', s=150, marker='^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

# label reservoirs and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center',va='center',color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1)
axins.set_xlim(90, 115)
axins.set_ylim(8,28)

plt.xticks(visible=False)
plt.yticks(visible=False)

m2 = Basemap(llcrnrlat=8.0,urcrnrlat=28.0,llcrnrlon=90.0,urcrnrlon=115.0,resolution='l',ax=axins)
m2.drawcountries(color='k',linewidth=0.5)

# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset

# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center',va='center',color='k')
plt.text(102.5, 20.2, 'Laos', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'Thailand', fontsize=11, ha='center', va='center', color='k')
plt.text(96.5, 21.0, 'Myanmar', fontsize=11, ha='center', va='center', color='k')

plt.annotate('Vietnam', xy=(108.0,14.0), xycoords='data', xytext=(5.0,20.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)
plt.annotate('Cambodia', xy=(104.5,12.0), xycoords='data', xytext=(-60.0,-25.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)

fig.savefig('RedRiverMap.png')
fig.clf()



# Using Rhodium for RDM Analysis of External Dataset

In my last blog post, I showed how to run an MORDM experiment using Rhodium. This process included the multi-objective optimization to an assumed state of the world (SOW) as well as the re-evaluation of the Pareto-approximate solutions on alternative SOWs, before using sensitivity and classification tools such as PRIM and CART for the scenario discovery analysis. However, there may be cases where you have to run the optimization and re-evaluation outside of Rhodium, for instance if your model is in another programming language than Python. There are two ways you can do this while still using Rhodium for the scenario discovery. The first option is to run the model through the Executioner. Another option is to run the model separately and import the output into the same format as is generated by Rhodium for post-analysis. I will explain the second method here for the fish game described in my last post.

The first step is to read the decision variables and objectives from the optimization into 2D arrays. Then the uncertainties, levers and responses can be defined as before, except they no longer have to be associated with an object of the class ‘Model‘.


# read in output of optimization

# make maximization objectives positive
maxIndices = [0]
objectives[:,maxIndices] = -objectives[:,maxIndices]

# define X of XLRM framework
uncertainties = [UniformUncertainty("a", 1.5, 4.0),
UniformUncertainty("b0", 0.25, 0.67)]

# define L of XLRM framework
levers = [RealLever("vars", 0.0, 1.0, length=2)]

# define R of XLRM framework
responses = [Response("NPVharvest", Response.MAXIMIZE),
Response("std_z", Response.MINIMIZE)]



Note: If you are interested in using Rhodium’s plotting tools to visualize the results of the optimization, you can still make the uncertainties, levers and responses attributes of a model object. However, you will have to create a model function to instantiate the model. This is sloppy, but you can fake this by just creating a function that takes in the decision variables and model parameters, and returns the objective values, but doesn’t actually perform any calculations.


def fishGame(vars,
a = 1.75, # rate of prey growth
b0 = 0.6, # initial rate of predator growth
F = 0, # rate of change of radiative forcing per unit time
S = 0.5): # climate sensitivity)

NPVharvest = None
std_z = None

return (NPVharvest, std_z)

model = Model(fishGame)

# define all parameters to the model that we will be studying
model.parameters = [Parameter("vars"),
Parameter("a"),
Parameter("b0"),
Parameter("F"),
Parameter("S")]



If using Rhodium for the optimization, this function would actually perform the desired calculation and Platypus could be used for the optimization. Since we have already performed the optimization, we just need to reformat the output of the optimization into that used by Rhodium for the RDM analysis. This can be done by mimicking the output structure that would be returned by the function ‘optimize‘.


# find number of solutions
nsolns = np.shape(objectives)[0]

# properly format output of optimization
output = DataSet()
for i in range(nsolns):
env = OrderedDict()
offset = 0

for lever in levers:
if lever.length == 1:
env[lever.name] = list(variables[i,:])
else:
env[lever.name] = list(variables[i,offset:offset+lever.length])

offset += lever.length

for j, response in enumerate(responses):
env[response.name] = objectives[i,j]

output.append(env)

# write output to file
with open("FishGame/FishGame_data.txt","w") as f:
json.dump(output, f)



Next we need to read in the uncertain parameters that were sampled for the re-evaluation and format the results of the re-evaluation into the same format as would be output by calling ‘evaluate‘ within Rhodium. Below is an example with the first solution (soln_index=0).

# read in LH samples of uncertain parameters and determine # of samples
nsamples = np.shape(LHsamples)[0]

soln_index = 0
policy = output[soln_index]

# load its objective values from re-evaluation and make maximization objectives positive
objectives = np.loadtxt('FishGame/MORDMreeval/FishGame_Soln' + str(soln_index+1) + '.obj')
objectives[:,maxIndices] = -objectives[:,maxIndices]

# convert re-evaluation output to proper format
results = DataSet()
for j in range(nsamples):
env = OrderedDict()
offset = 0

for k, uncertainty in enumerate(uncertainties):
env[uncertainty.name] = LHsamples[j,k]

for k, response in enumerate(responses):
env[response.name] = objectives[j,k]

for lever in levers:
if lever.length == 1:
env[lever.name] = list(variables[soln_index,:])
else:
env[lever.name] = list(variables[soln_index,offset:offset+lever.length])

offset += lever.length

results.append(env)

# write results to file
with open("FishGame/FishGame_Soln" + str(soln_index+1) + "_reeval.txt","w") as f:
json.dump(results, f)



Finally, you have to define the metrics.


# calculate M of XLRM framework
metric = ["Profitable" if v["NPVharvest"] >= 3.0 else "Unprofitable" for v in results]



Then you can run PRIM and CART.  This requires defining the names, or ‘keys’, of the uncertain parameters. If you created a fake model object, you can pass ‘include=model.uncertainties.keys()’ to the functions Prim() and Cart(). If not, you have to create your own list of ‘keys’ as I do below.


keys = []
for i in range(len(uncertainties)):
keys.append(uncertainties[i].name)

# run PRIM and CART on metrics
p = Prim(results, metric, include=keys, coi="Profitable")
box = p.find_box()
box.show_details()
plt.show()

c = Cart(results, metrics[j], include=keys)
c.print_tree(coi="Profitable")
c.show_tree()
plt.show()



The above code creates the following two figures.

If you had run the analysis using Sobol samples, you could use the SALib wrapper to calculate sensitivity indices and make bar charts or radial convergence plots of the results. (Note: My previous post did not show how to make these plots, but has since been updated. Check it out here.)


import numpy as np
import seaborn as sns
from rhodium import *
from SALib.analyze import sobol

def _S2_to_dict(matrix, problem):
result = {} names = list(problem["names"])
for i in range(problem["num_vars"]):
for j in range(i+1, problem["num_vars"]):
if names[i] not in result:
result[names[i]] = {}
if names[j] not in result:
result[names[j]] = {}

result[names[i]][names[j]] = result[names[j]][names[i]] = float(matrix[i][j])

return result

def get_pretty_result(result):
pretty_result = SAResult(result["names"] if "names" in result else problem["names"])

if "S1" in result:
pretty_result["S1"] = {k : float(v) for k, v in zip(problem["names"], result["S1"])}
if "S1_conf" in result:
pretty_result["S1_conf"] = {k : float(v) for k, v in zip(problem["names"], result["S1_conf"])}
if "ST" in result:
pretty_result["ST"] = {k : float(v) for k, v in zip(problem["names"], result["ST"])}
if "ST_conf" in result:
pretty_result["ST_conf"] = {k : float(v) for k, v in zip(problem["names"], result["ST_conf"])}
if "S2" in result:
pretty_result["S2"] = _S2_to_dict(result["S2"], problem)
if "S2_conf" in result:
pretty_result["S2_conf"] = _S2_to_dict(result["S2_conf"], problem)
if "delta" in result:
pretty_result["delta"] = {k : float(v) for k, v in zip(problem["names"], result["delta"])}
if "delta_conf" in result:
pretty_result["delta_conf"] = {k : float(v) for k, v in zip(problem["names"], result["delta_conf"])}
if "vi" in result:
pretty_result["vi"] = {k : float(v) for k, v in zip(problem["names"], result["vi"])}
if "vi_std" in result:
pretty_result["vi_std"] = {k : float(v) for k, v in zip(problem["names"], result["vi_std"])}
if "dgsm" in result:
pretty_result["dgsm"] = {k : float(v) for k, v in zip(problem["names"], result["dgsm"])}
if "dgsm_conf" in result:
pretty_result["dgsm_conf"] = {k : float(v) for k, v in zip(problem["names"], result["dgsm_conf"])}
if "mu" in result:
pretty_result["mu"] = {k : float(v) for k, v in zip(result["names"], result["mu"])}
if "mu_star" in result:
pretty_result["mu_star"] = {k : float(v) for k, v in zip(result["names"], result["mu_star"])}
if "mu_star_conf" in result:
pretty_result["mu_star_conf"] = {k : float(v) for k, v in zip(result["names"], result["mu_star_conf"])}
if "sigma" in result:
pretty_result["sigma"] = {k : float(v) for k, v in zip(result["names"], result["sigma"])}

return pretty_result

# Read the parameter range file and Sobol samples

Y = np.loadtxt('FishGame/SobolReeval/FishGame_Soln' + (soln_index+1) + '.obj')

# Evaluate sensitivity to the first objective, NPVharvest
obj_index = 0
Si = sobol.analyze(problem, Y[:,obj_index], calc_second_order=True, conf_level=0.95, print_to_console=False)
pretty_result = get_pretty_result(Si)

# Plot sensitivity
sns.set()
fig1 = pretty_result.plot()
fig2 = pretty_result.plot_sobol(threshold=0.01,groups={"Prey Growth Parameters" : ["a"], "Predator Growth Parameters" : ["b0"]})



So don’t feel like you need to run your optimization and re-evaluation in Python in order to use Rhodium!

# Rhodium – Open Source Python Library for (MO)RDM

Last year Dave Hadka introduced OpenMORDM (Hadka et al., 2015), an open source R package for Multi-Objective Robust Decision Making (Kasprzyk et al., 2013). If you liked the capabilities of OpenMORM but prefer coding in Python, you’ll be happy to hear Dave has also written an open source Python library for robust decision making (RDM) (Lempert et al., 2003), including multi-objective robust decision making (MORDM): Rhodium.

Rhodium is part of Project Platypus, which also contains Python libraries for multi-objective optimization (Platypus), a standalone version of the Patient Rule Induction method (PRIM) algorithm (Friedman and Fisher, 1999) implemented in Jan Kwakkel’s EMA Workbench, and a cross-language automation tool for running models (Executioner). Rhodium uses functions from both Platypus and PRIM, as I will briefly show here, but Jazmin will describe Platypus in more detail in a future blog post.

Dave provides an ipython notebook file with clear instructions on how to use Rhodium, with the lake problem given as an example. To give another example, I will walk through the fish game. The fish game is a chaotic predator-prey system in which the population of fish, x, and their predators, y, are co-dependent. Predator-prey systems are typically modeled by the classic Lotka-Volterra equations:

1) $\frac{dx}{dt} = \alpha x - \beta x y$

2) $\frac{dy}{dt} = \delta x y - \gamma y_t$

where α is the growth rate of the prey (fish), β is the rate of predation on the prey, δ is the growth rate of the predator, and γ is the death rate of the predator. This model assumes exponential growth of the prey, x, and exponential death of the predator. Based on a classroom exercise given at RAND, I modify the Lotka-Volterra model of the prey population for logistic growth (see the competitive Lotka-Volterra equations):

3) $\frac{dx}{dt} = \alpha x - r x^2 - \beta x y$

Discretizing equations 1 and 3 yields:

4) $x_{t+1} = (\alpha + 1)x_t (1 - \frac{r}{\alpha + 1} x_t) - \beta x_t y_t$ and

5) $y_{t+1} = (1 - \gamma)y_t + \delta x_t y_t$

RAND simplifies equation 4 by letting a = α + 1, r/(α + 1) = 1 and β = 1, and simplifies equation 5 by letting b = 1/δ and γ = 1. This yields the following equations:

6) $x_{t+1} = \alpha x_t(1-x_t) - x_t y_t,$

7) $y_{t+1} = \frac{x_t y_t}{b}.$

In this formulation, the parameter a controls the growth rate of the fish and b controls the growth rate of the predators. The growth rate of the predators is dependent on the temperature, which is increasing due to climate change according to the following equation:

8) $C \frac{dT}{dt} = (F_0 + Ft) - \frac{T}{S}$

where C is the heat capacity, assumed to be 50 W/m2/K/yr, F0 is the initial value of radiative forcing, assumed to be 1.0 W/m2, F is the rate of change of radiative forcing, S is the climate sensitivity in units of K/(W/m2), and T is the temperature increase from equilibrium, initialized at 0. The dependence of b on the temperature increase is given by:

9) $b = \text{max} \Bigg( b_0 e^{-0.3T},0.25 \Bigg).$

The parameters a, b, F, and S could all be considered deeply uncertain, but for this example I will use (unrealistically optimistic) values of F = 0 and S = 0.5 and assume possible ranges for a and b0 of 1.5 < a < 4 and 0.25 < b0 < 0.67. Within these bounds, different combinations of a and b parameters can lead to point attractors, strange attractors, or collapse of the predator population.

The goal of the game is to design a strategy for harvesting some number of fish, z, at each time step assuming that only the fish population can be observed, not the prey. The population of the fish then becomes:

10) $x_{t+1} = \alpha x_t(1-x_t) - x_t y_t - z_t$

For this example, I assume the user employs a strategy of harvesting some weighted average of the fish population in the previous two time steps:

11) $z_t = \begin{cases} \text{min} \Bigg( \alpha\beta x_t + \alpha(1-\beta)x_{t-1},x_{t} \Bigg), t \geq 2\\ \alpha\beta x_t, t = 1 \end{cases}$

where 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1. The user is assumed to have two objectives: 1) to maximize the net present value of their total harvest over T time steps, and 2) to minimize the standard deviation of their harvests over T time steps:

12) Maximize: $NPV = \sum^T_{t=1} 1.05^{-t} z_t$

13) Minimize: $s_z = \sqrt{\frac{1}{T-1} \sum^T_{t=1} (z_t - \bar{z})^2}.$

As illustrated in the figure below, depending on the values of a and b0, the optimal Pareto sets for each “future” (each with initial populations of x0 = 0.48 and y0 = 0.26) can have very different shapes and attainable values.

 Future 1 2 3 4 5 a 1.75 1.75 3.75 3.75 2.75 b0 0.6 0.3 0.6 0.3 0.45

For this MORDM experiment, I first optimize to an assumed state of the world (SOW) in which a = 1.75 and b = 0.6. To do this, I first have to write a function that takes in the decision variables for the optimization problem as well as any potentially uncertain model parameters, and returns the objectives. Here the decision variables are represented by the vector ‘vars’, the uncertain parameters are passed at default values of a=1.75, b0 = 0.6, F = 0 and S = 0.5, and the returned objectives are NPVharvest and std_z.


import os
import math
import json
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.optimize import brentq as root
from rhodium import *
from rhodium.config import RhodiumConfig
from platypus import MapEvaluator

RhodiumConfig.default_evaluator = MapEvaluator()

def fishGame(vars,
a = 1.75, # rate of prey growth
b0 = 0.6, # initial rate of predator growth
F = 0, # rate of change of radiative forcing per unit time
S = 0.5): # climate sensitivity)

# Objectives are:
# 1) maximize (average NPV of harvest) and
# 2) minimize (average standard deviation of harvest)
# x = population of prey at time 0 to t
# y = population of predator at time 0 to t
# z = harvested prey at time 1 to t

tSteps = 100
x = np.zeros(tSteps+1)
y = np.zeros(tSteps+1)
z = np.zeros(tSteps)

# initialize predator and prey populations
x[0] = 0.48
y[0] = 0.26

# Initialize climate parameters
F0 = 1
C = 50
T = 0
b = max(b0*np.exp(-0.3*T),0.25)

# find harvest at time t based on policy
z[0] = harvest(x, 0, vars)

#Initialize NPV of harvest
NPVharvest = 0

for t in range(tSteps):
x[t+1] = max(a*x[t]*(1-x[t]) - x[t]*y[t] - z[t],0)
y[t+1] = max(x[t]*y[t]/b,0)
if t < tSteps-1:
z[t+1] = harvest(x, t+1, vars)

NPVharvest = NPVharvest + z[t]*(1+0.05)**(-(t+1))

#Calculate next temperature and b values
T = T + (F0 + F*(t+1) - (1/S)*T)/C
b = max(b0*np.exp(-0.3*T),0.25)

# Calculate minimization objectives
std_z = np.std(z)

return (NPVharvest, std_z)

def harvest(x, t, vars):
if t > 0:
harvest = min(vars[0]*vars[1]*x[t] + vars[0]*(1-vars[1])*x[t-1],x[t])
else:
harvest = vars[0]*vars[1]*x[t]

return harvest



Next, the model class must be defined, as well as its parameters, objectives (or “responses”) and whether they need to be minimized or maximized, decision variables (or “levers”) and uncertainties.


model = Model(fishGame)

# define all parameters to the model that we will be studying
model.parameters = [Parameter("vars"), Parameter("a"), Parameter("b0"), Parameter("F"), Parameter("S")]

# define the model outputs
model.responses = [Response("NPVharvest", Response.MAXIMIZE), Response("std_z", Response.MINIMIZE)]

# some parameters are levers that we control via our policy
model.levers = [RealLever("vars", 0.0, 1.0, length=2)]

# some parameters are exogeneous uncertainties, and we want to better
# understand how these uncertainties impact our model and decision making
# process
model.uncertainties = [UniformUncertainty("a", 1.5, 4.0), UniformUncertainty("b0", 0.25, 0.67)]



The model can then be optimized using a multi-objective evolutionary algorithm (MOEA) in Platypus, and the output written to a file. Here I use NSGA-II.


output = optimize(model, "NSGAII", 100)
with open("data.txt", "w") as f:
json.dump(output, f)



The results can be easily visualized with simple commands. The Pareto sets can be plotted with ‘scatter2D’ or ‘scatter3D’, both of which allow brushing on one or more objective thresholds. Here I first brush on solutions with a NPV of harvest ≥ 1.0, and then add a condition that the standard deviation of harvest be ≤ 0.01.


# Use Seaborn settings for pretty plots
sns.set()

# Plot the points in 2D space
scatter2d(model, output)
plt.show()

# The optional interactive flag will show additional details of each point when
# hovering the mouse
# Most of Rhodiums's plotting functions accept an optional expr argument for
# classifying or highlighting points meeting some condition
scatter2d(model, output, x="NPVharvest", brush=Brush("NPVharvest >= 1.0"))
plt.show()

scatter2d(model, output, brush="NPVharvest >= 1.0 and std_z <= 0.01")
plt.show()



The above code creates the following images:

Rhodium can also plot Kernel density estimates of the solutions, or those attaining certain objective values.


# Kernel density estimation plots show density contours for samples. By
# default, it will show the density of all sampled points
kdeplot(model, output, x="NPVharvest", y="std_z")
plt.show()

# Alternatively, we can show the density of all points meeting one or more
# conditions
kdeplot(model, output, x="NPVharvest", y="std_z", brush=["NPVharvest >= 1.0", "std_z <= 0.01"], alpha=0.8)
plt.show()



Scatterplots of all pairwise objective combinations can also be plotted, along with histograms of the marginal distribution of each objective illustrated in the pairwise scatterplots. These can also be brushed by objective thresholds specified by the user.


# Pairwise scatter plots shown 2D scatter plots for all outputs
pairs(model, output)
plt.show()

# We can also highlight points meeting one or more conditions
pairs(model, output, brush=["NPVharvest >= 1.0", "std_z <= 0.01"])
plt.show()

# Joint plots show a single pair of parameters in 2D, their distributions using
# histograms, and the Pearson correlation coefficient
joint(model, output, x="NPVharvest", y="std_z")
plt.show()



Finally, tradeoffs can also be viewed on parallel axes plots, which can also be brushed on user-specified objective values.


# A parallel coordinates plot to view interactions among responses
parallel_coordinates(model, output, colormap="rainbow", zorder="NPVharvest", brush=Brush("NPVharvest > 1.0"))
plt.show()



But the real advantage of Rhodium is not visualization but uncertainty analysis. First, PRIM can be used to identify “boxes” best describing solutions meeting user-specified criteria. I define solutions with a NPV of harvest ≥ 1.0 as profitable, and those below unprofitable.


# The remaining figures look better using Matplotlib's default settings
mpl.rcdefaults()

# We can manually construct policies for analysis. A policy is simply a Python
# dict storing key-value pairs, one for each lever.
#policy = {"vars" : [0.02]*2}

# Or select one of our optimization results
policy = output[8]

# construct a specific policy and evaluate it against 1000 states-of-the-world
SOWs = sample_lhs(model, 1000)
results = evaluate(model, update(SOWs, policy))
metric = ["Profitable" if v["NPVharvest"] >= 1.0 else "Unprofitable" for v in results]

# use PRIM to identify the key uncertainties if we require NPVharvest >= 1.0
p = Prim(results, metric, include=model.uncertainties.keys(), coi="Profitable")
box = p.find_box()
box.show_details()
plt.show()



This will first show the smallest box with the greatest density but lowest coverage.

Clicking on “Back” will show the next largest box with slightly lower density but greater coverage, while “Next” moves in the opposite direction. In this case, since the smallest box is shown, “Next” moves full circle to the largest box with the lowest density, but greatest coverage, and clicking “Next” from this figure will start reducing the box size.

Classification And Regression Trees (CART; Breiman et al., 1984) can also be used to identify hierarchical conditional statements classifying successes and failures in meeting the user-specified criteria.

# use CART to identify the key uncertainties
c = Cart(results, metric, include=model.uncertainties.keys())
c.print_tree(coi="Profitable")
c.show_tree()
plt.show()



Finally, Dave has wrapped Rhodium around Jon Herman’s SALib for sensitivity analysis. Here’s an example of how to run the Method of Morris.


# Sensitivity analysis using Morris method
print(sa(model, "NPVharvest", policy=policy, method="morris", nsamples=1000, num_levels=4, grid_jump=2))



You can also create tornado and spider plots from one-at-a-time (OAT) sensitivity analysis.


# oat sensitivity
fig = oat(model, "NPVharvest",policy=policy,nsamples=1000)


Finally, you can visualize the output of Sobol sensitivity analysis with bar charts of the first and total order sensitivity indices, or as radial plots showing the interactions between parameters. In these plots the filled circles on each parameter represent their first order sensitivity, the open circles their total sensitivity, and the lines between them the second order indices of the connected parameters. You can even create groups of similar parameters with different colors for easier visual analysis.

Si = sa(model, "NPVharvest", policy=policy, method="sobol", nsamples=1000, calc_second_order=True)
fig1 = Si.plot()
fig2 = Si.plot_sobol(threshold=0.01)
fig3 = Si.plot_sobol(threshold=0.01,groups={"Prey Growth Parameters" : ["a"],
"Predator Growth Parameters" : ["b0"]})


As you can see, Rhodium makes MORDM analysis very simple! Now if only we could reduce uncertainty…

Works Cited

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth.

Friedman, J. H. and N. I. Fisher (1999). Bump-hunting for high dimensional data. Statistics and Computing, 9, 123-143.

Hadka, D., Herman, J., Reed, P., and Keller, K. (2015). An open source framework for many-objective robust decision making. Environmental Modelling & Software, 74, 114-129.

Kasprzyk, J. R., S. Nataraj, P. M. Reed, and R. J. Lempert (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71.

Lempert, R. J. (2003). Shaping the next one hundred years: new methods for quantitative, long-term policy analysis. Rand Corporation.

# Fitting Multivariate Normal Distributions

In water resources modeling, we often want to generate synthetic multivariate data for a simulation model in a way that preserves their marginal and joint behavior. The multivariate normal (MVN) distribution is a common model choice for these simulations because 1) it often arises naturally due to the Central Limit Theorem, 2) it has many useful properties that make data manipulation convenient, and 3) data can often be transformed to MVN even if that is not their underlying distribution.

As an example, let X be a K-dimensional random vector with 1 mean vector μ and K covariance matrix [Σ]. If X is multivariate normal, i.e. X~MVN(μ,[Σ]), its probability density function is the following:

where det(·) denotes the determinant. The term (xμ)T[Σ]-1(xμ) is called the squared Mahalanobis distance, which measures how far away an observation x is from its distribution’s mean, scaled by a multi-dimensional measure of spread, [Σ]. It is therefore a measure of how far away from the mean each data vector is in a statistical sense, as opposed to Euclidean distance, which only measures distance in a physical sense. The two measures are related, though. If all of the K dimensions of X are independent, every non-diagonal element of [Σ] will be equal to 0, and the diagonals equal to the variances in each dimension, σk2 where k ϵ {1, 2, …, K}. In that case, the Mahalanobis distance is equal to the Euclidean distance after scaling each data vector by its standard deviation.

So how do we fit MVN distributions? Well, the MVN distribution has “four handy properties” (Wilks, 2011) that we can test. Here I will discuss two of them and how we can use these properties to test for multivariate normality. See Chapter 11 of Wilks (2011) for additional tests for multivariate normality.

Let X be a set of n joint observations of K variables. Denote each of the n observations xi = [xi1, xi2, …, xik] where i ϵ {1, 2, …, n} and each of the K marginals Xk = [xk1, xk2, …, xkn] where k ϵ {1, 2, …, K}. If X~MVN(μ,[Σ]), the following two properties (among others) hold:

1. All marginal distributions of X are univariate normal, i.e. Xk~N(μk, σk2)
2. The squared Mahalanobis distances, Di2 = (xiμ)T[Σ]-1(xiμ), have a χk2 distribution with k degrees of freedom.

So if we want to fit a MVN distribution to X, each of these will have to be true. Let’s look at an example where X is the standard deviation in daily flows during all historical Septembers at five different sites within the same basin. In this case, K=5 for the 5 sites and n=51, as there are 51 years in the historical record. To fit a MVN distribution to X, we’ll first want to ensure that the marginal distributions of the standard deviations in daily September flows are normal at each of the K sites. Let’s inspect these distributions visually with a histogram, first:

Clearly these distributions are not normal, as they are positively skewed. But that’s okay, we can transform the data so that they look more normal. The Box-Cox transformation is commonly used to transform non-normal data, X, to normal data, Y (see Chapter 3 of Wilks (2011) for more details):

Using λ=0 (a log-transform), our transformed data look like this:

These look much better! We can perform a formal hypothesis test to confirm that each of these 5 transformed data series are not inconsistent with the normal distribution using a number of tests, such as the Shapiro-Wilk test, the Kolmogorov-Smirnov test, and the Filliben Q-Q correlation test, which I use here (see Chapter 5 of Wilks, 2011 for a description of other tests). The Filliben Q-Q test finds the correlation between the sample data quantiles and the theoretical quantiles of the distribution being fit. I’ve plotted these below; the correlation coefficients at these 5 sites are [0.9922, 0.9951, 0.9822, 0.9909, 0.9945].

Rejection regions for the Filliben Q-Q test for the normal distribution are tabulated for different significance levels and sample sizes based on Monte Carlo results. The relevant section of the table is copied below. For a sample size of n≈50, the site with the lowest correlation (Site 3: 0.9822) fails to reject the null hypothesis that the data are normal at the 10% level, as the rejection region is ρ≤0.981. This means that if the data were normal, there would be a 10% chance that a data series of length n=50 would have a correlation coefficient below 0.981.

So now we know that none of the marginal distributions at each site is inconsistent with the normal distribution, but that does not guarantee that the joint distribution across sites will be multi-variate normal. There could be multi-variate outliers, or points which are not extreme within any particular site’s distribution, but are extreme within the context of the overall covariance structure. We test this by confirming that the squared Mahalanobis distances are not inconsistent with a χk2 distribution. Again, this can be done by comparing the sample data quantiles to the theoretical data quantiles (figure below). Here the correlation coefficient is 0.9964.

Because the rejection regions will depend not only on the sample size (n) and significance levels, but also the number of degrees of freedom (k), there are no tabulated critical values for this test (there would need to be a separate table for every possible k). Instead of using a table, one has to perform a Monte Carlo simulation to calculate the critical region for their specific application. In this case, I did that by generating 10,000 random samples of length n=51 from a χ2 distribution with k=5 degrees of freedom. Of the generated samples, 97.8% had correlation coefficients less than the observed value of 0.9964 suggesting that this sample is very consistent with a χ52 distribution.

So now that we know the MVN is a good fit for the log-transformed standard deviations in daily September flows, we can estimate the model parameters. This part is easy! The MLE estimator of the mean vector μ is the sample mean vector = [1,2, … , k] of the data (in this case, the log-transformed data), while the MLE estimator of the covariance [Σ] is the sample covariance $\left[S\right]= \frac{1}{N-1}\left[X^{'}\right]^{T}\left[X^{'}\right]$, where $\left[X^{'}\right] = \frac{1}{N}\left[1\right]\left[X\right]$ with [1] being an N×N matrix of 1s and [X] an N×K  matrix of the data (log transformed here).

Below is Python code for all of the fitting and plotting done here.


from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
import seaborn.apionly as sns
from scipy import stats

def fitMVN():
# set plotting style
sns.set_style("darkgrid")

Nyears = int(np.shape(Qdaily)[0]/365)
Nsites = np.shape(Qdaily)[1]
Months = ['May','June','July','August','September','October','November','December','January','February','March','April']

# calculate standard deviation in daily flows each month and squared Mahalanobis distances
StdMonthly = calc_monthly_std(Qdaily, Nyears, Nsites)
D2 = calcD2(Nyears, Nsites, np.log(StdMonthly))

# calculate theoretical quantiles for a chi^2 distribution with dof = Nsites, and for the standard normal distribution
m = np.array(range(1,Nyears+1))
p = (m-0.5)/Nyears
chi2 = stats.chi2.ppf(p,Nsites)
norm = stats.norm.ppf(p,0,1)

# initialize matrices to store correlation coefficients and significance levels for marginal normal distributions and chi^2 distributions
normCorr = np.zeros([Nsites,12])
norm_sigLevel = np.zeros([Nsites,12])
chi2Corr = np.zeros([12])
chi2_sigLevel = np.zeros([12])

for i in range(len(Months)):
# plot histograms of standard deviation of daily flows each month, and of their logs
plotHistograms(Nsites, StdMonthly[:,:,i], 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'Hist.png')
plotHistograms(Nsites, np.log(StdMonthly[:,:,i]), 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
'Log' + Months[i] + 'Hist.png')

# plot QQ plots of standard deviation of daily flows each month, and of their logs
plotNormQQ(Nsites, StdMonthly[:,:,i], norm, 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'QQ.png')
normCorr[:,i] = plotNormQQ(Nsites, np.log(StdMonthly[:,:,i]), norm, 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', 'Log' + Months[i] + 'QQ.png')

# plot QQ plot of Chi Squared distribution of log of standard deviation in daily flows each month
chi2Corr[i] = plotChi2QQ(Nsites, D2[:,i], chi2, 'D$\mathregular{^2}\!$ of log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
'Log' + Months[i] + 'Chi2QQ.png')

# find significance levels
chi2_sigLevel[i] = chi2_MC(Nsites,Nyears,chi2,chi2Corr[i])
norm_sigLevel[:,i] = norm_MC(Nsites,Nyears,norm,normCorr[:,i])

np.savetxt('Norm_sigLevels.txt',np.transpose(norm_sigLevel))
np.savetxt('Norm_corr.txt',np.transpose(normCorr))
np.savetxt('Chi2_sigLevels.txt',chi2_sigLevel)
np.savetxt('Chi2_corr.txt',chi2Corr)

return None

def calc_monthly_std(Qdaily, Nyears, Nsites):
Nmonths = 12
# first month = May (1st month of water year)
DaysPerMonth = np.array([31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30])

Qmonthly = np.zeros([Nsites, Nyears, Nmonths])
StdMonthly = np.zeros([Nsites, Nyears, Nmonths])
for year in range(Nyears):
for month in range(Nmonths):
start = year*365 + np.sum(DaysPerMonth[0:month])

for i in range(Nsites):
# find total flow each month
Qmonthly[i,year,month] = 86400*np.sum(Qdaily[start:start+DaysPerMonth[month],i])

# find standard deviation in daily flows each month
for i in range(Nsites):
for j in range(DaysPerMonth[month]):
StdMonthly[i,year,month] = StdMonthly[i,year,month] + \
(86400*Qdaily[start+j,i]-Qmonthly[i,year,month]/DaysPerMonth[month])**2

StdMonthly[i,year,month] = np.sqrt((1/(DaysPerMonth[month]-1))*StdMonthly[i,year,month])

return StdMonthly

def plotHistograms(Nsites, data, xlabel, filename):
fig = plt.figure()
for i in range(Nsites):
ax.hist(data[i,:],bins=10,color='navy',alpha=0.8)
ax.set_title('Site ' + str(i+1),fontsize=16)

fig.text(0.1, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=14)
fig.text(0.5, 0.04, xlabel, ha='center', fontsize=14)
fig.set_size_inches([22.525,4.825])
fig.savefig('Hists/' + filename)
fig.clf()

return None

def plotNormQQ(Nsites, data, norm, title, filename):
corr = np.zeros([Nsites])
fig = plt.figure()
for i in range(Nsites):
corr[i] = np.corrcoef(np.sort(data[i,:]),norm)[0,1]
z = (data[i,:] - np.mean(data[i,:]))/np.std(data[i,:])
ax.scatter(norm,np.sort(z))
ax.plot([-3,3],[-3,3],c='r')
ax.set_title('Site ' + str(i+1),fontsize=16)
ax.set_xlim([-3,3])
ax.set_ylim([-3,3])

fig.text(0.1, 0.5, 'Sample Quantiles', va='center', rotation='vertical', fontsize=14)
fig.text(0.5, 0.04, 'Theoretical Quantiles', ha='center', fontsize=14)
fig.suptitle('Normal Q-Q Plot of ' + title,fontsize=16)
fig.set_size_inches([22.525,4.825])
fig.savefig('QQplots/' + filename)
fig.clf()

return corr

def calcD2(Nyears, Nsites, data):
D2 = np.zeros([Nyears,12])
X = np.zeros([Nyears, Nsites])
Xprime = np.zeros([Nyears,Nsites])
S = np.zeros(Nsites)
for i in range(12):
# fill data matrix, X, for ith month
for j in range(Nsites):
X[:,j] = data[j,:,i]

# calculate covariance matrix, S, for ith month
Xprime = X - (1/Nyears)*np.dot(np.ones([Nyears,Nyears]),X)
S = (1/(Nyears-1))*np.dot(np.transpose(Xprime),Xprime)

#calculate Mahalanobis distance, D2, for each year's ith month
for j in range(Nyears):
D2[j,i] = np.dot(np.dot((X[j,:] - np.mean(X,0)),np.linalg.inv(S)),(np.transpose(X[j,:] - np.mean(X,0))))

return D2

def plotChi2QQ(Nsites, data, chi2, title, filename):
corr = np.corrcoef(np.sort(data),chi2)[0,1]
fig = plt.figure()
ax.scatter(chi2,np.sort(data))
ax.plot([0,1.1*np.max(chi2)],[0,1.1*np.max(chi2)],c='r')
ax.set_xlabel('Theoretical Quantiles',fontsize=16)
ax.set_xlim([0, 1.1*np.max(chi2)])
ax.set_ylabel('Sample Quantiles',fontsize=16)
ax.set_ylim([0,1.1*np.max(data)])
ax.tick_params(axis='both',labelsize=14)
ax.set_title(r'$\chi^2$' + ' Q-Q Plot of ' + title,fontsize=16)
fig.savefig('QQplots/' + filename)
fig.clf()

return corr

def chi2_MC(Nsites,Nyears,theoretical,dataCorr):
corr = np.zeros(10000)
for i in range(10000): # 10,000 MC simulations
simulated = stats.chi2.rvs(Nsites,size=Nyears)
corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]

# find significance levels
corr = np.sort(corr)
for i in range(10000):
if dataCorr > corr[i]:
sigLevel = (i+0.5)/10000

return sigLevel

def norm_MC(Nsites,Nyears,theoretical,dataCorr):
sigLevel = np.zeros(Nsites)
corr = np.zeros([10000])
for i in range(10000): # 10,000 MC simulations
simulated = stats.norm.rvs(0,1,size=Nyears)
corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]

# find significance levels
corr = np.sort(corr)
for i in range(10000):
for j in range(Nsites):
if dataCorr[j] > corr[i]:
sigLevel[j] = (i+0.5)/10000

return sigLevel

fitMVN()