# Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
# fit Gaussian HMM to Q
model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))

# classify each observation as state 0 or 1
hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

# find parameters of Gaussian HMM
mus = np.array(model.means_)
sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
P = np.array(model.transmat_)

# find log-likelihood of Gaussian HMM
logProb = model.score(np.reshape(Q,[len(Q),1]))

# generate nSamples from Gaussian HMM
samples = model.sample(nSamples)

# re-organize mus, sigmas and P so that first row is lower mean (if not already)
if mus[0] > mus[1]:
mus = np.flipud(mus)
sigmas = np.flipud(sigmas)
P = np.fliplr(np.flipud(P))
hidden_states = 1 - hidden_states

return hidden_states, mus, sigmas, P, logProb, samples

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)



Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

sns.set()
fig = plt.figure()

xs = np.arange(len(Q))+1909
ax.plot(xs, Q, c='k')

ax.set_xlabel('Year')
ax.set_ylabel(ylabel)
handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
fig.savefig(filename)
fig.clf()

return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')



Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)



Running that we get the following:

[[ 0.6794469   0.3205531 ]
[ 0.34904974  0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σi[ei])e in which e is the eigenvector of P corresponding to an eigenvalue of 1, and ei is the ith element of e. The overall distribution for our observations is then Y ~ π0N(μ0,σ02) + π1*N(μ1,σ12). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

# calculate stationary distribution
eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
one_eigval = np.argmin(np.abs(eigenvals-1))
pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

sns.set()
fig = plt.figure()
ax.hist(Q, color='k', alpha=0.5, density=True)
l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
fig.savefig(filename)
fig.clf()

return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')



Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

# Creating shaded dial plots in python

I recently created a code for plotting shaded dials (figures that look like gauges or speedometers) in python and I thought I’d share my code here. The dials are well suited to plot things such as risk or maybe the probability of meeting a set of robustness criteria across a range of decision variables (shameless plug: if you’re at EWRI this week, come check out my talk: Conflicts in Coalitions, Wednesday morning at 8:30 in Northstar B for which I created these figures).

As hinted at above, I originally created the plot to show bivariate data, with one variable plotted as the location on the dial and the other as the color. You could also plot the same variable as both color and location if you wanted to emphasize the meaning of increasing value on the dial. An example dial created with the code is shown below.

Example custom dial. The above figure consists of two images, a dial plot (originally constructed from a pie plot) and a color bar, made as a separate image but using the same data.

The color distribution, location of arrow and labeling of the gauge and colorbar are all fully customizable. I created the figure by first making a pie chart using marplotlib, inscribing a small white circle in the middle and then cropping the image in half using the Python image processing library (PIL also known as Pillow). The arrow is created using the matplotlib “arrow” function and will point to a specified location on the dial. The code is created such that you can add an array of any length to specify your colors, the array does not have to be monotonic like the one shown above, but will accept any values between zero and one (if your data is not in this range I’d suggest normalizing).

Annotated code is below:

import matplotlib.pyplot as plt
from matplotlib import cm, gridspec
import numpy as np
import math
from PIL import Image
from mpl_toolkits.axes_grid1 import make_axes_locatable

# set your color array and name of figure here:
dial_colors = np.linspace(0,1,1000) # using linspace here as an example
figname = 'myDial'

# specify which index you want your arrow to point to
arrow_index = 750

# create labels at desired locations
# note that the pie plot ploots from right to left
labels = [' ']*len(dial_colors)*2
labels[25] = '100'
labels[250] = '75'
labels[500] = '50'
labels[750] = '25'
labels[975] = '0'

# function plotting a colored dial
def dial(color_array, arrow_index, labels, ax):
# Create bins to plot (equally sized)
size_of_groups=np.ones(len(color_array)*2)

# Create a pieplot, half white, half colored by your color array
white_half = np.ones(len(color_array))*.5
color_half = color_array
color_pallet = np.concatenate([color_half, white_half])

cs=cm.RdYlBu(color_pallet)
pie_wedge_collection = ax.pie(size_of_groups, colors=cs, labels=labels)

i=0
for pie_wedge in pie_wedge_collection[0]:
pie_wedge.set_edgecolor(cm.RdYlBu(color_pallet[i]))
i=i+1

# create a white circle to make the pie chart a dial
my_circle=plt.Circle( (0,0), 0.3, color='white')

# create the arrow, pointing at specified index
arrow_angle = (arrow_index/float(len(color_array)))*3.14159
arrow_x = 0.2*math.cos(arrow_angle)
arrow_y = 0.2*math.sin(arrow_angle)

# create figure and specify figure name
fig, ax = plt.subplots()

# make dial plot and save figure
dial(dial_colors, arrow_index, labels, ax)
ax.set_aspect('equal')
plt.savefig(figname + '.png', bbox_inches='tight')

# create a figure for the colorbar (crop so only colorbar is saved)
fig, ax2 = plt.subplots()
cmap = cm.ScalarMappable(cmap='RdYlBu')
cmap.set_array([min(dial_colors), max(dial_colors)])
cbar = plt.colorbar(cmap, orientation='horizontal')
cbar.ax.set_xlabel("Risk")
plt.savefig('cbar.png', bbox_inches='tight')
cbar = Image.open('cbar.png')
c_width, c_height = cbar.size
cbar = cbar.crop((0, .8*c_height, c_width, c_height)).save('cbar.png')

# open figure and crop bottom half
im = Image.open(figname + '.png')
width, height = im.size

# crop bottom half of figure
# function takes top corner &lt;span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			&gt;&amp;#65279;&lt;/span&gt;and bottom corner coordinates
# of image to keep, (0,0) in python images is the top left corner
im = im.crop((0, 0, width+c_width, int(height/2.0))).save(figname + '.png')



## Other ways of doing this from around the web

This code was my way of making a dial plot, and I think it works well for plotting gradients on the dial. In the course of writing this I came across a couple similar codes, I’m listing them below. They both have advantages if you want to plot a small number of colors on your dial but I had trouble getting them to scale.

Here’s an example that creates dials using matplotlib patches, this method looks useful for plotting a small number of categorical data, I like the customization of the labels: http://nicolasfauchereau.github.io/climatecode/posts/drawing-a-gauge-with-matplotlib/

Here’s another alternative using the plotly library, I like the aesthetics but if you’re unfamiliar with plotly there’s a lot to learn before you can nicely customize the final product: https://plot.ly/python/gauge-charts/

# Creating parallel axis plots with multiple datasets, color gradients, and brushing in Python

Parallel axis plots (here is a good description of what they are) are a relatively recent development in the plotting world, so it is no surprise that there is no implementations of it with more than basic functionalities in the major plotting packages available online. Over the past couple of days I then created my own implementation of parallel axis plots in Python using Matplotlib Pandas’ and Plot.ly’s implementation get cumbersome when the user tries to apply brushing and multiple color gradients  to create versatile, high-resolution and story-telling plots for my next papers and presentations. This implementation allows for:

• Plotting multiple datasets,
• Displaying dataset names,
• Choosing columns to be plot,
• Coloring each dataset based on a column and a different Matplotlib color map,
• Specifying ranges to be plotted,
• Inverting multiple axis,
• Brushing by intervales in multiple axis,
• Choosing different fonts for title and rest of the plot, and
• Export result as a figure file or viewing plot in Matplotlib’s interactive window.

The source code can be found here, and below is an example of how to use it:

import numpy as np
from plotting.parallel_axis import paxis_plot
from matplotlib.colors import LinearSegmentedColormap
from matplotlib import cm

bu_cy = LinearSegmentedColormap.from_list('BuCy', [(0, 0, 1), (0, 1, 1)])
bu_cy_r = bu_cy.reversed()

data1 = np.random.normal(size=(100, 8))
data2 = np.random.normal(size=(100, 8))
columns_to_plot = [0, 1, 3, 5, 7]
color_column = 0
axis_labels = ['axes ' + str(i) for i in range(8)]
dataset_names = ['Data set 1', 'Data set 2']
plot_ranges = [[-3.5, 3.5]] * 3 + [[-2.9, 3.1]] + [[-3.5, 3.5]] * 4
axis_to_invert = [1, 5]
brush_criteria = {1: [-10., 0.], 7: [10., 0.]}

paxis_plot((data1, data2),
columns_to_plot,
color_column,
[bu_cy_r, cm.get_cmap('autumn_r')],
axis_labels,
'Title Here',
dataset_names,
axis_ranges=plot_ranges,
fontname_title='Gill Sans MT',
fontname_body='CMU Bright',
file_name='test.png',
axis_to_invert=axis_to_invert,
brush_criteria=brush_criteria)


The output of this script should be a file named “test.png” that looks similar to the plot below:

# Logistic Regression for Scenario Discovery

As most of you probably know, scenario discovery is an exploratory modeling approach [Bankes, 1993] that involves stress-testing proposed policies over plausible future “states of the world” (SOWs) to discover conditions under which those policies would fail to meet performance goals [Bryant and Lempert, 2010]. The scenario discovery process is therefore an exercise in statistical classification. Two commonly used methods used for the scenario discovery process are the Patient Rule Induction Method (PRIM; Friedman and Fisher [1999]) and Classification and Regression Trees (CART; Breiman et al. [1984]), both of which are included in the OpenMORDM R package and Rhodium Python package.

Another commonly used method in classification that hasn’t been given much attention in the scenario discovery literature is logistic regression. Logistic regression models estimate the probability that an event is classified as a success (1) as opposed to a failure (0) as a function of different covariates. This allows for the definition of “safe operating spaces,” or factor combinations leading to success, based on the probability with which one would like to be able to achieve the specified performance goal(s). We may not know the probability that a particular SOW will occur, but through the logistic regression we can estimate the probability of success in that SOW should it occur. The logistic regression can also identify which factors most influence a policy’s ability to meet those performance goals.

This blog post will illustrate how to build logistic regression models in Python for scenario discovery using the Red River basin as an example. Here we are interested in determining under what streamflow and demand characteristics reservoir operating policies are unable to protect Hanoi from the 100-yr flood. We assume operators want to ensure protection to this event with at least 95% reliability and use logistic regression to estimate under what combination of streamflow and demand characteristics they will be able to do so.

The form of the logistic regression model is given by Equation 1, where pi represents the probability that performance in the ith SOW is classified as a success and Xi represents a vector of covariates (in this case, streamflow and demand characteristics) describing the ith SOW:

1) $\ln\Bigg(\frac{p_i}{1-p_i}\Bigg) = \mathbf{X_i^\intercal}\mathbf{\beta}$.

The coefficients, $\mathbf{\beta}$, on the covariates are estimated using Maximum Likelihood Estimation.

To determine which streamflow and demand characteristics are most important in explaining successes and failures, we can compare the McFadden’s pseudo-R2 values associated with different models that include different covariates. McFadden’s pseudo-R2, $R_{McFadden}^2$, is given by Equation 2:

2) $R_{McFadden}^2 = 1 - \frac{\ln \hat{L}(M_{Full})}{\ln \hat{L}(M_{Intercept})}$

where $\ln \hat{L}(M_{Full})$ is the log-likelihood of the full model and $\ln \hat{L}(M_{Intercept})$ is the log-likelihood of the intercept model, i.e. a model with no covariates beyond the intercept. The intercept model therefore predicts the mean probability of success across all SOWs. $R_{McFadden}^2$ is a measure of improvement of the full model over the intercept model.

A common approach to fitting regression models is to add covariates one-by-one based on which most increase R2 (or in this case, $R_{McFadden}^2$), stopping once the increase of an additional covariate is marginal. The covariate that by itself most increases $R_{McFadden}^2$ is therefore the most important in predicting a policy’s success. To do this in Python, we will use the library statsmodels.

Imagine we have a pandas dataframe, dta that includes n columns of streamflow and demand characteristics describing different SOWs (rows) and a final column of 0s and 1s representing whether or not the policy being evaluated can provide protection to the 100-yr flood in that SOW (0 for no and 1 for yes). Assume the column of 0s and 1s is the last column and it is labeled Success. We can find the value of $R_{McFadden}^2$ for each covariate individually by running the following code:

import pandas as pd
import statsmodels.api as sm
from scipy import stats

# deal with fact that calling result.summary() in statsmodels.api
# calls scipy.stats.chisqprob, which no longer exists
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

def fitLogit(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])

# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors

#fit logistic regression
logit = sm.Logit(dta['Success'], dta[cols])
result = logit.fit()

return result

n = len(dta.columns) - 1
for i in range(n):
predictors = dta.columns.tolist()[i:(i+1)]
result = fitLogit(dta, predictors)
print(result.summary())



A sample output for one predictor, Col1 is shown below. This predictor has a pseudo-R2 of 0.1138.

Once the most informative predictor has been determined, additional models can be tested by adding more predictors one-by-one as described above. Suppose that through this process, one finds that the first 3 columns of dta (Col1,Col2 and Col3) are the most informative for predicting success on providing protection to the 100-yr flood, while the subsequent columns provide little additional predictive power. We can use this model to visualize the probability of success as a function of these 3 factors using a contour map. If we want to show this as a 2D projection, the probability of success can only be shown for combinations of 2 of these factors. In this case, we can hold the third factor constant at some value, say its base value. This is illustrated in the code below, which also shows a scatter plot of the SOWs. The dots are shaded light blue if the policy succeeds in providing protection to the 100-yr flood in that world, and dark red if it does not.


import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import statsmodels.api as sm

def fitLogit(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])

# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors

#fit logistic regression
logit = sm.Logit(dta['Success'], dta[cols])
result = logit.fit()

return result

def plotContourMap(ax, result, constant, dta, contour_cmap, dot_cmap, levels, xgrid, ygrid, \
xvar, yvar, base):

# find probability of success for x=xgrid, y=ygrid
X, Y = np.meshgrid(xgrid, ygrid)
x = X.flatten()
y = Y.flatten()
if constant == 'x3': # 3rd predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),x,y,np.ones(len(x))*base[2]])
elif constant == 'x2': # 2nd predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),x,np.ones(len(x))*base[1],y])
else: # 1st predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),np.ones(len(x))*base[0],x,y])

z = result.predict(grid)
Z = np.reshape(z, np.shape(X))

contourset = ax.contourf(X, Y, Z, levels, cmap=contour_cmap)
ax.scatter(dta[xvar].values, dta[yvar].values, c=dta['Success'].values, edgecolor='none', cmap=dot_cmap)
ax.set_xlim(np.min(X),np.max(X))
ax.set_ylim(np.min(Y),np.max(Y))
ax.set_xlabel(xvar,fontsize=24)
ax.set_ylabel(yvar,fontsize=24)
ax.tick_params(axis='both',labelsize=18)

return contourset

# build logistic regression model with first 3 columns of predictors from dta
predictors = dta.columns.tolist()[0:3]
result = fitLogit(dta, predictors)

# define color map for dots representing SOWs in which the policy
# succeeds (light blue) and fails (dark red)
dot_cmap = mpl.colors.ListedColormap(np.array([[227,26,28],[166,206,227]])/255.0)

# define color map for probability contours
contour_cmap = mpl.cm.get_cmap(‘RdBu’)

# define probability contours
contour_levels = np.arange(0.0, 1.05,0.1)

# define grid of x (1st predictor), y (2nd predictor), and z (3rd predictor) dimensions
# to plot contour map over
xgrid = np.arange(-0.1,1.1,0.01)
ygrid = np.arange(-0.1,1.1,0.01)
zgrid = np.arange(-0.1,1.1,0.01)

# define base values of 3 predictors
base = [0.5, 0.5, 0.5]

fig = plt.figure()
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, contour_cmap, dot_cmap, contour_levels, xgrid, ygrid, \
'Col1', 'Col2', base)
# plot contour map when 2nd predictor ('x2') is held constant
contourset = plotContourMap(ax, result, 'x2', dta, contour_cmap, dot_cmap, contour_levels, xgrid, zgrid, \
'Col1', 'Col3', base)

cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(contourset, cax=cbar_ax)
cbar_ax.set_ylabel('Probability of Success',fontsize=20)
yticklabels = cbar.ax.get_yticklabels()
cbar.ax.set_yticklabels(yticklabels,fontsize=18)
fig.set_size_inches([14.5,8])
fig.savefig('Fig1.png')
fig.clf()



This produces the following figure:

We can also use the probability contours discovered above to define “safe operating spaces” as combinations of these 3 factors under which the evaluated policy is able to succeed in providing protection to the 100-yr flood with some reliability, say 95%. The hyperplane of factor combinations defining that 95% probability contour can be determined by setting p to 0.95 in Equation 2. Again, to plot 2-D projections of that hyperplane, the values of the other covariates can be held constant at their base values. The code below illustrates how to do this with a 95% boundary.


# define colormap for classifying boundary between failure and success
class_cmap = mpl.colors.ListedColormap(np.array([[251,154,153],[31,120,180]])/255.0)

# define probability cutoff between failure and success
class_levels = [0.0, 0.95, 1.0]

fig = plt.figure()
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, class_cmap, dot_cmap, class_levels, xgrid, ygrid, \
'Col1', 'Col2', base)

# plot contour map when 2nd predictor ('x2') is held constant
plotContourMap(ax, result, 'x2', dta, class_cmap, dot_cmap, class_levels, xgrid, zgrid, \
'Col1', 'Col3', base)

fig.set_size_inches([14.5,8])
fig.savefig('Fig2.png')
fig.clf()



This produces the following figure, where the light red region is the parameter ranges in which the policy cannot provide protection to the 100-yr flood with 95% reliability, and the dark blue region is the “safe operating space” in which it can.

All code for this example can be found here.

# Policy Diagnostics with Time-Varying and State Space PDFs

Some of my work has focused on “policy diagnostics,” analyzing how policies (in this case, multi-reservoir operating policies) that favor different objectives perform under different conditions and why. This can guide analysts in choosing a policy to implement, or even in determining objectives that policies should be optimized to (cough, cough, see Quinn et al., 2017). One of the more effective ways we’ve found to analyze these policies is by examining their probabilistic behavior through time-varying PDFs and state-space PDFs. This blog post will illustrate these two types of figures and provide sample code for creating them. The code for the versions of these figures generated in the above paper can be found here.

Below is an example of how time-varying PDFs can provide insights into system behavior using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels on different days of the year (x axis in both figures), from the beginning of the monsoon in May to the end of the dry season in April. Red shades represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a policy minimizing the 100-yr annual maximum water level, while the right plot shows them for a policy maximizing the 100-yr average hydropower production. The flood-minimizing policy has a lower probability of overtopping the dikes and crossing a stakeholder-elicited alarm level of 11.25 m (Second Alarm) compared to the hydropower-maximizing policy. However, this reduction in the probability of high floodwaters requires a higher probability of crossing a lower stakeholder-elicited alarm level of 6 m (First Alarm), highlighting a tradeoff between reducing severe floods and nuisance floods. There are also different dynamics during the dry season, where the flood-minimizing solution releases more to both meet agricultural demand at the time of planting and lower the reservoir level in advance of the next monsoon. There is a bifurcation in the high probability density streak during this time, suggesting how much needs to be released depends on what is needed to lower the reservoir level to an acceptable pre-flood season level or meet the agricultural demand.

To create this figure, we simply need an N x 365 matrix of the water level on each day (column) of N different annual simulations (rows). Let’s call this matrix ‘data’. We then need to reformat ‘data’ into a Y x 365 matrix, where Y is the number of “bins” along the y axis (between ymin and ymax) that we are going to group our data into to make a histogram for each day. Finally, we just need to count how many data points occur in each bin, and then divide this count by the total number of simulated years, N. This is shown using the function ‘getTimeVaryingProbs.py’ below assuming we have two datasets we want to plot, ‘data1’ and ‘data2’.

import numpy as np

def getTimeVaryingProbs(data, N, Y, ymin, ymax):
'''Finds the probability of being at a specific water level (y) on a given day.'''
probMatrix = np.zeros([Y,365])
step = (ymax-ymin)/Y
for i in range(np.shape(probMatrix)[0]):
for j in range(np.shape(probMatrix)[1]):
count = ((data[:,j] < ymax-step*i) & (data[:,j] >= ymax-step*(i+1))).sum()
probMatrix[i,j] = count/N

return probMatrix

probMatrix1 = getTimeVaryingProbs(data1, 100000, 366, 0, 15)
probMatrix2 = getTimeVaryingProbs(data2, 100000, 366, 0, 15)


After calling ‘getTimeVaryingProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can plot the time-varying PDF of each of these using ‘imshow’. Since we want to compare the two side-by-side, we need to make sure they’re normalized over the same range. We do this by finding the lowest and highest probabilities over the two matrices and normalizing our color map over that range:

import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl

# find the lowest and highest probability between two probability matrices
probMin = min(np.min(probMatrix1), np.min(probMatrix2))
probMax = max(np.max(probMatrix1), np.max(probMatrix2))

fig = plt.figure()
sm = ax1.imshow(probMatrix1, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
sm = ax2.imshow(probMatrix2, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(sm, cax=cbar_ax)
cbar.ax.set_ylabel('Probability Density',fontsize=16)
fig.show()


In some cases, it may be helpful to plot a log transformation of the probability matrices, as was done in the above paper since streamflows are highly skewed.

Below is an example of how state-space PDFs can provide insights into system behavior, again using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels when the total storage in the reservoirs upstream is at different levels (x axis in both figures). Red shades again represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a compromise policy optimized to one set of objectives, while the right plot shows them for a compromise policy optimized to a different set of objectives. The compromise policy on the left fills up the reservoirs without releasing much water downstream, resulting in a high probability streak along the bottom of the plot at low water levels. This will favor hydropower production. However, when the largest reservoirs fill up, they are forced to spill, resulting in a spike in the water level downstream. This occurs before the smaller reservoirs have filled up, and in wet years, results in overtopping before total system storage has been reached. Consequently, this policy does not make full use of the total system storage for flood protection. The compromise policy on the right, however, increases the system storage and water level simultaneously, releasing some of what initially comes in to leave empty capacity for future flood events. This strategy makes better use of the full system capacity, only resulting in overtopping when maximum system storage has been reached. The difference in the behavior of these two compromise solutions highlights the need to test rival framings of objective functions for multi-objective optimization, as some formulations may suffer unintended consequences like the formulation on the left.

To create this figure, we need two N x 365 matrices, one of the water level on each day (column) of N different annual simulations (rows) and another of the total system storage. Let’s call these matrices ‘h’ and ‘s’, respectively. We then need to use these matrices to populate a Y x X probability matrix, where Y is the number of bins along the y axis (water level, h) between ymin and ymax, and X the number of bins along the x axis (storage, s) between xmin and xmax. This probability matrix will represent a 2D histogram of how many data points lie in a combined water level and storage bin.  We again just need to count how many data points occur in each bin, and then divide this count by the total number of simulated points (365N). This is shown using the function ‘getJointProbs.py’ below assuming we have two joint datasets, (h1,s1) and (h2,s2), that we want to plot.

def getJointProbs(h, s, Y, X, ymax, ymin, xmax, xmin):
'''Finds the probability of being at a specific water level (h) and storage (s) jointly'''
probMatrix = np.zeros([Y,X])
yStep = (ymax-ymin)/np.shape(probMatrix)[0]
xStep = (xmax-xmin)/np.shape(probMatrix)[1]
for i in range(np.shape(s)[0]):
for j in range(np.shape(s)[1]):
# figure out which "box" the simulated s and h are in
row = int(np.floor((ymax-h[i,j])/yStep))
col = int(np.ceil((s[i,j]-xmin)/xStep))
if row < np.shape(probMatrix)[0] and col < np.shape(probMatrix)[1]:
probMatrix[row,col] = probMatrix[row,col] + 1

# calculate probability of being in each box
probMatrix = probMatrix/(np.shape(s)[0]*np.shape(s)[1])

return probMatrix

probMatrix1 = getJointProbs(h1, s1, 100, 100, 15, 0, 3.0E10, 0.5E10)
probMatrix2 = getJointProbs(h2, s2, 100, 100, 15, 0, 3.0E10, 0.5E10)


After calling ‘getJointProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can again plot the state space PDF of each of these using ‘imshow’ as illustrated in the second snippet of code above. Now go analyze how your reservoirs are probabilistically operating as a system!

# Plotting trajectories and direction fields for a system of ODEs in Python

The aim of this post is to guide the reader through plotting trajectories and direction fields for a system of equations in Python. This is useful when investigating the equilibria and stability of the system, and to facilitate in understanding the general behavior of a system under study. I will use a system of predator-prey equations, that my most devoted online readers are already familiar with from my previous posts on identifying equilibria and stability, and on nondimensionalization. Specifically, I’ll be using the Lotka-Volterra set of equations with Holling’s Type II functional response:

$\frac{\mathrm{d} x}{\mathrm{d} t}=bx\left ( 1-\frac{x}{K} \right )-\frac{axy}{1+ahx}$

$\frac{\mathrm{d} y}{\mathrm{d} t}=\frac{caxy}{1+ahx}-dy$

where:

x: prey abundance

y: predator abundance

b: prey growth rate

d: predator death rate

c: rate with which consumed prey is converted to predator

a: rate with which prey is killed by a predator per unit of time

K: prey carrying capacity given the prey’s environmental conditions

h: handling time

This system has 3 equilibria: when both species are dead (0,0), when predators are dead and the prey grows to its carrying capacity (K,0) and a non-trivial equilibrium where both species coexist and is generally more interesting, given by:

$y^*=\frac{b}{a}(1+ahx^*)\left(1-\frac{x^*}{K} \right)$

$x^*=\frac{d}{a(c-dh)}$

The following code should produce both trajectories and direction fields for this system of ODEs (python virtuosos please excuse the extensive commenting, I try to comment as much as possible for people new to python):

import numpy as np
from matplotlib import pyplot as plt
from scipy import integrate

# I'm using this style for a pretier plot, but it's not actually necessary
plt.style.use('ggplot')

"""
This is to ignore RuntimeWarning: invalid value encountered in true_divide
I know that when my populations are zero there's some division by zero and
the resulting error terminates my function, which I want to avoid in this case.
"""
np.seterr(divide='ignore', invalid='ignore')

# These are the parameter values we'll be using
a = 0.005
b = 0.5
c = 0.5
d = 0.1
h = 0.1
K = 2000

# Define the system of ODEs
# P[0] is prey, P[1] is predator
def fish(P, t=0):
return ([b*P[0]*(1-P[0]/K) - (a*P[0]*P[1])/(1+a*h*P[0]),
c*(a*P[0]*P[1])/(1+a*h*P[0]) - d*P[1] ])

# Define equilibrium point
EQ = ([d/(a*(c-d*h)),b*(1+a*h*(d/(a*(c-d*h))))*(1-(d/(a*(c-d*h)))/K)/a])

"""
I need to define the possible values my initial points will take as they
relate to the equilibrium point. In this case I chose to plot 10 trajectories
ranging from 0.1 to 5
"""
values = np.linspace(0.1, 5, 10)
# I want each trajectory to have a different color
vcolors = plt.cm.autumn_r(np.linspace(0.1, 1, len(values)))

# Open figure
f = plt.figure()
"""
I need to define a range of time over which to integrate the system of ODEs
The values don't really matter in this case because our system doesn't have t
on the right hand side of dx/dt and dy/dt, but it is a necessary input for
integrate.odeint.
"""
t = np.linspace(0, 150, 1000)

# Plot trajectories by looping through the possible values
for v, col in zip(values, vcolors):
# Starting point of each trajectory
P0 = [E*v for E in EQ]
# Integrate system of ODEs to get x and y values
P = integrate.odeint(fish, P0, t)
# Plot each trajectory
plt.plot( P[:,0], P[:,1],
# Different line width for different trajectories (optional)
lw=0.5*v,
# Different color for each trajectory
color=col,
# Assign starting point to trajectory label
label='P0=(%.f, %.f)' % ( P0[0], P0[1]) )
"""
To plot the direction fields we first need to define a grid in order to
compute the direction at each point
"""
# Get limits of trajectory plot
ymax = plt.ylim(ymin=0)[1]
xmax = plt.xlim(xmin=0)[1]
# Define number of points
nb_points = 20
# Define x and y ranges
x = np.linspace(0, xmax, nb_points)
y = np.linspace(0, ymax, nb_points)
# Create meshgrid
X1 , Y1 = np.meshgrid(x,y)
# Calculate growth rate at each grid point
DX1, DY1 = fish([X1, Y1])
# Direction at each grid point is the hypotenuse of the prey direction and the
# predator direction.
M = (np.hypot(DX1, DY1))
# This is to avoid any divisions when normalizing
M[ M == 0] = 1.
# Normalize the length of each arrow (optional)
DX1 /= M
DY1 /= M

plt.title('Trajectories and direction fields')
"""
This is using the quiver function to plot the field of arrows using DX1 and
DY1 for direction and M for speed
"""
Q = plt.quiver(X1, Y1, DX1, DY1, M, pivot='mid', cmap=plt.cm.plasma)
plt.xlabel('Prey abundance')
plt.ylabel('Predator abundance')
plt.legend(bbox_to_anchor=(1.05, 1.0))
plt.grid()
plt.xlim(0, xmax)
plt.ylim(0, ymax)
plt.show()



This should produce the following plot. All P0s are the initial conditions we defined.

We can also see that this parameter combination produces limit cycles in our system. If we change the parameter values to:

a = 0.005
b = 0.5
c = 0.5
d = 0.1
h = 0.1
K = 200


i.e. reduce the available resources to the prey, our trajectories look like this:

The equilibrium becomes stable, attracting the trajectories to it.

The same can be seen if we increase the predator death rate:

a = 0.005
b = 0.5
c = 0.5
d = 1.5
h = 0.1
K = 2000


The implication of this observation is that an initially stable system, can become unstable given more resources for the prey or less efficient predators. This has been referred to as the Paradox of Enrichment and other predator-prey models have tried to address it (more on this in future posts).

P.S: I would also like to link to this scipy tutorial, that I found very helpful and that contains more plotting tips.

# Making Watershed Maps in Python

This post builds off of earlier posts by Jon Lamontagne and Jon Herman on making global maps in Python using matplotlib and basemap. However rather than making a global map, I’ll show how to zoom into a particular region, here the Red River basin in East Asia. To make these maps, you’ll need to have basemap installed (from github here, or using a Windows installer here).

The first step is to create a basemap. Both Jons used the ‘robin’ global projection to do this in their posts. Since I’m only interested in a particular region, I just specify the bounding box using the lower and upper latitudes and longitudes of the region I’d like to plot. As Jon H points out, you can also specify the resolution (‘f’ = full, ‘h’ =high, ‘i’ = intermediate, ‘l’ = low, ‘c’ = crude), and you can even use different ArcGIS images for the background (see here). I use ‘World_Shaded_Relief’. It’s also possible to add a lot of features such as rivers, countries, coastlines, counties, etc. I plot countries and rivers. The argument ‘zorder’ specifies the order of the layering from 1 to n, where 1 is the bottom layer and n the top.


from mpl_toolkits.basemap import Basemap
from matplotlib import pyplot as plt

fig = plt.figure()
fig.set_size_inches([17.05,8.15])

# plot basemap, rivers and countries
m = basemap(llcrnrlat=19.5, urcrnrlat=26.0, llcrnrlon=99.6, urcrnr=107.5, resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)



The above code makes the following image (it takes some time, since I’m using high resolution):

Now let’s add a shaded outline of the Red River basin. To do this, you need a shapefile of the basin. The FAO provides a shapefile of major watersheds in the world, from which you can extract the watershed you’re interested in using ArcGIS (see instructions here). In this shapefile, the Red River is labeled by its name in Vietnamese, ‘Song Hong.’ I chose not to draw the bounds of the basin in my map because it would be too busy with the country borders. Instead, I shaded the region gray (facecolor=’0.33′) with a slightly darker border (edgecolor=’0.5′) and slight transparency (alpha=0.5). To do that, I had to collect all of the patches associated with the shapefile (which I called ‘Basin’ when reading it in) that needed to be shaded.


from matplotlib.patches import Polygon
from matplotlib.collections import Patch Collection

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1: # attribute in attribute table of shapefile
patches.append(Polygon(np.array(shape), True))



This creates the following image:

Now let’s add the locations of major dams and cities in the basin using ‘scatter‘. You could again do this by adding a shapefile, but I’m just going to add their locations manually, either by uploading their latitude and longitude coordinates from a .csv file or by passing them directly.


import numpy as np

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv', delimiter=',', skiprows=1, usecols=[1,2])
x, y = m(damsLatLong[:,1], damLatLong[:,0]) # m(longitude, latitude)
m.scatter(x, y, c='k', s = 150, marker = '^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)



This makes the following image:

If we want to label the dams and cities, we can add text specifying where on the map we’d like them to be located. This may require some guess-and-check work to determine the best place (comment if you know a better way!). I temporarily added gridlines to the map to aid in this process using ‘drawparallels‘ and ‘drawmeridians‘.


# label dams and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center', va='center', color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')



Now our map looks like this:

That looks nice, but it would be helpful to add some context as to where in the world the Red River basin is located. To illustrate this, we can create an inset of the greater geographical area by adding another set of axes with its own basemap. This one can be at a lower resolution.


from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1) # locations
axins.set_xlim(90, 115) # longitude boundaries of inset map
axins.set_ylim(8, 28) # latitude boundaries of inset map

# remove tick marks from inset axes
plt.xticks(visible=False)
plt.yticks(visible=False)

# add basemap to inset map
m2 = Basemap(llcrnrlat=8.0, urcrnclat=28.0, llcrnr=90.0, urcrnrlon=115.0, resolution='l', ax=axins)
m2.drawcountries(color='k', linewidth=0.5)



This image looks like this:

Now let’s highlight a country of interest (Vietnam) in green and also add the Red River basin in light gray again.


# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset



Now our map looks like this:

Finally, let’s label the countries in the inset. Some of the countries are too small to fit their name inside, so we’ll have to create arrows pointing to them using ‘annotate‘. In this function, ‘xy’ specifies where the arrow points to and ‘xytext’ where the text is written relative to where the arrow points.


# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(102.5, 20.2, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(9.5, 21.0, 'China', fontsize=11, ha='center', va='center', color='k')

# add arrows to label Vietnam and Cambodia
plt.annotate('Vietnam', xy=(108.0, 14.0), xycoords='data', xytext=(5.0, 20.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)
plt.annotate('Cambodia', xy=(104.5, 12.0), xycoords='data', xytext=(-60.0, -25.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)



Now our map looks like this:

I think that’s pretty good, so let’s save it ;). See below for all the code used to make this map, with all the import statements at the beginning rather than sporadically inserted throughout the code!

If you’re looking for any other tips on how to make different types of maps using basemap, I recommend browsing through the basemap toolkit documentation and this basemap tutorial, where I learned how to do most of what I showed here.


from mpl_toolkits.basemap import Basemap
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import numpy as np

# set-up Vietnam basemap
fig = plt.figure()
fig.set_size_inches([17.05, 8.15])

# plot basemap, rivers and countries
m = Basemap(llcrnrlat=19.5,urcrnrlat=26.0,llcrnrlon=99.6,urcrnrlon=107.5,resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1:
patches.append(Polygon(np.array(shape), True))

# plot dams
x, y = m(damsLatLong[:,1], damsLatLong[:,0])
m.scatter(x, y, c='k', s=150, marker='^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

# label reservoirs and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center',va='center',color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1)
axins.set_xlim(90, 115)
axins.set_ylim(8,28)

plt.xticks(visible=False)
plt.yticks(visible=False)

m2 = Basemap(llcrnrlat=8.0,urcrnrlat=28.0,llcrnrlon=90.0,urcrnrlon=115.0,resolution='l',ax=axins)
m2.drawcountries(color='k',linewidth=0.5)

# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset

# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center',va='center',color='k')
plt.text(102.5, 20.2, 'Laos', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'Thailand', fontsize=11, ha='center', va='center', color='k')
plt.text(96.5, 21.0, 'Myanmar', fontsize=11, ha='center', va='center', color='k')

plt.annotate('Vietnam', xy=(108.0,14.0), xycoords='data', xytext=(5.0,20.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)
plt.annotate('Cambodia', xy=(104.5,12.0), xycoords='data', xytext=(-60.0,-25.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)

fig.savefig('RedRiverMap.png')
fig.clf()