# Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
# fit Gaussian HMM to Q
model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))

# classify each observation as state 0 or 1
hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

# find parameters of Gaussian HMM
mus = np.array(model.means_)
sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
P = np.array(model.transmat_)

# find log-likelihood of Gaussian HMM
logProb = model.score(np.reshape(Q,[len(Q),1]))

# generate nSamples from Gaussian HMM
samples = model.sample(nSamples)

# re-organize mus, sigmas and P so that first row is lower mean (if not already)
if mus[0] > mus[1]:
mus = np.flipud(mus)
sigmas = np.flipud(sigmas)
P = np.fliplr(np.flipud(P))
hidden_states = 1 - hidden_states

return hidden_states, mus, sigmas, P, logProb, samples

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)



Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

sns.set()
fig = plt.figure()

xs = np.arange(len(Q))+1909
ax.plot(xs, Q, c='k')

ax.set_xlabel('Year')
ax.set_ylabel(ylabel)
handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
fig.savefig(filename)
fig.clf()

return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')



Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)



Running that we get the following:

[[ 0.6794469   0.3205531 ]
[ 0.34904974  0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σi[ei])e in which e is the eigenvector of P corresponding to an eigenvalue of 1, and ei is the ith element of e. The overall distribution for our observations is then Y ~ π0N(μ0,σ02) + π1*N(μ1,σ12). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

# calculate stationary distribution
eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
one_eigval = np.argmin(np.abs(eigenvals-1))
pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

sns.set()
fig = plt.figure()
ax.hist(Q, color='k', alpha=0.5, density=True)
l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

handles, labels = plt.gca().get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
fig.savefig(filename)
fig.clf()

return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')



Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

# Setting Up and Customizing Python Environments using Conda

Typing ‘python’ into your command line launches the default global Python environment (which you can change by changing your path) that includes every package you’ve likely installed since the dawn of man (or since you adopted your machine).

But what happens when you are working between Python 2.7 and Python 3.x due to collaboration, using Python 3.4 because the last time you updated your script was four years ago, collaborating with others and want to ensure reproducibility and compatible environments, or banging your head against the wall because that one Python library installation is throwing up errors (shakes fist at PIL/Pillow)?

Creating Python environments is a straightforward solution to save you headaches down the road.

Python environments are a topic that many of us have feared through the years due to ambiguous definitions filled with waving hands. An environment is simply the domain in which users run software or scripts. With this same train of thought, a python environment is the domain with all of the Python packages are installed where a user (you!) is executing a script (usually interfacing through an IDE or Terminal/Command Prompt).

However, different scripts will work or fail in different environments  avoid having to use all of these packages at once or having to completely reinstall Python, what we want to do is create new and independent Python environments. Applications of these environments include:

• Have multiple versions of Python (e.g. 2.7 and 3.4 and 3.6) installed on your machine at once that you can easily switch between
• Work with specific versions of packages and ensure they don’t update for the specific script you’re developing
• Allow for individuals to install the same, reproducible environment between workstations
• Create standardized environments for seamless collaboration
• Use older versions of packages to utilize outdated code

## Creating Your First Python Environment

One problem that recent arose in Ithaca was that someone was crunching towards deadlines and could only run PIL (Python Imaging Library) on their home machine and not their desktop on campus due to package installation issues. This individual had the following  packages they needed to install while using Python 2.7.5:

• PIL
• matplotlib
• numpy
• pandas
• statsmodels
• seaborn

To start, let’s first create an environment! To do this, we will be using Conda (install Anaconda for new users or MiniConda for anyone who doesn’t want their default Python environment to be jeopardized. If you want to avoid using Conda, feel free to explore Pipenv). As a quick note on syntax, I will be running everything in Windows 7 and every command I am using can be found on the Conda Cheatsheet. Only slight variations are required for MacOS/Linux.

First, with your Command Prompt open, type the following command to create the environment we will be working in:

conda create --name blog_pil_example python=2.7.5

At this point, a new environment titled blog_pil_example with Python 2.7.5 has been created. Congrats! Don’t forget to take screenshots to add to your new environment’s baby book (or just use the one above if it’s not your first environment).

From here, we need to activate the environment before interacting with it. To see which environments are available, use the following:

conda env list

Now, let’s go ahead and activate the environment that we want (blog_pil_example):

activate blog_pil_example

To leave the environment you’re in, simply use the following command:

deactivate

(For Linux and MaxOS, put ‘source ‘ prior to these commands)

We can see in the screenshot above that multiple other environments exist, but the selected/activated environment is shown in parentheses. Note that you’re still navigating through the same directories as before, you’re just selecting and running a different version of Python and installed packages when you’re using this environment.

### (Installing Packages)

Now onto the real meat and potatoes: installing the necessary packages. While you can use pip at this point, I’ve found Conda has run into fewer issues over the past year.  (Read into channel prioritization if you’re interested in where package files are being sourced from and how to change this.) As a quick back to basics, we’re going to install one of the desired packages, matplotlib, using Conda (or pip). Using these ensures that the proper versions of the packages for your environment (i.e. the Python version and operating system) are retrieved. At the same time, all dependent packages will also be installed (e.g. numpy). Use the following command when in the environment and confirm you want to install matplotlib:

conda install matplotlib

Note that you can specify a version much like how we specified the python version above for library compatibility issues:

conda install matplotlib=2.2.0

If you wish to remove matplotlib, use the following command:

conda remove matplotlib

If you wish to update a specific package, run:

conda update matplotlib

Or to update all packages:

conda update

Additionally, you can prevent specific packages from updating by creating a pinned file in the environment’s conda-meta directory. Be sure to do this prior to running the command to update all packages!

After installing all of the packages that were required at the start of this tutorial, let’s look into which packages are actually installed in this environment:

conda list

By only installing the required packages, Conda was kind and installed all of the dependencies at the same time. Now you have a Python environment that you’ve created from scratch and developed into a hopefully productive part of your workflow.

The simplest way to utilize your newly created python environment is simply run python directly in the Command Prompt above. You can run any script when this environment is activated (shown in the parentheses on the left of the command line) to utilize this setup!

If you want to use this environment in your IDE of choice, you can simply point the interpreter to this new environment. In PyCharm, you can easily create a new Conda Environment when creating a new project, or you can point the interpreter to a previously created environment (instructions here).

For a good ground-up and more in depth tutorial with visualizations on how Conda works (including directory structure, channel prioritization) that has been a major source of inspiration and knowledge for me, please check out this blog post by Gergely Szerovay.

If you’re looking for a great (and nearly exhaustive) source of Python Packages (both current and previous versions), check out Gohlke’s webpage. To install these packages, download the associated file for your system (32/64 bit and then your operating system) then use pip to install the file (in Command Prompt, navigate to the folder the .whl file is located in, then type ‘pip install ,file_name>’). I’ve found that installing packages this way sometimes allows me to step around errors I’ve encountered while using

You can also create environments for R. Check it out here.

If you understand most of the materials above, you can now claim to be environmentally conscious!

# Evaluating and visualizing sampling quality

State sampling is a necessary step for any computational experiment, and the way sampling is carried out will influence the experiment’s results. This is the case for instance, for sensitivity analysis (i.e., the analysis of model output sensitivity to values of the input variables). The popular method of Sobol’ (Sobol’, 2001) relies on tailor-made sampling techniques that have been perfected through time (e.g., Joe and Kuo, 2008; Saltelli et al., 2010). Likewise, the method of Morris (Morris, 1991), less computationally demanding than Sobol’s (Herman et al., 2013) and used for screening (i.e., understanding which are the inputs that most influence outputs), relies on specific sampling techniques (Morris, 1991; Campolongo et al., 2007).

But what makes a good sample, and how can we understand the strengths and weaknesses of the sampling techniques (and also of the associated sensitivity techniques we are using) through quick visualization of some associated metrics?

This post aims to answer this question. It will first look at what makes a good sample using some examples from a sampling technique called latin hypercube sampling. Then it will show some handy visualization tools for quickly testing and visualizing a sample.

# What makes a good sample?

Intuitively, the first criterion for a good sample is how well it covers the space from which to sample. The difficulty though, is how we define “how well” it practice, and the implications that has.

Let us take an example. A quick and popular way to generate a sample that covers the space fairly well is latin hypercube sampling (LHS; McKay et al., 1979). This algorithm relies on the following steps for drawing N samples from a hypercube-shaped of dimension p.:

1) Divide each dimension of the space in N equiprobabilistic bins. If we want uniform sampling, each bin will have the same length. Number bins from 1 to N each dimension.

2) Randomly draw points such that you have exactly one in each bin in each dimension.

For instance, for 6 points in 2 dimensions, this is a possible sample (points are selected randomly in each square labelled A to F):

It is easy to see that by definition, LHS has a good space coverage when projected on each individual axis. But space coverage in multiple dimensions all depends on the luck of the draw. Indeed, this is also a perfectly valid LHS configuration:

In the above configuration, it is easy to see that on top of poor space coverage, correlation between the sampled values along both axes is also a huge issue. For instance, if output values are hugely dependent on values of input 1, there will be large variations of the output values as values of input 2 change, regardless of the real impact of input 2 on the output.

Therefore, there are two kinds of issues to look at. One is correlation between sampled values of the input variables. We’ll look at it first because it is pretty straightforward. Then we’ll look at space coverage metrics, which are more numerous, do not look exactly at the same things, and can be sometimes conflicting. In fact, it is illuminating to see that sample quality metrics sometimes trade-off with one another, and several authors have turned to multi-objective optimization to come up with Pareto-optimal sample designs (e.g., Cioppa and Lucas, 2007; De Rainville et al., 2012).

One can look at authors such as Sheikholeslami and Razavi (2017) who summarize similar sets of variables. The goal there is not to write a summary of summaries but rather to give a sense that there is a relationship between which indicators of sampling quality matter, which sampling strategy to use, and what we want to do.

In what follows we note $x_{k,i}$ the kth  sampled value of input variable i, with $1\leq k \leq N$ and $1\leq i \leq p$.

### Correlation

Sample correlation is usually measured through the Pearson statistic. For inputs variables i and j among the p input variables, we note $x_{k,i}$ and $x_{k,j}$ the values of these variables i and j in sample k $(1\leq k \leq N)$ have:

$\rho_{ij} = \frac{\sum_{k=1}^N (x_{k,i}-\bar{x}_i)(x_{k,j}-\bar{x}_j)}{\sqrt{\sum_{k=1}^N (x_{k,i}-\bar{x}_i)^2 \sum_{k=1}^N (x_{k,j}-\bar{x}_j)^2}}$

In the above equation, ${\bar{x}_i}$  and ${\bar{x}_j}$ are the average sampled values of inputs i and j .

Then, the indicator of sample quality looks at the maximal level of correlation across all variables:

$\rho_{\max} = \max_{1\leq i \leq j \leq N} |\rho_{ij}|$

This definition relies on the remark that $\rho_{ij} = \rho_{ji}$.

### Space Coverage

There are different measures of space coverage.

We are best equipped to visualize space coverage via 1D or 2D projections of a sample. In 1D a measure of space coverage is by dividing each dimension in N equiprobable bins, and count the fraction of bins that have at least a point. Since N is the sample size, this measure is maximized when there is exactly one point in each bin — it is a measure that LHS maximizes.

Other measures of space coverage consider all dimension at once. A straightforward measure of space filling is the minimum Euclidean distance between two sampled points X in the generated ensemble:

$D = \min_{1\leq k \leq m \leq N} \left\{ d(\textbf{X}^k, \textbf{X}^m) \right\}$

Other indicators measure discrepancy which is a concept closely related to space coverage. In simple terms, a low discrepancy means that when we look at a subset of a sampled input space, its volume is roughly proportional to the number of points that are in it. In other words, there is no large subset with relatively few sampled points, and there is no small subset with a relatively large density of sampled points. A low discrepancy is desirable and in fact, Sobol’ sequences that form the basis of the Sobol’ sensitivity analysis method, are meant to minimize discrepancy.

# Sample visualization

The figures that follow can be easily reproduced by cloning a little repository SampleVis I put together, and by entering on the command line python readme.py &amp;&gt; output.txt. That Python routine can be used with both latin hypercube and Sobol’ sampling (using the SAlib sampling tool; SAlib is a Python library developed primarily by Jon Herman and Will Usher, and which is extensively discussed in this blog.)

In what follows I give examples using a random draw of latin hypercube sampling with 100 members and 7 sampled variables.

### Correlation

No luck, there is statistically significant pairwise correlation between in three pairs of variables: x1 and x4, x4 and x6, and x5 and x6. Using LHS, it can take some time to be lucky enough until the drawn sample is correlation-free (alternatively, methods to minimize correlations have been extensively researched over the years, though no “silver bullet” really emerges).

This means any inference that works for both variables in any of these pairs may be suspect. The SampleVis toolbox contains also tools to plot whether these correlations are positive or negative.

### Space coverage

The toolbox enables to plot several indicators of space coverage, assuming that the sampled space is the unit hypercube of dimension p (p=7 in this example). It computes discrepancy and minimal distance indicators. Ironically, my random LHS with 7 variables and 100 members has a better discrepancy (here I use an indicator called L2-star discrepancy) than a Sobol’ sequence with as many variables and members. The minimal Euclidean distance as well is better than for Sobol’ (0.330 vs. 0.348). This means that if for our experiment, space coverage is more important than correlation, the drawn LHS is pretty good.

To better grasp how well points cover the whole space, it is interesting to plot the distance of the point that is closest to each point, and to represent that in growing order:

This means that some points are not evenly spaced, and some are more isolated than others. When dealing with a limited number of variables, it can also be interesting to visualize 2D projections of the sample, like this one:

This again goes to show that the sample is pretty-well distributed in space. We can compare with the same diagram for a Sobol’ sampling with 100 members and 7 variables:

It is pretty clear that the deterministic nature of Sobol’ sampling, for so few points, leaves more systematic holes in the sampled space. Of course, this sample is too small for any serious Sobol’ sensitivity analysis, and holes are plugged by a larger sample. But again, this comparison is a visual heuristic that tells a similar story as the global coverage indicator: this LHS draw is pretty good when it comes to coverage.

# References

Campolongo, F., Cariboni, J. & Saltelli, A. (2007). An effective screening design for sensitivity analysis of large models. Environmental Modelling & Software, 22, 1509 – 1518.

Cioppa, T. M. & Lucas, T. W. (2007). Efficient Nearly Orthogonal and Space-Filling Latin Hypercubes. Technometrics, 49, 45-55.

De Rainville, F.-M., Gagné, C., Teytaud, O. & Laurendeau, D. (2012). Evolutionary Optimization of Low-discrepancy Sequences. ACM Trans. Model. Comput. Simul., ACM, 22, 9:1-9:25.

Herman, J. D., Kollat, J. B., Reed, P. M. & Wagener, T. (2013). Technical Note: Method of Morris effectively reduces the computational demands of global sensitivity analysis for distributed watershed models. Hydrology and Earth System Sciences, 17, 2893-2903.

Joe, S. & Kuo, F. (2008). Constructing Sobol Sequences with Better Two-Dimensional Projections. SIAM Journal on Scientific Computing, 30, 2635-2654.

McKay, M.D., Beckman R.J. & Conover, W.J. (1979).A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2), 239-245.

Morris, M. D. (1991). Factorial Sampling Plans for Preliminary Computational Experiments. Technometrics, 33, 161-174.

Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M. & Tarantola, S. (2010). Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Computer Physics Communications, 181, 259 – 270.

Sheikholeslami, R. & Razavi, S. (2017). Progressive Latin Hypercube Sampling: An efficient approach for robust sampling-based analysis of environmental models. Environmental Modelling & Software, 93, 109 – 126.

Sobol’, I. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271 – 280.

# Creating parallel axis plots with multiple datasets, color gradients, and brushing in Python

Parallel axis plots (here is a good description of what they are) are a relatively recent development in the plotting world, so it is no surprise that there is no implementations of it with more than basic functionalities in the major plotting packages available online. Over the past couple of days I then created my own implementation of parallel axis plots in Python using Matplotlib Pandas’ and Plot.ly’s implementation get cumbersome when the user tries to apply brushing and multiple color gradients  to create versatile, high-resolution and story-telling plots for my next papers and presentations. This implementation allows for:

• Plotting multiple datasets,
• Displaying dataset names,
• Choosing columns to be plot,
• Coloring each dataset based on a column and a different Matplotlib color map,
• Specifying ranges to be plotted,
• Inverting multiple axis,
• Brushing by intervales in multiple axis,
• Choosing different fonts for title and rest of the plot, and
• Export result as a figure file or viewing plot in Matplotlib’s interactive window.

The source code can be found here, and below is an example of how to use it:

import numpy as np
from plotting.parallel_axis import paxis_plot
from matplotlib.colors import LinearSegmentedColormap
from matplotlib import cm

bu_cy = LinearSegmentedColormap.from_list('BuCy', [(0, 0, 1), (0, 1, 1)])
bu_cy_r = bu_cy.reversed()

data1 = np.random.normal(size=(100, 8))
data2 = np.random.normal(size=(100, 8))
columns_to_plot = [0, 1, 3, 5, 7]
color_column = 0
axis_labels = ['axes ' + str(i) for i in range(8)]
dataset_names = ['Data set 1', 'Data set 2']
plot_ranges = [[-3.5, 3.5]] * 3 + [[-2.9, 3.1]] + [[-3.5, 3.5]] * 4
axis_to_invert = [1, 5]
brush_criteria = {1: [-10., 0.], 7: [10., 0.]}

paxis_plot((data1, data2),
columns_to_plot,
color_column,
[bu_cy_r, cm.get_cmap('autumn_r')],
axis_labels,
'Title Here',
dataset_names,
axis_ranges=plot_ranges,
fontname_title='Gill Sans MT',
fontname_body='CMU Bright',
file_name='test.png',
axis_to_invert=axis_to_invert,
brush_criteria=brush_criteria)


The output of this script should be a file named “test.png” that looks similar to the plot below:

# Logistic Regression for Scenario Discovery

As most of you probably know, scenario discovery is an exploratory modeling approach [Bankes, 1993] that involves stress-testing proposed policies over plausible future “states of the world” (SOWs) to discover conditions under which those policies would fail to meet performance goals [Bryant and Lempert, 2010]. The scenario discovery process is therefore an exercise in statistical classification. Two commonly used methods used for the scenario discovery process are the Patient Rule Induction Method (PRIM; Friedman and Fisher [1999]) and Classification and Regression Trees (CART; Breiman et al. [1984]), both of which are included in the OpenMORDM R package and Rhodium Python package.

Another commonly used method in classification that hasn’t been given much attention in the scenario discovery literature is logistic regression. Logistic regression models estimate the probability that an event is classified as a success (1) as opposed to a failure (0) as a function of different covariates. This allows for the definition of “safe operating spaces,” or factor combinations leading to success, based on the probability with which one would like to be able to achieve the specified performance goal(s). We may not know the probability that a particular SOW will occur, but through the logistic regression we can estimate the probability of success in that SOW should it occur. The logistic regression can also identify which factors most influence a policy’s ability to meet those performance goals.

This blog post will illustrate how to build logistic regression models in Python for scenario discovery using the Red River basin as an example. Here we are interested in determining under what streamflow and demand characteristics reservoir operating policies are unable to protect Hanoi from the 100-yr flood. We assume operators want to ensure protection to this event with at least 95% reliability and use logistic regression to estimate under what combination of streamflow and demand characteristics they will be able to do so.

The form of the logistic regression model is given by Equation 1, where pi represents the probability that performance in the ith SOW is classified as a success and Xi represents a vector of covariates (in this case, streamflow and demand characteristics) describing the ith SOW:

1) $\ln\Bigg(\frac{p_i}{1-p_i}\Bigg) = \mathbf{X_i^\intercal}\mathbf{\beta}$.

The coefficients, $\mathbf{\beta}$, on the covariates are estimated using Maximum Likelihood Estimation.

To determine which streamflow and demand characteristics are most important in explaining successes and failures, we can compare the McFadden’s pseudo-R2 values associated with different models that include different covariates. McFadden’s pseudo-R2, $R_{McFadden}^2$, is given by Equation 2:

2) $R_{McFadden}^2 = 1 - \frac{\ln \hat{L}(M_{Full})}{\ln \hat{L}(M_{Intercept})}$

where $\ln \hat{L}(M_{Full})$ is the log-likelihood of the full model and $\ln \hat{L}(M_{Intercept})$ is the log-likelihood of the intercept model, i.e. a model with no covariates beyond the intercept. The intercept model therefore predicts the mean probability of success across all SOWs. $R_{McFadden}^2$ is a measure of improvement of the full model over the intercept model.

A common approach to fitting regression models is to add covariates one-by-one based on which most increase R2 (or in this case, $R_{McFadden}^2$), stopping once the increase of an additional covariate is marginal. The covariate that by itself most increases $R_{McFadden}^2$ is therefore the most important in predicting a policy’s success. To do this in Python, we will use the library statsmodels.

Imagine we have a pandas dataframe, dta that includes n columns of streamflow and demand characteristics describing different SOWs (rows) and a final column of 0s and 1s representing whether or not the policy being evaluated can provide protection to the 100-yr flood in that SOW (0 for no and 1 for yes). Assume the column of 0s and 1s is the last column and it is labeled Success. We can find the value of $R_{McFadden}^2$ for each covariate individually by running the following code:

import pandas as pd
import statsmodels.api as sm
from scipy import stats

# deal with fact that calling result.summary() in statsmodels.api
# calls scipy.stats.chisqprob, which no longer exists
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

def fitLogit(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])

# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors

#fit logistic regression
logit = sm.Logit(dta['Success'], dta[cols])
result = logit.fit()

return result

n = len(dta.columns) - 1
for i in range(n):
predictors = dta.columns.tolist()[i:(i+1)]
result = fitLogit(dta, predictors)
print(result.summary())



A sample output for one predictor, Col1 is shown below. This predictor has a pseudo-R2 of 0.1138.

Once the most informative predictor has been determined, additional models can be tested by adding more predictors one-by-one as described above. Suppose that through this process, one finds that the first 3 columns of dta (Col1,Col2 and Col3) are the most informative for predicting success on providing protection to the 100-yr flood, while the subsequent columns provide little additional predictive power. We can use this model to visualize the probability of success as a function of these 3 factors using a contour map. If we want to show this as a 2D projection, the probability of success can only be shown for combinations of 2 of these factors. In this case, we can hold the third factor constant at some value, say its base value. This is illustrated in the code below, which also shows a scatter plot of the SOWs. The dots are shaded light blue if the policy succeeds in providing protection to the 100-yr flood in that world, and dark red if it does not.


import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import statsmodels.api as sm

def fitLogit(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])

# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors

#fit logistic regression
logit = sm.Logit(dta['Success'], dta[cols])
result = logit.fit()

return result

def plotContourMap(ax, result, constant, dta, contour_cmap, dot_cmap, levels, xgrid, ygrid, \
xvar, yvar, base):

# find probability of success for x=xgrid, y=ygrid
X, Y = np.meshgrid(xgrid, ygrid)
x = X.flatten()
y = Y.flatten()
if constant == 'x3': # 3rd predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),x,y,np.ones(len(x))*base[2]])
elif constant == 'x2': # 2nd predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),x,np.ones(len(x))*base[1],y])
else: # 1st predictor held constant at base value
grid = np.column_stack([np.ones(len(x)),np.ones(len(x))*base[0],x,y])

z = result.predict(grid)
Z = np.reshape(z, np.shape(X))

contourset = ax.contourf(X, Y, Z, levels, cmap=contour_cmap)
ax.scatter(dta[xvar].values, dta[yvar].values, c=dta['Success'].values, edgecolor='none', cmap=dot_cmap)
ax.set_xlim(np.min(X),np.max(X))
ax.set_ylim(np.min(Y),np.max(Y))
ax.set_xlabel(xvar,fontsize=24)
ax.set_ylabel(yvar,fontsize=24)
ax.tick_params(axis='both',labelsize=18)

return contourset

# build logistic regression model with first 3 columns of predictors from dta
predictors = dta.columns.tolist()[0:3]
result = fitLogit(dta, predictors)

# define color map for dots representing SOWs in which the policy
# succeeds (light blue) and fails (dark red)
dot_cmap = mpl.colors.ListedColormap(np.array([[227,26,28],[166,206,227]])/255.0)

# define color map for probability contours
contour_cmap = mpl.cm.get_cmap(‘RdBu’)

# define probability contours
contour_levels = np.arange(0.0, 1.05,0.1)

# define grid of x (1st predictor), y (2nd predictor), and z (3rd predictor) dimensions
# to plot contour map over
xgrid = np.arange(-0.1,1.1,0.01)
ygrid = np.arange(-0.1,1.1,0.01)
zgrid = np.arange(-0.1,1.1,0.01)

# define base values of 3 predictors
base = [0.5, 0.5, 0.5]

fig = plt.figure()
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, contour_cmap, dot_cmap, contour_levels, xgrid, ygrid, \
'Col1', 'Col2', base)
# plot contour map when 2nd predictor ('x2') is held constant
contourset = plotContourMap(ax, result, 'x2', dta, contour_cmap, dot_cmap, contour_levels, xgrid, zgrid, \
'Col1', 'Col3', base)

cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(contourset, cax=cbar_ax)
cbar_ax.set_ylabel('Probability of Success',fontsize=20)
yticklabels = cbar.ax.get_yticklabels()
cbar.ax.set_yticklabels(yticklabels,fontsize=18)
fig.set_size_inches([14.5,8])
fig.savefig('Fig1.png')
fig.clf()



This produces the following figure:

We can also use the probability contours discovered above to define “safe operating spaces” as combinations of these 3 factors under which the evaluated policy is able to succeed in providing protection to the 100-yr flood with some reliability, say 95%. The hyperplane of factor combinations defining that 95% probability contour can be determined by setting p to 0.95 in Equation 2. Again, to plot 2-D projections of that hyperplane, the values of the other covariates can be held constant at their base values. The code below illustrates how to do this with a 95% boundary.


# define colormap for classifying boundary between failure and success
class_cmap = mpl.colors.ListedColormap(np.array([[251,154,153],[31,120,180]])/255.0)

# define probability cutoff between failure and success
class_levels = [0.0, 0.95, 1.0]

fig = plt.figure()
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, class_cmap, dot_cmap, class_levels, xgrid, ygrid, \
'Col1', 'Col2', base)

# plot contour map when 2nd predictor ('x2') is held constant
plotContourMap(ax, result, 'x2', dta, class_cmap, dot_cmap, class_levels, xgrid, zgrid, \
'Col1', 'Col3', base)

fig.set_size_inches([14.5,8])
fig.savefig('Fig2.png')
fig.clf()



This produces the following figure, where the light red region is the parameter ranges in which the policy cannot provide protection to the 100-yr flood with 95% reliability, and the dark blue region is the “safe operating space” in which it can.

All code for this example can be found here.

# Launching Jupyter Notebook Using an Icon/Shortcut in the Current Working Directory Folder

A petty annoyance I’ve encountered when wanting to open Jupyter Notebook (overview) is that I couldn’t find a way to instantly open it in my current Windows Explorer window. While one trick you can use to open the Command Prompt in this folder is by typing ‘cmd’ in the navigation bar above (shown below) and pressing Enter/Return, I wanted to create a shortcut or icon I could double-click in any given folder and have it open Jupyter Notebook in that same working directory.

This method allows you to drag-and-drop the icon you create into any folder and have it launch Jupyter Notebook from the new folder. It works for Windows 7, 8, and 10. Please feel free to let me know if you encounter any errors!

A great application for this shortcut may be to include this shortcut in GitHub folders where you wish to direct someone to launch Jupyter Notebook with minimal confusion. Just direct them to double-click on the icon and away they go!

# Creating Your Own Jupyter Notebook Shortcut

To begin, we must have already installed Jupyter Notebook or Jupyter Lab. Next, navigate to the folder we want to create your shortcut. Right-click, select ‘New’, then create a shortcut.

In the Create Shortcut Windows prompt, type the location of the item you want the Shortcut Icon to direct to. In this case, we are wanting direct this shortcut to the Command Prompt and have it run the command to open Jupyter Notebook. Copy/paste or type the following into the prompt:

cmd /k “jupyter notebook”

Note that cmd will change to the location of the Command Prompt executable file (e.g. C:\Windows\System32\cmd.exe), and ‘/k’ keeps the Command Prompt window open to ensure Jypyter Notebook does not crash. You can edit the command in the quotation marks to any command you would want, but in this case ‘jupyter notebook’ launches an instance of Jupyter Notebook.

You can then save this shortcut with whatever name you wish!

At this point, double-clicking the shortcut will open Jupyter Notebook in a static default directory (e.g. ‘C:\Windows\system32’). To fix this, we need to ensure that this shortcut instead directs to the current working directory (the location of the shortcut).

Next, we need to edit the location where the Command Prompt will run in. Right-click on your newly-created icon and select ‘Properties’ at the bottom of the menu to open the window shown on the left. One thing to note is that the ‘Target’ input is where we initially put in our ‘location’ prompt from above.

At this point, change the ‘Start in:’ input (e.g. ‘C:\Windows\system32’) to the following:

%cd%

By changing this input, instead of starting the Command Prompt in a static default directory, it instead starts the command prompt  in the current working directory for the shortcut.

At this point, you’re finished! You can drag and drop this icon to any new folder and have Jupyter Notebook start in that new folder.

If you wish to download a copy of the shortcut from Dropbox. Note that for security reasons, most browsers, hosting services, and email services will rename the file from ‘jupyter_notebook_shortcut.lnk’ to ‘jupyter_notebook_shortcut.downloads’.

Many thanks to users on superuser for helping develop this solution!

Please let me know if you have any questions, comments, or additional suggestions on applications for this shortcut!

# Policy Diagnostics with Time-Varying and State Space PDFs

Some of my work has focused on “policy diagnostics,” analyzing how policies (in this case, multi-reservoir operating policies) that favor different objectives perform under different conditions and why. This can guide analysts in choosing a policy to implement, or even in determining objectives that policies should be optimized to (cough, cough, see Quinn et al., 2017). One of the more effective ways we’ve found to analyze these policies is by examining their probabilistic behavior through time-varying PDFs and state-space PDFs. This blog post will illustrate these two types of figures and provide sample code for creating them. The code for the versions of these figures generated in the above paper can be found here.

Below is an example of how time-varying PDFs can provide insights into system behavior using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels on different days of the year (x axis in both figures), from the beginning of the monsoon in May to the end of the dry season in April. Red shades represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a policy minimizing the 100-yr annual maximum water level, while the right plot shows them for a policy maximizing the 100-yr average hydropower production. The flood-minimizing policy has a lower probability of overtopping the dikes and crossing a stakeholder-elicited alarm level of 11.25 m (Second Alarm) compared to the hydropower-maximizing policy. However, this reduction in the probability of high floodwaters requires a higher probability of crossing a lower stakeholder-elicited alarm level of 6 m (First Alarm), highlighting a tradeoff between reducing severe floods and nuisance floods. There are also different dynamics during the dry season, where the flood-minimizing solution releases more to both meet agricultural demand at the time of planting and lower the reservoir level in advance of the next monsoon. There is a bifurcation in the high probability density streak during this time, suggesting how much needs to be released depends on what is needed to lower the reservoir level to an acceptable pre-flood season level or meet the agricultural demand.

To create this figure, we simply need an N x 365 matrix of the water level on each day (column) of N different annual simulations (rows). Let’s call this matrix ‘data’. We then need to reformat ‘data’ into a Y x 365 matrix, where Y is the number of “bins” along the y axis (between ymin and ymax) that we are going to group our data into to make a histogram for each day. Finally, we just need to count how many data points occur in each bin, and then divide this count by the total number of simulated years, N. This is shown using the function ‘getTimeVaryingProbs.py’ below assuming we have two datasets we want to plot, ‘data1’ and ‘data2’.

import numpy as np

def getTimeVaryingProbs(data, N, Y, ymin, ymax):
'''Finds the probability of being at a specific water level (y) on a given day.'''
probMatrix = np.zeros([Y,365])
step = (ymax-ymin)/Y
for i in range(np.shape(probMatrix)[0]):
for j in range(np.shape(probMatrix)[1]):
count = ((data[:,j] < ymax-step*i) & (data[:,j] >= ymax-step*(i+1))).sum()
probMatrix[i,j] = count/N

return probMatrix

probMatrix1 = getTimeVaryingProbs(data1, 100000, 366, 0, 15)
probMatrix2 = getTimeVaryingProbs(data2, 100000, 366, 0, 15)


After calling ‘getTimeVaryingProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can plot the time-varying PDF of each of these using ‘imshow’. Since we want to compare the two side-by-side, we need to make sure they’re normalized over the same range. We do this by finding the lowest and highest probabilities over the two matrices and normalizing our color map over that range:

import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl

# find the lowest and highest probability between two probability matrices
probMin = min(np.min(probMatrix1), np.min(probMatrix2))
probMax = max(np.max(probMatrix1), np.max(probMatrix2))

fig = plt.figure()
sm = ax1.imshow(probMatrix1, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
sm = ax2.imshow(probMatrix2, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(sm, cax=cbar_ax)
cbar.ax.set_ylabel('Probability Density',fontsize=16)
fig.show()


In some cases, it may be helpful to plot a log transformation of the probability matrices, as was done in the above paper since streamflows are highly skewed.

Below is an example of how state-space PDFs can provide insights into system behavior, again using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels when the total storage in the reservoirs upstream is at different levels (x axis in both figures). Red shades again represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a compromise policy optimized to one set of objectives, while the right plot shows them for a compromise policy optimized to a different set of objectives. The compromise policy on the left fills up the reservoirs without releasing much water downstream, resulting in a high probability streak along the bottom of the plot at low water levels. This will favor hydropower production. However, when the largest reservoirs fill up, they are forced to spill, resulting in a spike in the water level downstream. This occurs before the smaller reservoirs have filled up, and in wet years, results in overtopping before total system storage has been reached. Consequently, this policy does not make full use of the total system storage for flood protection. The compromise policy on the right, however, increases the system storage and water level simultaneously, releasing some of what initially comes in to leave empty capacity for future flood events. This strategy makes better use of the full system capacity, only resulting in overtopping when maximum system storage has been reached. The difference in the behavior of these two compromise solutions highlights the need to test rival framings of objective functions for multi-objective optimization, as some formulations may suffer unintended consequences like the formulation on the left.

To create this figure, we need two N x 365 matrices, one of the water level on each day (column) of N different annual simulations (rows) and another of the total system storage. Let’s call these matrices ‘h’ and ‘s’, respectively. We then need to use these matrices to populate a Y x X probability matrix, where Y is the number of bins along the y axis (water level, h) between ymin and ymax, and X the number of bins along the x axis (storage, s) between xmin and xmax. This probability matrix will represent a 2D histogram of how many data points lie in a combined water level and storage bin.  We again just need to count how many data points occur in each bin, and then divide this count by the total number of simulated points (365N). This is shown using the function ‘getJointProbs.py’ below assuming we have two joint datasets, (h1,s1) and (h2,s2), that we want to plot.

def getJointProbs(h, s, Y, X, ymax, ymin, xmax, xmin):
'''Finds the probability of being at a specific water level (h) and storage (s) jointly'''
probMatrix = np.zeros([Y,X])
yStep = (ymax-ymin)/np.shape(probMatrix)[0]
xStep = (xmax-xmin)/np.shape(probMatrix)[1]
for i in range(np.shape(s)[0]):
for j in range(np.shape(s)[1]):
# figure out which "box" the simulated s and h are in
row = int(np.floor((ymax-h[i,j])/yStep))
col = int(np.ceil((s[i,j]-xmin)/xStep))
if row < np.shape(probMatrix)[0] and col < np.shape(probMatrix)[1]:
probMatrix[row,col] = probMatrix[row,col] + 1

# calculate probability of being in each box
probMatrix = probMatrix/(np.shape(s)[0]*np.shape(s)[1])

return probMatrix

probMatrix1 = getJointProbs(h1, s1, 100, 100, 15, 0, 3.0E10, 0.5E10)
probMatrix2 = getJointProbs(h2, s2, 100, 100, 15, 0, 3.0E10, 0.5E10)


After calling ‘getJointProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can again plot the state space PDF of each of these using ‘imshow’ as illustrated in the second snippet of code above. Now go analyze how your reservoirs are probabilistically operating as a system!