Logistic Regression for Scenario Discovery

As most of you probably know, scenario discovery is an exploratory modeling approach [Bankes, 1993] that involves stress-testing proposed policies over plausible future “states of the world” (SOWs) to discover conditions under which those policies would fail to meet performance goals [Bryant and Lempert, 2010]. The scenario discovery process is therefore an exercise in statistical classification. Two commonly used methods used for the scenario discovery process are the Patient Rule Induction Method (PRIM; Friedman and Fisher [1999]) and Classification and Regression Trees (CART; Breiman et al. [1984]), both of which are included in the OpenMORDM R package and Rhodium Python package.

Another commonly used method in classification that hasn’t been given much attention in the scenario discovery literature is logistic regression. Logistic regression models estimate the probability that an event is classified as a success (1) as opposed to a failure (0) as a function of different covariates. This allows for the definition of “safe operating spaces,” or factor combinations leading to success, based on the probability with which one would like to be able to achieve the specified performance goal(s). We may not know the probability that a particular SOW will occur, but through the logistic regression we can estimate the probability of success in that SOW should it occur. The logistic regression can also identify which factors most influence a policy’s ability to meet those performance goals.

This blog post will illustrate how to build logistic regression models in Python for scenario discovery using the Red River basin as an example. Here we are interested in determining under what streamflow and demand characteristics reservoir operating policies are unable to protect Hanoi from the 100-yr flood. We assume operators want to ensure protection to this event with at least 95% reliability and use logistic regression to estimate under what combination of streamflow and demand characteristics they will be able to do so.

The form of the logistic regression model is given by Equation 1, where pi represents the probability that performance in the ith SOW is classified as a success and Xi represents a vector of covariates (in this case, streamflow and demand characteristics) describing the ith SOW:

1) \ln\Bigg(\frac{p_i}{1-p_i}\Bigg) = \mathbf{X_i^\intercal}\mathbf{\beta}.

The coefficients, \mathbf{\beta}, on the covariates are estimated using Maximum Likelihood Estimation.

To determine which streamflow and demand characteristics are most important in explaining successes and failures, we can compare the McFadden’s pseudo-R2 values associated with different models that include different covariates. McFadden’s pseudo-R2, R_{McFadden}^2, is given by Equation 2:

2) R_{McFadden}^2 = 1 - \frac{\ln \hat{L}(M_{Full})}{\ln \hat{L}(M_{Intercept})}

where \ln \hat{L}(M_{Full}) is the log-likelihood of the full model and \ln \hat{L}(M_{Intercept}) is the log-likelihood of the intercept model, i.e. a model with no covariates beyond the intercept. The intercept model therefore predicts the mean probability of success across all SOWs. R_{McFadden}^2 is a measure of improvement of the full model over the intercept model.

A common approach to fitting regression models is to add covariates one-by-one based on which most increase R2 (or in this case, R_{McFadden}^2), stopping once the increase of an additional covariate is marginal. The covariate that by itself most increases R_{McFadden}^2 is therefore the most important in predicting a policy’s success. To do this in Python, we will use the library statsmodels.

Imagine we have a pandas dataframe, dta that includes n columns of streamflow and demand characteristics describing different SOWs (rows) and a final column of 0s and 1s representing whether or not the policy being evaluated can provide protection to the 100-yr flood in that SOW (0 for no and 1 for yes). Assume the column of 0s and 1s is the last column and it is labeled Success. We can find the value of R_{McFadden}^2 for each covariate individually by running the following code:

import pandas as pd
import statsmodels.api as sm
from scipy import stats

# deal with fact that calling result.summary() in statsmodels.api
# calls scipy.stats.chisqprob, which no longer exists
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

def fitLogit(dta, predictors):
    # concatenate intercept column of 1s
    dta['Intercept'] = np.ones(np.shape(dta)[0])

    # get columns of predictors
    cols = dta.columns.tolist()[-1:] + predictors

    #fit logistic regression
    logit = sm.Logit(dta['Success'], dta[cols])
    result = logit.fit()

    return result

dta = pd.read_csv('SampleData.txt')
n = len(dta.columns) - 1
for i in range(n):
    predictors = dta.columns.tolist()[i:(i+1)]
    result = fitLogit(dta, predictors)
    print(result.summary())

A sample output for one predictor, Col1 is shown below. This predictor has a pseudo-R2 of 0.1138.

Once the most informative predictor has been determined, additional models can be tested by adding more predictors one-by-one as described above. Suppose that through this process, one finds that the first 3 columns of dta (Col1,Col2 and Col3) are the most informative for predicting success on providing protection to the 100-yr flood, while the subsequent columns provide little additional predictive power. We can use this model to visualize the probability of success as a function of these 3 factors using a contour map. If we want to show this as a 2D projection, the probability of success can only be shown for combinations of 2 of these factors. In this case, we can hold the third factor constant at some value, say its base value. This is illustrated in the code below, which also shows a scatter plot of the SOWs. The dots are shaded light blue if the policy succeeds in providing protection to the 100-yr flood in that world, and dark red if it does not.


import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import statsmodels.api as sm

def fitLogit(dta, predictors):
    # concatenate intercept column of 1s
    dta['Intercept'] = np.ones(np.shape(dta)[0])

    # get columns of predictors
    cols = dta.columns.tolist()[-1:] + predictors

    #fit logistic regression
    logit = sm.Logit(dta['Success'], dta[cols])
    result = logit.fit()
    
    return result

def plotContourMap(ax, result, constant, dta, contour_cmap, dot_cmap, levels, xgrid, ygrid, \
    xvar, yvar, base):

    # find probability of success for x=xgrid, y=ygrid
    X, Y = np.meshgrid(xgrid, ygrid)
    x = X.flatten()
    y = Y.flatten()
    if constant == 'x3': # 3rd predictor held constant at base value
        grid = np.column_stack([np.ones(len(x)),x,y,np.ones(len(x))*base[2]])
    elif constant == 'x2': # 2nd predictor held constant at base value
        grid = np.column_stack([np.ones(len(x)),x,np.ones(len(x))*base[1],y])
    else: # 1st predictor held constant at base value
        grid = np.column_stack([np.ones(len(x)),np.ones(len(x))*base[0],x,y])

    z = result.predict(grid)
    Z = np.reshape(z, np.shape(X))

    contourset = ax.contourf(X, Y, Z, levels, cmap=contour_cmap)
    ax.scatter(dta[xvar].values, dta[yvar].values, c=dta['Success'].values, edgecolor='none', cmap=dot_cmap)
    ax.set_xlim(np.min(X),np.max(X))
    ax.set_ylim(np.min(Y),np.max(Y))
    ax.set_xlabel(xvar,fontsize=24)
    ax.set_ylabel(yvar,fontsize=24)
    ax.tick_params(axis='both',labelsize=18)

    return contourset

# build logistic regression model with first 3 columns of predictors from dta
dta = pd.read_csv('SampleData.txt')
predictors = dta.columns.tolist()[0:3]
result = fitLogit(dta, predictors)

# define color map for dots representing SOWs in which the policy
# succeeds (light blue) and fails (dark red)
dot_cmap = mpl.colors.ListedColormap(np.array([[227,26,28],[166,206,227]])/255.0)

# define color map for probability contours
contour_cmap = mpl.cm.get_cmap(‘RdBu’)

# define probability contours
contour_levels = np.arange(0.0, 1.05,0.1)

# define grid of x (1st predictor), y (2nd predictor), and z (3rd predictor) dimensions
# to plot contour map over
xgrid = np.arange(-0.1,1.1,0.01)
ygrid = np.arange(-0.1,1.1,0.01)
zgrid = np.arange(-0.1,1.1,0.01)

# define base values of 3 predictors
base = [0.5, 0.5, 0.5]

fig = plt.figure()
ax = fig.add_subplot(121)
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, contour_cmap, dot_cmap, contour_levels, xgrid, ygrid, \
    'Col1', 'Col2', base)
ax = fig.add_subplot(122)
# plot contour map when 2nd predictor ('x2') is held constant
contourset = plotContourMap(ax, result, 'x2', dta, contour_cmap, dot_cmap, contour_levels, xgrid, zgrid, \
    'Col1', 'Col3', base)

fig.subplots_adjust(wspace=0.3,hspace=0.3,right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(contourset, cax=cbar_ax)
cbar_ax.set_ylabel('Probability of Success',fontsize=20)
yticklabels = cbar.ax.get_yticklabels()
cbar.ax.set_yticklabels(yticklabels,fontsize=18)
fig.set_size_inches([14.5,8])
fig.savefig('Fig1.png')
fig.clf()

This produces the following figure:

We can also use the probability contours discovered above to define “safe operating spaces” as combinations of these 3 factors under which the evaluated policy is able to succeed in providing protection to the 100-yr flood with some reliability, say 95%. The hyperplane of factor combinations defining that 95% probability contour can be determined by setting p to 0.95 in Equation 2. Again, to plot 2-D projections of that hyperplane, the values of the other covariates can be held constant at their base values. The code below illustrates how to do this with a 95% boundary.


# define colormap for classifying boundary between failure and success
class_cmap = mpl.colors.ListedColormap(np.array([[251,154,153],[31,120,180]])/255.0)

# define probability cutoff between failure and success
class_levels = [0.0, 0.95, 1.0]

fig = plt.figure()
ax = fig.add_subplot(121)
# plot contour map when 3rd predictor ('x3') is held constant
plotContourMap(ax, result, 'x3', dta, class_cmap, dot_cmap, class_levels, xgrid, ygrid, \
    'Col1', 'Col2', base)

ax = fig.add_subplot(122)
# plot contour map when 2nd predictor ('x2') is held constant
plotContourMap(ax, result, 'x2', dta, class_cmap, dot_cmap, class_levels, xgrid, zgrid, \
    'Col1', 'Col3', base)

fig.set_size_inches([14.5,8])
fig.savefig('Fig2.png')
fig.clf()

This produces the following figure, where the light red region is the parameter ranges in which the policy cannot provide protection to the 100-yr flood with 95% reliability, and the dark blue region is the “safe operating space” in which it can.

All code for this example can be found here.

Advertisements

Policy Diagnostics with Time-Varying and State Space PDFs

Some of my work has focused on “policy diagnostics,” analyzing how policies (in this case, multi-reservoir operating policies) that favor different objectives perform under different conditions and why. This can guide analysts in choosing a policy to implement, or even in determining objectives that policies should be optimized to (cough, cough, see Quinn et al., 2017). One of the more effective ways we’ve found to analyze these policies is by examining their probabilistic behavior through time-varying PDFs and state-space PDFs. This blog post will illustrate these two types of figures and provide sample code for creating them. The code for the versions of these figures generated in the above paper can be found here.

Below is an example of how time-varying PDFs can provide insights into system behavior using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels on different days of the year (x axis in both figures), from the beginning of the monsoon in May to the end of the dry season in April. Red shades represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a policy minimizing the 100-yr annual maximum water level, while the right plot shows them for a policy maximizing the 100-yr average hydropower production. The flood-minimizing policy has a lower probability of overtopping the dikes and crossing a stakeholder-elicited alarm level of 11.25 m (Second Alarm) compared to the hydropower-maximizing policy. However, this reduction in the probability of high floodwaters requires a higher probability of crossing a lower stakeholder-elicited alarm level of 6 m (First Alarm), highlighting a tradeoff between reducing severe floods and nuisance floods. There are also different dynamics during the dry season, where the flood-minimizing solution releases more to both meet agricultural demand at the time of planting and lower the reservoir level in advance of the next monsoon. There is a bifurcation in the high probability density streak during this time, suggesting how much needs to be released depends on what is needed to lower the reservoir level to an acceptable pre-flood season level or meet the agricultural demand.

To create this figure, we simply need an N x 365 matrix of the water level on each day (column) of N different annual simulations (rows). Let’s call this matrix ‘data’. We then need to reformat ‘data’ into a Y x 365 matrix, where Y is the number of “bins” along the y axis (between ymin and ymax) that we are going to group our data into to make a histogram for each day. Finally, we just need to count how many data points occur in each bin, and then divide this count by the total number of simulated years, N. This is shown using the function ‘getTimeVaryingProbs.py’ below assuming we have two datasets we want to plot, ‘data1’ and ‘data2’.

import numpy as np

def getTimeVaryingProbs(data, N, Y, ymin, ymax):
    '''Finds the probability of being at a specific water level (y) on a given day.'''
    probMatrix = np.zeros([Y,365])
    step = (ymax-ymin)/Y
    for i in range(np.shape(probMatrix)[0]):
        for j in range(np.shape(probMatrix)[1]):
            count = ((data[:,j] < ymax-step*i) & (data[:,j] >= ymax-step*(i+1))).sum()
            probMatrix[i,j] = count/N

    return probMatrix

probMatrix1 = getTimeVaryingProbs(data1, 100000, 366, 0, 15)
probMatrix2 = getTimeVaryingProbs(data2, 100000, 366, 0, 15)

After calling ‘getTimeVaryingProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can plot the time-varying PDF of each of these using ‘imshow’. Since we want to compare the two side-by-side, we need to make sure they’re normalized over the same range. We do this by finding the lowest and highest probabilities over the two matrices and normalizing our color map over that range:

import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl

# find the lowest and highest probability between two probability matrices
probMin = min(np.min(probMatrix1), np.min(probMatrix2))
probMax = max(np.max(probMatrix1), np.max(probMatrix2))

fig = plt.figure()
ax1 = fig.add_subplot(121)
sm = ax1.imshow(probMatrix1, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
ax2 = fig.add_subplot(122)
sm = ax2.imshow(probMatrix2, cmap='RdYlBu', origin='upper', norm=mpl.colors.Normalize(vmin=probMin, vmax=ProbMax))
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
cbar = fig.colorbar(sm, cax=cbar_ax)
cbar.ax.set_ylabel('Probability Density',fontsize=16)
fig.show()

In some cases, it may be helpful to plot a log transformation of the probability matrices, as was done in the above paper since streamflows are highly skewed.

Below is an example of how state-space PDFs can provide insights into system behavior, again using the Red River basin as an example. These plots show the probability of the water level in Hanoi (y axis in both figures) being at different levels when the total storage in the reservoirs upstream is at different levels (x axis in both figures). Red shades again represent high probabilities and blue shades represent low probabilities. The left plot shows these dynamics for a compromise policy optimized to one set of objectives, while the right plot shows them for a compromise policy optimized to a different set of objectives. The compromise policy on the left fills up the reservoirs without releasing much water downstream, resulting in a high probability streak along the bottom of the plot at low water levels. This will favor hydropower production. However, when the largest reservoirs fill up, they are forced to spill, resulting in a spike in the water level downstream. This occurs before the smaller reservoirs have filled up, and in wet years, results in overtopping before total system storage has been reached. Consequently, this policy does not make full use of the total system storage for flood protection. The compromise policy on the right, however, increases the system storage and water level simultaneously, releasing some of what initially comes in to leave empty capacity for future flood events. This strategy makes better use of the full system capacity, only resulting in overtopping when maximum system storage has been reached. The difference in the behavior of these two compromise solutions highlights the need to test rival framings of objective functions for multi-objective optimization, as some formulations may suffer unintended consequences like the formulation on the left.

To create this figure, we need two N x 365 matrices, one of the water level on each day (column) of N different annual simulations (rows) and another of the total system storage. Let’s call these matrices ‘h’ and ‘s’, respectively. We then need to use these matrices to populate a Y x X probability matrix, where Y is the number of bins along the y axis (water level, h) between ymin and ymax, and X the number of bins along the x axis (storage, s) between xmin and xmax. This probability matrix will represent a 2D histogram of how many data points lie in a combined water level and storage bin.  We again just need to count how many data points occur in each bin, and then divide this count by the total number of simulated points (365N). This is shown using the function ‘getJointProbs.py’ below assuming we have two joint datasets, (h1,s1) and (h2,s2), that we want to plot.

def getJointProbs(h, s, Y, X, ymax, ymin, xmax, xmin):
    '''Finds the probability of being at a specific water level (h) and storage (s) jointly'''
    probMatrix = np.zeros([Y,X])
    yStep = (ymax-ymin)/np.shape(probMatrix)[0]
    xStep = (xmax-xmin)/np.shape(probMatrix)[1]
    for i in range(np.shape(s)[0]):
        for j in range(np.shape(s)[1]):
            # figure out which "box" the simulated s and h are in
            row = int(np.floor((ymax-h[i,j])/yStep))
            col = int(np.ceil((s[i,j]-xmin)/xStep))
            if row < np.shape(probMatrix)[0] and col < np.shape(probMatrix)[1]:
                probMatrix[row,col] = probMatrix[row,col] + 1
            
    # calculate probability of being in each box
    probMatrix = probMatrix/(np.shape(s)[0]*np.shape(s)[1])

    return probMatrix

probMatrix1 = getJointProbs(h1, s1, 100, 100, 15, 0, 3.0E10, 0.5E10)
probMatrix2 = getJointProbs(h2, s2, 100, 100, 15, 0, 3.0E10, 0.5E10)

After calling ‘getJointProbs.py’ to generate ‘probMatrix1’ and ‘probMatrix2’, we can again plot the state space PDF of each of these using ‘imshow’ as illustrated in the second snippet of code above. Now go analyze how your reservoirs are probabilistically operating as a system!

Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plotFDCrange.py plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables.

plotFDCrange.py provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function monthly-moments.py. This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions autocorr.py and spatial-corr.py, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

Pair Number Sites
1 Marietta and Muddy Run
2 Marietta and Lateral Inflows
3 Marietta and Evaporation
4 Muddy Run and Lateral Inflows
5 Muddy Run and Evaporation
6 Lateral Inflows and Evaporation

 

 

Open Source Streamflow Generator Part I: Synthetic Generation

This post describes how to use the Kirsch-Nowak synthetic streamflow generator to generate synthetic streamflow ensembles for water systems analysis. As Jon Lamontagne discussed in his introduction to synthetic streamflow generation, generating synthetic hydrology for water systems models allows us to stress-test alternative management plans under stochastic realizations outside of those observed in the historical record. These realizations may be generated assuming stationary or non-stationary models. In a few recent papers from our group applied to the Red River and Lower Susquehanna River Basins (Giuliani et al., 2017; Quinn et al., 2017; Zatarain Salazar et al., In Revision), we’ve generated stationary streamflow ensembles by combining methods from Kirsch et al. (2013) and Nowak et al. (2010). We use the method of Kirsch et al. (2013) to generate flows on a monthly time step and the method of Nowak et al. (2010) to disaggregate these monthly flows to a daily time step. The code for this streamflow generator, written by Matteo Giuliani, Jon Herman and me, is now available on Github. Here I’ll walk through how to use the MATLAB code in the subdirectory /stationary_generator to generate correlated synthetic hydrology at multiple sites, and in Part II I’ll show how to use the Python code in the subdirectory /validation to statistically validate the synthetic hydrology. As an example, I’ll use the Lower Susquehanna River Basin (LSRB).

A schematic of the LSRB, reproduced from Giuliani et al. (2014) is provided below. The system consists of two reservoirs: Conowingo and Muddy Run. For the system model, we generate synthetic hydrology upstream of the Conowingo Dam at the Marietta gauge (USGS station 01576000), as well as lateral inflows between Marietta and Conowingo, inflows to Muddy Run and evaporation rates over Conowingo and Muddy Run dams. The historical hydrology on which the synthetic hydrologic model is based consists of the historical record at the Marietta gauge from 1932-2001 and simulated flows and evaporation rates at all other sites over the same time frame generated by an OASIS system model. The historical data for the system can be found here.

The first step to use the synthetic generator is to format the historical data into an nD × nS matrix, where nD is the number of days of historical data with leap days removed and nS is the number of sites, or hydrologic variables. An example of how to format the Susquehanna data is provided in clean_data.m. Once the data has been reformatted, the synthetic generation can be performed by running script_example.m (with modifications for your application). Note that in the LSRB, the evaporation rates over the two reservoirs are identical, so we remove one of those columns from the historical data (line 37) for the synthetic generation. We also transform the historical evaporation with an exponential transformation (line 42) since the code assumes log-normally distributed hydrologic data, while evaporation in this region is more normally distributed. After the synthetic hydrology is generated, the synthetic evaporation rates are back-transformed with a log-transformation on line 60. While such modifications allow for additional hydrologic data beyond streamflows to be generated, for simplicity I will refer to all synthetic variables as “streamflows” for the remainder of this post. In addition to these modifications, you should also specify the number of realizations, nR, you would like to generate (line 52), the number of years, nY, to simulate in each realization (line 53) and a string with the dimensions nR × nY for naming the output file.

The actual synthetic generation is performed on line 58 of script_example.m which calls combined_generator.m. This function first generates monthly streamflows at all sites on line 10 where it calls monthly_main.m, which in turn calls monthly_gen.m to perform the monthly generation for the user-specified number of realizations. To understand the monthly generation, we denote the set of historical streamflows as \mathbf{Q_H}\in \mathbb{R}^{N_H\times T} and the set of synthetic streamflows as \mathbf{Q_S}\in \mathbb{R}^{N_S\times T}, where N_H and N_S are the number of years in the historical and synthetic records, respectively, and T is the number of time steps per year. Here T=12 for 12 months. For the synthetic generation, the streamflows in \mathbf{Q_H} are log-transformed to yield the matrix Y_{H_{i,j}}=\ln(Q_{H_{i,j}}), where i and j are the year and month of the historical record, respectively. The streamflows in \mathbf{Y_H} are then standardized to form the matrix \mathbf{Z_H}\in \mathbb{R}^{N_H\times T} according to equation 1:

1) Z_{H_{i,j}} = \frac{Y_{H_{i,j}}-\hat{\mu_j}}{\hat{\sigma_j}}

where \hat{\mu_j} and \hat{\sigma_j} are the sample mean and sample standard deviation of the j-th month’s log-transformed streamflows, respectively. These variables follow a standard normal distribution: Z_{H_{i,j}}\sim\mathcal{N}(0,1).

For each site, we generate standard normal synthetic streamflows that reproduce the statistics of \mathbf{Z_H} by first creating a matrix \mathbf{C}\in \mathbb{R}^{N_S\times T} of randomly sampled standard normal streamflows from \mathbf{Z_H}. This is done by formulating a random matrix \mathbf{M}\in \mathbb{R}^{N_S\times T} whose elements are independently sampled integers from (1,2,...,N_H). Each element of \mathbf{C} is then assigned the value C_{i,j}=Z_{H_{(M_{i,j}),j}}, i.e. the elements in each column of \mathbf{C} are randomly sampled standard normal streamflows from the same column (month) of \mathbf{Z_H}. In order to preserve the historical cross-site correlation, the same matrix \mathbf{M} is used to generate \mathbf{C} for each site.

Because of the random sampling used to populate \mathbf{C}, an additional step is needed to generate auto-correlated standard normal synthetic streamflows, \mathbf{Z_S}. Denoting the historical autocorrelation \mathbf{P_H}=corr(\mathbf{Z_H}), where corr(\mathbf{Z_H}) is the historical correlation between standardized streamflows in months i and j (columns of \mathbf{Z_H}), an upper right triangular matrix, \mathbf{U}, can be found using Cholesky decomposition (chol_corr.m) such that \mathbf{P_H}=\mathbf{U^\intercal U}. \mathbf{Z_S} is then generated as \mathbf{Z_S}=\mathbf{CU}. Finally, for each site, the auto-correlated synthetic standard normal streamflows \mathbf{Z_S} are converted back to log-space streamflows \mathbf{Y_S} according to Y_{S_{i,j}}=\hat{\mu_j}+Z_{S_{i,j}}\hat{\sigma_j}. These are then transformed back to real-space streamflows \mathbf{Q_S} according to Q_{S_{i,j}}=exp(Y_{S_{i,j}}).

While this method reproduces the within-year log-space autocorrelation, it does not preserve year to-year correlation, i.e. concatenating rows of \mathbf{Q_S} to yield a vector of length N_S\times T will yield discontinuities in the autocorrelation from month 12 of one year to month 1 of the next. To resolve this issue, Kirsch et al. (2013) repeat the method described above with a historical matrix \mathbf{Q_H'}\in \mathbb{R}^{N_{H-1}\times T}, where each row i of \mathbf{Q_H'} contains historical data from month 7 of year i to month 6 of year i+1, removing the first and last 6 months of streamflows from the historical record. \mathbf{U'} is then generated from \mathbf{Q_H'} in the same way as \mathbf{U} is generated from \mathbf{Q_H}, while \mathbf{C'} is generated from \mathbf{C} in the same way as \mathbf{Q_H'} is generated from \mathbf{Q_H}. As before, \mathbf{Z_S'} is then calculated as \mathbf{Z_S'}=\mathbf{C'U'}. Concatenating the last 6 columns of \mathbf{Z_S'} (months 1-6) beginning from row 1 and the last 6 columns of \mathbf{Z_S} (months 7-12) beginning from row 2 yields a set of synthetic standard normal streamflows that preserve correlation between the last month of the year and the first month of the following year. As before, these are then de-standardized and back-transformed to real space.

Once synthetic monthly flows have been generated, combined_generator.m then finds all historical total monthly flows to be used for disaggregation. When calculating all historical total monthly flows a window of +/- 7 days of the month being disaggregated is considered. That is, for January, combined_generator.m finds the total flow volumes in all consecutive 31-day periods within the window from 7 days before Jan 1st to 7 days after Jan 31st. For each month, all of the corresponding historical monthly totals are then passed to KNN_identification.m (line 76) along with the synthetic monthly total generated by monthly_main.mKNN_identification.m identifies the k nearest historical monthly totals to the synthetic monthly total based on Euclidean distance (equation 2):

2) d = \left[\sum^{M}_{m=1} \left({\left(q_{S}\right)}_{m} - {\left(q_{H}\right)}_{m}\right)^2\right]^{1/2}

where {(q_S)}_m is the real-space synthetic monthly flow generated at site m and {(q_H)}_m is the real-space historical monthly flow at site m. The k-nearest neighbors are then sorted from i=1 for the closest to i=k for the furthest, and probabilistically selected for proportionally scaling streamflows in disaggregation. KNN_identification.m uses the Kernel estimator given by Lall and Sharma (1996) to assign the probability p_n of selecting neighbor n (equation 3):

3) p_{n} = \frac{\frac{1}{n}}{\sum^{k}_{i=1} \frac{1}{i}}

Following Lall and Sharma (1996) and Nowak et al. (2010), we use k=\Big \lfloor N_H^{1/2} \Big \rceil. After a neighbor is selected, the final step in disaggregation is to proportionally scale all of the historical daily streamflows at site m from the selected neighbor so that they sum to the synthetically generated monthly total at site m. For example, if the first day of the month of the selected historical neighbor represented 5% of that month’s historical flow, the first day of the month of the synthetic series would represent 5% of that month’s synthetically-generated flow. The random neighbor selection is performed by KNN_sampling.m (called on line 80 of combined_generator.m), which also calculates the proportion matrix used to rescale the daily values at each site on line 83 of combined_generator.m. Finally, script_example.m writes the output of the synthetic streamflow generation to files in the subdirectory /validation. Part II shows how to use the Python code in this directory to statistically validate the synthetically generated hydrology, meaning ensure that it preserves the historical monthly and daily statistics, such as the mean, standard deviation, autocorrelation and spatial correlation.

Making Watershed Maps in Python

This post builds off of earlier posts by Jon Lamontagne and Jon Herman on making global maps in Python using matplotlib and basemap. However rather than making a global map, I’ll show how to zoom into a particular region, here the Red River basin in East Asia. To make these maps, you’ll need to have basemap installed (from github here, or using a Windows installer here).

The first step is to create a basemap. Both Jons used the ‘robin’ global projection to do this in their posts. Since I’m only interested in a particular region, I just specify the bounding box using the lower and upper latitudes and longitudes of the region I’d like to plot. As Jon H points out, you can also specify the resolution (‘f’ = full, ‘h’ =high, ‘i’ = intermediate, ‘l’ = low, ‘c’ = crude), and you can even use different ArcGIS images for the background (see here). I use ‘World_Shaded_Relief’. It’s also possible to add a lot of features such as rivers, countries, coastlines, counties, etc. I plot countries and rivers. The argument ‘zorder’ specifies the order of the layering from 1 to n, where 1 is the bottom layer and n the top.


from mpl_toolkits.basemap import Basemap
from matplotlib import pyplot as plt

fig = plt.figure()
fig.set_size_inches([17.05,8.15])
ax = fig.add_subplot(111)

# plot basemap, rivers and countries
m = basemap(llcrnrlat=19.5, urcrnrlat=26.0, llcrnrlon=99.6, urcrnr=107.5, resolution='h')
m.arcgisimage(service='World_Shaded_Relief')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

The above code makes the following image (it takes some time, since I’m using high resolution):

Now let’s add a shaded outline of the Red River basin. To do this, you need a shapefile of the basin. The FAO provides a shapefile of major watersheds in the world, from which you can extract the watershed you’re interested in using ArcGIS (see instructions here). In this shapefile, the Red River is labeled by its name in Vietnamese, ‘Song Hong.’ I chose not to draw the bounds of the basin in my map because it would be too busy with the country borders. Instead, I shaded the region gray (facecolor=’0.33′) with a slightly darker border (edgecolor=’0.5′) and slight transparency (alpha=0.5). To do that, I had to collect all of the patches associated with the shapefile (which I called ‘Basin’ when reading it in) that needed to be shaded.


from matplotlib.patches import Polygon
from matplotlib.collections import Patch Collection

# plot Red River basin
m.readshapefile('RedRiverBasin_WGS1984', 'Basin', drawbounds=False)
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
    if info['OBJECTID'] == 1: # attribute in attribute table of shapefile
        patches.append(Polygon(np.array(shape), True))

ax.add_collection(PatchCollection(patches, facecolor='0.33', edgecolor='0.5', alpha=0.5))

This creates the following image:

Now let’s add the locations of major dams and cities in the basin using ‘scatter‘. You could again do this by adding a shapefile, but I’m just going to add their locations manually, either by uploading their latitude and longitude coordinates from a .csv file or by passing them directly.


import numpy as np

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv', delimiter=',', skiprows=1, usecols=[1,2])
x, y = m(damsLatLong[:,1], damLatLong[:,0]) # m(longitude, latitude)
m.scatter(x, y, c='k', s = 150, marker = '^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

This makes the following image:

If we want to label the dams and cities, we can add text specifying where on the map we’d like them to be located. This may require some guess-and-check work to determine the best place (comment if you know a better way!). I temporarily added gridlines to the map to aid in this process using ‘drawparallels‘ and ‘drawmeridians‘.


# label dams and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center', va='center', color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

Now our map looks like this:

That looks nice, but it would be helpful to add some context as to where in the world the Red River basin is located. To illustrate this, we can create an inset of the greater geographical area by adding another set of axes with its own basemap. This one can be at a lower resolution.


from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1) # locations
axins.set_xlim(90, 115) # longitude boundaries of inset map
axins.set_ylim(8, 28) # latitude boundaries of inset map

# remove tick marks from inset axes
plt.xticks(visible=False)
plt.yticks(visible=False)

# add basemap to inset map
m2 = Basemap(llcrnrlat=8.0, urcrnclat=28.0, llcrnr=90.0, urcrnrlon=115.0, resolution='l', ax=axins)
m2.arcgisimage(service='World_Shaded_Relief')
m2.drawcountries(color='k', linewidth=0.5)

This image looks like this:

Now let’s highlight a country of interest (Vietnam) in green and also add the Red River basin in light gray again.


# plot Vietnam green in inset
m2.readshapefile('VN_borders_only_WGS1984', 'Vietnam', drawbounds=False)
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
    if info['Joiner'] == 1:
        patches2.append(Polygon(np.array(shape), True))

axins.add_collection(PatchCollection(patches2, facecolor='forestgreen', edgecolor='0.5', alpha=0.5))

# shade Red River basin gray in inset
axins.add_collection(PatchCollection(patches, faceolor='0.33', edgecolor='0.5', alpha=0.5)

Now our map looks like this:

Finally, let’s label the countries in the inset. Some of the countries are too small to fit their name inside, so we’ll have to create arrows pointing to them using ‘annotate‘. In this function, ‘xy’ specifies where the arrow points to and ‘xytext’ where the text is written relative to where the arrow points.


# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(102.5, 20.2, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(9.5, 21.0, 'China', fontsize=11, ha='center', va='center', color='k')

# add arrows to label Vietnam and Cambodia 
plt.annotate('Vietnam', xy=(108.0, 14.0), xycoords='data', xytext=(5.0, 20.0), textcoords='offset points', \ 
    color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)
plt.annotate('Cambodia', xy=(104.5, 12.0), xycoords='data', xytext=(-60.0, -25.0), textcoords='offset points', \ 
    color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)

Now our map looks like this:

I think that’s pretty good, so let’s save it ;). See below for all the code used to make this map, with all the import statements at the beginning rather than sporadically inserted throughout the code!

If you’re looking for any other tips on how to make different types of maps using basemap, I recommend browsing through the basemap toolkit documentation and this basemap tutorial, where I learned how to do most of what I showed here.


from mpl_toolkits.basemap import Basemap
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import numpy as np

# set-up Vietnam basemap
fig = plt.figure()
fig.set_size_inches([17.05, 8.15])
ax = fig.add_subplot(111)

# plot basemap, rivers and countries
m = Basemap(llcrnrlat=19.5,urcrnrlat=26.0,llcrnrlon=99.6,urcrnrlon=107.5,resolution='h')
m.arcgisimage(service='World_Shaded_Relief')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

# plot Red River basin
m.readshapefile('RedRiverBasin_WGS1984','Basin',drawbounds=False)
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
    if info['OBJECTID'] == 1:
        patches.append(Polygon(np.array(shape), True))

ax.add_collection(PatchCollection(patches, facecolor='0.33',edgecolor='0.5',alpha=0.5))

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv',delimiter=',',skiprows=1,usecols=[1,2])
x, y = m(damsLatLong[:,1], damsLatLong[:,0])
m.scatter(x, y, c='k', s=150, marker='^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

# label reservoirs and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center',va='center',color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1)
axins.set_xlim(90, 115)
axins.set_ylim(8,28)

plt.xticks(visible=False)
plt.yticks(visible=False)

m2 = Basemap(llcrnrlat=8.0,urcrnrlat=28.0,llcrnrlon=90.0,urcrnrlon=115.0,resolution='l',ax=axins)
m2.arcgisimage(service='World_Shaded_Relief')
m2.drawcountries(color='k',linewidth=0.5)

# plot Vietnam green in inset
m2.readshapefile('VN_borders_only_WGS1984','Vietnam',drawbounds=False)
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
    if info['Joiner'] == 1:
        patches2.append(Polygon(np.array(shape), True))

axins.add_collection(PatchCollection(patches2, facecolor='forestgreen',edgecolor='0.5',alpha=0.5))

# shade Red River basin gray in inset
axins.add_collection(PatchCollection(patches, facecolor='0.33',edgecolor='0.5',alpha=0.5))

# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center',va='center',color='k')
plt.text(102.5, 20.2, 'Laos', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'Thailand', fontsize=11, ha='center', va='center', color='k')
plt.text(96.5, 21.0, 'Myanmar', fontsize=11, ha='center', va='center', color='k')

plt.annotate('Vietnam', xy=(108.0,14.0), xycoords='data', xytext=(5.0,20.0), textcoords='offset points', \
    color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)
plt.annotate('Cambodia', xy=(104.5,12.0), xycoords='data', xytext=(-60.0,-25.0), textcoords='offset points', \
    color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)

fig.savefig('RedRiverMap.png')
fig.clf()

Using Rhodium for RDM Analysis of External Dataset

In my last blog post, I showed how to run an MORDM experiment using Rhodium. This process included the multi-objective optimization to an assumed state of the world (SOW) as well as the re-evaluation of the Pareto-approximate solutions on alternative SOWs, before using sensitivity and classification tools such as PRIM and CART for the scenario discovery analysis. However, there may be cases where you have to run the optimization and re-evaluation outside of Rhodium, for instance if your model is in another programming language than Python. There are two ways you can do this while still using Rhodium for the scenario discovery. The first option is to run the model through the Executioner. Another option is to run the model separately and import the output into the same format as is generated by Rhodium for post-analysis. I will explain the second method here for the fish game described in my last post.

The first step is to read the decision variables and objectives from the optimization into 2D arrays. Then the uncertainties, levers and responses can be defined as before, except they no longer have to be associated with an object of the class ‘Model‘.


# read in output of optimization
variables = np.loadtxt('FishGame/FishGame.resultfile',usecols=[0,1])
objectives = np.loadtxt('FishGame/FishGame.resultfile',usecols=[2,3])

# make maximization objectives positive
maxIndices = [0]
objectives[:,maxIndices] = -objectives[:,maxIndices]

# define X of XLRM framework
uncertainties = [UniformUncertainty("a", 1.5, 4.0),
    UniformUncertainty("b0", 0.25, 0.67)]

# define L of XLRM framework
levers = [RealLever("vars", 0.0, 1.0, length=2)]

# define R of XLRM framework
responses = [Response("NPVharvest", Response.MAXIMIZE),
    Response("std_z", Response.MINIMIZE)]

Note: If you are interested in using Rhodium’s plotting tools to visualize the results of the optimization, you can still make the uncertainties, levers and responses attributes of a model object. However, you will have to create a model function to instantiate the model. This is sloppy, but you can fake this by just creating a function that takes in the decision variables and model parameters, and returns the objective values, but doesn’t actually perform any calculations.


def fishGame(vars,
    a = 1.75, # rate of prey growth
    b0 = 0.6, # initial rate of predator growth
    F = 0, # rate of change of radiative forcing per unit time
    S = 0.5): # climate sensitivity)

    NPVharvest = None
    std_z = None

    return (NPVharvest, std_z)

model = Model(fishGame)

# define all parameters to the model that we will be studying
model.parameters = [Parameter("vars"),
    Parameter("a"),
    Parameter("b0"),
    Parameter("F"),
    Parameter("S")]

If using Rhodium for the optimization, this function would actually perform the desired calculation and Platypus could be used for the optimization. Since we have already performed the optimization, we just need to reformat the output of the optimization into that used by Rhodium for the RDM analysis. This can be done by mimicking the output structure that would be returned by the function ‘optimize‘.


# find number of solutions
nsolns = np.shape(objectives)[0]

# properly format output of optimization
output = DataSet()
for i in range(nsolns):
    env = OrderedDict()
    offset = 0

    for lever in levers:
        if lever.length == 1:
            env[lever.name] = list(variables[i,:])
        else:
            env[lever.name] = list(variables[i,offset:offset+lever.length])
            
        offset += lever.length

    for j, response in enumerate(responses):
        env[response.name] = objectives[i,j]

    output.append(env)

# write output to file
with open("FishGame/FishGame_data.txt","w") as f:
    json.dump(output, f)

Next we need to read in the uncertain parameters that were sampled for the re-evaluation and format the results of the re-evaluation into the same format as would be output by calling ‘evaluate‘ within Rhodium. Below is an example with the first solution (soln_index=0).

# read in LH samples of uncertain parameters and determine # of samples
LHsamples = np.loadtxt('FishGame/LHsamples.txt')
nsamples = np.shape(LHsamples)[0]

# load policy from optimization
soln_index = 0
policy = output[soln_index]

# load its objective values from re-evaluation and make maximization objectives positive
objectives = np.loadtxt('FishGame/MORDMreeval/FishGame_Soln' + str(soln_index+1) + '.obj')
objectives[:,maxIndices] = -objectives[:,maxIndices]

# convert re-evaluation output to proper format
results = DataSet()
for j in range(nsamples):
    env = OrderedDict()
    offset = 0

    for k, uncertainty in enumerate(uncertainties):
        env[uncertainty.name] = LHsamples[j,k]

    for k, response in enumerate(responses):
        env[response.name] = objectives[j,k]

    for lever in levers:
        if lever.length == 1:
            env[lever.name] = list(variables[soln_index,:])
        else:
            env[lever.name] = list(variables[soln_index,offset:offset+lever.length])

        offset += lever.length

    results.append(env)

# write results to file
with open("FishGame/FishGame_Soln" + str(soln_index+1) + "_reeval.txt","w") as f:
    json.dump(results, f)

Finally, you have to define the metrics.


# calculate M of XLRM framework
metric = ["Profitable" if v["NPVharvest"] >= 3.0 else "Unprofitable" for v in results]

Then you can run PRIM and CART.  This requires defining the names, or ‘keys’, of the uncertain parameters. If you created a fake model object, you can pass ‘include=model.uncertainties.keys()’ to the functions Prim() and Cart(). If not, you have to create your own list of ‘keys’ as I do below.


keys = []
for i in range(len(uncertainties)):
    keys.append(uncertainties[i].name)

# run PRIM and CART on metrics
p = Prim(results, metric, include=keys, coi="Profitable")
box = p.find_box()
box.show_details()
plt.show()

c = Cart(results, metrics[j], include=keys)
c.print_tree(coi="Profitable")
c.show_tree()
plt.show()

The above code creates the following two figures.

pasting_trajectory_24figure_1

 

If you had run the analysis using Sobol samples, you could use the SALib wrapper to calculate sensitivity indices and make bar charts or radial convergence plots of the results. (Note: My previous post did not show how to make these plots, but has since been updated. Check it out here.)


import numpy as np
import seaborn as sns
from rhodium import *
from SALib.analyze import sobol
from SALib.util import read_param_file

def _S2_to_dict(matrix, problem): 
    result = {} names = list(problem["names"]) 
    for i in range(problem["num_vars"]): 
        for j in range(i+1, problem["num_vars"]): 
            if names[i] not in result: 
                result[names[i]] = {} 
             if names[j] not in result: 
                result[names[j]] = {} 

            result[names[i]][names[j]] = result[names[j]][names[i]] = float(matrix[i][j]) 

    return result

def get_pretty_result(result):
    pretty_result = SAResult(result["names"] if "names" in result else problem["names"])

    if "S1" in result:
        pretty_result["S1"] = {k : float(v) for k, v in zip(problem["names"], result["S1"])}
    if "S1_conf" in result:
        pretty_result["S1_conf"] = {k : float(v) for k, v in zip(problem["names"], result["S1_conf"])}
    if "ST" in result:
        pretty_result["ST"] = {k : float(v) for k, v in zip(problem["names"], result["ST"])}
    if "ST_conf" in result:
        pretty_result["ST_conf"] = {k : float(v) for k, v in zip(problem["names"], result["ST_conf"])}
    if "S2" in result:
        pretty_result["S2"] = _S2_to_dict(result["S2"], problem)
    if "S2_conf" in result:
        pretty_result["S2_conf"] = _S2_to_dict(result["S2_conf"], problem)
    if "delta" in result:
        pretty_result["delta"] = {k : float(v) for k, v in zip(problem["names"], result["delta"])}
    if "delta_conf" in result:
        pretty_result["delta_conf"] = {k : float(v) for k, v in zip(problem["names"], result["delta_conf"])}
    if "vi" in result:
        pretty_result["vi"] = {k : float(v) for k, v in zip(problem["names"], result["vi"])}
    if "vi_std" in result:
        pretty_result["vi_std"] = {k : float(v) for k, v in zip(problem["names"], result["vi_std"])}
    if "dgsm" in result:
        pretty_result["dgsm"] = {k : float(v) for k, v in zip(problem["names"], result["dgsm"])}
    if "dgsm_conf" in result:
        pretty_result["dgsm_conf"] = {k : float(v) for k, v in zip(problem["names"], result["dgsm_conf"])}
    if "mu" in result:
        pretty_result["mu"] = {k : float(v) for k, v in zip(result["names"], result["mu"])}
    if "mu_star" in result:
        pretty_result["mu_star"] = {k : float(v) for k, v in zip(result["names"], result["mu_star"])}
    if "mu_star_conf" in result:
        pretty_result["mu_star_conf"] = {k : float(v) for k, v in zip(result["names"], result["mu_star_conf"])}
    if "sigma" in result:
        pretty_result["sigma"] = {k : float(v) for k, v in zip(result["names"], result["sigma"])}

    return pretty_result

# Read the parameter range file and Sobol samples 
problem = read_param_file('FishGame/uncertain_params.txt') 
param_values = np.loadtxt('FishGame/SobolSamples.txt') 

# Load the first solution 
Y = np.loadtxt('FishGame/SobolReeval/FishGame_Soln' + (soln_index+1) + '.obj') 

# Evaluate sensitivity to the first objective, NPVharvest 
obj_index = 0 
Si = sobol.analyze(problem, Y[:,obj_index], calc_second_order=True, conf_level=0.95, print_to_console=False) 
pretty_result = get_pretty_result(Si)

# Plot sensitivity
sns.set() 
fig1 = pretty_result.plot() 
fig2 = pretty_result.plot_sobol(threshold=0.01,groups={"Prey Growth Parameters" : ["a"], "Predator Growth Parameters" : ["b0"]})

soln1_obj1_barchartsoln1_obj1_radialplot

 

So don’t feel like you need to run your optimization and re-evaluation in Python in order to use Rhodium!

Rhodium – Open Source Python Library for (MO)RDM

Last year Dave Hadka introduced OpenMORDM (Hadka et al., 2015), an open source R package for Multi-Objective Robust Decision Making (Kasprzyk et al., 2013). If you liked the capabilities of OpenMORM but prefer coding in Python, you’ll be happy to hear Dave has also written an open source Python library for robust decision making (RDM) (Lempert et al., 2003), including multi-objective robust decision making (MORDM): Rhodium.

Rhodium is part of Project Platypus, which also contains Python libraries for multi-objective optimization (Platypus), a standalone version of the Patient Rule Induction method (PRIM) algorithm (Friedman and Fisher, 1999) implemented in Jan Kwakkel’s EMA Workbench, and a cross-language automation tool for running models (Executioner). Rhodium uses functions from both Platypus and PRIM, as I will briefly show here, but Jazmin will describe Platypus in more detail in a future blog post.

Dave provides an ipython notebook file with clear instructions on how to use Rhodium, with the lake problem given as an example. To give another example, I will walk through the fish game. The fish game is a chaotic predator-prey system in which the population of fish, x, and their predators, y, are co-dependent. Predator-prey systems are typically modeled by the classic Lotka-Volterra equations:

1) \frac{dx}{dt} = \alpha x - \beta x y

2) \frac{dy}{dt} = \delta x y - \gamma y_t

where α is the growth rate of the prey (fish), β is the rate of predation on the prey, δ is the growth rate of the predator, and γ is the death rate of the predator. This model assumes exponential growth of the prey, x, and exponential death of the predator. Based on a classroom exercise given at RAND, I modify the Lotka-Volterra model of the prey population for logistic growth (see the competitive Lotka-Volterra equations):

3) \frac{dx}{dt} = \alpha x - r x^2 - \beta x y

Discretizing equations 1 and 3 yields:

4) x_{t+1} = (\alpha + 1)x_t (1 - \frac{r}{\alpha + 1} x_t) - \beta x_t y_t and

5) y_{t+1} = (1 - \gamma)y_t + \delta x_t y_t

RAND simplifies equation 4 by letting a = α + 1, r/(α + 1) = 1 and β = 1, and simplifies equation 5 by letting b = 1/δ and γ = 1. This yields the following equations:

6) x_{t+1} = \alpha  x_t(1-x_t) - x_t y_t,

7) y_{t+1} = \frac{x_t y_t}{b}.

In this formulation, the parameter a controls the growth rate of the fish and b controls the growth rate of the predators. The growth rate of the predators is dependent on the temperature, which is increasing due to climate change according to the following equation:

8) C \frac{dT}{dt} = (F_0 + Ft) - \frac{T}{S}

where C is the heat capacity, assumed to be 50 W/m2/K/yr, F0 is the initial value of radiative forcing, assumed to be 1.0 W/m2, F is the rate of change of radiative forcing, S is the climate sensitivity in units of K/(W/m2), and T is the temperature increase from equilibrium, initialized at 0. The dependence of b on the temperature increase is given by:

9) b = \text{max} \Bigg( b_0 e^{-0.3T},0.25 \Bigg).

The parameters a, b, F, and S could all be considered deeply uncertain, but for this example I will use (unrealistically optimistic) values of F = 0 and S = 0.5 and assume possible ranges for a and b0 of 1.5 < a < 4 and 0.25 < b0 < 0.67. Within these bounds, different combinations of a and b parameters can lead to point attractors, strange attractors, or collapse of the predator population.

The goal of the game is to design a strategy for harvesting some number of fish, z, at each time step assuming that only the fish population can be observed, not the prey. The population of the fish then becomes:

10) x_{t+1} = \alpha x_t(1-x_t) - x_t y_t - z_t

For this example, I assume the user employs a strategy of harvesting some weighted average of the fish population in the previous two time steps:

11) z_t = \begin{cases}  \text{min} \Bigg( \alpha\beta x_t + \alpha(1-\beta)x_{t-1},x_{t} \Bigg),  t \geq 2\\  \alpha\beta x_t, t = 1  \end{cases}

where 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1. The user is assumed to have two objectives: 1) to maximize the net present value of their total harvest over T time steps, and 2) to minimize the standard deviation of their harvests over T time steps:

12) Maximize: NPV = \sum^T_{t=1} 1.05^{-t} z_t

13) Minimize: s_z = \sqrt{\frac{1}{T-1} \sum^T_{t=1} (z_t - \bar{z})^2}.

As illustrated in the figure below, depending on the values of a and b0, the optimal Pareto sets for each “future” (each with initial populations of x0 = 0.48 and y0 = 0.26) can have very different shapes and attainable values.

Future 1 2 3 4 5
a 1.75 1.75 3.75 3.75 2.75
b0 0.6 0.3 0.6 0.3 0.45

fishfutures

For this MORDM experiment, I first optimize to an assumed state of the world (SOW) in which a = 1.75 and b = 0.6. To do this, I first have to write a function that takes in the decision variables for the optimization problem as well as any potentially uncertain model parameters, and returns the objectives. Here the decision variables are represented by the vector ‘vars’, the uncertain parameters are passed at default values of a=1.75, b0 = 0.6, F = 0 and S = 0.5, and the returned objectives are NPVharvest and std_z.


import os
import math
import json
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.optimize import brentq as root
from rhodium import *
from rhodium.config import RhodiumConfig
from platypus import MapEvaluator

RhodiumConfig.default_evaluator = MapEvaluator()

def fishGame(vars,
    a = 1.75, # rate of prey growth
    b0 = 0.6, # initial rate of predator growth
    F = 0, # rate of change of radiative forcing per unit time
    S = 0.5): # climate sensitivity)

    # Objectives are:
    # 1) maximize (average NPV of harvest) and 
    # 2) minimize (average standard deviation of harvest)
    # x = population of prey at time 0 to t
    # y = population of predator at time 0 to t
    # z = harvested prey at time 1 to t

    tSteps = 100
    x = np.zeros(tSteps+1)
    y = np.zeros(tSteps+1)
    z = np.zeros(tSteps)

    # initialize predator and prey populations
    x[0] = 0.48
    y[0] = 0.26

    # Initialize climate parameters
    F0 = 1
    C = 50
    T = 0
    b = max(b0*np.exp(-0.3*T),0.25)

    # find harvest at time t based on policy
    z[0] = harvest(x, 0, vars)

    #Initialize NPV of harvest
    NPVharvest = 0

    for t in range(tSteps):
        x[t+1] = max(a*x[t]*(1-x[t]) - x[t]*y[t] - z[t],0)
        y[t+1] = max(x[t]*y[t]/b,0)
        if t < tSteps-1:
            z[t+1] = harvest(x, t+1, vars)

        NPVharvest = NPVharvest + z[t]*(1+0.05)**(-(t+1))

        #Calculate next temperature and b values
        T = T + (F0 + F*(t+1) - (1/S)*T)/C
        b = max(b0*np.exp(-0.3*T),0.25)

        # Calculate minimization objectives
        std_z = np.std(z)

    return (NPVharvest, std_z)

def harvest(x, t, vars):
    if t > 0:
        harvest = min(vars[0]*vars[1]*x[t] + vars[0]*(1-vars[1])*x[t-1],x[t])
    else:
        harvest = vars[0]*vars[1]*x[t]

    return harvest

Next, the model class must be defined, as well as its parameters, objectives (or “responses”) and whether they need to be minimized or maximized, decision variables (or “levers”) and uncertainties.


model = Model(fishGame)

# define all parameters to the model that we will be studying
model.parameters = [Parameter("vars"), Parameter("a"), Parameter("b0"), Parameter("F"), Parameter("S")]

# define the model outputs
model.responses = [Response("NPVharvest", Response.MAXIMIZE), Response("std_z", Response.MINIMIZE)]

# some parameters are levers that we control via our policy
model.levers = [RealLever("vars", 0.0, 1.0, length=2)]

# some parameters are exogeneous uncertainties, and we want to better
# understand how these uncertainties impact our model and decision making
# process
model.uncertainties = [UniformUncertainty("a", 1.5, 4.0), UniformUncertainty("b0", 0.25, 0.67)]

The model can then be optimized using a multi-objective evolutionary algorithm (MOEA) in Platypus, and the output written to a file. Here I use NSGA-II.


output = optimize(model, "NSGAII", 100)
with open("data.txt", "w") as f:
    json.dump(output, f)

The results can be easily visualized with simple commands. The Pareto sets can be plotted with ‘scatter2D’ or ‘scatter3D’, both of which allow brushing on one or more objective thresholds. Here I first brush on solutions with a NPV of harvest ≥ 1.0, and then add a condition that the standard deviation of harvest be ≤ 0.01.


# Use Seaborn settings for pretty plots
sns.set()

# Plot the points in 2D space
scatter2d(model, output)
plt.show()

# The optional interactive flag will show additional details of each point when
# hovering the mouse
# Most of Rhodiums's plotting functions accept an optional expr argument for
# classifying or highlighting points meeting some condition
scatter2d(model, output, x="NPVharvest", brush=Brush("NPVharvest >= 1.0"))
plt.show()

scatter2d(model, output, brush="NPVharvest >= 1.0 and std_z <= 0.01")
plt.show()

The above code creates the following images:

figure1figure2bfigure2

Rhodium can also plot Kernel density estimates of the solutions, or those attaining certain objective values.


# Kernel density estimation plots show density contours for samples. By
# default, it will show the density of all sampled points
kdeplot(model, output, x="NPVharvest", y="std_z")
plt.show()

# Alternatively, we can show the density of all points meeting one or more
# conditions
kdeplot(model, output, x="NPVharvest", y="std_z", brush=["NPVharvest >= 1.0", "std_z <= 0.01"], alpha=0.8)
plt.show()

figure4figure5

Scatterplots of all pairwise objective combinations can also be plotted, along with histograms of the marginal distribution of each objective illustrated in the pairwise scatterplots. These can also be brushed by objective thresholds specified by the user.


# Pairwise scatter plots shown 2D scatter plots for all outputs
pairs(model, output)
plt.show()

# We can also highlight points meeting one or more conditions
pairs(model, output, brush=["NPVharvest >= 1.0", "std_z <= 0.01"])
plt.show()

# Joint plots show a single pair of parameters in 2D, their distributions using
# histograms, and the Pearson correlation coefficient
joint(model, output, x="NPVharvest", y="std_z")
plt.show()

figure6figure7figure8

Finally, tradeoffs can also be viewed on parallel axes plots, which can also be brushed on user-specified objective values.


# A parallel coordinates plot to view interactions among responses
parallel_coordinates(model, output, colormap="rainbow", zorder="NPVharvest", brush=Brush("NPVharvest > 1.0")) 
plt.show()

figure10

But the real advantage of Rhodium is not visualization but uncertainty analysis. First, PRIM can be used to identify “boxes” best describing solutions meeting user-specified criteria. I define solutions with a NPV of harvest ≥ 1.0 as profitable, and those below unprofitable.


# The remaining figures look better using Matplotlib's default settings
mpl.rcdefaults()

# We can manually construct policies for analysis. A policy is simply a Python
# dict storing key-value pairs, one for each lever.
#policy = {"vars" : [0.02]*2}

# Or select one of our optimization results
policy = output[8]

# construct a specific policy and evaluate it against 1000 states-of-the-world
SOWs = sample_lhs(model, 1000)
results = evaluate(model, update(SOWs, policy))
metric = ["Profitable" if v["NPVharvest"] >= 1.0 else "Unprofitable" for v in results]

# use PRIM to identify the key uncertainties if we require NPVharvest >= 1.0
p = Prim(results, metric, include=model.uncertainties.keys(), coi="Profitable")
box = p.find_box()
box.show_details()
plt.show()

This will first show the smallest box with the greatest density but lowest coverage.

pasting_trajectory_26

Clicking on “Back” will show the next largest box with slightly lower density but greater coverage, while “Next” moves in the opposite direction. In this case, since the smallest box is shown, “Next” moves full circle to the largest box with the lowest density, but greatest coverage, and clicking “Next” from this figure will start reducing the box size.

pasting_trajectory_1

Classification And Regression Trees (CART; Breiman et al., 1984) can also be used to identify hierarchical conditional statements classifying successes and failures in meeting the user-specified criteria.

# use CART to identify the key uncertainties
c = Cart(results, metric, include=model.uncertainties.keys())
c.print_tree(coi="Profitable")
c.show_tree()
plt.show()

figure11

Finally, Dave has wrapped Rhodium around Jon Herman’s SALib for sensitivity analysis. Here’s an example of how to run the Method of Morris.


# Sensitivity analysis using Morris method
print(sa(model, "NPVharvest", policy=policy, method="morris", nsamples=1000, num_levels=4, grid_jump=2))

morrisprintedoutput

You can also create tornado and spider plots from one-at-a-time (OAT) sensitivity analysis.


# oat sensitivity
fig = oat(model, "NPVharvest",policy=policy,nsamples=1000)

oatplot

Finally, you can visualize the output of Sobol sensitivity analysis with bar charts of the first and total order sensitivity indices, or as radial plots showing the interactions between parameters. In these plots the filled circles on each parameter represent their first order sensitivity, the open circles their total sensitivity, and the lines between them the second order indices of the connected parameters. You can even create groups of similar parameters with different colors for easier visual analysis.

Si = sa(model, "NPVharvest", policy=policy, method="sobol", nsamples=1000, calc_second_order=True)
fig1 = Si.plot()
fig2 = Si.plot_sobol(threshold=0.01)
fig3 = Si.plot_sobol(threshold=0.01,groups={"Prey Growth Parameters" : ["a"],
            "Predator Growth Parameters" : ["b0"]})

sobolsi_plotfishgame_radialplotfishgame_radialplot_groups

As you can see, Rhodium makes MORDM analysis very simple! Now if only we could reduce uncertainty…

Works Cited

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth.

Friedman, J. H. and N. I. Fisher (1999). Bump-hunting for high dimensional data. Statistics and Computing, 9, 123-143.

Hadka, D., Herman, J., Reed, P., and Keller, K. (2015). An open source framework for many-objective robust decision making. Environmental Modelling & Software, 74, 114-129.

Kasprzyk, J. R., S. Nataraj, P. M. Reed, and R. J. Lempert (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71.

Lempert, R. J. (2003). Shaping the next one hundred years: new methods for quantitative, long-term policy analysis. Rand Corporation.