Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plotFDCrange.py plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables.

plotFDCrange.py provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function monthly-moments.py. This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions autocorr.py and spatial-corr.py, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

Pair Number Sites
1 Marietta and Muddy Run
2 Marietta and Lateral Inflows
3 Marietta and Evaporation
4 Muddy Run and Lateral Inflows
5 Muddy Run and Evaporation
6 Lateral Inflows and Evaporation

 

 

Advertisements

Open Source Streamflow Generator Part I: Synthetic Generation

This post describes how to use the Kirsch-Nowak synthetic streamflow generator to generate synthetic streamflow ensembles for water systems analysis. As Jon Lamontagne discussed in his introduction to synthetic streamflow generation, generating synthetic hydrology for water systems models allows us to stress-test alternative management plans under stochastic realizations outside of those observed in the historical record. These realizations may be generated assuming stationary or non-stationary models. In a few recent papers from our group applied to the Red River and Lower Susquehanna River Basins (Giuliani et al., 2017; Quinn et al., 2017; Zatarain Salazar et al., In Revision), we’ve generated stationary streamflow ensembles by combining methods from Kirsch et al. (2013) and Nowak et al. (2010). We use the method of Kirsch et al. (2013) to generate flows on a monthly time step and the method of Nowak et al. (2010) to disaggregate these monthly flows to a daily time step. The code for this streamflow generator, written by Matteo Giuliani, Jon Herman and me, is now available on Github. Here I’ll walk through how to use the MATLAB code in the subdirectory /stationary_generator to generate correlated synthetic hydrology at multiple sites, and in Part II I’ll show how to use the Python code in the subdirectory /validation to statistically validate the synthetic hydrology. As an example, I’ll use the Lower Susquehanna River Basin (LSRB).

A schematic of the LSRB, reproduced from Giuliani et al. (2014) is provided below. The system consists of two reservoirs: Conowingo and Muddy Run. For the system model, we generate synthetic hydrology upstream of the Conowingo Dam at the Marietta gauge (USGS station 01576000), as well as lateral inflows between Marietta and Conowingo, inflows to Muddy Run and evaporation rates over Conowingo and Muddy Run dams. The historical hydrology on which the synthetic hydrologic model is based consists of the historical record at the Marietta gauge from 1932-2001 and simulated flows and evaporation rates at all other sites over the same time frame generated by an OASIS system model. The historical data for the system can be found here.

The first step to use the synthetic generator is to format the historical data into an nD × nS matrix, where nD is the number of days of historical data with leap days removed and nS is the number of sites, or hydrologic variables. An example of how to format the Susquehanna data is provided in clean_data.m. Once the data has been reformatted, the synthetic generation can be performed by running script_example.m (with modifications for your application). Note that in the LSRB, the evaporation rates over the two reservoirs are identical, so we remove one of those columns from the historical data (line 37) for the synthetic generation. We also transform the historical evaporation with an exponential transformation (line 42) since the code assumes log-normally distributed hydrologic data, while evaporation in this region is more normally distributed. After the synthetic hydrology is generated, the synthetic evaporation rates are back-transformed with a log-transformation on line 60. While such modifications allow for additional hydrologic data beyond streamflows to be generated, for simplicity I will refer to all synthetic variables as “streamflows” for the remainder of this post. In addition to these modifications, you should also specify the number of realizations, nR, you would like to generate (line 52), the number of years, nY, to simulate in each realization (line 53) and a string with the dimensions nR × nY for naming the output file.

The actual synthetic generation is performed on line 58 of script_example.m which calls combined_generator.m. This function first generates monthly streamflows at all sites on line 10 where it calls monthly_main.m, which in turn calls monthly_gen.m to perform the monthly generation for the user-specified number of realizations. To understand the monthly generation, we denote the set of historical streamflows as \mathbf{Q_H}\in \mathbb{R}^{N_H\times T} and the set of synthetic streamflows as \mathbf{Q_S}\in \mathbb{R}^{N_S\times T}, where N_H and N_S are the number of years in the historical and synthetic records, respectively, and T is the number of time steps per year. Here T=12 for 12 months. For the synthetic generation, the streamflows in \mathbf{Q_H} are log-transformed to yield the matrix Y_{H_{i,j}}=\ln(Q_{H_{i,j}}), where i and j are the year and month of the historical record, respectively. The streamflows in \mathbf{Y_H} are then standardized to form the matrix \mathbf{Z_H}\in \mathbb{R}^{N_H\times T} according to equation 1:

1) Z_{H_{i,j}} = \frac{Y_{H_{i,j}}-\hat{\mu_j}}{\hat{\sigma_j}}

where \hat{\mu_j} and \hat{\sigma_j} are the sample mean and sample standard deviation of the j-th month’s log-transformed streamflows, respectively. These variables follow a standard normal distribution: Z_{H_{i,j}}\sim\mathcal{N}(0,1).

For each site, we generate standard normal synthetic streamflows that reproduce the statistics of \mathbf{Z_H} by first creating a matrix \mathbf{C}\in \mathbb{R}^{N_S\times T} of randomly sampled standard normal streamflows from \mathbf{Z_H}. This is done by formulating a random matrix \mathbf{M}\in \mathbb{R}^{N_S\times T} whose elements are independently sampled integers from (1,2,...,N_H). Each element of \mathbf{C} is then assigned the value C_{i,j}=Z_{H_{(M_{i,j}),j}}, i.e. the elements in each column of \mathbf{C} are randomly sampled standard normal streamflows from the same column (month) of \mathbf{Z_H}. In order to preserve the historical cross-site correlation, the same matrix \mathbf{M} is used to generate \mathbf{C} for each site.

Because of the random sampling used to populate \mathbf{C}, an additional step is needed to generate auto-correlated standard normal synthetic streamflows, \mathbf{Z_S}. Denoting the historical autocorrelation \mathbf{P_H}=corr(\mathbf{Z_H}), where corr(\mathbf{Z_H}) is the historical correlation between standardized streamflows in months i and j (columns of \mathbf{Z_H}), an upper right triangular matrix, \mathbf{U}, can be found using Cholesky decomposition (chol_corr.m) such that \mathbf{P_H}=\mathbf{U^\intercal U}. \mathbf{Z_S} is then generated as \mathbf{Z_S}=\mathbf{CU}. Finally, for each site, the auto-correlated synthetic standard normal streamflows \mathbf{Z_S} are converted back to log-space streamflows \mathbf{Y_S} according to Y_{S_{i,j}}=\hat{\mu_j}+Z_{S_{i,j}}\hat{\sigma_j}. These are then transformed back to real-space streamflows \mathbf{Q_S} according to Q_{S_{i,j}}=exp(Y_{S_{i,j}}).

While this method reproduces the within-year log-space autocorrelation, it does not preserve year to-year correlation, i.e. concatenating rows of \mathbf{Q_S} to yield a vector of length N_S\times T will yield discontinuities in the autocorrelation from month 12 of one year to month 1 of the next. To resolve this issue, Kirsch et al. (2013) repeat the method described above with a historical matrix \mathbf{Q_H'}\in \mathbb{R}^{N_{H-1}\times T}, where each row i of \mathbf{Q_H'} contains historical data from month 7 of year i to month 6 of year i+1, removing the first and last 6 months of streamflows from the historical record. \mathbf{U'} is then generated from \mathbf{Q_H'} in the same way as \mathbf{U} is generated from \mathbf{Q_H}, while \mathbf{C'} is generated from \mathbf{C} in the same way as \mathbf{Q_H'} is generated from \mathbf{Q_H}. As before, \mathbf{Z_S'} is then calculated as \mathbf{Z_S'}=\mathbf{C'U'}. Concatenating the last 6 columns of \mathbf{Z_S'} (months 1-6) beginning from row 1 and the last 6 columns of \mathbf{Z_S} (months 7-12) beginning from row 2 yields a set of synthetic standard normal streamflows that preserve correlation between the last month of the year and the first month of the following year. As before, these are then de-standardized and back-transformed to real space.

Once synthetic monthly flows have been generated, combined_generator.m then finds all historical total monthly flows to be used for disaggregation. When calculating all historical total monthly flows a window of +/- 7 days of the month being disaggregated is considered. That is, for January, combined_generator.m finds the total flow volumes in all consecutive 31-day periods within the window from 7 days before Jan 1st to 7 days after Jan 31st. For each month, all of the corresponding historical monthly totals are then passed to KNN_identification.m (line 76) along with the synthetic monthly total generated by monthly_main.mKNN_identification.m identifies the k nearest historical monthly totals to the synthetic monthly total based on Euclidean distance (equation 2):

2) d = \left[\sum^{M}_{m=1} \left({\left(q_{S}\right)}_{m} - {\left(q_{H}\right)}_{m}\right)^2\right]^{1/2}

where {(q_S)}_m is the real-space synthetic monthly flow generated at site m and {(q_H)}_m is the real-space historical monthly flow at site m. The k-nearest neighbors are then sorted from i=1 for the closest to i=k for the furthest, and probabilistically selected for proportionally scaling streamflows in disaggregation. KNN_identification.m uses the Kernel estimator given by Lall and Sharma (1996) to assign the probability p_n of selecting neighbor n (equation 3):

3) p_{n} = \frac{\frac{1}{n}}{\sum^{k}_{i=1} \frac{1}{i}}

Following Lall and Sharma (1996) and Nowak et al. (2010), we use k=\Big \lfloor N_H^{1/2} \Big \rceil. After a neighbor is selected, the final step in disaggregation is to proportionally scale all of the historical daily streamflows at site m from the selected neighbor so that they sum to the synthetically generated monthly total at site m. For example, if the first day of the month of the selected historical neighbor represented 5% of that month’s historical flow, the first day of the month of the synthetic series would represent 5% of that month’s synthetically-generated flow. The random neighbor selection is performed by KNN_sampling.m (called on line 80 of combined_generator.m), which also calculates the proportion matrix used to rescale the daily values at each site on line 83 of combined_generator.m. Finally, script_example.m writes the output of the synthetic streamflow generation to files in the subdirectory /validation. Part II shows how to use the Python code in this directory to statistically validate the synthetically generated hydrology, meaning ensure that it preserves the historical monthly and daily statistics, such as the mean, standard deviation, autocorrelation and spatial correlation.

Making Watershed Maps in Python

This post builds off of earlier posts by Jon Lamontagne and Jon Herman on making global maps in Python using matplotlib and basemap. However rather than making a global map, I’ll show how to zoom into a particular region, here the Red River basin in East Asia. To make these maps, you’ll need to have basemap installed (from github here, or using a Windows installer here).

The first step is to create a basemap. Both Jons used the ‘robin’ global projection to do this in their posts. Since I’m only interested in a particular region, I just specify the bounding box using the lower and upper latitudes and longitudes of the region I’d like to plot. As Jon H points out, you can also specify the resolution (‘f’ = full, ‘h’ =high, ‘i’ = intermediate, ‘l’ = low, ‘c’ = crude), and you can even use different ArcGIS images for the background (see here). I use ‘World_Shaded_Relief’. It’s also possible to add a lot of features such as rivers, countries, coastlines, counties, etc. I plot countries and rivers. The argument ‘zorder’ specifies the order of the layering from 1 to n, where 1 is the bottom layer and n the top.


from mpl_toolkits.basemap import Basemap
from matplotlib import pyplot as plt

fig = plt.figure()
fig.set_size_inches([17.05,8.15])
ax = fig.add_subplot(111)

# plot basemap, rivers and countries
m = basemap(llcrnrlat=19.5, urcrnrlat=26.0, llcrnrlon=99.6, urcrnr=107.5, resolution='h')
m.arcgisimage(service='World_Shaded_Relief')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

The above code makes the following image (it takes some time, since I’m using high resolution):

Now let’s add a shaded outline of the Red River basin. To do this, you need a shapefile of the basin. The FAO provides a shapefile of major watersheds in the world, from which you can extract the watershed you’re interested in using ArcGIS (see instructions here). In this shapefile, the Red River is labeled by its name in Vietnamese, ‘Song Hong.’ I chose not to draw the bounds of the basin in my map because it would be too busy with the country borders. Instead, I shaded the region gray (facecolor=’0.33′) with a slightly darker border (edgecolor=’0.5′) and slight transparency (alpha=0.5). To do that, I had to collect all of the patches associated with the shapefile (which I called ‘Basin’ when reading it in) that needed to be shaded.


from matplotlib.patches import Polygon
from matplotlib.collections import Patch Collection

# plot Red River basin
m.readshapefile('RedRiverBasin_WGS1984', 'Basin', drawbounds=False)
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
    if info['OBJECTID'] == 1: # attribute in attribute table of shapefile
        patches.append(Polygon(np.array(shape), True))

ax.add_collection(PatchCollection(patches, facecolor='0.33', edgecolor='0.5', alpha=0.5))

This creates the following image:

Now let’s add the locations of major dams and cities in the basin using ‘scatter‘. You could again do this by adding a shapefile, but I’m just going to add their locations manually, either by uploading their latitude and longitude coordinates from a .csv file or by passing them directly.


import numpy as np

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv', delimiter=',', skiprows=1, usecols=[1,2])
x, y = m(damsLatLong[:,1], damLatLong[:,0]) # m(longitude, latitude)
m.scatter(x, y, c='k', s = 150, marker = '^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

This makes the following image:

If we want to label the dams and cities, we can add text specifying where on the map we’d like them to be located. This may require some guess-and-check work to determine the best place (comment if you know a better way!). I temporarily added gridlines to the map to aid in this process using ‘drawparallels‘ and ‘drawmeridians‘.


# label dams and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center', va='center', color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

Now our map looks like this:

That looks nice, but it would be helpful to add some context as to where in the world the Red River basin is located. To illustrate this, we can create an inset of the greater geographical area by adding another set of axes with its own basemap. This one can be at a lower resolution.


from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1) # locations
axins.set_xlim(90, 115) # longitude boundaries of inset map
axins.set_ylim(8, 28) # latitude boundaries of inset map

# remove tick marks from inset axes
plt.xticks(visible=False)
plt.yticks(visible=False)

# add basemap to inset map
m2 = Basemap(llcrnrlat=8.0, urcrnclat=28.0, llcrnr=90.0, urcrnrlon=115.0, resolution='l', ax=axins)
m2.arcgisimage(service='World_Shaded_Relief')
m2.drawcountries(color='k', linewidth=0.5)

This image looks like this:

Now let’s highlight a country of interest (Vietnam) in green and also add the Red River basin in light gray again.


# plot Vietnam green in inset
m2.readshapefile('VN_borders_only_WGS1984', 'Vietnam', drawbounds=False)
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
    if info['Joiner'] == 1:
        patches2.append(Polygon(np.array(shape), True))

axins.add_collection(PatchCollection(patches2, facecolor='forestgreen', edgecolor='0.5', alpha=0.5))

# shade Red River basin gray in inset
axins.add_collection(PatchCollection(patches, faceolor='0.33', edgecolor='0.5', alpha=0.5)

Now our map looks like this:

Finally, let’s label the countries in the inset. Some of the countries are too small to fit their name inside, so we’ll have to create arrows pointing to them using ‘annotate‘. In this function, ‘xy’ specifies where the arrow points to and ‘xytext’ where the text is written relative to where the arrow points.


# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(102.5, 20.2, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(9.5, 21.0, 'China', fontsize=11, ha='center', va='center', color='k')

# add arrows to label Vietnam and Cambodia 
plt.annotate('Vietnam', xy=(108.0, 14.0), xycoords='data', xytext=(5.0, 20.0), textcoords='offset points', \ 
    color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)
plt.annotate('Cambodia', xy=(104.5, 12.0), xycoords='data', xytext=(-60.0, -25.0), textcoords='offset points', \ 
    color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)

Now our map looks like this:

I think that’s pretty good, so let’s save it ;). See below for all the code used to make this map, with all the import statements at the beginning rather than sporadically inserted throughout the code!

If you’re looking for any other tips on how to make different types of maps using basemap, I recommend browsing through the basemap toolkit documentation and this basemap tutorial, where I learned how to do most of what I showed here.


from mpl_toolkits.basemap import Basemap
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import numpy as np

# set-up Vietnam basemap
fig = plt.figure()
fig.set_size_inches([17.05, 8.15])
ax = fig.add_subplot(111)

# plot basemap, rivers and countries
m = Basemap(llcrnrlat=19.5,urcrnrlat=26.0,llcrnrlon=99.6,urcrnrlon=107.5,resolution='h')
m.arcgisimage(service='World_Shaded_Relief')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

# plot Red River basin
m.readshapefile('RedRiverBasin_WGS1984','Basin',drawbounds=False)
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
    if info['OBJECTID'] == 1:
        patches.append(Polygon(np.array(shape), True))

ax.add_collection(PatchCollection(patches, facecolor='0.33',edgecolor='0.5',alpha=0.5))

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv',delimiter=',',skiprows=1,usecols=[1,2])
x, y = m(damsLatLong[:,1], damsLatLong[:,0])
m.scatter(x, y, c='k', s=150, marker='^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

# label reservoirs and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center',va='center',color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1)
axins.set_xlim(90, 115)
axins.set_ylim(8,28)

plt.xticks(visible=False)
plt.yticks(visible=False)

m2 = Basemap(llcrnrlat=8.0,urcrnrlat=28.0,llcrnrlon=90.0,urcrnrlon=115.0,resolution='l',ax=axins)
m2.arcgisimage(service='World_Shaded_Relief')
m2.drawcountries(color='k',linewidth=0.5)

# plot Vietnam green in inset
m2.readshapefile('VN_borders_only_WGS1984','Vietnam',drawbounds=False)
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
    if info['Joiner'] == 1:
        patches2.append(Polygon(np.array(shape), True))

axins.add_collection(PatchCollection(patches2, facecolor='forestgreen',edgecolor='0.5',alpha=0.5))

# shade Red River basin gray in inset
axins.add_collection(PatchCollection(patches, facecolor='0.33',edgecolor='0.5',alpha=0.5))

# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center',va='center',color='k')
plt.text(102.5, 20.2, 'Laos', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'Thailand', fontsize=11, ha='center', va='center', color='k')
plt.text(96.5, 21.0, 'Myanmar', fontsize=11, ha='center', va='center', color='k')

plt.annotate('Vietnam', xy=(108.0,14.0), xycoords='data', xytext=(5.0,20.0), textcoords='offset points', \
    color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)
plt.annotate('Cambodia', xy=(104.5,12.0), xycoords='data', xytext=(-60.0,-25.0), textcoords='offset points', \
    color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)

fig.savefig('RedRiverMap.png')
fig.clf()

Using Rhodium for RDM Analysis of External Dataset

In my last blog post, I showed how to run an MORDM experiment using Rhodium. This process included the multi-objective optimization to an assumed state of the world (SOW) as well as the re-evaluation of the Pareto-approximate solutions on alternative SOWs, before using sensitivity and classification tools such as PRIM and CART for the scenario discovery analysis. However, there may be cases where you have to run the optimization and re-evaluation outside of Rhodium, for instance if your model is in another programming language than Python. There are two ways you can do this while still using Rhodium for the scenario discovery. The first option is to run the model through the Executioner. Another option is to run the model separately and import the output into the same format as is generated by Rhodium for post-analysis. I will explain the second method here for the fish game described in my last post.

The first step is to read the decision variables and objectives from the optimization into 2D arrays. Then the uncertainties, levers and responses can be defined as before, except they no longer have to be associated with an object of the class ‘Model‘.


# read in output of optimization
variables = np.loadtxt('FishGame/FishGame.resultfile',usecols=[0,1])
objectives = np.loadtxt('FishGame/FishGame.resultfile',usecols=[2,3])

# make maximization objectives positive
maxIndices = [0]
objectives[:,maxIndices] = -objectives[:,maxIndices]

# define X of XLRM framework
uncertainties = [UniformUncertainty("a", 1.5, 4.0),
    UniformUncertainty("b0", 0.25, 0.67)]

# define L of XLRM framework
levers = [RealLever("vars", 0.0, 1.0, length=2)]

# define R of XLRM framework
responses = [Response("NPVharvest", Response.MAXIMIZE),
    Response("std_z", Response.MINIMIZE)]

Note: If you are interested in using Rhodium’s plotting tools to visualize the results of the optimization, you can still make the uncertainties, levers and responses attributes of a model object. However, you will have to create a model function to instantiate the model. This is sloppy, but you can fake this by just creating a function that takes in the decision variables and model parameters, and returns the objective values, but doesn’t actually perform any calculations.


def fishGame(vars,
    a = 1.75, # rate of prey growth
    b0 = 0.6, # initial rate of predator growth
    F = 0, # rate of change of radiative forcing per unit time
    S = 0.5): # climate sensitivity)

    NPVharvest = None
    std_z = None

    return (NPVharvest, std_z)

model = Model(fishGame)

# define all parameters to the model that we will be studying
model.parameters = [Parameter("vars"),
    Parameter("a"),
    Parameter("b0"),
    Parameter("F"),
    Parameter("S")]

If using Rhodium for the optimization, this function would actually perform the desired calculation and Platypus could be used for the optimization. Since we have already performed the optimization, we just need to reformat the output of the optimization into that used by Rhodium for the RDM analysis. This can be done by mimicking the output structure that would be returned by the function ‘optimize‘.


# find number of solutions
nsolns = np.shape(objectives)[0]

# properly format output of optimization
output = DataSet()
for i in range(nsolns):
    env = OrderedDict()
    offset = 0

    for lever in levers:
        if lever.length == 1:
            env[lever.name] = list(variables[i,:])
        else:
            env[lever.name] = list(variables[i,offset:offset+lever.length])
            
        offset += lever.length

    for j, response in enumerate(responses):
        env[response.name] = objectives[i,j]

    output.append(env)

# write output to file
with open("FishGame/FishGame_data.txt","w") as f:
    json.dump(output, f)

Next we need to read in the uncertain parameters that were sampled for the re-evaluation and format the results of the re-evaluation into the same format as would be output by calling ‘evaluate‘ within Rhodium. Below is an example with the first solution (soln_index=0).

# read in LH samples of uncertain parameters and determine # of samples
LHsamples = np.loadtxt('FishGame/LHsamples.txt')
nsamples = np.shape(LHsamples)[0]

# load policy from optimization
soln_index = 0
policy = output[soln_index]

# load its objective values from re-evaluation and make maximization objectives positive
objectives = np.loadtxt('FishGame/MORDMreeval/FishGame_Soln' + str(soln_index+1) + '.obj')
objectives[:,maxIndices] = -objectives[:,maxIndices]

# convert re-evaluation output to proper format
results = DataSet()
for j in range(nsamples):
    env = OrderedDict()
    offset = 0

    for k, uncertainty in enumerate(uncertainties):
        env[uncertainty.name] = LHsamples[j,k]

    for k, response in enumerate(responses):
        env[response.name] = objectives[j,k]

    for lever in levers:
        if lever.length == 1:
            env[lever.name] = list(variables[soln_index,:])
        else:
            env[lever.name] = list(variables[soln_index,offset:offset+lever.length])

        offset += lever.length

    results.append(env)

# write results to file
with open("FishGame/FishGame_Soln" + str(soln_index+1) + "_reeval.txt","w") as f:
    json.dump(results, f)

Finally, you have to define the metrics.


# calculate M of XLRM framework
metric = ["Profitable" if v["NPVharvest"] >= 3.0 else "Unprofitable" for v in results]

Then you can run PRIM and CART.  This requires defining the names, or ‘keys’, of the uncertain parameters. If you created a fake model object, you can pass ‘include=model.uncertainties.keys()’ to the functions Prim() and Cart(). If not, you have to create your own list of ‘keys’ as I do below.


keys = []
for i in range(len(uncertainties)):
    keys.append(uncertainties[i].name)

# run PRIM and CART on metrics
p = Prim(results, metric, include=keys, coi="Profitable")
box = p.find_box()
box.show_details()
plt.show()

c = Cart(results, metrics[j], include=keys)
c.print_tree(coi="Profitable")
c.show_tree()
plt.show()

The above code creates the following two figures.

pasting_trajectory_24figure_1

 

If you had run the analysis using Sobol samples, you could use the SALib wrapper to calculate sensitivity indices and make bar charts or radial convergence plots of the results. (Note: My previous post did not show how to make these plots, but has since been updated. Check it out here.)


import seaborn as sns
from SALib.analyze import sobol
from SALib.util import read_param_file

# Read the parameter range file and Sobol samples
problem = read_param_file('FishGame/uncertain_params.txt')
param_values = np.loadtxt('FishGame/SobolSamples.txt')

# Load the first solution
Y = np.loadtxt('FishGame/SobolReeval/FishGame_Soln' + (soln_index+1) + '.obj')

# Evaluate sensitivity to the first objective, NPVharvest
obj_index = 0
Si = sobol.analyze(problem, Y[:,obj_index], calc_second_order=True, conf_level=0.95, print_to_console=False)
pretty_result = get_pretty_result(Si)

sns.set()
fig1 = pretty_result.plot()
fig2 = pretty_result.plot_sobol(threshold=0.01,groups={"Prey Growth Parameters" : ["a"],
        "Predator Growth Parameters" : ["b0"]})

def get_pretty_result(result):
    pretty_result = SAResult(result["names"] if "names" in result else problem["names"])

    if "S1" in result:
        pretty_result["S1"] = {k : float(v) for k, v in zip(problem["names"], result["S1"])}
    if "S1_conf" in result:
        pretty_result["S1_conf"] = {k : float(v) for k, v in zip(problem["names"], result["S1_conf"])}
    if "ST" in result:
        pretty_result["ST"] = {k : float(v) for k, v in zip(problem["names"], result["ST"])}
    if "ST_conf" in result:
        pretty_result["ST_conf"] = {k : float(v) for k, v in zip(problem["names"], result["ST_conf"])}
    if "S2" in result:
        pretty_result["S2"] = _S2_to_dict(result["S2"], problem)
    if "S2_conf" in result:
        pretty_result["S2_conf"] = _S2_to_dict(result["S2_conf"], problem)
    if "delta" in result:
        pretty_result["delta"] = {k : float(v) for k, v in zip(problem["names"], result["delta"])}
    if "delta_conf" in result:
        pretty_result["delta_conf"] = {k : float(v) for k, v in zip(problem["names"], result["delta_conf"])}
    if "vi" in result:
        pretty_result["vi"] = {k : float(v) for k, v in zip(problem["names"], result["vi"])}
    if "vi_std" in result:
        pretty_result["vi_std"] = {k : float(v) for k, v in zip(problem["names"], result["vi_std"])}
    if "dgsm" in result:
        pretty_result["dgsm"] = {k : float(v) for k, v in zip(problem["names"], result["dgsm"])}
    if "dgsm_conf" in result:
        pretty_result["dgsm_conf"] = {k : float(v) for k, v in zip(problem["names"], result["dgsm_conf"])}
    if "mu" in result:
        pretty_result["mu"] = {k : float(v) for k, v in zip(result["names"], result["mu"])}
    if "mu_star" in result:
        pretty_result["mu_star"] = {k : float(v) for k, v in zip(result["names"], result["mu_star"])}
    if "mu_star_conf" in result:
        pretty_result["mu_star_conf"] = {k : float(v) for k, v in zip(result["names"], result["mu_star_conf"])}
    if "sigma" in result:
        pretty_result["sigma"] = {k : float(v) for k, v in zip(result["names"], result["sigma"])}

    return pretty_result

def _S2_to_dict(matrix, problem):
    result = {}
    names = list(problem["names"])
    for i in range(problem["num_vars"]):
        for j in range(i+1, problem["num_vars"]):
            if names[i] not in result:
                result[names[i]] = {}
            if names[j] not in result:
                result[names[j]] = {}

            result[names[i]][names[j]] = result[names[j]][names[i]] = float(matrix[i][j])

    return result

soln1_obj1_barchartsoln1_obj1_radialplot
 

So don’t feel like you need to run your optimization and re-evaluation in Python in order to use Rhodium!

Rhodium – Open Source Python Library for (MO)RDM

Last year Dave Hadka introduced OpenMORDM (Hadka et al., 2015), an open source R package for Multi-Objective Robust Decision Making (Kasprzyk et al., 2013). If you liked the capabilities of OpenMORM but prefer coding in Python, you’ll be happy to hear Dave has also written an open source Python library for robust decision making (RDM) (Lempert et al., 2003), including multi-objective robust decision making (MORDM): Rhodium.

Rhodium is part of Project Platypus, which also contains Python libraries for multi-objective optimization (Platypus), a standalone version of the Patient Rule Induction method (PRIM) algorithm (Friedman and Fisher, 1999) implemented in Jan Kwakkel’s EMA Workbench, and a cross-language automation tool for running models (Executioner). Rhodium uses functions from both Platypus and PRIM, as I will briefly show here, but Jazmin will describe Platypus in more detail in a future blog post.

Dave provides an ipython notebook file with clear instructions on how to use Rhodium, with the lake problem given as an example. To give another example, I will walk through the fish game. The fish game is a chaotic predator-prey system in which the population of fish, x, and their predators, y, are co-dependent. Predator-prey systems are typically modeled by the classic Lotka-Volterra equations:

1) \frac{dx}{dt} = \alpha x - \beta x y

2) \frac{dy}{dt} = \delta x y - \gamma y_t

where α is the growth rate of the prey (fish), β is the rate of predation on the prey, δ is the growth rate of the predator, and γ is the death rate of the predator. This model assumes exponential growth of the prey, x, and exponential death of the predator. Based on a classroom exercise given at RAND, I modify the Lotka-Volterra model of the prey population for logistic growth (see the competitive Lotka-Volterra equations):

3) \frac{dx}{dt} = \alpha x - r x^2 - \beta x y

Discretizing equations 1 and 3 yields:

4) x_{t+1} = (\alpha + 1)x_t (1 - \frac{r}{\alpha + 1} x_t) - \beta x_t y_t and

5) y_{t+1} = (1 - \gamma)y_t + \delta x_t y_t

RAND simplifies equation 4 by letting a = α + 1, r/(α + 1) = 1 and β = 1, and simplifies equation 5 by letting b = 1/δ and γ = 1. This yields the following equations:

6) x_{t+1} = \alpha  x_t(1-x_t) - x_t y_t,

7) y_{t+1} = \frac{x_t y_t}{b}.

In this formulation, the parameter a controls the growth rate of the fish and b controls the growth rate of the predators. The growth rate of the predators is dependent on the temperature, which is increasing due to climate change according to the following equation:

8) C \frac{dT}{dt} = (F_0 + Ft) - \frac{T}{S}

where C is the heat capacity, assumed to be 50 W/m2/K/yr, F0 is the initial value of radiative forcing, assumed to be 1.0 W/m2, F is the rate of change of radiative forcing, S is the climate sensitivity in units of K/(W/m2), and T is the temperature increase from equilibrium, initialized at 0. The dependence of b on the temperature increase is given by:

9) b = \text{max} \Bigg( b_0 e^{-0.3T},0.25 \Bigg).

The parameters a, b, F, and S could all be considered deeply uncertain, but for this example I will use (unrealistically optimistic) values of F = 0 and S = 0.5 and assume possible ranges for a and b0 of 1.5 < a < 4 and 0.25 < b0 < 0.67. Within these bounds, different combinations of a and b parameters can lead to point attractors, strange attractors, or collapse of the predator population.

The goal of the game is to design a strategy for harvesting some number of fish, z, at each time step assuming that only the fish population can be observed, not the prey. The population of the fish then becomes:

10) x_{t+1} = \alpha x_t(1-x_t) - x_t y_t - z_t

For this example, I assume the user employs a strategy of harvesting some weighted average of the fish population in the previous two time steps:

11) z_t = \begin{cases}  \text{min} \Bigg( \alpha\beta x_t + \alpha(1-\beta)x_{t-1},x_{t} \Bigg),  t \geq 2\\  \alpha\beta x_t, t = 1  \end{cases}

where 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1. The user is assumed to have two objectives: 1) to maximize the net present value of their total harvest over T time steps, and 2) to minimize the standard deviation of their harvests over T time steps:

12) Maximize: NPV = \sum^T_{t=1} 1.05^{-t} z_t

13) Minimize: s_z = \sqrt{\frac{1}{T-1} \sum^T_{t=1} (z_t - \bar{z})^2}.

As illustrated in the figure below, depending on the values of a and b0, the optimal Pareto sets for each “future” (each with initial populations of x0 = 0.48 and y0 = 0.26) can have very different shapes and attainable values.

Future 1 2 3 4 5
a 1.75 1.75 3.75 3.75 2.75
b0 0.6 0.3 0.6 0.3 0.45

fishfutures

For this MORDM experiment, I first optimize to an assumed state of the world (SOW) in which a = 1.75 and b = 0.6. To do this, I first have to write a function that takes in the decision variables for the optimization problem as well as any potentially uncertain model parameters, and returns the objectives. Here the decision variables are represented by the vector ‘vars’, the uncertain parameters are passed at default values of a=1.75, b0 = 0.6, F = 0 and S = 0.5, and the returned objectives are NPVharvest and std_z.


import os
import math
import json
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.optimize import brentq as root
from rhodium import *
from rhodium.config import RhodiumConfig
from platypus import MapEvaluator

RhodiumConfig.default_evaluator = MapEvaluator()

def fishGame(vars,
    a = 1.75, # rate of prey growth
    b0 = 0.6, # initial rate of predator growth
    F = 0, # rate of change of radiative forcing per unit time
    S = 0.5): # climate sensitivity)

    # Objectives are:
    # 1) maximize (average NPV of harvest) and 
    # 2) minimize (average standard deviation of harvest)
    # x = population of prey at time 0 to t
    # y = population of predator at time 0 to t
    # z = harvested prey at time 1 to t

    tSteps = 100
    x = np.zeros(tSteps+1)
    y = np.zeros(tSteps+1)
    z = np.zeros(tSteps)

    # initialize predator and prey populations
    x[0] = 0.48
    y[0] = 0.26

    # Initialize climate parameters
    F0 = 1
    C = 50
    T = 0
    b = max(b0*np.exp(-0.3*T),0.25)

    # find harvest at time t based on policy
    z[0] = harvest(x, 0, vars)

    #Initialize NPV of harvest
    NPVharvest = 0

    for t in range(tSteps):
        x[t+1] = max(a*x[t]*(1-x[t]) - x[t]*y[t] - z[t],0)
        y[t+1] = max(x[t]*y[t]/b,0)
        if t < tSteps-1:
            z[t+1] = harvest(x, t+1, vars)

        NPVharvest = NPVharvest + z[t]*(1+0.05)**(-(t+1))

        #Calculate next temperature and b values
        T = T + (F0 + F*(t+1) - (1/S)*T)/C
        b = max(b0*np.exp(-0.3*T),0.25)

        # Calculate minimization objectives
        std_z = np.std(z)

    return (NPVharvest, std_z)

def harvest(x, t, vars):
    if t > 0:
        harvest = min(vars[0]*vars[1]*x[t] + vars[0]*(1-vars[1])*x[t-1],x[t])
    else:
        harvest = vars[0]*vars[1]*x[t]

    return harvest

Next, the model class must be defined, as well as its parameters, objectives (or “responses”) and whether they need to be minimized or maximized, decision variables (or “levers”) and uncertainties.


model = Model(fishGame)

# define all parameters to the model that we will be studying
model.parameters = [Parameter("vars"), Parameter("a"), Parameter("b0"), Parameter("F"), Parameter("S")]

# define the model outputs
model.responses = [Response("NPVharvest", Response.MAXIMIZE), Response("std_z", Response.MINIMIZE)]

# some parameters are levers that we control via our policy
model.levers = [RealLever("vars", 0.0, 1.0, length=2)]

# some parameters are exogeneous uncertainties, and we want to better
# understand how these uncertainties impact our model and decision making
# process
model.uncertainties = [UniformUncertainty("a", 1.5, 4.0), UniformUncertainty("b0", 0.25, 0.67)]

The model can then be optimized using a multi-objective evolutionary algorithm (MOEA) in Platypus, and the output written to a file. Here I use NSGA-II.


output = optimize(model, "NSGAII", 100)
with open("data.txt", "w") as f:
    json.dump(output, f)

The results can be easily visualized with simple commands. The Pareto sets can be plotted with ‘scatter2D’ or ‘scatter3D’, both of which allow brushing on one or more objective thresholds. Here I first brush on solutions with a NPV of harvest ≥ 1.0, and then add a condition that the standard deviation of harvest be ≤ 0.01.


# Use Seaborn settings for pretty plots
sns.set()

# Plot the points in 2D space
scatter2d(model, output)
plt.show()

# The optional interactive flag will show additional details of each point when
# hovering the mouse
# Most of Rhodiums's plotting functions accept an optional expr argument for
# classifying or highlighting points meeting some condition
scatter2d(model, output, x="NPVharvest", brush=Brush("NPVharvest >= 1.0"))
plt.show()

scatter2d(model, output, brush="NPVharvest >= 1.0 and std_z <= 0.01")
plt.show()

The above code creates the following images:

figure1figure2bfigure2

Rhodium can also plot Kernel density estimates of the solutions, or those attaining certain objective values.


# Kernel density estimation plots show density contours for samples. By
# default, it will show the density of all sampled points
kdeplot(model, output, x="NPVharvest", y="std_z")
plt.show()

# Alternatively, we can show the density of all points meeting one or more
# conditions
kdeplot(model, output, x="NPVharvest", y="std_z", brush=["NPVharvest >= 1.0", "std_z <= 0.01"], alpha=0.8)
plt.show()

figure4figure5

Scatterplots of all pairwise objective combinations can also be plotted, along with histograms of the marginal distribution of each objective illustrated in the pairwise scatterplots. These can also be brushed by objective thresholds specified by the user.


# Pairwise scatter plots shown 2D scatter plots for all outputs
pairs(model, output)
plt.show()

# We can also highlight points meeting one or more conditions
pairs(model, output, brush=["NPVharvest >= 1.0", "std_z <= 0.01"])
plt.show()

# Joint plots show a single pair of parameters in 2D, their distributions using
# histograms, and the Pearson correlation coefficient
joint(model, output, x="NPVharvest", y="std_z")
plt.show()

figure6figure7figure8

Finally, tradeoffs can also be viewed on parallel axes plots, which can also be brushed on user-specified objective values.


# A parallel coordinates plot to view interactions among responses
parallel_coordinates(model, output, colormap="rainbow", zorder="NPVharvest", brush=Brush("NPVharvest > 1.0")) 
plt.show()

figure10

But the real advantage of Rhodium is not visualization but uncertainty analysis. First, PRIM can be used to identify “boxes” best describing solutions meeting user-specified criteria. I define solutions with a NPV of harvest ≥ 1.0 as profitable, and those below unprofitable.


# The remaining figures look better using Matplotlib's default settings
mpl.rcdefaults()

# We can manually construct policies for analysis. A policy is simply a Python
# dict storing key-value pairs, one for each lever.
#policy = {"vars" : [0.02]*2}

# Or select one of our optimization results
policy = output[8]

# construct a specific policy and evaluate it against 1000 states-of-the-world
SOWs = sample_lhs(model, 1000)
results = evaluate(model, update(SOWs, policy))
metric = ["Profitable" if v["NPVharvest"] >= 1.0 else "Unprofitable" for v in results]

# use PRIM to identify the key uncertainties if we require NPVharvest >= 1.0
p = Prim(results, metric, include=model.uncertainties.keys(), coi="Profitable")
box = p.find_box()
box.show_details()
plt.show()

This will first show the smallest box with the greatest density but lowest coverage.

pasting_trajectory_26

Clicking on “Back” will show the next largest box with slightly lower density but greater coverage, while “Next” moves in the opposite direction. In this case, since the smallest box is shown, “Next” moves full circle to the largest box with the lowest density, but greatest coverage, and clicking “Next” from this figure will start reducing the box size.

pasting_trajectory_1

Classification And Regression Trees (CART; Breiman et al., 1984) can also be used to identify hierarchical conditional statements classifying successes and failures in meeting the user-specified criteria.

# use CART to identify the key uncertainties
c = Cart(results, metric, include=model.uncertainties.keys())
c.print_tree(coi="Profitable")
c.show_tree()
plt.show()

figure11

Finally, Dave has wrapped Rhodium around Jon Herman’s SALib for sensitivity analysis. Here’s an example of how to run the Method of Morris.


# Sensitivity analysis using Morris method
print(sa(model, "NPVharvest", policy=policy, method="morris", nsamples=1000, num_levels=4, grid_jump=2))

morrisprintedoutput

You can also create tornado and spider plots from one-at-a-time (OAT) sensitivity analysis.


# oat sensitivity
fig = oat(model, "NPVharvest",policy=policy,nsamples=1000)

oatplot

Finally, you can visualize the output of Sobol sensitivity analysis with bar charts of the first and total order sensitivity indices, or as radial plots showing the interactions between parameters. In these plots the filled circles on each parameter represent their first order sensitivity, the open circles their total sensitivity, and the lines between them the second order indices of the connected parameters. You can even create groups of similar parameters with different colors for easier visual analysis.

Si = sa(model, "NPVharvest", policy=policy, method="sobol", nsamples=1000, calc_second_order=True)
fig1 = Si.plot()
fig2 = Si.plot_sobol(threshold=0.01)
fig3 = Si.plot_sobol(threshold=0.01,groups={"Prey Growth Parameters" : ["a"],
            "Predator Growth Parameters" : ["b0"]})

sobolsi_plotfishgame_radialplotfishgame_radialplot_groups

As you can see, Rhodium makes MORDM analysis very simple! Now if only we could reduce uncertainty…

Works Cited

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression Trees. Wadsworth.

Friedman, J. H. and N. I. Fisher (1999). Bump-hunting for high dimensional data. Statistics and Computing, 9, 123-143.

Hadka, D., Herman, J., Reed, P., and Keller, K. (2015). An open source framework for many-objective robust decision making. Environmental Modelling & Software, 74, 114-129.

Kasprzyk, J. R., S. Nataraj, P. M. Reed, and R. J. Lempert (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71.

Lempert, R. J. (2003). Shaping the next one hundred years: new methods for quantitative, long-term policy analysis. Rand Corporation.

Fitting Multivariate Normal Distributions

In water resources modeling, we often want to generate synthetic multivariate data for a simulation model in a way that preserves their marginal and joint behavior. The multivariate normal (MVN) distribution is a common model choice for these simulations because 1) it often arises naturally due to the Central Limit Theorem, 2) it has many useful properties that make data manipulation convenient, and 3) data can often be transformed to MVN even if that is not their underlying distribution.

As an example, let X be a K-dimensional random vector with 1 mean vector μ and K covariance matrix [Σ]. If X is multivariate normal, i.e. X~MVN(μ,[Σ]), its probability density function is the following:

mvnpdf

where det(·) denotes the determinant. The term (xμ)T[Σ]-1(xμ) is called the squared Mahalanobis distance, which measures how far away an observation x is from its distribution’s mean, scaled by a multi-dimensional measure of spread, [Σ]. It is therefore a measure of how far away from the mean each data vector is in a statistical sense, as opposed to Euclidean distance, which only measures distance in a physical sense. The two measures are related, though. If all of the K dimensions of X are independent, every non-diagonal element of [Σ] will be equal to 0, and the diagonals equal to the variances in each dimension, σk2 where k ϵ {1, 2, …, K}. In that case, the Mahalanobis distance is equal to the Euclidean distance after scaling each data vector by its standard deviation.

So how do we fit MVN distributions? Well, the MVN distribution has “four handy properties” (Wilks, 2011) that we can test. Here I will discuss two of them and how we can use these properties to test for multivariate normality. See Chapter 11 of Wilks (2011) for additional tests for multivariate normality.

Let X be a set of n joint observations of K variables. Denote each of the n observations xi = [xi1, xi2, …, xik] where i ϵ {1, 2, …, n} and each of the K marginals Xk = [xk1, xk2, …, xkn] where k ϵ {1, 2, …, K}. If X~MVN(μ,[Σ]), the following two properties (among others) hold:

  1. All marginal distributions of X are univariate normal, i.e. Xk~N(μk, σk2)
  2. The squared Mahalanobis distances, Di2 = (xiμ)T[Σ]-1(xiμ), have a χk2 distribution with k degrees of freedom.

So if we want to fit a MVN distribution to X, each of these will have to be true. Let’s look at an example where X is the standard deviation in daily flows during all historical Septembers at five different sites within the same basin. In this case, K=5 for the 5 sites and n=51, as there are 51 years in the historical record. To fit a MVN distribution to X, we’ll first want to ensure that the marginal distributions of the standard deviations in daily September flows are normal at each of the K sites. Let’s inspect these distributions visually with a histogram, first:

septhist

Clearly these distributions are not normal, as they are positively skewed. But that’s okay, we can transform the data so that they look more normal. The Box-Cox transformation is commonly used to transform non-normal data, X, to normal data, Y (see Chapter 3 of Wilks (2011) for more details):

boxcoxeqn

Using λ=0 (a log-transform), our transformed data look like this:

logsepthist

These look much better! We can perform a formal hypothesis test to confirm that each of these 5 transformed data series are not inconsistent with the normal distribution using a number of tests, such as the Shapiro-Wilk test, the Kolmogorov-Smirnov test, and the Filliben Q-Q correlation test, which I use here (see Chapter 5 of Wilks, 2011 for a description of other tests). The Filliben Q-Q test finds the correlation between the sample data quantiles and the theoretical quantiles of the distribution being fit. I’ve plotted these below; the correlation coefficients at these 5 sites are [0.9922, 0.9951, 0.9822, 0.9909, 0.9945].

logseptqq

Rejection regions for the Filliben Q-Q test for the normal distribution are tabulated for different significance levels and sample sizes based on Monte Carlo results. The relevant section of the table is copied below. For a sample size of n≈50, the site with the lowest correlation (Site 3: 0.9822) fails to reject the null hypothesis that the data are normal at the 10% level, as the rejection region is ρ≤0.981. This means that if the data were normal, there would be a 10% chance that a data series of length n=50 would have a correlation coefficient below 0.981.

criticalvalues

So now we know that none of the marginal distributions at each site is inconsistent with the normal distribution, but that does not guarantee that the joint distribution across sites will be multi-variate normal. There could be multi-variate outliers, or points which are not extreme within any particular site’s distribution, but are extreme within the context of the overall covariance structure. We test this by confirming that the squared Mahalanobis distances are not inconsistent with a χk2 distribution. Again, this can be done by comparing the sample data quantiles to the theoretical data quantiles (figure below). Here the correlation coefficient is 0.9964.

logseptchisqqq

Because the rejection regions will depend not only on the sample size (n) and significance levels, but also the number of degrees of freedom (k), there are no tabulated critical values for this test (there would need to be a separate table for every possible k). Instead of using a table, one has to perform a Monte Carlo simulation to calculate the critical region for their specific application. In this case, I did that by generating 10,000 random samples of length n=51 from a χ2 distribution with k=5 degrees of freedom. Of the generated samples, 97.8% had correlation coefficients less than the observed value of 0.9964 suggesting that this sample is very consistent with a χ52 distribution.

So now that we know the MVN is a good fit for the log-transformed standard deviations in daily September flows, we can estimate the model parameters. This part is easy! The MLE estimator of the mean vector μ is the sample mean vector = [1,2, … , k] of the data (in this case, the log-transformed data), while the MLE estimator of the covariance [Σ] is the sample covariance \left[S\right]= \frac{1}{N-1}\left[X^{'}\right]^{T}\left[X^{'}\right], where \left[X^{'}\right] = \frac{1}{N}\left[1\right]\left[X\right] with [1] being an N×N matrix of 1s and [X] an N×K  matrix of the data (log transformed here).

Below is Python code for all of the fitting and plotting done here.


from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
import seaborn.apionly as sns
from scipy import stats

def fitMVN():
    # set plotting style
    sns.set_style("darkgrid")

    # load streamflow data
    Qdaily = np.loadtxt('data/Qdaily.txt')
    Nyears = int(np.shape(Qdaily)[0]/365)
    Nsites = np.shape(Qdaily)[1]
    Months = ['May','June','July','August','September','October','November','December','January','February','March','April']

    # calculate standard deviation in daily flows each month and squared Mahalanobis distances
    StdMonthly = calc_monthly_std(Qdaily, Nyears, Nsites)
    D2 = calcD2(Nyears, Nsites, np.log(StdMonthly))

    # calculate theoretical quantiles for a chi^2 distribution with dof = Nsites, and for the standard normal distribution
    m = np.array(range(1,Nyears+1))
    p = (m-0.5)/Nyears
    chi2 = stats.chi2.ppf(p,Nsites)
    norm = stats.norm.ppf(p,0,1)

    # initialize matrices to store correlation coefficients and significance levels for marginal normal distributions and chi^2 distributions
    normCorr = np.zeros([Nsites,12])
    norm_sigLevel = np.zeros([Nsites,12])
    chi2Corr = np.zeros([12])
    chi2_sigLevel = np.zeros([12])

    for i in range(len(Months)):
        # plot histograms of standard deviation of daily flows each month, and of their logs
        plotHistograms(Nsites, StdMonthly[:,:,i], 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'Hist.png')
        plotHistograms(Nsites, np.log(StdMonthly[:,:,i]), 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
            'Log' + Months[i] + 'Hist.png')

        # plot QQ plots of standard deviation of daily flows each month, and of their logs
        plotNormQQ(Nsites, StdMonthly[:,:,i], norm, 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'QQ.png')
        normCorr[:,i] = plotNormQQ(Nsites, np.log(StdMonthly[:,:,i]), norm, 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', 'Log' + Months[i] + 'QQ.png')

        # plot QQ plot of Chi Squared distribution of log of standard deviation in daily flows each month
        chi2Corr[i] = plotChi2QQ(Nsites, D2[:,i], chi2, 'D$\mathregular{^2}\!$ of log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
            'Log' + Months[i] + 'Chi2QQ.png')

        # find significance levels
        chi2_sigLevel[i] = chi2_MC(Nsites,Nyears,chi2,chi2Corr[i])
        norm_sigLevel[:,i] = norm_MC(Nsites,Nyears,norm,normCorr[:,i])

    np.savetxt('Norm_sigLevels.txt',np.transpose(norm_sigLevel))
    np.savetxt('Norm_corr.txt',np.transpose(normCorr))
    np.savetxt('Chi2_sigLevels.txt',chi2_sigLevel)
    np.savetxt('Chi2_corr.txt',chi2Corr)

    return None

def calc_monthly_std(Qdaily, Nyears, Nsites):
    Nmonths = 12
    # first month = May (1st month of water year)
    DaysPerMonth = np.array([31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30])

    Qmonthly = np.zeros([Nsites, Nyears, Nmonths])
    StdMonthly = np.zeros([Nsites, Nyears, Nmonths])
    for year in range(Nyears):
        for month in range(Nmonths):
            start = year*365 + np.sum(DaysPerMonth[0:month])

            for i in range(Nsites):
                # find total flow each month
                Qmonthly[i,year,month] = 86400*np.sum(Qdaily[start:start+DaysPerMonth[month],i])

            # find standard deviation in daily flows each month
            for i in range(Nsites):
                for j in range(DaysPerMonth[month]):
                    StdMonthly[i,year,month] = StdMonthly[i,year,month] + \
                        (86400*Qdaily[start+j,i]-Qmonthly[i,year,month]/DaysPerMonth[month])**2

                StdMonthly[i,year,month] = np.sqrt((1/(DaysPerMonth[month]-1))*StdMonthly[i,year,month])

    return StdMonthly

def plotHistograms(Nsites, data, xlabel, filename):
    fig = plt.figure()
    for i in range(Nsites):
        ax = fig.add_subplot(1,Nsites,i+1)
        ax.hist(data[i,:],bins=10,color='navy',alpha=0.8)
        ax.set_title('Site ' + str(i+1),fontsize=16)

    fig.text(0.1, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=14)
    fig.text(0.5, 0.04, xlabel, ha='center', fontsize=14)
    fig.subplots_adjust(bottom=0.15)
    fig.set_size_inches([22.525,4.825])
    fig.savefig('Hists/' + filename)
    fig.clf()

    return None

def plotNormQQ(Nsites, data, norm, title, filename):
    corr = np.zeros([Nsites])
    fig = plt.figure()
    for i in range(Nsites):
        corr[i] = np.corrcoef(np.sort(data[i,:]),norm)[0,1]
        z = (data[i,:] - np.mean(data[i,:]))/np.std(data[i,:])
        ax = fig.add_subplot(1,Nsites,i+1)
        ax.scatter(norm,np.sort(z))
        ax.plot([-3,3],[-3,3],c='r')
        ax.set_title('Site ' + str(i+1),fontsize=16)
        ax.set_xlim([-3,3])
        ax.set_ylim([-3,3])

    fig.text(0.1, 0.5, 'Sample Quantiles', va='center', rotation='vertical', fontsize=14)
    fig.text(0.5, 0.04, 'Theoretical Quantiles', ha='center', fontsize=14)
    fig.suptitle('Normal Q-Q Plot of ' + title,fontsize=16)
    fig.subplots_adjust(bottom=0.15,top=0.85)
    fig.set_size_inches([22.525,4.825])
    fig.savefig('QQplots/' + filename)
    fig.clf()

    return corr

def calcD2(Nyears, Nsites, data):
    D2 = np.zeros([Nyears,12])
    X = np.zeros([Nyears, Nsites])
    Xprime = np.zeros([Nyears,Nsites])
    S = np.zeros(Nsites)
    for i in range(12):
        # fill data matrix, X, for ith month
        for j in range(Nsites):
            X[:,j] = data[j,:,i]

        # calculate covariance matrix, S, for ith month
        Xprime = X - (1/Nyears)*np.dot(np.ones([Nyears,Nyears]),X)
        S = (1/(Nyears-1))*np.dot(np.transpose(Xprime),Xprime)

        #calculate Mahalanobis distance, D2, for each year's ith month
        for j in range(Nyears):
            D2[j,i] = np.dot(np.dot((X[j,:] - np.mean(X,0)),np.linalg.inv(S)),(np.transpose(X[j,:] - np.mean(X,0))))

    return D2

def plotChi2QQ(Nsites, data, chi2, title, filename):
    corr = np.corrcoef(np.sort(data),chi2)[0,1]
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(chi2,np.sort(data))
    ax.plot([0,1.1*np.max(chi2)],[0,1.1*np.max(chi2)],c='r')
    ax.set_xlabel('Theoretical Quantiles',fontsize=16)
    ax.set_xlim([0, 1.1*np.max(chi2)])
    ax.set_ylabel('Sample Quantiles',fontsize=16)
    ax.set_ylim([0,1.1*np.max(data)])
    ax.tick_params(axis='both',labelsize=14)
    ax.set_title(r'$\chi^2$' + ' Q-Q Plot of ' + title,fontsize=16)
    fig.savefig('QQplots/' + filename)
    fig.clf()

    return corr

def chi2_MC(Nsites,Nyears,theoretical,dataCorr):
    corr = np.zeros(10000)
    for i in range(10000): # 10,000 MC simulations
        simulated = stats.chi2.rvs(Nsites,size=Nyears)
        corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]

    # find significance levels
    corr = np.sort(corr)
    for i in range(10000):
        if dataCorr > corr[i]:
            sigLevel = (i+0.5)/10000

    return sigLevel

def norm_MC(Nsites,Nyears,theoretical,dataCorr):
    sigLevel = np.zeros(Nsites)
    corr = np.zeros([10000])
    for i in range(10000): # 10,000 MC simulations
        simulated = stats.norm.rvs(0,1,size=Nyears)
        corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]

    # find significance levels
    corr = np.sort(corr)
    for i in range(10000):
        for j in range(Nsites):
            if dataCorr[j] > corr[i]:
                sigLevel[j] = (i+0.5)/10000

    return sigLevel

fitMVN()

Root finding in MATLAB, R, Python and C++

In dynamical systems, we are often interested in finding stable points, or equilibria. Some systems have multiple equilibria. As an example, take the lake problem, which is modeled by the equation below where Xt is the lake P concentration, at are the anthropogenic P inputs, Yt~LN(μ,σ2)  are random natural P inputs, b is the P loss rate, and q is a shape parameter controlling the rate of P recycling from the sediment. The first three terms on the right hand side make up the “Inputs” in the figure, while the last term represents the “Outputs.” A lake is in equilibrium when the inputs are equal to the outputs and the lake P concentration therefore is not changing over time.

lakeModel

For irreversible lakes this occurs at three locations, even in the absence of anthropogenic and natural inputs: an oligotrophic equilibrium, an unstable equilibrium (called the critical P threshold) and a eutrophic equilibrium (see figure below).

PcritDiagram

The unstable equilibrium in this case is called the critical P threshold because once it is crossed, it is impossible to return to an oligotrophic equilibrium by reducing anthropogenic and natural P inputs alone. In irreversible lakes like this, we would therefore like to keep the lake P concentration below the critical P threshold. How do we find the critical P threshold? With a root finding algorithm!

As stated earlier, the system above will be in equilibrium when the inputs are equal to the outputs and the P concentration is not changing over time, i.e. when

X_{t+1} - X_t = \frac{X^q_t}{1+X^q_t} - bX_t = 0

Therefore we simply need to find the zero, or “root” of the above equation.  Most of the methods for this require either an initial estimate or upper and lower bounds on the location of the root. These are important, since an irreversible lake will have three roots. If we are only interested in the critical P threshold, we have to make sure that we provide an estimate which leads to the unstable equilibrium, not either of the stable equilibria. If possible, you should plot the function whose root you are finding to make sure you are giving a good initial estimate or bounds, and check afterward to ensure the root that was found is the one you want! Here are several examples of root-finding methods in different programming languages.

In MATLAB, roots can be found with the function fzero(fun,x0) where ‘fun’ is the function whose root you want to find, and x0 is an initial estimate. This function uses Brent’s method, which combines several root-finding methods: bisection, secant, and inverse quadratic interpolation. Below is an example using the lake problem.

myfun = @(x,b,q) x^q/(1+x^q)-b*x;
b = 0.42;
q = 2.0;
fun = @(x) myfun(x,b,q);
pcrit = fzero(fun,0.75);

This returns pcrit = 0.5445, which is correct. If we had provided an initial estimate of 0.25 instead of 0.75, we would get pcrit = 2.6617E-19, basically 0, which is the oligotrophic equilibrium in the absence of anthropogenic and natural P inputs. If we had used 1.5 as an initial estimate, we would get pcrit = 1.8364, the eutrophic equilibrium.

MatlabScreenShot

In R, roots can be found with the function uniroot, which also uses Brent’s method. Dave uses this on line 10 of the function lake.eval in his OpenMORDM example. Instead of taking in an initial estimate of the root, this function takes in a lower and upper bound. This is safer, as you at least know that the root estimate will lie within these bounds. Providing an initial estimate that is close to the true value should do well, but is less predictable; the root finding algorithm may head in the opposite direction from what is desired.

b <- 0.42
q <- 2.0
pcrit <- uniroot(function(x) x^q/(1+x^q) - b*x, c(0.01, 1.5))$root

This returns pcrit = 0.5445145. Good, we got the same answer as we did with MATLAB! If we had used bounds of c(0.75, 2.0) we would have gotten 1.836426, the eutrophic equilibrium.

What if we had given bounds that included both of these equilibria, say c(0.5, 2.0)? In that case, R returns an error: ‘f() values at end points not of opposite sign’. That is, if the value returned by f(x) is greater than 0 for the lower bound, it must be less than 0 for the upper bound and vice versa. In this case both f(0.5) and f(2.0) are greater than 0, so the algorithm fails. What if we gave bounds for which one is greater than 0 and another less, but within which there are multiple roots, say c(-0.5,2.0)? Then R just reports the first one it finds, in this case pcrit = 0.836437, the eutrophic equilibrium. So it’s important to make sure you pick narrow enough bounds that include the root you want, but not roots you don’t!

RscreenShot

In Python, you can use either scipy.optimize.root or scipy.optimize.brentq, which is what Jon uses on line 14 here. scipy.optimize.root can be used with several different algorithms, but the default is Powell’s hybrid method, also called Powell’s dogleg method. This function only requires an initial estimate of the root.

from scipy.optimize import root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.75)

scipy.optimize.root returns an object with several attributes. The attribute of interest to us is the root, represented by x, so we want pcrit.x. In this case, we get the correct value of 0.54454. You can play around with initial estimates to see how pcrit.x changes.

PythonScreenShot1

Not surprisingly, scipy.optimize.brentq uses Brent’s method and requires bounds as an input.

from scipy.optimize import brentq as root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.01, 1.5)

This just returns the root itself, pcrit = 0.5445. Again, you can play around with the bounds to see how this estimate changes.

PythonScreenShot2

In C++, Dave again shows how this can be done in the function ‘main-lake.cpp’ provided in the Supplementary Material to OpenMORDM linked from this page under the “Publications” section. On lines 165-168 he uses the bisect tool to find the root of the function given on lines 112-114. I’ve copied the relevant sections of his code into the function ‘find_Pcrit.cpp’ below.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <boost/math/tools/roots.hpp>

namespace tools = boost::math::tools;
using namespace std;

double b, q, pcrit;

double root_function(double x) {
  return pow(x,q)/(1+pow(x,q)) - b*x;
}

bool root_termination(double min, double max) {
  return abs(max - min) <= 0.000001;
}

int main(int argc, char* argv[])
{
  b = 0.42;
  q = 2.0;

  std::pair<double, double> result = tools::bisect(root_function, 0.01, 1.0, root_termination);
  pcrit = (result.first + result.second)/2;
  cout << pcrit << endl;
}

This yields the desired root of pcrit = 0.54454, but of course, changing the bounds may result in different estimates. In case you missed it, the take home message is to be careful about your initial estimate and bounds ;).

CPPscreenShot