Open exploration with the Exploratory Modelling Workbench

In this blog, I will continue to showcase the functionality of the exploratory modelling workbench. In the previous blog, I have given a general introduction to the workbench, and showed how the Direct Policy Search example that comes with Rhodium can be adapted for use with the workbench. In this blog post, I will showcase how the workbench can be used for open exploration.

first a short background

In exploratory modeling, we are interested in understanding how regions in the uncertainty space and/or the decision space map to the whole outcome space, or partitions thereof. There are two general approaches for investigating this mapping. The first one is through systematic sampling of the uncertainty or decision space. This is sometimes also known as open exploration. The second one is to search through the space in a directed manner using some type of optimization approach. This is sometimes also known as directed search.

The workbench support both open exploration and directed search. Both can be applied to investigate the mapping of the uncertainty space and/or the decision space to the outcome space. In most applications, search is used for finding promising mappings from the decision space to the outcome space, while exploration is used to stress test these mappings under a whole range of possible resolutions to the various uncertainties. This need not be the case however. Optimization can be used to discover the worst possible scenario, while sampling can be used to get insight into the sensitivity of outcomes to the various decision levers.

open exploration

To showcase the open exploration functionality, let’s start with a basic example using the DPS lake problem also used in the previous blog post. We are going to simultaneously sample over uncertainties and decision levers. We are going to generate 1000 scenarios and 5 policies, and see how they jointly affect the outcomes. A scenario is understood as a point in the uncertainty space, while a policy is a point in the decision space. The combination of a scenario and a policy is called experiment. The uncertainty space is spanned by uncertainties, while the decision space is spanned by levers. Both uncertainties and levers are instances of RealParameter (a continuous range), IntegerParameter (a range of integers), or CategoricalParameter (an unorder set of things). By default, the workbench will use Latin Hypercube sampling for generating both the scenarios and the policies. Each policy will be always evaluated over all scenarios (i.e. a full factorial over scenarios and policies).

from ema_workbench import (RealParameter, ScalarOutcome, Constant,
                           ReplicatorModel)
model = ReplicatorModel('lakeproblem', function=lake_model)
model.replications = 150

#specify uncertainties
model.uncertainties = [RealParameter('b', 0.1, 0.45),
                       RealParameter('q', 2.0, 4.5),
                       RealParameter('mean', 0.01, 0.05),
                       RealParameter('stdev', 0.001, 0.005),
                       RealParameter('delta', 0.93, 0.99)]

# set levers
model.levers = [RealParameter("c1", -2, 2),
                RealParameter("c2", -2, 2),
                RealParameter("r1", 0, 2),
                RealParameter("r2", 0, 2),
                RealParameter("w1", 0, 1)]

def process_p(values):
    values = np.asarray(values)
    values = np.mean(values, axis=0)
    return np.max(values)

#specify outcomes
model.outcomes = [ScalarOutcome('max_P', kind=ScalarOutcome.MINIMIZE,
                                function=process_p),
                  ScalarOutcome('utility', kind=ScalarOutcome.MAXIMIZE,
                                function=np.mean),
                  ScalarOutcome('inertia', kind=ScalarOutcome.MINIMIZE,
                                function=np.mean),
                  ScalarOutcome('reliability', kind=ScalarOutcome.MAXIMIZE,
                                function=np.mean)]

# override some of the defaults of the model
model.constants = [Constant('alpha', 0.41),
                   Constant('steps', 100)]

Next, we can perform experiments with this model.

from ema_workbench import (MultiprocessingEvaluator, ema_logging,
                           perform_experiments)
ema_logging.log_to_stderr(ema_logging.INFO)

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.perform_experiments(scenarios=1000, policies=5)

Visual analysis

Having generated these results, the next step is to analyze them and see what we can learn from the results. The workbench comes with a variety of techniques for this analysis. A simple first step is to make a few quick visualizations of the results. The workbench has convenience functions for this, but it also possible to create your own visualizations using the scientific Python stack.

from ema_workbench.analysis import pairs_plotting
fig, axes = pairs_plotting.pairs_scatter(results, group_by='policy',
                                         legend=False)
plt.show()

output_6_0

Writing your own visualizations requires a more in-depth understanding of how the results from the workbench are structured. perform_experiments returns a tuple. The first item is a numpy structured array where each row is a single experiment. The second item contains the outcomes, structured in a dict with the name of the outcome as key and a numpy array as value. Experiments and outcomes are aligned based on index.

import seaborn as sns

experiments, outcomes = results

df = pd.DataFrame.from_dict(outcomes)
df = df.assign(policy=experiments['policy'])

# rename the policies using numbers
df['policy'] = df['policy'].map({p:i for i, p in
                                enumerate(set(experiments['policy']))})

# use seaborn to plot the dataframe
grid = sns.pairplot(df, hue='policy', vars=outcomes.keys())
ax = plt.gca()
plt.show()

output_8_0

Often, it is convenient to separate the process of performing the experiments from the analysis. To make this possible, the workbench offers convenience functions for storing results to disc and loading them from disc. The workbench will store the results in a tarbal with .csv files and separate metadata files. This is a convenient format that has proven sufficient over the years.

from ema_workbench import save_results

save_results(results, '1000 scenarios 5 policies.tar.gz')

from ema_workbench import load_results

results = load_results('1000 scenarios 5 policies.tar.gz')

advanced analysis

In addition to visual analysis, the workbench comes with a variety of techniques to perform a more in-depth analysis of the results. In addition, other analyses can simply be performed by utilizing the scientific python stack. The workbench comes with

  • Scenario Discovery, a model driven approach to scenario development.
  • Dimensional stacking, a quick visual approach drawing on feature scoring to enable scenario discovery. This approach has received limited attention in the literature (Suzuki et al., 2015). The implementation in the workbench replaces the rule mining approach with a feature scoring approach.
  • Feature Scoring, a poor man’s alternative to global sensitivity analysis
  • Regional sensitivity analysis

Scenario Discovery

A detailed discussion on scenario discovery can be found in an earlier blogpost. For completeness, I provide a code snippet here. Compared to the previous blog post, there is one small change. The library mpld3 is currently not being maintained and broken on Python 3.5 and higher. To still utilize the interactive exploration of the trade offs within the notebook, use the interactive back-end as shown below.

from ema_workbench.analysis import prim

experiments, outcomes = results

x = experiments
y = outcomes['max_P'] <0.8

prim_alg = prim.Prim(x, y, threshold=0.8)
box1 = prim_alg.find_box()

%matplotlib notebook

box1.show_tradeoff()
plt.show()

tradeoff

%matplotlib inline
# we go back to default not interactive

box1.inspect(43)
box1.inspect(43, style='graph')
plt.show()

output_13_1

dimensional stacking

Dimensional stacking was suggested as a more visual approach to scenario discovery. It involves two steps: identifying the most important uncertainties that affect system behavior, and creating a pivot table using the most influential uncertainties. Creating the pivot table involves binning the uncertainties. More details can be found in Suzuki et al. (2015) or by looking through the code in the workbench. Compared to the original paper, I use feature scoring for determining the most influential uncertainties. The code is set up in a modular way so other approaches to global sensitivity analysis can easily be used as well if so desired.

from ema_workbench.analysis import dimensional_stacking

x = experiments
y = outcomes['max_P'] <0.8

dimensional_stacking.create_pivot_plot(x,y, 2, nbins=3)
plt.show()

output_15_1

We can see from this visual that if B is low, while Q is high, we have a high concentration of cases where pollution stays below 0.8. The mean and delta have some limited additional influence. By playing around with an alternative number of bins, or different number of layers, patterns can be coarsened or refined.

regional sensitivity analysis

A third approach for supporting scenario discovery is to perform a regional sensitivity analysis. The workbench implements a visual approach based on plotting the empirical CDF given a classification vector. Please look at section 3.4 in Pianosi et al (2016) for more details.

from ema_workbench.analysis import regional_sa
from numpy.lib import recfunctions as rf

x = rf.drop_fields(experiments, 'model', asrecarray=True)
y = outcomes['max_P'] < 0.8

regional_sa.plot_cdfs(x,y)
plt.show()

output_17_0

feature scoring

Feature scoring is a family of techniques often used in machine learning to identify the most relevant features to include in a model. This is similar to one of the use cases for global sensitivity analysis, namely factor prioritisation. In some of the work ongoing in Delft, we are comparing feature scoring with Sobol and Morris and the results are quite positive. The main advantage of feature scoring techniques is that they impose virtually no constraints on the experimental design, while they can handle real valued, integer valued, and categorical valued parameters. The workbench supports multiple techniques, the most useful of which generally is extra trees (Geurts et al. 2006).

For this example, we run feature scoring for each outcome of interest. We can also run it for a specific outcome if desired. Similarly, we can choose if we want to run in regression mode or classification mode. The later is applicable if the outcome is a categorical variable and the results should be interpreted similar to regional sensitivity analysis results. For more details, see the documentation.

from ema_workbench.analysis import feature_scoring

x = experiments
y = outcomes

fs = feature_scoring.get_feature_scores_all(x, y)
sns.heatmap(fs, cmap='viridis', annot=True)
plt.show()

output_19_0

From the results, we see that max_P is primarily influenced by b, while utility is driven by delta, for inertia and reliability the situation is a little bit less clear cut.

linear regression

In addition to the prepackaged analyses that come with the workbench, it is also easy to rig up something quickly using the ever expanding scientific Python stack. Below is a quick example of performing a basic regression analysis on the results.

experiments, outcomes = results

for key, value in outcomes.items():
    params = model.uncertainties #+ model.levers[:]

    fig, axes = plt.subplots(ncols=len(params), sharey=True)

    y = value

    for i, param in enumerate(params):
        ax = axes[i]
        ax.set_xlabel(param.name)

        pearson = sp.stats.pearsonr(experiments[param.name], y)

        ax.annotate("r: {:6.3f}".format(pearson[0]), xy=(0.15, 0.85),
                    xycoords='axes fraction',fontsize=13)

        x = experiments[param.name]
        sns.regplot(x, y, ax=ax, ci=None, color='k',
        scatter_kws={'alpha':0.2, 's':8, 'color':'gray'})

        ax.set_xlim(param.lower_bound, param.upper_bound)

    axes[0].set_ylabel(key)

plt.show()

output_22_0

More advanced sampling techniques

The workbench can also be used for more advanced sampling techniques. To achieve this, it relies on SALib. On the workbench side, the only change is to specify the sampler we want to use. Next, we can use SALib directly to perform the analysis. To help with this, the workbench provides a convenience function for generating the problem dict which SALib provides. The example below focusses on performing SOBOL on the uncertainties, but we could do the exact same thing with the levers instead. The only changes required would be to set lever_sampling instead of uncertainty_sampling, and get the SALib problem dict based on the levers.

from SALib.analyze import sobol
from ema_workbench.em_framework.salib_samplers import get_SALib_problem

with MultiprocessingEvaluator(model) as evaluator:
    sa_results = evaluator.perform_experiments(scenarios=1000,
                                               uncertainty_sampling='sobol')

experiments, outcomes = sa_results
problem = get_SALib_problem(model.uncertainties)

Si = sobol.analyze(problem, outcomes['max_P'],
                   calc_second_order=True, print_to_console=False)

Si_filter = {k:Si[k] for k in ['ST','ST_conf','S1','S1_conf']}
Si_df = pd.DataFrame(Si_filter, index=problem['names'])
Animations (2/2)

Animations (2/2)

In the second part of this two-part post we’ll learn to use different tools and techniques to visualize and save animations (first part here). All the code discussed here is available on a GitHub repository, here.

This part focuses on the moviepy Python library, and all the neat things one can do with it. There actually are some nice tutorials for when we have a continuous function t -> f(t) to work with (see here). Instead, we are often working with data structures that are indexed on time in a discrete way.

Moviepy could be used from any data source dependent on time, including netCDF data such as the one manipulated by VisIt in the first part of this post. But in this second part, we are instead going to focus on how to draw time -dependent trajectories to make sense of nonlinear dynamical systems, then animate them in GIF. I will use the well-known shallow lake problem, and go through a first example with detailed explanation of the code. Then I’ll finish with a second example showing trajectories.

Part I: using state trajectories to understand the concept of stable equilibria

The shallow lake problem is a classic problem in the management of coupled human and natural system. Some human (e.g. agriculture) produce phosphorus that eventually end up in water bodies such as lakes. Too much phosphorus in lake causes a processus called eutrophication which usually destroys lakes’ diverse ecosystems (no more fish) and lower water quality. A major problem with that is that eutrophication is difficult or even sometimes impossible to reverse: lowering phosphorus inputs to what they were pre-eutrophication simply won’t work. Simple nonlinear dynamics, first proposed by Carpenter et al. in 1999 (see here) describe the relationship between phosphorus inputs (L) and concentration (P). The first part of the code (uploaded to GitHub as movie1.py) reads:

 import attractors
import numpy as np
from moviepy.video.io.bindings import mplfig_to_npimage
from moviepy.video.VideoClip import DataVideoClip
import matplotlib.pyplot as plt
import matplotlib.lines as mlines

# Lake parameters
b = 0.65
q = 4

# One step dynamic (P increment rate)
# arguments are current state x and lake parameters b,q and input l
def Dynamics(x, b, q, l):
    dp = (x ** q) / (1 + x ** q) - b * x + l
    return dp 

Where the first 6 lines contain the usual library imports. Note that I am importing an auxiliary Python function “attractors” to enable me to plot the attractors (see attractors.pyon the GitHub repository). The function “Dynamics” correspond to the evolution of P given L and lake parameters b and q, also given in this bit of code. Then we introduce the time parameters:

# Time parameters
dt = 0.01 # time step
T = 40 # final horizon
nt = int(T/dt+1E-6) # number of time steps

To illustrate that lake phosphorus dynamics depend not only on the phosphorus inputs L but also on initial phosphorus levels, we are going to plot P trajectories for different constant values of L, and three cases regarding the initial P. We first introduce these initial phosphorus levels, and the different input levels, then declare the arrays in which we’ll store the different trajectories

# Initial phosphorus levels
pmin = 0
pmed = 1
pmax = 2.5

# Inputs levels
l = np.arange(0.001,0.401,0.005)

# Store trajectories
low_p = np.zeros([len(l),nt+1]) # Correspond to pmin
med_p = np.zeros([len(l),nt+1]) # Correspond to pmed
high_p = np.zeros([len(l),nt+1]) # Correspond to pmax 

Once that is done, we can use the attractor import to plot the equilibria of the lake problem. This is a bit of code that is the GitHub repository associated to this post, but that I am not going to comment on further here.

After that we can generate the trajectories for P with constant L, and store them to the appropriate arrays:

# Generating the data: trajectories
def trajectory(b,q,p0,l,dt,T):
# Declare outputs
time = np.arange(0,T+dt,dt)
traj = np.zeros(len(time))
# Initialize traj
traj[0] = p0
# Fill traj with values
for i in range(1,len(traj)):
traj[i] = traj[i-1] + dt * Dynamics(traj[i-1],b,q,l)
    return traj
# Get them!
for i in range(len(l)):
    low_p[i,:] = trajectory(b,q,pmin,l[i],dt,T)
    med_p[i, :] = trajectory(b, q, pmed, l[i], dt, T)
    high_p[i,:] = trajectory(b,q,pmax,l[i],dt,T)

Now we are getting to the interesting part of making the plots for the animation. We need to declare a figure that all the frames in our animation will use (we don’t want the axes to wobble around). For that we use matplotlib / pyplot libraries:

# Draw animated figure
fig, ax = plt.subplots(1)
ax.set_xlabel('Phosphorus inputs L')
ax.set_ylabel('Phosphorus concentration P')
ax.set_xlim(0,l[-1])
ax.set_ylim(0,pmax)
line_low, = ax.plot(l,low_p[:,0],'.', label='State, P(0)=0')
line_med, = ax.plot(l,med_p[:,0],'.', label='State, P(0)=1')
line_high, = ax.plot(l,high_p[:, 0], '.', label='State, P(0)=2.5')

Once that is done, the last things we need to do before calling the core moviepy functions are to 1) define the parameters that manage time, and 2) have a function that makes frames for the instant that is being called.

For 1), we need to be careful because we are juggling with different notions of time, a) time in the dynamics, b) the index of each instant in the dynamics (i.e., in the data, the arrays where we stored the trajectories), and c) time in the animation. We may also want to have a pause at the beginning or at the end of the GIF, rather than watch with tired eyes as the animation is ruthlessly starting again before we realized what the hell happened. So here is how I declared all of this:

# Parameters of the animation
initial_delay = 0.5 # in seconds, delay where image is fixed before the animation
final_delay = 0.5 # in seconds, time interval where image is fixed at end of animation
time_interval = 0.25 # interval of time between two snapshots in the dynamics (time unit or non-dimensional)
fps = 20 # number of frames per second on the GIF
# Translation in the data structure
data_interval = int(time_interval/dt) # interval between two snapshots in the data structure
t_initial = -initial_delay*fps*data_interval
t_final = final_delay*fps*data_interval
time = np.arange(t_initial,low_p.shape[1]+t_final,data_interval) # time in the data structure

Now for 2), the function to make the frames resets the parts of the plot that change for different time indexes(“t” below is the index in the data). If we don’t do that, the plot will keep the previous plotted elements, and will grow messier at the animation goes on.

# Making frames
def make_frame(t):
    t = int(t)
    if t<0:
        return make_frame(0)
    elif t>nt:
        return make_frame(nt)
    else:
        line_low.set_ydata(low_p[:,t])
        line_med.set_ydata(med_p[:,t])
        line_high.set_ydata(high_p[:, t])
        ax.set_title(' Lake attractors, and dynamics at t=' + str(int(t*dt)), loc='left', x=0.2)
       if t > 0.25*nt:
            alpha = (t-0.25*nt) / (1.5*nt)
            lakeAttBase(eqList, 0.001, alpha=alpha)
            plt.legend(handles=[stable, unstable], loc=2)
        return mplfig_to_npimage(fig) 

In the above mplfig_to_npimage(fig) is a moviepy function that turns a figure into a frame of our GIF. Now we just have to call the function to do frames using the data, and to turn it into a GIF:

# Animating
animation = DataVideoClip(time,make_frame,fps=fps)
animation.write_gif("lake_attractors.gif",fps=fps)

Where the moviepy function DataVideoClip takes as arguments the sequences of indexes defined by the vector “time” defined in the parameters of the animation, the “make_frame” routine we defined, and the number of frame per second we want to output. The last lines integrates each frame to the GIF that is plotted below:

lake_attractors

Each point on the plot represent a different world (different constant input level, different initial phosphorus concentration), and the animation shows how these states converge towards an stable equilibriun point. The nonlinear lake dynamics make the initial concentration important towards knowing if the final concentration is low (lower set of stable equilibria), or if the lake is in a eutrophic state (upper set of stable equilibria).

Part II: plotting trajectories in the 2-D plane

Many trajectories can be plotted at the same time to understand the behavior of attractors, and visualize system dynamics for fixed human-controlled parameters (here, the phosphorus inputs L). Alternatively, if one changes the policy, trajectories evolve depending on both L and P. This redefines how trajectories are defined.

I did a similar bit of code to show how one could plot trajectories in the 2D plane. It is also uploaded on the GitHub repository (under movie2.py), and is similar in its structure to the code above. The definition of the trajectories and where to store them change. We define trajectories where inputs are lowered at a constant rate, with a minimum input of 0.08. For three different initial states, that gives us the following animation that illustrates how the system’s nonlinearity leads to very different trajectories even though the starting positions are close and the management policy, identical:

lake_trajectories

 

This could easily be extended to trajectories in higher dimensional planes, with and without sets of equilibria to guide our eyes.

Animations (1/2)

Animations (1/2)

In the first part of this two-part post we’ll learn to use different tools and techniques to visualize and save animations (second part here). This first part focuses on VisIt, an open-source visualization software developed by the Lawrence Livermore Laboratory – and many others (…that’s an advantage of being open-source). There are a lot of cool things about VisIt, and you can read about that here or here, and probably in future posts. But for today, we’ll just learn how to use VisIt to play and save animations. VisIt supports over a hundred file formats, and this post focuses on reading spatially distributed and time-varying data stored in the netCDF format (more about netCDF here).

Let us assume we have VisIt installed and the VisIt path has been added to our bash profile. Then we just need to cd into the directory where our data are stored, and type visit in the command line to launch the program. VisIt comes with two windows, one for manipulating files and the data stored in them, and another where plots are drawn (it is possible to draw more plots in additional windows, but in this tutorial we’ll stick to one window). Both windows are visible in the screenshot below.

visit

It may be difficult to see, but the left-hand window is the space for managing files and plots. The active “Plot” is also displayed. Here the active file is “air.2m.2000.nc” where nc is th extension for netCDF files. The plot is a “Pseudocolor” of that data, the equivalent of “contourf” in Matlab or Python – Matplotlib. It is plotted in the right-hand window. It is the worldwide distribution of daily air temperature (in Kelvin) with a 1 degree resolution for January 1, 2000 on the left pane (Data from NCEP). Note the toolbar above the plot: it contains “play / stop” buttons that can allow us to play the 365 other days of year 2000 (the cursor is on “stop” in the screenshot). We are going to save the animation of the whole year with the VisIt movie wizard.

But first, let us change the specifications of our plot. In the left-hand window, we go to PlotAtts – > Pseudocolor (recall that is the plot type that we have). We can change the colorbar, set minima and maxima so that the colorbar does not change for each day of the year when we save the movie, and even make the scale skewed so we see more easily the differences in temperatures in non-polar regions where people live (on the image above the temperature in the middle of the colorbar is 254.9 K, which corresponds to -18.1C or roughly -1F). The screenshot below shows our preferences. We can click “Apply” for the changes to take effect, and exit the window.

preferences

Now we can save the movie. We go to File – > Save Movie and follow the VisIt Movie Wizard. You will find it very intuitive and easy. We save a “New Simple Movie” in DVIX (other formarts are available), with 20 frames (i.e. 20 days) per second. The resulting movie is produced within 2 minutes, but since WordPress does not support DIVX unless I am willing to pay for premium access, I had to convert the result as a GIF. To be honest, this is a huge letdown as the quality suffers compared to a video (unless I want to upload a 1GB monster). But the point of this post is that VisIt is really easy to use, not that you should convert its results into GIF.

blogpost

In this animation it is easy to see the outline of land masses as they get either warmer or cooler than the surrounding oceans! Videos would look neater: you can try them for yourself! (at home, wherever you want).

It is actually possible to command VisIt using Python scripts, but I haven’t mastered that yet so it will be a tale for another post.

An Introduction to Copulas

Modeling multivariate probability distributions can be difficult when the marginal probability density functions of the component random variables are different. Copulas are a useful tool to model dependence between random variables with any marginal distributions. This post will introduce the idea of a copula, run through the basic math that underlies its composition and discuss some common copulas in use. Through researching for this post I found several comprehensive coding examples in Matlab, Python and R, so instead of creating my own I’m focusing on the a theoretical introduction to copulas in this post and will link to the coding tutorials at the end.

What is a copula?

The word copula is derived from the Latin noun for a link or a tie (as is the English word “couple”) and its purpose is to describe the dependence structure between two variables. Sklar’s theorem states that “Any multivariate joint distribution can be written in terms of univariate marginal distribution functions and a copula which describes the dependence structure between the two variables” [1].

To understand how a copula can describe the dependence function between random variables its helpful to first review some simple statistics.

u1_u2

proof of uniformity.PNG

The above statements can be summarized as saying that the values of the CDF of any marginal distribution are uniformly distributed on the interval [0,1] ie. if you make a random draw from any distribution, you have the same probability of drawing the largest value (U=1) of that distribution as the smallest possible value (U=0) or the median value (U=.5).

So what does this have to do with copulas? Great question, a copula is actually a joint distribution of the CDFs of the random variables it is modeling. Put formally:

A k dimensional copula is a function c:[0,1]^k \rightarrow [0,1] and is a CDF with uniform marginals [1].

copula defined

So now that we’ve defined what a copula is, lets take a look at the form of some commonly used ones.

The Gaussian Copula

The Gaussian takes the form:

Guassian

Where:

Φ_R is the joint standard normal CDF with: mean and var

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

It’s important to note that the Pearson Correlation coefficient is a bad choice for ρ row here, a rank based correlation such as Spearman’s ρ or Kendall’s τ are better options since they are scale invariant and do not require linearity.

Issues with Tail Dependency

The Gaussian copula is a helpful tool and relatively easy to fit, even for relatively large numbers of RVs with different marginal distributions. Gaussian copulas however, do not do a good job capturing tail dependence and can cause one to underestimate risk of simultaneously being in the tails of each distribution. The failure of the Gaussian copula to capture tail dependence has been blamed for contributing the the 2008 financial crisis after it was widely used by investment firms on Wall Street (this is actually a really interesting story, for more details check out this article from the financial times) .

Tail dependency can be quantified by the coefficients of upper and lower tail dependence (λ_u and λ_l) defined as:

upper tail dependence.PNG

lower tail dependence

The Student t Copula

Like the student t distribution, the student t copula has a similar shape to the Gaussian copula, but with fatter tails, thus it can do a slightly better job capturing tail dependence.

student t.PNG

Where:

t_ν,Σ is the joint student t CDF, Σ is covariance matrix (again don’t use Pearson correlation coefficient), ν is the degrees of freedom and t^-1_ν is the inverse student t CDF.

Archimedean Copulas

Archimedian copulas are a family of copulas with the following form:

archimedean.PNG

ψ(u|θ) is called the generator function and θ is the parameters for the copula.

3 common Archimedean Copulas are:

  • Gumbel: which is good at modeling upper tail dependence
  • Clayton: which is good at modeling lower tail dependence
  • Frank: has lighter tails and more density in the middle

It’s important to note that these copulas are usually employed for bivariate cases, for more than two variables, the Gaussian or Student t copulas are usually used.

A comparison of the shape of the copulas above can be found in Figure 1.

compare

Image source: wikipedia.org

Coding Copulas

There are numerous packages for modeling copulas in Matlab, Python and R.

In Matlab, the Statistics and Machine learning Toolbox has some helpful functions. You can find some well narrated examples of copulas here. There’s also the Multivariate Copula Analysis Toolbox from UC Irvine.

In Python, the copulalib package can be used to model the Clayton, Frank and Gumbel copulas. The statsmodels  package also has copulas built in. I found this post on the copulalib package, it has an attached Jupyter notebook with nice coding examples and figures. Here’s a post on statsmodels copula implementation, along with example Jupyter notebook.

Finally, here’s an example of coding copulas in R using the copulas library.

References and Acknowledgments

  1. Rüschendorf, L. (2013). Mathematical risk analysis dependence, risk bounds, optimal allocations and portfolios. Berlin: Springer.

I’d like to note that the majority of the content in the post came from Scott Steinschneider’s excellent course, BEE 6940: Multivariate Analysis, at Cornell.

Using the Exploratory Modelling Workbench

Over the last 7 years, I have been working on the development of an open source toolkit for supporting decision-making under deep uncertainty. This toolkit is known as the exploratory modeling workbench. The motivation for this name is that in my opinion all model-based deep uncertainty approaches are forms of exploratory modeling as first introduced by Bankes (1993). The design of the workbench has undergone various changes over time, but it has started to stabilize in the fall of 2016. This summer, I published a paper detailing the workbench (Kwakkel, 2017). There is an in depth example in the paper, but in a series of blogs I want to showcase the funtionality in some more detail.

The workbench is readily available through pip, but it requires ipyparallel and mpld3 (both available through conda), SALib (via pip), and optionality platypus (pip install directly from github repo).

Adapting the DPS example from Rhodium

As a starting point, I will use the Direct Policy Search example that is available for Rhodium (Quinn et al 2017). I will adapt this code to work with the workbench. In this way, I can explain the workbench, as well as highlight some of the main differences between the workbench and Rhodium.

<br /># A function for evaluating our cubic DPS. This is based on equation (12)
# from [1].
def evaluateCubicDPS(policy, current_value):
    value = 0

for i in range(policy["length"]):
    rbf = policy["rbfs"][i]
    value += rbf["weight"] * abs((current_value - rbf["center"]) / rbf["radius"])**3
    value = min(max(value, 0.01), 0.1)
    return value

# Construct the lake problem
def lake_problem(policy, # the DPS policy
                 b = 0.42, # decay rate for P in lake (0.42 = irreversible)
                 q = 2.0, # recycling exponent
                 mean = 0.02, # mean of natural inflows
                 stdev = 0.001, # standard deviation of natural inflows
                 alpha = 0.4, # utility from pollution
                 delta = 0.98, # future utility discount rate
                 nsamples = 100, # monte carlo sampling of natural inflows
                 steps = 100): # the number of time steps (e.g., days)
    Pcrit = root(lambda x: x**q/(1+x**q) - b*x, 0.01, 1.5)
    X = np.zeros((steps,))
    decisions = np.zeros((steps,))
    average_daily_P = np.zeros((steps,))
    reliability = 0.0
    utility = 0.0
    inertia = 0.0

    for _ in range(nsamples):
        X[0] = 0.0

        natural_inflows = np.random.lognormal(
                math.log(mean**2 / math.sqrt(stdev**2 + mean**2)),
                math.sqrt(math.log(1.0 + stdev**2 / mean**2)),
                size=steps)

        for t in range(1,steps):
            decisions[t-1] = evaluateCubicDPS(policy, X[t-1])
            X[t] = (1-b)*X[t-1] + X[t-1]**q/(1+X[t-1]**q) + decisions[t-1] + natural_inflows[t-1]
            average_daily_P[t] += X[t]/float(nsamples)

        reliability += np.sum(X < Pcrit)/float(steps) 
        utility += np.sum(alpha*decisions*np.power(delta,np.arange(steps)))
        inertia += np.sum(np.diff(decisions) > -0.01)/float(steps-1)

    max_P = np.max(average_daily_P)
    reliability /= float(nsamples)
    utility /= float(nsamples)
    inertia /= float(nsamples)

    return (max_P, utility, inertia, reliability)

The formulation of the decision rule assumes that policy is a dict, which is composed of a set of variables generated either through sampling or through optimization. This is relatively straightforward to do in Rhodium, but not so easy to do in the workbench. In the workbench, a policy is a composition of policy levers, where each policy lever is either a range of real values, a range of integers, or an unordered set of categories. To adapt the DPS version of the lake problem to work with the workbench, we have to first replace the policy dict with the different variables explicitly.

def get_antropogenic_release(xt, c1, c2, r1, r2, w1):
    '''
    Parameters
    ----------
    xt : float
    polution in lake at time t
    c1 : float
    center rbf 1
    c2 : float
    center rbf 2
    r1 : float
    radius rbf 1
    r2 : float
    radius rbf 2
    w1 : float
    weight of rbf 1

    note:: w2 = 1 - w1

    '''

    rule = w1*(abs(xt-c1/r1))**3+(1-w1)*(abs(xt-c2/r2))**3
    at = min(max(rule, 0.01), 0.1)
    return at

Next, we need to adapt the lake_problem function itself to use this adapted version of the decision rule. This requires 2 changes: replace policy in the function signature of the lake_model function with the actual underlying parameters c1, c2, r1, r2, and w1, and use these when calculating the anthropological pollution rate.

def lake_model(b=0.42, q=2.0, mean=0.02, stdev=0.001, alpha=0.4, delta=0.98,
               c1=0.25, c2=0.25, r1=0.5, r2=0.5, w1=0.5, nsamples=100,
               steps=100):
    Pcrit = root(lambda x: x**q/(1+x**q) - b*x, 0.01, 1.5)
    X = np.zeros((steps,))
    decisions = np.zeros((steps,))
    average_daily_P = np.zeros((steps,))
    reliability = 0.0
    utility = 0.0
    inertia = 0.0

    for _ in range(nsamples):
        X[0] = 0.0

        natural_inflows = np.random.lognormal(
                math.log(mean**2 / math.sqrt(stdev**2 + mean**2)),
                math.sqrt(math.log(1.0 + stdev**2 / mean**2)),
                          size=steps)

        for t in range(1,steps):
            decisions[t-1] = get_antropogenic_release(X[t-1], c1, c2, r1, r2, w1)
            X[t] = (1-b)*X[t-1] + X[t-1]**q/(1+X[t-1]**q) + decisions[t-1] + natural_inflows[t-1]
            average_daily_P[t] += X[t]/float(nsamples)

        reliability += np.sum(X < Pcrit)/float(steps)
        utility += np.sum(alpha*decisions*np.power(delta,np.arange(steps)))
        inertia += np.sum(np.diff(decisions) > -0.01)/float(steps-1)

    max_P = np.max(average_daily_P)
    reliability /= float(nsamples)
    utility /= float(nsamples)
    inertia /= float(nsamples)

    return (max_P, utility, inertia, reliability)

This version of the code can be combined with the workbench already. However, we can clean it up a bit more if we want to. Note how there are 2 for loops in the lake model. The outer loop generates stochastic realizations of the natural inflow, while the inner loop calculates the the dynamics of the system given a stochastic realization. The workbench can be made responsible for this outer loop.

A quick note on terminology is in order here. I have a background in transport modeling. Here we often use discrete event simulation models. These are intrinsically stochastic models. It is standard practice to run these models several times and take descriptive statistics over the set of runs. In discrete event simulation, and also in the context of agent based modeling, this is known as running replications. The workbench adopts this terminology and draws a sharp distinction between designing experiments over a set of deeply uncertain factors, and performing replications of each experiment to cope with stochastic uncertainty.

Some other notes on the code:
* To aid in debugging functions, it is good practice to make a function deterministic. In this case we can quite easily achieve this by including an optional argument for setting the seed of the random number generation.
* I have slightly changed the formulation of inertia, which is closer to the mathematical formulation used in the various papers.
* I have changes the for loop over t to get rid of virtually all the t-1 formulations

 

from __future__ import division # python2
import math
import numpy as np
from scipy.optimize import brentq

def lake_model(b=0.42, q=2.0, mean=0.02, stdev=0.001, alpha=0.4,
               delta=0.98, c1=0.25, c2=0.25, r1=0.5, r2=0.5,
               w1=0.5, nsamples=100, steps=100, seed=None):
    '''runs the lake model for 1 stochastic realisation using specified
       random seed.

    Parameters
    ----------
    b : float
    decay rate for P in lake (0.42 = irreversible)
    q : float
    recycling exponent
    mean : float
    mean of natural inflows
    stdev : float
    standard deviation of natural inflows
    alpha : float
    utility from pollution
    delta : float
    future utility discount rate
    c1 : float
    c2 : float
    r1 : float
    r2 : float
    w1 : float
    steps : int
    the number of time steps (e.g., days)
    seed : int, optional
    seed for the random number generator
    '''
    np.random.seed(seed)

    Pcrit = brentq(lambda x: x**q/(1+x**q) - b*x, 0.01, 1.5)
    X = np.zeros((steps,))
    decisions = np.zeros((steps,))

    X[0] = 0.0

    natural_inflows = np.random.lognormal(
                math.log(mean**2 / math.sqrt(stdev**2 + mean**2)),
                math.sqrt(math.log(1.0 + stdev**2 / mean**2)),
                size=steps)

    for t in range(steps-1):
        decisions[t] = get_antropogenic_release(X[t], c1, c2, r1, r2, w1)
        X[t+1] = (1-b)*X[t] + X[t]**q/(1+X[t]**q) + decisions[t] + natural_inflows[t]

    reliability = np.sum(X < Pcrit)/steps
    utility = np.sum(alpha*decisions*np.power(delta,np.arange(steps)))

    # note that I have slightly changed this formulation to retain
    # consistency with the equations in the papers
    inertia = np.sum(np.abs(np.diff(decisions)) < 0.01)/(steps-1)
    return X, utility, inertia, reliability

Now we are ready to connect this model to the workbench. This is fairly similar to how you would do it with Rhodium. We have to specify the uncertainties, the outcomes, and the policy levers. For the uncertainties and the levers, we can use real valued parameters, integer valued parameters, and categorical parameters. For outcomes, we can use either scalar, single valued outcomes or time series outcomes. For convenience, we can also explicitly control constants in case we want to have them set to a value different from their default value.

In this particular case, we are running the replications with the workbench. We still have to specify the descriptive statistics we would like to gather over the set of replications. For this, we can pass a function to an outcome. This function will be called with the results over the set of replications.

import numpy as np
from ema_workbench import (RealParameter, ScalarOutcome, Constant,
                           ReplicatorModel)

model = ReplicatorModel('lakeproblem', function=lake_model)
model.replications = 150

#specify uncertainties
model.uncertainties = [RealParameter('b', 0.1, 0.45),
                       RealParameter('q', 2.0, 4.5),
                       RealParameter('mean', 0.01, 0.05),
                       RealParameter('stdev', 0.001, 0.005),
                       RealParameter('delta', 0.93, 0.99)]

# set levers
model.levers = [RealParameter("c1", -2, 2),
                RealParameter("c2", -2, 2),
                RealParameter("r1", 0, 2),
                RealParameter("r2", 0, 2),
                RealParameter("w1", 0, 1)]

def process_p(values):
    values = np.asarray(values)
    values = np.mean(values, axis=0)
    return np.max(values)

#specify outcomes
model.outcomes = [ScalarOutcome('max_P', kind=ScalarOutcome.MINIMIZE,
                                function=process_p),
                  ScalarOutcome('utility', kind=ScalarOutcome.MAXIMIZE,
                                function=np.mean),
                  ScalarOutcome('inertia', kind=ScalarOutcome.MINIMIZE,
                                function=np.mean),
                  ScalarOutcome('reliability', kind=ScalarOutcome.MAXIMIZE,
                                function=np.mean)]

# override some of the defaults of the model
model.constants = [Constant('alpha', 0.41),
                   Constant('steps', 100)]

Open exploration

Now that we have specified the model with the workbench, we are ready to perform experiments on it. We can use evaluators to distribute these experiments either over multiple cores on a single machine, or over a cluster using ipyparallel. Using any parallelization is an advanced topic, in particular if you are on a windows machine. The code as presented here will run fine in parallel on a mac or Linux machine. If you are trying to run this in parallel using multiprocessing on a windows machine, from within a jupyter notebook, it won’t work. The solution is to move the lake_model and get_antropogenic_release to a separate python module and import the lake model function into the notebook.

Another common practice when working with the exploratory modeling workbench is to turn on the logging functionality that it provides. This will report on the progress of the experiments, as well as provide more insight into what is happening in particular in case of errors.

If we want to perform experiments on the model we have just defined, we can use the perform_experiments method on the evaluator, or the stand alone perform_experiments function. We can perform experiments over the uncertainties and/or over the levers. Any policy is evaluated over each of the scenarios. So if we want to use 100 scenarios and 10 policies, this means that we will end up performing 100 * 10 = 1000 experiments. By default, the workbench uses Latin hypercube sampling for both sampling over levers and sampling over uncertainties. However, the workbench also offers support for full factorial, partial factorial, and Monte Carlo sampling, as well as wrappers for the various sampling schemes provided by SALib.

from ema_workbench import (MultiprocessingEvaluator, ema_logging,
                           perform_experiments)
ema_logging.log_to_stderr(ema_logging.INFO)

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.perform_experiments(scenarios=10, policies=10)

Directed Search

Similarly, we can easily use the workbench to search for a good candidate strategy. This requires that platypus is installed. If platypus is installed, we can simply use the optimize method. By default, the workbench will use $\epsilon$-NSGAII. The workbench can be used to search over the levers in order to find a good candidate strategy as is common in Many-Objective Robust Decision Making. The workbench can also be used to search over the uncertainties in order to find for example the worst possible outcomes and the conditions under which they appear. This is a form of worst case discovery. The optimize method takes an optional reference argument. This can be used to set the scenario for which you want to find good policies, or for setting the policy for which you want to find the worst possible outcomes. This makes implementing the approach suggested in Watson & Kasprzyk (2017) very easy.

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=1000, searchover='levers',
                                 epsilons=[0.1,]*len(model.outcomes))

Robust optimization

A third possibility is to perform robust optimization. In this case, the search will take place over the levers, but a given policy is than evaluated for a set of scenarios and the performance is defined over this set. To do this, we need to explicitly define robustness. For this, we can use the outcome object we have used before. In the example below we are defining robustness as the worst 10th percentile over the set of scenarios. We need to pass a variable_name argument to explicitly link outcomes of the model to the robustness metrics.

import functools

percentile10 = functools.partial(np.percentile, q=10)
percentile90 = functools.partial(np.percentile, q=90)

MAXIMIZE = ScalarOutcome.MAXIMIZE
MINIMIZE = ScalarOutcome.MINIMIZE
robustnes_functions = [ScalarOutcome('90th percentile max_p', kind=MINIMIZE,
                                     variable_name='max_P', function=percentile90),
                       ScalarOutcome('10th percentile reliability', kind=MAXIMIZE,
                                     variable_name='reliability', function=percentile10),
                       ScalarOutcome('10th percentile inertia', kind=MAXIMIZE,
                                     variable_name='inertia', function=percentile10),
                       ScalarOutcome('10th percentile utility', kind=MAXIMIZE,
                                     variable_name='utility', function=percentile10)]

Given the specification of the robustness function, the remainder is straightforward and analogous to normal optimization.

<br />n_scenarios = 200
scenarios = sample_uncertainties(lake_model, n_scenarios)
nfe = 100000

with MultiprocessingEvaluator(lake_model) as evaluator:
    robust_results = evaluator.robust_optimize(robustnes_functions, scenarios,
                                               nfe=nfe, epsilons=[0.05,]*len(robustnes_functions))

This blog has introduced the exploratory modeling workbench and has shown its basic functionality for sampling or searching over uncertainties and levers. In subsequent blogs, I will take a more in depth look at this funcitonality, as well as demonstrate how the workbench facilitates the entire Many-Objective Robust Decision Making process.