For optimization, the workbench relies on platypus. You can easily install the latest version of platypus from github using pip

pip install git+https://github.com/Project-Platypus/Platypus.git

By default, the workbench will use epsilon NSGA2, but all the other algorithms available within platypus can be used as well.

Within the workbench, optimization can be used in three ways:

* Search over decision levers for a reference scenario

* Robust search: search over decision levers for a set of scenarios

* worst case discovery: search over uncertainties for a reference policy

The search over decision levers or over uncertainties relies on the specification of the direction for each outcome of interest defined on the model. It is only possible to use `ScalarOutcome`

objects for optimization.

Directed search is most often used to search over the decision levers in order to find good candidate strategies. This is for example the first step in the Many Objective Robust Decision Making process. This is straightforward to do with the workbench using the optimize method.

from ema_workbench import MultiprocessingEvaluator, ema_logging ema_logging.log_to_stderr(ema_logging.INFO) with MultiprocessingEvaluator(model) as evaluator: results = evaluator.optimize(nfe=10000, searchover='levers', epsilons=[0.1,]*len(model.outcomes), population_size=50)

the result from optimize is a DataFrame with the decision variables and outcomes of interest. The latest version of the workbench comes with a pure python implementation of parallel coordinates plot built on top of matplotlib. It has been designed with the matplotlib and seaborn api in mind. We can use this to quickly visualize the optimization results.

from ema_workbench.analysis import parcoords paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0) paraxes.plot(results, color=sns.color_palette()[0]) paraxes.invert_axis('max_P') plt.show()

Note how we can flip an axis using the `invert_axis`

method. This eases interpretation of the figure because the ideal solution in this case would be a straight line for the four outcomes of interest at the top of the figure.

In the previous example, we showed the most basic way for using the workbench to perform many-objective optimization. However, the workbench also offers support for constraints and tracking convergence. Constrains are an attribute of the optimization problem, rather than an attribute of the model as in Rhodium. Thus, we can pass a list of constraints to the optimize method. A constraint can be applied to the model input parameters (both uncertainties and levers), and/or outcomes. A constraint is essentially a function that should return the distance from the feasibility threshold. The distance should be 0 if the constraint is met.

As a quick demonstration, let’s add a constraint on the maximum pollution. This constraint applies to the max_P outcome. The example below specifies that the maximum pollution should be below 1.

from ema_workbench import MultiprocessingEvaluator, ema_logging, Constraint ema_logging.log_to_stderr(ema_logging.INFO) constraints = [Constraint("max pollution", outcome_names="max_P", function=lambda x:max(0, x-1))] with MultiprocessingEvaluator(model) as evaluator: results = evaluator.optimize(nfe=1000, searchover='levers', epsilons=[0.1,]*len(model.outcomes), population_size=25, constraints=constraints)

To track convergence, we need to specify which metric(s) we want to use and pass these to the optimize method. At present the workbench comes with 3 options: Hyper volume, Epsilon progress, and a class that will write the archive at each iteration to a separate text file enabling later processing. If convergence metrics are specified, optimize will return both the results as well as the convergence information.

from ema_workbench import MultiprocessingEvaluator, ema_logging from ema_workbench.em_framework.optimization import (HyperVolume, EpsilonProgress, ) from ema_workbench.em_framework.outcomes import Constraint ema_logging.log_to_stderr(ema_logging.INFO) # because of the constraint on pollution, we can specify the # maximum easily convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]), EpsilonProgress()] constraints = [Constraint("max pollution", outcome_names="max_P", function=lambda x:max(0, x-1))] with MultiprocessingEvaluator(model) as evaluator: results_ref1, convergence1 = evaluator.optimize(nfe=25000, searchover='levers', epsilons=[0.05,]*len(model.outcomes), convergence=convergence_metrics, constraints=constraints, population_size=100)

We can visualize the results using parcoords as before, while the convergence information is in a DataFrame making it also easy to plot.

fig, (ax1, ax2) = plt.subplots(ncols=2, sharex=True) ax1.plot(convergence1.epsilon_progress) ax1.set_xlabel('nr. of generations') ax1.set_ylabel('$\epsilon$ progress') ax2.plot(convergence1.hypervolume) ax2.set_ylabel('hypervolume') sns.despine() plt.show()

Up till now, we have performed the optimization for an unspecified reference scenario. Since the lake model function takes default values for each of the deeply uncertain factors, these values have been implicitly assumed. It is however possible to explicitly pass a reference scenario that should be used instead. In this way, it is easy to apply the extended MORDM approach suggested by Watson and Kasprzyk (2017).

To see the effects of changing the reference scenario on the values for the decision levers found through the optimization, as well as ensuring a fair comparison with the previous results, we use the same convergence metrics and constraints from the previous optimization. Note that the constraints are in essence only a function, and don’t retain optimization specific state, we can simply reuse them. The convergence metrics, in contrast retain state and we thus need to re-instantiate them.

from ema_workbench import Scenario reference = Scenario('reference', **dict(b=.43, q=3,mean=0.02, stdev=0.004, delta=.94)) convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]), EpsilonProgress()] with MultiprocessingEvaluator(model) as evaluator: results_ref2, convergence2 = evaluator.optimize(nfe=25000, searchover='levers', epsilons=[0.05,]*len(model.outcomes), convergence=convergence_metrics, constraints=constraints, population_size=100, reference=reference)

To demonstrate the parcoords plotting functionality in some more detail, let’s combine the results from the optimizations for the two different reference scenarios and visualize them in the same plot. To do this, we need to first figure out the limits across both optimizations. Moreover, to get a better sense of which part of the decision space is being used, let’s set the limits for the decision levers on the basis of their specified ranges instead of inferring the limits from the optimization results.

columns = [lever.name for lever in model.levers] columns += [outcome.name for outcome in model.outcomes] limits = {lever.name: (lever.lower_bound, lever.upper_bound) for lever in model.levers} limits = dict(**limits, **{outcome.name:(0,1) for outcome in model.outcomes}) limits = pd.DataFrame.from_dict(limits) # we resort the limits in the order produced by the optimization limits = limits[columns] paraxes = parcoords.ParallelAxes(limits, rot=0) paraxes.plot(results_ref1, color=sns.color_palette()[0], label='ref1') paraxes.plot(results_ref2, color=sns.color_palette()[1], label='ref2') paraxes.legend() paraxes.invert_axis('max_P') plt.show()

The workbench also comes with support for many objective robust optimization. In this case, each candidate solution is evaluated over a set of scenarios, and the robustness of the performance over this set is calculated. This requires specifying 2 new pieces of information:

* the robustness metrics

* the scenarios over which to evaluate the candidate solutions

The robustness metrics are simply a collection of `ScalarOutcome`

objects. For each one, we have to specify which model outcome(s) it uses, as well as the actual robustness function. For demonstrative purposes, let’s assume we are use a robustness function using descriptive statistics: we want to maximize the 10th percentile performance for reliability, inertia, and utility, while minimizing the 90th percentile performance for max_P.

We can specify our scenarios in various ways. The simplest would be to pass the number of scenarios to the `robust_optimize`

method. In this case for each generation a new set of scenarios is used. This can create noise and instability in the optimization. A better option is to explicitly generate the scenarios first, and pass these to the method. In this way, the same set of scenarios is used for each generation.

If we want to specify a constraint, this can easily be done. Note however, that in case of robust optimization, the constrains will apply to the robustness metrics instead of the model outcomes. They can of course still apply to the decision variables as well.

import functools from ema_workbench import Constraint, MultiprocessingEvaluator from ema_workbench import Constraint, ema_logging from ema_workbench.em_framework.optimization import (HyperVolume, EpsilonProgress) from ema_workbench.em_framework.samplers import sample_uncertainties ema_logging.log_to_stderr(ema_logging.INFO) percentile10 = functools.partial(np.percentile, q=10) percentile90 = functools.partial(np.percentile, q=90) MAXIMIZE = ScalarOutcome.MAXIMIZE MINIMIZE = ScalarOutcome.MINIMIZE robustnes_functions = [ScalarOutcome('90th percentile max_p', kind=MINIMIZE, variable_name='max_P', function=percentile90), ScalarOutcome('10th percentile reliability', kind=MAXIMIZE, variable_name='reliability', function=percentile10), ScalarOutcome('10th percentile inertia', kind=MAXIMIZE, variable_name='inertia', function=percentile10), ScalarOutcome('10th percentile utility', kind=MAXIMIZE, variable_name='utility', function=percentile10)] def constraint(x): return max(0, percentile90(x)-10) constraints = [Constraint("max pollution", outcome_names=['90th percentile max_p'], function=constraint)] convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[10,1,1,1]), EpsilonProgress()] n_scenarios = 10 scenarios = sample_uncertainties(model, n_scenarios) nfe = 10000 with MultiprocessingEvaluator(model) as evaluator: robust_results, convergence = evaluator.robust_optimize(robustnes_functions, scenarios, nfe=nfe, constraints=constraints, epsilons=[0.05,]*len(robustnes_functions), convergence=convergence_metrics,)

fig, (ax1, ax2) = plt.subplots(ncols=2) ax1.plot(convergence.epsilon_progress.values) ax1.set_xlabel('nr. of generations') ax1.set_ylabel('$\epsilon$ progress') ax2.plot(convergence.hypervolume) ax2.set_ylabel('hypervolume') sns.despine() plt.show()

paraxes = parcoords.ParallelAxes(parcoords.get_limits(robust_results), rot=45) paraxes.plot(robust_results) paraxes.invert_axis('90th percentile max_p') plt.show()

Up till now, we have focused on optimizing over the decision levers. The workbench however can also be used for worst case discovery (Halim et al, 2016). In essence, the only change is to specify that we want to search over uncertainties instead of over levers. Constraints and convergence works just as in the previous examples.

Reusing the foregoing, however, we should change the direction of optimization of the outcomes. We are no longer interested in finding the best possible outcomes, but instead we want to find the worst possible outcomes.

# change outcomes so direction is undesirable minimize = ScalarOutcome.MINIMIZE maximize = ScalarOutcome.MAXIMIZE for outcome in model.outcomes: if outcome.kind == minimize: outcome.kind = maximize else: outcome.kind = minimize

We can reuse the reference keyword argument to perform worst case discovery for one of the policies found before. So, below we select solution number 9 from the pareto approximate set. We can turn this into a dict and instantiate a Policy objecti.

from ema_workbench import Policy policy = Policy('9', **{k:v for k, v in results_ref1.loc[9].items() if k in model.levers}) with MultiprocessingEvaluator(model) as evaluator: results = evaluator.optimize(nfe=1000, searchover='uncertainties', epsilons=[0.1,]*len(model.outcomes), reference=policy)

Visualizing the results is straightforward using parcoords.

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0) paraxes.plot(results) paraxes.invert_axis('max_P') plt.show()

This blog showcased the functionality of the workbench for applying search based approaches to exploratory modelling. We specifically looked at the use of many-objective optimization for searching over the levers or uncertainties, as well as the use of many-objective robust optimization. This completes the overview of the functionality available in the workbench. In the next blog, I will put it all together to show how the workbench can be used to perform Many Objective Robust Decision Making.

Filed under: Code Examples, EMAworkbench, Many Objective Robust Decision Making, Pandas, Python, Tutorials ]]>

- All variables and parameters in the new system are unitless, i.e. scales and units not an issue in the new system and the dynamics of systems operating on different time and/or spatial scales can be compared;
- The number of model parameters is reduced to a smaller set of fundamental parameters that govern the dynamics of the system; which also means
- The new model is simpler and easier to analyze, and
- The computational time becomes shorter.

I will now present an example using a predator-prey system of equations. In my last blogpost, I used the Lotka-Volterra system of equations for describing predator-prey interactions. Towards the end of that post I talked about the logistic Lotka-Volterra system, which is in the following form:Where x is prey abundance, y is predator abundance, b is the prey growth rate, d is the predator death rate, c is the rate with which consumed prey is converted to predator abundance, a is the rate with which prey is killed by a predator per unit of time, and K is the carrying capacity of the prey given its environmental conditions.

The first step is to define the original model variables as products of new dimensionless variables (e.g. x*) and *scaling parameters* (e.g. X), carrying the same units as the original variable.The rescaled models are then substituted in the original model:

Carrying out all cancellations and obvious simplifications:Our task now is to define the rescaling parameters X, Y, and T to simplify our model – remember they have to have the same units as our original parameters.

Variable/parameter |
Unit |

x | mass ^{1} |

y | mass ^{1} |

t | time |

b | 1/time |

d | 1/time |

a | 1/(mass∙time)^{ 2} |

c | mass/mass ^{3} |

K | mass |

There’s no single correct way of going about doing this, but using the units for guidance and trying to be smart we can simplify the structure of our model. For example, setting X=K will remove that term from the prey equation (notice that this way X has the same unit as our original x variable).

The choice of Y is not very obvious so let’s look at T first. We could go with both T=1/b or T=1/d. Unit-wise they both work but one would serve to eliminate a parameter from the first equation and the other from the second. The decision here depends on what dynamics we’re most interested in, so for the purposes of demonstration here, let’s go with T=1/b.

We’re now left with defining Y, which only appears in the second term of the first equation. Looking at that term, the obvious substitution is Y=b/a, resulting in this set of equations:

Our system of equations is still not dimensionless, as we still have the model parameters to worry about. We can now define aggregate parameters using the original parameters in such a way that they will not carry any units and they will further simplify our model.

By setting p_{1}=caK/b and p_{2}=d/b we can transform our system to:a system of equations with no units and just two parameters.

^{1 }Prey and predator abundance don’t have to necessarily be measured using mass units, it could be volume, density or something else. The units for parameters a, c, K would change equivalently and the rescaling still holds.

^{2 }This is the death rate per encounter with predator per time t.

^{3} This is the converted predator (mass) per prey (mass) consumed.

Filed under: Uncategorized Tagged: Lotka-Volterra, Mathematical Biology, Modeling, Predator-Prey ]]>

In exploratory modeling, we are interested in understanding how regions in the uncertainty space and/or the decision space map to the whole outcome space, or partitions thereof. There are two general approaches for investigating this mapping. The first one is through systematic sampling of the uncertainty or decision space. This is sometimes also known as open exploration. The second one is to search through the space in a directed manner using some type of optimization approach. This is sometimes also known as directed search.

The workbench support both open exploration and directed search. Both can be applied to investigate the mapping of the uncertainty space and/or the decision space to the outcome space. In most applications, search is used for finding promising mappings from the decision space to the outcome space, while exploration is used to stress test these mappings under a whole range of possible resolutions to the various uncertainties. This need not be the case however. Optimization can be used to discover the worst possible scenario, while sampling can be used to get insight into the sensitivity of outcomes to the various decision levers.

To showcase the open exploration functionality, let’s start with a basic example using the DPS lake problem also used in the previous blog post. We are going to simultaneously sample over uncertainties and decision levers. We are going to generate 1000 scenarios and 5 policies, and see how they jointly affect the outcomes. A *scenario* is understood as a point in the uncertainty space, while a *policy* is a point in the decision space. The combination of a scenario and a policy is called *experiment*. The uncertainty space is spanned by uncertainties, while the decision space is spanned by levers. Both uncertainties and levers are instances of *RealParameter* (a continuous range), *IntegerParameter* (a range of integers), or *CategoricalParameter* (an unorder set of things). By default, the workbench will use Latin Hypercube sampling for generating both the scenarios and the policies. Each policy will be always evaluated over all scenarios (i.e. a full factorial over scenarios and policies).

from ema_workbench import (RealParameter, ScalarOutcome, Constant, ReplicatorModel) model = ReplicatorModel('lakeproblem', function=lake_model) model.replications = 150 #specify uncertainties model.uncertainties = [RealParameter('b', 0.1, 0.45), RealParameter('q', 2.0, 4.5), RealParameter('mean', 0.01, 0.05), RealParameter('stdev', 0.001, 0.005), RealParameter('delta', 0.93, 0.99)] # set levers model.levers = [RealParameter("c1", -2, 2), RealParameter("c2", -2, 2), RealParameter("r1", 0, 2), RealParameter("r2", 0, 2), RealParameter("w1", 0, 1)] def process_p(values): values = np.asarray(values) values = np.mean(values, axis=0) return np.max(values) #specify outcomes model.outcomes = [ScalarOutcome('max_P', kind=ScalarOutcome.MINIMIZE, function=process_p), ScalarOutcome('utility', kind=ScalarOutcome.MAXIMIZE, function=np.mean), ScalarOutcome('inertia', kind=ScalarOutcome.MINIMIZE, function=np.mean), ScalarOutcome('reliability', kind=ScalarOutcome.MAXIMIZE, function=np.mean)] # override some of the defaults of the model model.constants = [Constant('alpha', 0.41), Constant('steps', 100)]

Next, we can perform experiments with this model.

from ema_workbench import (MultiprocessingEvaluator, ema_logging, perform_experiments) ema_logging.log_to_stderr(ema_logging.INFO) with MultiprocessingEvaluator(model) as evaluator: results = evaluator.perform_experiments(scenarios=1000, policies=5)

Having generated these results, the next step is to analyze them and see what we can learn from the results. The workbench comes with a variety of techniques for this analysis. A simple first step is to make a few quick visualizations of the results. The workbench has convenience functions for this, but it also possible to create your own visualizations using the scientific Python stack.

from ema_workbench.analysis import pairs_plotting fig, axes = pairs_plotting.pairs_scatter(results, group_by='policy', legend=False) plt.show()

Writing your own visualizations requires a more in-depth understanding of how the results from the workbench are structured. `perform_experiments`

returns a tuple. The first item is a numpy structured array where each row is a single experiment. The second item contains the outcomes, structured in a dict with the name of the outcome as key and a numpy array as value. Experiments and outcomes are aligned based on index.

import seaborn as sns experiments, outcomes = results df = pd.DataFrame.from_dict(outcomes) df = df.assign(policy=experiments['policy']) # rename the policies using numbers df['policy'] = df['policy'].map({p:i for i, p in enumerate(set(experiments['policy']))}) # use seaborn to plot the dataframe grid = sns.pairplot(df, hue='policy', vars=outcomes.keys()) ax = plt.gca() plt.show()

Often, it is convenient to separate the process of performing the experiments from the analysis. To make this possible, the workbench offers convenience functions for storing results to disc and loading them from disc. The workbench will store the results in a tarbal with .csv files and separate metadata files. This is a convenient format that has proven sufficient over the years.

from ema_workbench import save_results save_results(results, '1000 scenarios 5 policies.tar.gz') from ema_workbench import load_results results = load_results('1000 scenarios 5 policies.tar.gz')

In addition to visual analysis, the workbench comes with a variety of techniques to perform a more in-depth analysis of the results. In addition, other analyses can simply be performed by utilizing the scientific python stack. The workbench comes with

- Scenario Discovery, a model driven approach to scenario development.
- Dimensional stacking, a quick visual approach drawing on feature scoring to enable scenario discovery. This approach has received limited attention in the literature (Suzuki et al., 2015). The implementation in the workbench replaces the rule mining approach with a feature scoring approach.
- Feature Scoring, a poor man’s alternative to global sensitivity analysis
- Regional sensitivity analysis

A detailed discussion on scenario discovery can be found in an earlier blogpost. For completeness, I provide a code snippet here. Compared to the previous blog post, there is one small change. The library mpld3 is currently not being maintained and broken on Python 3.5 and higher. To still utilize the interactive exploration of the trade offs within the notebook, use the interactive back-end as shown below.

from ema_workbench.analysis import prim experiments, outcomes = results x = experiments y = outcomes['max_P'] <0.8 prim_alg = prim.Prim(x, y, threshold=0.8) box1 = prim_alg.find_box()

%matplotlib notebook box1.show_tradeoff() plt.show()

%matplotlib inline # we go back to default not interactive box1.inspect(43) box1.inspect(43, style='graph') plt.show()

Dimensional stacking was suggested as a more visual approach to scenario discovery. It involves two steps: identifying the most important uncertainties that affect system behavior, and creating a pivot table using the most influential uncertainties. Creating the pivot table involves binning the uncertainties. More details can be found in Suzuki et al. (2015) or by looking through the code in the workbench. Compared to the original paper, I use feature scoring for determining the most influential uncertainties. The code is set up in a modular way so other approaches to global sensitivity analysis can easily be used as well if so desired.

from ema_workbench.analysis import dimensional_stacking x = experiments y = outcomes['max_P'] <0.8 dimensional_stacking.create_pivot_plot(x,y, 2, nbins=3) plt.show()

We can see from this visual that if B is low, while Q is high, we have a high concentration of cases where pollution stays below 0.8. The mean and delta have some limited additional influence. By playing around with an alternative number of bins, or different number of layers, patterns can be coarsened or refined.

A third approach for supporting scenario discovery is to perform a regional sensitivity analysis. The workbench implements a visual approach based on plotting the empirical CDF given a classification vector. Please look at section 3.4 in Pianosi et al (2016) for more details.

from ema_workbench.analysis import regional_sa from numpy.lib import recfunctions as rf x = rf.drop_fields(experiments, 'model', asrecarray=True) y = outcomes['max_P'] < 0.8 regional_sa.plot_cdfs(x,y) plt.show()

Feature scoring is a family of techniques often used in machine learning to identify the most relevant features to include in a model. This is similar to one of the use cases for global sensitivity analysis, namely factor prioritisation. In some of the work ongoing in Delft, we are comparing feature scoring with Sobol and Morris and the results are quite positive. The main advantage of feature scoring techniques is that they impose virtually no constraints on the experimental design, while they can handle real valued, integer valued, and categorical valued parameters. The workbench supports multiple techniques, the most useful of which generally is extra trees (Geurts et al. 2006).

For this example, we run feature scoring for each outcome of interest. We can also run it for a specific outcome if desired. Similarly, we can choose if we want to run in regression mode or classification mode. The later is applicable if the outcome is a categorical variable and the results should be interpreted similar to regional sensitivity analysis results. For more details, see the documentation.

from ema_workbench.analysis import feature_scoring x = experiments y = outcomes fs = feature_scoring.get_feature_scores_all(x, y) sns.heatmap(fs, cmap='viridis', annot=True) plt.show()

From the results, we see that max_P is primarily influenced by b, while utility is driven by delta, for inertia and reliability the situation is a little bit less clear cut.

In addition to the prepackaged analyses that come with the workbench, it is also easy to rig up something quickly using the ever expanding scientific Python stack. Below is a quick example of performing a basic regression analysis on the results.

experiments, outcomes = results for key, value in outcomes.items(): params = model.uncertainties #+ model.levers[:] fig, axes = plt.subplots(ncols=len(params), sharey=True) y = value for i, param in enumerate(params): ax = axes[i] ax.set_xlabel(param.name) pearson = sp.stats.pearsonr(experiments[param.name], y) ax.annotate("r: {:6.3f}".format(pearson[0]), xy=(0.15, 0.85), xycoords='axes fraction',fontsize=13) x = experiments[param.name] sns.regplot(x, y, ax=ax, ci=None, color='k', scatter_kws={'alpha':0.2, 's':8, 'color':'gray'}) ax.set_xlim(param.lower_bound, param.upper_bound) axes[0].set_ylabel(key) plt.show()

The workbench can also be used for more advanced sampling techniques. To achieve this, it relies on SALib. On the workbench side, the only change is to specify the sampler we want to use. Next, we can use SALib directly to perform the analysis. To help with this, the workbench provides a convenience function for generating the problem dict which SALib provides. The example below focusses on performing SOBOL on the uncertainties, but we could do the exact same thing with the levers instead. The only changes required would be to set `lever_sampling`

instead of `uncertainty_sampling`

, and get the SALib problem dict based on the levers.

from SALib.analyze import sobol from ema_workbench.em_framework.salib_samplers import get_SALib_problem with MultiprocessingEvaluator(model) as evaluator: sa_results = evaluator.perform_experiments(scenarios=1000, uncertainty_sampling='sobol') experiments, outcomes = sa_results problem = get_SALib_problem(model.uncertainties) Si = sobol.analyze(problem, outcomes['max_P'], calc_second_order=True, print_to_console=False) Si_filter = {k:Si[k] for k in ['ST','ST_conf','S1','S1_conf']} Si_df = pd.DataFrame(Si_filter, index=problem['names'])

Filed under: Code Examples, EMAworkbench, Many Objective Robust Decision Making, Python, Sensitivity Analysis, Sobol ]]>

This part focuses on the moviepy Python library, and all the neat things one can do with it. There actually are some nice tutorials for when we have a continuous function t -> f(t) to work with (see here). Instead, we are often working with data structures that are indexed on time in a discrete way.

Moviepy could be used from any data source dependent on time, including netCDF data such as the one manipulated by VisIt in the first part of this post. But in this second part, we are instead going to focus on how to draw time -dependent trajectories to make sense of nonlinear dynamical systems, then animate them in GIF. I will use the well-known shallow lake problem, and go through a first example with detailed explanation of the code. Then I’ll finish with a second example showing trajectories.

The shallow lake problem is a classic problem in the management of coupled human and natural system. Some human (e.g. agriculture) produce phosphorus that eventually end up in water bodies such as lakes. Too much phosphorus in lake causes a processus called eutrophication which usually destroys lakes’ diverse ecosystems (no more fish) and lower water quality. A major problem with that is that eutrophication is difficult or even sometimes impossible to reverse: lowering phosphorus inputs to what they were pre-eutrophication simply won’t work. Simple nonlinear dynamics, first proposed by Carpenter et al. in 1999 (see here) describe the relationship between phosphorus inputs (L) and concentration (P). The first part of the code (uploaded to GitHub as `movie1.py`

) reads:

import attractors import numpy as np from moviepy.video.io.bindings import mplfig_to_npimage from moviepy.video.VideoClip import DataVideoClip import matplotlib.pyplot as plt import matplotlib.lines as mlines # Lake parameters b = 0.65 q = 4 # One step dynamic (P increment rate) # arguments are current state x and lake parameters b,q and input l def Dynamics(x, b, q, l): dp = (x ** q) / (1 + x ** q) - b * x + l return dp

Where the first 6 lines contain the usual library imports. Note that I am importing an auxiliary Python function “attractors” to enable me to plot the attractors (see `attractors.py`

on the GitHub repository). The function “Dynamics” correspond to the evolution of P given L and lake parameters b and q, also given in this bit of code. Then we introduce the time parameters:

# Time parameters dt = 0.01 # time step T = 40 # final horizon nt = int(T/dt+1E-6) # number of time steps

To illustrate that lake phosphorus dynamics depend not only on the phosphorus inputs L but also on initial phosphorus levels, we are going to plot P trajectories for different constant values of L, and three cases regarding the initial P. We first introduce these initial phosphorus levels, and the different input levels, then declare the arrays in which we’ll store the different trajectories

# Initial phosphorus levels pmin = 0 pmed = 1 pmax = 2.5 # Inputs levels l = np.arange(0.001,0.401,0.005) # Store trajectories low_p = np.zeros([len(l),nt+1]) # Correspond to pmin med_p = np.zeros([len(l),nt+1]) # Correspond to pmed high_p = np.zeros([len(l),nt+1]) # Correspond to pmax

Once that is done, we can use the attractor import to plot the equilibria of the lake problem. This is a bit of code that is the GitHub repository associated to this post, but that I am not going to comment on further here.

After that we can generate the trajectories for P with constant L, and store them to the appropriate arrays:

# Generating the data: trajectories def trajectory(b,q,p0,l,dt,T): # Declare outputs time = np.arange(0,T+dt,dt) traj = np.zeros(len(time)) # Initialize traj traj[0] = p0 # Fill traj with values for i in range(1,len(traj)): traj[i] = traj[i-1] + dt * Dynamics(traj[i-1],b,q,l) return traj # Get them! for i in range(len(l)): low_p[i,:] = trajectory(b,q,pmin,l[i],dt,T) med_p[i, :] = trajectory(b, q, pmed, l[i], dt, T) high_p[i,:] = trajectory(b,q,pmax,l[i],dt,T)

Now we are getting to the interesting part of making the plots for the animation. We need to declare a figure that all the frames in our animation will use (we don’t want the axes to wobble around). For that we use matplotlib / pyplot libraries:

# Draw animated figure fig, ax = plt.subplots(1) ax.set_xlabel('Phosphorus inputs L') ax.set_ylabel('Phosphorus concentration P') ax.set_xlim(0,l[-1]) ax.set_ylim(0,pmax) line_low, = ax.plot(l,low_p[:,0],'.', label='State, P(0)=0') line_med, = ax.plot(l,med_p[:,0],'.', label='State, P(0)=1') line_high, = ax.plot(l,high_p[:, 0], '.', label='State, P(0)=2.5')

Once that is done, the last things we need to do before calling the core moviepy functions are to 1) define the parameters that manage time, and 2) have a function that makes frames for the instant that is being called.

For 1), we need to be careful because we are juggling with different notions of time, a) time in the dynamics, b) the index of each instant in the dynamics (i.e., in the data, the arrays where we stored the trajectories), and c) time in the animation. We may also want to have a pause at the beginning or at the end of the GIF, rather than watch with tired eyes as the animation is ruthlessly starting again before we realized what the hell happened. So here is how I declared all of this:

# Parameters of the animation initial_delay = 0.5 # in seconds, delay where image is fixed before the animation final_delay = 0.5 # in seconds, time interval where image is fixed at end of animation time_interval = 0.25 # interval of time between two snapshots in the dynamics (time unit or non-dimensional) fps = 20 # number of frames per second on the GIF # Translation in the data structure data_interval = int(time_interval/dt) # interval between two snapshots in the data structure t_initial = -initial_delay*fps*data_interval t_final = final_delay*fps*data_interval time = np.arange(t_initial,low_p.shape[1]+t_final,data_interval) # time in the data structure

Now for 2), the function to make the frames resets the parts of the plot that change for different time indexes(“t” below is the index in the data). If we don’t do that, the plot will keep the previous plotted elements, and will grow messier at the animation goes on.

# Making frames def make_frame(t): t = int(t) if t<0: return make_frame(0) elif t>nt: return make_frame(nt) else: line_low.set_ydata(low_p[:,t]) line_med.set_ydata(med_p[:,t]) line_high.set_ydata(high_p[:, t]) ax.set_title(' Lake attractors, and dynamics at t=' + str(int(t*dt)), loc='left', x=0.2) if t > 0.25*nt: alpha = (t-0.25*nt) / (1.5*nt) lakeAttBase(eqList, 0.001, alpha=alpha) plt.legend(handles=[stable, unstable], loc=2) return mplfig_to_npimage(fig)

In the above mplfig_to_npimage(fig) is a moviepy function that turns a figure into a frame of our GIF. Now we just have to call the function to do frames using the data, and to turn it into a GIF:

# Animating animation = DataVideoClip(time,make_frame,fps=fps) animation.write_gif("lake_attractors.gif",fps=fps)

Where the moviepy function DataVideoClip takes as arguments the sequences of indexes defined by the vector “time” defined in the parameters of the animation, the “make_frame” routine we defined, and the number of frame per second we want to output. The last lines integrates each frame to the GIF that is plotted below:

Each point on the plot represent a different world (different constant input level, different initial phosphorus concentration), and the animation shows how these states converge towards an stable equilibriun point. The nonlinear lake dynamics make the initial concentration important towards knowing if the final concentration is low (lower set of stable equilibria), or if the lake is in a eutrophic state (upper set of stable equilibria).

Many trajectories can be plotted at the same time to understand the behavior of attractors, and visualize system dynamics for fixed human-controlled parameters (here, the phosphorus inputs L). Alternatively, if one changes the policy, trajectories evolve depending on both L and P. This redefines how trajectories are defined.

I did a similar bit of code to show how one could plot trajectories in the 2D plane. It is also uploaded on the GitHub repository (under `movie2.py`

), and is similar in its structure to the code above. The definition of the trajectories and where to store them change. We define trajectories where inputs are lowered at a constant rate, with a minimum input of 0.08. For three different initial states, that gives us the following animation that illustrates how the system’s nonlinearity leads to very different trajectories even though the starting positions are close and the management policy, identical:

This could easily be extended to trajectories in higher dimensional planes, with and without sets of equilibria to guide our eyes.

Filed under: Python, Visualization ]]>

Let us assume we have VisIt installed and the VisIt path has been added to our bash profile. Then we just need to `cd`

into the directory where our data are stored, and type `visit`

in the command line to launch the program. VisIt comes with two windows, one for manipulating files and the data stored in them, and another where plots are drawn (it is possible to draw more plots in additional windows, but in this tutorial we’ll stick to one window). Both windows are visible in the screenshot below.

It may be difficult to see, but the left-hand window is the space for managing files and plots. The active “Plot” is also displayed. Here the active file is “air.2m.2000.nc” where nc is th extension for netCDF files. The plot is a “Pseudocolor” of that data, the equivalent of “contourf” in Matlab or Python – Matplotlib. It is plotted in the right-hand window. It is the worldwide distribution of daily air temperature (in Kelvin) with a 1 degree resolution for January 1, 2000 on the left pane (Data from NCEP). Note the toolbar above the plot: it contains “play / stop” buttons that can allow us to play the 365 other days of year 2000 (the cursor is on “stop” in the screenshot). We are going to save the animation of the whole year with the VisIt movie wizard.

But first, let us change the specifications of our plot. In the left-hand window, we go to PlotAtts – > Pseudocolor (recall that is the plot type that we have). We can change the colorbar, set minima and maxima so that the colorbar does not change for each day of the year when we save the movie, and even make the scale skewed so we see more easily the differences in temperatures in non-polar regions where people live (on the image above the temperature in the middle of the colorbar is 254.9 K, which corresponds to -18.1C or roughly -1F). The screenshot below shows our preferences. We can click “Apply” for the changes to take effect, and exit the window.

Now we can save the movie. We go to File – > Save Movie and follow the VisIt Movie Wizard. You will find it very intuitive and easy. We save a “New Simple Movie” in DVIX (other formarts are available), with 20 frames (i.e. 20 days) per second. The resulting movie is produced within 2 minutes, but since WordPress does not support DIVX unless I am willing to pay for premium access, I had to convert the result as a GIF. To be honest, this is a huge letdown as the quality suffers compared to a video (unless I want to upload a 1GB monster). But the point of this post is that *VisIt is really easy to use*, not that you should convert its results into GIF.

In this animation it is easy to see the outline of land masses as they get either warmer or cooler than the surrounding oceans! Videos would look neater: you can try them for yourself! (at home, wherever you want).

It is actually possible to command VisIt using Python scripts, but I haven’t mastered that yet so it will be a tale for another post.

Filed under: Tutorials, Visualization Tagged: VisIt ]]>

The word copula is derived from the Latin noun for a link or a tie (as is the English word “couple”) and its purpose is to describe the dependence structure between two variables. Sklar’s theorem states that “Any multivariate joint distribution can be written in terms of univariate marginal distribution functions and a copula which describes the dependence structure between the two variables” [1].

To understand how a copula can describe the dependence function between random variables its helpful to first review some simple statistics.

The above statements can be summarized as saying that the values of the CDF of any marginal distribution are uniformly distributed on the interval [0,1] ie. if you make a random draw from any distribution, you have the same probability of drawing the largest value (U=1) of that distribution as the smallest possible value (U=0) or the median value (U=.5).

So what does this have to do with copulas? Great question, a copula is actually a joint distribution of the CDFs of the random variables it is modeling. Put formally:

A k dimensional copula is a function and is a CDF with uniform marginals [1].

So now that we’ve defined what a copula is, lets take a look at the form of some commonly used ones.

The Gaussian takes the form:

Where:

Φ_R is the joint standard normal CDF with:

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

It’s important to note that the Pearson Correlation coefficient is a bad choice for ρ row here, a rank based correlation such as Spearman’s ρ or Kendall’s τ are better options since they are scale invariant and do not require linearity.

The Gaussian copula is a helpful tool and relatively easy to fit, even for relatively large numbers of RVs with different marginal distributions. Gaussian copulas however, do not do a good job capturing tail dependence and can cause one to underestimate risk of simultaneously being in the tails of each distribution. The failure of the Gaussian copula to capture tail dependence has been blamed for contributing the the 2008 financial crisis after it was widely used by investment firms on Wall Street (this is actually a really interesting story, for more details check out this article from the financial times) .

Tail dependency can be quantified by the coefficients of upper and lower tail dependence (λ_u and λ_l) defined as:

Like the student t distribution, the student t copula has a similar shape to the Gaussian copula, but with fatter tails, thus it can do a slightly better job capturing tail dependence.

Where:

t_ν,Σ is the joint student t CDF, Σ is covariance matrix (again don’t use Pearson correlation coefficient), ν is the degrees of freedom and t^-1_ν is the inverse student t CDF.

Archimedian copulas are a family of copulas with the following form:

ψ(u|θ) is called the generator function and θ is the parameters for the copula.

3 common Archimedean Copulas are:

- Gumbel: which is good at modeling upper tail dependence
- Clayton: which is good at modeling lower tail dependence
- Frank: has lighter tails and more density in the middle

It’s important to note that these copulas are usually employed for bivariate cases, for more than two variables, the Gaussian or Student t copulas are usually used.

A comparison of the shape of the copulas above can be found in Figure 1.

There are numerous packages for modeling copulas in Matlab, Python and R.

In Matlab, the* Statistics and Machine learning Toolbox* has some helpful functions. You can find some well narrated examples of copulas here. There’s also the Multivariate Copula Analysis Toolbox from UC Irvine.

In Python, the *copulalib* package can be used to model the Clayton, Frank and Gumbel copulas. The *statsmodels * package also has copulas built in. I found this post on the *copulalib* package, it has an attached Jupyter notebook with nice coding examples and figures. Here’s a post on *statsmodels* copula implementation, along with example Jupyter notebook.

Finally, here’s an example of coding copulas in R using the copulas library.

- Rüschendorf, L. (2013).
*Mathematical risk analysis dependence, risk bounds, optimal allocations and portfolios*. Berlin: Springer.

I’d like to note that the majority of the content in the post came from Scott Steinschneider’s excellent course, BEE 6940: Multivariate Analysis, at Cornell.

Filed under: Uncategorized ]]>

The workbench is readily available through pip, but it requires ipyparallel and mpld3 (both available through conda), SALib (via pip), and optionality platypus (pip install directly from github repo).

As a starting point, I will use the Direct Policy Search example that is available for Rhodium (Quinn et al 2017). I will adapt this code to work with the workbench. In this way, I can explain the workbench, as well as highlight some of the main differences between the workbench and Rhodium.

<br /># A function for evaluating our cubic DPS. This is based on equation (12) # from [1]. def evaluateCubicDPS(policy, current_value): value = 0 for i in range(policy["length"]): rbf = policy["rbfs"][i] value += rbf["weight"] * abs((current_value - rbf["center"]) / rbf["radius"])**3 value = min(max(value, 0.01), 0.1) return value # Construct the lake problem def lake_problem(policy, # the DPS policy b = 0.42, # decay rate for P in lake (0.42 = irreversible) q = 2.0, # recycling exponent mean = 0.02, # mean of natural inflows stdev = 0.001, # standard deviation of natural inflows alpha = 0.4, # utility from pollution delta = 0.98, # future utility discount rate nsamples = 100, # monte carlo sampling of natural inflows steps = 100): # the number of time steps (e.g., days) Pcrit = root(lambda x: x**q/(1+x**q) - b*x, 0.01, 1.5) X = np.zeros((steps,)) decisions = np.zeros((steps,)) average_daily_P = np.zeros((steps,)) reliability = 0.0 utility = 0.0 inertia = 0.0 for _ in range(nsamples): X[0] = 0.0 natural_inflows = np.random.lognormal( math.log(mean**2 / math.sqrt(stdev**2 + mean**2)), math.sqrt(math.log(1.0 + stdev**2 / mean**2)), size=steps) for t in range(1,steps): decisions[t-1] = evaluateCubicDPS(policy, X[t-1]) X[t] = (1-b)*X[t-1] + X[t-1]**q/(1+X[t-1]**q) + decisions[t-1] + natural_inflows[t-1] average_daily_P[t] += X[t]/float(nsamples) reliability += np.sum(X < Pcrit)/float(steps) utility += np.sum(alpha*decisions*np.power(delta,np.arange(steps))) inertia += np.sum(np.diff(decisions) > -0.01)/float(steps-1) max_P = np.max(average_daily_P) reliability /= float(nsamples) utility /= float(nsamples) inertia /= float(nsamples) return (max_P, utility, inertia, reliability)

The formulation of the decision rule assumes that `policy`

is a dict, which is composed of a set of variables generated either through sampling or through optimization. This is relatively straightforward to do in Rhodium, but not so easy to do in the workbench. In the workbench, a policy is a composition of policy levers, where each policy lever is either a range of real values, a range of integers, or an unordered set of categories. To adapt the DPS version of the lake problem to work with the workbench, we have to first replace the policy dict with the different variables explicitly.

def get_antropogenic_release(xt, c1, c2, r1, r2, w1): ''' Parameters ---------- xt : float polution in lake at time t c1 : float center rbf 1 c2 : float center rbf 2 r1 : float radius rbf 1 r2 : float radius rbf 2 w1 : float weight of rbf 1 note:: w2 = 1 - w1 ''' rule = w1*(abs(xt-c1/r1))**3+(1-w1)*(abs(xt-c2/r2))**3 at = min(max(rule, 0.01), 0.1) return at

Next, we need to adapt the lake_problem function itself to use this adapted version of the decision rule. This requires 2 changes: replace policy in the function signature of the lake_model function with the actual underlying parameters `c1`

, `c2`

, `r1`

, `r2`

, and `w1`

, and use these when calculating the anthropological pollution rate.

def lake_model(b=0.42, q=2.0, mean=0.02, stdev=0.001, alpha=0.4, delta=0.98, c1=0.25, c2=0.25, r1=0.5, r2=0.5, w1=0.5, nsamples=100, steps=100): Pcrit = root(lambda x: x**q/(1+x**q) - b*x, 0.01, 1.5) X = np.zeros((steps,)) decisions = np.zeros((steps,)) average_daily_P = np.zeros((steps,)) reliability = 0.0 utility = 0.0 inertia = 0.0 for _ in range(nsamples): X[0] = 0.0 natural_inflows = np.random.lognormal( math.log(mean**2 / math.sqrt(stdev**2 + mean**2)), math.sqrt(math.log(1.0 + stdev**2 / mean**2)), size=steps) for t in range(1,steps): decisions[t-1] = get_antropogenic_release(X[t-1], c1, c2, r1, r2, w1) X[t] = (1-b)*X[t-1] + X[t-1]**q/(1+X[t-1]**q) + decisions[t-1] + natural_inflows[t-1] average_daily_P[t] += X[t]/float(nsamples) reliability += np.sum(X < Pcrit)/float(steps) utility += np.sum(alpha*decisions*np.power(delta,np.arange(steps))) inertia += np.sum(np.diff(decisions) > -0.01)/float(steps-1) max_P = np.max(average_daily_P) reliability /= float(nsamples) utility /= float(nsamples) inertia /= float(nsamples) return (max_P, utility, inertia, reliability)

This version of the code can be combined with the workbench already. However, we can clean it up a bit more if we want to. Note how there are 2 for loops in the lake model. The outer loop generates stochastic realizations of the natural inflow, while the inner loop calculates the the dynamics of the system given a stochastic realization. The workbench can be made responsible for this outer loop.

A quick note on terminology is in order here. I have a background in transport modeling. Here we often use discrete event simulation models. These are intrinsically stochastic models. It is standard practice to run these models several times and take descriptive statistics over the set of runs. In discrete event simulation, and also in the context of agent based modeling, this is known as running replications. The workbench adopts this terminology and draws a sharp distinction between designing experiments over a set of deeply uncertain factors, and performing replications of each experiment to cope with stochastic uncertainty.

Some other notes on the code:

* To aid in debugging functions, it is good practice to make a function deterministic. In this case we can quite easily achieve this by including an optional argument for setting the seed of the random number generation.

* I have slightly changed the formulation of inertia, which is closer to the mathematical formulation used in the various papers.

* I have changes the for loop over t to get rid of virtually all the t-1 formulations

from __future__ import division # python2 import math import numpy as np from scipy.optimize import brentq def lake_model(b=0.42, q=2.0, mean=0.02, stdev=0.001, alpha=0.4, delta=0.98, c1=0.25, c2=0.25, r1=0.5, r2=0.5, w1=0.5, nsamples=100, steps=100, seed=None): '''runs the lake model for 1 stochastic realisation using specified random seed. Parameters ---------- b : float decay rate for P in lake (0.42 = irreversible) q : float recycling exponent mean : float mean of natural inflows stdev : float standard deviation of natural inflows alpha : float utility from pollution delta : float future utility discount rate c1 : float c2 : float r1 : float r2 : float w1 : float steps : int the number of time steps (e.g., days) seed : int, optional seed for the random number generator ''' np.random.seed(seed) Pcrit = brentq(lambda x: x**q/(1+x**q) - b*x, 0.01, 1.5) X = np.zeros((steps,)) decisions = np.zeros((steps,)) X[0] = 0.0 natural_inflows = np.random.lognormal( math.log(mean**2 / math.sqrt(stdev**2 + mean**2)), math.sqrt(math.log(1.0 + stdev**2 / mean**2)), size=steps) for t in range(steps-1): decisions[t] = get_antropogenic_release(X[t], c1, c2, r1, r2, w1) X[t+1] = (1-b)*X[t] + X[t]**q/(1+X[t]**q) + decisions[t] + natural_inflows[t] reliability = np.sum(X < Pcrit)/steps utility = np.sum(alpha*decisions*np.power(delta,np.arange(steps))) # note that I have slightly changed this formulation to retain # consistency with the equations in the papers inertia = np.sum(np.abs(np.diff(decisions)) < 0.01)/(steps-1) return X, utility, inertia, reliability

Now we are ready to connect this model to the workbench. This is fairly similar to how you would do it with Rhodium. We have to specify the uncertainties, the outcomes, and the policy levers. For the uncertainties and the levers, we can use real valued parameters, integer valued parameters, and categorical parameters. For outcomes, we can use either scalar, single valued outcomes or time series outcomes. For convenience, we can also explicitly control constants in case we want to have them set to a value different from their default value.

In this particular case, we are running the replications with the workbench. We still have to specify the descriptive statistics we would like to gather over the set of replications. For this, we can pass a function to an outcome. This function will be called with the results over the set of replications.

import numpy as np from ema_workbench import (RealParameter, ScalarOutcome, Constant, ReplicatorModel) model = ReplicatorModel('lakeproblem', function=lake_model) model.replications = 150 #specify uncertainties model.uncertainties = [RealParameter('b', 0.1, 0.45), RealParameter('q', 2.0, 4.5), RealParameter('mean', 0.01, 0.05), RealParameter('stdev', 0.001, 0.005), RealParameter('delta', 0.93, 0.99)] # set levers model.levers = [RealParameter("c1", -2, 2), RealParameter("c2", -2, 2), RealParameter("r1", 0, 2), RealParameter("r2", 0, 2), RealParameter("w1", 0, 1)] def process_p(values): values = np.asarray(values) values = np.mean(values, axis=0) return np.max(values) #specify outcomes model.outcomes = [ScalarOutcome('max_P', kind=ScalarOutcome.MINIMIZE, function=process_p), ScalarOutcome('utility', kind=ScalarOutcome.MAXIMIZE, function=np.mean), ScalarOutcome('inertia', kind=ScalarOutcome.MINIMIZE, function=np.mean), ScalarOutcome('reliability', kind=ScalarOutcome.MAXIMIZE, function=np.mean)] # override some of the defaults of the model model.constants = [Constant('alpha', 0.41), Constant('steps', 100)]

Now that we have specified the model with the workbench, we are ready to perform experiments on it. We can use evaluators to distribute these experiments either over multiple cores on a single machine, or over a cluster using `ipyparallel`

. Using any parallelization is an advanced topic, in particular if you are on a windows machine. The code as presented here will run fine in parallel on a mac or Linux machine. If you are trying to run this in parallel using multiprocessing on a windows machine, from within a jupyter notebook, it won’t work. The solution is to move the `lake_model`

and `get_antropogenic_release`

to a separate python module and import the lake model function into the notebook.

Another common practice when working with the exploratory modeling workbench is to turn on the logging functionality that it provides. This will report on the progress of the experiments, as well as provide more insight into what is happening in particular in case of errors.

If we want to perform experiments on the model we have just defined, we can use the perform_experiments method on the evaluator, or the stand alone perform_experiments function. We can perform experiments over the uncertainties and/or over the levers. Any policy is evaluated over each of the scenarios. So if we want to use 100 scenarios and 10 policies, this means that we will end up performing 100 * 10 = 1000 experiments. By default, the workbench uses Latin hypercube sampling for both sampling over levers and sampling over uncertainties. However, the workbench also offers support for full factorial, partial factorial, and Monte Carlo sampling, as well as wrappers for the various sampling schemes provided by SALib.

from ema_workbench import (MultiprocessingEvaluator, ema_logging, perform_experiments) ema_logging.log_to_stderr(ema_logging.INFO) with MultiprocessingEvaluator(model) as evaluator: results = evaluator.perform_experiments(scenarios=10, policies=10)

Similarly, we can easily use the workbench to search for a good candidate strategy. This requires that platypus is installed. If platypus is installed, we can simply use the optimize method. By default, the workbench will use $\epsilon$-NSGAII. The workbench can be used to search over the levers in order to find a good candidate strategy as is common in Many-Objective Robust Decision Making. The workbench can also be used to search over the uncertainties in order to find for example the worst possible outcomes and the conditions under which they appear. This is a form of worst case discovery. The optimize method takes an optional reference argument. This can be used to set the scenario for which you want to find good policies, or for setting the policy for which you want to find the worst possible outcomes. This makes implementing the approach suggested in Watson & Kasprzyk (2017) very easy.

with MultiprocessingEvaluator(model) as evaluator: results = evaluator.optimize(nfe=1000, searchover='levers', epsilons=[0.1,]*len(model.outcomes))

A third possibility is to perform robust optimization. In this case, the search will take place over the levers, but a given policy is than evaluated for a set of scenarios and the performance is defined over this set. To do this, we need to explicitly define robustness. For this, we can use the outcome object we have used before. In the example below we are defining robustness as the worst 10th percentile over the set of scenarios. We need to pass a `variable_name`

argument to explicitly link outcomes of the model to the robustness metrics.

import functools percentile10 = functools.partial(np.percentile, q=10) percentile90 = functools.partial(np.percentile, q=90) MAXIMIZE = ScalarOutcome.MAXIMIZE MINIMIZE = ScalarOutcome.MINIMIZE robustnes_functions = [ScalarOutcome('90th percentile max_p', kind=MINIMIZE, variable_name='max_P', function=percentile90), ScalarOutcome('10th percentile reliability', kind=MAXIMIZE, variable_name='reliability', function=percentile10), ScalarOutcome('10th percentile inertia', kind=MAXIMIZE, variable_name='inertia', function=percentile10), ScalarOutcome('10th percentile utility', kind=MAXIMIZE, variable_name='utility', function=percentile10)]

Given the specification of the robustness function, the remainder is straightforward and analogous to normal optimization.

<br />n_scenarios = 200 scenarios = sample_uncertainties(lake_model, n_scenarios) nfe = 100000 with MultiprocessingEvaluator(lake_model) as evaluator: robust_results = evaluator.robust_optimize(robustnes_functions, scenarios, nfe=nfe, epsilons=[0.05,]*len(robustnes_functions))

This blog has introduced the exploratory modeling workbench and has shown its basic functionality for sampling or searching over uncertainties and levers. In subsequent blogs, I will take a more in depth look at this funcitonality, as well as demonstrate how the workbench facilitates the entire Many-Objective Robust Decision Making process.

Filed under: Code Examples, EMAworkbench, Many Objective Robust Decision Making, Python, Software, Tutorials ]]>

Here are some simple tips to make your C/C++ code run even faster, and how to get some advice about further performance improvements. The last idea (data locality) is transferable to Python and other languages.

**Most important trick**

Improve your algorithm. Thinking if there is a simpler way of doing what you coded may reduce your algorithm’s complexity (say, from say n^{3} to n*log(n)), which would:

- yield great speed-up when you have lots of data or needs to run a computation several times in a row, and
- make you look smarter.

**Compiler flags**

First and foremost, the those who know what they are doing — compiler developers — do the work for you by calling the compiler with the appropriate flags. There are an incredible amount of flags you can call for that, but I would say that the ones you should have on whenever possible are -O3 and –march=native.

The optimization flags (-O1 to -O3, the latter more aggressive than the former) will perform a series of modification on your code behind the scenes to speed it up, some times by more than an order of magnitude. The issue is that this modifications may eventually make your code behave differently than you expected, so it’s always good to do a few smaller runs with -O0 and -O3 and compare their results before getting into production mode.

The –march=native flag will make the compiler fine tune your code to the processor it is being compiled on (conversely, –march=haswell would fine tune it to haswell architectures, for example). This is great if you are only going to run your code on your own machine or in another machine known to have a similar architecture, but if you try to run the binary on a machine with a different architecture, specially if it is an older one, you may end up with illegal instruction errors.

**Restricted pointer array**

When declaring a pointer array which you are sure will not be subject to pointer aliasing — namely there will be no other pointer pointing to the same memory address –, you can declare that pointer as a restrict pointer, as below:

- GCC: double* __restrict__ my_pointer_array
- Intel compiler: double* restrict my_pointer_array

This will let the compiler know that it can change order of certain operations involving my_pointer_array to make your code faster without changing some read/write order that may change your results. If you want to use the restricted qualifier with the intel compiler the flag -restrict must be passed as an argument to the compiler.

**Aligned pointer array**

By aligning an array, you are making sure the data is going to lie in the ideal location in memory for the processor to fetch it and perform the calculations it needs based on that array. To help your compiler optimize your data alignment, you need to (1) align your array when it is declared by a specific number of bytes and (2) tell your the compiler the array is aligned when using it to perform calculations — the compiler has no idea whether or not arrays received as arguments in function are aligned. Below are examples in C, although the same ideas apply to C++ as well.

**GCC**

#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double *__restrict__ a, double *__restrict__ b, double *__restrict__ c, int n) { a = (double*) __builtin_assume_aligned(a, ALIGN); b = (double*) __builtin_assume_aligned(b, ALIGN); c = (double*) __builtin_assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { double a[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); double b[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); double c[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); for (int i = 0; i < SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(a, b, c, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }

**Intel compiler**

#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double* restrict a, double* restrict b, double* restrict c, int n) { __assume_aligned(a, ALIGN); __assume_aligned(b, ALIGN); __assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { __declspec(align(ALIGN )) float a[SIZE_ARRAY]; __declspec(align(ALIGN )) float b[SIZE_ARRAY]; __declspec(align(ALIGN )) float c[SIZE_ARRAY]; for (int i = 0; i < SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(a, b, c, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }

Edit: In a comment to this post, Krister Walfridsson not only caught an issue with my GCC code, for which mention I thank him, but he also shows the differences in machine code generated with and without alignment.

**Data Locality**

Computers are physical things, which means that data is physically stored and needs to be physically moved around in memory and between cache and processor in order to be used in calculations. This means that, if your data is stored all over the place in memory — e.g. in multiple pointer arrays in different parts of memory –, the processor will have to reach out to several parts of memory to fetch all your data before performing any computations. By having the data intelligently laid out in memory you ensure all data required for each computation is stored close to each other and in cache at the same time, which becomes even more important if your code uses too much data to fit in the cache at once.

In order to making your processor’s life easier, it is a good idea to ensure that all data required for a calculation step is close together. For example, if a given computation required three arrays of fixed sizes, it is always a good idea to merge them into one long array, as in the example below for the Intel compiler.

#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double* restrict a, double* restrict b, double* restrict c, int n) { __assume_aligned(a, ALIGN); __assume_aligned(b, ALIGN); __assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { __declspec(align(ALIGN )) float abc[3 * SIZE_ARRAY]; for (int i = 0; i < 2 * SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(abc, abc + SIZE_ARRAY, abc + 2 * ARRAY, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }

or even, since c[i] depends only on b[i] and a[i], we can have the values of a, b and c intercalated to assure that all computations will be performed on data that is right next to each other in memory:

#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 #define STRIDE 3 void sum(double* restrict abc, int n, int stride) { __assume_aligned(abc, ALIGN); for (int i = 0; i < n; i += stride) abc[i+2] = abc[i] + abc[i+1]; } int main(void) { __declspec(align(ALIGN )) double abc[3 * SIZE_ARRAY]; for (int i = 0; i < 3 * SIZE_ARRAY; i += STRIDE) { abc[i] = 5.; abc[i+1] = 2.; } double start_time = omp_get_wtime(); sum(abc, SIZE_ARRAY, STRIDE ); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }

**Conclusion**

According a class project in which we had to write C code to perform matrix multiplication, the improvements suggested should may improve the performance of your code by 5 or 10 times. Also, the idea of data locality can be transferred to other languages, such as Python.

Filed under: c, C++, Code Examples, Python, Tips and Tricks, Tutorials, Uncategorized ]]>

Being able to automate data retrieval helps alleviate encountering irritating, repetitive tasks. Often times, these data files can be separated by a range of times, site locations, or even type of measurement, causing them to be cumbersome to download manually. This blog post outlines how to download multiple zipped csv files from a webpage using both R and Python.

We will specifically explore downloading historical hourly locational marginal pricing (LMP) data files from PJM, a regional transmission organization coordinating a wholesale electricity market across 13 Midwestern, Southeastern, and Northeastern states. The files in question are located here.

A distinct advantage of using Python over R is being able to write fewer lines for the same results. However, the differences in the time required to execute the code are generally minimal between the two languages.

The URL for the most recent file from the webpage for August 2017 has the following format: “http://www.pjm.com/pub/account/lmpmonthly/201708-da.csv ”

Notably, the URL is straightforward in its structure and allows for a fairly simple approach in attempting to create this URL from scratch.

To begin, we can deconstruct the URL into three aspects: the base URL, the date, and the file extension.

The base URL is fairly straightforward because it is constant: “http://www.pjm.com/pub/account/lmpmonthly/ ”

As is the file extension: “-da.csv”

However, the largest issue presented is how to approach recreating the date: “201708”

It is in the form “yyyymm”, requiring a reconstruction of a 4-digit year (luckily these records don’t go back to the first millennium) and a 2-digit month.

In Python, this is excessively easy to reconstruct using the .format() string modifying method. All that is required is playing around with the format-specification mini-language within the squiggly brackets. Note that the squiggly brackets are inserted within the string with the following form: {index:modifyer}

In this example, we follow the format above and deconstruct the URL then reassemble it within one line in Python. Assigning this string to a variable allows for easy iterations through the code by simply changing the inputs. So let’s create a quick function named url_creator that allows for the input of a date in the format (yyyy, mm) and returns the constructed URL as a string:

def url_creator(date): """ Input is list [yyyy, mm] """ yyyy, mm = date return "http://www.pjm.com/pub/account/lmpmonthly/{0:}{1:02d}-da.zip".format(yyyy, mm)

To quickly generate a list of the dates we’re interested in, creating a list of with entries in the format (yyyy, mm) for each of the month’s we’re interested in is easy enough using a couple of imbedded for loops. If you need alternative dates, you can easily alter this loop around or manually create a list of lists to accommodate your needs:

#for loop method dates = [] for i in range(17): for j in range(12): dates.append([2001 + i, j + 1]) #alternative manual method dates = [[2011, 1], [2011, 2], [2011, 3]]

Now that we can create the URL for this specific file type by calling the url_creator function and have all of the dates we may be interested in, we need to be able to access, download, unzip/extract, and subsequently save these files. Utilizing the urllib library, we can request and download the zipped file. Note that urllib.request.urlretrieve only retrieves the files and does not attempt to read it. While we could simply save the zipped file at this step, it is preferred to extract it to prevent headaches down the line.

Utilizing the urllib library, we can extract the downloaded files to the specified folder. Notably, I use the operating system interface method getcwd while extracting the file to save the resulting csv file into the directory where this script is running. Following this, the extracted file is closed.

import zipfile, urllib, os from urllib.request import Request,urlopen, urlretrieve for date in dates: baseurl = url_creator(date) local_filename, headers = urllib.request.urlretrieve(url = baseurl) zip_ref = zipfile.ZipFile(file = local_filename, mode = 'r') zip_ref.extractall(path = os.getcwd()) #os.getcwd() directs to current working directory zip_ref.close()

At this point, the extracted csv files will be located in the directory where this script was running.

We will be using an csv files to specify date ranges instead of for loops as shown in the Python example above. The advantage of using the csv files rather than indexing i=2001:2010 is that if you have a list of non-consecutive months or years, it is sometimes easier to just make a list of the years and cycle through all of the elements of the list rather than trying to figure out how to index the years directly. However, in this example, it would be just as easy to index through the years and month rather than the csv file.

This first block of code sets the working directory and reads in two csv files from the directory. The first csv file contains a list of the years 2001-2010. The second CSV file lists out the months January-December as numbers.

#Set working directory #specifically specified for the directory where we want to "dump" the data setwd("E:/Data/WebScraping/PJM/Markets_and_operations/Energy/Real_Time_Monthly/R") #Create csv files with a list of the relevant years and months years=read.csv("Years.csv") months=read.csv("Months.csv")

The breakdown of the URL was previously shown. To create this in R, we will generate a for-loop that will change this link to cycle through all the years and months listed in the csv files. The index i and j iterate through the elements in the year and month csv file respectively. (The if statement will be discussed at the end of this section). We then use the “downloadurl” and “paste” functions to turn the entire link into a string. The parts of the link surrounded in quotes are unchanging and denoted in green. The “as.character” function is used to cycle in the elements from the csv files and turn them into characters. Finally, the sep=’’ is used to denote that there should be no space separating the characters. The next two lines download the file into R as a temporary file.

#i indexes through the years and j indexes through the months for (i in (1:16)){ for( j in (1:12)){ if (j>9){ #dowload the url and save it as a temporary file downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),as.character(months[j,1]),"-da.zip",sep="") temp="tempfile" download.file(downloadurl,temp)

Next, we unzip the temporary file and finds the csv (which is in the form: 200101-da.csv), reading in the csv file into the R environment as “x”. We assign the name “200101-da” to “x”. Finally, it is written as a csv file and stored in the working directory, the temporary file is deleted, and the for-loop starts over again.

#read in the csv into the global environment x=read.csv((unz(temp,paste(as.character(years[i,1]),as.character(months[j,1]),"-da.csv",sep="")))) #Assign a name to the csv "x" newname=paste(as.character(years[i,1]),as.character(months[j,1]),sep="") assign(newname,x) #create a csv that is stored in your working directory write.csv(x,file=paste(newname,".csv",sep="")) unlink(temp) #deletetempfile } #If j is between 1 and 10, the only difference is that a "0" has to be added #in front of the month number

The reason for the “if” statement is that it is not a trivial process to properly format the dates in Excel when writing to a csv to create the list of dates. When inputing “01”, the entry is simply changed to “1” when the file is saved as a csv. However, note that in the name of the files, the first nine months have a 0 preceding their number. Therefore, the URL construct must be modified for these months. The downloadurl has been changed so that a 0 is added before the month number if j < 10. Aside from the hard-coded zeroes, the block of code is the same as the one above.

else { downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),"0",as.character(months[j,1]),"-da.zip",sep="") temp="tempfile" download.file(downloadurl,temp) #x=read.csv(url(paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1],as.character(months[j,1],sep="")))) x=read.csv((unz(temp,paste(as.character(years[i,1]),"0",as.character(months[j,1]),"-da.csv",sep="")))) newname=paste(as.character(years[i,1]),"0",as.character(months[j,1]),sep="") assign(newname,x) write.csv(x,file=paste(newname,".csv",sep="")) unlink(temp) #deletetempfile }

Many thanks to Rohini Gupta for contributing to this post by defining the problem and explaining her approach in R.

Versions: R = 3.4.1 Python = 3.6.2

To see the code in its entirety, please visit the linked GitHub Repository.

Filed under: Python, R Tagged: webscraping ]]>

How do visuals differ from visual analytics? In a scientific sense, a visual is a broad term for any picture, illustration, or graph that can be used to convey an idea. However, visual analytics is more than just generating a graph of complex data and handing it to a decision maker. Visual analytic tools help create graphs that allow the user to interact with the data, whether that involves manipulating a graph in three-dimensional space or allowing users to filter or brush for solutions that match certain criteria. Ultimately, visual analytics seeks to help in making decisions as fast as possible and to “enable learning through continual [problem] reformulation” (Woodruff et al., 2013) by presenting large data sets in an organized way so that the user can better recognize patterns and make inferences.

My goal with this blog post is to introduce two R libraries that are particularly useful to develop interactive graphs that will allow for better exploration of a three-dimensional space. I have found that documentation on these libraries and potential errors was sparse, so this post will consolidate my hours of Stack Overflow searching into a step-by-step process to produce beautiful graphs!

*R Libraries*

* *Use *rgl* to create a GIF of a 3D graph

Spinning graphs can be especially useful to visualize a 3D Pareto front and a nice visualization for a Power Point presentation. I will be using an example three-objective Pareto set from Julie’s work on the Red River Basin for this tutorial. The script has been broken down and explained in the following sections.

#Set working directory setwd("C:/Users/Me/Folder/Blog_Post_1") #Read in in csv of pareto set data=read.csv("pareto.csv") #Create three vectors for the three objectives hydropower=data$WcAvgHydro deficit=data$WcAvgDeficit flood=data$WcAvgFlood

In this first block of code, the working directory is set, the data set is imported from a CSV file, and each column of the data frame is saved as a vector that is conveniently named. Now we will generate the plot.

#call the rgl library library(rgl) #Adjust the size of the window par3d(windowRect=c(0,0,500,500))

If the *rgl* package isn’t installed on your computer yet, simply type install.packages(“rgl”) into the console. Otherwise, use the library function in line 2 to call the *rgl *package. The next line of code is used to adjust the window that the graph will pop up in. The default window is very small and as such, the movie will have a small resolution if the window is not adjusted!

#Plot the set in 3D space plot3d(hydropower,deficit,flood,col=brewer.pal(8,"Blues"), size=2, type='s', alpha=0.75)

Let’s plot these data in 3D space. The first three components of the plot3d function are the x,y, and z vectors respectively. The rest of the parameters are subject to your personal preference. I used the Color Brewer (install package “RColorBrewer”) to color the data points in different blue gradients. The first value is the number of colors that you want, and the second value is the color set. Color Brewer sets can be found here: http://www.datavis.ca/sasmac/brewerpal.html. My choice of colors is random, so I opted not to create a color scale. Creating a color scale is more involved in rgl. One option is to split your data into classes and to use legend3d and the cut function to cut your legend into color levels. Unfortunately, there simply isn’t an easy way to create a color scale in rgl. Finally, I wanted my data points to be spheres, of size 2, that were 50% transparent, which is specified with type, size, and alpha respectively. Plot3d will open a window with your graph. You can use your mouse to rotate it.

Now, let’s make a movie of the graph. The movie3d function requires that you install ImageMagick, a software that allows you to create a GIF from stitching together multiple pictures. ImageMagick also has cool functionalities like editing, resizing, and layering pictures. It can be installed into your computer through R using the first two lines of code below. Make sure not to re-run these lines once ImageMagick is installed. Note that ImageMagick doesn’t have to be installed in your directory, just on your computer.

require(installr) install.ImageMagick() #Create a spinning movie of your plot movie3d(spin3d(axis = c(0, 0, 1)), duration = 20, dir = getwd())

Finally, the last line of code is used to generate the movie. I have specified that I want the plot to spin about the z axis, specified a duration (you can play around with the number to see what suits your data), and that I want the movie to be saved in my current working directory. The resulting GIF is below. If the GIF has stopped running, reload the page and scroll down to this section again.

I have found that creating the movie can be a bit finicky and the last step is where errors usually occur. When you execute your code, make sure that you keep the plot window open while ImageMagick stitches together the snapshots otherwise you will get an error. If you have errors, please feel free to share because I most likely had them at one point and was able to ultimately fix them.

Overall, I found this package to be useful for a quick overview of the 3D space, but I wasn’t pleased with the way the axes values and titles overlap sometimes when the graph spins. The way to work around this is to set the labels and title to NULL and insert your own non-moving labels and title when you add the GIF to a PowerPoint presentation.

* *Use *plotly* to create an interactive scatter

I much prefer the *plotly* package to *rgl* for the aesthetic value, ease of creating a color scale, and the ability to mouse-over points to obtain coordinate values in a scatter plot. Plotly is an open source JavaScript graphing library but has an R API. The first step is to create a Plotly account at: https://plot.ly/. Once you have confirmed your email address, head to https://plot.ly/settings/api to get an API key. Click the “regenerate key” button and you’ll get a 20 character key that will be used to create a shareable link to your chart. Perfect, now we’re ready to get started!

setwd("C:/Users/Me/Folder/Blog_Post_1") library(plotly) library(ggplot2) #Set environment variables Sys.setenv("plotly_username"="rg727") Sys.setenv("plotly_api_key"="insert key here") #Read in pareto set data pareto=read.csv("ieee_synthetic_thinned.csv")

Set the working directory, install the relevant libraries, set the environment variables and load in the data set. Be sure to insert your API key. You will need to regenerate a new key every time you make a new graph. Also, note that your data must be in the form of a data frame for plotly to work.

#Plot your data plot= plot_ly(pareto, x = ~WcAvgHydro, y = ~WcAvgDeficit, z = ~WcAvgFlood, color = ~WcAvgFlood, colors = c('#00d6f7', '#ebfc2f')) add_markers() #Add axes titles layout(title="Pareto Set", scene = list(xaxis = list(title = 'Hydropower'),yaxis = list(title = 'Deficit'), zaxis = list(title = 'Flood'))) #call the name of your plot to appear in the viewer plot

To correctly use the plotly command, the first input needed is the data frame, followed by the column names of the x,y, and z columns in the data frame. Precede each column name with a “~”.

I decided that I wanted the colors to scale with the value of the z variable. The colors were defined using color codes available at http://www.color-hex.com/. Use the layout function to add a main title and axis labels. Finally call the name of your plot and you will see it appear in the viewer at the lower right of your screen.If your viewer shows up blank with only the color scale, click on the viewer or click “zoom”. Depending on how large the data set is, it may take some time for the graph to load.

#Create a link to your chart and call it to launch the window chart_link = api_create(plot, filename = "public-graph") chart_link

Finally create the chart link using the first line of code above and the next line will launch the graph in Plotly. Copy and save the URL and anyone with it can access your graph, even if they don’t have a Plotly account. Play around with the cool capabilities of my Plotly graph, like mousing over points, rotating, and zooming!

*Sources: *

https://plot.ly/r/3d-scatter-plots/

Woodruff, M.J., Reed, P.M. & Simpson, T.W. Struct Multidisc Optim (2013) 48: 201. https://doi.org/10.1007/s00158-013-0891-z

James J. Thomas and Kristin A. Cook (Ed.) (2005). *Illuminating the Path: The R&D Agenda for Visual Analytics* National Visualization and Analytics Center.

Filed under: R, Tutorials, Visualization ]]>