Directed search with the Exploratory Modeling workbench

This is the third blog in a series showcasing the functionality of the Exploratory Modeling workbench. Exploratory modeling entails investigating the way in which uncertainty and/or policy levers map to outcomes. To investigate these mappings, we can either use sampling based strategies (open exploration) or optimization based strategies (directed search) In the first blog, I gave a general overview of the workbench and showed briefly how both investigation strategies can be done. In the second blog, I demonstrated the use of the workbench for open exploration in substantial more detail. In this third blog, I will demonstrate in more detail how to use the workbench for directed search. Like in the previous two blog post, I will use the DPS version of the lake problem.

For optimization, the workbench relies on platypus. You can easily install the latest version of platypus from github using pip

pip install git+https://github.com/Project-Platypus/Platypus.git


By default, the workbench will use epsilon NSGA2, but all the other algorithms available within platypus can be used as well.

Within the workbench, optimization can be used in three ways:
* Search over decision levers for a reference scenario
* Robust search: search over decision levers for a set of scenarios
* worst case discovery: search over uncertainties for a reference policy

The search over decision levers or over uncertainties relies on the specification of the direction for each outcome of interest defined on the model. It is only possible to use ScalarOutcome objects for optimization.

Search over levers

Directed search is most often used to search over the decision levers in order to find good candidate strategies. This is for example the first step in the Many Objective Robust Decision Making process. This is straightforward to do with the workbench using the optimize method.

from ema_workbench import MultiprocessingEvaluator, ema_logging

ema_logging.log_to_stderr(ema_logging.INFO)

with MultiprocessingEvaluator(model) as evaluator:
results = evaluator.optimize(nfe=10000, searchover='levers',
epsilons=[0.1,]*len(model.outcomes),
population_size=50)


the result from optimize is a DataFrame with the decision variables and outcomes of interest. The latest version of the workbench comes with a pure python implementation of parallel coordinates plot built on top of matplotlib. It has been designed with the matplotlib and seaborn api in mind. We can use this to quickly visualize the optimization results.

from ema_workbench.analysis import parcoords

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0)
paraxes.plot(results, color=sns.color_palette()[0])
paraxes.invert_axis('max_P')
plt.show()


Note how we can flip an axis using the invert_axis method. This eases interpretation of the figure because the ideal solution in this case would be a straight line for the four outcomes of interest at the top of the figure.

Specifying constraints

In the previous example, we showed the most basic way for using the workbench to perform many-objective optimization. However, the workbench also offers support for constraints and tracking convergence. Constrains are an attribute of the optimization problem, rather than an attribute of the model as in Rhodium. Thus, we can pass a list of constraints to the optimize method. A constraint can be applied to the model input parameters (both uncertainties and levers), and/or outcomes. A constraint is essentially a function that should return the distance from the feasibility threshold. The distance should be 0 if the constraint is met.

As a quick demonstration, let’s add a constraint on the maximum pollution. This constraint applies to the max_P outcome. The example below specifies that the maximum pollution should be below 1.

from ema_workbench import MultiprocessingEvaluator, ema_logging, Constraint

ema_logging.log_to_stderr(ema_logging.INFO)

constraints = [Constraint("max pollution", outcome_names="max_P",
function=lambda x:max(0, x-1))]

with MultiprocessingEvaluator(model) as evaluator:
results = evaluator.optimize(nfe=1000, searchover='levers',
epsilons=[0.1,]*len(model.outcomes),
population_size=25, constraints=constraints)


tracking convergence

To track convergence, we need to specify which metric(s) we want to use and pass these to the optimize method. At present the workbench comes with 3 options: Hyper volume, Epsilon progress, and a class that will write the archive at each iteration to a separate text file enabling later processing. If convergence metrics are specified, optimize will return both the results as well as the convergence information.

from ema_workbench import MultiprocessingEvaluator, ema_logging
from ema_workbench.em_framework.optimization import (HyperVolume,
EpsilonProgress, )
from ema_workbench.em_framework.outcomes import Constraint

ema_logging.log_to_stderr(ema_logging.INFO)

# because of the constraint on pollution, we can specify the
# maximum easily
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]),
EpsilonProgress()]
constraints = [Constraint("max pollution", outcome_names="max_P",
function=lambda x:max(0, x-1))]

with MultiprocessingEvaluator(model) as evaluator:
results_ref1, convergence1 = evaluator.optimize(nfe=25000, searchover='levers',
epsilons=[0.05,]*len(model.outcomes),
convergence=convergence_metrics,
constraints=constraints,
population_size=100)


We can visualize the results using parcoords as before, while the convergence information is in a DataFrame making it also easy to plot.

fig, (ax1, ax2) = plt.subplots(ncols=2, sharex=True)
ax1.plot(convergence1.epsilon_progress)
ax1.set_xlabel('nr. of generations')
ax1.set_ylabel('$\epsilon$ progress')
ax2.plot(convergence1.hypervolume)
ax2.set_ylabel('hypervolume')
sns.despine()
plt.show()


Changing the reference scenario

Up till now, we have performed the optimization for an unspecified reference scenario. Since the lake model function takes default values for each of the deeply uncertain factors, these values have been implicitly assumed. It is however possible to explicitly pass a reference scenario that should be used instead. In this way, it is easy to apply the extended MORDM approach suggested by Watson and Kasprzyk (2017).

To see the effects of changing the reference scenario on the values for the decision levers found through the optimization, as well as ensuring a fair comparison with the previous results, we use the same convergence metrics and constraints from the previous optimization. Note that the constraints are in essence only a function, and don’t retain optimization specific state, we can simply reuse them. The convergence metrics, in contrast retain state and we thus need to re-instantiate them.

from ema_workbench import Scenario

reference = Scenario('reference', **dict(b=.43, q=3,mean=0.02,
stdev=0.004, delta=.94))
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]),
EpsilonProgress()]

with MultiprocessingEvaluator(model) as evaluator:
results_ref2, convergence2 = evaluator.optimize(nfe=25000, searchover='levers',
epsilons=[0.05,]*len(model.outcomes),
convergence=convergence_metrics,
constraints=constraints,
population_size=100, reference=reference)



comparing results for different reference scenarios

To demonstrate the parcoords plotting functionality in some more detail, let’s combine the results from the optimizations for the two different reference scenarios and visualize them in the same plot. To do this, we need to first figure out the limits across both optimizations. Moreover, to get a better sense of which part of the decision space is being used, let’s set the limits for the decision levers on the basis of their specified ranges instead of inferring the limits from the optimization results.

columns = [lever.name for lever in model.levers]
columns += [outcome.name for outcome in model.outcomes]
limits = {lever.name: (lever.lower_bound, lever.upper_bound) for lever in
model.levers}
limits = dict(**limits, **{outcome.name:(0,1) for outcome in model.outcomes})
limits = pd.DataFrame.from_dict(limits)
# we resort the limits in the order produced by the optimization
limits = limits[columns]

paraxes = parcoords.ParallelAxes(limits, rot=0)
paraxes.plot(results_ref1, color=sns.color_palette()[0], label='ref1')
paraxes.plot(results_ref2, color=sns.color_palette()[1], label='ref2')
paraxes.legend()
paraxes.invert_axis('max_P')
plt.show()


Robust Search

The workbench also comes with support for many objective robust optimization. In this case, each candidate solution is evaluated over a set of scenarios, and the robustness of the performance over this set is calculated. This requires specifying 2 new pieces of information:
* the robustness metrics
* the scenarios over which to evaluate the candidate solutions

The robustness metrics are simply a collection of ScalarOutcome objects. For each one, we have to specify which model outcome(s) it uses, as well as the actual robustness function. For demonstrative purposes, let’s assume we are use a robustness function using descriptive statistics: we want to maximize the 10th percentile performance for reliability, inertia, and utility, while minimizing the 90th percentile performance for max_P.

We can specify our scenarios in various ways. The simplest would be to pass the number of scenarios to the robust_optimize method. In this case for each generation a new set of scenarios is used. This can create noise and instability in the optimization. A better option is to explicitly generate the scenarios first, and pass these to the method. In this way, the same set of scenarios is used for each generation.

If we want to specify a constraint, this can easily be done. Note however, that in case of robust optimization, the constrains will apply to the robustness metrics instead of the model outcomes. They can of course still apply to the decision variables as well.

import functools
from ema_workbench import Constraint, MultiprocessingEvaluator
from ema_workbench import Constraint, ema_logging
from ema_workbench.em_framework.optimization import (HyperVolume,
EpsilonProgress)
from ema_workbench.em_framework.samplers import sample_uncertainties

ema_logging.log_to_stderr(ema_logging.INFO)

percentile10 = functools.partial(np.percentile, q=10)
percentile90 = functools.partial(np.percentile, q=90)

MAXIMIZE = ScalarOutcome.MAXIMIZE
MINIMIZE = ScalarOutcome.MINIMIZE
robustnes_functions = [ScalarOutcome('90th percentile max_p', kind=MINIMIZE,
variable_name='max_P', function=percentile90),
ScalarOutcome('10th percentile reliability', kind=MAXIMIZE,
variable_name='reliability', function=percentile10),
ScalarOutcome('10th percentile inertia', kind=MAXIMIZE,
variable_name='inertia', function=percentile10),
ScalarOutcome('10th percentile utility', kind=MAXIMIZE,
variable_name='utility', function=percentile10)]

def constraint(x):
return max(0, percentile90(x)-10)

constraints = [Constraint("max pollution",
outcome_names=['90th percentile max_p'],
function=constraint)]
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[10,1,1,1]),
EpsilonProgress()]
n_scenarios = 10
scenarios = sample_uncertainties(model, n_scenarios)

nfe = 10000

with MultiprocessingEvaluator(model) as evaluator:
robust_results, convergence = evaluator.robust_optimize(robustnes_functions,
scenarios, nfe=nfe, constraints=constraints,
epsilons=[0.05,]*len(robustnes_functions),
convergence=convergence_metrics,)

fig, (ax1, ax2) = plt.subplots(ncols=2)
ax1.plot(convergence.epsilon_progress.values)
ax1.set_xlabel('nr. of generations')
ax1.set_ylabel('$\epsilon$ progress')
ax2.plot(convergence.hypervolume)
ax2.set_ylabel('hypervolume')
sns.despine()
plt.show()


paraxes = parcoords.ParallelAxes(parcoords.get_limits(robust_results), rot=45)
paraxes.plot(robust_results)
paraxes.invert_axis('90th percentile max_p')
plt.show()


Search over uncertainties: worst case discovery

Up till now, we have focused on optimizing over the decision levers. The workbench however can also be used for worst case discovery (Halim et al, 2016). In essence, the only change is to specify that we want to search over uncertainties instead of over levers. Constraints and convergence works just as in the previous examples.

Reusing the foregoing, however, we should change the direction of optimization of the outcomes. We are no longer interested in finding the best possible outcomes, but instead we want to find the worst possible outcomes.

# change outcomes so direction is undesirable
minimize = ScalarOutcome.MINIMIZE
maximize = ScalarOutcome.MAXIMIZE

for outcome in model.outcomes:
if outcome.kind == minimize:
outcome.kind = maximize
else:
outcome.kind = minimize


We can reuse the reference keyword argument to perform worst case discovery for one of the policies found before. So, below we select solution number 9 from the pareto approximate set. We can turn this into a dict and instantiate a Policy objecti.

from ema_workbench import Policy

policy = Policy('9', **{k:v for k, v in results_ref1.loc[9].items()
if k in model.levers})

with MultiprocessingEvaluator(model) as evaluator:
results = evaluator.optimize(nfe=1000, searchover='uncertainties',
epsilons=[0.1,]*len(model.outcomes),
reference=policy)


Visualizing the results is straightforward using parcoords.

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0)
paraxes.plot(results)
paraxes.invert_axis('max_P')
plt.show()


Closing remarks

This blog showcased the functionality of the workbench for applying search based approaches to exploratory modelling. We specifically looked at the use of many-objective optimization for searching over the levers or uncertainties, as well as the use of many-objective robust optimization. This completes the overview of the functionality available in the workbench. In the next blog, I will put it all together to show how the workbench can be used to perform Many Objective Robust Decision Making.

Easy vectorized parallel plots for multiple data sets

I will share a very quick and straight-forward solution to generate parallel plots in python of multiple groups of data.   The idea is transitioning from the parallel axis plot tool  to a method that enables  the plots to be exported as a vectorized image.   You can also take a look at Matt’s python parallel.py code available in github: https://github.com/matthewjwoodruff/parallel.py .

This is the type of figure that you will get:

The previous figure was generated with the following lines of code:

import numpy as np
import pandas as pd
from pandas.tools.plotting import parallel_coordinates
import matplotlib.pyplot as plt
import seaborn

parallel_coordinates(data,'Name', color= ['#225ea8','#7fcdbb','#1d91c0'], linewidth=5, alpha=.8)
plt.ylabel('Direction of Preference $\\rightarrow$', fontsize=12)

plt.savefig('parallel_plot.svg')


Lines 1-4 are the required libraries.  I just threw in the seaborn library to give it the gray background but it is not necessary.  In the parallel_coordinates function, you need to specify the data, ‘Name’ and the color of the different groups.  You can substitute the color  variable for colormap and specify the colormap that you wish to use (e.g. colormap=’YlGnBu’).   I also specified an alpha for transparency to see overlapping lines. If you want to learn more, you can take a look at the parallel_coordinates source code.  I found this stack overflow link very useful,  it shows some examples on editing the source code to enable other capabilities.

Finally, the following snippet shows the format of the input data (the sample_data.csv  file that is read in line 7 ) :

Columns A-G the different categories to be plotted are specified (e.g. the objectives of a problem) and in Column H the names of the different data groups are specified.  And there you have it, I hope you find this plotting alternative useful.

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
extension = '.xlsx'
Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
location_list}
# Reset indices as desired dates
for keys in Excel_data_dictionary:
Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
extension = '.csv'
CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
# Reset indices as desired dates
for keys in CSV_data_dictionary:
CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))



When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

• Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
• I believe OpenPyXL only works with .xlsx files, not .xls files.

Scenario discovery in Python

The purpose of this blog post is to demonstrate how one can do scenario discovery in python. This blogpost will use the exploratory modeling workbench available on github. I will demonstrate how we can perform both PRIM in an interactive way, as well as briefly show how to use CART, which is also available in the exploratory modeling workbench. There is ample literature on both CART and PRIM and their relative merits for use in scenario discovery. So I won’t be discussing that here in any detail. This blog was first written as an ipython notebook, which can be found here

The workbench is mend as a one stop shop for doing exploratory modeling, scenario discovery, and (multi-objective) robust decision making. To support this, the workbench is split into several packages. The most important packages are expWorkbench that contains the support for setting up and executing computational experiments or (multi-objective) optimization with models; The connectors package, which contains connectors to vensim (system dynamics modeling package), netlogo (agent based modeling environment), and excel; and the analysis package that contains a wide range of techniques for visualization and analysis of the results from series of computational experiments. Here, we will focus on the analysis package. It some future blog post, I plan to demonstrate the use of the workbench for performing computational experimentation and multi-objective (robust) optimization.

The workbench can be found on github and downloaded from there. At present, the workbench is only available for python 2.7. There is a seperate branch where I am working on making a version of the workbench that works under both python 2.7 and 3. The workbench is depended on various scientific python libraries. If you have a standard scientific python distribution, like anaconda, installed, the main dependencies will be met. In addition to the standard scientific python libraries, the workbench is also dependend on deap for genetic algorithms. There are also some optional dependencies. These include seaborn and mpld3 for nicer and interactive visualizations, and jpype for controlling models implemented in Java, like netlogo, from within the workbench.

In order to demonstrate the use of the exploratory modeling workbench for scenario discovery, I am using a published example. I am using the data used in the original article by Ben Bryant and Rob Lempert where they first introduced scenario discovery. Ben Bryant kindly made this data available for my use. The data comes as a csv file. We can import the data easily using pandas. columns 2 up to and including 10 contain the experimental design, while the classification is presented in column 15

import pandas as pd

data = pd.DataFrame.from_csv('./data/bryant et al 2010 data.csv',
index_col=False)
x = data.ix[:, 2:11]
y = data.ix[:, 15]


The exploratory modeling workbench is built on top of numpy rather than pandas. This is partly a path dependecy issue. The earliest version of prim in the workbench is from 2012, when pandas was still under heavy development. Another problem is that the pandas does not contain explicit information on the datatypes of the columns. The implementation of prim in the exploratory workbench is however datatype aware, in contrast to the scenario discovery toolkit in R. That is, it will handle categorical data differently than continuous data. Internally, prim uses a numpy structured array for x, and a numpy array for y. We can easily transform the pandas dataframe to either.

x = x.to_records()
y = y.values


the exploratory modeling workbench comes with a seperate analysis package. This analysis package contains prim. So let’s import prim. The workbench also has its own logging functionality. We can turn this on to get some more insight into prim while it is running.

from analysis import prim
from expWorkbench import ema_logging
ema_logging.log_to_stderr(ema_logging.INFO);


Next, we need to instantiate the prim algorithm. To mimic the original work of Ben Bryant and Rob Lempert, we set the peeling alpha to 0.1. The peeling alpha determines how much data is peeled off in each iteration of the algorithm. The lower the value, the less data is removed in each iteration. The minimium coverage threshold that a box should meet is set to 0.8. Next, we can use the instantiated algorithm and find a first box.

prim_alg = prim.Prim(x, y, threshold=0.8, peel_alpha=0.1)
box1 = prim_alg.find_box()


Let’s investigate this first box is some detail. A first thing to look at is the trade off between coverage and density. The box has a convenience function for this called show_tradeoff. To support working in the ipython notebook, this method returns a matplotlib figure with some additional information than can be used by mpld3.

import matplotlib.pyplot as plt

plt.show()


The notebook contains an mpld3 version of the same figure with interactive pop ups. Let’s look at point 21, just as in the original paper. For this, we can use the inspect method. By default this will display two tables, but we can also make a nice graph instead that contains the same information.

box1.inspect(21)
box1.inspect(21, style='graph')
plt.show()


This first displays two tables, followed by a figure

coverage    0.752809
density     0.770115
mass        0.098639
mean        0.770115
res dim     4.000000
Name: 21, dtype: float64

box 21
min         max     qp values
Demand elasticity        -0.422000   -0.202000  1.184930e-16
Biomass backstop price  150.049995  199.600006  3.515113e-11
Total biomass           450.000000  755.799988  4.716969e-06
Cellulosic cost          72.650002  133.699997  1.574133e-01


If one where to do a detailed comparison with the results reported in the original article, one would see small numerical differences. These differences arise out of subtle differences in implementation. The most important difference is that the exploratory modeling workbench uses a custom objective function inside prim which is different from the one used in the scenario discovery toolkit. Other differences have to do with details about the hill climbing optimization that is used in prim, and in particular how ties are handled in selecting the next step. The differences between the two implementations are only numerical, and don’t affect the overarching conclusions drawn from the analysis.

Let’s select this 21 box, and get a more detailed view of what the box looks like. Following Bryant et al., we can use scatter plots for this.

box1.select(21)
fig = box1.show_pairs_scatter()
fig.set_size_inches((12,12))
plt.show()


We have now found a first box that explains close to 80% of the cases of interest. Let’s see if we can find a second box that explains the remainder of the cases.

box2 = prim_alg.find_box()


The logging will inform us in this case that no additional box can be found. The best coverage we can achieve is 0.35, which is well below the specified 0.8 threshold. Let’s look at the final overal results from interactively fitting PRIM to the data. For this, we can use to convenience functions that transform the stats and boxes to pandas data frames.

print prim_alg.stats_to_dataframe()
print prim_alg.boxes_to_dataframe()

       coverage   density      mass  res_dim
box 1  0.752809  0.770115  0.098639        4
box 2  0.247191  0.027673  0.901361        0
box 1              box 2
min         max    min         max
Demand elasticity        -0.422000   -0.202000   -0.8   -0.202000
Biomass backstop price  150.049995  199.600006   90.0  199.600006
Total biomass           450.000000  755.799988  450.0  997.799988
Cellulosic cost          72.650002  133.699997   67.0  133.699997


For comparison, we can also use CART for doing scenario discovery. This is readily supported by the exploratory modelling workbench.

from analysis import cart
cart_alg = cart.CART(x,y, 0.05)
cart_alg.build_tree()


Now that we have trained CART on the data, we can investigate its results. Just like PRIM, we can use stats_to_dataframe and boxes_to_dataframe to get an overview.

print cart_alg.stats_to_dataframe()
print cart_alg.boxes_to_dataframe()

       coverage   density      mass  res dim
box 1  0.011236  0.021739  0.052154        2
box 2  0.000000  0.000000  0.546485        2
box 3  0.000000  0.000000  0.103175        2
box 4  0.044944  0.090909  0.049887        2
box 5  0.224719  0.434783  0.052154        2
box 6  0.112360  0.227273  0.049887        3
box 7  0.000000  0.000000  0.051020        3
box 8  0.606742  0.642857  0.095238        2
box 1                  box 2               box 3  \
min         max        min         max     min
Cellulosic yield        80.0   81.649994  81.649994   99.900002  80.000
Demand elasticity       -0.8   -0.439000  -0.800000   -0.439000  -0.439
Biomass backstop price  90.0  199.600006  90.000000  199.600006  90.000

box 4                box 5  \
max         min         max      min
Cellulosic yield         99.900002   80.000000   99.900002   80.000
Demand elasticity        -0.316500   -0.439000   -0.316500   -0.439
Biomass backstop price  144.350006  144.350006  170.750000  170.750

box 6                  box 7  \
max      min         max        min
Cellulosic yield         99.900002  80.0000   89.050003  89.050003
Demand elasticity        -0.316500  -0.3165   -0.202000  -0.316500
Biomass backstop price  199.600006  90.0000  148.300003  90.000000

box 8
max         min         max
Cellulosic yield         99.900002   80.000000   99.900002
Demand elasticity        -0.202000   -0.316500   -0.202000
Biomass backstop price  148.300003  148.300003  199.600006


Alternatively, we might want to look at the classification tree directly. For this, we can use the show_tree method. This returns an image that we can either save, or display.

If we look at the results of CART and PRIM, we can see that in this case PRIM produces a better description of the data. The best box found by CART has a coverage and density of a little above 0.6. In contrast, PRIM produces a box with coverage and density above 0.75.

All of the Analysis Code for my Latest Study is on GitHub

I’ve published to GitHub all of the code I wrote for the paper I’m currently working on.  This includes:

• Python PBS submission script
• Python scripts to automate reference set generation using MOEAFramework
• Python scripts to automate hypervolume calculation using MOEAFramework and the WFG hypervolume engine
• Python / Pandas scripts for statistical summaries of the hypervolume data
• Python scripts to automate Sobol’ sensitivity analysis using MOEAFramework and tabulate the results.  (If I were starting today, I’d have an SALib version too.)
• Python / Pandas / Matplotlib figure generation scripts:
• Control maps for hypervolume attainment
• Radial convergence plots (“spider plots”) for Sobol’ global sensitivity analysis results
• Bar charts for Sobol’ global sensitivity analysis results
• CDF plots (dot / shaded bar, plus actual CDF plots) for hypervolume attainment
• Parallel coordinate plots
• Input file generation for AeroVis glyph plotting
• Joint PDF plots for hypervolume attainment across multiple problems

Not all of the figures I mentioned will turn up in the paper, but I provide them as examples in case they prove helpful.

Magic

I didn’t have the time or inclination to try to set up the iPython HTML notebook on the conference room computer for yesterday’s demo, but I really wanted to use the HTML notebook. What to do?

Magic.

Magic in this case means running the iPython HTML notebook on the cluster, forwarding the HTTP port that the HTML notebook uses, and displaying the session in a web browser running locally. In the rest of this post, I’ll explain each of the moving parts.

iPython HTML Notebook on the Cluster

The ipython that comes for free on the cluster doesn’t support the HTML notebook because the python/2.7 module doesn’t have tornado or pyzmq. On the plus side, you do have easy_install, so setting up these dependencies isn’t too hard.

1. Make a directory for your personal Python packages:
mkdir /gpfs/home/asdf1234/local
export PYTHONPATH=\$HOME/local/lib/python2.7/site-packages
3. python -measy_install --prefix /gpfs/home/asdf1234/local tornado
python -measy_install --prefix /gpfs/home/asdf1234/local pyzmq

If you have a local X server, you can check to see if this works:

ssh -Y asdf1234@cluster
ipython notebook --pylab=inline

Firefox should pop up with the HTML notebook. It’s perfectly usable like this, but I also didn’t want to set up an X server on the conference room computer. This leads us to…

Forwarding Port 8888

By default, the HTML notebook serves HTTP on port 8888. If you’re sitting in front of the computer, you get to port 8888 by using the loopback address 127.0.0.1:8888.
127.0.0.1 is only available locally. But using SSH port forwarding, we can connect to 127.0.0.1:8888 from a remote machine.

Here’s how you do that with a command-line ssh client:

ssh -L8888:127.0.0.1:8888 asdf1234@cluster

Here’s how you do it with PuTTY:

Click “Add.”  You should see this:

Now open your connection and login to the remote machine. Once there, cd to the directory where your data is and type

ipython notebook --pylab=inline

If you’re using X forwarding, this will open up the elinks text browser, which is woefully incapable of handling the HTML notebook. Fortunately that doesn’t sink the demo. You’ll see something like this:

This means that the iPython HTML notebook is up and running. If you actually want to use it, howerver, you need a better browser. Fortunately, we opened up SSH with a tunnel…

Open the Notebook in Your Browser

This was the one part of the demo that wasn’t under my control. You need a modern web browser, and I just had to hope that someone was keeping the conference room computer more or less up to date. My fallback plan was to use a text-mode ipython over ssh, but the notebook is much more fun! Fortunately for me, the computer had Firefox 14.

In your URL bar, type in

http://127.0.0.1:8888


If everything works, you’ll see this:

And you’re off to the races!

What Just Happened?

I said earlier that 127.0.0.1 is a special IP address that’s only reachable locally, i.e. on the machine you’re sitting in front of. Port 8888 on 127.0.0.1 is where ipython serves its HTML notebook, so you’d think the idea of using the HTML notebook over the network isn’t going to fly.

When you log in through ssh, however, it’s as if you are actually sitting in front of the computer you’re connected to. Every program you run, runs on that computer. Port forwarding takes this a step further and presents all traffic on port 8888 to the remote computer as if it were actually on the remote computer’s port 8888.

Python Data Analysis Part 2: Pandas / Matplotlib Live Demo

This post is part of our series on Python.  Part 1 came in three parts: a, b, and c.  We also have a post on setting up python.  And on how to write a python job submission script. For all python fun, search “python” in the search box on the right.

Transcript

At yesterday’s group meeting, I promised a transcript of the live demo. I’m pasting it below with light edits for formatting.

  1 import pandas
2 import matplotlib
4 metrics
5 metrics[0:40]
6 metrics["roundeddown"] = 3000 * ( metrics["NFE"] // 3000)
7 metrics[0:40]
8 grouped = metrics.groupby(["model", "roundeddown"])
9 mean = grouped.mean()
10 mean
11 mean.ix["response"]
12 grouped = grouped["SBX+PM"]
13 mean = grouped.mean()
14 mean
15 fig = pyplot.figure(figsize=(10,6))
17 ax.plot(mean.ix["response"].index.values,
18         mean.ix["response"].values, color = 'r')
19 fig
20 ax.plot(mean.ix["gasp"].index.values,
21         mean.ix["gasp"].values, color = 'b')
22 fig
23 summary = pandas.DataFrame(
24             data = {
25                     "mean":grouped.mean(),
26                     "tenth":grouped.quantile(0.1),
27                     "ninetieth":grouped.quantile(0.9)
28             }
29           )
30 summary
31 fig = pyplot.figure(figsize=(10,12))
33 index = summary.ix["gasp"].index.values
34 mean = summary.ix["gasp"]["mean"].values
35 tenth = summary.ix["gasp"]["tenth"].values
36 ninetieth=summary.ix["gasp"]["ninetieth"].values
37 ax.plot(index, mean,
38         index, tenth,
39         index, ninetieth,
40         color='k')
41 fig
42 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.01)
43 fig
44 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
45 fig
47 index = summary.ix["response"].index.values
48 mean = summary.ix["response"]["mean"].values
49 tenth = summary.ix["response"]["tenth"].values
50 ninetieth=summary.ix["response"]["ninetieth"].values
51 ax.plot(index, mean,
52         index, tenth,
53         index, ninetieth,
54         color='k')
55 fig
56 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
57 fig
58 for ax in fig.axes:
59     ax.set_yscale("log")
60     ax.set_ylim(0.0001, 1.0)
61     ax.set_xlim(0,103000)
62 fig

Description

The transcript picks up where your homework left off. We have 50 seeds worth of data for two models:

And we want to produce a summary plot:

You should be able to use text-based ipython on the cluster without any additional setup. However, you’ll need to add from matplotlib import pyplot to the top before you start typing in code from the transcript. (The HTML notebook I used for the demo imports pyplot automatically.)

If you want to see your figures, you’ll have to take the somewhat cumbersome steps of saving them (fig.savefig("filename")), copying them to your local machine (scp user1234@hammer.rcc.psu.edu:workingdirectory/filename.png .), and then viewing them.

I used an ipython HTML notebook for the demo. I’ll be making two posts about how to do this: one about how I did it for the demo, and another about installing ipython on your desktop.