MORDM Basics VI: Processing the output and reevaluating for robustness

In the previous post, we conducted a basic WaterPaths tutorial in which we ran a simulation-optimization of the North Carolina Research Triangle test case (Trindade et al, 2019); across 1000 possible futures, or states of the world (SOWs). To briefly recap, the Research Triangle test case consists of three water utilities in Cary (C), Durham (D) and Raleigh (R). Cary is the main supplier, having a water treatment plant of its own. Durham and Raleigh purchase water from Cary via treated transfers.

Having obtained the .set file containing the Pareto-optimal decision variables and their respective performance objective values, we will now walk through the .set file processing and visualize the decision variables and performance objective space.

Understanding the .set file

First, it is important that the .set file makes sense. The NC_output_MS_S3_N1000.set file should have between 30-80 rows, and a total of 35 columns. The first 20 columns contain values of the decision variables. Note that only Durham and Raleigh have water transfer triggers as they purchase treated water from Cary.

  1. Restriction trigger, RT (C, D, R)
  2. Transfer trigger, TT (D, R)
  3. Jordan Lake allocation, JLA (C, D, R)
  4. Reserve fund contribution as a percentage of total annual revenue, RC (C, D, R)
  5. Insurance trigger, IT (C, D, R)
  6. Insurance payments as a percentage of total annual revenue, IP (C, D, R)
  7. Infrastructure trigger, INF (C, D, R)

The last 15 columns contain the objective values for the following performance objectives of all three utilities:

  1. Reliability (REL) to be maximized
  2. Restriction frequency (RF) to be minimized
  3. Infrastructure net present cost (INF_NPC) to be minimized
  4. Peak financial cost of drought mitigation actions (PFC) to be minimized
  5. Worst-case financial cost of drought mitigation actions (WCC) to be minimized

This reference set needs to be processed to output a .csv file to enable reevaluation for robustness analysis. To do so, run the post_processing.py file found in this GitHub repository in the command line:

python post_processing.py

In addition to post-processing the optimization output files, this file also conduct regional minimax operation, where each regional performance objective is taken to be the objective value of the worst-performing utility (Gold et al, 2019).

This should output two files:

  1. NC_refset.csv No header row. This is the file that will be used to run the re-evaluation for robustness analysis in the next blog post.
  2. NC_dvs_objs.csv Contains a header row. This file that contains the labeled decision variables and objectives, including the minimax regional performance objectives. Will be used for visualizing the reference set’s decision variables and performance objectives.

Visualizing the reference set

Due to the higher number of decision variables, we utilize parallel axis plots to identify how varying the decision variables can potentially affect certain performance objectives. Here, we use the regional reliability performance objective, REL, as an example. Figure 1 below demonstrates how all decision variables vary with regional reliability.

Figure 1: All decision variables for the three utilities. A darker blue indicates a higher degree of reliability.

From Figure 1, most solutions found via the optimization conducted in the previous blog post seem to have relatively high reliability across the full range of decision variable values. It is unclear as to how each decision variable might affect regional reliability. It is thus more helpful to identify specific sets of decision variables (or policies) that enable the achievement of reliability beyond a certain threshold.

With this in mind, we assume that all members of the Triangle require that their collective reliability be at least 98%. This results in the following figure:

Figure 2: All decision variables across the three utilities. The dark lines represent the policies that are at least 98% reliable.

Figure 2 has one clear takeaway: Pareto-optimality does not indicate satisfactory performance. In addition, setting this threshold make the effects of each decision variable clearer. It can be seen that regional reliability is significantly affected by both Durham and Raleigh’s infrastructure trigger (INF). Desirable levels of regional reliability can be achieved when Durham sets a high INF value. Conversely, Raleigh can set lower INF values to benefit from satisfactory reliability. Figure 2 also shows having Durham set a high insurance trigger (IT) may benefit the regional in terms of reliability.

However, there are caveats. Higher INF and IT values for Durham implies that the financial burden of investment and insurance payments are being borne by Raleigh and Cary, as Durham is able to withstand more risk without having to trigger an investment or infrastructure payment. This may affect how each member utility perceives their individual risks and benefits by forming a cooperative contract.

The code to plot these figures can be found under ‘refset_parallel.py’ in the repository.

Robustness analysis and what’s next

So how is setting a threshold value of regional reliability significant?

Within the MORDM framework, robustness is defined using a multivariate satisficing metric (Gold et al, 2019). Depending on the requirements of the stakeholders, a set of criteria is defined that will then be used distinguish between success (all criteria are met) and failure (at least one criteria is not met). Using these criteria, the rest of Pareto-optimal policies are simulated across a number of uncertain SOWs. Each policy’s robustness is then represented by the percent of SOWs in which it meets the minimum performance criteria that has been set.

In this post, we processed the reference set and visualized its decision variable space with respect to each variable’s effect on the reliability performance objective. A similar process can be repeated across all utilities for all performance objectives.

Using the processed reference set, we will conduct multi-criterion robustness analysis using two criteria:

  1. Regional reliability should be at least 98%
  2. Regional restriction frequency should be less than or equal to 20%

We will also conduct sensitivity analysis to identify the the decision variables that most impact regional robustness. Finally, we will conduct scenario discovery to identify SOWs that may cause the policies to fail.

References

Gold, D. F., Reed, P. M., Trindade, B. C., & Characklis, G. W. (2019). Identifying actionable compromises: Navigating multi‐city robustness conflicts to discover cooperative safe operating spaces for regional water supply portfolios. Water Resources Research, 55(11), 9024–9050. https://doi.org/10.1029/2019wr025462

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). DEEPLY UNCERTAIN PATHWAYS: Integrated multi-city Regional Water Supply Infrastructure Investment and portfolio management. Advances in Water Resources, 134, 103442. https://doi.org/10.1016/j.advwatres.2019.103442

Networks on maps: exploring spatial connections using NetworkX and Basemap

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here 1.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

plt.figure(figsize = (12,8))
m = Basemap(projection='merc',llcrnrlon=-160,llcrnrlat=15,urcrnrlon=-60,
urcrnrlat=50, lat_ts=0, resolution='l',suppress_ticks=True)
mx, my = m(pos_data['lon'].values, pos_data['lat'].values)
pos = {}
for count, elem in enumerate(pos_data['nodes']):
     pos[elem] = (mx[count], my[count])
nx.draw_networkx_edges(G, pos = pos, edge_color='blue', alpha=0.1, arrows = False)
m.drawcountries(linewidth = 2)
m.drawstates(linewidth = 0.2)
m.drawcoastlines(linewidth=2)
plt.tight_layout()
plt.savefig("map.png", dpi = 300)
plt.show()

The resulting figure is the following, corresponding to Fig. 5B from the original paper.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(G.degree())
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()

plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)
plt.show()

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

1 – If you are interested in how this is done, I used the National Counties Gazetteer file from the US Census Bureau and looked up each code to get its lat and lon.

Interactive Lake Problem Visualizations Using Altair

Introduction

This post is an introduction to Altair, a python package derived from Vega and Vega-Lite, which allows high level visualization using a JSON syntax. I came across Altair while trying to look for a package that would allow interactive linking of plots. Altair allows the user to build sophisticated static and interactive plots using very little code. It is modular and has a simple “visualization grammar”  that allows plots to be built up layer by layer. Initially, there is a learning curve to understand the syntax and almost every plot type requires data to be in a Pandas dataframe. Furthermore, Altair has yet to introduce 3D plotting functionality. However, I believe that it can be a very useful interactive tool for teaching. The Altair API does not render figures, but rather outputs JSON data structures that can be saved as HTML files or viewed in Jupyter Notebooks.

In this blog post, I re-create/create some figures for the Lake Problem using Altair. Below you will see videos (be sure to hit “enable HD” on each video), but after installing Altair (instructions in Blog Post.ipynb) you can create the figures yourself by downloading the following GitHub tutorial and stepping through the Jupyter Notebook or you can simply click on the html files in the GitHub folder to open the interactive figures in your browser.

Linked Scatter and Line Plot

This type of figure allows you to link data across two figures using a common id, so when you click a point, the corresponding line is shown in the second figure. I used this figure to highlight the discharge policy that is associated with each point in the DPS Pareto front. I find this to be very  useful to be able to understand the general objective characteristics that lead to different policy shapes. Note how the high economic benefit solutions keep a steady release, while the solutions that favor low P concentration adjust accordingly. This style of figure creates a more wholesome experience; without it, the user is left to manually output a figure for each policy and keep track of which point on the front is associated with their respective policies.

Linked Scatter and Histogram

Using this type of figure allows the user to highlight sections of the scatter plot which contains points from different categories and to see a histogram of the highlighted points. I used this type of plot to recreate the first panel of Figure 10, which shows the parameter combinations  of b and q that lead to success and failure in different states of the world for the criteria of economic benefit > 0.2 and reliability > 0.95. For any given rectangular cross section you can see the breakdown of success and failure in the linked histogram. The histogram provides a more quantitative interpretation of the results, which may not be clear just looking a the scatter plot.

Linked Scatter Plots

By brushing one scatter plot, you see the corresponding behavior of points on the other linked plot. This style of interactive figure might be useful to view 2D sub-problems of more complex multi-objective problems side by side.

Linked Legend Histograms

This type of plot allows you to isolate parts of the figure by clicking on the legend. Here we have histograms of the objective values for the Intertemporal (orange) and DPS (blue) formulations. By clicking on the legend label, the respective histogram will be kept while the other becomes transparent. This is just a another convenient way of representing the objectives rather than in a scatter plot, gives a sense of which values are most common, and also draws the user’s attention to the shape of each histogram through the isolation technique.

 

A Beginner’s Guide to Visualizing Trade Offs across Multiple Objectives using the Python Bokeh Library

Python libraries can be very helpful in making visualizations. One particular library I like using is the Bokeh library. In this post, I will share how this library can be used for visualizing trade offs across multiple objectives.

First import the libraries we need for our plots:


import pandas as pd
from bokeh.io import output_file, show #for outputting and showing plots
from bokeh.plotting import figure #for creating figures, which are the plot area in this case
from bokeh.models.tools import HoverTool #for importing the hover tool
from bokeh.models import ColumnDataSource #which we will ultimately use for linking plots
from bokeh.models.widgets import Panel, Tabs #for making tabs

Basic Plotting

Let’s start by creating a basic plot for a three objective DTLZ2, where the first and second objectives are plotted on the x and y axes respectively, and the third objective is the size of the plotted points.


# create a new plot using the figure function
p = figure(title="DTLZ2 with 3 objectives",plot_width=400, plot_height=400)

# add axis titles
p.xaxis.axis_label = "Objective 1"
p.yaxis.axis_label = "Objective 2"

# upload your data
paretoFront = pd.read_csv("C:/Users/Sarah/Desktop/dmh309-serial-borg-moea-2c7702638d42/dtlz2obj.csv") 

# assign data to arrays
obj1=paretoFront['Obj1']
obj2=paretoFront['Obj2']
obj3=paretoFront['Obj3']

# plots points
paretoPlot = p.circle(obj1, obj2, size=obj3*10, # objective 3 is the shown by the size of the points; I multiplied by 10 to make them clearer
       fill_color="grey", # for more color options, check out https://bokeh.pydata.org/en/latest/docs/reference/colors.html
       fill_alpha=0.6, # alpha is the transparency of the points
       line_color=None) # line_color is the outline of the points

# show the results
show(p)

The figure below shows what your plot will look like. Notice the default tools that are outlined on the right of the plot.

figure 1

More useful tools

Although Bokeh plots come with a default set of useful tools, there are other tools which you could add to make your plots more interactive.

To demonstrate some of these useful tools, I will create a figure with three tabs. Each tab will contain a view of the Pareto front for two of the three objectives. In addition, these tabs will be linked, so that if you make a selection on the Pareto front in one of the tabs, the same points will be selected in the other views. I will also demonstrate the use of the vertical, horizontal, and mouse over hover tools, and add data labels for the mouse over hover tool.


paretoFront = pd.read_csv("C:/Users/Sarah/Desktop/dmh309-serial-borg-moea-2c7702638d42/dtlz2obj_reduced.csv") 

obj1=paretoFront['Obj1']
obj2=paretoFront['Obj2']
obj3=paretoFront['Obj3']

# create a column data source for the plots; this will allow us to link the plots we create
source = ColumnDataSource(data=dict(x=obj1, y=obj2, z=obj3))

# define the tools you want to use; if you are defining tools, you need to list all the tools you want, including the default Bokeh tools
TOOLS = "box_select,lasso_select,help,pan,wheel_zoom,box_zoom,reset,save"

# create a new plot for each view/tab using figure
p1 = figure(tools=TOOLS,title="DTLZ2 (Objective 1, Objective 2)",plot_width=400, plot_height=400)
p2 = figure(tools=TOOLS,title="DTLZ2 (Objective 1, Objective 3)",plot_width=400, plot_height=400)
p3 = figure(tools=TOOLS,title="DTLZ2 (Objective 2, Objective 3)",plot_width=400, plot_height=400)

# define functions for hovering
def verticalhover(p,paretoPlot):
    p.add_tools(HoverTool(tooltips=None, renderers=[paretoPlot], mode='vline'))

def horizontalhover(p,paretoPlot):
    p.add_tools(HoverTool(tooltips=None, renderers=[paretoPlot], mode='hline'))

def pointhover(p, paretoPlot, obja, objb):
    p.add_tools(HoverTool(tooltips=[(obja, "$x" ), (objb, "$y")],mode="mouse", point_policy="follow_mouse", renderers=[paretoPlot]))
# note that we added 'tooltips' here, this gives the values (like excel data labels) of the points when you hover over them

# plot 1
paretoPlot1 = p1.circle('x', 'y',
                        fill_color='grey',
                        fill_alpha=0.7,
                        hover_fill_color="crimson", hover_alpha=0.8, # this changes the color and transparecy of the points when we hover over them
                        line_color=None, hover_line_color="white", source=source) # we will use 'source' in all our plots, because we want to link them by having them share the same sources data

p1.xaxis.axis_label = "Objective 1"
p1.yaxis.axis_label = "Objective 2"

# add the hover tools
verticalhover(p1,paretoPlot1)
horizontalhover(p1,paretoPlot1)
pointhover(p1, paretoPlot1, "objective 1", "objective 2")

# add the panel for the first tab
tab1 = Panel(child=p1, title="obj1, obj2")

# repeat the same steps for tabs 2 and 3 (i.e. objective 1 vs. objective 3 and objective 2 vs. objective 3)
# plot 2

paretoPlot2 = p2.circle('x','z',
                        fill_color='grey', hover_fill_color="crimson",
                        fill_alpha=0.7, hover_alpha=0.8,
                        line_color=None, hover_line_color="white", source=source)

p2.xaxis.axis_label = "Objective 1"
p2.yaxis.axis_label = "Objective 3"

verticalhover(p2,paretoPlot2)
horizontalhover(p2,paretoPlot2)
pointhover(p2, paretoPlot2, "objective 1", "objective 3")

tab2 = Panel(child=p2, title="obj1, obj3")

# plot 3

paretoPlot3 = p3.circle('y', 'z',
                        fill_color='grey', hover_fill_color="crimson",
                        fill_alpha=0.7, hover_alpha=0.8,
                        line_color=None, hover_line_color="white", source=source)

p3.xaxis.axis_label = "Objective 2"
p3.yaxis.axis_label = "Objective 3"

verticalhover(p3,paretoPlot3)
horizontalhover(p3,paretoPlot3)
pointhover(p3, paretoPlot3, "objective 2", "objective 3")

tab3 = Panel(child=p3, title="obj2, obj3")

# finally create the tabs and show your plot!
tabs = Tabs(tabs=[ tab1, tab2, tab3 ])
show(tabs)

The figure below shows the plot you will obtain:

figure 2

Now, let’s explore the tools that we added:

  • Selection across multiple tabs: use lasso or box select to select some points in your first tab. When you move to the second and third tabs, the same points will be selected in different views.figure 3
  • Vertical and Horizontal Hover: select either the vertical or horizontal hover tool from the right toolbar. Make sure you deselect all other tools. The vertical hover will color in all the points that lie in the vertical line that crosses the point you are hovering over, and the horizontal hover does the same thing horizontally.figure 4
  • Point Hover and Data Labels: Select the point hover tool, deselect all the other tools, and hover over the points. When you hover over a point, you will get the values for the objectives for that point.figure 5

Sources

 

Directed search with the Exploratory Modeling workbench

This is the third blog in a series showcasing the functionality of the Exploratory Modeling workbench. Exploratory modeling entails investigating the way in which uncertainty and/or policy levers map to outcomes. To investigate these mappings, we can either use sampling based strategies (open exploration) or optimization based strategies (directed search) In the first blog, I gave a general overview of the workbench and showed briefly how both investigation strategies can be done. In the second blog, I demonstrated the use of the workbench for open exploration in substantial more detail. In this third blog, I will demonstrate in more detail how to use the workbench for directed search. Like in the previous two blog post, I will use the DPS version of the lake problem.

For optimization, the workbench relies on platypus. You can easily install the latest version of platypus from github using pip

pip install git+https://github.com/Project-Platypus/Platypus.git

By default, the workbench will use epsilon NSGA2, but all the other algorithms available within platypus can be used as well.

Within the workbench, optimization can be used in three ways:
* Search over decision levers for a reference scenario
* Robust search: search over decision levers for a set of scenarios
* worst case discovery: search over uncertainties for a reference policy

The search over decision levers or over uncertainties relies on the specification of the direction for each outcome of interest defined on the model. It is only possible to use ScalarOutcome objects for optimization.

Search over levers

Directed search is most often used to search over the decision levers in order to find good candidate strategies. This is for example the first step in the Many Objective Robust Decision Making process. This is straightforward to do with the workbench using the optimize method.

from ema_workbench import MultiprocessingEvaluator, ema_logging

ema_logging.log_to_stderr(ema_logging.INFO)

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=10000, searchover='levers', 
                                 epsilons=[0.1,]*len(model.outcomes),
                                 population_size=50)

the result from optimize is a DataFrame with the decision variables and outcomes of interest. The latest version of the workbench comes with a pure python implementation of parallel coordinates plot built on top of matplotlib. It has been designed with the matplotlib and seaborn api in mind. We can use this to quickly visualize the optimization results.

from ema_workbench.analysis import parcoords

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0)
paraxes.plot(results, color=sns.color_palette()[0])
paraxes.invert_axis('max_P')
plt.show()

Note how we can flip an axis using the invert_axis method. This eases interpretation of the figure because the ideal solution in this case would be a straight line for the four outcomes of interest at the top of the figure.

output_8_1

Specifying constraints

In the previous example, we showed the most basic way for using the workbench to perform many-objective optimization. However, the workbench also offers support for constraints and tracking convergence. Constrains are an attribute of the optimization problem, rather than an attribute of the model as in Rhodium. Thus, we can pass a list of constraints to the optimize method. A constraint can be applied to the model input parameters (both uncertainties and levers), and/or outcomes. A constraint is essentially a function that should return the distance from the feasibility threshold. The distance should be 0 if the constraint is met.

As a quick demonstration, let’s add a constraint on the maximum pollution. This constraint applies to the max_P outcome. The example below specifies that the maximum pollution should be below 1.

from ema_workbench import MultiprocessingEvaluator, ema_logging, Constraint

ema_logging.log_to_stderr(ema_logging.INFO)

constraints = [Constraint("max pollution", outcome_names="max_P",
                          function=lambda x:max(0, x-1))]

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=1000, searchover='levers', 
                                 epsilons=[0.1,]*len(model.outcomes),
                                 population_size=25, constraints=constraints)

tracking convergence

To track convergence, we need to specify which metric(s) we want to use and pass these to the optimize method. At present the workbench comes with 3 options: Hyper volume, Epsilon progress, and a class that will write the archive at each iteration to a separate text file enabling later processing. If convergence metrics are specified, optimize will return both the results as well as the convergence information.

from ema_workbench import MultiprocessingEvaluator, ema_logging
from ema_workbench.em_framework.optimization import (HyperVolume,
                                                     EpsilonProgress, )
from ema_workbench.em_framework.outcomes import Constraint

ema_logging.log_to_stderr(ema_logging.INFO)

# because of the constraint on pollution, we can specify the 
# maximum easily
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]),
                       EpsilonProgress()]
constraints = [Constraint("max pollution", outcome_names="max_P",
                          function=lambda x:max(0, x-1))]

with MultiprocessingEvaluator(model) as evaluator:
    results_ref1, convergence1 = evaluator.optimize(nfe=25000, searchover='levers', 
                                    epsilons=[0.05,]*len(model.outcomes),
                                    convergence=convergence_metrics,
                                    constraints=constraints,
                                    population_size=100)

We can visualize the results using parcoords as before, while the convergence information is in a DataFrame making it also easy to plot.

fig, (ax1, ax2) = plt.subplots(ncols=2, sharex=True)
ax1.plot(convergence1.epsilon_progress)
ax1.set_xlabel('nr. of generations')
ax1.set_ylabel('$\epsilon$ progress')
ax2.plot(convergence1.hypervolume)
ax2.set_ylabel('hypervolume')
sns.despine()
plt.show()

output_16_0

Changing the reference scenario

Up till now, we have performed the optimization for an unspecified reference scenario. Since the lake model function takes default values for each of the deeply uncertain factors, these values have been implicitly assumed. It is however possible to explicitly pass a reference scenario that should be used instead. In this way, it is easy to apply the extended MORDM approach suggested by Watson and Kasprzyk (2017).

To see the effects of changing the reference scenario on the values for the decision levers found through the optimization, as well as ensuring a fair comparison with the previous results, we use the same convergence metrics and constraints from the previous optimization. Note that the constraints are in essence only a function, and don’t retain optimization specific state, we can simply reuse them. The convergence metrics, in contrast retain state and we thus need to re-instantiate them.

from ema_workbench import Scenario

reference = Scenario('reference', **dict(b=.43, q=3,mean=0.02, 
                                         stdev=0.004, delta=.94))
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]),
                       EpsilonProgress()]

with MultiprocessingEvaluator(model) as evaluator:
    results_ref2, convergence2 = evaluator.optimize(nfe=25000, searchover='levers', 
                                  epsilons=[0.05,]*len(model.outcomes),
                                  convergence=convergence_metrics,
                                  constraints=constraints,
                                  population_size=100, reference=reference)

comparing results for different reference scenarios

To demonstrate the parcoords plotting functionality in some more detail, let’s combine the results from the optimizations for the two different reference scenarios and visualize them in the same plot. To do this, we need to first figure out the limits across both optimizations. Moreover, to get a better sense of which part of the decision space is being used, let’s set the limits for the decision levers on the basis of their specified ranges instead of inferring the limits from the optimization results.

columns = [lever.name for lever in model.levers]
columns += [outcome.name for outcome in model.outcomes]
limits = {lever.name: (lever.lower_bound, lever.upper_bound) for lever in 
           model.levers}
limits = dict(**limits, **{outcome.name:(0,1) for outcome in model.outcomes})
limits = pd.DataFrame.from_dict(limits)
# we resort the limits in the order produced by the optimization
limits = limits[columns] 

paraxes = parcoords.ParallelAxes(limits, rot=0)
paraxes.plot(results_ref1, color=sns.color_palette()[0], label='ref1')
paraxes.plot(results_ref2, color=sns.color_palette()[1], label='ref2')
paraxes.legend()
paraxes.invert_axis('max_P')
plt.show()

output_22_0.png

Robust Search

The workbench also comes with support for many objective robust optimization. In this case, each candidate solution is evaluated over a set of scenarios, and the robustness of the performance over this set is calculated. This requires specifying 2 new pieces of information:
* the robustness metrics
* the scenarios over which to evaluate the candidate solutions

The robustness metrics are simply a collection of ScalarOutcome objects. For each one, we have to specify which model outcome(s) it uses, as well as the actual robustness function. For demonstrative purposes, let’s assume we are use a robustness function using descriptive statistics: we want to maximize the 10th percentile performance for reliability, inertia, and utility, while minimizing the 90th percentile performance for max_P.

We can specify our scenarios in various ways. The simplest would be to pass the number of scenarios to the robust_optimize method. In this case for each generation a new set of scenarios is used. This can create noise and instability in the optimization. A better option is to explicitly generate the scenarios first, and pass these to the method. In this way, the same set of scenarios is used for each generation.

If we want to specify a constraint, this can easily be done. Note however, that in case of robust optimization, the constrains will apply to the robustness metrics instead of the model outcomes. They can of course still apply to the decision variables as well.

import functools
from ema_workbench import Constraint, MultiprocessingEvaluator
from ema_workbench import Constraint, ema_logging
from ema_workbench.em_framework.optimization import (HyperVolume,
                                                     EpsilonProgress)
from ema_workbench.em_framework.samplers import sample_uncertainties

ema_logging.log_to_stderr(ema_logging.INFO)

percentile10 = functools.partial(np.percentile, q=10)
percentile90 = functools.partial(np.percentile, q=90)

MAXIMIZE = ScalarOutcome.MAXIMIZE
MINIMIZE = ScalarOutcome.MINIMIZE
robustnes_functions = [ScalarOutcome('90th percentile max_p', kind=MINIMIZE, 
                             variable_name='max_P', function=percentile90),
                       ScalarOutcome('10th percentile reliability', kind=MAXIMIZE, 
                             variable_name='reliability', function=percentile10),
                       ScalarOutcome('10th percentile inertia', kind=MAXIMIZE, 
                             variable_name='inertia', function=percentile10),
                       ScalarOutcome('10th percentile utility', kind=MAXIMIZE, 
                             variable_name='utility', function=percentile10)]

def constraint(x):
    return max(0, percentile90(x)-10)

constraints = [Constraint("max pollution", 
                          outcome_names=['90th percentile max_p'],
                          function=constraint)]
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[10,1,1,1]),
                       EpsilonProgress()]
n_scenarios = 10
scenarios = sample_uncertainties(model, n_scenarios)

nfe = 10000

with MultiprocessingEvaluator(model) as evaluator:
    robust_results, convergence = evaluator.robust_optimize(robustnes_functions, 
                            scenarios, nfe=nfe, constraints=constraints,
                            epsilons=[0.05,]*len(robustnes_functions),
                            convergence=convergence_metrics,)
fig, (ax1, ax2) = plt.subplots(ncols=2)
ax1.plot(convergence.epsilon_progress.values)
ax1.set_xlabel('nr. of generations')
ax1.set_ylabel('$\epsilon$ progress')
ax2.plot(convergence.hypervolume)
ax2.set_ylabel('hypervolume')
sns.despine()
plt.show()

output_25_0.png

paraxes = parcoords.ParallelAxes(parcoords.get_limits(robust_results), rot=45)
paraxes.plot(robust_results)
paraxes.invert_axis('90th percentile max_p')
plt.show()

output_26_0.png

Search over uncertainties: worst case discovery

Up till now, we have focused on optimizing over the decision levers. The workbench however can also be used for worst case discovery (Halim et al, 2016). In essence, the only change is to specify that we want to search over uncertainties instead of over levers. Constraints and convergence works just as in the previous examples.

Reusing the foregoing, however, we should change the direction of optimization of the outcomes. We are no longer interested in finding the best possible outcomes, but instead we want to find the worst possible outcomes.

# change outcomes so direction is undesirable
minimize = ScalarOutcome.MINIMIZE
maximize = ScalarOutcome.MAXIMIZE

for outcome in model.outcomes:
    if outcome.kind == minimize:
        outcome.kind = maximize
    else:
        outcome.kind = minimize

We can reuse the reference keyword argument to perform worst case discovery for one of the policies found before. So, below we select solution number 9 from the pareto approximate set. We can turn this into a dict and instantiate a Policy objecti.

from ema_workbench import Policy

policy = Policy('9', **{k:v for k, v in results_ref1.loc[9].items()
                        if k in model.levers})

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=1000, searchover='uncertainties', 
                                 epsilons=[0.1,]*len(model.outcomes),
                                 reference=policy)

Visualizing the results is straightforward using parcoords.

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0)
paraxes.plot(results)
paraxes.invert_axis('max_P')
plt.show()

output_30_0.png

Closing remarks

This blog showcased the functionality of the workbench for applying search based approaches to exploratory modelling. We specifically looked at the use of many-objective optimization for searching over the levers or uncertainties, as well as the use of many-objective robust optimization. This completes the overview of the functionality available in the workbench. In the next blog, I will put it all together to show how the workbench can be used to perform Many Objective Robust Decision Making.

Easy vectorized parallel plots for multiple data sets

I will share a very quick and straight-forward solution to generate parallel plots in python of multiple groups of data.   The idea is transitioning from the parallel axis plot tool  to a method that enables  the plots to be exported as a vectorized image.   You can also take a look at Matt’s python parallel.py code available in github: https://github.com/matthewjwoodruff/parallel.py .

This is the type of figure that you will get:

parallel3.png

The previous figure was generated with the following lines of code:

import numpy as np
import pandas as pd
from pandas.tools.plotting import parallel_coordinates
import matplotlib.pyplot as plt
import seaborn

data = pd.read_csv('sample_data.csv')

parallel_coordinates(data,'Name', color= ['#225ea8','#7fcdbb','#1d91c0'], linewidth=5, alpha=.8)
plt.ylabel('Direction of Preference $\\rightarrow$', fontsize=12)

plt.savefig('parallel_plot.svg')

Lines 1-4 are the required libraries.  I just threw in the seaborn library to give it the gray background but it is not necessary.  In the parallel_coordinates function, you need to specify the data, ‘Name’ and the color of the different groups.  You can substitute the color  variable for colormap and specify the colormap that you wish to use (e.g. colormap=’YlGnBu’).   I also specified an alpha for transparency to see overlapping lines. If you want to learn more, you can take a look at the parallel_coordinates source code.  I found this stack overflow link very useful,  it shows some examples on editing the source code to enable other capabilities.

Finally, the following snippet shows the format of the input data (the sample_data.csv  file that is read in line 7 ) :

sample_data.png

Columns A-G the different categories to be plotted are specified (e.g. the objectives of a problem) and in Column H the names of the different data groups are specified.  And there you have it, I hope you find this plotting alternative useful.

Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.

Scenario discovery in Python

The purpose of this blog post is to demonstrate how one can do scenario discovery in python. This blogpost will use the exploratory modeling workbench available on github. I will demonstrate how we can perform both PRIM in an interactive way, as well as briefly show how to use CART, which is also available in the exploratory modeling workbench. There is ample literature on both CART and PRIM and their relative merits for use in scenario discovery. So I won’t be discussing that here in any detail. This blog was first written as an ipython notebook, which can be found here

The workbench is mend as a one stop shop for doing exploratory modeling, scenario discovery, and (multi-objective) robust decision making. To support this, the workbench is split into several packages. The most important packages are expWorkbench that contains the support for setting up and executing computational experiments or (multi-objective) optimization with models; The connectors package, which contains connectors to vensim (system dynamics modeling package), netlogo (agent based modeling environment), and excel; and the analysis package that contains a wide range of techniques for visualization and analysis of the results from series of computational experiments. Here, we will focus on the analysis package. It some future blog post, I plan to demonstrate the use of the workbench for performing computational experimentation and multi-objective (robust) optimization.

The workbench can be found on github and downloaded from there. At present, the workbench is only available for python 2.7. There is a seperate branch where I am working on making a version of the workbench that works under both python 2.7 and 3. The workbench is depended on various scientific python libraries. If you have a standard scientific python distribution, like anaconda, installed, the main dependencies will be met. In addition to the standard scientific python libraries, the workbench is also dependend on deap for genetic algorithms. There are also some optional dependencies. These include seaborn and mpld3 for nicer and interactive visualizations, and jpype for controlling models implemented in Java, like netlogo, from within the workbench.

In order to demonstrate the use of the exploratory modeling workbench for scenario discovery, I am using a published example. I am using the data used in the original article by Ben Bryant and Rob Lempert where they first introduced scenario discovery. Ben Bryant kindly made this data available for my use. The data comes as a csv file. We can import the data easily using pandas. columns 2 up to and including 10 contain the experimental design, while the classification is presented in column 15

import pandas as pd

data = pd.DataFrame.from_csv('./data/bryant et al 2010 data.csv',
                             index_col=False)
x = data.ix[:, 2:11]
y = data.ix[:, 15]

The exploratory modeling workbench is built on top of numpy rather than pandas. This is partly a path dependecy issue. The earliest version of prim in the workbench is from 2012, when pandas was still under heavy development. Another problem is that the pandas does not contain explicit information on the datatypes of the columns. The implementation of prim in the exploratory workbench is however datatype aware, in contrast to the scenario discovery toolkit in R. That is, it will handle categorical data differently than continuous data. Internally, prim uses a numpy structured array for x, and a numpy array for y. We can easily transform the pandas dataframe to either.

x = x.to_records()
y = y.values

the exploratory modeling workbench comes with a seperate analysis package. This analysis package contains prim. So let’s import prim. The workbench also has its own logging functionality. We can turn this on to get some more insight into prim while it is running.

from analysis import prim
from expWorkbench import ema_logging
ema_logging.log_to_stderr(ema_logging.INFO);

Next, we need to instantiate the prim algorithm. To mimic the original work of Ben Bryant and Rob Lempert, we set the peeling alpha to 0.1. The peeling alpha determines how much data is peeled off in each iteration of the algorithm. The lower the value, the less data is removed in each iteration. The minimium coverage threshold that a box should meet is set to 0.8. Next, we can use the instantiated algorithm and find a first box.

prim_alg = prim.Prim(x, y, threshold=0.8, peel_alpha=0.1)
box1 = prim_alg.find_box()

Let’s investigate this first box is some detail. A first thing to look at is the trade off between coverage and density. The box has a convenience function for this called show_tradeoff. To support working in the ipython notebook, this method returns a matplotlib figure with some additional information than can be used by mpld3.

import matplotlib.pyplot as plt

box1.show_tradeoff()
plt.show()

fig1

The notebook contains an mpld3 version of the same figure with interactive pop ups. Let’s look at point 21, just as in the original paper. For this, we can use the inspect method. By default this will display two tables, but we can also make a nice graph instead that contains the same information.

box1.inspect(21)
box1.inspect(21, style='graph')
plt.show()

This first displays two tables, followed by a figure

coverage    0.752809
density     0.770115
mass        0.098639
mean        0.770115
res dim     4.000000
Name: 21, dtype: float64

                            box 21
                               min         max     qp values
Demand elasticity        -0.422000   -0.202000  1.184930e-16
Biomass backstop price  150.049995  199.600006  3.515113e-11
Total biomass           450.000000  755.799988  4.716969e-06
Cellulosic cost          72.650002  133.699997  1.574133e-01

fig 2

If one where to do a detailed comparison with the results reported in the original article, one would see small numerical differences. These differences arise out of subtle differences in implementation. The most important difference is that the exploratory modeling workbench uses a custom objective function inside prim which is different from the one used in the scenario discovery toolkit. Other differences have to do with details about the hill climbing optimization that is used in prim, and in particular how ties are handled in selecting the next step. The differences between the two implementations are only numerical, and don’t affect the overarching conclusions drawn from the analysis.

Let’s select this 21 box, and get a more detailed view of what the box looks like. Following Bryant et al., we can use scatter plots for this.

box1.select(21)
fig = box1.show_pairs_scatter()
fig.set_size_inches((12,12))
plt.show()

fig3

We have now found a first box that explains close to 80% of the cases of interest. Let’s see if we can find a second box that explains the remainder of the cases.

box2 = prim_alg.find_box()

The logging will inform us in this case that no additional box can be found. The best coverage we can achieve is 0.35, which is well below the specified 0.8 threshold. Let’s look at the final overal results from interactively fitting PRIM to the data. For this, we can use to convenience functions that transform the stats and boxes to pandas data frames.

print prim_alg.stats_to_dataframe()
print prim_alg.boxes_to_dataframe()
       coverage   density      mass  res_dim
box 1  0.752809  0.770115  0.098639        4
box 2  0.247191  0.027673  0.901361        0
                             box 1              box 2
                               min         max    min         max
Demand elasticity        -0.422000   -0.202000   -0.8   -0.202000
Biomass backstop price  150.049995  199.600006   90.0  199.600006
Total biomass           450.000000  755.799988  450.0  997.799988
Cellulosic cost          72.650002  133.699997   67.0  133.699997

For comparison, we can also use CART for doing scenario discovery. This is readily supported by the exploratory modelling workbench.

from analysis import cart
cart_alg = cart.CART(x,y, 0.05)
cart_alg.build_tree()

Now that we have trained CART on the data, we can investigate its results. Just like PRIM, we can use stats_to_dataframe and boxes_to_dataframe to get an overview.

print cart_alg.stats_to_dataframe()
print cart_alg.boxes_to_dataframe()
       coverage   density      mass  res dim
box 1  0.011236  0.021739  0.052154        2
box 2  0.000000  0.000000  0.546485        2
box 3  0.000000  0.000000  0.103175        2
box 4  0.044944  0.090909  0.049887        2
box 5  0.224719  0.434783  0.052154        2
box 6  0.112360  0.227273  0.049887        3
box 7  0.000000  0.000000  0.051020        3
box 8  0.606742  0.642857  0.095238        2
                       box 1                  box 2               box 3  \
                         min         max        min         max     min
Cellulosic yield        80.0   81.649994  81.649994   99.900002  80.000
Demand elasticity       -0.8   -0.439000  -0.800000   -0.439000  -0.439
Biomass backstop price  90.0  199.600006  90.000000  199.600006  90.000   

                                         box 4                box 5  \
                               max         min         max      min
Cellulosic yield         99.900002   80.000000   99.900002   80.000
Demand elasticity        -0.316500   -0.439000   -0.316500   -0.439
Biomass backstop price  144.350006  144.350006  170.750000  170.750   

                                      box 6                  box 7  \
                               max      min         max        min
Cellulosic yield         99.900002  80.0000   89.050003  89.050003
Demand elasticity        -0.316500  -0.3165   -0.202000  -0.316500
Biomass backstop price  199.600006  90.0000  148.300003  90.000000   

                                         box 8
                               max         min         max
Cellulosic yield         99.900002   80.000000   99.900002
Demand elasticity        -0.202000   -0.316500   -0.202000
Biomass backstop price  148.300003  148.300003  199.600006

Alternatively, we might want to look at the classification tree directly. For this, we can use the show_tree method. This returns an image that we can either save, or display.

fig3

If we look at the results of CART and PRIM, we can see that in this case PRIM produces a better description of the data. The best box found by CART has a coverage and density of a little above 0.6. In contrast, PRIM produces a box with coverage and density above 0.75.

All of the Analysis Code for my Latest Study is on GitHub

I’ve published to GitHub all of the code I wrote for the paper I’m currently working on.  This includes:

  • Python PBS submission script
  • Python scripts to automate reference set generation using MOEAFramework
  • Python scripts to automate hypervolume calculation using MOEAFramework and the WFG hypervolume engine
  • Python / Pandas scripts for statistical summaries of the hypervolume data
  • Python scripts to automate Sobol’ sensitivity analysis using MOEAFramework and tabulate the results.  (If I were starting today, I’d have an SALib version too.)
  • Python / Pandas / Matplotlib figure generation scripts:
    • Control maps for hypervolume attainment
    • Radial convergence plots (“spider plots”) for Sobol’ global sensitivity analysis results
    • Bar charts for Sobol’ global sensitivity analysis results
    • CDF plots (dot / shaded bar, plus actual CDF plots) for hypervolume attainment
    • Parallel coordinate plots
    • Input file generation for AeroVis glyph plotting
    • Joint PDF plots for hypervolume attainment across multiple problems

Not all of the figures I mentioned will turn up in the paper, but I provide them as examples in case they prove helpful.

Connecting to an iPython HTML Notebook on the Cluster Using an SSH Tunnel

Magic

I didn’t have the time or inclination to try to set up the iPython HTML notebook on the conference room computer for yesterday’s demo, but I really wanted to use the HTML notebook. What to do?

Magic.

Magic in this case means running the iPython HTML notebook on the cluster, forwarding the HTTP port that the HTML notebook uses, and displaying the session in a web browser running locally. In the rest of this post, I’ll explain each of the moving parts.

iPython HTML Notebook on the Cluster

The ipython that comes for free on the cluster doesn’t support the HTML notebook because the python/2.7 module doesn’t have tornado or pyzmq. On the plus side, you do have easy_install, so setting up these dependencies isn’t too hard.

  1. Make a directory for your personal Python packages:
    mkdir /gpfs/home/asdf1234/local
  2. In your .bashrc, add
    export PYTHONPATH=$HOME/local/lib/python2.7/site-packages
  3. python -measy_install --prefix /gpfs/home/asdf1234/local tornado
    python -measy_install --prefix /gpfs/home/asdf1234/local pyzmq

If you have a local X server, you can check to see if this works:

ssh -Y asdf1234@cluster 
ipython notebook --pylab=inline

Firefox should pop up with the HTML notebook. It’s perfectly usable like this, but I also didn’t want to set up an X server on the conference room computer. This leads us to…

Forwarding Port 8888

By default, the HTML notebook serves HTTP on port 8888. If you’re sitting in front of the computer, you get to port 8888 by using the loopback address 127.0.0.1:8888.
127.0.0.1 is only available locally. But using SSH port forwarding, we can connect to 127.0.0.1:8888 from a remote machine.

Here’s how you do that with a command-line ssh client:

ssh -L8888:127.0.0.1:8888 asdf1234@cluster

Here’s how you do it with PuTTY:

putty

Click “Add.”  You should see this:

putty2

Now open your connection and login to the remote machine. Once there, cd to the directory where your data is and type

ipython notebook --pylab=inline

If you’re using X forwarding, this will open up the elinks text browser, which is woefully incapable of handling the HTML notebook. Fortunately that doesn’t sink the demo. You’ll see something like this:

elinks

This means that the iPython HTML notebook is up and running. If you actually want to use it, howerver, you need a better browser. Fortunately, we opened up SSH with a tunnel…

Open the Notebook in Your Browser

This was the one part of the demo that wasn’t under my control. You need a modern web browser, and I just had to hope that someone was keeping the conference room computer more or less up to date. My fallback plan was to use a text-mode ipython over ssh, but the notebook is much more fun! Fortunately for me, the computer had Firefox 14.

In your URL bar, type in

http://127.0.0.1:8888

If everything works, you’ll see this:
dashboard
And you’re off to the races!

What Just Happened?

I said earlier that 127.0.0.1 is a special IP address that’s only reachable locally, i.e. on the machine you’re sitting in front of. Port 8888 on 127.0.0.1 is where ipython serves its HTML notebook, so you’d think the idea of using the HTML notebook over the network isn’t going to fly.

When you log in through ssh, however, it’s as if you are actually sitting in front of the computer you’re connected to. Every program you run, runs on that computer. Port forwarding takes this a step further and presents all traffic on port 8888 to the remote computer as if it were actually on the remote computer’s port 8888.