Easy vectorized parallel plots for multiple data sets

I will share a very quick and straight-forward solution to generate parallel plots in python of multiple groups of data.   The idea is transitioning from the parallel axis plot tool  to a method that enables  the plots to be exported as a vectorized image.   You can also take a look at Matt’s python parallel.py code available in github: https://github.com/matthewjwoodruff/parallel.py .

This is the type of figure that you will get:

parallel3.png

The previous figure was generated with the following lines of code:

import numpy as np
import pandas as pd
from pandas.tools.plotting import parallel_coordinates
import matplotlib.pyplot as plt
import seaborn

data = pd.read_csv('sample_data.csv')

parallel_coordinates(data,'Name', color= ['#225ea8','#7fcdbb','#1d91c0'], linewidth=5, alpha=.8)
plt.ylabel('Direction of Preference $\\rightarrow$', fontsize=12)

plt.savefig('parallel_plot.svg')

Lines 1-4 are the required libraries.  I just threw in the seaborn library to give it the gray background but it is not necessary.  In the parallel_coordinates function, you need to specify the data, ‘Name’ and the color of the different groups.  You can substitute the color  variable for colormap and specify the colormap that you wish to use (e.g. colormap=’YlGnBu’).   I also specified an alpha for transparency to see overlapping lines. If you want to learn more, you can take a look at the parallel_coordinates source code.  I found this stack overflow link very useful,  it shows some examples on editing the source code to enable other capabilities.

Finally, the following snippet shows the format of the input data (the sample_data.csv  file that is read in line 7 ) :

sample_data.png

Columns A-G the different categories to be plotted are specified (e.g. the objectives of a problem) and in Column H the names of the different data groups are specified.  And there you have it, I hope you find this plotting alternative useful.

Advertisements
Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.

Scenario discovery in Python

The purpose of this blog post is to demonstrate how one can do scenario discovery in python. This blogpost will use the exploratory modeling workbench available on github. I will demonstrate how we can perform both PRIM in an interactive way, as well as briefly show how to use CART, which is also available in the exploratory modeling workbench. There is ample literature on both CART and PRIM and their relative merits for use in scenario discovery. So I won’t be discussing that here in any detail. This blog was first written as an ipython notebook, which can be found here

The workbench is mend as a one stop shop for doing exploratory modeling, scenario discovery, and (multi-objective) robust decision making. To support this, the workbench is split into several packages. The most important packages are expWorkbench that contains the support for setting up and executing computational experiments or (multi-objective) optimization with models; The connectors package, which contains connectors to vensim (system dynamics modeling package), netlogo (agent based modeling environment), and excel; and the analysis package that contains a wide range of techniques for visualization and analysis of the results from series of computational experiments. Here, we will focus on the analysis package. It some future blog post, I plan to demonstrate the use of the workbench for performing computational experimentation and multi-objective (robust) optimization.

The workbench can be found on github and downloaded from there. At present, the workbench is only available for python 2.7. There is a seperate branch where I am working on making a version of the workbench that works under both python 2.7 and 3. The workbench is depended on various scientific python libraries. If you have a standard scientific python distribution, like anaconda, installed, the main dependencies will be met. In addition to the standard scientific python libraries, the workbench is also dependend on deap for genetic algorithms. There are also some optional dependencies. These include seaborn and mpld3 for nicer and interactive visualizations, and jpype for controlling models implemented in Java, like netlogo, from within the workbench.

In order to demonstrate the use of the exploratory modeling workbench for scenario discovery, I am using a published example. I am using the data used in the original article by Ben Bryant and Rob Lempert where they first introduced scenario discovery. Ben Bryant kindly made this data available for my use. The data comes as a csv file. We can import the data easily using pandas. columns 2 up to and including 10 contain the experimental design, while the classification is presented in column 15

import pandas as pd

data = pd.DataFrame.from_csv('./data/bryant et al 2010 data.csv',
                             index_col=False)
x = data.ix[:, 2:11]
y = data.ix[:, 15]

The exploratory modeling workbench is built on top of numpy rather than pandas. This is partly a path dependecy issue. The earliest version of prim in the workbench is from 2012, when pandas was still under heavy development. Another problem is that the pandas does not contain explicit information on the datatypes of the columns. The implementation of prim in the exploratory workbench is however datatype aware, in contrast to the scenario discovery toolkit in R. That is, it will handle categorical data differently than continuous data. Internally, prim uses a numpy structured array for x, and a numpy array for y. We can easily transform the pandas dataframe to either.

x = x.to_records()
y = y.values

the exploratory modeling workbench comes with a seperate analysis package. This analysis package contains prim. So let’s import prim. The workbench also has its own logging functionality. We can turn this on to get some more insight into prim while it is running.

from analysis import prim
from expWorkbench import ema_logging
ema_logging.log_to_stderr(ema_logging.INFO);

Next, we need to instantiate the prim algorithm. To mimic the original work of Ben Bryant and Rob Lempert, we set the peeling alpha to 0.1. The peeling alpha determines how much data is peeled off in each iteration of the algorithm. The lower the value, the less data is removed in each iteration. The minimium coverage threshold that a box should meet is set to 0.8. Next, we can use the instantiated algorithm and find a first box.

prim_alg = prim.Prim(x, y, threshold=0.8, peel_alpha=0.1)
box1 = prim_alg.find_box()

Let’s investigate this first box is some detail. A first thing to look at is the trade off between coverage and density. The box has a convenience function for this called show_tradeoff. To support working in the ipython notebook, this method returns a matplotlib figure with some additional information than can be used by mpld3.

import matplotlib.pyplot as plt

box1.show_tradeoff()
plt.show()

fig1

The notebook contains an mpld3 version of the same figure with interactive pop ups. Let’s look at point 21, just as in the original paper. For this, we can use the inspect method. By default this will display two tables, but we can also make a nice graph instead that contains the same information.

box1.inspect(21)
box1.inspect(21, style='graph')
plt.show()

This first displays two tables, followed by a figure

coverage    0.752809
density     0.770115
mass        0.098639
mean        0.770115
res dim     4.000000
Name: 21, dtype: float64

                            box 21
                               min         max     qp values
Demand elasticity        -0.422000   -0.202000  1.184930e-16
Biomass backstop price  150.049995  199.600006  3.515113e-11
Total biomass           450.000000  755.799988  4.716969e-06
Cellulosic cost          72.650002  133.699997  1.574133e-01

fig 2

If one where to do a detailed comparison with the results reported in the original article, one would see small numerical differences. These differences arise out of subtle differences in implementation. The most important difference is that the exploratory modeling workbench uses a custom objective function inside prim which is different from the one used in the scenario discovery toolkit. Other differences have to do with details about the hill climbing optimization that is used in prim, and in particular how ties are handled in selecting the next step. The differences between the two implementations are only numerical, and don’t affect the overarching conclusions drawn from the analysis.

Let’s select this 21 box, and get a more detailed view of what the box looks like. Following Bryant et al., we can use scatter plots for this.

box1.select(21)
fig = box1.show_pairs_scatter()
fig.set_size_inches((12,12))
plt.show()

fig3

We have now found a first box that explains close to 80% of the cases of interest. Let’s see if we can find a second box that explains the remainder of the cases.

box2 = prim_alg.find_box()

The logging will inform us in this case that no additional box can be found. The best coverage we can achieve is 0.35, which is well below the specified 0.8 threshold. Let’s look at the final overal results from interactively fitting PRIM to the data. For this, we can use to convenience functions that transform the stats and boxes to pandas data frames.

print prim_alg.stats_to_dataframe()
print prim_alg.boxes_to_dataframe()
       coverage   density      mass  res_dim
box 1  0.752809  0.770115  0.098639        4
box 2  0.247191  0.027673  0.901361        0
                             box 1              box 2
                               min         max    min         max
Demand elasticity        -0.422000   -0.202000   -0.8   -0.202000
Biomass backstop price  150.049995  199.600006   90.0  199.600006
Total biomass           450.000000  755.799988  450.0  997.799988
Cellulosic cost          72.650002  133.699997   67.0  133.699997

For comparison, we can also use CART for doing scenario discovery. This is readily supported by the exploratory modelling workbench.

from analysis import cart
cart_alg = cart.CART(x,y, 0.05)
cart_alg.build_tree()

Now that we have trained CART on the data, we can investigate its results. Just like PRIM, we can use stats_to_dataframe and boxes_to_dataframe to get an overview.

print cart_alg.stats_to_dataframe()
print cart_alg.boxes_to_dataframe()
       coverage   density      mass  res dim
box 1  0.011236  0.021739  0.052154        2
box 2  0.000000  0.000000  0.546485        2
box 3  0.000000  0.000000  0.103175        2
box 4  0.044944  0.090909  0.049887        2
box 5  0.224719  0.434783  0.052154        2
box 6  0.112360  0.227273  0.049887        3
box 7  0.000000  0.000000  0.051020        3
box 8  0.606742  0.642857  0.095238        2
                       box 1                  box 2               box 3  \
                         min         max        min         max     min
Cellulosic yield        80.0   81.649994  81.649994   99.900002  80.000
Demand elasticity       -0.8   -0.439000  -0.800000   -0.439000  -0.439
Biomass backstop price  90.0  199.600006  90.000000  199.600006  90.000   

                                         box 4                box 5  \
                               max         min         max      min
Cellulosic yield         99.900002   80.000000   99.900002   80.000
Demand elasticity        -0.316500   -0.439000   -0.316500   -0.439
Biomass backstop price  144.350006  144.350006  170.750000  170.750   

                                      box 6                  box 7  \
                               max      min         max        min
Cellulosic yield         99.900002  80.0000   89.050003  89.050003
Demand elasticity        -0.316500  -0.3165   -0.202000  -0.316500
Biomass backstop price  199.600006  90.0000  148.300003  90.000000   

                                         box 8
                               max         min         max
Cellulosic yield         99.900002   80.000000   99.900002
Demand elasticity        -0.202000   -0.316500   -0.202000
Biomass backstop price  148.300003  148.300003  199.600006

Alternatively, we might want to look at the classification tree directly. For this, we can use the show_tree method. This returns an image that we can either save, or display.

fig3

If we look at the results of CART and PRIM, we can see that in this case PRIM produces a better description of the data. The best box found by CART has a coverage and density of a little above 0.6. In contrast, PRIM produces a box with coverage and density above 0.75.

All of the Analysis Code for my Latest Study is on GitHub

I’ve published to GitHub all of the code I wrote for the paper I’m currently working on.  This includes:

  • Python PBS submission script
  • Python scripts to automate reference set generation using MOEAFramework
  • Python scripts to automate hypervolume calculation using MOEAFramework and the WFG hypervolume engine
  • Python / Pandas scripts for statistical summaries of the hypervolume data
  • Python scripts to automate Sobol’ sensitivity analysis using MOEAFramework and tabulate the results.  (If I were starting today, I’d have an SALib version too.)
  • Python / Pandas / Matplotlib figure generation scripts:
    • Control maps for hypervolume attainment
    • Radial convergence plots (“spider plots”) for Sobol’ global sensitivity analysis results
    • Bar charts for Sobol’ global sensitivity analysis results
    • CDF plots (dot / shaded bar, plus actual CDF plots) for hypervolume attainment
    • Parallel coordinate plots
    • Input file generation for AeroVis glyph plotting
    • Joint PDF plots for hypervolume attainment across multiple problems

Not all of the figures I mentioned will turn up in the paper, but I provide them as examples in case they prove helpful.

Connecting to an iPython HTML Notebook on the Cluster Using an SSH Tunnel

Magic

I didn’t have the time or inclination to try to set up the iPython HTML notebook on the conference room computer for yesterday’s demo, but I really wanted to use the HTML notebook. What to do?

Magic.

Magic in this case means running the iPython HTML notebook on the cluster, forwarding the HTTP port that the HTML notebook uses, and displaying the session in a web browser running locally. In the rest of this post, I’ll explain each of the moving parts.

iPython HTML Notebook on the Cluster

The ipython that comes for free on the cluster doesn’t support the HTML notebook because the python/2.7 module doesn’t have tornado or pyzmq. On the plus side, you do have easy_install, so setting up these dependencies isn’t too hard.

  1. Make a directory for your personal Python packages:
    mkdir /gpfs/home/asdf1234/local
  2. In your .bashrc, add
    export PYTHONPATH=$HOME/local/lib/python2.7/site-packages
  3. python -measy_install --prefix /gpfs/home/asdf1234/local tornado
    python -measy_install --prefix /gpfs/home/asdf1234/local pyzmq

If you have a local X server, you can check to see if this works:

ssh -Y asdf1234@cluster 
ipython notebook --pylab=inline

Firefox should pop up with the HTML notebook. It’s perfectly usable like this, but I also didn’t want to set up an X server on the conference room computer. This leads us to…

Forwarding Port 8888

By default, the HTML notebook serves HTTP on port 8888. If you’re sitting in front of the computer, you get to port 8888 by using the loopback address 127.0.0.1:8888.
127.0.0.1 is only available locally. But using SSH port forwarding, we can connect to 127.0.0.1:8888 from a remote machine.

Here’s how you do that with a command-line ssh client:

ssh -L8888:127.0.0.1:8888 asdf1234@cluster

Here’s how you do it with PuTTY:

putty

Click “Add.”  You should see this:

putty2

Now open your connection and login to the remote machine. Once there, cd to the directory where your data is and type

ipython notebook --pylab=inline

If you’re using X forwarding, this will open up the elinks text browser, which is woefully incapable of handling the HTML notebook. Fortunately that doesn’t sink the demo. You’ll see something like this:

elinks

This means that the iPython HTML notebook is up and running. If you actually want to use it, howerver, you need a better browser. Fortunately, we opened up SSH with a tunnel…

Open the Notebook in Your Browser

This was the one part of the demo that wasn’t under my control. You need a modern web browser, and I just had to hope that someone was keeping the conference room computer more or less up to date. My fallback plan was to use a text-mode ipython over ssh, but the notebook is much more fun! Fortunately for me, the computer had Firefox 14.

In your URL bar, type in

http://127.0.0.1:8888

If everything works, you’ll see this:
dashboard
And you’re off to the races!

What Just Happened?

I said earlier that 127.0.0.1 is a special IP address that’s only reachable locally, i.e. on the machine you’re sitting in front of. Port 8888 on 127.0.0.1 is where ipython serves its HTML notebook, so you’d think the idea of using the HTML notebook over the network isn’t going to fly.

When you log in through ssh, however, it’s as if you are actually sitting in front of the computer you’re connected to. Every program you run, runs on that computer. Port forwarding takes this a step further and presents all traffic on port 8888 to the remote computer as if it were actually on the remote computer’s port 8888.

Python Data Analysis Part 2: Pandas / Matplotlib Live Demo

This post is part of our series on Python.  Part 1 came in three parts: a, b, and c.  We also have a post on setting up python.  And on how to write a python job submission script. For all python fun, search “python” in the search box on the right.

Transcript

At yesterday’s group meeting, I promised a transcript of the live demo. I’m pasting it below with light edits for formatting.

  1 import pandas
  2 import matplotlib
  3 metrics = pandas.read_table("metrics.txt")
  4 metrics
  5 metrics[0:40]
  6 metrics["roundeddown"] = 3000 * ( metrics["NFE"] // 3000)
  7 metrics[0:40]
  8 grouped = metrics.groupby(["model", "roundeddown"])
  9 mean = grouped.mean()
 10 mean
 11 mean.ix["response"]
 12 grouped = grouped["SBX+PM"]
 13 mean = grouped.mean()
 14 mean
 15 fig = pyplot.figure(figsize=(10,6))
 16 ax = fig.add_subplot(1,1,1)
 17 ax.plot(mean.ix["response"].index.values,
 18         mean.ix["response"].values, color = 'r')
 19 fig
 20 ax.plot(mean.ix["gasp"].index.values,
 21         mean.ix["gasp"].values, color = 'b')
 22 fig
 23 summary = pandas.DataFrame(
 24             data = {
 25                     "mean":grouped.mean(),
 26                     "tenth":grouped.quantile(0.1),
 27                     "ninetieth":grouped.quantile(0.9)
 28             }
 29           )
 30 summary
 31 fig = pyplot.figure(figsize=(10,12))
 32 ax = fig.add_subplot(2,1,1)
 33 index = summary.ix["gasp"].index.values
 34 mean = summary.ix["gasp"]["mean"].values
 35 tenth = summary.ix["gasp"]["tenth"].values
 36 ninetieth=summary.ix["gasp"]["ninetieth"].values
 37 ax.plot(index, mean,
 38         index, tenth,
 39         index, ninetieth,
 40         color='k')
 41 fig
 42 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.01)
 43 fig
 44 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
 45 fig
 46 ax = fig.add_subplot(2,1,2)
 47 index = summary.ix["response"].index.values
 48 mean = summary.ix["response"]["mean"].values
 49 tenth = summary.ix["response"]["tenth"].values
 50 ninetieth=summary.ix["response"]["ninetieth"].values
 51 ax.plot(index, mean,
 52         index, tenth,
 53         index, ninetieth,
 54         color='k')
 55 fig
 56 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
 57 fig
 58 for ax in fig.axes:
 59     ax.set_yscale("log")
 60     ax.set_ylim(0.0001, 1.0)
 61     ax.set_xlim(0,103000)
 62 fig

Description

The transcript picks up where your homework left off. We have 50 seeds worth of data for two models:

sbx

And we want to produce a summary plot:

SBX_filled

You should be able to use text-based ipython on the cluster without any additional setup. However, you’ll need to add from matplotlib import pyplot to the top before you start typing in code from the transcript. (The HTML notebook I used for the demo imports pyplot automatically.)

If you want to see your figures, you’ll have to take the somewhat cumbersome steps of saving them (fig.savefig("filename")), copying them to your local machine (scp user1234@hammer.rcc.psu.edu:workingdirectory/filename.png .), and then viewing them.

I used an ipython HTML notebook for the demo. I’ll be making two posts about how to do this: one about how I did it for the demo, and another about installing ipython on your desktop.

Python Data Analysis Part 1c: Borg Runtime Metrics Plots

Borg Runtime Metrics

Have you done Part 1a yet?  How about Part 1b? Go do those first, and come back with metrics.txt, which has all of the data in a single file.  Afterwards, please check out Part 2.

We’re interested in Borg’s runtime metrics because they tell us interesting things about the problem we’re optimizing. See the Borg paper for more details.

The Script

Please consider this one released under the MIT license.

  1 import pandas
  2 import matplotlib
  3 import matplotlib.backends.backend_svg as svg
  4 import matplotlib.backends.backend_agg as agg
  5
  6 metrics = pandas.read_table('metrics.txt')
  7
  8 models = ["gasp", "response"]
  9 colors = ['b', 'r']
 10 seeds = range(50)
 11 toplot = [
 12     "Elapsed Time", "Population Size", "Archive Size",
 13     "GenerationalDistance", "AdditiveEpsilonIndicator",
 14     "SBX+PM", "DifferentialEvolution+PM",
 15     "PCX", "SPX", "UNDX", "UM"
 16     ]
 17 titles = {
 18     "GenerationalDistance": "Generational Distance",
 19     "AdditiveEpsilonIndicator": "Additive Epsilon Indicator",
 20     "SBX+PM": "Simulated Binary Crossover",
 21     "DifferentialEvolution+PM": "Differential Evolution",
 22     "PCX": "Parent Centric Crossover", "SPX": "Simplex Crossover",
 23     "UNDX": "Unimodal Normally Distributed Crossover",
 24     "UM": "Uniform Mutation"
 25     }
 26 filenames = {
 27     "Elapsed Time": "time", "Population Size": "popsize",
 28     "Archive Size": "archive",
 29     "GenerationalDistance": "gd",
 30     "AdditiveEpsilonIndicator":"aei",
 31     "SBX+PM": "sbx", "DifferentialEvolution+PM":"de",
 32     "PCX":"pcx", "SPX":"spx", "UNDX":"undx", "UM":"um"
 33     }
 34 axis_limits = {
 35     "SBX+PM": (0.0, 1.0), "DifferentialEvolution+PM": (0.0, 1.0),
 36     "PCX": (0.0, 1.0), "SPX": (0.0, 1.0), "UNDX": (0.0, 1.0),
 37     "UM": (0.0, 1.0)
 38     }
 39 for column in toplot:
 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)
 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])
 52
 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

Reading in the Data

Remember all of the text manipulation we had to do in Part 1a to deal with our data? Pandas gives us a very convenient interface to our table of data that makes Part 1a look like banging rocks together.

Line 6 shows how I read in a table. No need to parse the header or tell Pandas that we used tabs as field separators. Pandas does all of that.

  6 metrics = pandas.read_table('metrics.txt')

metrics is now a pandas DataFrame object. Users of R will recognize the DataFrame concept. It’s basically a table of data with named columns, where the data in each column are basically homogeneous (e.g. all floating-point numbers or all text).

Setting Things Up

Line 8 identifies the models for which I expect to find data in metrics.txt, and Line 9 indicates what color to use for plotting, for each model. 'r' is red, and 'b' is blue. Line 10 identifies the seeds I expect to find in the data table. range(50) is a shorthand way of saying “integers 0 through 49”.

Lines 11 through 16 make a list of the columns I want to plot. Lines 17 through 25 set up a dictionary relating those column names to the names I want to use as plot titles. (Generational Distance instead of GenerationalDistance, for example). Likewise, Lines 26 through 33 make a dictionary relating column names to the names of the files where I want to save the plots, and Lines 34 through 38 specify Y-axis limits for the plots.

A dictionary (dict) in Python is an associative data structure. Every value stored in a dictionary is attached to a key, which is used to look it up. Wikipedia has a pretty good explanation.

Making the Plots

Lines 39 through 60 make a plot for each of the columns we specified in toplot and save it to a file.

Setting up the Axes

When using Matplotlib, the axes object provides most of the plotting interface. Lines 40 through 43 set up an axes instance.

 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)

An axes object belongs to a figure. Line 40 sets up a figure, specifying that it should be 10 inches by 6 inches (this is a nominal size since we’re dealing with screen graphics).

Line 41 creates a canvas, which is a backend for drawing. This is not what the Matplotlib tutorials would have you do. They use pyplot.figure() to create a figure which then keeps a whole bunch of state in the background. The pyplot approach is apparently designed to ease the transition for Matlab users. Since I’m not a Matlab user it just seems weird to me, so I create a canvas explicitly. Commenting out Line 41 and switching to Line 42 instead switches between SVG and PNG output.

Line 43 creates the axes object for plotting the data in a column. add_subplot tells a figure where to put the axes. Matplotlib is designed from the ground up to support figures with multiple plots in them. The arguments 1, 1, 1 tell the figure it has a 1×1 array of subplots, and the one we’re interested in is in the first position.

Plotting the Runtime Metrics

Lines 44 through 51 do the plotting:

 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])

The call to zip packs up models and colors into an array of tuples. It would look like this if you were to create it explicitly:

[("gasp", "b"), ("response", "r")]

So on the first iteration through the for loop, model is "gasp" and color is "b". Then on the second iteration, model is "response and color is "r".

Line 45 opens a for loop that iterates through all 50 seeds (range(50) if you’ll recall.)

Lines 46 and 47 use a few excellent Pandas features. This bit:

metrics["model"] == model

returns an array of boolean (True/False values) where the condition ( == model) is true. The indices of that array correspond to the lines in the metrics table.

(metrics['model'] == model) & (metrics['seed'] == seed)

is an array of booleans where both conditions are true.
Pandas lets you filter a table using an array of booleans, so

metrics[(metrics['model'] == model) & (metrics['seed'] == seed)]

is a the subset of rows in the full table where the model and seed are as specified. Pause for a second and think about how you would do that if you didn’t have Pandas.

Lines 48 and 49 then call the plotting routine itself. It puts NFE on the X axis and whichever metric you’re plotting on the Y axis, and it uses the color we specified. The plot method returns a list of the lines created (it can make more than one at a time, although we aren’t.) So the subscript [0] at the end means that we just assign to line the single line created by our call to plot, rather than a 1-element list containing that line.

Line 50 excludes every line from the legend, since we don’t want 100 items in the legend. (50 seeds times two models). Line 51 then selectively enables the legend for one line from each model (and it doesn’t matter which one because they all look the same.) Note that Line 51 is outside the loop that starts on Line 45, so it only gets executed twice.
This part:

{"gasp":"GASP","response":"RSM"}[model]

is probably just me getting excessively clever. I could have written this instead:

if model == "gasp":
    line.set_label("GASP")
else:
    line.set_label("RSM")

But I got lazy.

Wrapping Up

Lines 53 through 56 fix up the plotting limits, add a title and a legend, and write the figure out to a file.

 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

I already know that I ran 100,000 function evaluations, so I set the X-axis limits accordingly for each plot on Line 53. Line 54 checks to see if I wanted to set limits for Y-axis for the column I’m plotting. I only want to do this for the operator probabilities, because I want to scale them all between 0 and 1. (See Lines 34 through 38.) This means that some of the columns don’t have limits specified. Ordinarily, asking a dictionary for the value associated with a key that’s not in a dictionary raises an exception. However, I’m using the get method with a default of None, so if I didn’t specify any limits in axis_limits, limits gets set to None. Line 55 tests whether that is the case, so Line 56 sets the Y-axis limits only if I actually specified limits when I declared axis_limits.

Line 58 makes a legend and puts it where I specify (bbox_to_anchor). Line 59 sets the title for the plot, and Line 60 writes it to a file. Note that the filenames specified on Lines 26 to 33 don’t include extensions (.svg or .png). Matplotlib decides which one to used based on whether we chose the SVG canvas or the Agg canvas on Lines 41 and 42.

The Result

If you run this script, these files should appear in your directory.

aei.svg  
archive.svg  
de.svg  
gd.svg  
pcx.svg  
popsize.svg  
sbx.svg  
spx.svg  
time.svg  
um.svg  
undx.svg

Here’s what sbx.svg looks like.

sbx

(WordPress doesn’t accept SVGs, so this is actually a PNG rasterization.)

Your Homework

If you’re in the Pat Reed group, please give this (and Part 1a) a try before our meeting on the 13th. If you have trouble, please post comments to the blog so I can clarify in public.

After all this, please proceed to part 2!