Python Data Analysis Part 1c: Borg Runtime Metrics Plots

Borg Runtime Metrics

Have you done Part 1a yet?  How about Part 1b? Go do those first, and come back with metrics.txt, which has all of the data in a single file.  Afterwards, please check out Part 2.

We’re interested in Borg’s runtime metrics because they tell us interesting things about the problem we’re optimizing. See the Borg paper for more details.

The Script

Please consider this one released under the MIT license.

  1 import pandas
  2 import matplotlib
  3 import matplotlib.backends.backend_svg as svg
  4 import matplotlib.backends.backend_agg as agg
  5
  6 metrics = pandas.read_table('metrics.txt')
  7
  8 models = ["gasp", "response"]
  9 colors = ['b', 'r']
 10 seeds = range(50)
 11 toplot = [
 12     "Elapsed Time", "Population Size", "Archive Size",
 13     "GenerationalDistance", "AdditiveEpsilonIndicator",
 14     "SBX+PM", "DifferentialEvolution+PM",
 15     "PCX", "SPX", "UNDX", "UM"
 16     ]
 17 titles = {
 18     "GenerationalDistance": "Generational Distance",
 19     "AdditiveEpsilonIndicator": "Additive Epsilon Indicator",
 20     "SBX+PM": "Simulated Binary Crossover",
 21     "DifferentialEvolution+PM": "Differential Evolution",
 22     "PCX": "Parent Centric Crossover", "SPX": "Simplex Crossover",
 23     "UNDX": "Unimodal Normally Distributed Crossover",
 24     "UM": "Uniform Mutation"
 25     }
 26 filenames = {
 27     "Elapsed Time": "time", "Population Size": "popsize",
 28     "Archive Size": "archive",
 29     "GenerationalDistance": "gd",
 30     "AdditiveEpsilonIndicator":"aei",
 31     "SBX+PM": "sbx", "DifferentialEvolution+PM":"de",
 32     "PCX":"pcx", "SPX":"spx", "UNDX":"undx", "UM":"um"
 33     }
 34 axis_limits = {
 35     "SBX+PM": (0.0, 1.0), "DifferentialEvolution+PM": (0.0, 1.0),
 36     "PCX": (0.0, 1.0), "SPX": (0.0, 1.0), "UNDX": (0.0, 1.0),
 37     "UM": (0.0, 1.0)
 38     }
 39 for column in toplot:
 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)
 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])
 52
 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

Reading in the Data

Remember all of the text manipulation we had to do in Part 1a to deal with our data? Pandas gives us a very convenient interface to our table of data that makes Part 1a look like banging rocks together.

Line 6 shows how I read in a table. No need to parse the header or tell Pandas that we used tabs as field separators. Pandas does all of that.

  6 metrics = pandas.read_table('metrics.txt')

metrics is now a pandas DataFrame object. Users of R will recognize the DataFrame concept. It’s basically a table of data with named columns, where the data in each column are basically homogeneous (e.g. all floating-point numbers or all text).

Setting Things Up

Line 8 identifies the models for which I expect to find data in metrics.txt, and Line 9 indicates what color to use for plotting, for each model. 'r' is red, and 'b' is blue. Line 10 identifies the seeds I expect to find in the data table. range(50) is a shorthand way of saying “integers 0 through 49”.

Lines 11 through 16 make a list of the columns I want to plot. Lines 17 through 25 set up a dictionary relating those column names to the names I want to use as plot titles. (Generational Distance instead of GenerationalDistance, for example). Likewise, Lines 26 through 33 make a dictionary relating column names to the names of the files where I want to save the plots, and Lines 34 through 38 specify Y-axis limits for the plots.

A dictionary (dict) in Python is an associative data structure. Every value stored in a dictionary is attached to a key, which is used to look it up. Wikipedia has a pretty good explanation.

Making the Plots

Lines 39 through 60 make a plot for each of the columns we specified in toplot and save it to a file.

Setting up the Axes

When using Matplotlib, the axes object provides most of the plotting interface. Lines 40 through 43 set up an axes instance.

 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)

An axes object belongs to a figure. Line 40 sets up a figure, specifying that it should be 10 inches by 6 inches (this is a nominal size since we’re dealing with screen graphics).

Line 41 creates a canvas, which is a backend for drawing. This is not what the Matplotlib tutorials would have you do. They use pyplot.figure() to create a figure which then keeps a whole bunch of state in the background. The pyplot approach is apparently designed to ease the transition for Matlab users. Since I’m not a Matlab user it just seems weird to me, so I create a canvas explicitly. Commenting out Line 41 and switching to Line 42 instead switches between SVG and PNG output.

Line 43 creates the axes object for plotting the data in a column. add_subplot tells a figure where to put the axes. Matplotlib is designed from the ground up to support figures with multiple plots in them. The arguments 1, 1, 1 tell the figure it has a 1×1 array of subplots, and the one we’re interested in is in the first position.

Plotting the Runtime Metrics

Lines 44 through 51 do the plotting:

 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])

The call to zip packs up models and colors into an array of tuples. It would look like this if you were to create it explicitly:

[("gasp", "b"), ("response", "r")]

So on the first iteration through the for loop, model is "gasp" and color is "b". Then on the second iteration, model is "response and color is "r".

Line 45 opens a for loop that iterates through all 50 seeds (range(50) if you’ll recall.)

Lines 46 and 47 use a few excellent Pandas features. This bit:

metrics["model"] == model

returns an array of boolean (True/False values) where the condition ( == model) is true. The indices of that array correspond to the lines in the metrics table.

(metrics['model'] == model) & (metrics['seed'] == seed)

is an array of booleans where both conditions are true.
Pandas lets you filter a table using an array of booleans, so

metrics[(metrics['model'] == model) & (metrics['seed'] == seed)]

is a the subset of rows in the full table where the model and seed are as specified. Pause for a second and think about how you would do that if you didn’t have Pandas.

Lines 48 and 49 then call the plotting routine itself. It puts NFE on the X axis and whichever metric you’re plotting on the Y axis, and it uses the color we specified. The plot method returns a list of the lines created (it can make more than one at a time, although we aren’t.) So the subscript [0] at the end means that we just assign to line the single line created by our call to plot, rather than a 1-element list containing that line.

Line 50 excludes every line from the legend, since we don’t want 100 items in the legend. (50 seeds times two models). Line 51 then selectively enables the legend for one line from each model (and it doesn’t matter which one because they all look the same.) Note that Line 51 is outside the loop that starts on Line 45, so it only gets executed twice.
This part:

{"gasp":"GASP","response":"RSM"}[model]

is probably just me getting excessively clever. I could have written this instead:

if model == "gasp":
    line.set_label("GASP")
else:
    line.set_label("RSM")

But I got lazy.

Wrapping Up

Lines 53 through 56 fix up the plotting limits, add a title and a legend, and write the figure out to a file.

 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

I already know that I ran 100,000 function evaluations, so I set the X-axis limits accordingly for each plot on Line 53. Line 54 checks to see if I wanted to set limits for Y-axis for the column I’m plotting. I only want to do this for the operator probabilities, because I want to scale them all between 0 and 1. (See Lines 34 through 38.) This means that some of the columns don’t have limits specified. Ordinarily, asking a dictionary for the value associated with a key that’s not in a dictionary raises an exception. However, I’m using the get method with a default of None, so if I didn’t specify any limits in axis_limits, limits gets set to None. Line 55 tests whether that is the case, so Line 56 sets the Y-axis limits only if I actually specified limits when I declared axis_limits.

Line 58 makes a legend and puts it where I specify (bbox_to_anchor). Line 59 sets the title for the plot, and Line 60 writes it to a file. Note that the filenames specified on Lines 26 to 33 don’t include extensions (.svg or .png). Matplotlib decides which one to used based on whether we chose the SVG canvas or the Agg canvas on Lines 41 and 42.

The Result

If you run this script, these files should appear in your directory.

aei.svg  
archive.svg  
de.svg  
gd.svg  
pcx.svg  
popsize.svg  
sbx.svg  
spx.svg  
time.svg  
um.svg  
undx.svg

Here’s what sbx.svg looks like.

sbx

(WordPress doesn’t accept SVGs, so this is actually a PNG rasterization.)

Your Homework

If you’re in the Pat Reed group, please give this (and Part 1a) a try before our meeting on the 13th. If you have trouble, please post comments to the blog so I can clarify in public.

After all this, please proceed to part 2!

2 thoughts on “Python Data Analysis Part 1c: Borg Runtime Metrics Plots

  1. Pingback: Python Data Analysis Part 1a: Borg Runtime Metrics Plots (Preparing the Data) | Pat Reed Group Research Tips Blog

  2. Pingback: Python Data Analysis Part 2: Pandas / Matplotlib Live Demo | Pat Reed Group Research Tips Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s