Connecting to an iPython HTML Notebook on the Cluster Using an SSH Tunnel

Magic

I didn’t have the time or inclination to try to set up the iPython HTML notebook on the conference room computer for yesterday’s demo, but I really wanted to use the HTML notebook. What to do?

Magic.

Magic in this case means running the iPython HTML notebook on the cluster, forwarding the HTTP port that the HTML notebook uses, and displaying the session in a web browser running locally. In the rest of this post, I’ll explain each of the moving parts.

iPython HTML Notebook on the Cluster

The ipython that comes for free on the cluster doesn’t support the HTML notebook because the python/2.7 module doesn’t have tornado or pyzmq. On the plus side, you do have easy_install, so setting up these dependencies isn’t too hard.

  1. Make a directory for your personal Python packages:
    mkdir /gpfs/home/asdf1234/local
  2. In your .bashrc, add
    export PYTHONPATH=$HOME/local/lib/python2.7/site-packages
  3. python -measy_install --prefix /gpfs/home/asdf1234/local tornado
    python -measy_install --prefix /gpfs/home/asdf1234/local pyzmq

If you have a local X server, you can check to see if this works:

ssh -Y asdf1234@cluster 
ipython notebook --pylab=inline

Firefox should pop up with the HTML notebook. It’s perfectly usable like this, but I also didn’t want to set up an X server on the conference room computer. This leads us to…

Forwarding Port 8888

By default, the HTML notebook serves HTTP on port 8888. If you’re sitting in front of the computer, you get to port 8888 by using the loopback address 127.0.0.1:8888.
127.0.0.1 is only available locally. But using SSH port forwarding, we can connect to 127.0.0.1:8888 from a remote machine.

Here’s how you do that with a command-line ssh client:

ssh -L8888:127.0.0.1:8888 asdf1234@cluster

Here’s how you do it with PuTTY:

putty

Click “Add.”  You should see this:

putty2

Now open your connection and login to the remote machine. Once there, cd to the directory where your data is and type

ipython notebook --pylab=inline

If you’re using X forwarding, this will open up the elinks text browser, which is woefully incapable of handling the HTML notebook. Fortunately that doesn’t sink the demo. You’ll see something like this:

elinks

This means that the iPython HTML notebook is up and running. If you actually want to use it, howerver, you need a better browser. Fortunately, we opened up SSH with a tunnel…

Open the Notebook in Your Browser

This was the one part of the demo that wasn’t under my control. You need a modern web browser, and I just had to hope that someone was keeping the conference room computer more or less up to date. My fallback plan was to use a text-mode ipython over ssh, but the notebook is much more fun! Fortunately for me, the computer had Firefox 14.

In your URL bar, type in

http://127.0.0.1:8888

If everything works, you’ll see this:
dashboard
And you’re off to the races!

What Just Happened?

I said earlier that 127.0.0.1 is a special IP address that’s only reachable locally, i.e. on the machine you’re sitting in front of. Port 8888 on 127.0.0.1 is where ipython serves its HTML notebook, so you’d think the idea of using the HTML notebook over the network isn’t going to fly.

When you log in through ssh, however, it’s as if you are actually sitting in front of the computer you’re connected to. Every program you run, runs on that computer. Port forwarding takes this a step further and presents all traffic on port 8888 to the remote computer as if it were actually on the remote computer’s port 8888.

Installing IPython

Intro

If you were at yesterday’s meeting, you’ve seen my enthusiasm for the ipython HTML notebook. This is a short post on how to get it set up.

Windows

iPython

Download the zip file from the iPython release archive. Pick the biggest release version number. Unzip it. Do

python setup.py install

Watch it complain about how you don’t have setuptools. Install setuptools as below and try again.

Setuptools

If you don’t have Python setuptools yet, go get it from PyPI, the Python Package Index.

Tornado

Now that you have setuptools:

python -measy_install tornado

PyZMQ

python -measy_install pyzmq

Linux (but not the cluster)

On Fedora, sudo yum install ipython. Probably sudo apt-get install ipython on Ubuntu, but let me know in the comments if I’m wrong about that and I’ll update the post. You might also need to install tornado and pyzmq, but I don’t recall having to do this. (I did my Fedora install a while ago.)

Starting the iPython HTML Notebook

iPython notebook --pylab=inline

A web browser should pop up. Click on “new notebook.”

Python Data Analysis Part 2: Pandas / Matplotlib Live Demo

This post is part of our series on Python.  Part 1 came in three parts: a, b, and c.  We also have a post on setting up python.  And on how to write a python job submission script. For all python fun, search “python” in the search box on the right.

Transcript

At yesterday’s group meeting, I promised a transcript of the live demo. I’m pasting it below with light edits for formatting.

  1 import pandas
  2 import matplotlib
  3 metrics = pandas.read_table("metrics.txt")
  4 metrics
  5 metrics[0:40]
  6 metrics["roundeddown"] = 3000 * ( metrics["NFE"] // 3000)
  7 metrics[0:40]
  8 grouped = metrics.groupby(["model", "roundeddown"])
  9 mean = grouped.mean()
 10 mean
 11 mean.ix["response"]
 12 grouped = grouped["SBX+PM"]
 13 mean = grouped.mean()
 14 mean
 15 fig = pyplot.figure(figsize=(10,6))
 16 ax = fig.add_subplot(1,1,1)
 17 ax.plot(mean.ix["response"].index.values,
 18         mean.ix["response"].values, color = 'r')
 19 fig
 20 ax.plot(mean.ix["gasp"].index.values,
 21         mean.ix["gasp"].values, color = 'b')
 22 fig
 23 summary = pandas.DataFrame(
 24             data = {
 25                     "mean":grouped.mean(),
 26                     "tenth":grouped.quantile(0.1),
 27                     "ninetieth":grouped.quantile(0.9)
 28             }
 29           )
 30 summary
 31 fig = pyplot.figure(figsize=(10,12))
 32 ax = fig.add_subplot(2,1,1)
 33 index = summary.ix["gasp"].index.values
 34 mean = summary.ix["gasp"]["mean"].values
 35 tenth = summary.ix["gasp"]["tenth"].values
 36 ninetieth=summary.ix["gasp"]["ninetieth"].values
 37 ax.plot(index, mean,
 38         index, tenth,
 39         index, ninetieth,
 40         color='k')
 41 fig
 42 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.01)
 43 fig
 44 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
 45 fig
 46 ax = fig.add_subplot(2,1,2)
 47 index = summary.ix["response"].index.values
 48 mean = summary.ix["response"]["mean"].values
 49 tenth = summary.ix["response"]["tenth"].values
 50 ninetieth=summary.ix["response"]["ninetieth"].values
 51 ax.plot(index, mean,
 52         index, tenth,
 53         index, ninetieth,
 54         color='k')
 55 fig
 56 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
 57 fig
 58 for ax in fig.axes:
 59     ax.set_yscale("log")
 60     ax.set_ylim(0.0001, 1.0)
 61     ax.set_xlim(0,103000)
 62 fig

Description

The transcript picks up where your homework left off. We have 50 seeds worth of data for two models:

sbx

And we want to produce a summary plot:

SBX_filled

You should be able to use text-based ipython on the cluster without any additional setup. However, you’ll need to add from matplotlib import pyplot to the top before you start typing in code from the transcript. (The HTML notebook I used for the demo imports pyplot automatically.)

If you want to see your figures, you’ll have to take the somewhat cumbersome steps of saving them (fig.savefig("filename")), copying them to your local machine (scp user1234@hammer.rcc.psu.edu:workingdirectory/filename.png .), and then viewing them.

I used an ipython HTML notebook for the demo. I’ll be making two posts about how to do this: one about how I did it for the demo, and another about installing ipython on your desktop.

Using linux “split”

Today I’d like to quickly talk about the linux command “split”.  I like writing about new simple Linux commands as evidenced here and here.

I often write customized C++ scripts to manipulate large data files.  There’s obviously a time and place for this, since you get ultimate control on every aspect of how your data looks going in and coming out.  We’ve written about this before, and I think string processing is an important skill no matter what language.  There’s a post about matlab (and another one here), some sample bash scripting, and a post about python among other things.  You should also see Matt’s series on python data analysis, since I’m doing some shameless plugging!

Anyway… little did I know that something very complicated in C++ can be easily done in linux/unix with “split”!

To split a large file into smaller files with, say, 100 lines, you use: “split -l 100 myLargerFile.txt”  There are also commands to change the filenames of the output files, and so forth.

Read the man page for split, and check out forum posts here and here to get on your way!

grep allows you to find an expression in one or more files in a folder on Linux.  I find it useful for programming.  Say, for example, I want to look for the string “nrec” in a set of source code and header files.  Maybe “nrec” is a variable and I forgot where I declared it (if this sounds a little too specific to be merely an example, you’re right. This is what I’m having to do right this second!).  The grep command is:

grep -in “nrec” *.*

What this means is, search for the “nrec” expression in every file in the folder.  There are two useful flags set here as well.  “i” means that the search is case insensitive (that is, NREC and NrEc and nrec are each treated as equal).  “n” means that the program will show me the line number of each occurrence of my desired phrase.  There are other options that I’m not using, including “inverting” a search to find all occurrences of NOT that phrase, suppressing the file name or only showing the file name, etc.

If you were curious, here’s a sample of the output:

iras.h:144: int num_flow_datapoints; //originally: NRec
SimSysClass.cpp:806: flowrecs=sysstat(nrec)

(If you’re curious, the first instance is in a header file, on line 144.  I’m translating this code from one language to another, and originally the variable was called “nrec”. So in the header file I made a note that now my variable is called something else.  In the second instance, I had copied the original code into my file as a placeholder, so now I know that I need to use my new name in its place.  Also, the “i” flag in grep is helpful since fortran is not case-sensitive, and here you can see there were two different case styles for this variable even in our simple example.)

For more info, please consult some casual reference such as this excellent post about linux command line utilities,  a similar blog post about grep, and of course the Linux man page for the command. Also look at 15 grep tips.  As usual, remember that “man [insert command here]” gives you all the low-down on each command you’d like to learn.

Thanks for reading and please comment with additional tips or questions!

Python Data Analysis Part 1c: Borg Runtime Metrics Plots

Borg Runtime Metrics

Have you done Part 1a yet?  How about Part 1b? Go do those first, and come back with metrics.txt, which has all of the data in a single file.  Afterwards, please check out Part 2.

We’re interested in Borg’s runtime metrics because they tell us interesting things about the problem we’re optimizing. See the Borg paper for more details.

The Script

Please consider this one released under the MIT license.

  1 import pandas
  2 import matplotlib
  3 import matplotlib.backends.backend_svg as svg
  4 import matplotlib.backends.backend_agg as agg
  5
  6 metrics = pandas.read_table('metrics.txt')
  7
  8 models = ["gasp", "response"]
  9 colors = ['b', 'r']
 10 seeds = range(50)
 11 toplot = [
 12     "Elapsed Time", "Population Size", "Archive Size",
 13     "GenerationalDistance", "AdditiveEpsilonIndicator",
 14     "SBX+PM", "DifferentialEvolution+PM",
 15     "PCX", "SPX", "UNDX", "UM"
 16     ]
 17 titles = {
 18     "GenerationalDistance": "Generational Distance",
 19     "AdditiveEpsilonIndicator": "Additive Epsilon Indicator",
 20     "SBX+PM": "Simulated Binary Crossover",
 21     "DifferentialEvolution+PM": "Differential Evolution",
 22     "PCX": "Parent Centric Crossover", "SPX": "Simplex Crossover",
 23     "UNDX": "Unimodal Normally Distributed Crossover",
 24     "UM": "Uniform Mutation"
 25     }
 26 filenames = {
 27     "Elapsed Time": "time", "Population Size": "popsize",
 28     "Archive Size": "archive",
 29     "GenerationalDistance": "gd",
 30     "AdditiveEpsilonIndicator":"aei",
 31     "SBX+PM": "sbx", "DifferentialEvolution+PM":"de",
 32     "PCX":"pcx", "SPX":"spx", "UNDX":"undx", "UM":"um"
 33     }
 34 axis_limits = {
 35     "SBX+PM": (0.0, 1.0), "DifferentialEvolution+PM": (0.0, 1.0),
 36     "PCX": (0.0, 1.0), "SPX": (0.0, 1.0), "UNDX": (0.0, 1.0),
 37     "UM": (0.0, 1.0)
 38     }
 39 for column in toplot:
 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)
 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])
 52
 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

Reading in the Data

Remember all of the text manipulation we had to do in Part 1a to deal with our data? Pandas gives us a very convenient interface to our table of data that makes Part 1a look like banging rocks together.

Line 6 shows how I read in a table. No need to parse the header or tell Pandas that we used tabs as field separators. Pandas does all of that.

  6 metrics = pandas.read_table('metrics.txt')

metrics is now a pandas DataFrame object. Users of R will recognize the DataFrame concept. It’s basically a table of data with named columns, where the data in each column are basically homogeneous (e.g. all floating-point numbers or all text).

Setting Things Up

Line 8 identifies the models for which I expect to find data in metrics.txt, and Line 9 indicates what color to use for plotting, for each model. 'r' is red, and 'b' is blue. Line 10 identifies the seeds I expect to find in the data table. range(50) is a shorthand way of saying “integers 0 through 49”.

Lines 11 through 16 make a list of the columns I want to plot. Lines 17 through 25 set up a dictionary relating those column names to the names I want to use as plot titles. (Generational Distance instead of GenerationalDistance, for example). Likewise, Lines 26 through 33 make a dictionary relating column names to the names of the files where I want to save the plots, and Lines 34 through 38 specify Y-axis limits for the plots.

A dictionary (dict) in Python is an associative data structure. Every value stored in a dictionary is attached to a key, which is used to look it up. Wikipedia has a pretty good explanation.

Making the Plots

Lines 39 through 60 make a plot for each of the columns we specified in toplot and save it to a file.

Setting up the Axes

When using Matplotlib, the axes object provides most of the plotting interface. Lines 40 through 43 set up an axes instance.

 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)

An axes object belongs to a figure. Line 40 sets up a figure, specifying that it should be 10 inches by 6 inches (this is a nominal size since we’re dealing with screen graphics).

Line 41 creates a canvas, which is a backend for drawing. This is not what the Matplotlib tutorials would have you do. They use pyplot.figure() to create a figure which then keeps a whole bunch of state in the background. The pyplot approach is apparently designed to ease the transition for Matlab users. Since I’m not a Matlab user it just seems weird to me, so I create a canvas explicitly. Commenting out Line 41 and switching to Line 42 instead switches between SVG and PNG output.

Line 43 creates the axes object for plotting the data in a column. add_subplot tells a figure where to put the axes. Matplotlib is designed from the ground up to support figures with multiple plots in them. The arguments 1, 1, 1 tell the figure it has a 1×1 array of subplots, and the one we’re interested in is in the first position.

Plotting the Runtime Metrics

Lines 44 through 51 do the plotting:

 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])

The call to zip packs up models and colors into an array of tuples. It would look like this if you were to create it explicitly:

[("gasp", "b"), ("response", "r")]

So on the first iteration through the for loop, model is "gasp" and color is "b". Then on the second iteration, model is "response and color is "r".

Line 45 opens a for loop that iterates through all 50 seeds (range(50) if you’ll recall.)

Lines 46 and 47 use a few excellent Pandas features. This bit:

metrics["model"] == model

returns an array of boolean (True/False values) where the condition ( == model) is true. The indices of that array correspond to the lines in the metrics table.

(metrics['model'] == model) & (metrics['seed'] == seed)

is an array of booleans where both conditions are true.
Pandas lets you filter a table using an array of booleans, so

metrics[(metrics['model'] == model) & (metrics['seed'] == seed)]

is a the subset of rows in the full table where the model and seed are as specified. Pause for a second and think about how you would do that if you didn’t have Pandas.

Lines 48 and 49 then call the plotting routine itself. It puts NFE on the X axis and whichever metric you’re plotting on the Y axis, and it uses the color we specified. The plot method returns a list of the lines created (it can make more than one at a time, although we aren’t.) So the subscript [0] at the end means that we just assign to line the single line created by our call to plot, rather than a 1-element list containing that line.

Line 50 excludes every line from the legend, since we don’t want 100 items in the legend. (50 seeds times two models). Line 51 then selectively enables the legend for one line from each model (and it doesn’t matter which one because they all look the same.) Note that Line 51 is outside the loop that starts on Line 45, so it only gets executed twice.
This part:

{"gasp":"GASP","response":"RSM"}[model]

is probably just me getting excessively clever. I could have written this instead:

if model == "gasp":
    line.set_label("GASP")
else:
    line.set_label("RSM")

But I got lazy.

Wrapping Up

Lines 53 through 56 fix up the plotting limits, add a title and a legend, and write the figure out to a file.

 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

I already know that I ran 100,000 function evaluations, so I set the X-axis limits accordingly for each plot on Line 53. Line 54 checks to see if I wanted to set limits for Y-axis for the column I’m plotting. I only want to do this for the operator probabilities, because I want to scale them all between 0 and 1. (See Lines 34 through 38.) This means that some of the columns don’t have limits specified. Ordinarily, asking a dictionary for the value associated with a key that’s not in a dictionary raises an exception. However, I’m using the get method with a default of None, so if I didn’t specify any limits in axis_limits, limits gets set to None. Line 55 tests whether that is the case, so Line 56 sets the Y-axis limits only if I actually specified limits when I declared axis_limits.

Line 58 makes a legend and puts it where I specify (bbox_to_anchor). Line 59 sets the title for the plot, and Line 60 writes it to a file. Note that the filenames specified on Lines 26 to 33 don’t include extensions (.svg or .png). Matplotlib decides which one to used based on whether we chose the SVG canvas or the Agg canvas on Lines 41 and 42.

The Result

If you run this script, these files should appear in your directory.

aei.svg  
archive.svg  
de.svg  
gd.svg  
pcx.svg  
popsize.svg  
sbx.svg  
spx.svg  
time.svg  
um.svg  
undx.svg

Here’s what sbx.svg looks like.

sbx

(WordPress doesn’t accept SVGs, so this is actually a PNG rasterization.)

Your Homework

If you’re in the Pat Reed group, please give this (and Part 1a) a try before our meeting on the 13th. If you have trouble, please post comments to the blog so I can clarify in public.

After all this, please proceed to part 2!

Python Data Analysis Part 1b: Setting up Matplotlib and Pandas

Installing Matplotlib and Pandas

So it turns out my two-part series of blog posts is going to be three parts, at least. This one is about getting and installing Python, Matplotlib, and Pandas. Skip down to the bottom for the best news of all: we now get these for free on Penn State’s HPC systems!

Windows

Python

Get Python from the official download site. Pick the 32-bit installer.

Make sure your environment variables are set up correctly, too. Find your Advanced System Settings and click on Environment Variables. Your PATH should include the directory where you installed Python, as well as the scripts subdirectory. On one of the machines I use, I put Python in d:\python27_32, so this is what I added to my PATH:

d:\python27_32;d:\python27_32\scripts

Why 32-bit?

Pandas depends on NumPy, and NumPy only has a 32-bit Windows version. (I think you have to compile blas yourself if you want 64-bit support. That builds character, but it’s way outside the scope of this tutorial.) So even if you have a 64-bit machine, which you probably do because it’s 2013, and a 64-bit version of Windows, which you might not because it’s 2013 and 32-bit Windows XP is still installed on everything, you need all your Python things to be 32-bit. That includes your Python interpreter, so make sure you have the right one installed. You should see something like this when you type python at the command prompt:

C:\>python
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>>

You don’t have to be using Python version 2.7.3, but my code examples are based on Python 2 and not Python 3, so make sure you’re on a relatively modern release of Python 2.

Matplotlib

Download Matplotlib here. Remember to get the 32-bit version.

NumPy

Go to the NumPy downloads page and get the latest win32 superpack. As of this writing, it’s numpy-1.6.2-win32-superpack-python2.7.exe.

Pandas

Get the latest Win32 installer from the official download page. As of this writing, it’s pandas-0.10.1.win32-py2.7.exe.

If you have problems with this version, let me know. I’ve been using 0.10.0 and haven’t upgraded yet. Pandas is still below version 1.0, so Wes McKinney is under no obligation to keep things from breaking between versions.

Linux

Most major distributions have Python, Matplotlib, NumPy, and Pandas neatly packaged up for you.  Packages are usually named python, python-matplotlib, python-numpy, and python-pandas, or something like that. You may need a bit of googling to find the right names for your distribution’s package repositories.

I ran into an issue on one of my laptops at home running a Debian derivative (Crunchbang) where the supplied NumPy was too old. If you have a stern constitution, get it from github and build your own.

On the Cluster

I was getting ready to write a whole mini-tutorial on getting source tarballs and building everything from scratch, because that’s what I had to do six months ago to get these packages on the cluster. However, I just discovered that some intelligent, farsighted individual has added them to the default Python 2.7 configuration. Here’s all you have to do to get going on the cluster:

module load python/2.7.3

Stick it in your .bashrc, and you’re ready to go!

Python Data Analysis Part 1a: Borg Runtime Metrics Plots (Preparing the Data)

Introduction

Welcome to a series of posts discussing making Borg runtime metrics plots using Python, with Pandas and Matplotlib.  To set up Python, see this post.  The first post is broken into three parts.  This part sets up the data, whereas the next part sets up the libraries, and the final part puts it all together.  There is also a second companion post going through a “live demo” of how to do everything.  Also search “python” on the right side for other blog posts relating to using Python!

Generating Runtime Metrics Data

First, I’ll assume that you’ve collected metrics during your optimization runs.

Here’s what I had at the end of my main function (I had used Java MOEAFramework for optimization).

// write the runtime dynamics 
Accumulator accumulator = instrumenter.getLastAccumulator();

String[] metrics = {
        "NFE",
        "Elapsed Time",
        "Population Size",
        "Archive Size",
        "CV",
        "GenerationalDistance",
        "AdditiveEpsilonIndicator",
        "SBX+PM",
        "UM",
        "DifferentialEvolution+PM",
        "PCX",
        "SPX",
        "UNDX"
};

for(String metric : metrics) {
    System.out.print(metric);
    System.out.print("\t");
}
System.out.print("\n");
for(int ii=0; ii<accumulator.size("NFE"); ii++) {
    for(String metric : metrics) {
        System.out.print(accumulator.get(metric, ii));
        System.out.print("\t");
    }
    System.out.print("\n");
}

Resulting Data Files

Here’s a sample of what was inside the resulting data files. They’re tab-delimited text, and unfortunately our WordPress theme will only let you see the first few columns, but you get the idea.

NFE Elapsed Time Population Size Archive Size CV Generational Distance Additive Epsilon Indicator SBX+PM UM Differential Evolution+PM PCX SPX UNDX
3000 77.359 308 93 0.0 0.048 0.639 0.617 0.021 0.042 0.148 0.148 0.021
6111 151.339 900 244 0.011 0.023 0.589 0.543 0.005 0.021 0.059 0.365 0.005
9111 222.376 1216 275 0.0 0.017 0.572 0.544 0.009 0.009 0.127 0.303 0.004
12112 293.615 952 310 0.0 0.013 0.578 0.494 0.003 0.003 0.083 0.411 0.003

Merging the Data

For my study, I did 50 optimization runs for each of two versions of my model. (GASP and a response surface model, or RSM.) This means I don’t have one metrics file, I have 50. Here’s what my directory listing looks like.

metrics_gasp_0.txt   metrics_gasp_40.txt      metrics_response_26.txt
metrics_gasp_10.txt  metrics_gasp_41.txt      metrics_response_27.txt
metrics_gasp_11.txt  metrics_gasp_42.txt      metrics_response_28.txt
metrics_gasp_12.txt  metrics_gasp_43.txt      metrics_response_29.txt
metrics_gasp_13.txt  metrics_gasp_44.txt      metrics_response_2.txt
metrics_gasp_14.txt  metrics_gasp_45.txt      metrics_response_30.txt
metrics_gasp_15.txt  metrics_gasp_46.txt      metrics_response_31.txt
metrics_gasp_16.txt  metrics_gasp_47.txt      metrics_response_32.txt
metrics_gasp_17.txt  metrics_gasp_48.txt      metrics_response_33.txt
metrics_gasp_18.txt  metrics_gasp_49.txt      metrics_response_34.txt
metrics_gasp_19.txt  metrics_gasp_4.txt       metrics_response_35.txt
metrics_gasp_1.txt   metrics_gasp_5.txt       metrics_response_36.txt
metrics_gasp_20.txt  metrics_gasp_6.txt       metrics_response_37.txt
metrics_gasp_21.txt  metrics_gasp_7.txt       metrics_response_38.txt
metrics_gasp_22.txt  metrics_gasp_8.txt       metrics_response_39.txt
metrics_gasp_23.txt  metrics_gasp_9.txt       metrics_response_3.txt
metrics_gasp_24.txt  metrics_response_0.txt   metrics_response_40.txt
metrics_gasp_25.txt  metrics_response_10.txt  metrics_response_41.txt
metrics_gasp_26.txt  metrics_response_11.txt  metrics_response_42.txt
metrics_gasp_27.txt  metrics_response_12.txt  metrics_response_43.txt
metrics_gasp_28.txt  metrics_response_13.txt  metrics_response_44.txt
metrics_gasp_29.txt  metrics_response_14.txt  metrics_response_45.txt
metrics_gasp_2.txt   metrics_response_15.txt  metrics_response_46.txt
metrics_gasp_30.txt  metrics_response_16.txt  metrics_response_47.txt
metrics_gasp_31.txt  metrics_response_17.txt  metrics_response_48.txt
metrics_gasp_32.txt  metrics_response_18.txt  metrics_response_49.txt
metrics_gasp_33.txt  metrics_response_19.txt  metrics_response_4.txt
metrics_gasp_34.txt  metrics_response_1.txt   metrics_response_5.txt
metrics_gasp_35.txt  metrics_response_20.txt  metrics_response_6.txt
metrics_gasp_36.txt  metrics_response_21.txt  metrics_response_7.txt
metrics_gasp_37.txt  metrics_response_22.txt  metrics_response_8.txt
metrics_gasp_38.txt  metrics_response_23.txt  metrics_response_9.txt
metrics_gasp_39.txt  metrics_response_24.txt
metrics_gasp_3.txt   metrics_response_25.txt

Python Data-Merging Script

To plot everything, it’s most convenient for me if all the data are in a single file.  There are many ways to combine all 100 files together, and even the Unix shell-scripting version is pretty straightforward. But since this post is about data analysis in Python anyway, I’ll give you a Python version.

1   def append_data(accumulator, model, seed):
2       filename = "metrics_{0}_{1}.txt".format(model, seed)
3       with open(filename, 'rb') as metrics:
4           header = metrics.readline().strip()
5   
6           for line in metrics:
7               line = line.strip()
8               line = "{0}\t{1}\t{2}\n".format(
9                       model, seed, line)
10              accumulator.append(line)
11
12      return header
13
14  models = ("response", "gasp")
15  seeds = range(50)
16  accumulator = []
17  for model in models:
18      for seed in seeds:
19          header = append_data(accumulator, model, seed)
20
21  with open("metrics.txt", 'wb') as accumulated:
22      header = "{0}\t{1}\t{2}\n".format(
23                      "model", "seed", header)
24      accumulated.write(header)
25      for line in accumulator:
26          accumulated.write(line)

This is a bit of a throwaway script (consider it released under the MIT license. Go do what you like with it.) It treats the data in the individual files like text, rather than converting to floating-point numbers and back again. It gathers every line from the individual data files, prepending the model and seed number, then writes them all back out as one file.

Let’s walk through the details…

Merging Loop

14  models = ("response", "gasp")
15  seeds = range(50)
16  accumulator = []
17  for model in models:
18      for seed in seeds:
19          header = append_data(accumulator, model, seed)

The action starts on line 14. This bit takes a list called accumulator and stuffs every line of every metrics file into it, except for the headers. The append_data function returns the header separately, and since I’m being lazy, I assume the header is always the same and let it get overwritten each time. When the loop exits, header is actually just the header (i.e. first line) of the last file.

Now, I’m not assuming you know Python, so I’ll make a couple of notes about the syntax here.

  • Parentheses, like on line 14, make a tuple object. This is like a list, but generally a tuple is short and immutable, while a list (made with brackets [ ]) is any length and mutable. (A list in Python is a general-purpose ordered collection of data.)
  • Indentation is meaningful to the Python interpreter. Everything inside a for loop is at a deeper level of indentation, and the loop ends when a shallower level of indentation (or the end of the script) is reached. So lines 17-19 define a nested for loop that covers both seeds for every model.

Function to append data

1   def append_data(accumulator, model, seed):
2       filename = "metrics_{0}_{1}.txt".format(model, seed)
3       with open(filename, 'rb') as metrics:
4           header = metrics.readline().strip()
5   
6           for line in metrics:
7               line = line.strip()
8               line = "{0}\t{1}\t{2}\n".format(
9                       model, seed, line)
10              accumulator.append(line)
11
12      return header

The append_data function appends to accumulator (a list) every line in the file identified by model and seed. Line 2 puts together a file name based on the model and seed number.

The with block starting on line 3 and ending on line 10 is a way of automatically releasing the metrics file when I’m done with it. The call to open gives us a file object I’m calling metrics. Because I used a with block, when the block ends, the file gets closed automatically. It’s an easy way to prevent your program from leaking open files.

On Line 4, I read the first line of the metrics file, its header, which will get returned at the end of the function. That’s not the main thing this function does, though. Lines 6-10 loop through the remaining lines in the file, prepend model and seed number to them, then append them to the accumulator list.

Writing out the data

21  with open("metrics.txt", 'wb') as accumulated:
22      header = "{0}\t{1}\t{2}\n".format(
23                      "model", "seed", header)
24      accumulated.write(header)
25      for line in accumulator:
26          accumulated.write(line)

Here I use another with block to take care of the file I’m writing data out to. Lines 22-23 create the header by prepending “model” and “seed” to it, then Line 24 writes it to the output file. Lines 25 and 26 loop through the accumulator list and write each line out to the metrics file.

Comment

I want to repeat that, although the metrics files were full of numbers, I treated them like text here. All this merging script does is add some more text (model and seed number) to each line before writing the same text it read in right back out to a file. Once I get to plotting things, I’ll need to treat numbers like numbers, which is where pandas comes in.

Resulting Data File

Here’s a sample of what the resulting data file (metrics.txt) looks like:

model seed NFE Elapsed Time Population Size Archive Size CV Generational Distance Additive Epsilon Indicator SBX + PM UM Differential Evolution + PM PCX SPX UNDX
response 0 3001 2.852 184 57 0.0 0.032 0.863 0.086 0.017 0.017 0.172 0.672 0.034
response 0 6001 5.163 368 123 0.0 0.020 0.785 0.223 0.008 0.016 0.074 0.669 0.008
response 0 9001 7.581 512 151 0.0 0.017 0.764 0.178 0.006 0.046 0.026 0.735 0.006
response 0 12002 9.947 512 133 0.0 0.017 0.758 0.148 0.007 0.037 0.074 0.725 0.007

Play Along at Home!

If you’re lucky enough to be a member of the Pat Reed group, I’m making my metrics files available on our group Dropbox, in Presentations/py_data_analysis/metrics.zip. I encourage you to type in my code, save it as accumulate.py (or whatever name strikes your fancy) in the directory where you put the metrics files, and then run it by typing python accumulate.py. If you don’t have Python on your machine, you can run it on the cluster if you first load the Python module: module load python/2.7.3

(This is part of your homework for Wednesday February 13!)