Python Data Analysis Part 1a: Borg Runtime Metrics Plots (Preparing the Data)

Introduction

Welcome to a series of posts discussing making Borg runtime metrics plots using Python, with Pandas and Matplotlib.  To set up Python, see this post.  The first post is broken into three parts.  This part sets up the data, whereas the next part sets up the libraries, and the final part puts it all together.  There is also a second companion post going through a “live demo” of how to do everything.  Also search “python” on the right side for other blog posts relating to using Python!

Generating Runtime Metrics Data

First, I’ll assume that you’ve collected metrics during your optimization runs.

Here’s what I had at the end of my main function (I had used Java MOEAFramework for optimization).

// write the runtime dynamics 
Accumulator accumulator = instrumenter.getLastAccumulator();

String[] metrics = {
        "NFE",
        "Elapsed Time",
        "Population Size",
        "Archive Size",
        "CV",
        "GenerationalDistance",
        "AdditiveEpsilonIndicator",
        "SBX+PM",
        "UM",
        "DifferentialEvolution+PM",
        "PCX",
        "SPX",
        "UNDX"
};

for(String metric : metrics) {
    System.out.print(metric);
    System.out.print("\t");
}
System.out.print("\n");
for(int ii=0; ii<accumulator.size("NFE"); ii++) {
    for(String metric : metrics) {
        System.out.print(accumulator.get(metric, ii));
        System.out.print("\t");
    }
    System.out.print("\n");
}

Resulting Data Files

Here’s a sample of what was inside the resulting data files. They’re tab-delimited text, and unfortunately our WordPress theme will only let you see the first few columns, but you get the idea.

NFE Elapsed Time Population Size Archive Size CV Generational Distance Additive Epsilon Indicator SBX+PM UM Differential Evolution+PM PCX SPX UNDX
3000 77.359 308 93 0.0 0.048 0.639 0.617 0.021 0.042 0.148 0.148 0.021
6111 151.339 900 244 0.011 0.023 0.589 0.543 0.005 0.021 0.059 0.365 0.005
9111 222.376 1216 275 0.0 0.017 0.572 0.544 0.009 0.009 0.127 0.303 0.004
12112 293.615 952 310 0.0 0.013 0.578 0.494 0.003 0.003 0.083 0.411 0.003

Merging the Data

For my study, I did 50 optimization runs for each of two versions of my model. (GASP and a response surface model, or RSM.) This means I don’t have one metrics file, I have 50. Here’s what my directory listing looks like.

metrics_gasp_0.txt   metrics_gasp_40.txt      metrics_response_26.txt
metrics_gasp_10.txt  metrics_gasp_41.txt      metrics_response_27.txt
metrics_gasp_11.txt  metrics_gasp_42.txt      metrics_response_28.txt
metrics_gasp_12.txt  metrics_gasp_43.txt      metrics_response_29.txt
metrics_gasp_13.txt  metrics_gasp_44.txt      metrics_response_2.txt
metrics_gasp_14.txt  metrics_gasp_45.txt      metrics_response_30.txt
metrics_gasp_15.txt  metrics_gasp_46.txt      metrics_response_31.txt
metrics_gasp_16.txt  metrics_gasp_47.txt      metrics_response_32.txt
metrics_gasp_17.txt  metrics_gasp_48.txt      metrics_response_33.txt
metrics_gasp_18.txt  metrics_gasp_49.txt      metrics_response_34.txt
metrics_gasp_19.txt  metrics_gasp_4.txt       metrics_response_35.txt
metrics_gasp_1.txt   metrics_gasp_5.txt       metrics_response_36.txt
metrics_gasp_20.txt  metrics_gasp_6.txt       metrics_response_37.txt
metrics_gasp_21.txt  metrics_gasp_7.txt       metrics_response_38.txt
metrics_gasp_22.txt  metrics_gasp_8.txt       metrics_response_39.txt
metrics_gasp_23.txt  metrics_gasp_9.txt       metrics_response_3.txt
metrics_gasp_24.txt  metrics_response_0.txt   metrics_response_40.txt
metrics_gasp_25.txt  metrics_response_10.txt  metrics_response_41.txt
metrics_gasp_26.txt  metrics_response_11.txt  metrics_response_42.txt
metrics_gasp_27.txt  metrics_response_12.txt  metrics_response_43.txt
metrics_gasp_28.txt  metrics_response_13.txt  metrics_response_44.txt
metrics_gasp_29.txt  metrics_response_14.txt  metrics_response_45.txt
metrics_gasp_2.txt   metrics_response_15.txt  metrics_response_46.txt
metrics_gasp_30.txt  metrics_response_16.txt  metrics_response_47.txt
metrics_gasp_31.txt  metrics_response_17.txt  metrics_response_48.txt
metrics_gasp_32.txt  metrics_response_18.txt  metrics_response_49.txt
metrics_gasp_33.txt  metrics_response_19.txt  metrics_response_4.txt
metrics_gasp_34.txt  metrics_response_1.txt   metrics_response_5.txt
metrics_gasp_35.txt  metrics_response_20.txt  metrics_response_6.txt
metrics_gasp_36.txt  metrics_response_21.txt  metrics_response_7.txt
metrics_gasp_37.txt  metrics_response_22.txt  metrics_response_8.txt
metrics_gasp_38.txt  metrics_response_23.txt  metrics_response_9.txt
metrics_gasp_39.txt  metrics_response_24.txt
metrics_gasp_3.txt   metrics_response_25.txt

Python Data-Merging Script

To plot everything, it’s most convenient for me if all the data are in a single file.  There are many ways to combine all 100 files together, and even the Unix shell-scripting version is pretty straightforward. But since this post is about data analysis in Python anyway, I’ll give you a Python version.

1   def append_data(accumulator, model, seed):
2       filename = "metrics_{0}_{1}.txt".format(model, seed)
3       with open(filename, 'rb') as metrics:
4           header = metrics.readline().strip()
5   
6           for line in metrics:
7               line = line.strip()
8               line = "{0}\t{1}\t{2}\n".format(
9                       model, seed, line)
10              accumulator.append(line)
11
12      return header
13
14  models = ("response", "gasp")
15  seeds = range(50)
16  accumulator = []
17  for model in models:
18      for seed in seeds:
19          header = append_data(accumulator, model, seed)
20
21  with open("metrics.txt", 'wb') as accumulated:
22      header = "{0}\t{1}\t{2}\n".format(
23                      "model", "seed", header)
24      accumulated.write(header)
25      for line in accumulator:
26          accumulated.write(line)

This is a bit of a throwaway script (consider it released under the MIT license. Go do what you like with it.) It treats the data in the individual files like text, rather than converting to floating-point numbers and back again. It gathers every line from the individual data files, prepending the model and seed number, then writes them all back out as one file.

Let’s walk through the details…

Merging Loop

14  models = ("response", "gasp")
15  seeds = range(50)
16  accumulator = []
17  for model in models:
18      for seed in seeds:
19          header = append_data(accumulator, model, seed)

The action starts on line 14. This bit takes a list called accumulator and stuffs every line of every metrics file into it, except for the headers. The append_data function returns the header separately, and since I’m being lazy, I assume the header is always the same and let it get overwritten each time. When the loop exits, header is actually just the header (i.e. first line) of the last file.

Now, I’m not assuming you know Python, so I’ll make a couple of notes about the syntax here.

  • Parentheses, like on line 14, make a tuple object. This is like a list, but generally a tuple is short and immutable, while a list (made with brackets [ ]) is any length and mutable. (A list in Python is a general-purpose ordered collection of data.)
  • Indentation is meaningful to the Python interpreter. Everything inside a for loop is at a deeper level of indentation, and the loop ends when a shallower level of indentation (or the end of the script) is reached. So lines 17-19 define a nested for loop that covers both seeds for every model.

Function to append data

1   def append_data(accumulator, model, seed):
2       filename = "metrics_{0}_{1}.txt".format(model, seed)
3       with open(filename, 'rb') as metrics:
4           header = metrics.readline().strip()
5   
6           for line in metrics:
7               line = line.strip()
8               line = "{0}\t{1}\t{2}\n".format(
9                       model, seed, line)
10              accumulator.append(line)
11
12      return header

The append_data function appends to accumulator (a list) every line in the file identified by model and seed. Line 2 puts together a file name based on the model and seed number.

The with block starting on line 3 and ending on line 10 is a way of automatically releasing the metrics file when I’m done with it. The call to open gives us a file object I’m calling metrics. Because I used a with block, when the block ends, the file gets closed automatically. It’s an easy way to prevent your program from leaking open files.

On Line 4, I read the first line of the metrics file, its header, which will get returned at the end of the function. That’s not the main thing this function does, though. Lines 6-10 loop through the remaining lines in the file, prepend model and seed number to them, then append them to the accumulator list.

Writing out the data

21  with open("metrics.txt", 'wb') as accumulated:
22      header = "{0}\t{1}\t{2}\n".format(
23                      "model", "seed", header)
24      accumulated.write(header)
25      for line in accumulator:
26          accumulated.write(line)

Here I use another with block to take care of the file I’m writing data out to. Lines 22-23 create the header by prepending “model” and “seed” to it, then Line 24 writes it to the output file. Lines 25 and 26 loop through the accumulator list and write each line out to the metrics file.

Comment

I want to repeat that, although the metrics files were full of numbers, I treated them like text here. All this merging script does is add some more text (model and seed number) to each line before writing the same text it read in right back out to a file. Once I get to plotting things, I’ll need to treat numbers like numbers, which is where pandas comes in.

Resulting Data File

Here’s a sample of what the resulting data file (metrics.txt) looks like:

model seed NFE Elapsed Time Population Size Archive Size CV Generational Distance Additive Epsilon Indicator SBX + PM UM Differential Evolution + PM PCX SPX UNDX
response 0 3001 2.852 184 57 0.0 0.032 0.863 0.086 0.017 0.017 0.172 0.672 0.034
response 0 6001 5.163 368 123 0.0 0.020 0.785 0.223 0.008 0.016 0.074 0.669 0.008
response 0 9001 7.581 512 151 0.0 0.017 0.764 0.178 0.006 0.046 0.026 0.735 0.006
response 0 12002 9.947 512 133 0.0 0.017 0.758 0.148 0.007 0.037 0.074 0.725 0.007

Play Along at Home!

If you’re lucky enough to be a member of the Pat Reed group, I’m making my metrics files available on our group Dropbox, in Presentations/py_data_analysis/metrics.zip. I encourage you to type in my code, save it as accumulate.py (or whatever name strikes your fancy) in the directory where you put the metrics files, and then run it by typing python accumulate.py. If you don’t have Python on your machine, you can run it on the cluster if you first load the Python module: module load python/2.7.3

(This is part of your homework for Wednesday February 13!)

11 thoughts on “Python Data Analysis Part 1a: Borg Runtime Metrics Plots (Preparing the Data)

  1. Pingback: Python Data Analysis Part 1c: Borg Runtime Metrics Plots « Pat Reed Group Research Tips Blog

  2. Pingback: Using linux “split” « Pat Reed Group Research Tips Blog

  3. Thanks for writing this up Matt. I have two questions on this post, which deal more with my ignorance of Python than with data manipulation specifically:

    – In the append_data function, the “accumulator” list looks like it’s being passed by reference. Is this true for all Python variables, or only containers like lists/tuples/dictionaries?

    – When you open files, I get that the “r” and “w” flags are for read/write, but what does “b” do?

    Thanks.

    • Good questions!

      The whole pass-by-reference issue in Python can be confusing if you come from a C/C++ background. C++ especially does a whole lot of weird stuff under the hood to implement pass-by-value semantics for objects. This stackexchange thread does a pretty good job of clarifying the issue: http://stackoverflow.com/questions/986006/python-how-do-i-pass-a-variable-by-reference#986145

      What I’m doing with the accumulator list is conceptually the same as if I passed a reference to a vector in C++ and executed its push_back method a bunch of times.

      The ‘b’ flag puts the file into “binary” mode. Windows and Unix/Linux disagree about what character ends a line, and libraries that deal with text files add some compatability magic to smooth things out for you. I want my files to have Unix line endings, so I put my files in binary mode to make sure that I’m not getting unexpected carriage return characters.

  4. Pingback: Setting up Python and Eclipse | Pat Reed Group Research Tips Blog

  5. Pingback: Python Data Analysis Part 2: Pandas / Matplotlib Live Demo | Pat Reed Group Research Tips Blog

  6. Pingback: Re-evaluating solutions using Python subprocesses | Pat Reed Group Research Tips Blog

  7. Pingback: Python for automating cluster tasks: Part 1, Getting started | Water Programming: A Collaborative Research Blog

  8. Pingback: Python for automating cluster tasks: Part 2, More advanced commands | Water Programming: A Collaborative Research Blog

  9. Pingback: Runtime metrics for MOEAFramework algorithms, extracting metadata from Borg runtime, and handling infinities | Water Programming: A Collaborative Research Blog

  10. Pingback: Water Programming Blog Guide (Part I) – Water Programming: A Collaborative Research Blog

Leave a comment