Introduction
Welcome to a series of posts discussing making Borg runtime metrics plots using Python, with Pandas and Matplotlib. To set up Python, see this post. The first post is broken into three parts. This part sets up the data, whereas the next part sets up the libraries, and the final part puts it all together. There is also a second companion post going through a “live demo” of how to do everything. Also search “python” on the right side for other blog posts relating to using Python!
Generating Runtime Metrics Data
First, I’ll assume that you’ve collected metrics during your optimization runs.
Here’s what I had at the end of my main function (I had used Java MOEAFramework for optimization).
// write the runtime dynamics Accumulator accumulator = instrumenter.getLastAccumulator(); String[] metrics = { "NFE", "Elapsed Time", "Population Size", "Archive Size", "CV", "GenerationalDistance", "AdditiveEpsilonIndicator", "SBX+PM", "UM", "DifferentialEvolution+PM", "PCX", "SPX", "UNDX" }; for(String metric : metrics) { System.out.print(metric); System.out.print("\t"); } System.out.print("\n"); for(int ii=0; ii<accumulator.size("NFE"); ii++) { for(String metric : metrics) { System.out.print(accumulator.get(metric, ii)); System.out.print("\t"); } System.out.print("\n"); }
Resulting Data Files
Here’s a sample of what was inside the resulting data files. They’re tab-delimited text, and unfortunately our WordPress theme will only let you see the first few columns, but you get the idea.
NFE | Elapsed Time | Population Size | Archive Size | CV | Generational Distance | Additive Epsilon Indicator | SBX+PM | UM | Differential Evolution+PM | PCX | SPX | UNDX | |
3000 | 77.359 | 308 | 93 | 0.0 | 0.048 | 0.639 | 0.617 | 0.021 | 0.042 | 0.148 | 0.148 | 0.021 | |
6111 | 151.339 | 900 | 244 | 0.011 | 0.023 | 0.589 | 0.543 | 0.005 | 0.021 | 0.059 | 0.365 | 0.005 | |
9111 | 222.376 | 1216 | 275 | 0.0 | 0.017 | 0.572 | 0.544 | 0.009 | 0.009 | 0.127 | 0.303 | 0.004 | |
12112 | 293.615 | 952 | 310 | 0.0 | 0.013 | 0.578 | 0.494 | 0.003 | 0.003 | 0.083 | 0.411 | 0.003 |
Merging the Data
For my study, I did 50 optimization runs for each of two versions of my model. (GASP and a response surface model, or RSM.) This means I don’t have one metrics file, I have 50. Here’s what my directory listing looks like.
metrics_gasp_0.txt metrics_gasp_40.txt metrics_response_26.txt metrics_gasp_10.txt metrics_gasp_41.txt metrics_response_27.txt metrics_gasp_11.txt metrics_gasp_42.txt metrics_response_28.txt metrics_gasp_12.txt metrics_gasp_43.txt metrics_response_29.txt metrics_gasp_13.txt metrics_gasp_44.txt metrics_response_2.txt metrics_gasp_14.txt metrics_gasp_45.txt metrics_response_30.txt metrics_gasp_15.txt metrics_gasp_46.txt metrics_response_31.txt metrics_gasp_16.txt metrics_gasp_47.txt metrics_response_32.txt metrics_gasp_17.txt metrics_gasp_48.txt metrics_response_33.txt metrics_gasp_18.txt metrics_gasp_49.txt metrics_response_34.txt metrics_gasp_19.txt metrics_gasp_4.txt metrics_response_35.txt metrics_gasp_1.txt metrics_gasp_5.txt metrics_response_36.txt metrics_gasp_20.txt metrics_gasp_6.txt metrics_response_37.txt metrics_gasp_21.txt metrics_gasp_7.txt metrics_response_38.txt metrics_gasp_22.txt metrics_gasp_8.txt metrics_response_39.txt metrics_gasp_23.txt metrics_gasp_9.txt metrics_response_3.txt metrics_gasp_24.txt metrics_response_0.txt metrics_response_40.txt metrics_gasp_25.txt metrics_response_10.txt metrics_response_41.txt metrics_gasp_26.txt metrics_response_11.txt metrics_response_42.txt metrics_gasp_27.txt metrics_response_12.txt metrics_response_43.txt metrics_gasp_28.txt metrics_response_13.txt metrics_response_44.txt metrics_gasp_29.txt metrics_response_14.txt metrics_response_45.txt metrics_gasp_2.txt metrics_response_15.txt metrics_response_46.txt metrics_gasp_30.txt metrics_response_16.txt metrics_response_47.txt metrics_gasp_31.txt metrics_response_17.txt metrics_response_48.txt metrics_gasp_32.txt metrics_response_18.txt metrics_response_49.txt metrics_gasp_33.txt metrics_response_19.txt metrics_response_4.txt metrics_gasp_34.txt metrics_response_1.txt metrics_response_5.txt metrics_gasp_35.txt metrics_response_20.txt metrics_response_6.txt metrics_gasp_36.txt metrics_response_21.txt metrics_response_7.txt metrics_gasp_37.txt metrics_response_22.txt metrics_response_8.txt metrics_gasp_38.txt metrics_response_23.txt metrics_response_9.txt metrics_gasp_39.txt metrics_response_24.txt metrics_gasp_3.txt metrics_response_25.txt
Python Data-Merging Script
To plot everything, it’s most convenient for me if all the data are in a single file. There are many ways to combine all 100 files together, and even the Unix shell-scripting version is pretty straightforward. But since this post is about data analysis in Python anyway, I’ll give you a Python version.
1 def append_data(accumulator, model, seed): 2 filename = "metrics_{0}_{1}.txt".format(model, seed) 3 with open(filename, 'rb') as metrics: 4 header = metrics.readline().strip() 5 6 for line in metrics: 7 line = line.strip() 8 line = "{0}\t{1}\t{2}\n".format( 9 model, seed, line) 10 accumulator.append(line) 11 12 return header 13 14 models = ("response", "gasp") 15 seeds = range(50) 16 accumulator = [] 17 for model in models: 18 for seed in seeds: 19 header = append_data(accumulator, model, seed) 20 21 with open("metrics.txt", 'wb') as accumulated: 22 header = "{0}\t{1}\t{2}\n".format( 23 "model", "seed", header) 24 accumulated.write(header) 25 for line in accumulator: 26 accumulated.write(line)
This is a bit of a throwaway script (consider it released under the MIT license. Go do what you like with it.) It treats the data in the individual files like text, rather than converting to floating-point numbers and back again. It gathers every line from the individual data files, prepending the model and seed number, then writes them all back out as one file.
Let’s walk through the details…
Merging Loop
14 models = ("response", "gasp") 15 seeds = range(50) 16 accumulator = [] 17 for model in models: 18 for seed in seeds: 19 header = append_data(accumulator, model, seed)
The action starts on line 14. This bit takes a list called accumulator and stuffs every line of every metrics file into it, except for the headers. The append_data function returns the header separately, and since I’m being lazy, I assume the header is always the same and let it get overwritten each time. When the loop exits, header is actually just the header (i.e. first line) of the last file.
Now, I’m not assuming you know Python, so I’ll make a couple of notes about the syntax here.
- Parentheses, like on line 14, make a tuple object. This is like a list, but generally a tuple is short and immutable, while a list (made with brackets [ ]) is any length and mutable. (A list in Python is a general-purpose ordered collection of data.)
- Indentation is meaningful to the Python interpreter. Everything inside a for loop is at a deeper level of indentation, and the loop ends when a shallower level of indentation (or the end of the script) is reached. So lines 17-19 define a nested for loop that covers both seeds for every model.
Function to append data
1 def append_data(accumulator, model, seed): 2 filename = "metrics_{0}_{1}.txt".format(model, seed) 3 with open(filename, 'rb') as metrics: 4 header = metrics.readline().strip() 5 6 for line in metrics: 7 line = line.strip() 8 line = "{0}\t{1}\t{2}\n".format( 9 model, seed, line) 10 accumulator.append(line) 11 12 return header
The append_data function appends to accumulator (a list) every line in the file identified by model and seed. Line 2 puts together a file name based on the model and seed number.
The with block starting on line 3 and ending on line 10 is a way of automatically releasing the metrics file when I’m done with it. The call to open gives us a file object I’m calling metrics. Because I used a with block, when the block ends, the file gets closed automatically. It’s an easy way to prevent your program from leaking open files.
On Line 4, I read the first line of the metrics file, its header, which will get returned at the end of the function. That’s not the main thing this function does, though. Lines 6-10 loop through the remaining lines in the file, prepend model and seed number to them, then append them to the accumulator list.
Writing out the data
21 with open("metrics.txt", 'wb') as accumulated: 22 header = "{0}\t{1}\t{2}\n".format( 23 "model", "seed", header) 24 accumulated.write(header) 25 for line in accumulator: 26 accumulated.write(line)
Here I use another with block to take care of the file I’m writing data out to. Lines 22-23 create the header by prepending “model” and “seed” to it, then Line 24 writes it to the output file. Lines 25 and 26 loop through the accumulator list and write each line out to the metrics file.
Comment
I want to repeat that, although the metrics files were full of numbers, I treated them like text here. All this merging script does is add some more text (model and seed number) to each line before writing the same text it read in right back out to a file. Once I get to plotting things, I’ll need to treat numbers like numbers, which is where pandas comes in.
Resulting Data File
Here’s a sample of what the resulting data file (metrics.txt) looks like:
model | seed | NFE | Elapsed Time | Population Size | Archive Size | CV | Generational Distance | Additive Epsilon Indicator | SBX + PM | UM | Differential Evolution + PM | PCX | SPX | UNDX |
response | 0 | 3001 | 2.852 | 184 | 57 | 0.0 | 0.032 | 0.863 | 0.086 | 0.017 | 0.017 | 0.172 | 0.672 | 0.034 |
response | 0 | 6001 | 5.163 | 368 | 123 | 0.0 | 0.020 | 0.785 | 0.223 | 0.008 | 0.016 | 0.074 | 0.669 | 0.008 |
response | 0 | 9001 | 7.581 | 512 | 151 | 0.0 | 0.017 | 0.764 | 0.178 | 0.006 | 0.046 | 0.026 | 0.735 | 0.006 |
response | 0 | 12002 | 9.947 | 512 | 133 | 0.0 | 0.017 | 0.758 | 0.148 | 0.007 | 0.037 | 0.074 | 0.725 | 0.007 |
Play Along at Home!
If you’re lucky enough to be a member of the Pat Reed group, I’m making my metrics files available on our group Dropbox, in Presentations/py_data_analysis/metrics.zip. I encourage you to type in my code, save it as accumulate.py (or whatever name strikes your fancy) in the directory where you put the metrics files, and then run it by typing python accumulate.py. If you don’t have Python on your machine, you can run it on the cluster if you first load the Python module: module load python/2.7.3
(This is part of your homework for Wednesday February 13!)
Pingback: Python Data Analysis Part 1c: Borg Runtime Metrics Plots « Pat Reed Group Research Tips Blog
Pingback: Using linux “split” « Pat Reed Group Research Tips Blog
Thanks for writing this up Matt. I have two questions on this post, which deal more with my ignorance of Python than with data manipulation specifically:
– In the append_data function, the “accumulator” list looks like it’s being passed by reference. Is this true for all Python variables, or only containers like lists/tuples/dictionaries?
– When you open files, I get that the “r” and “w” flags are for read/write, but what does “b” do?
Thanks.
Good questions!
The whole pass-by-reference issue in Python can be confusing if you come from a C/C++ background. C++ especially does a whole lot of weird stuff under the hood to implement pass-by-value semantics for objects. This stackexchange thread does a pretty good job of clarifying the issue: http://stackoverflow.com/questions/986006/python-how-do-i-pass-a-variable-by-reference#986145
What I’m doing with the accumulator list is conceptually the same as if I passed a reference to a vector in C++ and executed its push_back method a bunch of times.
The ‘b’ flag puts the file into “binary” mode. Windows and Unix/Linux disagree about what character ends a line, and libraries that deal with text files add some compatability magic to smooth things out for you. I want my files to have Unix line endings, so I put my files in binary mode to make sure that I’m not getting unexpected carriage return characters.
Pingback: Setting up Python and Eclipse | Pat Reed Group Research Tips Blog
Pingback: Python Data Analysis Part 2: Pandas / Matplotlib Live Demo | Pat Reed Group Research Tips Blog
Pingback: Re-evaluating solutions using Python subprocesses | Pat Reed Group Research Tips Blog
Pingback: Python for automating cluster tasks: Part 1, Getting started | Water Programming: A Collaborative Research Blog
Pingback: Python for automating cluster tasks: Part 2, More advanced commands | Water Programming: A Collaborative Research Blog
Pingback: Runtime metrics for MOEAFramework algorithms, extracting metadata from Borg runtime, and handling infinities | Water Programming: A Collaborative Research Blog
Pingback: Water Programming Blog Guide (Part I) – Water Programming: A Collaborative Research Blog