September 23, 2013 by Jon Herman

Method of Morris (Elementary Effects) using SALib

This post was updated on January 16, 2015 to correct a few errors and update the SALib module structure, and again in 2017.

The Sensitivity Analysis Library (SALib) is an open-source Python library for common sensitivity analysis routines, including the Sobol, Morris, and FAST methods. In 2017 it was published in the Journal of Open Source Software:

Herman, J. and Usher, W. (2017) SALib: An open-source Python library for sensitivity analysis. Journal of Open Source Software, 2(9).

This post describes how to use the command-line interface of the library to run the Method of Morris (also known as the Elementary Effects Method). The Github page gives an example of the Python interface. The default use case for SALib is to perform decoupled sensitivity analysis, i.e. the sampling and analysis steps are performed separately, and the model evaluations that occur in between are left to the user.

Step 0: Get the library

The easiest approach is to pip install SALib, which will pull the latest version from the Python package index. Or, you can download at the above link as a zip/tar file and run python setup.py install. If you’re interested in contributing, you can clone the git repository:

git clone https://github.com/SALib/SALib .

Step 1: Choose sampling bounds for your parameters

Create a simple text file with the form [parameter] [lower bound] [upper bound]. For example, such a file for the “Sobol G” test function might look like this:

x1 0.0 1.0
x2 0.0 1.0
x3 0.0 1.0
x4 0.0 1.0
x5 0.0 1.0
x6 0.0 1.0
x7 0.0 1.0
x8 0.0 1.0

The bounds are used to sample parameter values. The variable names will only appear in the printed output, and they will not affect the method itself. Let’s call this file params.txt.

Step 2: Generate parameter sets

Put your params.txt file in the same directory as the SALib folder. Move to this directory and type the following command:

python -m SALib.sample.morris -n 1000 -p params.txt -o my_samples_file.txt

The -n flag specifies the number of trajectories to generate. The -p flag specifies the parameter bounds file that you created in the first step. Finally, the -o flag tells the program where to output the matrix of parameter samples. A total of N(p+1) samples will be generated; in this case, N = 1000 and p = 8, leading to a total of 9000 model evaluations.

The sampling command also has two options that aren’t required, --num-levels and --grid-jump. By default, these are set to 4 and 2, respectively.

Step 3: Run the parameter sets through your model

The parameter sets are now saved in my_samples_file.txt. Run these parameter sets through your model, and save the output(s) to a text file. The output file should contain one row of output values for each model run. This process is performed outside of SALib, so the details will be language-specific. Be careful to read in your parameter sets in the same order they were sampled.

Step 4: Calculate Morris Indices

You now have the output values for all of your model runs saved to a file. For the sake of example, let’s call that file SGOutput.txt (the output from the Sobol G function). We need to send this information back to SALib to compute the sensitivity indices, using following command:

python -m SALib.analyze.morris -p params.txt -X my_samples_file.txt -Y SGOutput.txt -c 0

The options here are: the parameter file (-p), the file containing calculated outputs (-Y), and the column of the objective values file to read (-c). The columns are assumed to be zero-indexed; if you have calculated multiple objective values, you would continue on to -m 1, etc., repeating the same command as above. By default, this will print sensitivity indices to the command line. You may want to print them to a file using the “>” operator.

Step 5: Interpret your results

Say that you saved the indices from the above command into the file morrisIndices.txt. If you open this file in a text editor, it will look something like this:

Parameter Mu Sigma Mu_Star
x1 0.040674 2.573977 2.077549
x2 -0.113902 1.514879 1.109394
x3 -0.025667 0.620538 0.454424
x4 -0.001532 0.324167 0.229770
x5 -0.001736 0.032333 0.023077
x6 -0.000858 0.032265 0.022693
x7 -0.000976 0.032779 0.022949
x8 -0.000224 0.034278 0.024575

The parameter names will match the ones you specified in params.txt. The mean and variance of each parameter’s elementary effects are given by mu and sigma, respectively. Mu_star is the mean of the absolute values of the elementary effects, following Campolongo et al. (2007). This Mu_star value is the best approximation of “total” sensitivity provided by the Morris method. Note that these indices do not have a direct interpretation as an “attribution of variance”, like we saw in the example results from the Sobol method. Instead, they should be used to understand the ranking of the most sensitive parameters, and to provide an approximate quantification of sensitivity.

For a full description of available methods and options, please consult the readme in the Github repository or on the SALib website. Github users can also post issues or submit pull requests. Thanks for reading!

September 18, 2013 by Jon Herman

Git branch in bash prompt

I never seem to know what branch of a repository I’m on, and typing git branch gets tiresome. I found a neat trick to print the name of the current branch at the end of your shell prompt (if the current directory is not a git repository, then it’s just blank). The site I found this on (HalloweenBash) contains a lot of customizations, this is just one that I found useful.

First copy this short function into your .bashrc file:

function parse_git_branch {
   git branch --no-color 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/(\1)/'
}

Then in your PS1 variable (which sets the prompt), put \$(parse_git_branch) where you want the branch name to appear. My full prompt looks like this:

PS1="\[\033[G\]\[\e]0;\w\a\]\n\[\e[32m\]\u \[\e[33m\]\w \[\e[36m\]\$(parse_git_branch)\[\e[0m\]\n\$ "

…but I confess to not knowing exactly what all the escape sequences do. The basics are, \u is your username, \h is the hostname (which I removed), and \w is the working directory. This PS1 makes a prompt that looks like this:

If you’re curious, you can check out my full .bashrc file here.

September 18, 2013 by matt

Announcing pareto.py: a free, open-source nondominated sorting utility

Jon Herman and I are pleased to announce version 0.2 of pareto.py, a free, open-source nondominated sorting utility. Released under the LGPL, pareto.py performs a nondominated sort on the data in any number of input files. To the best of our knowledge, this is the only standalone nondominated sorting utility available.

Implemented using only the Python standard library, pareto.py features:

epsilon-nondominated sorting
“importable” design: can be used from other Python scripts
tag-along variables: columns (even non-numeric ones) other than those to be sorted may be retained in the output
verbatim transcription of input: nondominated rows in the output appear exactly the same as they did in the input
optionally skip header rows, commented rows, and blank lines

We welcome feature requests, bug reports, and pull requests. If you’re in a hurry, here is the source code for version 0.2.

(Edit: bump to version 0.2)

September 16, 2013 by matt

Comparing Data Sets: Are Two Data Files the Same?

Jon and I were looking at nondominated sorting, and the question came up, “how do you validate a sorting routine?” You’ve got to compare the resulting data, but doing so is not straightforward. You can’t just diff the text files that come out, because there’s no guarantee everything comes out in the same order. Sorting and then diff-ing the files still doesn’t help, because output formatting may differ between sorting routines. So I wrote a Python script that evaluates whether two data tables are the same, to within a specified tolerance:

https://github.com/matthewjwoodruff/datacomparison

September 16, 2013 by JR Kasprzyk

Python for automating cluster tasks: Part 2, More advanced commands

This is part 2 in a series about using Python for automating cluster tasks. Part 1 is here. (For more on Python, check out: another tutorial part one and two, tips on setting up Python and Eclipse, and some specific examples including a cluster submission guide and a script that re-evaluates solutions using a different simulation model)

Edit: Added another example in the “copy” section below!

Welcome back! Let’s continue our discussion of basic Python commands. Let’s start by modifying our last code sample to facilitate random seed analysis. Now, instead of writing one file we will write 50 new files. This isn’t exactly how we’ll do the final product, but it will be helpful to introduce loops and some other string processing.

Loops and String Processing

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE

def main():
    #the input filename and filestream are handled outside of the loop.
    #but the output filename and filestream have to occur inside the loop now.
    inFilename = "borg.c"
    inStream = open(inFilename, 'rb')

    for mySeed in range(1,51):
        outFilename = "borgNew.seed" + str(mySeed) + ".c"
        outStream = open(outFilename, 'w')

        print "Working on seed %s" % str(mySeed)

        for line in inStream:
            if "int mySeed" in line:
                newString = " int mySeed = " + str(mySeed) + ";\n"
                outStream.write(newString)
            else:
                outStream.write(line)
        outStream.close()
        inStream.seek(0) #reset the input file so you can read it again

if __name__ == "__main__":
    main()

Above, the range function allows us to iterate through a range of numbers. Note that the last member of the range is never included, so range(1,51) goes from 1 to 50. Also, now we have to be concerned with making sure our files are closed properly, and making sure that the input stream gets ‘reset’ every time. There may be a more efficient way to do this code, but sometimes it’s better to be more explicit to be sure that the code is doing exactly what you want it to. Also, if you had to rewrite multiple lines, it would be helpful to structure your loops the way I have them here.

By the way, after you run the sample program, you may want to do something like “rm BorgNew*” to remove all the files you just created.

Calling System Commands

Ok great, so now you can use Python to modify text files. What if you have to do something else in your workflow, such as copy files? Move them? Rename them? Call programs? Basically, you want your script to be able to do anything that you would do on the command line, or call system commands. For some background, check out this post on Stack Overflow, talking about the four or five different ways to call external commands in Python.

The code sample is below. Note that there’s two different ways to use the call command. Using “shell=True” allows you to have access to certain features of the shell such as the wildcard operator. But be careful with this! Accessing the shell directly can lead to problems as discussed here.

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE
from subprocess import call

def main():

    print "Listing files..."
    call(["ls", "-l"])

    print "Showing the current working directory..."
    call(["pwd"])

    print "Now making ten copies of borg.c"
    for i in range(1,11):
        print "Working on file %s" % str(i)
        newFilename = "borgCopy." + str(i) + ".c"
        call(["cp", "borg.c", newFilename])
    print "All done copying!"

    print "Here's proof we did it.  Listing the directory..."
    call(["ls", "-l"])

    print "What a mess.  Let's clean up:"
    call("rm borgCopy*", shell=True)
    #the above is needed if you want to use a wildcard, see:
    #http://stackoverflow.com/questions/11025784/calling-rm-from-subprocess-using-wildcards-does-not-remove-the-files

    print "All done removing!"

if __name__ == "__main__":
    main()

You may also remember that there are multiple ways to call the system. You can use subprocess to, in a sense, open a shell and call your favorite Linux commands… or you can use Python’s os library to do some of the tasks directly. Here’s an example of how to create some directories and then copy the files into the directory. Thanks to Amy for helping to write some of this code:

import os
import shutil
import subprocess

print os.getcwd()
global src
src = "myFile.txt" #or whatever your file is called

for i in range(51, 53); #remember this will only do it for 51 and 52
   newFoldername = 'seed'+str(i)
   if not os.path.exists(newFoldername):
      os.mkdir(newFoldername)
   print "Listing files..."
   subprocess.call(["ls", "-l"])
   shutil.copy(src, newFoldername)
   #now, we should change to the new directory to see if the
   #copy worked correctly
   os.chdir(newFoldername)
   subprocess.call(["ls", "-l"])
   #make sure to change back
   os.chdir("..")

Conclusion

These two pieces of the puzzle should open up a lot of possibilities to you, as you’re setting up your jobs. Let us know if you want more by posting in the comments below!

September 13, 2013 by JR Kasprzyk

Python for automating cluster tasks: Part 1, Getting started

Yet another post in our discussion of Python. (To see more, check out: a tutorial part one and two, tips on setting up Python and Eclipse, and some specific examples including a cluster submission guide and a script that re-evaluates solutions using a different simulation model)

If you’re just getting into using MOEAs and simulation models together, you may have spent some time getting familiar with how to get the two technologies to “talk” to one another, and you may have solved a problem or two. But now you may realize, well there’s more to the process than just running a MOEA once. The following are some reasons why you may need to set up a “batch” submission of multiple MOEA runs:

1. Random Seed Analysis A MOEA is affected by the random number generator used to generate initial random solutions, generate random or synthetic input data, and perform variation operations to develop new solutions. Typically, a random seed analysis is used to test how robust the algorithm is to different sets of random numbers.

2. Diagnostic Analysis of Algorithm Performance MOEAs have parameters (population size, run duration, crossover probability, etc.). We’ve discussed methods to evaluate how well the algorithm does across a wide array of different parameter values (see, for example, this post).

3. Running Multiple Problem Formulations Perhaps you want to compare a 2 objective problem with a 10 objective problem. Or, you use different simulation models (a screening level model, a metamodel, all the way to a physics-based process model). All of these changes would require running the MOEA more than once.

4. Learning More about your Problem Even if you’re not trying 1-3, you may just need to run the MOEA for a problem that you’re still developing. You make an educated guess about the best set of objectives, decisions, and constraints, but by analyzing the results you could see that this needs to change. We use the term a posteriori to describe the process, because you don’t specify preferences between objectives, etc until after you generate alternatives. So it’s an interactive process after you start running the algorithm.

The manual approach

For this illustration, here are some assumptions on what you’re up to. First, you are using the Borg MOEA or MOEAframework, with a simulation model (either the source code, or you’ve written a wrapper to call it). You’ve set up one trial of the simulation-optimization process, maybe using some of the resources on this blog! You are using cluster computing and you have a terminal connection to the cluster all set up. And, python is installed on the cluster.

A typical workflow might look something like this. (Thanks to my student Amy Piscopo for being the “guinea pig” here). Some of these steps are kind of complicated, you may or may not have them.

1. Change the simulation or optimization source code. Typically we set up programming so you don’t need to actually change the code and re-compile if you’re making a simple change. For example, if there’s a really important parameter in your simulation code, you can pass it in as a command line argument. But, sometimes you can’t avoid this, so step one is that you might have to go into a text editor and modify code. (Example: “On line 2268, change “int seed = 1” to “int seed = 2”.

2. Compile the simulation or optimization source code. Any changes in source code must be compiled before you run the program. (Example: Write a command that compiles the files directly, such as in the Borg README, or use a makefile and the command “make”)

3. Modify a submission script. You thought your experiment was going to take 8 hours, and it really takes 24. Oops. Typically, when you create a “submission script” for the cluster, you need to tell it a few things about your job: what queue you want, how long to run, how many processors to request, and then the specific commands you need to run. Another thing to consider with multiple runs is that the command that you are specifying may actually change. (Example: Changing the submission script to say “borg.exe -s 2” instead of “borg.exe -s 1”)

4. Make multiple run folders, and copy all files into each folder. Ok, so you’ve made the necessary modifications for, say, “Seed 1” and “Seed 2”. Or, “Problem 1” and “Problem 2”. Now, you need to gather up all the files for your particular run and put them into their own folder. I usually recommend that you use a different folder for different runs, even if there are only a few files in the folder. It just helps keep things organized. (Example: Each seed has its own folder)

5. Submit the jobs! Whew. Too much work. After clicking and typing and hitting enter and clicking and drag and dropping, you’re finally ready to hit go.

If you’re exhausted by that list, I don’t blame you — so am I! This is a lot of manual steps, and the worst part is that the process won’t change hardly at all at seed 1, vs seed 50. Plus, if you do all this by hand, you are more likely to make a mistake (“Did I actually change seed 2 to seed 3? Oh gosh I don’t know”). The other thing that’s annoying about it is that you may need to make yet another change in the process later on. Instead of changing the seed, now you have to change some parameter!

Well, there’s another way…

Instead, use python!

That’s the point of this whole post. Let’s get started.

Learning Python syntax and starting your first script

How do you quickly get started learning this new language? First, log into the cluster and type “python”. You may need to load a module, consult the documentation for your own cluster. You’ll see a prompt that looks like this: “>>>” This is the python interpreter. You can type any command you want, and it will get executed right there, kind of like Matlab. So now you’re running Python! Saying it’s too hard to get started is no excuse. 🙂

Then, open the official Python tutorial, or a third party tutorial on the internet. I noticed the third party one has a Python interpreter right in the browser. Anyway, any time you see a new command, look it up in the tutorial! After a while you’ll be a Python expert.

One more comment in the “getting started” section. In a previous post the beginning of the script looked like this. The import commands load packages that have commands that you need. Any time you learn a new command, make sure you don’t need to also include a new import command at the beginning of the script.

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE

And then, the main function is defined a little differently than you may be used to in Fortran or C. Something like this:

def main():

    #a bunch of code goes here

if __name__ == "__main__":
    main()

The default function, (“if __name__”) is a convention in Python. So just set up your script in a similar way and you can use it like you were programming in C or another language.

Make sure to pay attention to what the tutorial says about code indenting. Spaces/tabs/indents are very important in Python and they have to be done correctly in order for the code to work.

Ok now that you’re a verified Python expert let’s talk about how to do some of the basic functions that you need to know:

Modifying text files (or, changing one little number in 3000 lines of code)

Here’s your first Python program. It takes a file called borg.c, which, somewhere inside of which, contains the line “int mySeed = 1”. It then changes it to a number you specify in your Python program (in this case, 2). It does this by reading in every line of borg.c and spitting it into a new file, but when it gets to the mySeed line, it rewrites it. Note! Sometimes the spacing comes out weird on the blog. Be sure to follow correct Python spacing convention when writing your own code.

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE
def main():
    inFilename = "borg.c"
    outFilename = "borgNew.c"
    seed = 2
    inStream = open(inFilename, 'rb')
    outStream = open(outFilename, 'w')
    for line in inStream:
        if "int mySeed" in line:
            newString = " int mySeed = " + str(seed) + ";\n"
            outStream.write(newString)
    else:
        outStream.write(line)
if __name__ == "__main__":
    main()

We’ve hit the ground running! The above code sample shows how to write a file, how to write a loop, and then how to actually modify lines of text. It also shows that ‘#’ indicates whether the line is a comment. Remember that the indentation will tell Python whether you’re inside one loop, or an if statement, or what have you. The indenting scheme makes the code easy to read!

If you save this code in a file (say, myPython.py), all you have to do to run the program is type “python myPython.py” at the command line. No news is good news, if the program runs, it worked!

Hopefully this has given you a taste for what you can do easily, with scripting. Yes, you could’ve opened a text editor and changed one line pretty easily. But could you do it easily 1000 times? Not without some effort. One easy modification here is that you could assign the ‘seed’ variable in a loop, and change the file multiple times, or create multiple files.

Next time… We’ll talk about how to call system commands within Python. So, in addition to changing text, we’ll be able to copy files, delete files, even submit jobs to the cluster. All within our script! Feel free to adapt any code samples here for your own purposes, and provide comments below!

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

Month: September 2013