The Ternary Operator

The ternary operator (the question mark and colon in the expression a ? b: c) is a tricky one.  It’s use is relatively uncommon, but it’s present in many programming languages, and sometimes it drastically simplifies a conditional assignment.  I bring it up because it was in the first version I saw of the GAA objective function, and it was doing something it shouldn’t have been.  Before I took up the GAA problem, some of the work involved allowing a 3 percent constraint violation.  This was still present in the objective function when I got my hands on it, although we no longer wanted to allow a 3 percent constraint violation.

Here’s what it looked like:


constr[0] = CV_TOTAL > 0.03 ? CV_TOTAL : 0.0;

What it does is assign CV_TOTAL to constr[0] only if CV_TOTAL is greater than 0.03, and otherwise it assigns 0 to constr[0].  It works this way because in general, the expression a ? b : c evaluates to b if a is true, and to c if a is false. So writing d = a ? b : c is a shorthand way of saying, “assign b to d if a is true, and assign c to d if a is false.”  And what it does in this case is allow 3% constraint violation.

My advice: there’s nothing wrong with using the ternary operator, but don’t be stingy with your comments.  Writing it like this could save you a lot of time:


constr[0] = CV_TOTAL > 0.03 ? CV_TOTAL : 0.0; // allow 3% constraint violation!

Of the  languages that I’ve used, the ternary operator shows up in this form in C, C++, Java, and Ruby.

Zotero introduction (video)

I made a short Screenr video describing how to install and use Zotero for citation management in Word. Specifically, the Zotero standalone program plus the Chrome plugin. (Sorry about the breathing sounds in the microphone … I’ll work on that next time.)

http://screenr.com/aZ5s

EDIT: As far as I can tell, WordPress only allows embedding from certain video sites which do not include Screenr. So I guess you have to just open the link. Sorry about that.

Edit by Rachel: Here is a second Screenr video I created about Zotero that talks about importing/exporting citations as well as using the PDF search capability in Zotero. I recommend watching this video using a full screen so you can read the text.

C++: Vectors vs. Arrays

In many of our programs we are using C-style arrays to store data. These are easy to write, but can be difficult to work with—they cannot be resized, they don’t know their own size, and most array operations require the use of pointers and memory management.

For many applications, you may want or need a data structure that is simpler to use. Check out the vector library in C++. The syntax for vectors may be new to you, but they make many operations a whole lot easier. A brief introduction to using the vector class is here.

Personally I like vectors because they make C++ operations more Matlab-esque, and they help prevent the memory management errors that are so common with C-style arrays. The downside, according to the C purists, is that vectors are slower than arrays. This can be true, depending on how you use them. But the vector class is really only an abstraction of the C-style array, so basic operations like storage and retrieval are still completed in constant time. The only time vectors slow down is when you append items to them, because behind the scenes, the vector is copying an entire array and adding a single element to it. This runs in O(N) time and can really slow things down if you do it every loop iteration. (Source: the tutorial above, and here).

So: vectors can make your life a lot easier, and they are the same speed as arrays for indexing operations. Just be careful to resize vectors as few times as necessary to preserve the efficiency of your program.

P.S. I’m sure there are dissenting opinions on the array/vector debate … please add your two cents.

C++ Training: Valgrind

Valgrind is a tool for “memory debugging” of programs. It allows you to find places where memory is not properly allocated, which are difficult to find using traditional debuggers. Valgrind should be one of the first steps that you take when testing a program — even a simple one! It is available on Penn State’s clusters, or available on Linux.

Here are some “fun” programming errors you should avoid that Valgrind will help catch:

Trying to write outside the bound

double *a;
length = 5;
a = new double[length]; //a is now an array of size 5
for (int i = 0; i < 6; i++)
{
   a[i] = i; //error at i=5
}

The program will let you write a value at a[5], even though the only legal places to write are a[0] through a[4]. This is a corruption, since a[5] refers to a place where memory could be used by another variable. Valgrind will yell at you for this, and rightly so. Also be careful when allocating 2d and 3d arrays, since it is easy to confuse rows and columns, which will cause you major headaches.

Memory Leaks Inside Functions

void fun()
{
   double *a;
   a = new double[5];
   return;
}

Although the variable a gets destroyed as you leave the function, the memory that you allocated here doesn’t get destroyed at all! This is dangerous, especially if you call fun() many times.

Conditional Jump Depends on Uninitialized Variables

Even if you allocate memory properly, you may unfortunately perform this horrible error:

int y; //uninitialized at first
int x = 6;
int *a;
a = new int[5];

if (x < 5)
{
   y = 2;
}

x[y] = 7; //error, since y is not initialized yet
delete a; //a is deallocated properly

Here we managed memory correctly, but we accessed y before it had a value assigned to it. The danger here is that y really has some garbage value (i.e., -58342671), so the program may “sometimes” call the line correctly, and sometimes it won’t. Valgrind to the rescue!

This tutorial explains how to start using it.

C++ Training: Makefiles

The following is a basic makefile that worked for me.  However, there’s a link at the bottom to a forum post that provides a better solution.

#The c++ compiler you'd like to use:
CC=g++

#the following is for normal use:
CFLAGS=-c -O3 -Wall
LDFLAGS=
#the following is for using gprof or valgrind (only comment out one at a time)
#CFLAGS=-g -pg -c -O3 -Wall
#LDFLAGS=-pg

#All your .cpp files go below. This simple example assumes they're all in the same directory:
SOURCES=calculations.cpp Dates.cpp Demands.cpp triangleSimulation.cpp

OBJECTS=$(SOURCES:.cpp=.o)

#Below you place your target name, what you want the program to be called.
EXECUTABLE=triangleSimulation

all: $(SOURCES) $(EXECUTABLE)

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(LDFLAGS) $(OBJECTS) -o $@

.cpp.o:
    $(CC)  $(CFLAGS) $^ -o $@

clean:
    rm -rf *.o $(EXECUTABLE)

Forum post with advice on a simple makefile

MOEAs: Basic Concepts and Reading

(This post will be an introduction to the basic ideas behind MOEAs and will include references to relevant literature. Editors, please feel free to add citations as you see fit).

What is a MOEA?

Multi-objective evolutionary algorithms (MOEA) are used to optimize a set of decision variables with respect to a set of objectives. (The “multi-objective” distinction just means that more than one metric is being used to evaluate performance). In between the decision variables and objectives lies your model, which could be any application where performance needs to be optimized. For a detailed explanation of terminology related to decision variables and objectives, see Joe’s post. Evolutionary algorithms are considered heuristic methods because they do not solve problems deterministically, but rather explore the space of possible solutions and dynamically learn which decision variables lead to good performance. Perhaps the most comprehensive resource for learning about MOEAs is Evolutionary Algorithms for Solving Multi-Objective Problems by Coello Coello (ask around for hard or PDF copies). This can be a difficult read for those new to the subject, so you may want to read the rest of this post and peruse the suggested papers before diving in.

Why do we use them?

A number of different optimization strategies exist, including linear programming, response surface methods, and others. For some optimization problems, you can even find closed-form analytical solutions (think of finding function maxima in calculus). Compared to other strategies, evolutionary algorithms are often considered to be messy and inelegant; for example, in the Algorithm Design Manual, Steven Skeina refers to genetic algorithms as “voodoo”. EAs truly are the sledgehammer of optimization methods—but this is why we use them! In engineering, we frequently encounter all sorts of difficult problems, including: constrained non-convex (where you must travel through infeasible space to reach feasible space), multi-modal (many “false” local optima), and high-dimensional problems, in terms of both decision variables and output objectives. More elegant methods typically need to be reformulated on a per-problem basis, but modern MOEAs can handle these difficult problems without altering their implementation. We want a solver that is reliable and robust across problem types, which is why we turn to MOEAs. A good discussion of the motivation for multi-objective optimization from a management perspective is given in Brill et al. (1982) Modeling to Generate Alternatives (note that the specific algorithm they use is less important for our purposes).

How do they work? (Single-objective)

For simplicity, let’s start by looking at a basic single-objective EA. This description is mostly based on the early genetic algorithms, but please note that a wide variety of variations exist. An EA begins with a randomly selected population of solutions (i.e., it randomly selects parameter sets and evaluates them in your model). From here, the algorithm uses mathematical operators to guide its search. EAs generally employ three operators—crossover, mutation, and selection—although the specific form of these operators varies between algorithms. The crossover operator combines decision variables from two or more solutions to create new solutions. The mutation operator perturbs the decision variables of a single solution to look for improvement in its local vicinity. Finally, the selection operator determines which solutions are allowed to proceed to the next “generation” (the next cycle of running these operators) and which will be discarded. By performing this process over multiple generations, the algorithm is able to evolve the best-known solution (hence the analogies to biological evolution).

Some technical considerations

Every run of an EA is a battle between convergence toward the best solution (governed by the selection operator) and diversity of solutions (governed by the crossover and mutation operators). Ideally, we want fast convergence with maximum diversity, but a balance must be struck between the two—we can sacrifice some convergence speed to ensure that the algorithm does not converge prematurely to a local optimum.

We also must consider the mathematical properties of our problems and the difficulties these may pose for evolutionary search. The success of early EAs generally depends on whether the problem is decomposable (i.e., if the decision variables can be optimized individually). This is closely related to the concept of rotational invariance, which describes whether the decision variables interact with each other and thus need to be optimized together rather than individually. Finally, the success of EAs also depends on the causality of the problem: in a problem with strong causality, a small change in decision variables results in a small change in the objective function(s). We can expect EAs to perform better on problems that have both decomposability and strong causality. However, modern MOEAs utilize adaptive operator schemes to attempt to overcome these problems. A good discussion of testing the ability of EAs to overcome nasty problems (especially high dimensionality) is given by Deb et al. (2005) Scalable Test Problems.

Moving to multi-objective EAs

In a single-objective optimization, an EA will return only one point that is optimal with regard to the objective function. What if we want to incorporate multiple objectives? Our first thought might be to combine all of objectives together into one, using some kind of weighting scheme. However, this requires that we know ahead of time which objectives we care about most, and this is not always the case. Instead, we can incorporate all of our desired objective functions separately using the (crucial) principle of nondominance. In a set of solutions, a solution is said to be “dominated” if another solution exists which is better than it in all objectives. The solutions that are not dominated are said to be—you guessed it—”nondominated”. The set of all nondominated solutions in a given generation is referred to as the Pareto front or tradeoff surface (because objectives usually conflict with one another, resulting in multiple possible “best” solutions). So, rather than optimizing all of the objective functions themselves, modern MOEAs instead optimize the nondominance of solutions. The first algorithm to utilize this approach was Deb’s NSGA; Deb’s NSGAII (2002 paper) addressed many shortcomings of NSGA, and soon became extremely popular (to say the least). A thorough, broader discussion of multi-objective optimization is given by Fleming et al. 2005.

Where is the Pareto front?

A single run of a MOEA will provide a best-known approximation set of solutions, based on the principle of nondominance. But how do we know whether this is “right” or not? Many people have debated this question in more of a philosophical, mathematical sense. Practically speaking, we usually investigate this question using brute force. We don’t just run a MOEA one time—instead, we use many different random seeds, a high number of function evaluations, and sometimes even different algorithms. For example, our recent AWR study used 10 different algorithms with 50 random seeds each. Our approach to the question “is it right?” is just to take the results from all of these runs and find the final set of nondominated solutions, known as the reference set. This is the best-known set of solutions to the problem after performing as much evolution as computation allows. We can discuss whether a reference set is “right”, but from an engineering perspective, it is perhaps more important to discuss whether it provides a useful, diverse set of new solutions to the problem.

Improving efficiency: epsilon dominance

MOEAs rely on nondominated sorting to generate a set of optimal solutions. This is a computationally expensive operation, because an algorithm must compare every existing solution to every other solution. As a consequence, an algorithm may waste a lot of time sorting solutions that differ from each other by insignificant amounts. An important contribution to address this problem is the principle of epsilon dominance. Epsilon dominance asks the user to specify a desired precision for each objective value ahead of time in order to simplify the nondominated sorting operations. For example, you may only care about measuring cost to the nearest dollar, or for a big project, to the nearest thousand dollars. An algorithm will then only find the best solution within each “epsilon” of the objective space, rather than optimizing using full floating point precision. The theory behind epsilon dominance is discussed in Laumanns et al. 2002. Josh and Pat added epsilon dominance to NSGAII to create eNSGAII; a good description of the algorithm design and performance is given in their 2006 comparison paper. Finally, a clear example of the efficiency gains from epsilon dominance is given in Kollat and Reed (2007) Scaling Analysis.

Putting it all together: Visualizing alternatives

We perform all of this computation because, somewhere in the world, there are stakeholders who care about this problem. It would be somewhat unhelpful to approach these stakeholders with a 10-page ASCII printout of a reference set. If we want to increase the real-world impact of our heroic efforts, we need to provide stakeholders with visualizations of our best-known solutions. Fortunately for our group, Josh has created the Aerovis program to view solution sets in multiple dimensions. Aerovis allows us to explore our solution sets and identify regions of interest in the objective space, and allows stakeholders to easily see the full range of possible outcomes for their problem. Getting started with MOEAs involves a lot of in-depth technical effort, but it’s important not to lose sight of the big picture—ultimately, we want to present meaningful results to stakeholders in a visually appealing way. A company or government agency may not care what algorithm you used, or how many random seeds you ran, etc., but they certainly do care if you can provide them with a better solution than the status quo. Some great examples of visualizing practical results are given in the Kollat and Reed (2007) VIDEO paper and also Kasprzyk et al. 2011.

The reading listed here should get you started for now. I plan to write another post about more advanced topics, including MOEA diagnostics (why we need them, what they mean, etc.). If you want to get a head start on MOEA diagnostics, you can check out Dave and Pat’s recent paper in Evolutionary Computation. Thanks for reading and feel free to email jdh366 (at) cornell (dot) edu with questions.

Software to Install on Personal Computers

I’ve been working lately with some visiting students, people outside the research group, and new students within the group that need to install some software on a personal laptop to get started.  Here is a guide to what you need to install. (Last Updated January 20, 2012).

  • (Required, Available to ANGEL Group Members Only)  If you haven’t already done so, contact Josh Kollat to become part of the AeroVis user group on Penn State’s ANGEL course management system, http://cms.psu.edu/.  There, you can download AeroVis, which is the visualization software that we use.  Also grab a copy of the sample data files and documentation which are really helpful.
  • (Required, Available to Everyone) Cygwin is used to connect to the cluster and see visual programs such as Matlab that are running on the cluster.  It can be found at http://www.cygwin.com/.  Download setup.exe and save it in a place (such as your Documents folder) where you will remember where to find it, since you may need to run it again.  Double click on setup.exe to get started.  You can either select the “default” packages when it asks which packages to install, or go ahead and select “all” to install all packages.  Either way, you may have to go back and select additional packages in subsequent runs of setup.exe if everything doesn’t install right.  The main package we’ll be using is “X-Window”, but according to their website, you install X-Window by following the same process of installing the cygwin package.  This is probably the hardest of any of the software packages to install, and it takes a long time.  Let us know if you need any help, and you can leave a comment if you have any tips on how to make the install process easier.  Afterwards, change your windows environment variables to add ;C:\cygwin\bin to your windows path. Note: Ryan has some additional ideas about the best way to access remote computers. Stay tuned for his post on the subject, and he can edit this post too with more details.
  • (Required, Available to Everyone) You’ll need a text editor, such as Notepad++ or PSPad.  Try both and see which one you like better, they are both free.  While you can use notepad or wordpad to do a lot of the text editing, these programs are a lot more comfortable for working with data and programming.
  • (Required, Available to Everyone) Use WinSCP for file transfer to the cluster.Ryan’s suggestions will probably pertain to this advice too. I know Matt and Ryan use different software packages so I’d love their input here. See comments to this post for additional discussion about this.
  • (Required Only for Visiting Students who Need College of Engineering Wireless) A group member can contact Bob White to get access for you.  A Penn State student must download software at the college’s site and load it on the visitor’s laptop.
  • (Required for Most Students Publishing Papers and Writing Theses Internally in the Group) You’ll need a LaTeX system. LyX is an open source “document processor” that has compatability with LaTeX. Please add additional suggestions for LaTeX environments and editors, preferably ones with syntax highlighting and some graphical features (such as adding equations using symbols). We have a license for WinEdt, but it’s not free for personal use.
  • (Optional, Available to Everyone)  Open source tips:  When in a Penn State computer lab or at a computer in our office, you can use Adobe Photoshop and Illustrator for figure editing and Microsoft Office products.  If you want access to some nice programs on your personal computer, though, for free, try Inkscape (for vector images), GIMP (for raster images and pictures), and Open Office (an alternative to Microsoft) which are all freely available.
  • (Optional, Available to Penn State Students Only) Some good software is available at http://downloads.its.psu.edu/.  Secure Shell Client, under “File Transfer” at that site, is a file transfer/terminal program used to connect to the cluster.  Different people have varying preferences for file transfer and cluster stuff.  I personally recommend WinSCP and running terminal commands on Cygwin, so Secure Shell is not really required.

As far as software that costs money or for computers in the office, we would generally need Microsoft Office, Microsoft Visual Studio, Matlab, and the Adobe Suite.  Students shouldn’t have to worry about installing those programs on their personal computers.  If you get Cygwin working correctly, your cluster access will allow you to use Matlab, Mathematica, programming compilers, and other software, so even if you don’t have access to your own copy of Matlab, you can use it interactively on the cluster.

Let me know if you have any questions by emailing jrk301 at psu.edu.

C++ Training: Exercise 1

(Updated 1/30/12)

This example requires some data you can download here: inputdata.  Open the file in Excel and “Save As”… csv.

Write a C++ program that reads data from the csv file and then calculates the mean, the standard deviation, the 10th percentile, and the 90th percentile, of each column of data.  Output these statistics to one or more new files.  You may create one file for each statistic, i.e.

means-of-data.csv

col1, col2, col3,
mean-of-col-1, mean-of-col-2, mean-of-col-3

Or put each statistic in its own row in a single file.

For this exercise, you may need the following functions, libraries, or features.  Please look them up on cplusplus.com and make liberal use of the example code there!

math: pow, operators such as +, -, and *

input/output of data: please use ifstream and ofstream.  You’ll want to use the << and >> operator, and the function: getline

manipulating text streams: the c++ library string, and stringstream, may be helpful.  You may need to look up how to handle csv files, and how to separate the commas from the data

data: sort

control structures: if, else, while, for

One piece of advice: if you want to convert a C++ style string to an integer or vice versa, you may find the following functions handy:


string intToString(int input_int)

{

string s;

stringstream out;

out << input_int;

s = out.str();

out.clear();

return s;

}

int stringToInt(string input_string)

{

return atoi(input_string.c_str());

}

double stringToDouble(string input_string)

{

return strtod(input_string.c_str(), NULL);

}

Unit 1: Getting the Program Working

  1. Write a standard deviation function that can take the standard deviation of an arbitrary number of values. Verify that the function works either with a calculator, Matlab, or Excel. Is the precision the same? What could cause differences between your answer and the calculators?
  2. Learn how to use the input/output streams.  You’ll want to set up one stream to read the file, and another one to write output files. To test this, you can simply “copy” the file you read in directly into another file. Are there ways to do this without using too much system memory?
  3. The easiest statistics to calculate are mean and standard deviation, since they can be done without storing much information.  The 10th and 90th percentile are more difficult, because you must sort the values first.  Don’t try to do these until you have the other two tasks completed.
  4. Once you have your results, verify them using Matlab or Excel.  Are your statistics correct?

Unit 2: Intermediate Steps

After your program is working, please try the following procedures:

  1. Initially, you can write all the code necessary to do this task in one file, maybe even all in the main() function. This is fine for small tasks, but as you continue, you’ll need to have more complicated programs that span multiple files and use many more functions. A makefile is a way to compile large projects using Linux. Set up a makefile for your program, and try separating the code into multiple files. There’s a short blog post to help you (log in to see it). You can also use a development environment such as Eclipse to create a makefile for you, or Visual Studio to create a project file without needing a makefile.
  2. The first time you do this example, you may use (or have used) regular C-style arrays to store the data from the CSV files. This is fine, as long as you know how many rows and columns are in the file! Arrays also require you to keep track of pointers and memory management, which can make operations like sorting more painful than they need to be. C++’s answer to these problems is vectors. Vectors are like ArrayLists in Java; they can be sized and appended dynamically, and they know their own size. Consider implementing the same program using vectors for an input file with an unknown number of rows and columns.
  3. Use gprof to profile your code, to determine how much time is being spent in each function. This can identify “bottlenecks” in your code. See a blog post here for more info.
  4. Use valgrind to identify memory leaks in your code. This is very important, since memory issue can actually make your code fail to run. The blog post here explains.
Unit 3. Pulling everything together
  1. The last step will be to combine everything you’ve done so far to create your own Matrix class. You should be able to find some good websites explaining basic C++ classes, and if you need more advice, come talk to us. Your Matrix class should use vectors to create a matrix behind the scenes, so that you can use it easily in your main() function. The class should have some helpful functions to simplify your tasks … for example, you may want to have a Matrix.readFile(…) function, a Matrix.sortColumns() function, or whatever else you want to include!
  2. You may want to implement your class using a separate header (.h) and source (.cpp) file. To compile a project with multiple source files, you’ll need to learn how to use basic makefiles, which are explained in Joe’s previous post (link is given in step 2).
  3. Re-implement your main() function using your new Matrix class. Make sure your output is the same as in the previous exercises. Once you do this, you will have written a general, re-usable class that could (potentially) save you lots of time in the future.

If you have questions, please ask them and if the answers are general enough we will add them to this post.  Good luck!

–Jon and Joe

Meeting Notes – January 18

Here are some notes about using the blog, that stemmed from our meeting this afternoon:

  • I’ve added an “About” and “Important Post” page that will always be on every blog page, in the upper right corner.  It explains the rules of the blog and gives Pat’s suggestion of “what to read first.”  Please update it as you see fit.
  • The “Important Post” page has some suggestions of what we need on future posts.  Please add additional suggestions on topics that need to be covered, and if you want to address one of the to-be-covered topics, please do!
  • Generally, try to log in any time you are using the blog.  You’ll get to see private posts, your name will be associated with comments, and you can add new posts or edit old ones.  If you are reading this and you are a member of “the public”, visit our group webpage and get in touch with Dr. Reed if you would like to become a member of the blog or contribute.
  • The post: WordPress Tips has best practices on how to use the blog.  For example, there’s a neat syntax highlighting plugin that makes code samples really nice.  Maybe we should add this content to the “About” page.

Action items for next week:

  • Dave will be giving a presentation next week on some new stuff he’s been cooking up.  At that time, we’ll collaborate with him on updating our training materials on using his new framework.
  • Matt will be posting a version of the GAA problem that you can use as a test.  We will need to coordinate with Dave to help get everyone up to speed on the new framework.
  • Check out the “Important Posts” page and add suggested new content or start posting desired content.

Thanks everyone!

Running Sobol Sensitivity Analysis

Note: this post describes a version of Sobol written in C/C++ that is no longer supported. We now recommend the version of Sobol analysis included in SALib (https://github.com/SALib/SALib), an open-source Python library for sensitivity analysis methods. The link describes how to run different methods from the command line. If you prefer, there is another version of Sobol analysis included in MOEAFramework (http://moeaframework.org), written in Java. There is a separate post here describing how to use the MOEAFramework version.

Sobol sensitivity analysis is used to attribute the variance in a model’s output to its parameters and their interactions. Our group maintains a C/C++ version of the analysis that can be used as a black box. This post will describe how to hook your model into the Sobol framework and compute some sensitivity indices.

Step 1: Set up the environment

First, you will need to create a simple text file with the form [parameter] [lower bound] [upper bound]. For example, such a file for the Hymod model might look like this:

Cmax 0.0 1000.0
B 0.0 3.0
Alpha 0.0 1.0
Kq 0.15 1.0
Ks 0.0 0.15

The bounds are used to sample parameter values. The names of the parameters themselves are only used to print the final sensitivity indices (i.e., you can name them whatever you want). For this example, I will refer to this file as params.txt.

Obtain the Sobol source code from one of the group members and extract the folder somewhere in a UNIX environment (Cygwin on Windows will work, but you might as well put it on the cluster somewhere). Go to the Sobol directory and compile the source code (just type make). This will create three executables named generate, generate_sobol, and sobol.

Step 2: Generate parameter sets using the Sobol Sequence

From the Sobol directory, type the following command:


./generate -p params.txt -n 1000 -m 1 | ./generate_sobol -o sobolParams.txt

We are running two different executables here. The first, generate, reads the parameter file you give it and generates 1000 samples (specified by the -n flag) using the Sobol Sequence (specified by the -m flag). This output is piped (the “|” operator) to the executable generate_sobol, which resamples the parameters holding each one fixed at a time. This results in N(p+1) total parameter sets to be run in your model; for the Hymod example, we would have 1000*(5+1) = 6000 total model runs. The parameter sets are saved in sobolParams.txt.

Note that the Sobol method is very computationally intensive. Even for a simple model like Hymod, from personal experience I would recommend a sample size of at least N = 10,000 (which translates to 60,000 model runs). More complex models will be slower to run and will also require more samples to calculate accurate estimates of Sobol indices. Pay attention to the confidence bounds on your sensitivity indices to see whether you need to run more samples.

Step 3: Run the parameter sets through your model

The parameter sets are now saved in sobolParams.txt—the analysis so far has no idea what model you’re running. You need to set up your model to read in parameter sets from the file sobolParams.txt, run them through the model, and save your desired outputs in a file called objectiveValues.txt. You can use multiple objective values here—each row in the objectives file will correspond to a model run. Here is an example of reading in parameter sets for the Hymod model in C/C++ in the main() function:

//begin main()
string sampleFile = "sobolParams.txt";  //Input Sobol samples file to read

int numSamples = 60000; //make sure your program knows how many samples it has to read
int nParams = 5; //how many parameters do you have?
int nObj = 4; //how many objectives do you have?

ifstream in; //Input file stream
ofstream of; //output file stream for objective values

//obj is an array holding the objective values, and xreal holds the parameter values for each run
double *obj = new double [nObj];
double *xreal = new double [nParams];

//Intialize the model only once
init_HyMod(argc, argv);

//Open the output file and set some properties for it
of.open("objectiveValues.txt", ios::app);
of.setf(ios_base::fixed);
of.precision(8);
of.width(20);

in.open(sampleFile.c_str(), ios_base::in);
if (!in)
{
     cout << "Error: sampleFile could not be opened!" << endl;
     exit(1);
}

//Loop through parameter samples in sobolParams file
for (int s = 0; s < numSamples; s++)
{
     if (s%100 == 0) cout << s << endl; //keep track of your progress

     //Read in the parameter values from Sobol
     for (int i=0; i<nParams; i++)      {           in >> xreal[i]; //read in each parameter individually for this set
     }

     in.ignore(1000,'\n'); //ignore the rest of the line

     //Run this sample once all parameters have been read in
     calc_HyMod(indPtr);

    //Print the objective value outputs
    for(int i = 0; i < nObj; i++)
    {
         of << setw(15) << obj[i];
    }
    of << endl; } //close the "samples" loop //Close the output file  of.close(); //Remember to clean up any memory you allocated! deleteAllStuff(&hymodPtr->data, &hymodPtr->model, &hymodPtr->snowModel);

//end main()

As you can see, even this fairly simple task becomes complicated in C/C++. It might only take you a handful of lines to do the same thing in Matlab. But if you want speed, C/C++ is the way to go. Be careful to read in your parameters in the same order in which you sampled them!

Step 4: Calculate Sobol Indices

You now have the objective values for all of your model runs, arranged in columns in the file objectiveValues.txt. We need to send this back to the final executable, sobol, to compute the sensitivity indices. Do this with the following command:

./sobol -p params.txt -i objectiveValues.txt -o sobolIndices.txt -m 0

As shown here, the sobol executable requires several flags: the parameter file (-p), the file containing calculated objective values (-i), the output file for the sensitivity indices (-o), and the column of the objective values file to read (-m). The columns are assumed to be zero-indexed; if you have calculated multiple objective values, you would continue on to -m 1, etc., repeating the same command as above.

Optional parameters for the sobol program include the (-r) flag, which specifies the number of boostrap resamples, and the (-s) flag, which specifies the random seed for bootstrapping. These parameters are used to calculate the confidence intervals of the sensitivity indices. If these flags are excluded, the program defaults to use the system time as the random seed, with 1000 bootstrap resamples.

Step 5: Interpret your results

Open the file sobolIndices.txt in a regular text editor. You will see something like this:

Parameter	Sensitivity [Confidence]	Rank
1st-Order Sensitivity Indices
  Ks	+0.004141 [ +0.004347 ]	5
  Kq	+0.128900 [ +0.022733 ]	2
  DDF	+0.000007 [ +0.000006 ]	8
  Tb	+0.000007 [ +0.000006 ]	7
  Tth	+0.000007 [ +0.000006 ]	6
  Alpha	+0.262308 [ +0.037787 ]	1
  B	+0.039529 [ +0.018653 ]	4
  CMax	+0.112149 [ +0.037824 ]	3
Total-Order Sensitivity Indices
  Ks	+0.011856 [ +0.003890 ]	5
  Kq	+0.298414 [ +0.036949 ]	3
  DDF	-0.000000 [ +0.000000 ]	8
  Tb	+0.000000 [ +0.000000 ]	7
  Tth	+0.000000 [ +0.000000 ]	6
  Alpha	+0.617916 [ +0.038536 ]	1
  B	+0.172242 [ +0.029841 ]	4
  CMax	+0.379264 [ +0.053695 ]	2
2nd-Order Sensitivity Indices
  Ks * Kq	+0.000842 [ +0.006566 ]	12
  Ks * DDF	+0.005705 [ +0.006541 ]	10
  Ks * Tb	+0.005705 [ +0.006556 ]	9
  Ks * Tth	+0.005705 [ +0.006331 ]	8
  Ks * Alpha	+0.006472 [ +0.005962 ]	6
  .....(second order indices continued).....

The parameter names will match the ones you specified in params.txt. The first order, total order, and second order sensitivities are specified as indicated. Confidence intervals for each indice are shown in brackets. Rankings are included in the rightmost column. Typically we use the total order indices to get a broad picture of model behavior—here you can see that the Alpha parameter is controlling most of the variance in the output. If the confidence intervals of your dominant indices are larger than ~10% of the value itself, you may want to consider increasing your sample size as computation permits.

Those are the basics for running Sobol. There are some cool extensions to this framework that I plan to cover in future posts, including time-varying sensitivities, sensitivity of parameter groups, and how to visualize indices in Matlab. Feel free to email jdh33 (at) psu (dot) edu with questions. Thanks for reading.