New Windows install? Here’s how to get everything set up

I’m installing Windows 8 on a computer, and I wanted to blog the process of getting everything installed because it may help someone in the group / or that does similar work remember all the software they need!

Free Software

1. Connecting to the cluster. I use Cygwin to do terminal commands and also GUI programs using the XWin server. When installing Cygwin, if you have the hard drive space, select all packages so you never need to worry about not having something you need. For file transfer I use WinSCP.
2. Chrome/Firefox
3. Dropbox desktop app
4. Skype
5. Media players: Apple iTunes or a media library if you need it. Also, I like VLC Media Player for a free open source alternative.  I also use MediaMonkey as a free alternative to the iTunes library.
6. Java development kit. Needed to do tasks related to MOEAframework.
7. Text editors. Recently I’ve been using Notepad++ and SciEdit.
8. LaTeX. We have a site license in the group for WinEdt, but there are free alternatives. You may also need a TeX package like TeXLive.
9. Thunderbird for email
10. AeroVis, as well as the Microsoft Redistributables, available over at decisionvis.com.
11. Data analysis: Python and R
12. Integrated Development Environments such as Eclipse

Paid/Licensed Software

1. Adobe Creative Suite, for editing figures.  There are some alternatives such as Gimp and Inkscape, as well.
2. Matlab.  Note that you can do a lot of the things that classically are done with Matlab using alternatives, such as Octave (the Matlab clone), Python, R, or other scripting.
3. Microsoft Office.  Again, there are alternatives such as using LaTeX for everything, or office packages such as OpenOffice.
4. Specific compilers such as Visual Studio

Integrating a MOEA with a sample water management problem

On this blog, we’ve covered some basics of getting started using MOEAs — you can search the “getting started” tag below for some examples, such as the problem formulation post, basic concepts and reading, etc.  But I thought it might be helpful to run through a sort-of hypothetical example, where you would be trying to optimize a portfolio of water infrastructure.  With that, let’s get started!

The first step is defining your problem formulation.  This type of formulation looks a little different than a traditional formulation for optimization of a water system.  What I mean is, if you are using an optimization model of, say, a reservoir, you may want to maximize a single benefit function of the reservoir, and your decision variables will actually be volumes of water that are allocated in the system.  The constraints will ensure continuity in the system (no water is created or destroyed), and do other housekeeping such as making sure that all the decision variables are non-negative.  While this is a perfectly fine way to set things up, there are some limitations there.  In the traditional system, your constraints are the only way to model the actual physics of the system, so you’re going to make some assumptions about how the system works.  Also, that single benefit function can’t always capture all of the important functionality of the system.  What if you want to maximize some cubic function for environmental flow, but your other benefits are calculated in another manner?

It seems I’ve gotten off into a tangent here.  The point is that our so-called many objective approach has a different way of defining those terms in the problem formulation.  It’s a simulation-optimization approach.  And what that means is that the simulation model is responsible for taking care of the system physics.  So your constraints aren’t needed to ensure continuity, that’s inside the model.  What you need to be concerned with is three major categories, explained below.  I’ll talk about each one of them means, and I’ll set up a hypothetical problem as an illustration

The problem in a nutshell: Imagine you have a system with three major locations A, B, and C.  You are trying to propose whether or not to build a reservoir at B.  Once you have the reservoir at B, you could use several different types of release curves at the reservoir.  You have some water transfers and conveyances in the system.  One of the transfer schemes is set, but there is another one, between B and C, that you have to design.  Each of the transfers uses a generic threshold to determine how the water should be conveyed from one location to the other.

Decision Variables

These are “levers” that the decision maker/designer/stakeholder can change in the system.  Some phrases to think about: “whether or not something should be designed”, “how big should it be”, “what rule should we use to operate it?”, “what design parameter should we use?” “what material?” “what regulation should we set?”  You can be very creative in setting this up!  The big rule to remember here, though, is that you need to be able to code this into something that the MOEA can automatically change.  Decision variables to a MOEA are typically values that can be modified, which have a uniform distribution between an upper and a lower range.  When I say distribution, I don’t necessarily mean a statistical distribution — I just mean that it is plausible that the values range between the upper and lower bound.  Perhaps the optimal values are all toward the high end of the range!  Who knows.  The point is you just need to set an upper and a lower bound.  This will become clearer when we actually look at the variables for this problem:

Variable — Lower Bound — Upper Bound — Description/Notes
Res — 0 — 1 — Boolean variable.  1 if the reservoir is built, 0 otherwise.
Rule — 1 — 3 — Discrete variable, which uses a lookup table.  You have 3 release curves, this variable shows which one you’re using.
Cap — 0 — 1000 — Real variable, the capacity of the reservoir.
Transf — 0 — 1 — Boolean variable. 1 if the transfer scheme is built, 0 otherwise.
Thresh — 0.0 — 3.0 — Real variable, the threshold used for the transfer scheme.

Be aware that some of these variables are technically dependent on one another.  That is, if Res = 0, Rule and Cap don’t matter.  If Transf = 0, Thresh doesn’t matter.  Amazingly, the MOEA can operate under these conditions!

Objectives

Multiple, quantitative metrics of system performance.  This isn’t much different than the classical definition of an objective function.  Just note that your traditional problem usually has the objective function as a direct function of the decision variables (which classically are volumes of water that are being allocated or routed).  Here, the objective function is sort of indirectly a function of the decision variables.  You’ll see this when we define them below:

Objective — Description
Cost — Fixed and operating cost of the reservoirs and transfers.  Here, the fixed cost is obviously directly correlated to the Res decision variable.  But, the operating costs are a function of a number of things — the input data in the simulation, how conservative the release curve is (i.e., the Rule variable), the transfer rules (the Thresh variable).  Cost is minimized.
Reliability — The likelihood of meeting performance targets.  There’s lots of ways to define reliability (maybe I can cover them in a separate post).  I like to think of it as “storage reliability” (meeting storage targets) and “demand reliability” (meeting your demands).  In general this is a percentage or ratio, such as 92% or 95%, and it is maximized.  There’s a conflict here with cost — if you spend a lot and build the big reservoir, you’re going to improve reliability.  But minimizing cost will conflict with this.  Or, there could be other strategies that don’t cost a lot but can still maintain performance.
Environmental Flows — Some function of flows that indicates how well your reservoir releases meet the needs of the environment.  It would be easy to operate a reservoir to maximize water supply and flood control.  Simply release when you need flood storage, retain water when you need storage for supply.  But fish are used to variety in the flows, so meeting that variety while also meeting your other uses could be challenging.

Constraints

Acceptable limits on performance, or preventing “infeasible” system designs.  This is a little different than the classical formulation, since the constraints are not needed to ensure the system runs properly.  But, similar to the classical formulation there is the idea of “feasibility”.  I like to think of it as setting limits on what a plausible solution is.  If your regulatory requirements say you have to meet 90% reliability, this can be a constraint.  Solutions with lower than 90% reliability will be infeasible.  Similarly, if it’s physically impossible to have a high transfer threshold if the reservoir is in place, that could be considered a constraint as well.  Unlike classical optimization, I’d venture to say constraints are not necessary for a proper many objective problem formulation.  But they can help ensure your tradeoff solutions have high performance and are meaningful.

The Actual Solution Process

About 1000 words later we get to the “good stuff”!  It’s broken up into a few parts.  First, set up the information the algorithm needs to know.  Then, a brief overview on what the algorithm is doing.  Next, I talk about what happens to the decision variables at each iteration.  Finally, I discuss what information is needed within the simulation model.

Setting Up the Algorithm

The algorithm needs the following information about the decision variables, set up in some type of table such as shown below.  You’ll notice we’ve covered some of this already:

Variable — Lower Bound — Upper Bound — Tag
Res — 0 — 1 — {Res}
Cap — 0 — 1000 — {Cap}
Rule — 1 — 3 — {Rule}
Transf — 0 — 1 — {Transf}
Thresh — 0.0 — 3.0 — {Thresh}

Here we need some tags.  Tags tell the algorithm where to replace the information already in the simulation model, with the specific value for a given solution.  More on that in a minute.

Regarding objectives, the algorithm needs to know how many you have, and what their names are.  You’ll also need “epsilon precision” for a lot of MOEAs, but that will be the topic of another post.  Basically that’s a significant precision for each objective — if your cost is in millions of dollars, for example, you might not care about cost differences of $0.01 so you set the precision accordingly.

So what is the algorithm doing?

The algorithm’s goal is to ‘create’ new values of the decision variables, and manipulate them to find better and better values of your objective functions.  When I say ‘create’, I mean two things.  At first, the algorithm will start with random generation of the decision variables.  After the second generation, an iterative process of selection of good solutions and variation on those solutions (like mutation) is used to create values that have better and better objective performance.  Lower costs, higher reliabilities — all while finding the tradeoff between the objectives.  Tradeoff here means, what is the best reliability I can get at each level of cost?  The best environmental flow at every level of reliability and cost?

Decision Variables Inside the Algorithm

Let xi represent the ith decision variable.  So x1 is the Res variable, x2 is cap, and so forth.  MOEAs use a population based approach.  What that means is there are multiple solutions that are generated at each step.  So the algorithm is responsible for the ‘creation’ of new values.  (Like I said, ‘creation’ at first is completely random.  Later, as the search continues, it consists of selection and variation of good values.)

Create a population of solutions:
Solution — Res, Cap, Rule, Transf, Thresh
Solution 1 — 1, 1000, 2, 1, 1.5
Solution 2 — 0, 500, 2, 1, 2.0
Solution 3 — 1, 500, 1, 0, 3.0

Solution 1 builds the reservoir at capacity 1000, operating the reservoir with rule set 2.  It also selects the transfer scheme, running it with a threshold of 1.5.

Solution 2 does not build the reservoir.  Recall that its values for the Cap and Rule variables are meaningless.  It does select the transfer scheme, though, and runs it with a threshold of 2.0

Solution 3 builds the reservoir with capacity 500 and rule set 1.  It does not select the transfer.

Notice the only human input here is the ranges for the decision variables!  A human being did not create solutions 1 2 and 3.  However, later in the process a human will decide between solutions 1 and 3 in the final tradeoff analysis.  But that, again, is the subject of another post.

So now the algorithm is done with its business for the time being, and it needs to:

Send the decision variable values to the simulation model

It then waits around for the simulation model to calculate objective functions.  Then, the algorithm:

Receives objective function values from the simulation model

and then,

Repeat for all solutions in the population

Setup within the Simulation Model

If you have a lot of control over the simulation model, you can write a function that receives decision variable values, puts them where they need to go, and then spits out objective function values seamlessly.  Unfortunately this isn’t always possible.  So I’ll explain a slightly more complicated version of how to do this that may be helpful for your application.

Simulation Setup File. Imagine you have a file that looks something like this:

==Reservoirs==
Name — Capacity — Release Rule — Height
Res A — 2000 — Rule A — 20
{Name} — {Cap} — {Rule} — 30
Res C — 1000 — Rule B — 40

==Transfers==
Name — On/Off — Threshold
A to B — On — 1.2
B to C — {Transf} — {Thresh}

The setup file specifies a generic solution for your simulation.  Reservoir A exists, with a capacity of 2000 ran with release rule set A, and a height of 20.  Reservoir B is the design!  Notice that its fields have tags in them — but no matter what Reservoir B will be at height 30.  Reservoir C also exists.  You can see the same thing is going on with the transfers.  The simulation model needs to do the following procedure:

0. When the model is started up, input data should be run in, memory allocated, and everything set up and ready to go.  These routines will ideally only be called once.
1. When decision variables are received from the algorithm, a function should be written which goes in and replaces all the tags with the appropriate values.  A tag like {Cap} is easy because there’s a 1 to 1 relationship between the decision variable value and the actual value used by the simulation.  A tag like {Name} is tougher though.  You have to write a function that says, if the value of Res is 1, then {Name} is replaced with something like “Reservoir B”.  If Res is 0, replace {Name} with “Off” or “N/A” or something that signifies to the simulation model that the reservoir does not exist.  Alternatively you can also set the capacity of Reservoir B equal to zero, and pass all the water through.  Whatever works for your application.
2. Now, run this set of infrastructure in the model.  Get some results!
3. Translate the results into the objectives which are output to the model.  There are varying levels of sophistication here.  One way of doing it is setting up a system of tags just like for the decision variables.
4. Write the values of the objectives back on your output stream, and send them back to the algorithm.
5. Make sure your simulation model is also set up to repeat this process for all solutions in the population, indeed for all solutions in the search procedure as it moves forward.

Conclusion

Hopefully this has been a nice introduction to the “nuts and bolts” of how the whole process works.  Again, what you’re really doing is automatically generating a set of designs that balance conflicting objectives that you care about in your system.  The result is different infrastructure portfolios that you can visualize and analyze, ultimately leading to better or more sustainable water planning!

Please feel free to ask questions or make comments below.  Until next time!

 

Running Sobol Sensitivity Analysis using SALib

This post was updated on August 11, 2014 to update the SALib module structure, and again in 2017.

The Sensitivity Analysis Library (SALib) is an open-source Python library for common sensitivity analysis routines, including the Sobol, Morris, and FAST methods. In 2017 it was published in the Journal of Open Source Software:

Herman, J. and Usher, W. (2017) SALib: An open-source Python library for sensitivity analysis. Journal of Open Source Software, 2(9).

This post describes how to use the command-line interface of the library. (The Github page gives an example of the Python interface).

Step 0: Get the library

The easiest way to install is pip install SALib, which will pull the latest version from the Python package index. Or, you can download at the above link and run python setup.py install. If you’re interested in contributing, you can clone the git repository:

git clone https://github.com/SALib/SALib .

Step 1: Choose sampling bounds for your parameters

First, you will need to create a simple text file with the form [parameter] [lower bound] [upper bound]. For example, such a file for the Hymod model might look like this:

Cmax 0.0 1000.0
B 0.0 3.0
Alpha 0.0 1.0
Kq 0.15 1.0
Ks 0.0 0.15

The bounds are used to sample parameter values. The names of the parameters themselves are only used to print the final sensitivity indices, so you can name them whatever you want. Let’s call this file params.txt.

Step 2: Generate parameter sets using the Sobol Sequence

Put your params.txt file in the same directory as the SALib folder. Move to this directory and type the following command:

python -m SALib.sample.saltelli -n 1000 -p params.txt -o my_samples_file.txt

The -n flag specifies the number of initial samples to generate from the pseudo-random Sobol sequence. The -p flag specifies the parameter bounds file that you created in the first step.

In this example, 1000 parameter sets are generated from the Sobol sequence. After that, the Saltelli method of cross-sampling is applied (for more information, see: Saltelli 2008, “Global Sensitivity Analysis: The Primer“). The cross-sampling scheme creates a total of 2N(p+1) total parameter sets to be run in your model; for the Hymod example, we would have 1000*(5+1) = 6000 total model runs. The parameter samples will be printed to the file specified with the -o flag, which in this case is called my_samples_file.txt.

Note that the Sobol method can be computationally intensive depending on the model being analyzed. Even for a simple model like Hymod, from personal experience I would recommend a sample size of at least N = 10,000 (which translates to 60,000 model runs). More complex models will be slower to run and will also require more samples to calculate accurate estimates of Sobol indices. Once you complete this process, pay attention to the confidence bounds on your sensitivity indices to see whether you need to run more samples.

Step 3: Run the parameter sets through your model

The parameter sets are now saved in my_samples_file.txt. Run these parameter sets through your model, and save the output(s) to a text file. The output file should contain one row of output values for each model run. This process is performed outside of SALib, so the details will be language-specific. Be careful to read in your parameter sets in the same order they were sampled.

Step 4: Calculate Sobol Indices

You now have the output values for all of your model runs saved to a file. For the sake of example, let’s call that file objectiveValues.txt. We need to send this information back to SALib to compute the sensitivity indices, using following command:

python -m SALib.analyze.sobol -p params.txt -Y objectiveValues.txt -c 0

The options here are: the parameter file (-p), the file containing calculated outputs (-Y), and the column of the objective values file to read (-c). The columns are assumed to be zero-indexed; if you have calculated multiple objective values, you would continue on to -m 1, etc., repeating the same command as above. By default, this will print sensitivity indices to the command line. You may want to print them to a file using the “>” operator.

Step 5: Interpret your results

Say that you saved the indices from the above command into the file sobolIndices.txt. If you open this file in a text editor, it will look something like this:

Parameter First_Order First_Order_Conf Total_Order Total_Order_Conf
x1 0.696371 0.183873 0.704233 0.211868
x2 0.232399 0.119619 0.264305 0.129080
x3 -0.021573 0.048209 0.027243 0.066093
...(etc.)

Parameter_1 Parameter_2 Second_Order Second_Order_Conf
x1 x2 -0.142104 0.307560
x1 x3 -0.009698 0.271062
x1 x4 -0.049298 0.283457
...(etc.)

The parameter names will match the ones you specified in params.txt (Here they don’t, but this is just an example). The first order, total order, and second order sensitivities are specified as indicated, along with their respective confidence intervals. Most of the indices are omitted here for the sake of brevity. Typically we use the total order indices to get a broad picture of model behavior, since they estimate all of the interactions between parameters. If the confidence intervals of your dominant indices are larger than roughly 10% of the value itself, you may want to consider increasing your sample size as computation permits. For total-order indices to be important, they will usually need to be above 0.05 at the very least (the most dominant parameters will have values upward of 0.8).

For a full description of available methods and options, please consult the readme and examples in the Github repository or on the SALib website. Github users can also post issues or submit pull requests. Thanks for reading!

Matlab and Matplotlib Plotting Examples

A while back, Jon published a set of Matlab plotting examples on GitHub.  I’m not sure if this has been publicized yet — I couldn’t find it with a quick search of the blog, so it couldn’t hurt to publicize it again!

I’ve started adding my own Python/Matplotlib equivalents to Jon’s Matlab examples, in the hopes of turning this into a comprehensive library of basic plotting examples.  I’m not done yet, but I’ll add more as time permits.  I take requests, so if there’s anything you’d like to see, please leave a note in the comments.

Here’s a quick list of what’s there now:

  • line plots (Matlab and Python)
  • line plots with shading between curves (Python)
  • stacked area plots (Matlab)
  • animated gifs like the one from Jon’s recent post (Matlab and Python)
  • histograms and cdf plots (Matlab)
  • parallel coordinate plots (Matlab)
  • gridded data contour and surface plots (Matlab and Python)
  • scatter plots (Matlab and Python)
  • non-gridded data contour plots (Matlab and Python)