Emacs in Cygwin

Are you having trouble with certain commands not working in Emacs under Cygwin (e.g., C-x C-c doesn’t exit the program), then try adding this line to Cygwin.bat before the ‘bash –login -i’ line:

set CYGWIN=tty notitle glob

Cygwin.bat is located in the root directory of your Cygwin installation.  That should do it!

Speed up your Bash sessions with “alias”

Some of you probably already know about this, but I just discovered it and wanted to share.

Tired of typing in the same terminal commands all the time? Try this:

alias cy='ssh -Y jdh33@cyberstar.psu.edu'

Now when you type “cy” in the terminal, it automatically replaces it with the command you specify in quotes.

You can do this for each session, but even more helpful is to put these alias commands in your .bashrc file so that they are run whenever you start bash (the “rc” in “bashrc” stands for “run commands” … at least I think it does). If you’re using cygwin, this file will be located at C:\cygwin\home\(your username)\.bashrc. Open it in a text editor and put your alias commands somewhere in the middle. (On the cluster, your .bashrc file is located in your home directory, ~.)

Some useful possibilities:

  • On your computer, create aliases to log into different clusters (“cy” for cyberstar, “lx” for Lion-XO, etc.)
  • On the cluster, create aliases for common queue operations (“qme” for “qstat -u jdh33”, for example, which is a good idea I got from Ryan. Or “killall” for “qdel $(qselect -u jdh33)”. Pretty handy!)
  • Any long commands you find yourself repeating can be aliased! Just be careful not to overwrite any existing Linux commands when you do it.
You can check your current aliases at any time by running alias without any arguments.

That’s all, thanks for reading.

Moving Off-screen Windows in Windows 7

Has this ever happened to you?  You do something with your monitors and suddenly Matlab or another program opens a window off your screen and you have no idea how to get it back on the screen?  This is a pretty common occurance if you have a laptop and frequently switch from a multi-monitor to single monitor setup.  I found this handy fix on the MS forums (here).

Moving the window around is as simple as holding down the windows key and pressing the left or right arrow keys.  The window will be automatically docked to one part of the screen or another, moving it back on screen!

Debugging the NWS model: lessons learned

Josh and I have been setting up a Sobol sensitivity experiment using the National Weather Service’s HLRDHM model. This is a distributed rainfall-runoff model which uses the Sacramento structure for soil moisture accounting along with a separate routing model. We ran into some problems which may be helpful for people in the future, particularly if you find yourself inheriting a large code base from elsewhere. (All of the code here will be in C/C++, but the takeaways are generally applicable).

In general, the goal here is to write wrappers around the model to perform Sobol sampling and analysis without touching the model code itself. This approach would give us a portable method for similar analyses in the future, and would (hopefully) avoid conflicts if the NWS decided to update the model. Josh initially accomplished this by compiling the model into a separate executable, and calling it from the wrapper using the system command:

system("./theModel.exe -flags arguments");

This method should work in theory, and we were able to get some results this way. However, Josh noticed that many of the clusters would not allow calls to system(). We are still not sure why this is—our suspicion is that it has something to do with security, but we haven’t been able to confirm this.

The next idea was to compile the model into a shared library. This involves three fairly short steps: (1) renaming the model’s main() function to something else (such as model_driver(), for example), (2) compiling the model into a shared library, and (3) compiling the wrapper to link against this library in its makefile. (The specifics of this setup could be covered in a separate post). Then, in your wrapper, you could call:

model_driver(arguments);

The shared library approach should also, in theory, work. It did run properly, but we found that the time required to run each simulation was increasing linearly throughout the experiment(!) This was a strange and unacceptable result and required some investigation.

We knew that the model is typically run as a single executable. This means that when it finishes running, it will clear all of its memory before it runs again. However, when we use a shared library, all of the global variables in the model will save their state between calls to model_driver(). This is a key point worth emphasizing. A variable declared inside a function will be cleared when it goes out of scope (i.e., when the function exits), whereas variables declared globally will not. For this reason, we suspected that the slowdown issue was due to global variables not being cleared properly between model runs. But where to start looking?

Fortunately, we did not have to dig very deep to find the cause of the problem. The main function in the model basically calls a number of other functions in sequence, so the first step in this debugging process was to place timers around these different functions to identify which one was causing the slowdown. In C++, you can set up basic timers to print to stdout as follows:

// Create global timer macros to simplify things
#ifndef START_TIME
#define START_TIME clock()
#define GET_TIME(start) ((clock()-start)/(double)CLOCKS_PER_SEC)
#endif
double timer;
// ....start your main function, etc.
// Then, to record the timing of the function foo(), you do the following:
timer = START_TIME;
foo(bar);
timer = GET_TIME(timer);
cout << "Time to run foo(): " << timer << endl;

This is a very informative (if inelegant) way to learn about how your program is running. In the NWS model, this identified only one function as the cause of the slowdown—what luck! The offending function was:

hlrms_algo(deck, allPixels, factory, inputBeforeLoopFact, inputInsideLoopFact, inputAfterLoopFact);

As it turns out, most of the parameters being passed into this function are actually global variables. (Why would one need to pass global variables into a function, you ask? Good question.) A quick search in Visual Studio for the definitions of these variables indicated that they are of type map from the C++ standard library (a map object is similar to a vector, but instead of ordered values, a map contains unordered key-value pairs).

At this point, all of this evidence suggests the following conclusion: one or more of these global map objects are not being cleared properly between model runs. A quick search of STL documentation will tell you that map has a function clear() which removes all elements and, importantly, calls the destructors of those elements. Testing this conclusion, then, was just a matter of adding clear() functions after each model run:

hlrms_algo(deck, allPixels, factory, inputBeforeLoopFact, inputInsideLoopFact, inputAfterLoopFact);

deck.clear();
allPixels.clear();
inputBeforeLoopFact.clear();
inputInsideLoopFact.clear();
inputAfterLoopFact.clear();

And just like that, the slowdown problem disappears. Each simulation now takes approximately the same amount of time, because these global variables are being cleaned up after each run.

A more curious programmer might have some additional questions here. Which of these variables was causing the slowdown? What exactly do these variables do? Is map really the best choice of data structure for them? But remember, our goal is not to rewrite the NWS model—we just want to run our experiments on the model as it stands in a portable way. Now we can proceed with the “shared library” approach and get some results.

Common PBS Batch Options

PBS is a handy cluster scheduling software, usually wrapped around a grid manager like MOAB.  It’s useful in that you can submit options to the command line, or using a batch script.  Arguments placed on the command line when calling the qsub command will take precedent over those in the script, so a general script may be built and then tested or varied by varying the options on the command line.  PSU has a pretty good guide to get you started using the PBS system, and can be read here.  However, there are some other options which are exceptionally useful for the moderate user.  In particular, the ability to pass the current environment, set up email notification, and redirect output are handy things to be able to use and modify.  An example script and header are presented below:

——————————————————————————————————————————–
#!/bin/csh                                                 #Using the C-Shell
#PBS -N SimpleScript                              #Give the job this name.
#PBS -M youremailhere@psu.edu              #A single user, send notification emails there.
#PBS -m a                                               #Send notification of aborts <a>.
#PBS -V                                                  #Pass the current environment variables to the job.
#PBS -l nodes=1:ppn=1                            #Request a single node, single core for this job.
#PBS -l walltime=96:00:00                         #Request a maximum wall time of 96 hours [HH:MM:SS format].
#PBS -o output/$PBS_JOBNAME.out        #Redirect STDOUT to ./output/$PBS_JOBNAME.out
#PBS -e error/$PBS_JOBNAME.err           #Redirect STDERR to ./output/$PBS_JOBNAME.err

env                                            #Echo the environment (variables)
cd $PBS_O_WORKDIR              #PBS starts your job in your home directory, cd to the submit/work directory
echo -n CWD:; /bin/pwd              #Echo the current working directory path
echo PBS_JOBNAME is live…    #Print to STDOUT (really, the file declared above) the job is live…
sleep 30                                    #Sleep for 30 seconds, then exit.
——————————————————————————————————————————–

In this case, I’ve configured the job to be named “SimpleScript,” to email the user “youremailhere@psu.edu” if the job aborts, to use the same environment as the one that the qsub command was issued from, requests 1 node and 1 processor on that node, a maximum run time of 96 hours, and to redirect the error/output messages to separate directories under my working directory.  Clearly this is a very simple example, given that it prints some basic info, pauses and exits.  If you were going to run a process or other program, you’d put your commands in place of the sleep command.  However, it provides a cut/copy example of commonly used options that you can include in your own batch scripts.  In case you want to modify those options, there’s a brief review of the most commonly changed ones below.  For a more complete list, head to the NCCS’ listing on common PBS options:

http://www.nccs.gov/computing-resources/phoenix/running-jobs/common-pbs-options/

Commonly Used Options:

These options can either be present on the command line a-la:
qsub -N SimpleScript -j oe <batchScriptFile>

Or included in the batch script file using the PBS Flagging macro: #PBS as in:
#PBS -N SimpleScript

Recall, that you can mix and match options on the command line and in the batch script, but be aware that the command line options override those in the batch file.

[-N] Name: Declares the name of the job.  It may be up to 15 characters in length, and must consist of printable, non white space characters with the first  character alphabetic.

[-o] Output File Path: Defines the path to, and name of, the file to which STDOUT will get redirected to.

[-e] Error File Path: Defines the path to, and name of, the file to which STDERR will get redirected to.

[-j] Join STD* streams: Declares if the standard error stream of the job will be merged with the standard output stream of the job.

An  option  argument  value  of oe directs that the two streams will be merged, intermixed,  as  standard  output.   The path and name of the file can then be specified with the -o option.

An  option argument  value  of  eo  directs  that  the two streams will be merged, intermixed, as standard error.  The path and name of the file can then be specified with the -e option.

If the join argument is n or the option is not  specified,  the two streams will be two separate files.

[-V] Pass Environment: This option declares that all environment variables in the qsub command’s environment are to be exported to the batch job.

[-m] – mail options: Defines the set of conditions under which the execution server will send a mail message about the job.  The mail_options argument is a string which consists of either the single  character “n”, or one or more of the characters “a”, “b”, and “e”.

If the character “n” is specified, no mail will be sent.

For the letters “a”, “b”, and “e”:

  • a  mail is sent when the job is aborted by the batch system.
  • b  mail is sent when the job begins execution.
  • e  mail is sent when the job terminates.

If the -m option is not specified, mail will be sent if the job is aborted.

[-M] User List: A list of users to send email notifications to.  The user_list argument is of the form:

user[@host][,user[@host],…]
If  unset, the list defaults to the submitting user at the qsub host, i.e. the job owner.

[-l , ‘ell’] – resource_list: Defines resources required by the job and establishes a limit to the amount of resource that can be consumed.  The list can be of the form:
resource_name[=[value]][,resource_name[=[value]],…]
Common arguments for this flag option are “walltime” and “nodes”.  The walltime sets the wall clock limit for the job, and is of the format HH:MM:SS.  Check with your sysadmin to see if there’s a maximum limit on this time.  The nodes argument defines how many cores you want the script to grab.

[-a] – date_time: Declares the time after which the job is eligible for execution. The date_time argument is in the form:
[[[[CC]YY]MM]DD]hhmm[.SS]

Where  CC is the first two digits of the year (the century), YY is the second two digits of the year, MM is the two digits for the month, DD is the day of the month, hh is the hour, mm is the minute, and the optional SS is the seconds.

Environment Variables Available to Job:

You can use these variables in your scripts as though they already exist in your environment; PBS sets them up as soon as your job starts running.

PBS_O_WORKDIR – the absolute path of the current working directory of the qsub command. You must ‘cd’ to this directory if you want to work in the folder you submitted the job from.
PBS_JOBNAME – the job name supplied by the user.
PBS_O_HOST – the name of the host upon which the qsub command is running.
PBS_SERVER – the hostname of the pbs_server which qsub submits the job to.
PBS_O_QUEUE – the name of the original queue to which the job was submitted.
PBS_ARRAYID – each member of a job array is assigned a unique identifier (see -t)
PBS_ENVIRONMENT – set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job, see -I option.
PBS_JOBID – the job identifier assigned to the job by the batch system.
PBS_NODEFILE – the name of the file contain the list of nodes assigned to the job (for parallel and cluster systems). This file is particularly useful if you want to log in to remote machines for parallel debugging.
PBS_QUEUE – the name of the queue from which the job is executed.

PBS_JOBNAME – the job name supplied by the user.

File IO Within a Loop – Lessons Learned

Consider the following piece of code:

//Reads a NWS HLRDHM output file containing time series data - highly specialized
void readOutputTimeSeries(int numFiles, string *filenames, double **timeSeries)
{
    string test, junk;

    //Loop through output files
    for (int file=0; file<numFiles; file++)
    {
        ifstream in;
        //Open the file
        in.open(filenames[file].c_str(), ios::in);

        //Test file stream
        if (!in)
        {
            cerr << "Error opening model output file! Filename: " << filenames[file] << endl;
            exit(1);
        }

        //Find the line beginning with RDHM OUTPUTS
        while (1)
        {
            in >> test;
            if (test == "RDHM")
            {
                in >> test;
                if (test == "OUTPUTS")
                {
                    //We're there - ingore the rest of this line and the next
                    in.ignore(1000,'\n');
                    in.ignore(1000,'\n');
                    break;
                }
            }
        }

        //Now, we're at the discharge data
        int i=0;
        while (!in.eof())
        {
            //TODO: This will need to get updated to handle multiple time series outputs
            in >> junk >> junk >> junk >> timeSeries[file][i];
            in.ignore(1000,'\n');
            i++;
        }
        in.close();
    }
    return;
}

Description:

This function loops through a set of output files in different folders, looks for the beginning of the needed data, and collects the data in a 2-dimensional array.

The Problem:

Does anybody see anything wrong with this…?

While this seemed to work fine on some HPC clusters, today I ended up spending a couple hours dealing with “Error opening model output file!” messages. Thinking that it might have something to do with the fact that I was running in an MPI environment, I spent a lot of time ensuring that the path and file it was using was correct.  As it turns out, there was nothing wrong with the path at all.  Apparently, (and this doesn’t appear to be universal across systems), when the eof (end-of-file) flag got set after reading through the first file in the loop, this flag doesn’t get reset the second time through the loop, even though we are closing the file stream.  Hence, the “if (!in)” conditional will always test true, even if the file was opened successfully.

The solution:

Place an “in.clear()” statement immediately before the “in.close() in order to reset any error flags that get set on the file stream from the previous run through the loop.

You have a problem integrated into your MOEA, now what?

A lot of the material here has covered steps that you’ll need to do to “get started” using MOEAs — basic programming, some discussions about what a MOEA is and what it does and so forth.

The purpose of this post is to briefly discuss some additional questions/issues you should think about as you’re generating tradeoffs for your problem.  The discussion focuses on how to do some analysis “by hand” by writing codes for the public version of epsilon-NSGAII (or another algorithm).  While the new MOEAFramework coming out of our group will automate some of these processes, it’s still a good idea to know how they work or what you need to do when you’re running a study.

Random Seed Analysis

The initial population and operators (crossover, mutation) depend on random number generators.  Therefore, there’s a chance you could “get lucky” and generate really good individuals in the first generation.  Or it just so happens that the crossover and mutation cause you to find great solutions really quickly.  How can you ensure that your algorithm results are not just dependent on “luck”?  Replicate the experiment!

Most random number generators allow you to “seed” them, which means that they will provide a predictable but “random” string of numbers after you get started with the initial seed.  This is great for testing things in computer science, since if you repeat a model run it will give exactly the same performance every time you run it, given that the random seed is the same and that the random number generator controls all stochasticity in the model.

Scripting to support random seed analysis, experimentation

Oh great, you’ve told me to replicate my entire optimization run!  How do I do this efficiently without going crazy over so many output files?  I highly recommend to write scripts in your favorite language (bash, C++, python, etc.) to automate the process of copying files for random seed analysis or parameterization.  Some frameworks will do this for you automatically.  But if they don’t, here’s pseudocode for what you need to do:

  • Input number of seeds or number of algorithm runs etc.  Say we’re doing 50 replicates of the same MOEA test
  • Create 50 folders in your job folder.  In C++, loop from i = 1:50 on something like: command_string = ‘mkdir ‘ + num2str(i); system(command_string);
  • Now, copy the files you need to each of these folders.  Again, do this in a loop.
  • If you need to, modify the input files for each run in each folder.  For example, you can replace the random seed number by your loop iterator (i) in each file.
  • Finally, submit each job!  One note about using the system command in C++ in a linux environment — the system command actually spawns a new prompt every time you call it, so if you need to use “cd”, place the cd in the same string as the command you need to call.  More simply here’s an example:
This will not work:
system(‘cd myFolder’);
system(‘qsub myScript.sh’);
because after the first command, the system forgot that the “cd” was ever called.
This, however, will work:
system(‘cd myFolder; qsub myScript.sh’);
If you do 50 random seed replicates of the algorithm, you’ll have 50 approximations to the Pareto set as an output of the process.  You can sort the results together to get a reference set for your problem.  More on that in a later post.

How long do I run the algorithm?

For an analytical test problem, it’s easy to see “when you’re done” because you know what the optimal value or tradeoff surface is.  Real-world problems don’t often have set stopping points, though, so we need to creatively think about how long to run the solver.  Animations using AeroVis or other software can help with this.  Basically, look at each generation or subsequent generations at a certain interval to see search “progress” as a function of time.  Does the algorithm stall?  Or do you quickly find better performance than you’ve ever seen?  Is the progress meaningful?  For example, if you’re solving for cost and your budget is 100 units, but the difference in costs across the results is only, say, 3, you may not find that performance improvement significant.

Should I change my formulation?

In Kasprzyk et al. [2012], we extended some work in operations research that showed that problem formulations are constantly changing as we solve problems and learn more about our systems.  In a very basic way, solving a problem with an initial formulation allows you to figure out which objectives are important and which ones you may want to remove.  Some questions:

  • Do all objectives conflict with one another?  If there is a “collapse” in dimensions, you can satisfy two or more objectives with the same point.
  • But, that point may perform really poorly with respect to another objective.  You may not necessarily want to get rid of objectives that collapse, but you could use this as a lesson for which objectives are most important in your system.
  • Should you add objectives to increase fidelity in the system?  For example, if you use capital cost, you may also want to consider operating cost, or some other measure of efficiency for the system.

Parallelizing the algorithm

To use an example we’re working on currently — there is an infrastructure model that takes about 7 seconds to run per evaluation.  This will severely limit our ability to do long searches if we don’t have a way to parallelize the results.

Using our example as a point of discussion, some thoughts:

  • We’re using a very loosely coupled simulation model within the algorithm.  The algorithm generates an input file, spawns a system shell that runs an executable model, and then reads objectives from the executable at each function evaluation.
  • Parallelization strategies often discuss whether the memory is shared between worker nodes or not.  MPI, “message passing interface” is typically used for scientific codes in which each worker node has its own memory and must pass messages with data between each other.
  • But, on most parallel systems the filesystem is the same.  So it’s more difficult to lock ASCII input files from each processor, since each processor theoretically could access the files at any time.
  • So, a simple way to handle this is to let each processor have its own “scratch” space in which it can create new input files, run its own copy of the executable, let the executable create output files, and then read the output files.  The traditional MPI evolutionary algorithm can still run, the only difference is that each worker node will have its own copy of the data to work from.
  • We can create folders with the data in it, and each folder is then named with the index of the worker processor that will be using the data.