Debugging the NWS model: lessons learned

Josh and I have been setting up a Sobol sensitivity experiment using the National Weather Service’s HLRDHM model. This is a distributed rainfall-runoff model which uses the Sacramento structure for soil moisture accounting along with a separate routing model. We ran into some problems which may be helpful for people in the future, particularly if you find yourself inheriting a large code base from elsewhere. (All of the code here will be in C/C++, but the takeaways are generally applicable).

In general, the goal here is to write wrappers around the model to perform Sobol sampling and analysis without touching the model code itself. This approach would give us a portable method for similar analyses in the future, and would (hopefully) avoid conflicts if the NWS decided to update the model. Josh initially accomplished this by compiling the model into a separate executable, and calling it from the wrapper using the system command:

system("./theModel.exe -flags arguments");

This method should work in theory, and we were able to get some results this way. However, Josh noticed that many of the clusters would not allow calls to system(). We are still not sure why this is—our suspicion is that it has something to do with security, but we haven’t been able to confirm this.

The next idea was to compile the model into a shared library. This involves three fairly short steps: (1) renaming the model’s main() function to something else (such as model_driver(), for example), (2) compiling the model into a shared library, and (3) compiling the wrapper to link against this library in its makefile. (The specifics of this setup could be covered in a separate post). Then, in your wrapper, you could call:

model_driver(arguments);

The shared library approach should also, in theory, work. It did run properly, but we found that the time required to run each simulation was increasing linearly throughout the experiment(!) This was a strange and unacceptable result and required some investigation.

We knew that the model is typically run as a single executable. This means that when it finishes running, it will clear all of its memory before it runs again. However, when we use a shared library, all of the global variables in the model will save their state between calls to model_driver(). This is a key point worth emphasizing. A variable declared inside a function will be cleared when it goes out of scope (i.e., when the function exits), whereas variables declared globally will not. For this reason, we suspected that the slowdown issue was due to global variables not being cleared properly between model runs. But where to start looking?

Fortunately, we did not have to dig very deep to find the cause of the problem. The main function in the model basically calls a number of other functions in sequence, so the first step in this debugging process was to place timers around these different functions to identify which one was causing the slowdown. In C++, you can set up basic timers to print to stdout as follows:

// Create global timer macros to simplify things
#ifndef START_TIME
#define START_TIME clock()
#define GET_TIME(start) ((clock()-start)/(double)CLOCKS_PER_SEC)
#endif
double timer;
// ....start your main function, etc.
// Then, to record the timing of the function foo(), you do the following:
timer = START_TIME;
foo(bar);
timer = GET_TIME(timer);
cout << "Time to run foo(): " << timer << endl;

This is a very informative (if inelegant) way to learn about how your program is running. In the NWS model, this identified only one function as the cause of the slowdown—what luck! The offending function was:

hlrms_algo(deck, allPixels, factory, inputBeforeLoopFact, inputInsideLoopFact, inputAfterLoopFact);

As it turns out, most of the parameters being passed into this function are actually global variables. (Why would one need to pass global variables into a function, you ask? Good question.) A quick search in Visual Studio for the definitions of these variables indicated that they are of type map from the C++ standard library (a map object is similar to a vector, but instead of ordered values, a map contains unordered key-value pairs).

At this point, all of this evidence suggests the following conclusion: one or more of these global map objects are not being cleared properly between model runs. A quick search of STL documentation will tell you that map has a function clear() which removes all elements and, importantly, calls the destructors of those elements. Testing this conclusion, then, was just a matter of adding clear() functions after each model run:

hlrms_algo(deck, allPixels, factory, inputBeforeLoopFact, inputInsideLoopFact, inputAfterLoopFact);

deck.clear();
allPixels.clear();
inputBeforeLoopFact.clear();
inputInsideLoopFact.clear();
inputAfterLoopFact.clear();

And just like that, the slowdown problem disappears. Each simulation now takes approximately the same amount of time, because these global variables are being cleaned up after each run.

A more curious programmer might have some additional questions here. Which of these variables was causing the slowdown? What exactly do these variables do? Is map really the best choice of data structure for them? But remember, our goal is not to rewrite the NWS model—we just want to run our experiments on the model as it stands in a portable way. Now we can proceed with the “shared library” approach and get some results.

Advertisements

3 thoughts on “Debugging the NWS model: lessons learned

  1. Yes, clear() was the only fix needed. The additional “problem” yesterday was caused by my improper use of timing variables and was really not a problem at all.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s