Getting started with C and C++

I’ve been learning C and C++ recently and I thought I’d share my experience learning these languages through a post. Prior to learning C and C++, I had experience in Python and Matlab, but this was my first foray into lower level languages. In my attempts to learn each language I made my way through several courses and books available online; some I found very helpful, others not so much. In this post I’ll detail my experiences with each resource and provide some tips that may help those planning on learning these languages.

Main takeaways

To learn the languages, I used four main resources. Online courses from Lynda.com, a book titled Learn C the Hard Way, a book commonly known as K&R 2 and tutorials from cplusplus.com. For those who do not have the time or desire to read this whole post, I found the following resources to be the most useful:

For C:

Learning C the Hard Way

K&R 2

For C++:

Foundations of Programming: Object – Oriented Design (you need a lynda.com login             to access)

Up and Running with C++ (again, you need a lynda.com login to access)

cplusplus.com

Everyone’s learning style is different, but I found that I learned the languages much faster by coding examples myself, rather than watching someone walk through a script. I also found that courses that taught straight from the command line were more effective than courses that taught through an IDE. When using an IDE, I often found myself spending more time working through glitches or nuances within the IDE than learning the languages themselves.

I’ll detail my experiences with each resource below. I’ll start with C resources, then discuss C++.

Resources for Learning C

C Essential Training – From Lynda.com

I started my training by taking course on Lynda.com titled “C Essential Training”. Lynda.com is an online educational website with thousands of videos, many of which focus on programming. The service is free to Cornell students and graduate students (though I checked and unfortunately neither PSU, UC Davis nor CU Boulder have agreements with the site). I found the course to be well structured and I felt that the instructor presented the material clearly and understandably. Despite this, I do not feel that the course did an effective job teaching me C. The main problem I had with the course was its reliance on the Eclipse IDE.  Eclipse seems like a fine IDE, but I don’t plan to use an IDE when writing code and I didn’t want to spend the time taking a separate course to learn its intricacies (though Lynda.com does have a full course devoted to Eclipse). Throughout the course, I kept finding myself having small Eclipse problems (e.g. not being able to change the project I was working on or having compiler errors) that were not hard to solve, but were never discussed in the lectures. I was able to solve each problem by doing some research online, but each little problem took me time to resolve and was mentally taxing. After spending 30 minutes looking up an Eclipse fix, I was not in the mood to go troubleshooting interesting C questions . Another problem with using Eclipse is that the user is never forced to write their own makefiles, an omission that seems like it could really hurt someone who plans to run C programs through the command line. In summary, I would not recommend taking this course unless you are either thoroughly versed in Eclipse or plan to write all of your code through Eclipse.

Learning C the Hard Way

The next resource I used to learn C was a book that Jazmin pointed me to called Learning C the Hard Way by Zed A. Shaw (after some poking around I found this had been mentioned previously on this blog). The book is laid out as a tutorial, where each chapter teaches a new C concept (it’s really a C course in book form).The author takes a slightly nontraditional teaching approach in that he makes you write the code first, then explains in detail what you just wrote. I found this very hands on teaching method extremely helpful. When I wrote the code myself, I was forced to focus on every detail of the code (something that is very important in a language like C). I also was able to learn which concepts were genuinely challenging for me and which concepts I needed more work on. When I watched the Lynda.com lectures, I’d often believe I understood a concept, only to find out later that I had misinterpreted the instructors lesson.

The book does not use an IDE, but rather writes code in a text editor (I used Sublime Text) and runs them on the Unix command line.  The author provides a succinct introduction to makefiles and how to use them, which was refreshing after the Eclipse based course that never mention makefiles or compilers.

Overall I found the teaching method employed by the book to be very effective, and I would highly recommend it to new C users. I should note however, that there seems to be some controversy surrounding the book. If you google “Learning C the hard way” you’ll find some very heated exchanges between the author and a blogger who criticized the book’s teaching methodology. The blogger had two main criticisms of the book; first that it over simplified and inaccurately presented key C concepts, and second, that the author failed to teach accepted programming standards for the C language. Mr. Shaw’s rebuttal was that the book’s purpose was to teach people get people comfortable with C and begin actually coding with it, then  once they are literate, have them go back and learn more about the language’s nuances. I personally agree with Mr. Shaw on this point, though I don’t have a background in computer science so my opinion is only that of an beginner. Many of the criticisms of the book seemed to come from the perspective of an advanced coder who is unable to see the language through the eyes of a beginner. Mr. Shaw’s explanations might be over simplified, but they do a good job demystifying many of the most foreign  aspects of C. I think that use of this book should be supplemented with other sources, especially documents on accepted C coding standards, but if you’re looking for a quick way to get on your feet with C and gain some confidence, then the book is a great resource.

I used a free beta version of the book which can be found here: http://c.learncodethehardway.org/book/ but you can also purchase the book from the author here: https://www.amazon.com/Learn-Hard-Way-Practical-Computational/dp/0321884922

I found the beta version to be just fine, but there were some minor errors and some sections were clearly under construction.

The blog post criticizing the book can be found here: http://hentenaar.com/dont-learn-c-the-wrong-way

K&R 2

A resource that I discovered through reading the exchanges between the Shaw and his critics was “The C Programming Language” by Brian W. Kernighan and Dennis M. Ritchie (commonly referred to as K&R 2 which is what I’ll call it for the rest of the post). One of the Authors of this book, Dennis Ritchie, actually coauthored the C language and this book is talked of as the go to authority of all matters C. Mr. Shaw devoted a whole chapter of “Learning C the Hard way” to bashing this book, but I found its layout and explanations quite accessible and useful. I did not find the tutorials as direct as “Learning C the Hard Way”, but I found it to be a helpful supplement.

 

Resources for Learning C++

Foundations of Programming: Object-Oriented Design – From Lynda.com

A main difference between C and C++ is that C++ is an object oriented language. I had some very basic experience in using object oriented programming, but was looking for a refresher before learning C++. “Foundations of Programming: Object-Oriented Design” was an excellent course that taught me all I wanted to know about object-oriented programming and more. The course is purely conceptual and does not teach any actual code or contain any practice problems. It presents the concepts in a simple yet comprehensive manner that I found very helpful. I would highly recommend this course to anyone hoping to learn or brush up their knowledge of how object-oriented programming works.

Up and Running with C++ – From Lynda.com

This course was very similar in layout to the C course from Lynda.com, and I have the same criticisms. The entire course used Eclipse, and I kept having minor problems that were never addressed by the lectures but prevented me from executing my code. I did feel like I was able to learn the basic tools I needed from the lectures, but I would have gotten much more out of the class if it had been taught through the command line. I also felt that the course was sparse on exercises and heavy on lectures. I found that I got much less out of watching the instructor write code than being forced to write out the code myself (as Learning C the Hard Way forces you to do).

cplusplus.com

This resource is linked often in older posts on this blog, and I found it helpful in answering C++ questions I had after finishing the Lynda.com courses. I did not find that tutorial had the most helpful narration of how one may learn C++ from scratch, but it has very succinct definitions of many C++ components and was helpful as a reference. I think this site is the one I will look to most when troubleshooting future C++ code.

Final thoughts

I’d like to note that I found WRASEMAN’s post on makefiles a few weeks back  to be quite helpful. From my limited experience, ensuring that your code compiles correctly can be one of the most challenging parts of using a lower level language and the post has some excellent resources that explain makefiles are and how they can be used.

I know there are a lot of contributors and readers of this blog who are much more versed in C and C++ than I am, so if you’d like to chime in on useful resources, please do so in the comments.

 

Advertisements

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.

From Writing NetCDF Files in C to Loading NetCDF Files in R

So much data from such little models…

It’s been my experience that even simple models can generate lots of data. If you’re a regular reader of this blog, I can imagine you’ve had similar experiences as well. My most recent experience with this is the work I’ve done with the Dynamic Integrated Climate-Economic model (DICE). I had inherited a port of the 2007 version of the model, which would print relevant output to the screen. During my initial runs with the model, I would simply redirect the output to ascii files for post-processing. I knew that eventually I would be adding all sorts of complexity to this model, ultimately leading to high-dimensional model output and rendering the use of ascii files as impractical. I knew that I would need a better way to handle all this data. So in updating the model to the 2013 version, I decided to incorporate support for netCDF file generation.

You can find details about the netCDF file format through Unidata (a University Cooperation for Atmospheric Research [UCAR] Community Program) and through some of our previous blog posts (here, here, and here). What’s important to note here is that netCDF is a self-describing file format designed to manage high-dimensional hierarchical data sets.

I had become accustomed to netCDF files in my previous life as a meteorologist. Output from complex numerical weather prediction models would often come in netCDF format. While I had never needed to generate my own netCDF output files, I found it incredibly easy and convenient to process them in R (my preferred post-processing and visualization software). Trying to incorporate netCDF output support in my simple model seemed daunting at first, but after a few examples I found online and a little persistence, I had netCDF support incorporated into the DICE model.

The goal of this post is to guide you through the steps to generate and process a netCDF file. Some of our earlier posts go through a similar process using the Python and Matlab interfaces to the netCDF library. While I use R for post-processing, I generally use C/C++ for the modeling; thus I’ll step through generating a netCDF file in C and processing the generated netCDF file in R on a Linux machine.

Edit:  I originally put a link to following code at the bottom of this post.  For convenience, here’s a link to the bitbucket repository that contains the code examples below.

Writing a netCDF file in C…

Confirm netCDF installation

First, be sure that netCDF is installed on your computing platform. Most scientific computing clusters will have the netCDF library already installed. If not, contact your system administrator to install the library as a module. If you would like to install it yourself, Unidata provides the source code and great documentation to step you through the process. The example I provide here isn’t all that complex, so any recent version (4.0+) should be able to handle this with no problem.

Setup and allocation

Include the header files

With the netCDF libraries installed, you can now begin to code netCDF support into your model. Again, I’ll be using C for this example. Begin by including the netCDF header file with your other include statements:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>

Setup an error handler

The netCDF library includes a nice way of handling possible errors from the various netCDF functions. I recommend writing a simple wrapper function that can take the returned values of the netCDF functions and produce the appropriate error message if necessary:

void ncError(int val)
{
  printf("Error: %s\n", nc_strerror(val));
  exit(2);
}

Generate some example data

Normally, your model will have generated important data at this point. For the sake of the example, let’s generate some data to put into a netCDF file:

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 10;
  int dim2_size = 5;
  int dim3_size = 20;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
      for(j = 0; j < dim2_size; j++) {
          for(k = 0; k < dim3_size; k++) {
              x[i][j][k] = (i+j+k) * 0.2;
              y[i][j][k] = (i+j+k) * 1.7;
              z[i][j][k] = (i+j+k) * 2.4;
          }
      }
    }

This generates three variables, each with three different size dimensions. Think of this, for example, as variables on a 3-D map with dimensions of [latitude, longitude, height]. In my modeling application, my dimensions were [uncertain state-of-the-world, BORG archive solution, time].

Allocate variables for IDs

Everything needed in creating a netCDF file depends on integer IDs, so the next step is to allocate variables for the netCDF file id, the dimension ids, and the variable ids:

// Allocate space for netCDF dimension ids
int dim1id, dim2id, dim3id;
  
// Allocate space for the netcdf file id
int ncid;
  
// Allocate space for the data variable ids
int xid, yid, zid;

Each one of these IDs will be returned through reference by the netCDF functions. While we’re at it, let’s make a variable to hold the return status of the netCDF function calls:

// Allocate return status variable
int retval;

Define the meta-data

Now we will start to build the netCDF file. This is a two-part process. The first part is defining the meta-data for the file and the second part is assigning the data.

Create an empty netCDF file

First, create the file:

// Setup the netcdf file
if((retval = nc_create("example.nc", NC_NETCDF4, &ncid))) { ncError(retval); }

Note that we store the return status of the function call in retval and test the return status for an error. If there’s an error, we pass retval to our error handler. The first parameter to the function call is the name of the netCDF file. The second parameter is a flag that determines the type of netCDF file. Here we use the latest-and-greatest type of NETCDF4, which includes the HDF5/zlib compression features. If you don’t need these features, or you need a version compatible with older versions of netCDF libraries, then use the default or 64-bit offset (NC_64BIT_OFFSET) versions. The third parameter is the netCDF integer ID used for assigning variables to this file.

 Add the dimensions

Now that we have a clean netCDF file to work with, let’s add the dimensions we’ll be using:

 // Define the dimensions in the netcdf file
 if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
 // Gather the dimids into an array for defining variables in the netcdf file
 int dimids[ndims];
 dimids[0] = dim1id;
 dimids[1] = dim2id;
 dimids[2] = dim3id;

Just as before, we catch and test the function return status for any errors. The function nc_def_dim() takes four parameters. First is the netCDF file ID returned when we created the file. The second parameter is the name of the dimension. Here we’re using “dimX_size” – you would want to use something descriptive of this dimension (i.e. latitude, time, solution, etc.). The third parameter is the size of this dimension (i.e. number of latitude, number of solutions, etc.). The last is the ID for this dimension, which will be used in the next step of assigning variables. Note that we create an array of the dimension IDs to use in the next step.

 Add the variables

The last step in defining the meta-data for the netCDF file is to add the variables:

// Define the netcdf variables
if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

The nc_def_var() function takes 6 parameters. These include (in order) the netCDF file ID, the variable name to be displayed in the file, the type of data the variable contains, the number of dimensions of the variable, the IDs for each of the dimensions, and the variable ID (which is returned through reference). The type of data in our example is NC_FLOAT, which is a 32-bit floating point. The netCDF documentation describes the full set of data types covered. The IDs for each dimension are passed as that combined array of dimension IDs we made earlier.

 Optional: Add variable attributes

This part is optional, but is incredibly useful and true to the spirit of making a netCDF file. When sharing a netCDF file, the person receiving the file should have all the information they need about the data within the file itself. This can be done by adding “attributes”. For example, let’s add a “units” attribute to each of the variables:

 // OPTIONAL: Give these variables units
 if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }

The function nc_put_att_text() puts a text-based attribute onto a variable. The function takes the netCDF ID, the variable ID, the name of the attribute, the length of the string of characters for the attribute, and the text associated with the attribute. In this case, we’re adding an attribute called “units”. Variable ‘x’ has units of “cm”, which has a length of 2. Variable ‘y’ has units of “degC”, which has a length of 4 (and so on). You can apply text-based attributes as shown here or numeric-based attributes using the appropriate nc_put_att_X() function (see documentation for the full list of numeric attribute functions). You can also apply attributes to dimensions by using the appropriate dimension ID or set a global attribute using the ID “0” (zero).

 End the meta-data definition portion

At this point, we’ve successfully created a netCDF file and defined the necessary meta-data. We can now end the meta-data portion:

 // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }

…and move on to the part 2 of the netCDF file creation process.

Populate the file with data

Put your data into the netCDF file

Here, all we do is put data into the variables we defined in the file:

 // Write the data to the file
 if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }

The function nc_put_var() takes three parameters: the netCDF file ID, the variable ID, and the memory address of the start of the multi-dimensional data array. At this point, the data will be written to the variable in the netCDF file. There is a way to write to the netCDF file in data chunks, which can help with memory management, and a way to use parallel I/O for writing data in parallel to the file, but I have no experience with that (yet). I refer those interested in these features to the netCDF documentation.

Finalize the netCDF file

That’s it! We’re done writing to the netCDF file. Time to close it completely:

 // Close the netcdf file
 if((retval = nc_close(ncid))) { ncError(retval); }

Compile and run the code

Let’s compile and run the code to generate the example netCDF file:

gcc -o netcdf_example netcdf_write_example.c -lnetcdf

Some common problems people run into here are not including the netCDF library flag at the end of the compilation call, not having the header files in the include-path, and/or not having the netCDF library in the library-path. Check your user environment to make sure the netCDF paths are included in your C_INCLUDE_PATH and LIBRARY_PATH:

env | grep –i netcdf

Once the code compiles, run it to generate the example netCDF file:

./netcdf_example

If everything goes according to plan, there should be a file called “example.nc” in the same directory as your compiled code. Let’s load this up in R for some post-processing.

 Reading a netCDF file in R…

Install and load the “ncdf4” package

To start using netCDF files in R, be sure to install the netCDF package “ncdf4”:

install.packages("ncdf4")
library(ncdf4)

Note that there’s also an “ncdf” package. The “ncdf” package reads and writes the classic (default) and 64-bit offset versions of netCDF file. I recommend against using this package as the new package “ncdf4” can handle the old file versions as well as the new netCDF4 version.  Turns out the “ncdf” package has been removed from the CRAN repository.  It’s just as well since the new “ncdf4” package obsoletes the “ncdf” package.


Open the netCDF file

With the library installed and sourced, let’s open the example netCDF file we just created:

 nc <- nc_open("example.nc")

This stores an open file handle to the netCDF file.

View summary of netCDF file

Calling or printing the open file handle will produce a quick summary of the contents of the netCDF file:

 print(nc)

This summary produces the names of the available variables, the appropriate dimensions, and any global/dimension/variable attributes.

Extract variables from the netCDF file

To extract those variables, use the command:

x <- ncvar_get(nc, "x")
y <- ncvar_get(nc, "y")
z <- ncvar_get(nc, "z")

At this point, the data you extracted from the netCDF file are loaded into your R environment as 3-dimensional arrays. You can treat these the same as you would any multi-dimensional array of data (i.e. subsetting, plotting, etc.). Note that the dimensions are reported in reverse order from which you created the variables.

dim(x)

 Close the netCDF file

When you’re done, close the netCDF file:

nc_close(nc)

And there you have it! Hopefully this step-by-step tutorial has helped you incorporate netCDF support into your project. The code I described here is available through bitbucket.

Happy computing!

~Greg

Setting up Eclipse for C/C++

IDEs are tools to make code development a lot easier, specially if your project has multiple files, classes, and functions. However, setting up the IDE can sometimes be as painful as developing complex codes without an IDE. This post will present a short tutorial about how to install and configure Eclipse for C/C++ on Windows 7 in a (hopefully) fairly painless manner. This tutorial is sequenced as follows:

  1. Installation
    1. Downloading the Java Runtime Environment.
    2. Downloading the GCC compiler.
    3. Downloading Eclipse.
  2. First steps with Eclipse
    1. Setting up a template (optional)
    2. Creating a new project
    3. Including libraries in your project

INSTALLATION

Downloading the Java Runtime Environment

To check if you have the Java Runtime Environment installed, go to java.com with either Internet Explorer or Firefox (Chrome will block the plugin) and click on “Do I have Java?”. Accept running all the pluggins and, If the website tells you you do not have java, you will have to download and install it from the link displayed on the website.

Downloading the GCC compiler

After the check is done, you will have to download the GCC compiler, which can be done from http://www.equation.com. On the side menu, there will be a link to Programming Tools, which after expanded shows a link to Fortran, C, C++. Click on this link and download the right GCC version for your system (32/64 bit), as shown in the following screenshot.

DownloadGCC

After downloading it, double click on the executable, accept the licence, and type “c:\MinGW” as the installation directory. This is important because this is the first folder where Eclipse will look for the compiler in your computer. Proceed with the installation.

Downloading Eclipse

Now it is time to download an install eclipse. Go to the Eclipse download website and download Eclipse IDE for C/C++ Developers. Be sure to select the right option for your computer (Windows, 32bit/64bit), otherwise eclipse may not install and even if it does it will not run after installed. If unsure about which version you should download, this information can be found at Control Panel -> System by looking at System type.

Download

After downloading it, extract the file contents to “C:\Program Files\eclipse” (“Program Files (x86) if installing the 32 bits version) so that everything is organized. Note that for this you will need to start WinRAR or any other file compression program with administrative privileges. This can be done by right clicking the name of the program on the start menu and clicking on Run as Administrator.

Now, go to C:\Program Files\eclipse and double click on eclipse.exe to open eclipse. In case you get an error message saying, among other things:

Java was started but returned exit code=13
...
...
-os win32
-ws win32
...

then delete the whole eclipse folder, go back to the eclipse download page, download eclipse 32 bit, and extract it as previously described. You should not see the same error again when trying to run eclipse.exe.

Now that Eclipse is up and running, it is time to use it.

FIRST STEPS WITH ECLIPSE

The first thing eclipse will do is ask you to choose a workspace folder. This is the folder where all your code projects will be stored. It should not matter too much which folder you choose, so using the default is probably a good idea.

Setting up templates (optional)

It is helpful to create a code template in order to avoid retyping the same standard piece of code every time you create a new file or project. Many scientific codes have similar imports (such as math.h and stdio.h) and all of them must have a main method (as any C++ code). If we create a code template with a few common imports and the int main function, we can just tell Eclipse when creating a new project to add these to a new .cpp file.

In order to create the mentioned template, go to Window -> Preferences. There, under C/C++ -> Code Style on the left panel, click on Code Templates. Under Configure generated code and comments, expand Files -> C++ Source File, and then click on New. Choose a meaningful name for your template (I chose “Cpp with main”) and type a short description. After that, copy and paste the template below under “Pattern”.

/*
File: ${file_name}

Author: ${user}
Date: ${date}
*/

#include <iostream>
#include <string>
#include <math.h>
#include <stdio.h>
#include <string.h>

using namespace std;

int main()
{
    // Your code here.

    return 0;
}

Note ${file_name}, ${data}, and ${user} are variables, which means that they will be replaced by your file’s actual data. To see a list of the other variables that can be inserted in your template, click on Insert Variable…. Click Ok and Ok again and your template will be ready to be used!

Configuring_template

Creating a new project

Click on File -> New -> C++ Project. Under Project type choose Empty Project, then under Toolchains choose MinGW GCC, and, finally, type “project1” as your project name an click on Finish.

New_project

After your project is created, click on File -> New -> Source File. Type “say_something.cpp” (no quotes and do not forget the .cpp after the file name) as the name of your source file and choose the template you created as the template. The window should then look like this:

New_file

Click on Finish. If you used the template, replace the comment “// Your code here.” by “cout << “Yay, it worked!” << endl;”. Your code should look like the snippet below. If you have not created the template, just type the following code to your file.

/*
File: say_something.cpp

Author: bct52
Date: Jun 26, 2015
*/

#include <iostream>
#include <string>
#include <math.h>
#include <stdio.>
#include <string.h>

using namespace std;

int main()
{
    cout << "Yay, it worked!" << endl;

    return 0;
}

Now, build the code by clicking on the small hammer above the code window and, after the project is built, click on the run button (green circle with white play sign in the center). If everything went well, your window should look like the screenshot below, which means your code compiled and is runs as expected.

Project1_run

Including libraries in your project

When developing code, often times other people have had to develop pieces of code to perform some of the intermediate steps we want our code to perform. These pieces of code are often publicly available in the form of libraries. Therefore, instead of reinventing the wheel, it may be better to simply use a library.

Some libraries are comprised of one or a few files only, and can be included in a project simply by dragging the file into the Eclipse project. Others, however, are more complex and should be installed in the computer and then called from the code. The procedure for the latter case will be described here, as it is the most general case . The process of installation and usage of the Boost library with MinGW (GCC) will be used here as a case study.

The first step is downloading the library. Download the Boost library from here and extract it anywhere in your computer, say in C:\Users\my_username\Downloads (it really doesn’t matter where because these files will not be used after installation is complete).

Now it is time to install it. For this:

    1. Hold the Windows keyboard button and press R, type “cmd”, and press enter.
    2. On the command prompt, type “cd C:\Users\bct52\Downloads\boost_1_58_0” (or the directory where you extracted boost to) and press enter.
    3. There should be a file called bootstrap.bat in this folder. If that is the case, run the command:
      bootstrap.bat mingw
    4. In order to compile Boost to be used with MinGW, compile Boost with the gcc toolset. You will have to choose an installation directory for Boost, which WILL NOT be the same directory where you extracted the files earlier. In my case, I used C:\boost. For this, run the command:
      b2 install --prefix=C:\boost toolset=gcc

      Now go read a book or work on something else because this will take a while.

Now, if the installation worked with just warnings, it is time to run a code example from Boost’s website that, or course, uses the Boost library. Create a new project called “reveillon” and add a source file to it called “days_between_new_years.cpp” following the steps from the “Creating a new project” section. there is no need to use the template this time.

You should now have a blank source file in front of you. If not, delete any text/comments/codes in the file so that the file is blank. Now, copy and paste the following code, from Boost’s example, into your file.

 /* Provides a simple example of using a date_generator, and simple
   * mathematical operatorations, to calculate the days since
   * New Years day of this year, and days until next New Years day.
   *
   * Expected results:
   * Adding together both durations will produce 366 (365 in a leap year).
   */
  #include <iostream>
  #include "boost/date_time/gregorian/gregorian.hpp"

  int
  main()
  {
    
    using namespace boost::gregorian;

    date today = day_clock::local_day();
    partial_date new_years_day(1,Jan);
    //Subtract two dates to get a duration
    days days_since_year_start = today - new_years_day.get_date(today.year());
    std::cout << "Days since Jan 1: " << days_since_year_start.days()
              << std::endl;
    
    days days_until_year_start = new_years_day.get_date(today.year()+1) - today;
    std::cout << "Days until next Jan 1: " << days_until_year_start.days()
              << std::endl;
    return 0;
  };

Note that line 9 (“#include “boost/date_time/gregorian/gregorian.hpp””) is what tells your code what exactly is being used from Boost in your code. Line 15 (“using namespace boost::gregorian;”) saves you from having to type boost::gregorian every time you want to use one of its functions.

However, the project will still not compile in Eclipse because Eclipse still does not know where to look for the Boost library. This will require a couple of simple steps:

  1. Right click on the project (reveillon), under the Project Explorer side window, then click on Properties. Under C/C++ Build->Settings, click on Includes under GCC C++ Compiler. On the right there should be two blank boxes, the top one called Include paths (-I) and the other called Include files (-include). Under Include paths (top one), add the path “C:\boost\include\boost-1_58” (note that this path must reflect the path where you installed Boost as well as which version of Boost you have). This is where the compiler will look for the header file specified in the code with the #include statement.
  2. The compiled library files themselves must be included through the linker. This step is necessary only if you are using a compiled library. For this, on the same window, click on Libraries under MinGW C++ Linker. Add the path to the Boost libraries folder to the Library search path (-L) (bottom box). this path will be “C:\boost\lib” (again, if you installed Boost in a different folder your path will be slightly different). Now the actual compiled library must be added to the Libraries (-i) (top box). First, we need to figure out the name of the compiled library file used in the code. In this case, it is the file “libboost_date_time-mgw51-mt-d-1_58.a”. Therefore, add boost_date_time-mgw51-mt-d-1_58 (no lib prefix, no .a postfix, and be sure to match the name of your file) to Libraries (-i). Click Ok and Ok again.

Now compile the code by clicking on the hammer button and run the rode by clicking on the play button. Below is a screenshot reflecting both steps above as well as the expected output after running the program.

configuring_library

That’s it. After your model is in a good shape and it is time to run it with Borg (or other optimization algorithm), just change your “int main()” to a function with your model’s name and the right Borg’s arguments, add the standard Borg main, and change the makefile accordingly. Details on how to do all this for Borg will be explained in a future post.

Python for automating cluster tasks: Part 2, More advanced commands

This is part 2 in a series about using Python for automating cluster tasks.  Part 1 is here. (For more on Python, check out: another tutorial part one and two, tips on setting up Python and Eclipse, and some specific examples including a cluster submission guide and a script that re-evaluates solutions using a different simulation model)

Edit: Added another example in the “copy” section below!

Welcome back!  Let’s continue our discussion of basic Python commands.  Let’s start by modifying our last code sample to facilitate random seed analysis.  Now, instead of writing one file we will write 50 new files.  This isn’t exactly how we’ll do the final product, but it will be helpful to introduce loops and some other string processing.

Loops and String Processing

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE

def main():
    #the input filename and filestream are handled outside of the loop.
    #but the output filename and filestream have to occur inside the loop now.
    inFilename = "borg.c"
    inStream = open(inFilename, 'rb')

    for mySeed in range(1,51):
        outFilename = "borgNew.seed" + str(mySeed) + ".c"
        outStream = open(outFilename, 'w')

        print "Working on seed %s" % str(mySeed)

        for line in inStream:
            if "int mySeed" in line:
                newString = " int mySeed = " + str(mySeed) + ";\n"
                outStream.write(newString)
            else:
                outStream.write(line)
        outStream.close()
        inStream.seek(0) #reset the input file so you can read it again

if __name__ == "__main__":
    main()

Above, the range function allows us to iterate through a range of numbers. Note that the last member of the range is never included, so range(1,51) goes from 1 to 50. Also, now we have to be concerned with making sure our files are closed properly, and making sure that the input stream gets ‘reset’ every time. There may be a more efficient way to do this code, but sometimes it’s better to be more explicit to be sure that the code is doing exactly what you want it to. Also, if you had to rewrite multiple lines, it would be helpful to structure your loops the way I have them here.

By the way, after you run the sample program, you may want to do something like “rm BorgNew*” to remove all the files you just created.

Calling System Commands

Ok great, so now you can use Python to modify text files. What if you have to do something else in your workflow, such as copy files? Move them? Rename them? Call programs? Basically, you want your script to be able to do anything that you would do on the command line, or call system commands. For some background, check out this post on Stack Overflow, talking about the four or five different ways to call external commands in Python.

The code sample is below. Note that there’s two different ways to use the call command. Using “shell=True” allows you to have access to certain features of the shell such as the wildcard operator. But be careful with this! Accessing the shell directly can lead to problems as discussed here.

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE
from subprocess import call

def main():

    print "Listing files..."
    call(["ls", "-l"])

    print "Showing the current working directory..."
    call(["pwd"])

    print "Now making ten copies of borg.c"
    for i in range(1,11):
        print "Working on file %s" % str(i)
        newFilename = "borgCopy." + str(i) + ".c"
        call(["cp", "borg.c", newFilename])
    print "All done copying!"

    print "Here's proof we did it.  Listing the directory..."
    call(["ls", "-l"])

    print "What a mess.  Let's clean up:"
    call("rm borgCopy*", shell=True)
    #the above is needed if you want to use a wildcard, see:
    #http://stackoverflow.com/questions/11025784/calling-rm-from-subprocess-using-wildcards-does-not-remove-the-files

    print "All done removing!"

if __name__ == "__main__":
    main()

You may also remember that there are multiple ways to call the system. You can use subprocess to, in a sense, open a shell and call your favorite Linux commands… or you can use Python’s os library to do some of the tasks directly. Here’s an example of how to create some directories and then copy the files into the directory. Thanks to Amy for helping to write some of this code:

import os
import shutil
import subprocess

print os.getcwd()
global src
src = "myFile.txt" #or whatever your file is called

for i in range(51, 53); #remember this will only do it for 51 and 52
   newFoldername = 'seed'+str(i)
   if not os.path.exists(newFoldername):
      os.mkdir(newFoldername)
   print "Listing files..."
   subprocess.call(["ls", "-l"])
   shutil.copy(src, newFoldername)
   #now, we should change to the new directory to see if the
   #copy worked correctly
   os.chdir(newFoldername)
   subprocess.call(["ls", "-l"])
   #make sure to change back
   os.chdir("..")

Conclusion

These two pieces of the puzzle should open up a lot of possibilities to you, as you’re setting up your jobs. Let us know if you want more by posting in the comments below!

Python for automating cluster tasks: Part 1, Getting started

Yet another post in our discussion of Python.  (To see more, check out: a tutorial part one and two, tips on setting up Python and Eclipse, and some specific examples including a cluster submission guide and a script that re-evaluates solutions using a different simulation model)

If you’re just getting into using MOEAs and simulation models together, you may have spent some time getting familiar with how to get the two technologies to “talk” to one another, and you may have solved a problem or two.  But now you may realize, well there’s more to the process than just running a MOEA once.  The following are some reasons why you may need to set up a “batch” submission of multiple MOEA runs:

1. Random Seed Analysis A MOEA is affected by the random number generator used to generate initial random solutions, generate random or synthetic input data, and perform variation operations to develop new solutions. Typically, a random seed analysis is used to test how robust the algorithm is to different sets of random numbers.

2. Diagnostic Analysis of Algorithm Performance MOEAs have parameters (population size, run duration, crossover probability, etc.).  We’ve discussed methods to evaluate how well the algorithm does across a wide array of different parameter values (see, for example, this post).

3. Running Multiple Problem Formulations Perhaps you want to compare a 2 objective problem with a 10 objective problem.  Or, you use different simulation models (a screening level model, a metamodel, all the way to a physics-based process model).  All of these changes would require running the MOEA more than once.

4. Learning More about your Problem Even if you’re not trying 1-3, you may just need to run the MOEA for a problem that you’re still developing.  You make an educated guess about the best set of objectives, decisions, and constraints, but by analyzing the results you could see that this needs to change.  We use the term a posteriori to describe the process, because you don’t specify preferences between objectives, etc until after you generate alternatives.  So it’s an interactive process after you start running the algorithm.

The manual approach

For this illustration, here are some assumptions on what you’re up to.  First, you are using the Borg MOEA or MOEAframework, with a simulation model (either the source code, or you’ve written a wrapper to call it).  You’ve set up one trial of the simulation-optimization process, maybe using some of the resources on this blog!  You are using cluster computing and you have a terminal connection to the cluster all set up.  And, python is installed on the cluster.

A typical workflow might look something like this.  (Thanks to my student Amy Piscopo for being the “guinea pig” here).  Some of these steps are kind of complicated, you may or may not have them.

1. Change the simulation or optimization source code.  Typically we set up programming so you don’t need to actually change the code and re-compile if you’re making a simple change.  For example, if there’s a really important parameter in your simulation code, you can pass it in as a command line argument.  But, sometimes you can’t avoid this, so step one is that you might have to go into a text editor and modify code.  (Example: “On line 2268, change “int seed = 1” to “int seed = 2”.

2. Compile the simulation or optimization source code. Any changes in source code must be compiled before you run the program.   (Example: Write a command that compiles the files directly, such as in the Borg README, or use a makefile and the command “make”)

3. Modify a submission script. You thought your experiment was going to take 8 hours, and it really takes 24.  Oops.  Typically, when you create a “submission script” for the cluster, you need to tell it a few things about your job: what queue you want, how long to run, how many processors to request, and then the specific commands you need to run.  Another thing to consider with multiple runs is that the command that you are specifying may actually change. (Example: Changing the submission script to say “borg.exe -s 2” instead of “borg.exe -s 1”)

4. Make multiple run folders, and copy all files into each folder. Ok, so you’ve made the necessary modifications for, say, “Seed 1” and “Seed 2”.  Or, “Problem 1” and “Problem 2”.  Now, you need to gather up all the files for your particular run and put them into their own folder.  I usually recommend that you use a different folder for different runs, even if there are only a few files in the folder.  It just helps keep things organized. (Example: Each seed has its own folder)

5. Submit the jobs! Whew.  Too much work.  After clicking and typing and hitting enter and clicking and drag and dropping, you’re finally ready to hit go.

If you’re exhausted by that list, I don’t blame you — so am I!  This is a lot of manual steps, and the worst part is that the process won’t change hardly at all at seed 1, vs seed 50.  Plus, if you do all this by hand, you are more likely to make a mistake (“Did I actually change seed 2 to seed 3?  Oh gosh I don’t know”).  The other thing that’s annoying about it is that you may need to make yet another change in the process later on.  Instead of changing the seed, now you have to change some parameter!

Well, there’s another way…

Instead, use python!

That’s the point of this whole post.  Let’s get started.

Learning Python syntax and starting your first script

How do you quickly get started learning this new language?  First, log into the cluster and type “python”.  You may need to load a module, consult the documentation for your own cluster.  You’ll see a prompt that looks like this: “>>>”  This is the python interpreter.  You can type any command you want, and it will get executed right there, kind of like Matlab.  So now you’re running Python!  Saying it’s too hard to get started is no excuse. 🙂

Then, open the official Python tutorial, or a third party tutorial on the internet.  I noticed the third party one has a Python interpreter right in the browser.  Anyway, any time you see a new  command, look it up in the tutorial!  After a while you’ll be a Python expert.

One more comment in the “getting started” section.  In a previous post the beginning of the script looked like this.  The import commands load packages that have commands that you need.  Any time you learn a new command, make sure you don’t need to also include a new import command at the beginning of the script.

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE

And then, the main function is defined a little differently than you may be used to in Fortran or C.  Something like this:

def main():

    #a bunch of code goes here

if __name__ == "__main__":
    main()

The default function, (“if __name__”) is a convention in Python.  So just set up your script in a similar way and you can use it like you were programming in C or another language.

Make sure to pay attention to what the tutorial says about code indenting.  Spaces/tabs/indents are very important in Python and they have to be done correctly in order for the code to work.

Ok now that you’re a verified Python expert let’s talk about how to do some of the basic functions that you need to know:

Modifying text files (or, changing one little number in 3000 lines of code)

Here’s your first Python program. It takes a file called borg.c, which, somewhere inside of which, contains the line “int mySeed = 1”. It then changes it to a number you specify in your Python program (in this case, 2). It does this by reading in every line of borg.c and spitting it into a new file, but when it gets to the mySeed line, it rewrites it. Note! Sometimes the spacing comes out weird on the blog. Be sure to follow correct Python spacing convention when writing your own code.

import re
import os
import sys
import time
from subprocess import Popen
from subprocess import PIPE
def main():
    inFilename = "borg.c"
    outFilename = "borgNew.c"
    seed = 2
    inStream = open(inFilename, 'rb')
    outStream = open(outFilename, 'w')
    for line in inStream:
        if "int mySeed" in line:
            newString = " int mySeed = " + str(seed) + ";\n"
            outStream.write(newString)
    else:
        outStream.write(line)
if __name__ == "__main__":
    main()

We’ve hit the ground running!  The above code sample shows how to write a file, how to write a loop, and then how to actually modify lines of text.  It also shows that ‘#’ indicates whether the line is a comment.  Remember that the indentation will tell Python whether you’re inside one loop, or an if statement, or what have you.  The indenting scheme makes the code easy to read!

If you save this code in a file (say, myPython.py), all you have to do to run the program is type “python myPython.py” at the command line.  No news is good news, if the program runs, it worked!

Hopefully this has given you a taste for what you can do easily, with scripting.  Yes, you could’ve opened a text editor and changed one line pretty easily.  But could you do it easily 1000 times?  Not without some effort.  One easy modification here is that you could assign the ‘seed’ variable in a loop, and change the file multiple times, or create multiple files.

Next time… We’ll talk about how to call system commands within Python.  So, in addition to changing text, we’ll be able to copy files, delete files, even submit jobs to the cluster.  All within our script! Feel free to adapt any code samples here for your own purposes, and provide comments below!

Many Objective Robust Decision Making (MORDM): Concepts and Methods

This post provides an informal discussion of how to carry out the Many Objective Robust Decision Making (MORDM) procedure. The blog post was written by Jon Herman and Joe Kasprzyk. For the journal article describing MORDM, please click here.

Introduction

Numerical simulations of engineered systems define the relationship between decisions (inputs) and some measures of performance (objective values). The relationship between decisions and performance often depends on exogenous factors beyond the control of the decision maker, e.g., climate, economic variables, etc., which are liable to be highly uncertain. When such models account for uncertainty, they typically do so by calculating the expected value of performance under well-characterized probability distributions. They do not, however, account for deep uncertainty, where decision makers do not agree on the full set of risks to a system or their associated probabilities [1,2]. Robust Decision Making (RDM) is designed to address this challenge by identifying sets of decisions that perform well across a range of assumptions on deeply uncertain variables (i.e., decisions that are robust to uncertain states of the world).

This is an important distinction: by measuring performance across uncertain states of the world, RDM avoids the common problem of assigning probabilities to these outcomes. Instead, decision makers can explore which scenarios lead to vulnerabilities, and then determine a posteriori how likely these outcomes might be. Thus, RDM can shed light on two key questions:

  • Which deeply uncertain variables (and combinations thereof) are most responsible for changes in performance?
  • Which candidate solutions are most robust to these uncertain variables?

In our research, we have combined concepts from RDM and many objective analysis to propose a new framework, Many Objective Robust Decision Making (MORDM). The MORDM process consists of four main steps: (1) Problem formulation, (2) Generating alternatives, (3) uncertainty analysis, and (4) Scenario discovery and tradeoff analysis [3,4,5].

1. Problem Formulation

A “problem” in the context of RDM is defined by: exogenous uncertain variables, decision variables, a simulation model, and objective values. Following [6], these can be described with the acronym XLRM: uncertainties (X), decisions or “levers” (L), relationship between decisions and performance (R), and measures of performance (M).

Many of the existing applications that use the tools discussed on this blog will already have decision variables (levers), measures of performance, and a quantitative relationship or simulation. The new concept for creating MORDM analyses of these problems will be to identify a set of uncertain variables (X) that will collectively account for the primary exogenous sources of uncertainty in the system. The idea is to convert these concepts from the realm of deep uncertainty (i.e., stakeholders cannot agree on the full range of risks to the system) to a set of quantitative variables (creating an ensemble of feasible “states of the world” that describe uncertainties).

No two models will have the same set of uncertain variables, but here are some helpful guidelines:

  • Does the model contain variables that reflect future change? Is it possible that these values will be different than currently projected?
  • Does the model contain assumptions about the current state of the world that may not be correct? Many assumptions in the model will be well-defined from data, but others will likely be more suspect. It is worth exploring what impact these assumptions have on performance.
  • Are there any variables omitted in the current state of the world but which could become relevant?

Again, this is not a definitive list; your set of alternatives will be specific to your application. Once you have a set of XLRM values defined, you can start the next step.

2. Generating Alternatives

Alternatives are sets of model simulations (decisions and performance measures) of interest in the base state of the world. These are the solutions that will be subjected to the sources of uncertainty, X, defined above (this occurs later in Step #3). Different approaches exist for generating alternatives. Bryant and Lempert (2010) [7] propose a Latin hypercube sample over the decision variable space. Kasprzyk et al. (2013) [8] propose using a set of Pareto-approximate solutions found using a multi-objective evolutionary algorithm (MOEA) in an extension known as Many-Objective RDM. The MORDM approach confers several advantages: it allows the analysis of multiple performance objectives, and it ensures that decision makers are starting from a set of the best known solutions in the base state of the world. That is, the decision makers will be exploring the uncertainties associated with solutions that they would be likely to choose in the absence of RDM analysis.

To generate alternatives using the MORDM approach, you will need to perform a multi-objective optimization on your problem. This has been covered in more detail elsewhere, but here are some links to get started. For software, see MOEAFramework and Borg; for documentation about these, see here, here, and here.

3. Uncertainty Analysis

Uncertainty analysis involves running the set of alternatives generated above through a range of states of the world defined by the deeply uncertain variables (X). These states of the world can be generated, for example, with a Latin hypercube sample of the uncertain variables. The following Bash example shows how to generate such a sample using MOEAFramework:

#!/bin/bash

JAVA_ARGS="-Xmx256m -classpath MOEAFramework-1.16-Executable.jar"
NUM_SAMPLES=10000
METHOD=latin
RANGES_FILENAME=RDMFactors.txt
OUTPUT_FILENAME=RDMSamples.txt
CSV_FILENAME=RDMSamples.csv

java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SampleGenerator -m ${METHOD} -n ${NUM_SAMPLES} -p ${RANGES_FILENAME} -o ${OUTPUT_FILENAME}

# The default output is space-separated. Convert to comma-separated file as follows: (optional)
sed 's/ /,/g' ${OUTPUT_FILENAME} > ${CSV_FILENAME}

This example generates 10,000 Latin hypercube samples of the variables defined in RDMFactors.txt, which contains the name, lower, and upper bound for each variable, like so:

Inflows 0.8 1.2
Evaporation 0.8 1.2
...

The uncertain variables should be explored over reasonable ranges of values, but should not be restricted to only those scenarios considered “possible”. By the definition of deep uncertainty, these variables are likely to encounter scenarios previously considered impossible, so it is valuable to run the RDM analysis even in extreme scenarios. Remember, we’re running a series of “What-If” experiments, not trying to determine the most likely future scenario.

There is no requirement for how many samples to generate. The more uncertain variables you have, the more samples you will want to run to get good coverage of the space. The sample size used here (10,000) provides reasonably good coverage for experiments on the order of tens of variables.

Once you’ve generated your set of uncertain states of the world (stored in RDMSamples.txt above), run each alternative solution for the entire ensemble of states of the world. For example, if you generated 100 alternatives in Step #2, and an ensemble of 10,000 states of the world in this step, you will need to perform 100 * 10,000 = 1 million model evaluations. This will be trivial for some models, and impossible for others—adjust accordingly. Some model-specific modifications will be required to perform these evaluations. You’ll need to read in the variable values from RDMSamples.txt, and the decision variables defined for your set of alternatives, and make sure these are assigned properly within the model. Depending on the complexity of your model, you may also need to get access to a computing cluster.

These model evaluations should output the performance measures calculated for each solution in each state of the world. Again, depending on the size of your experiment and the number of performance measures, this may be quite a bit of data. Make sure you save these somehow, either in files or a database, for the next step.

4. Scenario Discovery and Tradeoff Analysis

With our alternatives evaluated across all sampled states of the world, it’s now possible to address the two questions posed at the top of this post. First, which deeply uncertain variables, and combinations thereof, are most responsible for changes in performance? And second, which candidate solutions are most robust to these changes, and what visualization techniques can we use to identify them?

The first question can be answered using the process of scenario discovery [9,10], where clustering analyses are used to find combinations of uncertain variables that best predict a particular outcome defined in terms of performance measure thresholds. The outcome defined by these thresholds can be either good or bad, but typically it will reflect a critical vulnerability in the system. Following Kasprzyk et al. (2013), the MORDM approach allows these thresholds to be defined in terms of multiple objectives. Lempert et al. (2008) [11] compared different clustering approaches and favored the Patient Rule Induction Method (PRIM, [12])  for its ease-of-use and interactivity. PRIM works by identifying a subsection of the space of uncertain variables in which the performance thresholds are likely to be crossed. It returns which uncertainties are most likely to contribute to these vulnerabilities and, importantly, at which values this is likely to occur. An implementation of PRIM in the R language is freely available (Bryant, 2009).

The second question—the selection of a robust solution—is a highly interactive process and thus cannot follow a concrete set of steps. Particularly in the case of MORDM, identifying a robust solution strongly depends on the ability to visualize data in multiple dimensions (see Kasprzyk et al., 2013 for examples). Ideally, a robust solution will have good performance in the base state of the world, as well as minimal deviation from that performance across the ensemble of sampled states of the world. It is not uncommon for the solutions with the best performance in the base state of the world to be vulnerable to deviation otherwise, as this represents overfitting to the base state without considering deep uncertainties. The outcome of this analysis will be model-specific, however. Some uncertain variables may not affect performance at all, while others may have major impacts.

This has been a high-level overview of the concepts and methods related to RDM. For in-depth studies and example figures, please refer to the references below. Thanks for reading!

References:

[1] Knight, F.H. 1921. Risk, Uncertainty, and Profit. Houghton Mifflin, Boston, MA.

[2] Lempert, R.J. 2002. A new decision sciences for complex systems. Proceedings of the National Academy of Sciences 99, 7309-7313.

[3] Ibid.

[4] Bryant, B.P., Lempert, R.J., 2010. Thinking inside the box: a participatory, computer-assisted approach to scenario discovery. Technological Forecasting and Social Change 77, 34-49.

[5] Joseph R. Kasprzyk, Shanthi Nataraj, Patrick M. Reed, Robert J. Lempert, Many objective robust decision making for complex environmental systems undergoing change, Environmental Modelling & Software, Volume 42, April 2013, Pages 55-71, ISSN 1364-8152, 10.1016/j.envsoft.2012.12.007.

[6] Lempert, R.J., Popper, S.W., Bankes, S.C., 2003. Shaping the Next One Hundred Years: New Methods for Quantitative, Long-term Policy Analysis. RAND, Santa Monica, CA.

[7] Bryant and Lempert, 2010.

[8] Kasprzyk et al. (2013)

[9] Lempert, R.J., Bryant, B.P., Bankes, S.C., 2008. Comparing algorithms for scenario discovery. Technical Report WR-557-NSF. RAND.

[10] Lempert, R.J., 2012. Scenarios that illuminate vulnerabilities and robust responses. Climatic Change.

[11] Lempert et al., 2008.

[12] Friedman, J.H, Fisher, N., 1999. Bump hunting in high-dimensional data. Statistics and Computing 9, 123-143.