Plotting a map of NetCDF data with Matplotlib/Basemap

In two previous posts I showed how to create a NetCDF file from a mess of ASCII data, and how to reduce its size by storing only the lat-lon grid cells for which data exists. This post will show a brief example of reading and plotting data from a NetCDF file using Matplotlib with the Basemap toolkit.

1. Read in NetCDF variables
This is an easy one—just open the file and assign the values you want to plot.

root_grp = Dataset('climatology-nc/vic_LHS_climatology_cells.nc')
vic_runoff = root_grp.variables['vic_runoff']
lat = root_grp.variables['latitude'][:]
lon = root_grp.variables['longitude'][:]

2. Create the Basemap object
Next we create the Basemap object, m. I do this by calling a custom mapformat() function, which returns the object. Here is the function in its entirety:

from __future__ import division
import numpy as np
from mpl_toolkits.basemap import Basemap

def mapformat():

  m = Basemap(projection='robin', lon_0=0,resolution='c')
  # resolution c, l, i, h, f in that order

  m.drawmapboundary(fill_color='white', zorder=-1)
  m.fillcontinents(color='0.8', lake_color='white', zorder=0)

  m.drawcoastlines(color='0.6', linewidth=0.5)
  m.drawcountries(color='0.6', linewidth=0.5)

  m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5')
  m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5')

  return m

I did this because I wanted to modularize the projection and color scheme that I found to look “nice” for plotting global data. This gives light gray continents with white oceans, so that the emphasis is on the plotted data rather than the map background; your preferences may vary.

3. Populate a gridded array for plotting
Next, we’ll take our NetCDF variable of interest and put the relevant values into a single 2D array to be plotted. Recall in the previous post we arranged this file into a list format, so the NetCDF data is not gridded by default—we need to do a little bit of work to stick it into a 2D grid.

array = np.empty((180,360))
array[:] = np.NAN;

x = np.arange(0.5, 360.5, 1.0)
y = np.arange(-90, 91, 1.0)

for i in xrange(0, lat.size):
    ilat = int(lat[i] + 90.0)
    ilon = int(lon[i] - 0.5)
    if ilon < 179 or ilon > 181: # HACK: to avoid date-line wraparound problem
      array[ilat,ilon] = vic_runoff[i,ensemble,month]

array, x = shiftgrid(180, array, x)
array_mask = np.ma.masked_invalid(array)
x,y = np.meshgrid(x,y)
x,y = m(x,y)

Here, x and y are the coordinates we’re going to use for plotting. We loop through our list of grid cells and assign the values of vic_runoff to the array variable (which is just a 2D Numpy array). Note the creation of array_mask, which removes the NaN values from array.

4. Create and customize plot
The goal here is to call a function of the form plot(x,y,array). Specifically I’ll use the m.pcolormesh function, which is part of the Basemap library. Creating the plot looks like this:

rbb = np.loadtxt('cmaps/runoff-brownblue.txt')/255;
rbb = mpl.colors.ListedColormap(rbb)

m.pcolormesh(x,y,array_mask,vmin=0.0,vmax=100.0, cmap=rbb, rasterized=False, edgecolor='0.6', linewidth=0)
cbar = m.colorbar()
cbar.solids.set_edgecolor("face")
cbar.set_ticks([0,100])
plt.title("VIC Runoff (mm/month)")
plt.show()

And the result:
vic_runoff_example

Shameless plug: I created this beige-blue colormap using Colormap. I think it does a nice job of conveying low-to-high runoff values—intuitively, the transition between a “desert” and a “lake”. Also it’s monotonically decreasing in grayscale, so it will print correctly in black and white. None of this is required for the task at hand, but it’s something to think about.

So there you have it, NetCDF makes it easy to plot data because you don’t need to spend time redefining all of the columns of data that you would normally read in from ASCII files. Thanks for reading, here is the full gist if you’re interested:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from mpl_toolkits.basemap import Basemap,shiftgrid
import matplotlib as mpl
from mapformat import mapformat
from netCDF4 import Dataset
root_grp = Dataset('climatology-nc/vic_LHS_climatology_cells.nc')
vic_runoff = root_grp.variables['vic_runoff']
lat = root_grp.variables['latitude'][:]
lon = root_grp.variables['longitude'][:]
ensemble = 5000
month = 7
m = mapformat()
array = np.empty((180,360))
array[:] = np.NAN;
x = np.arange(0.5, 360.5, 1.0)
y = np.arange(-90, 91, 1.0)
for i in xrange(0, lat.size):
ilat = int(lat[i] + 90.0)
ilon = int(lon[i] - 0.5)
if ilon < 179 or ilon > 181: # HACK: to avoid date-line wraparound problem
array[ilat,ilon] = vic_runoff[i,ensemble,month]
array, x = shiftgrid(180, array, x)
array_mask = np.ma.masked_invalid(array)
x,y = np.meshgrid(x,y)
x,y = m(x,y)
rbb = np.loadtxt('cmaps/runoff-brownblue.txt')/255;
rbb = mpl.colors.ListedColormap(rbb)
m.pcolormesh(x,y,array_mask,vmin=0.0,vmax=100.0, cmap=rbb, rasterized=False, edgecolor='0.6', linewidth=0)
cbar = m.colorbar()
cbar.solids.set_edgecolor("face")
cbar.set_ticks([0,100])
plt.title("VIC Runoff (mm/month)")
plt.show()
view raw nc_basemap.py hosted with ❤ by GitHub

Reducing size of NetCDF spatial data with list representation

In the previous post I described the process of creating a NetCDF file to store model output on a lat-lon grid. The main variable in this file was named vic_runoff (runoff from the VIC hydrologic model), and I set it up as a 4-dimensional array along the following dimensions: latitude (180), longitude (360), ensemble (10000), and month (12). I included the _FillValue attribute to account for the sea surface grid cells, since we are only doing land surface hydrology here.

The problem is that the full lat-lon grid is being stored, including the sea surface grid cells (which all contain fill values of -9999.0). I mistakenly assumed that these would be compressed in the file, but they were not by default: the 180 x 360 x 10000 x 12 matrix of 32-bit floats takes up about 29 GB, exactly as one would expect. This is an awful lot of wasted space.

There are two approaches we might try here. First, when we create the variables in the NetCDF file, we can include gzip compression like so:

vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('ncells', 'ensemble', 'month',), fill_value=-9999.0, zlib=True)

I’ve found this approach to slow things down considerably, both when creating the file and when reading from it. A simpler idea is just to stop storing the sea surface grid cells. Instead, we can create a dimension called ncells that just lists our land surface grid cells (at 1-degree resolution, there are 15836 land surface cells). The setup looks like this:

# New file
root_grp = Dataset('vic_LHS_climatology_cells.nc', 'w', format='NETCDF4')
root_grp.description = '(Cells only) Results from VIC 10K Latin Hypercube ensemble, 60-year simulation on Blue Waters'

# dimensions
root_grp.createDimension('ncells', 15836)
root_grp.createDimension('month', 12)
ensemble = root_grp.createDimension('ensemble', 10000)

# variables
latitudes = root_grp.createVariable('latitude', 'f4', ('ncells',))
longitudes = root_grp.createVariable('longitude', 'f4', ('ncells',))
vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('ncells', 'ensemble', 'month',), fill_value=-9999.0)
obs_runoff = root_grp.createVariable('obs_runoff', 'f4', ('ncells', 'month'), fill_value=-9999.0)

Note that latitude and longitude are still variables in the NetCDF file, because we need to store the locations of these grid cells. But we replace the dimensions lat and lon with ncells, which ensures that we will not be storing a whole bunch of wasted space.

Below is a gist showing an example of copying an existing NetCDF file (with a full lat-lon grid) into a new NetCDF file containing only a list of the grid cells for which we have data. This reduces the filesize from 29 GB to 7 GB, without any compression! The only downside is that it takes a bit more effort to plot the data, as I’ll describe in the next post.

from __future__ import division
from netCDF4 import Dataset
import numpy as np
import os
import math
PATH = '.'
LL = np.loadtxt('%s/global_soils_default.txt' % PATH);
LL = LL[:,2:4]
OBS = np.loadtxt('%s/VIC_GRDC_Monthly_Climatology.txt' % PATH, delimiter=',', skiprows=1)
old_root= Dataset('vic_LHS_climatology.nc', 'r', format='NETCDF4')
old_vic_runoff = old_root.variables['vic_runoff']
old_obs_runoff = old_root.variables['obs_runoff']
# New file
root_grp = Dataset('vic_LHS_climatology_cells.nc', 'w', format='NETCDF4')
root_grp.description = '(Cells only) Results from VIC 10K Latin Hypercube ensemble, 60-year simulation on Blue Waters'
# dimensions
root_grp.createDimension('ncells', 15836)
root_grp.createDimension('month', 12)
ensemble = root_grp.createDimension('ensemble', 10000)
# variables
latitudes = root_grp.createVariable('latitude', 'f4', ('ncells',))
longitudes = root_grp.createVariable('longitude', 'f4', ('ncells',))
vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('ncells', 'ensemble', 'month',), fill_value=-9999.0)
obs_runoff = root_grp.createVariable('obs_runoff', 'f4', ('ncells', 'month'), fill_value=-9999.0)
vic_runoff.units = 'mm/month'
obs_runoff.units = 'mm/month'
cellsdone = 0
latiter = np.arange(-90.5, 89.5, 1.0)
loniter = np.arange(0.5, 360.5, 1.0)
for lati, lat in enumerate(latiter):
for loni, lon in enumerate(loniter):
# grab the index of the 0-15836 list of grid cells
i = np.where((np.floor(LL[:,0]) == np.floor(lat)) & (np.floor(LL[:,1]) == np.floor(lon)))
# if this is one of our land surface grid cells...
if(np.size(i) > 0):
i = i[0]
print i
latitudes[i] = LL[i,0]
longitudes[i] = LL[i,1]
obs_runoff[i,:] = old_obs_runoff[lati,loni,:]
vic_runoff[i,:,:] = old_vic_runoff[lati,loni,:,:]
root_grp.close()
view raw txt2nc_cells.py hosted with ❤ by GitHub

Converting ASCII data to NetCDF (Python)

This is the first of what I hope will be a series of posts and discussions about NetCDF (Network Common Data Form), a data storage and access interface created by the Unidata team at UCAR. In this post, I’ll describe recent efforts to convert a horrifying amount of plaintext data into the NetCDF format using the netCDF4-python module. But first, a quick word on installation:

NetCDF Installation requires both the underlying NetCDF4 and HDF5 libraries, as well as the higher-level APIs for your language of choice (there are also APIs for R, Java, and I believe Matlab/Octave, in addition to the bedrock C/C++ and Fortran APIs). On Windows, you can probably just grab the appropriate binary from the link above, and install it as you would Numpy/Scipy. On OSX/Linux, you may encounter a series of installation dependencies with your package manager that will not be covered in this post. To test if Python can find your installation, for example, run import netCDF4 in your Python interpreter—importing without errors is a necessary but not sufficient condition for success.

Here is the situation. We have been running a global hydrology experiment in which we model the average monthly runoff at a 1-degree resolution for a number of parameter ensembles. There’s much more to it if you’re interested, but that’s the synopsis. Due to queue limits, we divided the ensembles across 200 separate jobs. I unwisely printed the output in tab-separated text files, which means we now have output files for every grid cell on the globe (about 16,000 for the land surface) for every job (200). I’ll spare you the math, that’s 3.2 million text files floating around. Each one is only about 7 KB, but the number of files is enough to drag down any filesystem–not to mention it’s difficult to document, maintain, and share.

Who wants to receive a tarball with 3.2 million files? Not me. Let’s try to do better.

One thing you’ll notice about this data is that it’s hierarchical. Every grid cell contains a number of parameter sets, and every parameter set contains monthly runoff values. If only some smart people had invented a data format to handle multidimensional arrays of hierarchical data! (Enter NetCDF, which now uses the Hierarchical Data Format, HDF5).

A NetCDF file consists of three main building blocks: Groups, Dimensions, and Variables. In this post I’m not using groups—this would be, for example, if you had data from different experiments, or multiple versions of the same experiment.

Dimensions describe, well, the dimensions of your multidimensional data. If you’re coming from a C background, you can think of this step like allocating memory. In Python, opening a NetCDF file for writing and creating some dimensions looks like this:

from netCDF4 import Dataset
import numpy as np

root_grp = Dataset('vic_LHS_climatology.nc', 'w', format='NETCDF4')
root_grp.description = 'Results from VIC 10K Latin Hypercube ensemble, 60-year simulation on Blue Waters'

# dimensions
root_grp.createDimension('lat', 180)
root_grp.createDimension('lon', 360)
root_grp.createDimension('month', 12)
root_grp.createDimension('ensemble', 10000)

The call to Dataset returns the object to access the “root group” of the file. Keep this around, because you’ll use it for all future reads/writes. Then we create our dimensions. Here, the dimensions are latitude, longitude, month, and parameter ensemble. We specify a name for each of these, as well as their size. Notice that we haven’t assigned any values here, only the amount of space in memory that each dimension requires.

Next we create our variables, which each use one or more of the dimensions above:

# variables
latitudes = root_grp.createVariable('latitude', 'f4', ('lat',))
longitudes = root_grp.createVariable('longitude', 'f4', ('lon',))
vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('lat', 'lon', 'ensemble', 'month',), fill_value=-9999.0)
obs_runoff = root_grp.createVariable('obs_runoff', 'f4', ('lat', 'lon', 'month'), fill_value=-9999.0)

vic_runoff.units = 'mm/month'
obs_runoff.units = 'mm/month'

# set the variables we know first
latitudes = np.arange(-90.5, 89.5, 1.0)
longitudes = np.arange(0.5, 360.5, 1.0)

The createVariable function allows us to specify a name, a datatype (in this case, a 32-bit floating point value specified by f4), and importantly, the list of dimensions used to specify this variable. Note that latitude/longitude are both dimensions and variables; we first defined their length, and now we are going to assign them values. We do not need to do this for the month and ensemble dimensions, because their array indices will be the same as their values (1-12 and 1-10000, respectively). The vic_runoff variable uses all four dimensions, so we are saying that this will be a 4-dimensional array. The obs_runoff variable does not depend on the ensemble dimension, so it is only 3-dimensional.

Something that is both dead simple and incredibly useful: you can specify metadata like units directly in the file! So here we can specify the units of these variables, and it will follow the data wherever it goes. It’s a little embarrassing to think that I haven’t been doing this before.

Finally, the latitude and longitude variables already have values that we know—namely, a 1-degree grid across the globe—so we can assign them straightaway. Note that the NetCDF variables are just NumPy arrays, and we manipulate them accordingly.

Now comes the hard part. We need to read in all of our text data and stick it into the two multidimensional arrays vic_runoff and obs_runoff. I’ll spare you the exact details, but here’s the main idea:

for grid cells:
  for job files:
    read in text column data
    save as a chunk of a temporary variable
  # write the temporary array to netcdf file for this grid cell:
  vic_runoff[lati,loni,:,:] = tempstore

The reason for using temporary variables is that the dataset will not fit in working memory. Thus we need to read it in chunks and assign it to part of the vic_runoff array. The line where this assignment occurs is writing the data to disk, so it’s going to be much slower than your regular NumPy array assignment. In general, you want to order the dimensions the same way you plan to loop over them—I originally had the dimensions of vic_runoff as (ensemble, month, lat, lon) and believe me, it was slow. This is similar to most programming languages, where the fastest array access occurs along the primary dimensions.

At the end, a call to root_grp.close() wraps things up. And now our massive unwieldy glob of text files has been condensed into a single NetCDF file that also contains metadata! (Read: we can share the dataset without shame!) Future posts will cover working with NetCDF datasets after they’ve been created and visualizing their contents.

Here is the full gist for the more adventurous:

from __future__ import division
from netCDF4 import Dataset
import numpy as np
import os
PATH = '.'
LL = np.loadtxt('%s/global_soils_default.txt' % PATH);
LL = LL[:,2:4]
OBS = np.loadtxt('%s/VIC_GRDC_Monthly_Climatology.txt' % PATH, delimiter=',', skiprows=1)
# NC file setup
root_grp = Dataset('vic_LHS_climatology.nc', 'w', format='NETCDF4')
root_grp.description = 'Results from VIC 10K Latin Hypercube ensemble, 60-year simulation on Blue Waters'
# dimensions
root_grp.createDimension('lat', 180)
root_grp.createDimension('lon', 360)
root_grp.createDimension('month', 12)
ensemble = root_grp.createDimension('ensemble', 10000)
# variables
latitudes = root_grp.createVariable('latitude', 'f4', ('lat',))
longitudes = root_grp.createVariable('longitude', 'f4', ('lon',))
vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('lat', 'lon', 'ensemble', 'month',), fill_value=-9999.0)
obs_runoff = root_grp.createVariable('obs_runoff', 'f4', ('lat', 'lon', 'month'), fill_value=-9999.0)
vic_runoff.units = 'mm/month'
obs_runoff.units = 'mm/month'
# set the variables we know first
latitudes = np.arange(-90.5, 89.5, 1.0)
longitudes = np.arange(0.5, 360.5, 1.0)
cellsdone = 0
for lati, lat in enumerate(latitudes):
for loni, lon in enumerate(longitudes):
# grab the index of the 0-15836 list of grid cells
i = np.where((np.floor(LL[:,0]) == np.floor(lat)) & (np.floor(LL[:,1]) == np.floor(lon)))
# if this is one of our land surface grid cells...
if(np.size(i) > 0):
current_obs = OBS[np.where((np.floor(OBS[:,0]) == np.floor(lat)) & (np.floor(OBS[:,1]) == np.floor(lon))), 2:15]
current_obs = np.squeeze(current_obs)
if(current_obs.size > 0):
obs_runoff[lati,loni,:] = current_obs
# keep values in memory until ready to write a big chunk to NC file
tempstore = np.zeros((len(ensemble), 12), float)
for filenum in xrange(0,200):
output_filename = '%s' % PATH + '/txt/file_' + '%d' % filenum + '/txt/hcube_lat_' + '%.6f' % LL[i,0] + '_long_' + '%.6f' % LL[i,1] + '.txt'
if(os.path.isfile(output_filename)):
try:
output = np.loadtxt(output_filename)
except:
pass
# write the VIC output data
startix = filenum*50
tempstore[startix:(startix+output.shape[0]),:] = output # ensembles x months
vic_runoff[lati,loni,:,:] = tempstore # write all ensembles to netcdf
cellsdone = cellsdone + 1
print cellsdone
root_grp.close()
view raw txt2nc.py hosted with ❤ by GitHub

Compiling OpenSees on the Janus cluster

We said this blog was about “research” not just “water-related research” so here’s a completely random set of instructions for you!

I recently had a conversation with a structural engineer about combining some of our optimization techniques with an open source earthquake engineering performance simulation called OpenSees. Talk about long evaluation times… this simulation takes minutes! Maybe even, 10 minutes! And it contains thousands of files! But here goes.

First, go to the website (the link is above), and sign up for an account. Once you have an account, the website is pretty easy to navigate. The download link allows you to get executables, but what we really want is the source code. If you click the source code link, you find a list of all the files… but what they want you to do is actually use Subversion to download the code to your own machine.

Here’s the trick! I found a helpful forum post linked here, that explains where to find the download instructions. When you click on the Source Code link, you’re going to see a bunch of folders. Navigate to trunk/MAKES/ and you should see a whole bunch of Makefiles for different systems. If you’re using one of the big computing clusters such as STAMPEDE, follow the instructions for their file. You’ll probably not need to do anything else, but sometimes you have to adjust things here and there for your own system.

So if you’re at the University of Colorado, you are going to be compiling on Janus, which uses the REDHAT operating system. So click on the Makefile.def.EC2-REDHAT-ENTERPRISE link. Here’s a link to it but I’m not sure if the link works properly because it’s a complicated URL.

See where it says “Following are commands to build OpenSees”? That’s what you want to follow! But, a lot of these commands aren’t needed on your own system because the packages like gcc and tcl are already installed. Ok so here goes.

1. Log on to Janus in the normal way. You need to load some modules for yourself that will help you. To load them, type (each line should be followed by an ‘enter’):

module load tcl/tcl-8.5.13
module load gcc
module load tcl/activetcl-8.5.13

2. When you are first logged in, you are typically in your home directory. Type ‘pwd’ to ensure where you are. It will look like /home/[your identikey]. Remember, you can always get back to your home directory by typing cd ~.
3. The first thing the commands want you to do is to create two special directories in your home folder called bin, where executable programs are stored, and lib, where libraries are stored. These commands are:

mkdir bin
mkdir lib

4. Now, use svn to obtain the code. Note, I am copying the commands here but please refer to the actual list of commands in the Makefile from the website, because those will be the latest commands!

svn co svn://opensees.berkeley.edu:/usr/local/svn/OpenSees/trunk OpenSees

This will take a few minutes. What it is doing, is downloading all these source code files to your account on Janus, to get ready for you to compile. If you need a refresher on what compiling is, I have some comments about this on a related post here, and I’ve also posted about makefiles too.
5. At this point, you’ve created the special directories in your home directory, and also downloaded all the data. Next, you’ll want to change into the new directory that you just created:

cd OpenSees

and of course, you can make sure you’re in the proper place by running the pwd command.
6. Now that you’re in the OpenSees directory, you want to basically pick out the proper Makefile for you to use. The commands show you how to do this:

cp ./MAKES/Makefile.def.EC2-REDHAT-ENTERPRISE ./Makefile.def

What this command does, is it says, “There are like, dozens of makefiles in the directory… which one am I supposed to use?” and what you do is copy the specific makefile into a generic makefile in your OpenSees directory called, simply, Makefile.def.
7. Now you’re going to want to use a text editor to edit the Makefile.def for your own purposes. I like to use emacs, but there are others. To open the file for editing, type:

emacs Makefile.def

This is a really nice makefile, the folks at Berkeley have done a great job. However, there are a few things you need to edit to make this work for the University of Colorado system, in my experience. They are listed in the following items
8. If you scroll down to section 2, the value of the BASE variable is set to /usr/local. I got rid of this and just wrote BASE =
9. Take a look at section 3, which lists the libraries. The nice thing, all these libraries are going to be created for you and placed in the lib folder you created in your home directory. But watch for the tcl library! The path for this library is not correct. By my calculation, on the University of Colorado computer the path should be:

TCL_LIBRARY = /curc/tools/x_86_64/rh6/tcl/8.5.13/lib/libtcl8.5.so

10. Section 4 tells the makefile where to look for the compilers. This also needs to change on the Colorado computer! The default commands have /usr/bin in front of them but we need to change them to:

C++ = g++
CC = gcc
FC = gfortran

11. The last part of Section 4 gives flags to the compiler that gives the compiler options on how to actual compile and link the code. In the LINKFLAGS variable, there are two options: -rdynamic and -Wl. Remove the -Wl option.
12. To save the file, hold Control and type ‘X s’. Then, to exit the program, Control and ‘x c’.
13. We are almost ready to compile, but first triple check you are in the correct directory. When you type ‘pwd’ you should see:

home/[identikey]/OpenSees

If you don’t see that, type: cd ~/OpenSees.
14. Type:

make

and if everything went right, you should be all set! Note, this will take about 15 minutes.
15. To run the program from any directory, type ~/bin/OpenSees. In other words, the executable is in the /bin/ folder in your home directory.

Now don’t ask me the first thing about what this program does… I just helped you compile it. 🙂

Invoking a Matlab wrapper while running a Parallel C process

As you can tell by my posts lately (thoughts on parallel MOEAs, and thoughts on compiling MODFLOW), I am working on a project where we are going to have to link a lot of codes together that may not like each other.  I guess it’s like little kids in the backseat of a car on a long trip!  So to continue this I wanted to talk about an interesting idea on extending the “Frankenstein’s monster” of these linked codes a step further…

The workflow

A parallel MOEA is coded in C.  It is running a master-worker paradigm in which all the workers are dutifully evaluating their simulations.  Note that each worker must work on his own individual input and output files!

The worker has to call on a MATLAB wrapper, which massages some input data and writes some files that are needed to run a simulation.  MATLAB has to somehow figure out what node it is running on!

Finally, MATLAB calls to an executable that is running a simulation code written in a Mysterious Language.  The exectuable communicates to the rest of the world via input and output text files.

A problem

We know that we can call MATLAB using a system command. Consider trying to run a MATLAB file called myMatlabFunction.m. The command to invoke MATLAB is simply:

matlab -r myMatlabFunction

but note! MATLAB doesn’t have the luxury of having the MPI setup, so it doesn’t know what processor rank it is! In fact, MATLAB is inherently ‘stupid’, in the sense that it is a new process that has been spawned off of the calculations that the worker node is doing. So how do we tell the MATLAB process its rank? This is needed, of course, because the input filename for the simulation is going to have, within it, the name of the rank of the process.

In the C code, all we need to do to determine the rank of the processor is:

#include
int rank;
MPI_Comm_rank (MPI_COMM_WORLD, &rank);/* get current process id */

My first idea was to include a parameter in the MATLAB function that would accept ‘rank’, and then the form of the call to MATLAB would simply enter the parameters in parentheses! I would post where I found out how to do this, but I’m going to because (wait for it) this is not a valid approach in Unix!

In fact, to do what we need to do, according to this helpful post on the Mathworks website, we need to set environment variables on the machine. Then, Matlab can read the environment variables.

The Solution

In your C code, first convert the rank to a char array, and then you can use the setenv command to set an environment variable. Note! For debugging purposes, it is helpful to save every input file that you create. Granted, this will mean that 1000s of files will be saved at once, but the benefit is that you get to see that the input files are created properly before you go destroying them. Therefore, there’s a second environment variable that you need to set which is the iteration. Essentially you will have a matrix of input files: ‘rank 2, iteration 5’ is the 5th iteration that rank 2 has worked on, etc. So:

sprintf(rankAsString, "%d", rank);
sprintf(iterationAsString, "%d", myCounter);

setenv("RANK", rankAsString, 1);
setenv("ITERATION", iterationAsString, 1);

To clarify, myCounter is a global variable that you increment up every time your function is called. So now, you have environment variables on your processor that can be read in by other programs to indicate the rank and iteration. This is important because your input file has those variables embedded in its name:

sprintf(theFilename, "input_rank-%d_iteration-%d.txt", rank, myCounter);

Now let’s look at the MATLAB code. To test initially, I just did something silly. All I did was have the MATLAB code read in the variables, and then spit them back out in a new file. If it can do that, it can obviously do more complicated things. Luckily MATLAB has a function that can read environment variables (getenv). You can also craft yourself some test output that you can look at to make sure the process is working:

function myMatlabFunction

rank = getenv('RANK');
myCounter = getenv('ITERATION');

disp('Rank is:');
disp(rank);

disp('Iteration is:');
disp(myCounter);

filename = ['input_rank-' num2str(rank) '_iteration-' num2str(myCounter) '.txt'];

outFilename = ['output_rank-' num2str(rank) '_iteration-' num2str(myCounter) '.txt'];

fid = fopen(filename);
for i = 1:11
	  vars(i) = fscanf(fid, '%f', 1);
end
fclose(fid);

disp(vars);

outid = fopen(outFilename, 'w');
fprintf(outid, '%f %f %f %f %f %f %f %f %f %f %f\n', vars);
fclose(outid);

exit

Note here that the number of variables is hard coded as 11, but this can be easily changed.

Conclusion

As you can see, just because you need to use source code to run a Parallel MOEA, that doesn’t mean that you can’t employ wrappers and executable simulation codes within your workflow. Granted, the use of wrappers is going to slow you down, but for a model with a long evaluation time, this won’t likely matter that much. As usual please comment and question below.

Thoughts on using models with a long evaluation time within a Parallel MOEA framework

Here is the result of some discussions I’ve had lately on a new project of mine. I tried to make the insights as generic as possible, so you can apply this to your own work!

Why will you need to parallelize your search?

Consider an engineering simulation that takes about 2.5 minutes on a brand new personal computer.  You’ve done some timing on your code, and you realize about  80% of the time is in the model’s actual calculation.  Sure, there may be some added benefit that you could gain by making the input and output to the model more efficient, etc.  But you’re not going to get the time down that much (you could argue the lowest amount of time the simulation would take is 2 minutes, removing all i/o overhead).  Let’s assume a  2.5 minute evaluation time and run some numbers:

In 60 minutes, we can do 24 evaluations.  Therefore, in 24 hours we can do 576 evaluations.

In some prior work, we saw that you could get a very good answer to the problem in about 80,000 function evaluations.  This new problem may actually be harder than the old one!  So I am guessing we would like to do at least 80,000 function evaluations per trial here.  You can see that it would take much longer than a day to do a search run (in fact, it would take 138 days, which is longer than we have to run a single job on the computer).

Let’s say we were able to cut down the computational time by 50% (which would be a huge undertaking), then we would only gain 2x speedup of time, and it would take 69 days.

Not to mention the fact that really long jobs that have a duration in the order of days, take a long time to queue on the system.

As a final final thought about this, we could concievably cut down the number of function evaluations.  But my guess is we would have to (at the very least) run about 10,000 function evaluations, which would take 17 days.  And, it would probably require a lot more justification than we’re prepared to give.  Right now, our argument is that we look at the Pareto approximate set and cut the search off when it stops improving.  If we cut down to 10,000 function evaluations, that argument would no longer be valid.

Running the numbers for parallelized search

Borg has been shown to to have scalable performance for 10’s of 1000’s of cores, in a forthcoming paper by Dave Hadka.

Scalable means two things.  First, just thinking about scalability in parallel computing, perfect scalability means that with N processors, you can do the work N times faster.  What takes 10 hours on a single pc, would take 1 hour on 10 nodes.  Computer scientists rarely get perfect scalability, mainly because of the overhead of passing messages back and forth or coordinating processes (more on that later).  In parallel MOEA work, scalability must also include the quality of the solutions you are getting out of the search.  It is very possible MOEA search will fail, so if the search does poorly, having the former definition of scalability doesn’t make much sense.  So Dave et al. have developed ways to measure scalability that also take into account solution quality.

But, to make a long story short, let’s assume perfect scalability and run the numbers again.

We can do 24 evaluations in one hour.  But now, scaling to 512 processors, we can do 24*512 = 12,288 evaluations in one hour, and the calculations of run time become similar to our last project.  We will get a halfway decent answer in 1 hour, and we can run the same amount of computation in 8 hours or so.  In fact, using the max queue time of 24 h, we could get 294,912 evaluations done in one day.

Parallel Computing 101

Shared Memory: You are likely reading this on a shared memory machine as we speak.  In shared memory, there are N processors, and each of them shares access to the same memory.  There are some programming paradigms such as a system called ‘pthreads’ which are used for these systems.  The challenge of shared memory programming is preventing so-called “race” conditions.  That is, Processor 1 tries to edit a variable, and Processor 2 is working on the same variable.

Although there are multi-core chips on the processor, they are not used as such, and we are dealing with…

Distributed Memory: Each of the N processors in a distributed memory system has its own memory.  Processor 1 never “sees” what Processor 2 is doing unless Processor 2 shares what she is doing with Processor 1.  The paradigm of programming here is called Message Passing Interface (MPI).  A processor only receives information from another processor if a ‘message’ is received from somewhere.  The challenge is that coordinating the messages can be tricky, and all processors work at different rates, so you could have a situation where Processor 1 is waiting for slowpoke Processor 8 to do his job, and he never sends the message, and the program locks.  This happens when all processors are homogeneous, but you can imagine it’s even worse when you have heterogeneous processors (i.e., one is faster than the other).

An additional challenge of MPI programming is that all processors see the same filesystem.  Two processors can’t edit the same file at once, without massive, horrible errors.  Keep this in mind in the next section.

Some pseudocode of the algorithm process

Parallel MOEAs can work in a number of different configurations, but the important one for us is called Master-Worker.  Basically, there is one and only one Master or Coordinator, and then a whole bunch of workers.  The Master’s job is to do the search calculations.  The Workers’ job is to evaluate the simulation model — take decision variable values and calculate objectives.  The Workers have no other job, and the workers cannot “see” what the others do.

Code to implement a Parallel MOEA is described in the following sections.

main()

The main program calls several functions.  First, you need code that initializes your own problem.  For example, read in the input data that doesn’t change over the run (a function called something like, myProblemInit). The algorithm is set up, for example, to set the random seed, and define the algorithm parameters.

Finally, the algorithm is run, with a pointer to a function that contains your problem (see myProblem below)

myProblem(obj, vars, consts)

Ok, so now you need to write your own version of the “problem” function.  There may be only one task to do, or multiple.  For example, if you need to perform some kind of transformation on the decision variables you can do that here.  But otherwise, all you’re doing is actually calling the real simulation model.  Perhaps this simulation code comes from somewhere else… it’s for a version of the model that has never been attached to an optimization framework before.  It looks like:

callSim(obj, vars, consts)

Here, you may very well have an executable call to a completely external program that runs your simulation.  Maybe that external program even has a separate wrapper attached to it, in Python or MATLAB.  Steps in such a function will probably look like this:

1. Note that parameters from initialization are already in memory.  A lot of parameters don’t change from evaluation to evaluation.
2. Write input files, such as a file called inputData.dat
3. Call an external program that does the calculations
4. Interpret the output file, which is called something like outputData
5. Assign the objective values to the variable, obj which is passed by pointer into this function.

Hey, wait, there’s a problem with steps 2 and 4 above!

Ahh, good.  You realized it.  If we recall from our discussion of MPI on clusters… all processors share the same filesystem!  So each worker is going to be creating a file called basicReaction.well, and all of them are going to be editing the same file at the same time, which cannot work.

The solution is to create a separate input and output file for each processor to ensure that this filesystem bug doesn’t happen to us.  MPI allows this because of a built-in function called MPI_Comm_Rank which tells each processor who he is in the world (existential, isn’t it?)

You can use the processor’s rank directly in the myProblem function above, and that way the processor will always use his own unique file and not confuse data from the other processors.

Conclusion

I hope this has been helpful!  Increased computing power with technologies like clusters and cloud computing are opening up more and more types of problems for many objective analysis.  With a little bit of skill in writing system calls and wrappers, you can use tools like Borg to do all sorts of fun things.  Just be careful to understand the unique properties of your parallel system when creating the code for your project.  Feel free to add comments and questions in the comments below!

Compiling MODFLOW2005 on the Janus computing cluster

Perhaps you are new to working with source code, and you want to know how to compile your first program on a super computer or cluster. Here’s some instructions on how to do so.

The link from USGS (http://water.usgs.gov/ogw/modflow/MODFLOW.html#downloads) had a link that said “MODFLOW-2005 source code and sample problems converted to the Unix file structure”.

I downloaded the folder, extracted it, and transferred it to Janus. If you don’t know how to transfer files, use Globus Connect or another service, as documented in our post about what software to install and our post on cluster commands.

The folder contains some subfolders, /doc/, /src/, /test-out/, and /test-run/. The source code is in /src/.

When you look at source code, different folks break it up in different ways. Sometimes the actual code is in /src/ and then the actual executable files are stored in a ‘binary’ folder called /bin/. Or something like that. But anyway, what you want to look for is a file called a makefile. Luckily, the friendly folks at USGS have put the makefile in the /src/ folder.

On Janus, navigate to the /src/ folder and type the command:

make

and this will compile the files. Of course, there is a problem (there’s always a problem!) This time it says:

f90: command not found

After looking at the webpage for compilers at research computing (https://www.rc.colorado.edu/support/userguide/compilersandmpi) it turns out that they have several packs of compilers we can load up. In Linux/Unix, the GNU compilers are standard. There is also such a compiler as the Intel family of compilers, too.

For our purposes, we will use the GNU fortran compiler. First, to load that family of compilers run the command:

module load gcc

The makefile for the project needs to know that you are using the gfortran compiler. So, edit the Makefile in the /src/ folder. Instead of modifying the actual line it’s always a good idea to comment out the command. So replace:

F90= f90

with:

#F90= f90
F90= gfortran

and save the file. Now, when you run:


make

you’ll see some magic happen. If you see a few warnings that’s ok.

Finally, to run the program, simply type the name of the executable:


./mf2005

I don’t know anything about MODFLOW but these words indicate it looks like it’s working:

MODFLOW-2005
U.S. GEOLOGICAL SURVEY MODULAR FINITE-DIFFERENCE GROUND-WATER FLOW MODEL
Version 1.11.00 8/8/2013

Enter the name of the NAME FILE:

Voila! Now follow the readme and documentation to actually try some examples. But hopefully this will help someone out there that is new to compiling and using clusters.

When coloring stacked solutions in Aerovis, make sure colorbar is first turned off.

The above is self-explanatory and probably very obvious to some people, but I became very frustrated the other day when attempting to color my stacked solutions with the data table tool made my image disappear. Now, I realize that it is necessary to adjust the axes in the plot control menu to turn off the colorbar before choosing colors for each set in the data table tool.

After that I found this post from Matt helpful in trying to use Illustrator.

Job dependency in PBS submission

In one of our current projects, we’re running a bunch of batch jobs simultaneously. When a single job enters the queue, everything is fine. But when 15-20 of our jobs enter the queue at the same time, the filesystem slows to a crawl, and the jobs wind up exceeding the walltime.

So what we’d like to do is ensure that only a certain number of these jobs can run simultaneously. The brute force solution is to sit there and stare at the job queue, but that’s not very appealing. Instead we can use PBS job dependencies. (This is very helpful, and I can’t believe I’m just learning about it now).

Job dependency works like this:

qsub -N job_name -W depend=afterok:other_job_id my_job_script.sh

The -W argument to qsub just means “other options”. There are a number of different dependency options you can give; see this discussion for a more complete list. Here I’m using afterok, which means “only run after this other job has completed without errors”. In addition to solving our main problem, this also ensures that we won’t keep running the entire set of submissions if there’s an error in one of the jobs. Note that you can append multiple job ids to the afterok option, like this:

qsub -N job_name -W depend=afterok:job1:job2:job3 my_job_script.sh

Here is an example of submitting 200 separate jobs in a loop, where I only want to run a maximum of 7 at a time. The jobs are divided into “globs” (my non-technical term) where each glob will not start running until the previous glob has completed without errors.

#!/bin/bash
# This submission will submit 200 jobs in "globs",
# Where one glob will not begin until the previous one finishes without errors
FILENUMS=$(seq 0 199)
MAX_SIMULT=7 # max number of simultaneous jobs
GLOBNUM=0
GLOBNUM_PREV=0
DEPEND=""
DEPEND_PREV=""
for FILENUM in ${FILENUMS}
do
NAME=hcube_${FILENUM}_9p_50
echo "Submitting: ${NAME}"
GLOBNUM=$(($FILENUM / $MAX_SIMULT))
# the first MAX_SIMULT jobs have no dependency
# submit the jobs and append to string
if [ $GLOBNUM -eq 0 ]
then
DEPEND=$DEPEND:`qsub -N ${NAME} -v NUM=${FILENUM} Run_Hcube.sh`
# Else if the counter ticks ahead to the next "glob" ...
# Grab the existing job dependencies and start building the next string
elif [ $GLOBNUM -ne $GLOBNUM_PREV ]
then
DEPEND_PREV=$DEPEND
DEPEND=:`qsub -N ${NAME} -W depend=afterok${DEPEND_PREV} -v NUM=${FILENUM} Run_Hcube.sh`
GLOBNUM_PREV=$GLOBNUM
# If we're in the same glob as the previous iteration, use the same dependency string
else
DEPEND=$DEPEND:`qsub -N ${NAME} -W depend=afterok${DEPEND_PREV} -v NUM=${FILENUM} Run_Hcube.sh`
fi
sleep 0.5
done

Caveat emptor: I am still in the process of testing this. But regardless of this particular implementation, it’s a cool idea that should be useful in the future.