Compiling Code using Makefiles

For people new to coding or using supercomputers to submit jobs, compiling can be a new concept. In programming, a compiler takes source code (e.g. written in C/C++, Python, Fortran, etc.) and translates it into a lower-level programming language (e.g. assembly language or machine code). When a compiler runs successfully, the source code will be converted to an executable program which the computer understands how to run.   For more info on compilers, check out this video

To compile a program, we use the ‘make’ command. When we have multiple source files (which we often do when running complex water management models), a makefile file helps organize directions to give to the compiler. If you are not creating a model from scratch, you may already have an existing makefile which are conventionally named ‘makefile’ or ‘Makefile’. If that is the case, compiling is easy if all the files are in the proper directory. Simply type ‘make’ in the command line interface to compile your code.

If you would like to edit your makefile, create one from scratch, or just want to learn more about the ‘make’ command and makefiles, check out the resources below:

Introduction to ‘make’ and compiler options:

Introductory tutorials for makefiles:

Makefile naming:

Makefile macros:

Compiler flags:

For a list of compiler flags, see the manual or ‘man’ page for the compiler: e.g. $man <compiler_name>. The convention for naming makefile macros for compiler flags for gcc and g++ are CFLAGS and CPPFLAGS, respectively.

Example Makefile (C or C++):

The conventional file organization for this work is to create a src (or source) and bin (or binary) directory. The source code will go in /src while the makefile and any input files will go in /bin. Once the executable is created, it will be located in /bin as well. Below is a truncated version of a makefile I made for a water treatment plant model (written in C) based on a makefile I found for LRGV (written in C++). In general, there is little difference between a makefile for a program written in C or C++, so this template file can be used for either language by simply commenting or uncommenting the makefile macros for the language for which you want to compile (e.g. CC, CXX, CFLAGS, CPPFLAGS). One special difference between compiling C and C++ is that a common library, math.h, is automatically linked for C++ but must be explicitly linked for C. You’ll notice that I’ve explicitly linked it in the code below, so as long as the C code I’ve written is also valid C++, I should be able to use either compiler since C is a subset of C++. Enjoy!

MAIN_DIR=./.. #from within /bin go to main directory which contains /bin and /src directories

SOURCE_DIR = $(MAIN_DIR)/src #navigate to directory which contains source code

SOURCES = \
$(SOURCE_DIR)/basin.c    \
   $(SOURCE_DIR)/breakpt.c  \
   #I’ll leave out some files for the sake of space
   $(SOURCE_DIR)/unittype.c \
   $(SOURCE_DIR)/uptable.c  \
   $(SOURCE_DIR)/writewtp.c \

OBJECTS=$(SOURCES:.c=.o) #name object files based on .c files
CC=gcc #select the type of compiler: this once in for C
#CXX=g++
CFLAGS=-c -O3 -Wall -I. -I$(SOURCE_DIR) #set flags for gcc
#CPPFLAGS=-c -O3 -Wall -I. -I$(SOURCE_DIR) #set flags for g++
LIBS=-lm  # link libraries (e.g. math.h with -lm: goo.gl/Bgq75V)
EXECUTABLE=wtp_v2-2_borg-mp #name of the executable file

all: $(SOURCES) $(EXECUTABLE)
    rm -rf $(SOURCE_DIR)/*.o

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(OBJECTS) -o $@ $(LIBS)

.c.o:
    $(CC) $(CFLAGS) $^ -o $@

clean: #’make clean’ will remove all compilation files
    rm -f $(SOURCE_DIR)/*.o $(EXECUTABLE)

 

 

 

Use python `cf ` and `esgf-python-client` package to interact with ESGF data portal

Working with climate change impact assessment may often require us to download a number of outputs from Global Circulation Models (GCMs) to depict the plausible changes of projected future conditions, as well as to use them for conventional top-down scenario analysis. Usually those kinds of scenario data are public available for download and are stored as NetCDF format on the data portal using Earth System Grid Federation (ESGF) architecture. Yet the procedure of downloading those data, which typically span over a lengthy time horizon, may be very cumbersome and requires following steps:

  • create the username and credential on ESGF data portal;
    • define the search entries to narrow down the results;
    • generate the *.wget file for batch downloading using secured OpenDAP protocal ;
    • run the *.wget file to download all the raw data onto the local disk;

For the climatic projection data, usually the downloaded data are stored in NetCDF format with irregular coordinate systems (e.g., rotated pole), segmented in pieces of small time periods, and available only for large continental scale, which is, however, unnecessary for regional case study.

In this post, I would like to share some of my experience and tools to tackle with the ESGF data portal, and facilitate the process. The context will encompass the concepts of NetCDF and Python language, both of which were already well introduced in our Water Programming blog. Therefore, I strongly suggest to revisit those posts in case of any doubt, and also because I am not an expert on neither NetCDF nor Python.

The first tool is called cf-python package, which implements comprehensive tools for reading, writing and processing NetCDf data. Personally I found it extremely useful for reading/writing NetCDF data, regridding and extracting the sub-domain from original dataset. More importantly, given the OpenDAP url all those aforementioned operations can be done on the server side without explicitly downloading the orignal data onto local disk! So instead of downloading a 10Gb projection data for entire Europe that takes 1 hour to finish, you can just download the part specific to your study area within a couple of seconds at a cost of a few Mb.

The second tool is named esgf-python-client. This package contains the API code for calling the ESGF Search API, whose documentation by the way has not been updated for quite a while. With the combined use of cf-python and this client, you are saved from user/password login and dealing with tedious web interface of ESGF search page (try here) and downloading.

Below I explain more in details by using an example of downloading European regional downscaling data for RCP4.5 scenario from one of the ESGF data portal. The example is essentially a few lines of codes with inline comments (notice that some indents of lines may not show correctly in the post, so just copy paste and reformat them in your editor).

from pyesgf.logon import LogonManager 
from pyesgf.search import SearchConnection
import cf
import numpy as np

def main():
 ## ================== To be modified by user ================================================
 # Define the search key words for facet fields in ESGF node. 
 # Note: the supported facet fields can be identified using: 
 # " http://<base_search_URL>/search?facets=*&distrib=false&limit=0 ", where
 # <base_search_URL> is the the domain of esgf-node, e.g., https://pcmdi.llnl.gov/esg-search

esgf_node='https://esgf-node.ipsl.upmc.fr/esg-search' # you may try different data portal, as many of them are frequently under maintenance longer than their actual service time

Project='CORDEX' 
 Time_frequency='day' # 'month', '6h', '3h' 
 Experiment='rcp45' # 'rcp25', 'rcp85', 'history'
 Variable='tas' # 'pr' for precipitation 
 Ensemble='r1i1p1' # not change 
 Domain='EUR-44' # EUR-44i, EUR-22, EUR-22i, EUR-11i

# Define the index of domain for study region. Say original domain is 100 (lon) by 60 (lat) grids, and your study area falls between 50 - 54 in longitude and 41 to 45 in latitude. 
 idx_lon = np.array( range(49,54) ) #49 instead of 50 as numpy array is 0-based indexing
 idx_lat = np.array( range(40,45) )

## =========================================================================================

 # Connect to ESGF data portal using your openid and password. May need a single run of *.wget file first to
 # retrieve personal credientials
 lm = LogonManager()
 lm.logon_with_openid(your_openID, your_password)

conn = SearchConnection(esgf_node, distrib=True)

 # Make query based on the facet fields. 
 ctx = conn.new_context( project=Project, 
 time_frequency=Time_frequency,
 experiment=Experiment, 
 variable=Variable,
 ensemble=Ensemble, 
 domain=Domain )

# ctx.hit_count # show total number of results found

 # loop over the hit_count results
 for each_hit in range(0, ctx.hit_count):

ds = ctx.search()[each_hit]
 files = ds.file_context().search() # use "len(files)" to see total files found

url_list_sorted = sorted( [i.opendap_url for i in files] ) # url list sorted to get assending timestamp, e.g., from 1/1/2005 to 12/31/2099
 filename_out = i.filename[0:-21] # get file name, the [0:21] means a subset of filename string

 counter = 0

data_list = []
 for each_url in url_list_sorted:
 counter += 1 
 meta_data = cf.read(each_url) # read the meta-data information from OpenDAP NetCDF url
 if len(meta_data) > 1: # in some cases, the returned meta_data contains two fields, i.e., len(meta_data)>1, and only the second field contains the variable of interest
 data_list.append(meta_data[1])
 else:
 data_list.append(meta_data[0])

# Aggregating the time
 dataset_agg = cf.aggregate(data_list) # this will concatenate all files with small time period into a single file with complete time series

# Slicing dataset
 if len(dataset_agg.shape) == 4:
 dataset_sliced = dataset_agg.subspace[0,:,idx_lat,idx_lon] # for 4-D , e.g., [height, time, lat, lon], .subspace() function is used to extract the subset of the data
 elif len(dataset_agg.shape) == 3:
 dataset_sliced = dataset_agg.subspace[:,idx_lat,idx_lon] # for 3-D , e.g., [time, lat, lon], .subspace() function is used to extract the subset of the data

# Save data into desired format (e.g., using savemat from 'scipy' package)
 cf.write(dataset_sliced, filename_out+'.nc', mode='w')

 lm.logoff() # logoff

if __name__ == "__main__":
 main()

Sorry for the lengthy and poor organized post, but I hope you will find it useful. For any doubts or suggestions, please feel free to contact me (likymice@gmail.com).

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.

The Linked Parallel Coordinate Plot with Linked Heatmap: A tool for simultaneous visualization of the objective and decision spaces

Introduction

When making real world decisions, the individual decisions that comprise a solution are often equally important to decision makers as the objective values that are generated by the solution. When evaluating MOEA search results for such problems, analysts are in need of a tool that can allow them to simultaneously visualize solutions both in the objective and decision spaces. Parallel Coordinate Plot with Linked Heatmap tool was created to fill this need. The tool was built  off of the D3 Interactive Parallel Coordinate tool that Bernardo introduced in his post a few weeks ago. As you may be able to guess by its name, the Parallel Coordinate Plot with Linked Heatmap tool links a heatmap containing decision space values with an interactive Parallel Coordinate plot containing objective space data. The tool can be found here.

Pcoord_full

Heatmap_full

Figure 1: The Parallel Coordinate Plot with Linked Heatmap tool demonstrated with a six objective problem with 31 decision variables.

Each axis on the parallel coordinate plot represents an objective function, while each line across the axes represents an individual Pareto optimal solution. Each row of the heatmap also represents an individual Pareto optimal solution, while each column corresponds to a decision variable. The shading of the heatmap cells represent how much of a decision variable is being used for each solution.

The tool allows analysts to simultaneously view the decision space along with the objective space in two different ways. First, the grouping functionality, contained in the original parallel coordinate plot tool, allows the analyst to group her data based on decision space categories. These categories are then plotted as different colors on the parallel coordinate plot. The fraction of solutions contained within each group can be observed in the ring plot in the top left corner of the screen. This ring plot will update dynamically as data is brushed, reflecting the how much of the brushed solutions are from each group.

The second way the tool can be used to visualize the decision space along with the objective space is through the linked heatmap. The heatmap shows the relative usage of each decision variable for each pareto optimal solution. As the parallel coordinate plot is brushed, the heatmap will dynamically update, reflecting only solutions within the brushed dataset.

Uploading Data

To upload data, follow the procedure outlined in Bernardo’s post. For this tool, the input is a .csv file containing both decision variable and objective function values for each Pareto optimal solution. Like the input for Bernardo’s tool, this file should have “name” and “group” as the heading for the first two columns. Unlike the input for Bernardo’s tool, the next columns should hold the decision variables, with the name of each variable as the heading. The final columns should have the objective values. The tool will automatically recognize the last 6 columns as objective values, no matter how many decision variables are present (the tool is only built to handle 6 objective problem formulations at the moment, I’m working on making it flexible to any number of objectives, I’ll post an update once that has been completed). For best use of the tool, decision variable values should be normalized to the highest value for each decision variable. An example input can be seen in Figure 2.

example_data

Figure 2: Example input data format, note that the objective values are in the last six columns

 

Using the Tool

For purposes of illustration, I’ll demonstrate how the tool can be used on a six objective water supply problem that yeilded a Pareto optimal set of 2600 solutions. This demonstration will follow Daniel Keim’s procedure for analyzing complex data broadly outlined as: Analyze and show the important, brush/filter, analyze further, show the important and provide details on demand.

Step 1: Analyze and show the important

The full Pareto set can be analyzed for broad patterns within the decision and objective spaces. The distribution of colors (representing grouping) on the parallel coordinate plot can be used to determine any correlation between certain groups of solutions and resulting objective values. The ring plot in the upper right corner can also be used to determine which groups of solutions are most prevalent within the Pareto set.

Pcoord_full

Figure 3: The parallel coordinate plot of an unbrushed Pareto optimal solutions to the six objective problem. Note that groups four and five, colored green and purple, are most prevalent in the full Pareto set.

The heatmap can also provide some useful insight into the nature of the full Pareto set. Though the heatmap cannot provide helpful insight into single solutions when 2600 points are plotted simultaneously, it can provide insight into broad trends within the data. For example, when examining Figure 4 below, one can see that certain decision variables are seen as columns of nearly solid yellow or blue. Solid yellow columns indicate decision variables that are not frequently employed in Pareto optimal solutions, while blue bars indicate that a decision variable is very frequently employed over the entire set.  In the example data set, decision variables 11 and 23 are rarely used, while decision variable 1 and 4 are used in the vast majority of solutions.

Heatmap_full

Figure 4: Heatmap of unbrushed Pareto optimal solutions and their corresponding decision variables.

Step 2: Brush/Filter the data

Once broad trends from the Pareto optimal set have been analyzed, the analyst can brush the data to meet satisfying criterion. Objectives of the highest importance can be brushed first. For this example, I first brushed the data to .99 reliability and above, after brushing I hit the “keep” button in the upper left to rescale the axes to the brushed data.

 

brushed_no_keep

brushed_keep

Figure 5: Brushed Parallel Coordinate plot. The data is the same in both (a) and (b), but (b) shows the plot after the “keep” button, circled in red in (a), has been pressed, which rescales the axes.

The heatmap of decision variables is dynamically updated as the parallel coordinate plot is brushed. An examination of the new heatmap can yield insights into connections between highly reliable solutions and the decision variables that comprise them. The heatmap still contains too many rows to provide insight into decisions that compose individual solutions, but the color patterns can be compared to the original heatmap. It can be observed that the heatmap in Figure 6 has noticeably darker columns 21, 25 and 27 than Figure 4, meaning that those decision variables are employed more in the solutions with greater than .99 reliability than in the overall Pareto set. Conversly, Figure 6 has noticeably lighter columns 19, 29 and 30 than Figure 4, meaning that those decision variables are not employed as commonly in higher reliability solutions.

heatmap_brushed1

Figure 6: Heatmap of solutions brushed to 99% reliability criterion

Step 3: Analyze Further

I then brushed Pareto Set  again, yielding a more refined set of solutions. Figure 7 shows the Pareto set brushed to satisficing criteria reliability > 99% and storage < 30. The keep function was again used to rescale the axes.

brushed2_keep

heatmap_brushed2

Figure 7: The Pareto Set brushed to satificing criteria >99% reliability and <30 storage.

The heatmap can again be used to compare brushed solutions to the entire set. The heatmap in Figure 7 reveals that Decision Variables 2, 19 and 29 are only employed in three solutions while 21, 25 and 27 are still very dark.

Step 4: Show the important

I then brushed the Pareto set a final time, which yielded a much smaller set of acceptable solutions. Figure 8 shows the Pareto Set brushed to satisficing criteria of reliability > .99, storage < 30 and Supply < 5. The Parallel Coordinate plot illuminates tradeoffs between objective values of solutions while the heatmap can be used to illuminate tradeoffs within the decision space. It can be noted from the heatmap that many decision variables are unused in any solution that meet these satisficing criteria, this may help inform analysts in simplifying future problem formulations.

brushed3_keep

heatmap_brushed3

Figure 8: Parallel Coordinate Plot and Linked heatmap shown for brushed criteria Reliability > .99, Storage < 30 and Supply < 5

It should also be noted that the proportion of solutions from each group in the brushed solutions differs from the entire Pareto set. While the most common group of solutions in the entire set was Group 5, no solution from Group 5 remains in the brushed set. Group 2, which only represented a small fraction of the entire Pareto set, represents a significant portion of solutions that meet satisficing criteria.

Provide details on demand

Information on individual solutions can be found by either hovering over a cell of the heatmap, or hovering over the tabulated row of a given solution which will highlight it on the Parallel Coordinate Plot.

Conclusion

This tool can assist analysts in understanding tradeoffs within the decision space as well as the objective space. It can also help identify broad mappings between decision variables and their effects on objective values. It was designed to be a dynamic and easy to use tool to assist analysts during an iterative decision making process that employs a MOEA. The tool is still a work in progress, so stay tuned to the blog to hear of improvements to the interactive nature of the tool.

Python’s template class

Many analyses require the same model to be run many times, but with different inputs.  For instance a Sobol sensitivity analysis requires thousands (or millions) of model runs corresponding to some strategic sampling of the factor space.  Depending on how complicated your model is, facilitating hundreds or thousands of runs may or may not be straightforward.  Some models require a unique configuration file, so performing a Sobol analysis is not as simple as changing a vector of numbers passed to an executable.

A very simple solution suggested by Jon Herman in an earlier post is to use Python string templates.  It is such a handy tool, I thought it deserved its own post.  We’ll use Python’s string module’s Template class.  The Template class has two methods: substitute and safe_substitute.  The difference between substitute and safe_substitute is that substitute will throw an exception if there is some problem filling the template, where as safe_substitute will not.  These two methods work essentially as standard $-based substitutions in Python, but rather than altering a single string, we can alter an entire document, and can then save it with a unique name.

Let’s consider a simple example first where we modify a single string:

from string import Template
s = Template('$who is from $where')

d = {}
d['who'] = 'Bill'
d['where'] = 'Boston'

p = s.substitute(d)

Which returns the string Bill is from Boston…lucky Bill.  Now we can get a bit fancier with lists of people and places:

from string import Template
s = Template('$who is from $where')

people = ['Bill','Jim','Jack']
places = ['Boston','London','LA']

p = {}
cnt = int(0)
for person in people:
 for place in places:
  d = {}
  d['who'] = person
  d['where'] = place
  p[cnt] = s.substitute(d)
  cnt = cnt+1

Which returns a p as a dictionary of every combination of people and places:

Bill is from Boston
Bill is from London
Bill is from LA
Jim is from Boston
Jim is from London
Jim is from LA
Jack is from Boston
Jack is from London
Jack is from LA

Of course this is a silly example, but this sort of exercise proved really useful for some recent factorial experiments where we wanted to test a model performance for every combination of input factors (specified by filename strings).

Getting a bit more complex, let’s consider a long configuration file needed to run your model.  For example GCAM, an integrated assessment model I’ve previously discussed, uses a configuration xml file that’s about 100 lines long. We’ll consider a pared down version:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
   <Files>
      <Value name="xmlInputFileName">../input/gcam-data-system/xml/modeltime-xml/modeltime.xml</Value>
      <Value name="BatchFileName">batch_ag.xml</Value>
      <Value name="policy-target-file">../input/policy/forcing_target_4p5.xml</Value>
      <Value name="xmldb-location">../output/database_basexdb</Value>
   </Files>
   <ScenarioComponents>
      <Value name = "climate">../input/climate/magicc.xml</Value>
   </ScenarioComponents>
</Configuration>

Now, suppose we want to vary the cost of solar power inside the model over a number of levels, and we want each model run to print to a unique output directory.  Our first step is to make a template xml file with a $-place holder where we want to vary the configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
   <Files>
      <Value name="xmlInputFileName">../input/gcam-data-system/xml/modeltime-xml/modeltime.xml</Value>
      <Value name="BatchFileName">batch_ag.xml</Value>
      <Value name="policy-target-file">../input/policy/forcing_target_4p5.xml</Value>
      <Value name="xmldb-location">../output/database_basexdb_$RN&</Value>
   </Files>
   <ScenarioComponents>
      <Value name = "climate">../input/climate/magicc.xml</Value>
      <!-- SOLAR -->
      $SOLAR
   </ScenarioComponents>
</Configuration>

We can utilize the template xml file using Python’s template class as follows:

with open(template_name,'r') as T:
 template = Template(T.read())
SOLAR_1 = ['<Value name="solar">../input/gcam-data-system/xml/energy-xml/solar_low.xml</Value>']
SOLAR_2 = ['']
SOLAR_3 = ['<Value name="solar">../input/gcam-data-system/xml/energy-xml/solar_adv.xml</Value>']
SOLAR = [SOLAR_1,SOLAR_2,SOLAR_3]
for i in range(3):
   d = {}
   d['SOLAR']=SOLAR[i]
   d['RN']=str(i)
   S1 = template.safe_substitute(d)
   with open('./configuration_' + str(i) + '.xml','w') as f1:
   f1.write(S1)

Here we are looping over experimental particles, defined by a unique setting of the solar power level in our experimental design.  For each particle a GCAM, the solar level and run number are substituted in (see S1), and S1 is written to a unique XML file.  If we open configuration_0.xml we get see that the substitution has worked!

<?xml version="1.0" encoding="UTF-8">
<Configuration>
   <Files>
      <Value name="xmlInputFileName">../input/gcam-data-system/xml/modeltime-xml/modeltime.xml</Value>
      <Value name="BatchFileName">batch_ag.xml</Value>
      <Value name="policy-target-file">../input/policy/forcing_target_4p5.xml</Value>
      <Value name="xmldb-location">../output/database_basexdb_0</Value>
   </Files>
   <ScenarioComponents>
      <Value name = "climate">../input/climate/magicc.xml</Value>
<!-- SOLAR -->
<Value name="solar">../input/gcam-data-system/xml/energy-xml/solar_low.xml</Value>
</ScenarioComponents>
</Configuration>

Of course this is a very simple example, but it has proven incredibly useful in our ongoing work.

That’s all for now!