# MORDM VII: Optimality, robustness, and reevaluation under deep uncertainty

In the previous MORDM post, we visualized the reference set of performance objectives for the North Carolina Research Triangle and conducted a preliminary multi-criterion robustness analysis using two criteria: (1) regional reliability should be at least 98%, and (2) regional restriction frequency should be not more than 20%. Using these metrics, we found that Pareto-optimality does not guarantee satisfactory robustness, a statement that is justified by showing that not all portfolios within the reference set satisfied the robustness criteria.

In this post, we will explain the differences between optimality and robustness, and justify the importance of robust optimization instead of sole reliance on a set of optimal solutions (aka an optimal portfolio). To demonstrate the differences better, we will also reevaluate the Pareto optimal set of solutions under a more challenging set of states of the world (SOWs), a method first used in Herman et al (2013, 2014) and Zeff et al (2014). The formal term for this method, Deeply Uncertain (DU) Re-evaluation was coined in a 2019 paper by Trindade et al.

### Optimality vs robustness

The descriptions of optimality are many. From a purely technical perspective, a Pareto optimal set is a set of decision variables or solutions that maps to a Pareto front, or a set of performance objectives where improving one objective cannot be improved without causing performance degradation in another. For the purposes of this blog post, we shall use the definition of optimality as laid out by Beyer and Sendoff in their 2007 paper:

The global optimal design…depends on the…(objective) functions and constraints…however, these functions always represent models and/or approximations of the real world.

Beyer and Sendhoff (2007)

In other words, a Pareto reference set is only optimal within the bounds of the model it was generated from. This makes sense; models are only approximations of the real world. It is difficult and computationally expensive to have bounds on the degree of certainty to which the model optimum maps to the true optimum due to uncertainties driven by human action, natural variability, and incomplete knowledge. Optimization is static in relation to reality – the set of solutions found do not change with time, and only account for the conditions within the model itself. Any deviation from this set of solutions or unaccounted differences between the actual system and model may result in failure (Herman et al, 2015; Read et al 2014).

This is why searching the set of optimal solutions for robust solutions is important. Herman et al (2015) quotes an earlier works by Matalas and Fiering (1977) that defines robustness as the insensitivity a system’s portfolio to uncertainty. Within the MORDM context, robustness was found to be best defined using the multi-criterion satisficing robustness measure (Herman et al, 2015), which refers to the ability of a solution to meet one or more requirements (or criteria) set by the decision-makers when evaluated under a set of challenging scenarios. More information on alternative robustness measures can be found here.

In this blog post, we will begin to explore this concept of robustness by conducting DU Re-evaluation, where we will perform the following steps:

### Generate a set of ROF tables from a more challenging set of SOWs

Recall that we previously stored our Pareto-optimal solution set in a .csv file names ‘NC_refset.csv’ (find the original Git Repository here). Now, we will write a quick Python script (called rof_tables_reeval.py in the Git Repository) using MPI4PY that will parallelize and speed up the ROF table generation and the bash script to submit the job. More information on parallelization using MPI4PY can be found in this handy blog post by Dave Gold.

First, create a Python virtual environment within the folder where all your sourcecode is kept and activate the virtual environment. I called mine python3_venv:

python3 -m venv python_venv
source python_venv/bin/activate

Next, install the numpy and mpi4py libraries:

pip install numpy mpi4py

Then write the Python script as follows:

# -*- coding: utf-8 -*-
"""
Created on Tues March 1 2022 16:16
@author: Lillian Bei Jia Lau
"""

from mpi4py import MPI
import numpy as np
import subprocess, sys, time
import os

# 5 nodes, 50 RDMs per node
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # up to 20 processes
print('rank = ', rank)

N_RDMs_needed = 100
N_REALIZATIONS = 100
N_RDM_PER_NODE = 20
N_TASKS = 50 # rank ranges from 0 to 50

DATA_DIR = "/scratch/lbl59/blog/WaterPaths/"
N_SOLS = 1

current_RDM = rank + (N_TASKS * i)

command_gen_tables = "./waterpaths -T {} -t 2344 -r {} -d {} -C 1 -O rof_tables_reeval/rdm_{} -e 0 \
-U TestFiles/rdm_utilities_test_problem_reeval.csv \
-W TestFiles/rdm_water_sources_test_problem_reeval.csv \
-P TestFiles/rdm_dmp_test_problem_reeval.csv \
-s {} -f 0 -l {} -R {}\
-p false".format(OMP_NUM_THREADS, N_REALIZATIONS, DATA_DIR, current_RDM, SOLS_FILE_NAME, N_SOLS, current_RDM)

print(command_gen_tables)
os.system(command_gen_tables)

comm.Barrier()

Before proceeding, a quick explanation on what all this means:

• Line 12: We are parallelizing this job across 5 nodes on Cornell’s THECUBE (The Cube) computing cluster.
• Lines 19-28: On each node, we are submitting 10 tasks to each of the 5 nodes requested. Each task, in turn, is handling 2 robust decision-making (RDM) multiplier files that scale up or scale down a hydroclimatic realization make a state of the world more (or less) challenging. In this submission, we are creating 400 different variations of each hydroclimatic scenario using the 400 RDM files, and running it across only one solution
• Line 16 and 31: The ‘rank’ is the order of the tasks in which there are submitted. Since there are 10 tasks over 5 nodes, there will be a total of 50 tasks being submitted. Note and understand how the current_RDM is calculated.
• Lines 32 to 37: This is the command that you are going to submit to The Cube. Note the -C and -O flags; a value of 1 for the -C flag tells WaterPaths to generate ROF tables, and the -O tells WaterPaths to output each ROF table file into a folder within rof_tables_reeval/rdm_{}for each RDM. Feel free to change the filenames as you see fit.

To accompany this script, first create the following folders: output, out_reeval, and rof_tables_reeval. The output folder will contain the simulation results from running the 1 solution across the 100 hydroclimatic realizations. The out_reeval folder will store any output or error messages such as script runtime.

Then, write the following bash submission script:

#!/bin/bash
#SBATCH -n 50 -N 5 -p normal
#SBATCH --job-name=rof_tables_reeval
#SBATCH --output=out_reeval/rof_tables_reeval.out
#SBATCH --error=out_reeval/rof_tables_reeval.err
#SBATCH --time=200:00:00
#SBATCH --mail-user=lbl59@cornell.edu
#SBATCH --mail-type=all

module spider py3-mpi4py
module spider py3-numpy/1.15.3

START="\$(date +%s)"

mpirun python3 rof_tables_reeval.py

DURATION=\$[ \$(date +%s) - \${START} ]

echo \${DURATION}

You can find the bash script under the filename rof_table_gen_reeval.sh. Finally, submit the script using the following line:

sbatch ./rof_table_gen_reeval.sh

The run should take roughly 5 hours. We’re good for some time!

### Re-evaluate your solutions (and possibly your life choices, while you’re at it)

Once the ROF tables are generated, it’s time to get a little hands-on with the underlying WaterPaths code. Navigate to the following PaperTestProblem.cpp file using:
cd /yourfilepath/WaterPaths/src/Problem/

1. Delete PaperTestProblem.cpp and replace it with the file PaperTestProblem-reeval.cpp. It can be found in the the main Git Repository.
2. Rename the latter file PaperTestProblem.cpp – it will be the new PaperTestProblem file that will be able to individually read each RDM scenario’s ROF tables.
3. Re-make WaterPaths by calling make clean and then make gcc in the command line. This will ensure that WaterPaths has no problems running the new PaperTestProblem.cpp file.

Next, write the following Python script (called run_du_reeval.py in the Git repository):

# -*- coding: utf-8 -*-
"""
Created on Tues March 1 2022 16:16

@author: Lillian Bei Jia Lau
"""

from mpi4py import MPI
import numpy as np
import subprocess, sys, time
import os

N_REALIZATIONS = 100
N_RDM_PER_NODE = 20
N_TASKS_PER_NODE = 10 # rank ranges from 0 to 15

comm = MPI.COMM_WORLD
rank = comm.Get_rank() # up to 20 processes
print('rank = ', rank)

DATA_DIR = "/scratch/lbl59/blog/WaterPaths/"
N_SOLS = 69

current_RDM = rank + (N_TASKS * i)

command_run_rdm = "./waterpaths -T {} -t 2344 -r {} -d {} -C -1 -O rof_tables_reeval/rdm_{} -e 0 \
-U TestFiles/rdm_utilities_test_problem_reeval.csv \
-W TestFiles/rdm_water_sources_test_problem_reeval.csv \
-P TestFiles/rdm_dmp_test_problem_reeval.csv \
-s {} -R {} -f 0 -l 69\
current_RDM , SOLS_FILE_NAME, current_RDM)

print(command_run_rdm)
os.system(command_run_rdm)

comm.Barrier()

Note the change in the -C flag; its value is now -1, telling WaterPaths that it should import the ROF table values from the folder indicated by the -O flag. The resulting objective values for each RDM will be saved in the output folder we previously made.

The accompanying bash script, named du_reeval.sh is as follows:

#!/bin/bash
#SBATCH -n 50 -N 5 -p normal
#SBATCH --job-name=mordm_training_du_reeval
#SBATCH --output=out_reeval/mordm_training_du_reeval.out
#SBATCH --error=out_reeval/mordm_training_du_reeval.err
#SBATCH --time=200:00:00
#SBATCH --mail-user=lbl59@cornell.edu
#SBATCH --mail-type=all

module spider py3-numpy/1.15.3

START="\$(date +%s)"

mpirun python3 run_du_reeval.py

DURATION=\$[ \$(date +%s) - \${START} ]

echo \${DURATION}

This run should take approximately three to four days. After that, you will have 1000 files containing 69 objective value sets resulting from running the 69 solutions across 1000 deeply-uncertain states of the world.

### Summary

In this post, we defined optimality and robustness. We demonstrated how to run a DU re-evaluation across 100 challenging SOWs to observe how these ‘optimal’ solutions perform in more extreme scenarios. This is done to show that optimality is bound by current model states, and any deviation from the expected circumstances as defined by the model may lead to degradations in performance.

In the next blog post, we will be visualizing these changes in performance using a combination of sensitivity analysis, scenario discovery, and tradeoff analysis.

## References

Beyer, H. and Sendhoff, B., 2007. Robust optimization – A comprehensive survey. Computer Methods in Applied Mechanics and Engineering, 196(33-34), pp.3190-3218.

Herman, J., Reed, P., Zeff, H. and Characklis, G., 2015. How Should Robustness Be Defined for Water Systems Planning under Change?. Journal of Water Resources Planning and Management, 141(10), p.04015012.

Herman, J., Zeff, H., Reed, P. and Characklis, G., 2014. Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty. Water Resources Research, 50(10), pp.7692-7713.

Matalas, N. C., and Fiering, M. B. (1977). “Water-resource systems planning.” Climate, climatic change, and water supply, studies in geophysics, National Academy of Sciences, Washington, DC, 99–110.

Read, L., Madani, K. and Inanloo, B., 2014. Optimality versus stability in water resource allocation. Journal of Environmental Management, 133, pp.343-354.

Zeff, H., Kasprzyk, J., Herman, J., Reed, P. and Characklis, G., 2014. Navigating financial and supply reliability tradeoffs in regional drought management portfolios. Water Resources Research, 50(6), pp.4906-4923.

# NetCDF Operators

This post is an introduction to Linux-based climate data and NetCDF operators (CDOs or NCOs) which allow you to perform various operations on netNDF files through the command line. I found these commands to be really nifty when I was working with pre-industrial control runs from a GCM. The output was being written on daily timestep, across 1200 years, and for the whole world, so it was absolutely essential that I cut the size of the files down as much as I could before transferring to my own computer.

The official documentation and installation instructions for NCO can be found here and CDO here, but if you’re working on a supercomputer, the libraries will already likely be installed. I will outline how I used some of these functions for my pre-industrial runs.

## Concatenation

Some of the NCO commands have size limits of 40 GB, so it’s important to use the right order of operations when processing your files, which will be different depending on your ultimate goal. My goal was to ultimately get the 500-hpa geopotential height anomalies across the whole 1200 year period for specifically the Western US. Assuming you have a directory with all the NetCDF files, the first goal is to concatenate the data, since my run was broken into many smaller files. The easiest way to do this is with the following command which will take all the netcdf files in the directory (using the *) and merge them into a file called merged_file.nc:

cdo mergetime *.nc merged_file.nc

## Return Individual Monthly Files

When calculating anomalies, you will need to determine a mean geopotential height value for each of the 12 months, and then calculate daily deviations with respect to these months to obtain daily deviations. You can do this with the following command:

cdo splitmon merged_file.nc zg500.mon

This command will return 12 files of the form zg500.mon\$i.nc.

## Return Monthly Means and Daily Anomalies

The next step is to calculate a monthly mean for each of these files. For example, for January use:

cdo timmean zg500.mon1.nc zg500.mean.mon1.nc

## Return Anomalies

Now we subtract the means from each monthly file to return the daily anomalies for each month, which will be of the form: zg500.mean.mon\${i}.anom.nc. If you want to combine the last two steps into one loop, you can use the code below:

for i in \$(seq 1 12)
do
cdo timmean zg500.mon\${i}.nc zg500.mean.mon\${i}.nc
cdo sub zg500.mon\${i}.nc zg500.mean.mon\${i}.nc zg500.mean.mon\${i}.anom.nc
done

## Cut Down to Geographical Area of Interest

Finally, we need to cut down the data just to the Western US. We use ncks (NetCDF Kitchen Sink) from NCO, which is probably the most versatile of all the functions (hence the name). This command is one that has the 40 GB limit, which is why I had to wait until I had monthly files before I could cut them down geographically. We must first specify the variable of interest using the -v flag. In my case, I only had one variable to extract, but you can also extract multiple in one command. Then denote the range of latitude and longitude using the -d flags. It is very important to include the periods at the end of each lat/lon (even if your bounds are integers) otherwise the command does not work.

for i in \$(seq 1 12)
do
ncks -v zg500 -d lon,180.,260. -d lat,30.,60. zg500.mean.mon\${i}.cut.anom.nc -o zg500.mean.mon\${i}.cut.anom.region.nc
done

Ultimately, you will get 12 files of the form: zg500.mean.mon\${i}.cut.anom.region.nc. And that’s it! You can concatenate the monthly files back together and try to resort the data back into the correct sequence according to time. I was unsuccessful at finding a quick way to do this, but it is possible. I found it much easier to work on this within R. I imported each of the 12 anomaly files, assigned a time vector, concatenated each monthly anomaly matrix into a larger matrix and then sorted according to date. If your files are small enough by the end of the process, this likely is the easiest way to take care of resorting. Enjoy!

# More on simple Bash Shell scripts (examples of “find” and “sed”)

When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs.  You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model.  For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”

find

Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].

find . -type f -name "218*"

Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:

find . -type f -name "218*" |  wc -l

You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:

find . -type f -name "218_wheat*" -exec rm {} \;

If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:

find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;

Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:

find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;

“d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.

If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”

find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;

Or we can look for files that do not contain “success.”

find . -type f -name 218cs*.err" -exec grep -L "success" {} \;

sed

“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:

sed -i 's/1295/1360/' 218cs.txt

Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:

sed -i '5s/1295/1360/' 218cs.txt

There might be more modifications that have to be done, so we can add them in one line using “-e”:

sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt

find + sed

If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:

find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/'  {} \;

Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:

“FORCING1         /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”

For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].

path_new="FORCING1	/home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|\$path_new|" global_param_default.txt {} +

Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt).

find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1" {} \;

Now, all of the “global_param_default” files have a new line with this content: “OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1”.

It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path:

for P in \$(find . -type d -name "output"); do (new_name="\$(echo "\${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "\${P}" && mv output__46.34375_-119.90625 \$ new_name); done

With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding \$ and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions.

With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring

“&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in \$P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in \$new_name.

# Simple Bash shell scripts that have made my life easier

I’ve recently been using Bash shell scripts to improve the efficiency of my workflow when working on Linux systems and I thought I would share some of them here. I’m fairly new to Linux so this post is not meant to be a comprehensive guide on how to write shell scripts rather, I hope the scripts in this post can serve as examples for those who may also be learning Linux and unsure of where or how to start writing shell scripts. I didn’t write any of these from scratch, most of the scripts are based off files shared with me by group members Julie Quinn, Bernardo Trindade and Jazmin Zatarian Salazar. If you’re interested in learning more about any of the commands used in these scripts I’ve put some references I found useful at the end of this post. If you’re more experienced in writing shell scripts, please feel free to put tips or suggestions in the comments.

## 1. A simple script for making directories

For my research I’m processing results of a monte carlo simulation for several solutions found through multi-objective search and I needed to make folders in several locations to store the output from each solution. My first instinct was to make each directory separately using the mkdir command in the command line, but this quickly got tedious. Instead I used a bash script to loop through all the solution numbers and create a new directory for each. For more on using loops in Bash, check out this reference.

#!/bin/bash#!/bin/bash

# This script will create directories named "Solution_*.txt" for
# a set of numbered solutions

# specify solution numbers
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# create a variable to store the string "Solution_"
DIRECTORY="Solution_"

# loop over solution numbers
for i in \${SOLUTIONS[@]}
do
# create a separate directory for each solution
mkdir \$DIRECTORY\${i}
done

## 2. Calling a Java function and saving the output

The MOEA framework is a tool written in Java with all sorts of cool functions. I used it to generate 1024 latin hypercube samples across a given range for each of the 8 solutions mentioned above. Using a shell script allows for you to easily set up the arguments needed for the MOEA framework, call the Java function and save the output to your desired file format. The MOEA framework’s tool spits out a .txt file, but this script uses the “sed” command to save it as a .csv file. More on “sed” can be found in the reference at the end of this post.

#!/bin/bash#!/bin/bash
# this shell script will call the MOEA framework's Latin Hypercube
# Sampling tool to create 1024 samples from a set of
# prespecified ranges for each of 8 solutions

# create variables to store Java arguments
JAVA_ARGS="-Xmx1g -classpath MOEAFramework-1.16-Executable.jar"
NUM_SAMPLES=1024
METHOD=latin

# these are the solutions we will create samples from
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# loop through solutions
for i in \${SOLUTIONS[@]}
do
# define names for input (ranges) and output file names
RANGES_FILENAME=\${i}ranges.txt
OUTPUT_FILENAME=Solution\${i}_Samples.txt
CSV_FILENAME=Solution\${i}_Samples.csv

# Call MOEA framework from JAVA using specified arguments to
# create LHS Samples, specify OUTPUT_FILENAME as output
java \${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SampleGenerator -m \${METHOD} -n \${NUM_SAMPLES} -p \${RANGES_FILENAME} -o \${OUTPUT_FILENAME}

# Use the sed command tocreate new comma separated values file
# from original output .txt file
sed 's/ /,/g' \${OUTPUT_FILENAME} &amp;amp;amp;amp;gt; \${CSV_FILENAME}

# remove .txt files
rm \$OUTPUT_FILENAME
done

## 3. A piping example

Piping allows you to link together programs by making the output from one program or function the input to another. The script below was originally written by my friend Shrutarshi Basu for a class project we were working on together. This script is made to process the output from the Borg MOEA for 9 random seeds of the DTLZ2 benchmarking problem across several different algorithmic configurations, seen in the code as “masters” (for more on this see Jazmin’s post here). In addition to calling Java tools from the MOEAframework, Basu uses piping to link the Linux commands “tac”, “sed”, “grep” and “cut”.  For more on each of these commands, see the links at the bottom of this post.

# loop over each of 9 seeds
for i in {0..9}
do
obj=DTLZ2_S\${i}.obj
output=dtlz2.volume

# loop over masters
for m in \$(seq 0 \$1)
do
runtime=DTLZ2_S\${i}_M\${m}.runtime
mobj=DTLZ2_S\${i}_M\${m}.obj

# extract objectives from output
echo "Extracting objectives"
tac \${runtime} | sed -n '1,/\/\// p' | grep -v "//" | cut -d' ' -f15-19 | tac > \${mobj};
done

# combine objectives into one file
echo "Combining objectives"
java -cp ../../moea.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger \
-d 5 -e 0.01,0.01,0.01,0.01,0.01 \
-o \${obj} DTLZ2_S\${i}_M*.obj

# calculate the hypervolume
echo "Finding final hypervolume"
hvol=\$(java -cp ../../moea.jar HypervolumeEval \${obj})

printf "%s %s\n" "\$i" "\$hvol" >> \${output}
echo "Done with seed \$i"
done

# So much data from such little models…

It’s been my experience that even simple models can generate lots of data. If you’re a regular reader of this blog, I can imagine you’ve had similar experiences as well. My most recent experience with this is the work I’ve done with the Dynamic Integrated Climate-Economic model (DICE). I had inherited a port of the 2007 version of the model, which would print relevant output to the screen. During my initial runs with the model, I would simply redirect the output to ascii files for post-processing. I knew that eventually I would be adding all sorts of complexity to this model, ultimately leading to high-dimensional model output and rendering the use of ascii files as impractical. I knew that I would need a better way to handle all this data. So in updating the model to the 2013 version, I decided to incorporate support for netCDF file generation.

You can find details about the netCDF file format through Unidata (a University Cooperation for Atmospheric Research [UCAR] Community Program) and through some of our previous blog posts (here, here, and here). What’s important to note here is that netCDF is a self-describing file format designed to manage high-dimensional hierarchical data sets.

I had become accustomed to netCDF files in my previous life as a meteorologist. Output from complex numerical weather prediction models would often come in netCDF format. While I had never needed to generate my own netCDF output files, I found it incredibly easy and convenient to process them in R (my preferred post-processing and visualization software). Trying to incorporate netCDF output support in my simple model seemed daunting at first, but after a few examples I found online and a little persistence, I had netCDF support incorporated into the DICE model.

The goal of this post is to guide you through the steps to generate and process a netCDF file. Some of our earlier posts go through a similar process using the Python and Matlab interfaces to the netCDF library. While I use R for post-processing, I generally use C/C++ for the modeling; thus I’ll step through generating a netCDF file in C and processing the generated netCDF file in R on a Linux machine.

Edit:  I originally put a link to following code at the bottom of this post.  For convenience, here’s a link to the bitbucket repository that contains the code examples below.

# Writing a netCDF file in C…

## Confirm netCDF installation

First, be sure that netCDF is installed on your computing platform. Most scientific computing clusters will have the netCDF library already installed. If not, contact your system administrator to install the library as a module. If you would like to install it yourself, Unidata provides the source code and great documentation to step you through the process. The example I provide here isn’t all that complex, so any recent version (4.0+) should be able to handle this with no problem.

## Setup and allocation

With the netCDF libraries installed, you can now begin to code netCDF support into your model. Again, I’ll be using C for this example. Begin by including the netCDF header file with your other include statements:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>

### Setup an error handler

The netCDF library includes a nice way of handling possible errors from the various netCDF functions. I recommend writing a simple wrapper function that can take the returned values of the netCDF functions and produce the appropriate error message if necessary:

void ncError(int val)
{
printf("Error: %s\n", nc_strerror(val));
exit(2);
}

### Generate some example data

Normally, your model will have generated important data at this point. For the sake of the example, let’s generate some data to put into a netCDF file:

// Loop control variables
int i, j, k;

// Define the dimension sizes for
// the example data.
int dim1_size = 10;
int dim2_size = 5;
int dim3_size = 20;

// Define the number of dimensions
int ndims = 3;

// Allocate the 3D vectors of example data
float x[dim1_size][dim2_size][dim3_size];
float y[dim1_size][dim2_size][dim3_size];
float z[dim1_size][dim2_size][dim3_size];

// Generate some example data
for(i = 0; i < dim1_size; i++) {
for(j = 0; j < dim2_size; j++) {
for(k = 0; k < dim3_size; k++) {
x[i][j][k] = (i+j+k) * 0.2;
y[i][j][k] = (i+j+k) * 1.7;
z[i][j][k] = (i+j+k) * 2.4;
}
}
}

This generates three variables, each with three different size dimensions. Think of this, for example, as variables on a 3-D map with dimensions of [latitude, longitude, height]. In my modeling application, my dimensions were [uncertain state-of-the-world, BORG archive solution, time].

### Allocate variables for IDs

Everything needed in creating a netCDF file depends on integer IDs, so the next step is to allocate variables for the netCDF file id, the dimension ids, and the variable ids:

// Allocate space for netCDF dimension ids
int dim1id, dim2id, dim3id;

// Allocate space for the netcdf file id
int ncid;

// Allocate space for the data variable ids
int xid, yid, zid;

Each one of these IDs will be returned through reference by the netCDF functions. While we’re at it, let’s make a variable to hold the return status of the netCDF function calls:

// Allocate return status variable
int retval;

## Define the meta-data

Now we will start to build the netCDF file. This is a two-part process. The first part is defining the meta-data for the file and the second part is assigning the data.

### Create an empty netCDF file

First, create the file:

// Setup the netcdf file
if((retval = nc_create("example.nc", NC_NETCDF4, &ncid))) { ncError(retval); }

Note that we store the return status of the function call in retval and test the return status for an error. If there’s an error, we pass retval to our error handler. The first parameter to the function call is the name of the netCDF file. The second parameter is a flag that determines the type of netCDF file. Here we use the latest-and-greatest type of NETCDF4, which includes the HDF5/zlib compression features. If you don’t need these features, or you need a version compatible with older versions of netCDF libraries, then use the default or 64-bit offset (NC_64BIT_OFFSET) versions. The third parameter is the netCDF integer ID used for assigning variables to this file.

Now that we have a clean netCDF file to work with, let’s add the dimensions we’ll be using:

// Define the dimensions in the netcdf file
if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }

// Gather the dimids into an array for defining variables in the netcdf file
int dimids[ndims];
dimids[0] = dim1id;
dimids[1] = dim2id;
dimids[2] = dim3id;

Just as before, we catch and test the function return status for any errors. The function nc_def_dim() takes four parameters. First is the netCDF file ID returned when we created the file. The second parameter is the name of the dimension. Here we’re using “dimX_size” – you would want to use something descriptive of this dimension (i.e. latitude, time, solution, etc.). The third parameter is the size of this dimension (i.e. number of latitude, number of solutions, etc.). The last is the ID for this dimension, which will be used in the next step of assigning variables. Note that we create an array of the dimension IDs to use in the next step.

The last step in defining the meta-data for the netCDF file is to add the variables:

// Define the netcdf variables
if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

The nc_def_var() function takes 6 parameters. These include (in order) the netCDF file ID, the variable name to be displayed in the file, the type of data the variable contains, the number of dimensions of the variable, the IDs for each of the dimensions, and the variable ID (which is returned through reference). The type of data in our example is NC_FLOAT, which is a 32-bit floating point. The netCDF documentation describes the full set of data types covered. The IDs for each dimension are passed as that combined array of dimension IDs we made earlier.

This part is optional, but is incredibly useful and true to the spirit of making a netCDF file. When sharing a netCDF file, the person receiving the file should have all the information they need about the data within the file itself. This can be done by adding “attributes”. For example, let’s add a “units” attribute to each of the variables:

// OPTIONAL: Give these variables units
if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }

The function nc_put_att_text() puts a text-based attribute onto a variable. The function takes the netCDF ID, the variable ID, the name of the attribute, the length of the string of characters for the attribute, and the text associated with the attribute. In this case, we’re adding an attribute called “units”. Variable ‘x’ has units of “cm”, which has a length of 2. Variable ‘y’ has units of “degC”, which has a length of 4 (and so on). You can apply text-based attributes as shown here or numeric-based attributes using the appropriate nc_put_att_X() function (see documentation for the full list of numeric attribute functions). You can also apply attributes to dimensions by using the appropriate dimension ID or set a global attribute using the ID “0” (zero).

### End the meta-data definition portion

At this point, we’ve successfully created a netCDF file and defined the necessary meta-data. We can now end the meta-data portion:

if((retval = nc_enddef(ncid))) { ncError(retval); }

…and move on to the part 2 of the netCDF file creation process.

## Populate the file with data

### Put your data into the netCDF file

Here, all we do is put data into the variables we defined in the file:

// Write the data to the file
if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }

The function nc_put_var() takes three parameters: the netCDF file ID, the variable ID, and the memory address of the start of the multi-dimensional data array. At this point, the data will be written to the variable in the netCDF file. There is a way to write to the netCDF file in data chunks, which can help with memory management, and a way to use parallel I/O for writing data in parallel to the file, but I have no experience with that (yet). I refer those interested in these features to the netCDF documentation.

### Finalize the netCDF file

That’s it! We’re done writing to the netCDF file. Time to close it completely:

// Close the netcdf file
if((retval = nc_close(ncid))) { ncError(retval); }

## Compile and run the code

Let’s compile and run the code to generate the example netCDF file:

gcc -o netcdf_example netcdf_write_example.c -lnetcdf

Some common problems people run into here are not including the netCDF library flag at the end of the compilation call, not having the header files in the include-path, and/or not having the netCDF library in the library-path. Check your user environment to make sure the netCDF paths are included in your C_INCLUDE_PATH and LIBRARY_PATH:

env | grep –i netcdf

Once the code compiles, run it to generate the example netCDF file:

./netcdf_example

If everything goes according to plan, there should be a file called “example.nc” in the same directory as your compiled code. Let’s load this up in R for some post-processing.

# Reading a netCDF file in R…

## Install and load the “ncdf4” package

To start using netCDF files in R, be sure to install the netCDF package “ncdf4”:

install.packages("ncdf4")
library(ncdf4)

Note that there’s also an “ncdf” package. The “ncdf” package reads and writes the classic (default) and 64-bit offset versions of netCDF file. I recommend against using this package as the new package “ncdf4” can handle the old file versions as well as the new netCDF4 version.  Turns out the “ncdf” package has been removed from the CRAN repository.  It’s just as well since the new “ncdf4” package obsoletes the “ncdf” package.

## Open the netCDF file

With the library installed and sourced, let’s open the example netCDF file we just created:

nc <- nc_open("example.nc")

This stores an open file handle to the netCDF file.

## View summary of netCDF file

Calling or printing the open file handle will produce a quick summary of the contents of the netCDF file:

print(nc)

This summary produces the names of the available variables, the appropriate dimensions, and any global/dimension/variable attributes.

## Extract variables from the netCDF file

To extract those variables, use the command:

x <- ncvar_get(nc, "x")
y <- ncvar_get(nc, "y")
z <- ncvar_get(nc, "z")

At this point, the data you extracted from the netCDF file are loaded into your R environment as 3-dimensional arrays. You can treat these the same as you would any multi-dimensional array of data (i.e. subsetting, plotting, etc.). Note that the dimensions are reported in reverse order from which you created the variables.

dim(x)

## Close the netCDF file

When you’re done, close the netCDF file:

nc_close(nc)

And there you have it! Hopefully this step-by-step tutorial has helped you incorporate netCDF support into your project. The code I described here is available through bitbucket.

Happy computing!

~Greg

# Useful Linux commands to handle text files and speed up work

Most of us, nice and normal human beings, tend to prefer programs with GUIs over typing commands on a command prompt because the former looks more “real” and is more visual than the latter. However, one thing we don’t know (or, sometimes, don’t want to know) is that learning a few terminal commands can dramatically increase productivity. These commands can save us a lot of time by sparing us from opening and closing programs, navigating through menus and windows, moving the mouse around, as well as moving the hand back and forth from the mouse to the keyboard.

This post will mention and briefly describe some useful “intermediate level” Linux commands (some basic commands are presented in this post by Jon Herman), which can be called from a Linux OS, Cygwin (mostly), or Mac. Among the numerous tedious tasks these commands can greatly simplify is the particularly interesting chore of handling text files, be they scripts or data files. Commands for other tasks are covered as well. Keep in mind that the symbol * is a wild card (character that can mean any string of characters when searching for files), which is really useful when the goal is to repeatedly apply one command to multiple files. For all commands listed here skip the “\$” character.

DELIMITER SEPARATED FILES HANDLING

• Remove columns 30 to 36 (starting from 0) from a comma separated file and export the output to another file.
\$ cut -d',' -f1-30,36 input.file >> output.file

• Print only columns 2 and 4 (starting from 1) of a comma separated file.
\$ awk -F "," '{print \$2,\$4}' input.file >> output.file
• Count number of columns in a file separated either by spaces or commas:
\$ head -1 input.file | sed 's/[^, ]//g' | wc -c
or:
\$ awk -F "[, ]" 'END{print NF}' input.file
• Print lines of a comma separated file in which the value in the 2nd column is lower than 100 and the value in the 5th column is higher than 0.3:
\$ awk -F "," '\$2<100 && \$5>0.3' input.file >> output.file
• Print lines between 10th and 20th lines (not inclusive) of a file:
\$ awk 'NR>10 && NR<20' input.file >> output.file
• Add a string to the end of multiple files:
\$ echo "your string" | tee -a *.set
• Add a string to the end of one file:
\$ echo "your string" >> file

FILE SEARCHING

• Find all text files in a folder that contain a certain string:
\$ grep -rn './folder' -e your_string
• Find files recursively (* is a wildcard):
\$ find -type f -name name_of*.file

FILES INFO

• See the contents of a zip/tar file without extracting it. Press q to quit.
\$ less file.tar
• Count number of lines in a file:
\$ wc -l your.file
• List all files with a certain extension in a directory:
\$ ls *.ext
• Print files and folders in tree fashion:
\$ tree
• Print the size of all subfolders and files in (sub)folders to a certain max depth in the folder hierarchy:
\$ du -h -a --max-depth=2

IMAGE HANDLING

• Convert svg files to png (you need to have Inkscape installed):
\$ inkscape input.svg -d 300 -e output.png
• Convert svg files to pdf-latex (you need to have Inkscape installed):
\$ inkscape input.svg --export-pdf output.pdf --export-latex
• Rotate a picture:
\$ convert Fig6_prim.png -rotate 90 Fig6_prim_rotated.png

MISCELLANEOUS

• See the history of commands you have typed:
\$ history
• See a calendar (month and year optional):
\$ cal [month] [year]
• Combine pdf files into one (you need to have pdftk installed):
\$ pdftk file1.pdf file2.pdf file3.pdf cat output newfile.pdf
or, to merge all pdf files in a directory:
\$ pdftk *.pdf cat output newfile.pdf

In order to see how to combine only certain pagers of pdf files, as well as how to splits all pages into separate pdf files, see this page.

• See the manual of a command:
\$ man command

Another useful idea is that of piping outputs of a command to another command. For example, if you want print the number of files in a directory, you can pipe the output of the ls command (list all files in a directory) to the wc -l command (count the number of lines in a string). For this, use the “|” character:

\$ ls | wc -l

However, you may want instead to check the number of lines in all the 20 files in a directory at once, which can also be achieved by combining the ls and wc commands with the command xargs. The command then would look like:

\$ ls | xargs wc -l

The command xargs breaks down the output from ls into one string per line and then calls wc -l for each line of the output of ls.

Hope all this saves you some time!

# Using linux “split”

Today I’d like to quickly talk about the linux command “split”.  I like writing about new simple Linux commands as evidenced here and here.

I often write customized C++ scripts to manipulate large data files.  There’s obviously a time and place for this, since you get ultimate control on every aspect of how your data looks going in and coming out.  We’ve written about this before, and I think string processing is an important skill no matter what language.  There’s a post about matlab (and another one here), some sample bash scripting, and a post about python among other things.  You should also see Matt’s series on python data analysis, since I’m doing some shameless plugging!

Anyway… little did I know that something very complicated in C++ can be easily done in linux/unix with “split”!

To split a large file into smaller files with, say, 100 lines, you use: “split -l 100 myLargerFile.txt”  There are also commands to change the filenames of the output files, and so forth.

Read the man page for split, and check out forum posts here and here to get on your way!

grep allows you to find an expression in one or more files in a folder on Linux.  I find it useful for programming.  Say, for example, I want to look for the string “nrec” in a set of source code and header files.  Maybe “nrec” is a variable and I forgot where I declared it (if this sounds a little too specific to be merely an example, you’re right. This is what I’m having to do right this second!).  The grep command is:

grep -in “nrec” *.*

What this means is, search for the “nrec” expression in every file in the folder.  There are two useful flags set here as well.  “i” means that the search is case insensitive (that is, NREC and NrEc and nrec are each treated as equal).  “n” means that the program will show me the line number of each occurrence of my desired phrase.  There are other options that I’m not using, including “inverting” a search to find all occurrences of NOT that phrase, suppressing the file name or only showing the file name, etc.

If you were curious, here’s a sample of the output:

iras.h:144: int num_flow_datapoints; //originally: NRec
SimSysClass.cpp:806: flowrecs=sysstat(nrec)

(If you’re curious, the first instance is in a header file, on line 144.  I’m translating this code from one language to another, and originally the variable was called “nrec”. So in the header file I made a note that now my variable is called something else.  In the second instance, I had copied the original code into my file as a placeholder, so now I know that I need to use my new name in its place.  Also, the “i” flag in grep is helpful since fortran is not case-sensitive, and here you can see there were two different case styles for this variable even in our simple example.)

For more info, please consult some casual reference such as this excellent post about linux command line utilities,  a similar blog post about grep, and of course the Linux man page for the command. Also look at 15 grep tips.  As usual, remember that “man [insert command here]” gives you all the low-down on each command you’d like to learn.

# Using Linux input redirection in a C++ code test

I’m testing a function that does a numerical calculation, namely the Cholesky matrix factorization.  I found a reference to algorithm that does this, but how do I test my code quickly, to input some data, perform the calculation, and then output some results?

It turns out Linux gives us a simple way to do this!  Using “input redirection”, we can quickly input some data from a text file, and process it as if we were inputting it directly through the keyboard.  First let’s look at some pseudocode.

#include <stdio.h>

#include <math.h>

#include <iostream>

#include <fstream>

#include <string>

using namespace std;

ofstream out; //for debugging

double myFun (double input1, double input2) //insert the function you want to test here

int main()

{

//Let's say you need to enter an n by n matrix with a text file, and you know n in advance. Then:

double input[52][52]; //you can use a vector of vectors, or a 2d C-style array if needed

int i = 0; int j = 0;

for (int k = 0; k < 52*52; k++)

{

cin >> input[i][j];

j++;

if (j == n)

{

j = 0; i++;

}

}

//call myFun to test.  Then, use cout or the output stream to output your results.

//to open the output stream, simply write out.open("myOutput.txt") and then use out << to output whatever you need.

return 0;

}

Now, if you have data in space or tab delimited format in a textfile, all you have to do is compile your program:

g++ myFile.cpp -o myExecutable

And then run it with input redirection:

./myExecutable < myData.txt

where the < operator feeds in the contents of myData.txt into the standard input of your program.  For more on input redirection read here, and feel free to comment!

# Using linux “grep”

grep allows you to find an expression in one or more files in a folder on Linux.  I find it useful for programming.  Say, for example, I want to look for the string “nrec” in a set of source code and header files.  Maybe “nrec” is a variable and I forgot where I declared it (if this sounds a little too specific to be merely an example, you’re right. This is what I’m having to do right this second!).  The grep command is:

grep -in “nrec” *.*

What this means is, search for the “nrec” expression in every file in the folder.  There are two useful flags set here as well.  “i” means that the search is case insensitive (that is, NREC and NrEc and nrec are each treated as equal).  “n” means that the program will show me the line number of each occurrence of my desired phrase.  There are other options that I’m not using, including “inverting” a search to find all occurrences of NOT that phrase, suppressing the file name or only showing the file name, etc.

If you were curious, here’s a sample of the output:

iras.h:144: int num_flow_datapoints; //originally: NRec
SimSysClass.cpp:806: flowrecs=sysstat(nrec)

(If you’re curious, the first instance is in a header file, on line 144.  I’m translating this code from one language to another, and originally the variable was called “nrec”. So in the header file I made a note that now my variable is called something else.  In the second instance, I had copied the original code into my file as a placeholder, so now I know that I need to use my new name in its place.  Also, the “i” flag in grep is helpful since fortran is not case-sensitive, and here you can see there were two different case styles for this variable even in our simple example.)

For more info, please consult some casual reference such as this excellent post about linux command line utilities,  a similar blog post about grep, and of course the Linux man page for the command. Also look at 15 grep tips.  As usual, remember that “man [insert command here]” gives you all the low-down on each command you’d like to learn.

# Using linux “cut”

The following code takes a file that has 16 columns and outputs a file with 5 of those columns.  Some notes:

• Don’t use PATH as a variable.  The program won’t work, because PATH is a system variable!
• Note the C++ style syntax of the loop.  Versions of bash greater than 3.0 allow you to use curly brackets, like: for i in {1..50}.  But when you want to use variables inside the range, you have to do something else, such as my example.  Others are discussed here.
• The star of this script is the ‘cut’ command.  d tells what delimiter you’d like.  f tells what fields you want to cut.
• Then there are some simple commands around cut.  ‘cat’ displays the contents of the file.  Then, the | operator pipes the output of cat into the next command, which is cut.  Finally, you then use the > operator to direct the output of this command into a new file.
• Save this file on the cluster or a Linux system as myFileNameHere.sh.  Then, to run the code, simply type “sh myFileNameHere.sh”

#!/bin/bash

# Cut out only the objective function values from the CBorg output files.

MYPATH=./output/
INPUT_NAME_BASE=CBorg_LRGV_