# Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

## A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

## mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()



## Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

1. Write demoScript.m so that it reads the sample number from the input.
2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

## 1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)



## 2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

'''
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes

# loop through samples assigned to current rank
sample= rank + TOTAL_TASKS * i

# write the command that will be sent to the terminal (here RUN will replace the {})
terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

# write the terminal command to the process
os.system(terminal_command)

# sleep before submitting the next command
time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()



## submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
N_NODES=2 # number of nodes

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks # Submit the parallel job #!/bin/bash #SBATCH -n$(TOTAL_TASKS) -N $(N_NODES) #SBATCH --time=0:01:00 #SBATCH --job-name=demoMPI4py #SBATCH --output=output/demo.out #SBATCH --error=output/demo.err #SBATCH --exclusive module load py3-mpi4py module load py3-numpy module load octave/6.3.0 mpirun -np$TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS$SAMPLES_PER_TASK > demoOutput.txt



Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/

# Writing sharable Python code part II: Unit Testing

When writing Python functions that may be used by others, it is important to demonstrate that your functions work as intended. In my last post I discussed how a proper function specification establishes a sort of contract between developers and users of functions that delineates errors in implementation (user error) from bugs in the code (developer error). While this contract is provides an important chain of accountability, it does not actually contain any proof that the function works as advertised. This is where unit testing comes in. A unit test simply runs a function over a suite of test cases (set of inputs that produce known results) to verify performance. As its name implies, a single unit test is meant to test one basic component of a code. Large or complex codes will have many sets of unit tests, each testing different elements (usually individual functions). Unit testing provides users with proof that your code works as intended, and serves as a powerful tool for finding and removing any errors in your code.

In this post I’ll provide guidance on the development of unit tests and demonstrate an example for a simple python function. Material in this post represents my interpretation of content taught by Professor Walker White at Cornell, in a Python Fundamentals course that I proctored in the summer of 2020.

## An example script

To illustrate unit testing I’ll examine the a function called “check_satisficing.py” which tests whether a set of performance objectives meets a set of satisficing criteria (for background on satisficing, see this post). The function returns a boolean and has three parameters:

• objs: a numpy array containing the two objective values
• criteria: a numpy array two criteria,
• directions: a list of strings that specify whether the criteria is meant to be a lower or upper bound.

Note that the actual code for this function only takes eight lines, the rest is the function specification and precondition checks.

def check_satisficing(objs, criteria, directions):
"""
Returns whether a set of objectives meets a set of satisficing criteria

Value return has type bool

Parameters:
objs: objective values from a given solution
criteria: satisficing criteria for each objective
directions: a list of strings containing "ge" or "le" for each
criteria, where ge indicates satisfication for values greater than
or equal to the criteria and le indicates satisfaction for values
less than or equal to the criteria

Examples:
objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'ge']
returns True
objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'le']
returns False
objs = [.4, .4], criteria = [0.5, 0.5], directions = ['le', 'le']
returns True
objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'le']
returns False
objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'ge']
returns False
objs = [.5, .5], criteria = [0.5, 0.5], directions = ['ge', 'ge']
returns True
objs = [.5, .5], criteria = [0.5, 0.5], directions = ['le', 'le']
returns True

Preconditions:
objs is a numpy array of floats with length 2
criteria is a numpy array of floats with length 2
directions is a list of strings with length 2 containing either
"ge" or "le"
"""

# check preconditions
assert len(objs) == 2, 'objs has length ' + repr(len(objs)) + ', should be 2'
assert len(criteria) == 2, 'criteria has length ' + repr(len(criteria)) + \
', should be 2'
assert len(directions) == 2, 'directions has length ' + \
repr(len(directions)) + ', should be 2'

# check to make sure
for i in range(2):
assert type(objs[i])== np.float64, 'objs element ' + str(i) + ': ' + \
repr(objs[i]) + ', is not a numpy array of floats'
assert type(criteria[i])== np.float64, 'criteria element ' + str(i) + \
': ' + repr(criteria[i]) + ', is not a numpy array of floats'
assert type(directions[i])== str, 'directions element ' + str(i) + \
': ' + repr(directions[i]) + ', is not a string'
assert directions[i] == 'ge' or directions[i] == 'le', 'directions ' + \
str(i) + ' is ' + repr(directions[i]) + ', should be either "ge" or "le"'

# loop through objectives and check if each meets the criteria
meets_criteria = True
for i in range(2):
if directions[i] == 'ge':
meets_criteria = meets_criteria and (objs[i] >= criteria[i])
else:
meets_criteria = meets_criteria and (objs[i] <= criteria[i])

return meets_criteria


## Developing test cases

If you read my last post, you may notice that this specification includes an extra section called “Examples”. These examples show the user how the function is supposed to perform. They also represent the suite of test cases used to validate the function. Creating test cases is more of an art than a science, and test cases will be unique to each function you write. However, there is a set of basic rule you can follow to guide your implementation of unit testing which I’ll outline below.

1. Pick the simplest cases first. In this example, the simplest cases are when both objectives are both above or below the criteria
2. Move to more complex cases. In this example, a more complex case could be when one objective is above and the other below, or vice versa
3. Think about “weird” possibilities. One “weird” case for this code could be when one or both objectives are equal to the criteria
4. Never test a precondition violation. Precondition violations are outside the scope of the function and should not be included in unit tests

Test cases should be exhaustive and even simple codes may have many test cases. In my example above I provide seven, can you think of any more that are applicable to this code?

## Implementing a testing strategy

After you’ve developed your test cases, it’s time to implement your unit test. For demonstration purposes I’ve written my own unit test code which can be used to test the cases developed above. This function simply utilizes assert statements to check if each test input generates the correct output. A danger of writing your own testing function is that the test function itself may have errors. In practice, it’s easiest to use an established tool such as PyTest to perform unit testing (for in-depth coverage of PyTest see Bernardo’s post from 2019).

def test_check_satisficing():
"""
Unit test for the function check_satisficing
"""
import numpy as np
from check_satisficing import check_satisficing

print("Testing check_satisficing")

# test case 1:
objs1 = np.array([0.5, 0.5])
criteria1 = np.array([0.4, 0.4])
directions1 = ['ge','ge']
result1 = True

assert (check_satisficing(objs1, criteria1, directions1)) == result1, \
'Test 1 failed ' + repr(objs1) + ', ' + repr(criteria1) + ', ' + \
repr(directions1) + ' returned False, should be True'

# test case 2:
objs2 = np.array([0.5, 0.5])
criteria2 = np.array([0.4, 0.4])
directions2 = ['ge','le']
result2 = False

assert (check_satisficing(objs2, criteria2, directions2)) == result2, \
'Test 2 failed ' + repr(objs2) + ', ' + repr(criteria2) + ', ' + \
repr(directions2) + ' returned True, should be False'

# test case 3:
objs3 = np.array([0.4, 0.4])
criteria3 = np.array([0.5, 0.5])
directions3 = ['le','le']
result3= True

assert (check_satisficing(objs3, criteria3, directions3)) == result3, \
'Test 3 failed ' + repr(objs3) + ', ' + repr(criteria3) + ', ' + \
repr(directions3) + ' returned False, should be True'

# test case 4:
objs4 = np.array([0.5, 0.5])
criteria4 = np.array([0.4, 0.4])
directions4 = ['ge','le']
result4 = False

assert (check_satisficing(objs4, criteria4, directions4)) == result4, \
'Test 4 failed ' + repr(objs4) + ', ' + repr(criteria4) + ', ' + \
repr(directions4) + ' returned True, should be False'

# test case 5
objs5 = np.array([0.5, 0.5])
criteria5 = np.array([0.4, 0.4])
directions5 = ['le','ge']
result5 = False

assert (check_satisficing(objs5, criteria5, directions5)) == result5, \
'Test 5 failed ' + repr(objs5) + ', ' + repr(criteria5) + ', ' + \
repr(directions5) + ' returned True, should be False'

# test case 6:
objs6 = np.array([0.5, 0.5])
criteria6 = np.array([0.5, 0.5])
directions6 = ['ge','ge']
result6 = True

assert (check_satisficing(objs6, criteria6, directions6)) == result6, \
'Test 6 failed ' + repr(objs6) + ', ' + repr(criteria6) + ', ' + \
repr(directions6) + ' returned False, should be True'

# test case 7:
objs7 = np.array([0.5, 0.5])
criteria7 = np.array([0.5, 0.5])
directions7 = ['le','le']
result7 = True

assert (check_satisficing(objs7, criteria7, directions7)) == result7, \
'Test 7 failed ' + repr(objs7) + ', ' + repr(criteria7) + ', ' + \
repr(directions7) + ' returned False, should be True'

print("check_satisficing has passed all tests!")

return


## Concluding thoughts

When developing a new Python function, it’s good practice to code the test cases while you’re writing the function and test each component during the development cycle. Coding in this matter will allow you to identify and fix errors early an can save a lot of time and headache. Creating test cases during code development may also improve the quality of your code by helping you conceptualize and avoid potential bugs.

# Writing sharable Python code

Most of us in academia do not have formal training in computer science, yet for many of us writing and sharing code is an essential part of our research. While the quality of code produced by many graduate students can be quite impressive, many of us never learned the basic CS principles that allow programs to be sharable and inheritable by other programmers. Last summer I proctored an online class in Python fundamentals. The course was for beginner programmers and, though it covered material simpler than the scripts I write for my PhD, I learned a lot about how to properly document and structure Python functions to make them usable by others. In this post I’ll discuss two important concepts I took away from this course: writing proper function specifications and enforcing preconditions using assert statements.

## Function Specifications

One of the the most important aspects of writing a quality Python function is proper specification. While the term may sound generic, a specification actually has a very precise definition and implementation for a Python function. In practice, a specification is a docstring, a “string literal” that occurs as the first statement in a function, module, class or method, formed by a bracketed set of “””. Here is an example of a simple function I wrote with a specification:


"""
Returns: theta converted to degrees

Value return has type float

Parameter theta: the angle in radians
Precondition: theta is a float
"""
return theta * (180.0/3.14159)


The function specification is everything between the sets of “””. When Python sees this docstring at the front of a function definition, it automatically is stored as the “__doc__” associated with the function. With this specification in place, any user that loads this function can access its __doc__ by typing help(radians_to_degrees), which will print the following to the terminal:

Help on function radians_to_degrees in module __main__:

Returns: theta converted to degrees

Value return has type float

Parameter theta: the angle in radians
Precondition: theta is a float


The help function will print anything in the docstring at the start of a function, but a it is good practice for the specification to have the following elements:

1. A one-line summary of the function at the very beginning, if the function is a “fruitful function” (meaning it returns something), this line should tell what it returns. In my example I note that my function returns theta converted to degrees.
2. A more detailed description of the function. In my example this simply provides the type of the return value.
3. A description of the function’s parameter(s)
4. Any preconditions that are necessary for the code to run properly (more on this later)

I should note that officially the Python programming language has a more flexible set of requirements for function specifications, which can be found here, but the attributes above are a good starting point for writing clear specifications.

Properly specifying a Python function will clarify the function’s intended use and provide instructions for how new users can utilize it. It will also help you document your code for formal release if you ever publish it. Google any of your favorite Python functions and you’ll likely be brought to a page that has a fancy looking version of the function’s specification. These pages can be automatically generated by tools such as Spinx that create them right from the function’s definition.

Aside from clarifying and providing instructions for your function, specifications provide a means of creating a chain of accountability for any problems with your code. This chain of accountability is created through precondition statements (element four above). A precondition statement dictates requirements for the function to run properly. Preconditions may specify the type of parameter input (i.e. x is a float) or a general statement about the parameter (x < 0).

For large teams of many developers and users of functions, precondition statements create a chain of accountability for code problems. If the preconditions are violated and a code crashes, then it is the responsibility of the user, who did not use the code properly. On the other hand, if the preconditions were met and the code crashes, it is the responsibility of the developer, who did not properly specify the code.

## Asserting preconditions to catch errors early

One way to improve the usability of a code is to catch precondition violations with statements that throw specific errors. In Python, this may be done using assert statements. Assert statements evaluate a boolean expression and can display a custom error message if the expression returns false. For example, I can modify my radians_to_degrees function with the following assert statement (line 10):

def radians_to_degrees(theta):
"""
Returns: theta converted to degrees

Value return has type float

Parameter theta: the angle in radians
Precondition: theta is a float
"""
assert type(theta) == float, repr(theta) + ' is not float'
return theta * (180.0/3.14159)


A helpful function in my assert statement above is “repr”, which will return a printable representation of a given object (i.e. for a string it will keep the quotation marks around it). Now if I were to enter ‘a’ into my function, I would get the following return:

AssertionError: 'a' is not float


This is trivial for my example, but can save a lot of time when running large and complex functions. Rather than seeing 30 lines of “traceback…”, the user will be immediately brought to the source of the error. In general, you should try to make assert statements as precise as possible to pinpoint the violation.

It’s important to note that not every conceivable precondition violation requires an assert statement. There is a tradeoff between code efficiency and number of assert statements (checking a precondition takes time). Determining which preconditions to strictly enforce with assert statements is a balancing act and will be different for each application. It’s important to remember though, that even if you do not have an assert statement for a precondition, you should still list all preconditions in the function specification to preserve the chain of accountability.

## Further resources

This post concerns specifications for Python functions. However, the use of docstrings is also important when coding modules, classes and public methods. Guidance on how these docstrings may be formatted can be found here.

While this post focused on Python, informative function specifications are important for all programming languages. Some resources on documentation in other languages can be found below:

# More on simple Bash Shell scripts (examples of “find” and “sed”)

When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs.  You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model.  For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”

find

Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].

find . -type f -name "218*"


Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:

find . -type f -name "218*" |  wc -l


You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:

find . -type f -name "218_wheat*" -exec rm {} \;


If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:

find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;


Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:

find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;


“d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.

If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”

find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;


Or we can look for files that do not contain “success.”

find . -type f -name 218cs*.err" -exec grep -L "success" {} \;


sed

“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:

sed -i 's/1295/1360/' 218cs.txt


Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:

sed -i '5s/1295/1360/' 218cs.txt


There might be more modifications that have to be done, so we can add them in one line using “-e”:

sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt


find + sed

If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:

find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/'  {} \;


Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:

“FORCING1         /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”

For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].

path_new="FORCING1	/home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|$path_new|" global_param_default.txt {} +  Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt). find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR OUT_CROP_EVAP %.4f OUT_TYPE_FLOAT 1" {} \;  Now, all of the “global_param_default” files have a new line with this content: “OUTVAR OUT_CROP_EVAP %.4f OUT_TYPE_FLOAT 1”. It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path: for P in$(find . -type d -name "output"); do (new_name="$(echo "${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "${P}" && mv output__46.34375_-119.90625$ new_name); done


With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding $and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions. With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring “&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in$P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in $new_name. # Profiling your Python script using cProfile Profiling refers to performing dynamic analysis on a script to measure its execution time, the execution time of its subcomponents, as well as how many times each subcomponent is being called. This produces data on where the script program is spending the most time, and can help with optimizing your script to minimize its execution time. This blog has had two past posts on profiling, one on C++ using Callgrind and one on Python using PyCharm. PyCharm is a Python IDE that’s very useful but unfortunately not free, so if you’re looking for some freeware profiling functionality in Python, this post is for you. Python has a module called cProfile. A simple example on timing the multiplication of two matrices with cProfile: import cProfile import numpy as np mat1 = ([1, 6, 3],[3, 6, 3],[2, 8, 3]) mat2 = ([2, 7, 6],[5, 4, 7],[6, 1, 9]) cProfile.run('np.dot(mat1,mat2)')  this should print out something like the following:  4 function calls in 0.000 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.dot} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}  I have recently used cProfile on one of my own scripts which I’ll be using here to demonstrate how it can be used in your own work. I have a function called fish_game, which contained my model and took vars as input. This function also calls function hrvSTR which represented my action policy function (it’s extraneous to this post what these functions do exactly, one represents the system and the other represents a policy that we use to act on the system, you can see the full model here). The fish_game function was called by my optimization algorithm during optimization. Running cProfile on it produces this: cProfile.run('fish_game(vars)',sort='cumulative') 282814 function calls in 0.698 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.698 0.698 {built-in method builtins.exec} 1 0.000 0.000 0.698 0.698 <string>:1(<module>) 1 0.166 0.166 0.698 0.698 <ipython-input-4-df258d5a749f>:55(fish_game) 20200 0.385 0.000 0.531 0.000 <ipython-input-4-df258d5a749f>:1(hrvSTR) 20200 0.016 0.000 0.089 0.000 fromnumeric.py:1821(sum) 20200 0.021 0.000 0.069 0.000 fromnumeric.py:64(_wrapreduction) 121208 0.055 0.000 0.055 0.000 {built-in method numpy.core.multiarray.zeros} 20200 0.046 0.000 0.046 0.000 {method 'reduce' of 'numpy.ufunc' objects} 20200 0.003 0.000 0.003 0.000 {built-in method builtins.isinstance} 40400 0.003 0.000 0.003 0.000 {built-in method builtins.len} 20200 0.002 0.000 0.002 0.000 {method 'items' of 'dict' objects} 2 0.000 0.000 0.000 0.000 {method 'normal' of 'mtrand.RandomState' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}  This tells me one run of my function takes 0.698 seconds in total (this might vary slightly every time, depending on your processor usage at the time and other factors), and that most of that time, 0.531 seconds, are consumed by the hrvSTR function. Even though 0.7 seconds might not seem long, in the context of optimization, where this function would need to be evaluated tens of thousands of times, an additional 0.1 seconds might add hours of process time to your workflow. Trying to bring that down is probably a worthwhile investment of time that will result in time savings later on. As a result I figured there could be something I could do to reduce the time hrvSTR took. I particularly intrigued by the fact that some numpy process numpy.core.multiarray.zeros was called 121208 times, an order of magnitude more than every other method in my script, which prompted me to think that I might be unnecessarily repeating a process. Looking at my code more closely (this is a script I have been using for more than a year now), I realized that I was ordering and normalizing and creating arrays for my action policy every single time it was called, something that was unnecessary, as the same policy was used for all time steps. I could instead perform all those steps once, save the outputs and use them for every time step instead of recalculating every time. I spent some time to adjust my script to do that and running cProfile again, produced this:  cProfile.run('fish_game(vars)',sort='cumulative') 79282 function calls in 0.379 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.379 0.379 {built-in method builtins.exec} 1 0.000 0.000 0.379 0.379 <string>:1(<module>) 1 0.150 0.150 0.379 0.379 <ipython-input-16-3c528334eb55>:62(fish_game) 19800 0.193 0.000 0.229 0.000 <ipython-input-16-3c528334eb55>:35(hrvSTR) 59414 0.036 0.000 0.036 0.000 {built-in method numpy.core.multiarray.zeros} 4 0.000 0.000 0.000 0.000 fromnumeric.py:2817(mean) 4 0.000 0.000 0.000 0.000 _methods.py:58(_mean) 2 0.000 0.000 0.000 0.000 <ipython-input-16-3c528334eb55>:4(generate_policy) 6 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects} 2 0.000 0.000 0.000 0.000 {method 'normal' of 'mtrand.RandomState' objects} 2 0.000 0.000 0.000 0.000 fromnumeric.py:1821(sum) 2 0.000 0.000 0.000 0.000 fromnumeric.py:64(_wrapreduction) 4 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr} 4 0.000 0.000 0.000 0.000 _methods.py:48(_count_reduce_items) 2 0.000 0.000 0.000 0.000 {method 'clip' of 'numpy.generic' objects} 4 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray) 10 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 8 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass} 4 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array} 2 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects} 4 0.000 0.000 0.000 0.000 {built-in method builtins.len} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}  I basically shaved 0.3 seconds off my function evaluation time by investing some time to look more closely at my script. If optimizing with 30000 function evaluations, this translates to some 2+ hours of processing time that I am saving (I spent far less figuring this out for the first time). You can also run cProfile directly from the command line like so: python -m cProfile -s cumtime fish_game.py  # Parallelization of C/C++ and Python on Clusters There are multiple ways of parallelizing C/C++ and Python codes to be used on clusters. The quick and dirty way, which often gets the job done, is to parallelize the code with MPI and launch one process per core across multiple nodes. Though easy to implement, this strategy may result in a fair waste of results because: • With too many processes on one node, inefficient use of memory may make the code memory bottle-necked, meaning that a lot of time will be spent accessing memory and the cores will be sitting twiddling their thumbs. • If you have a 48-core node with 192 GB of memory (as the Skylake nodes on Stampede 2) and your code uses 10 GB of RAM per process, you will not be able to have more than 19 out of 48 cores working at the same time because more than that would exceed the node’s memory, which would crash the node, which would get you kicked out of the cluster for a while without a warning — if you do not know how much RAM your code requires, you should look into the Remora tool. If you are working on big runs on systems that charge per usage time or if you are creating a code you will have to run multiple times, it may be worth looking into more clever parallelization strategies. Some of the techniques described in the next subsections will also make your code two to four times faster on your desktop/laptop and are rather straight-forward to implement. Most of the code in this blog post can be freely downloaded from this repository. Lastly, if the #include statements in the examples below are not displayed as followed by a library between “<>”, please refer to the mentioned online repository. ## Hybrid Parallelization in C/C++ The word hybrid refers in this context to a mixed use of shared and distributed memory parallelization techniques. In distributed parallelization techniques, such as the Message Passing Interface (MPI), each process has its own dedicated chunk of memory within a RAM chip. This means that each process will have its own copy of the input data (repeated across processes), variables within the program, etc. Conversely, shared memory parallelization techniques such as OpenMP allow all threads (parallel unit of computation) to access the same block of memory, which can make a mess in your code with variables being changed seemingly out of nowhere by various parallel runs, but which lessens the amount of RAM needed and in some cases allow for more efficient memory utilization. If running the Borg Master-Slave multiobjective evolutionary optimization algorithm (Borg MS) with hybrid parallelization, Borg MS will have each function evaluation performed by one MPI process, which could run in parallel with OpenMP (if your model is setup for that) across the cores designated for that process. As soon as a function evaluation is complete, the slave MPI process in charge of it will submit the results back to Borg MS’s master process and be given another set of decision variables to work on. The figure below is a schematic drawing of such hybrid parallelization scheme: ### OpenMP Parallelizing a for-loop in C or C++ in which each iteration is independent — such as in Monte Carlo simulations — is straight-forward: #include #include "path/to/my_model.h" int main(int argc, char *argv[]) { int n_monte_carlo_simulations = omp_get_num_threads(); // One model run per core. vector system_inputs = ; vector results(n_monte_carlo_simulations, 0); #pragma omp parallel for for (int i = 0; i < n_monte_carlo_simulations; ++i) { results[i] = RunMonteCarloSimulation(system_inputs[i]); } }  The only caveat is that you have to be sure that the independent simulations are not writing over each other. For example, if by mistake you had written results[0] instead of results[i], the final value on results[0] would be that of whichever thread (parallel simulation) finished last, while the rest of the vector would be of zeros. It is important to notice here that the vector system_inputs is read only once by the parallel code and all n_monte_carlo_simulations threads would have access to it in memory. Though this example is not rocket science, there is a lot more to OpenMP that can be found here and here. ### MPI MPI stands for Message Passing Interface. It creates processes which pass messages (information packages) to each other. The most basic MPI code is one that creates independent processes — each with its own allocated chunk of memory that cannot be accessed by any other process — and just has them running independently in parallel without communicating with each other. This is great to distribute Monte Carlo runs across various nodes, although it may lead to the limitations mentioned in the beginning of this post. Below is an example of a very basic MPI code to run Monte Carlo simulations in parallel: #include #include "path/to/my_model.h" int main(int argc, char *argv[]) { MPI_Init(NULL, NULL); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); vector system_inputs = ; RunMonteCarloSimulation(system_inputs[world_rank]); MPI_Finalize(); return 0; }  I will not cover in this post how to pass the output of the RunMonteCarloSimulation function back to a results vector, as in the previous example, because details of MPI is outside the scope of this post — a good MPI tutorial can be found here and great examples, including ones for Monte Carlo runs, can be found here. To run this code, you would have to compile it with mpicc mpi_demo.c -o MPIDemo and call it with the mpirun or mpiexec commands, which should be followed by the -n flag to specify the number of processes to be created, or mpirun -n 4 ./MPIDemo. The code above would run as many MonteCarlo simulations as there are processes (four processes if -n 4 was used), each with a rank (here a type of ID), and can be spread across as many nodes as you can get on a cluster. Each process of the code above would run on a core, which may result in the memory and/or processing limitations listed in the beginning of this post. ### Mixing OpenMP and MPI The ideal situation for most cases is when you use MPI to distribute processes across nodes and OpenMP to make sure that as many cores within a node as it makes sense to use are being used. To do this, you can use the structure of the code below. #include #include #include #include "path/to/my_model.cpp" int main(int argc, char *argv[]) { MPI_Init(NULL, NULL); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); int n_cores = omp_get_num_threads(); // get number of cores available for MPI process. vector system_inputs = ; vector results(n_cores , 0); #pragma omp parallel for for (int i = 0; i < n_cores ; ++i) { // printf("C hybrid. Process-thread: %s-%d\n", world_rank, omp_get_thread_num()); results[i] = RunMonteCarloSimulation(system_inputs[i]); } MPI_Finalize(); return 0; }  If you comment lines 4 and 20 and uncomment line 19, compile the code with mpicc hybrid.c -o Hybrid -fopenmp, set the number of threads in the terminal to eight by running set OMP_NUM_THREADS=8, and call the resulting executable with the command mpirun -n 4 ./Hybrid, the executable will spawn four processes, each of which spawning eight threads and printing the following output: C hybrid. Process-thread: 0-0 C hybrid. Process-thread: 0-1 C hybrid. Process-thread: 0-4 C hybrid. Process-thread: 0-5 C hybrid. Process-thread: 0-7 C hybrid. Process-thread: 0-6 C hybrid. Process-thread: 0-2 C hybrid. Process-thread: 0-3 C hybrid. Process-thread: 1-6 C hybrid. Process-thread: 1-0 C hybrid. Process-thread: 1-1 C hybrid. Process-thread: 1-3 C hybrid. Process-thread: 1-5 C hybrid. Process-thread: 1-4 C hybrid. Process-thread: 1-7 C hybrid. Process-thread: 1-2 C hybrid. Process-thread: 2-6 C hybrid. Process-thread: 2-0 C hybrid. Process-thread: 2-3 C hybrid. Process-thread: 2-7 C hybrid. Process-thread: 2-2 C hybrid. Process-thread: 2-4 C hybrid. Process-thread: 2-5 C hybrid. Process-thread: 2-1 C hybrid. Process-thread: 3-0 C hybrid. Process-thread: 3-1 C hybrid. Process-thread: 3-3 C hybrid. Process-thread: 3-4 C hybrid. Process-thread: 3-6 C hybrid. Process-thread: 3-7 C hybrid. Process-thread: 3-2 C hybrid. Process-thread: 3-5  The code above can be used to have one process per node (and therefore one copy of system_inputs per node) using as many threads as there are cores in that node — depending on the number of cores, it is sometimes more efficient to have two or three MPI processes per node, each using half or a third of the cores in that node. Below is a PBS job submission script that can be used to generate the output above — job submission scripts for Slurm to be used on Stampede 2 can be found here. #!/bin/bash #PBS -N test_hybrid #PBS -l nodes=2:ppn=16 #PBS -l walltime=:00:30 #PBS -o ./output/hybrid_c.out #PBS -e ./error/hybrid_c.err module load openmpi-1.10.7-gnu-x86_64 export OMP_NUM_THREADS=8 set -x cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS ./Hybrid


## Parallelizing Python on a Cluster

Python was not designed with shared-memory parallelization in mind, so its native parallelization library, the Multiprocessing library, does distributed-memory parallelization instead of shared. The issue with the Multiprocessing library is that it does not work across nodes, so you still need something like MPI. Here, we will use MPI4Py.

The code below parallelizes the function call across cores within the same node — please refer to this blog post for details about and other ways to use the Multiprocessing library.

from multiprocessing import Process, cpu_count

def do_something_useful(rank, shared_process_number):
# Do something useful here.
print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)

# Create node-local processes
shared_processes = []
#for i in range(cpu_count()):
for i in range(8):
p = Process(target=do_something_useful, args=(i,))
shared_processes.append(p)

# Start processes
for sp in shared_processes:
sp.start()

# Wait for all processes to finish
for sp in shared_processes:
sp.join()


By adding MPI4Py to the code above, following the same logic of the C/C++ example, we get the following result:

from mpi4py import MPI
from multiprocessing import Process, cpu_count

def do_something_useful(rank, shared_process_number):
# Do something useful here.
print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print 'Calling Python multiprocessing process from Python MPI rank {}'.format(rank)

# Create shared-ish processes
shared_processes = []
#for i in range(cpu_count()):
for i in range(8):
p = Process(target=do_something_useful, args=(rank, i))
shared_processes.append(p)

# Start processes
for sp in shared_processes:
sp.start()

# Wait for all processes to finish
for sp in shared_processes:
sp.join()

comm.Barrier()


Calling the code above with mpirun -n 4 python hybrid_pure_python.py will first result in four MPI processes being spawned with ranks 1 through 4, each spawning eight local processes. The output should look similar to the one below:

Calling Python multiprocessing process from Python MPI rank 0
Calling Python multiprocessing process from Python MPI rank 1
Python hybrid, MPI_Process-local_process (not quite a thread): 1-0
Python hybrid, MPI_Process-local_process (not quite a thread): 0-0
Python hybrid, MPI_Process-local_process (not quite a thread): 1-1
Python hybrid, MPI_Process-local_process (not quite a thread): 0-1
Python hybrid, MPI_Process-local_process (not quite a thread): 1-2
Python hybrid, MPI_Process-local_process (not quite a thread): 0-2
Python hybrid, MPI_Process-local_process (not quite a thread): 0-3
Python hybrid, MPI_Process-local_process (not quite a thread): 1-3
Calling Python multiprocessing process from Python MPI rank 2
Calling Python multiprocessing process from Python MPI rank 3
Python hybrid, MPI_Process-local_process (not quite a thread): 1-4
Python hybrid, MPI_Process-local_process (not quite a thread): 0-4
Python hybrid, MPI_Process-local_process (not quite a thread): 0-5
Python hybrid, MPI_Process-local_process (not quite a thread): 1-5
Python hybrid, MPI_Process-local_process (not quite a thread): 1-6
Python hybrid, MPI_Process-local_process (not quite a thread): 0-6
Python hybrid, MPI_Process-local_process (not quite a thread): 1-7
Python hybrid, MPI_Process-local_process (not quite a thread): 0-7
Python hybrid, MPI_Process-local_process (not quite a thread): 2-0
Python hybrid, MPI_Process-local_process (not quite a thread): 3-0
Python hybrid, MPI_Process-local_process (not quite a thread): 3-1
Python hybrid, MPI_Process-local_process (not quite a thread): 2-1
Python hybrid, MPI_Process-local_process (not quite a thread): 3-2
Python hybrid, MPI_Process-local_process (not quite a thread): 2-2
Python hybrid, MPI_Process-local_process (not quite a thread): 3-3
Python hybrid, MPI_Process-local_process (not quite a thread): 2-3
Python hybrid, MPI_Process-local_process (not quite a thread): 3-4
Python hybrid, MPI_Process-local_process (not quite a thread): 2-4
Python hybrid, MPI_Process-local_process (not quite a thread): 2-5
Python hybrid, MPI_Process-local_process (not quite a thread): 3-5
Python hybrid, MPI_Process-local_process (not quite a thread): 3-6
Python hybrid, MPI_Process-local_process (not quite a thread): 3-7
Python hybrid, MPI_Process-local_process (not quite a thread): 2-6
Python hybrid, MPI_Process-local_process (not quite a thread): 2-7


Below is a PBS submission script that can be used with the code above:

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=00:01:00
#PBS -o ./output/hybrid_python.out
#PBS -e ./error/hybrid_python.err

set -x
cd "$PBS_O_WORKDIR" # Construct a copy of the hostfile with only 16 entries per node. # MPI can use this to run 16 tasks on each node. export TASKS_PER_NODE=2 uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile #cat nodefile mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_pure_python.py  ## Parallelizing Monte Carlo Runs of C/C++ (or anything else) with Python Very often we have to run the same model dozens to thousands of times for Monte-Carlo-style analyses. One option is to go on a cluster and submit dozens to thousand of jobs, but this becomes a mess for the scheduler to manage and cluster managers tend to not be very fond of such approach. A cleaner way of performing Monte Carlo runs in parallel is to write a code in Python or other language that spawns MPI processes which then call the model to be run. This can be done in Python with the subprocess library, as in the example below: from mpi4py import MPI import subprocess comm = MPI.COMM_WORLD rank = comm.Get_rank() print 'Calling OpenmpTest from Python MPI rank {}'.format(rank) function_call = ['./OpenmpTest', '{}'.format(rank)] subprocess.call(function_call) comm.Barrier()  Here, OpenmpTest is an executable compiled from the C-OpenMP code below: #include #include int main(int argc, char *argv[]) { #pragma omp parallel for for (int i = 0; i < omp_get_num_threads(); ++i) { printf("C OpenMP. Process-thread: %s-%d\n", argv[1], i); } return 0; }  Submitting the code above with following PBS job submission script should result in the output below. #!/bin/bash #PBS -N test_hybrid #PBS -l nodes=2:ppn=16 #PBS -l walltime=00:01:00 #PBS -o ./output/hybrid_python_c.out #PBS -e ./error/hybrid_python_c.err module load python-2.7.5 export OMP_NUM_THREADS=8 set -x cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec –hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_calls_c.py

Output:

Calling OpenmpTest from Python MPI rank 0
Calling OpenmpTest from Python MPI rank 1
Calling OpenmpTest from Python MPI rank 2
Calling OpenmpTest from Python MPI rank 3


## End of the Story

As highlighted in the examples above (all of which can be download from here), creating parallel code is not rocket science and requires very few lines of code. The OpenMP and Multiprocessing parallelization tricks can save hours if you have to run a model several times on your personal computer, and paired with MPI these tricks can save you from hours to days in time and thousands of node-hours on clusters. I hope you make good use of it.

# Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

Figure 1: Example training set

## Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

$P(y|x)=\frac{P(x|y)P(y)}{P(x)}$                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

$P(x|y) ~ N(\mu_y, \Sigma_y)$                                                                                   (2)

Where $\Sigma_y$ is a diagonal covariance matrix with $[\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y}$ for each feature $\alpha$.

For each feature, $\alpha$, and each class, c we can then model $P(x_\alpha|y)$ as:

$P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}$                              (3)

We can then estimate model parameters:

$\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}$                                                                   (4)

$\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2$                                                  (5)

Where:

$n_c = \sum^{n}_{i=1}I(y_i=c)$                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

$P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}$                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

## Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

$P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}$                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

## Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1');
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')



Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

## Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1');
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')



Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

## References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743

# Magnitude-varying sensitivity analysis and visualization (Part 2)

In my last post, I talked about producing these flow-duration-curve-type figures for an output time-series one might be interested in, and talked about their potential use in an exploratory approach for the purpose of robust decision making. Again, the codes to perform the analysis and visualization are in this Github repository.

Fig. 1: Historical data vs. range of experiment outputs

As already discussed, there are multiple benefits for visualizing the output in such manner: we are often concerned with the levels and frequencies of extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), or we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“). The various percentiles tell a different part of the story of how a system operates, the 5th percentile tells as that its level is exceeded 95% of the time, the 99th tells as that its level is only reached once in every 100 years in our records. These might seem obvious to the readers of this blog, but often times we perform our analyses for only some of these percentiles, “the worst event”, “the average”, etc., which is certainly very informative, but can potentially miss part of the bigger picture.

In this post I’m going to walk the reader through performing a sensitivity analysis using the output of an experiment using multiple Latin Hypercube Samples. The analysis will be magnitude-varying, i.e., it will be performed at different magnitudes of our output of interest. For this particular example, we aim to see what are the most significant drivers of shortage at the different levels it’s experienced by this user. In other words, if some factors appear to be driving the frequent small shortages experienced, are those factors the same for the rare large shortages?

To perform the sensitivity analysis, I am going to use SALib (featured in this blog multiple times already), to perform a Delta Moment-Independent Analysis [1] (also produces a first order Sobol sensitivity index [2]). You’ll probably need to install SALib if it’s not a package you’ve used already. I’m also going to use statsmodels, to perform a simple linear regression on the outputs and look at their R2 values. But, why, you might ask, perform not one, not two, but three sensitivity analyses for this? There are nuanced, yet potentially important differences between what the three methods capture:

Delta method: Look for parameters most significantly affecting the density function of observed shortages. This method is moment-independent, i.e., it looks at differences in the entire distribution of the output we’re interested in.
First order Sobol (S1): Look for parameters that most significantly affect the variance of observed outputs, including non-linear effects.
R2: Look for parameters best able to describe the variance of observed outputs, limited to linear effects.

Another important thing to note is that using the First order Sobol index, the total variance resulting from the parameters should equal 1. This means that if we sum up the S1’s we get from our analysis, the sum represents the variance described by the first order effects of our parameters, leaving whatever is left to interactions between our variables (that S1 cannot capture). The same holds using R2, as we are repeatedly fitting our parameters and scoring them on how much of the output variance they describe as a sole linear predictor (with no interactions or other relationships).

The following Python script will produce all three as well as confidence intervals for the Delta index and S1. The script essentially loops through all percentiles in the time-series and performs the two analyses for each one. In other words, we’re are looking at how sensitive each magnitude percentile is to each of the sampled parameters.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 import numpy as np import pandas as pd import statsmodels.api as sm from SALib.analyze import delta # Load parameter samples LHsamples = np.loadtxt('./LHsamples.txt') params_no = len(LHsamples[0,:]) param_bounds=np.loadtxt('./uncertain_params.txt', usecols=(1,2)) # Parameter names param_names=['IWRmultiplier','RESloss','TBDmultiplier','M_Imultiplier', 'Shoshone','ENVflows','EVAdelta','XBM_mu0','XBM_sigma0', 'XBM_mu1','XBM_sigma1','XBM_p00','XBM_p11'] # Define problem class problem = { 'num_vars': params_no, 'names': param_names, 'bounds': param_bounds.tolist() } # Percentiles for analysis to loop over percentiles = np.arange(0,100) # Function to fit regression with Ordinary Least Squares using statsmodels def fitOLS(dta, predictors): # concatenate intercept column of 1s dta['Intercept'] = np.ones(np.shape(dta)[0]) # get columns of predictors cols = dta.columns.tolist()[-1:] + predictors #fit OLS regression ols = sm.OLS(dta['Shortage'], dta[cols]) result = ols.fit() return result # Create empty dataframes to store results DELTA = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles) DELTA_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles) S1 = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles) S1_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles) R2_scores = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles) DELTA.index=DELTA_conf.index=S1.index=S1_conf.index = R2_scores.index = param_names # Read in experiment data expData = np.loadtxt('./experiment_data.txt') # Identify magnitude at each percentiles syn_magnitude = np.zeros([len(percentiles),len(LHsamples[:,0])]) for j in range(len(LHsamples[:,0])): syn_magnitude[:,j]=[np.percentile(expData[:,j], i) for i in percentiles] # Delta Method analysis for i in range(len(percentiles)): if syn_magnitude[i,:].any(): try: result= delta.analyze(problem, LHsamples, syn_magnitude[i,:], print_to_console=False, num_resamples=2) DELTA[percentiles[i]]= result['delta'] DELTA_conf[percentiles[i]] = result['delta_conf'] S1[percentiles[i]]=result['S1'] S1_conf[percentiles[i]]=result['S1_conf'] except: pass S1.to_csv('./S1_scores.csv') S1_conf.to_csv('./S1_conf_scores.csv') DELTA.to_csv('./DELTA_scores.csv') DELTA_conf.to_csv('./DELTA_conf_scores.csv') # OLS regression analysis dta = pd.DataFrame(data = LHsamples, columns=param_names) # fig = plt.figure() for i in range(len(percentiles)): shortage = np.zeros(len(LHsamples[:,0])) for k in range(len(LHsamples[:,0])): shortage[k]=syn_magnitude[i,k] dta['Shortage']=shortage for m in range(params_no): predictors = dta.columns.tolist()[m😦m+1)] result = fitOLS(dta, predictors) R2_scores.at[param_names[m],percentiles[i]]=result.rsquared R2_scores.to_csv('./R2_scores.csv')

The script produces the sensitivity analysis indices for each magnitude percentile and stores them as .csv files.

I will now present a way of visualizing these outputs, using the curves from Fig. 1 as context.  The code below reads in the values for each sensitivity index, normalizes them to the range of magnitude at each percentile, and then plots them using matplotlib’s stackplot fuction, which stacks the contribution of each parameter to the sum (in this case the maximum of the resulting range)

I’ll go through what the code does in more detail:

First, we take the range boundaries (globalmax and globalmin) which give us the max and min values for each percentile. We then read in the values for each sensitivity index and normalize them to that range (i.e. globalmaxglobalmin for each percentile). The script also adds two more arrays (rows in the pandas dataframe), one representing interaction and one representing the globalmin, upon which we’re going to stack the rest of the values. [Note: This is a bit of a roundabout way of getting the figures how we like them, but it’s essentially creating a pseudo-stack for the globalmin, that we’re plotting in white.]

The interaction array is only used when normalizing the S1 and R2 values, where we attribute to it the difference between 1 and the sum of the calculated indices (i.e. we’re attributing the rest to interaction between the parameters). We don’t need to do this for the delta method indices (if you run the code the array remains empty), but the reason I had to put it there was to make it simpler to create labels and a single legend later.

The plotting simply creates three subplots and for each one uses stackplot to plot the normalized values and then the edges in black. It is important to note that the colorblocks in each figure do not represent the volume of shortage attributed to each parameter at each percentile, but rather the contribution of each parameter to the change in the metric, namely, the density distribution (Delta Method), and the variance (S1 and R2). The code for this visualization is provided at the bottom of the post.

Fig. 2: Magnitude sensitivity curves using three sensitivity indeces

The first thing that pops out from this figure is the large blob of peach, which represents the irrigation demand multiplier in our experiment. The user of interest here was an irrigation user, which would suggest that their shortages are primarily driven by increases in their own demands and of other irrigation users. This is important, because irrigation demand is an uncertainty for which we could potentially have direct or indirect control over, e.g. through conservation efforts.

Looking at the other factors, performing the analysis in a magnitude-varying manner, allowed us to explore the vulnerabilities of this metric across its different levels. For example, dark blue and dark green represent the mean flow of dry and wet years, respectively. Across the three figures we can see that the contribution of mean wet-year flow is larger in the low-magnitude percentiles (left hand side) and diminishes as we move towards the larger-magnitude percentiles.

Another thing that I thought was interesting to note was the difference between the S1 and the R2 plots. They are both variance-based metrics, with R2 limited to linear effects in this case. In this particular case, the plots are fairly similar which would suggest that a lot of the parameter effects on the output variance are linear. Larger differences between the two would point to non-linearities between changes in parameter values and the output.

The code to produce Fig. 2:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Percentiles for analysis to loop over percentiles = np.arange(0,100) # Estimate upper and lower bounds globalmax = [np.percentile(np.max(expData_sort[:,:],1),p) for p in percentiles] globalmin = [np.percentile(np.min(expData_sort[:,:],1),p) for p in percentiles] delta_values = pd.read_csv('./DELTA_scores.csv') delta_values.set_index(list(delta_values)[0],inplace=True) delta_values = delta_values.clip(lower=0) bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(delta_values.columns.values)) top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(delta_values.columns.values)) delta_values = pd.concat([top_row,delta_values.loc[:],bottom_row]) for p in range(len(percentiles)): total = np.sum(delta_values[str(percentiles[p])])-delta_values.at['Min',str(percentiles[p])] if total!=0: for param in param_names: value = (globalmax[p]-globalmin[p])*delta_values.at[param,str(percentiles[p])]/total delta_values.set_value(param,str(percentiles[p]),value) delta_values = delta_values.round(decimals = 2) delta_values_to_plot = delta_values.values.tolist() S1_values = pd.read_csv('./S1_scores.csv') S1_values.set_index(list(S1_values)[0],inplace=True) S1_values = S1_values.clip(lower=0) bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(S1_values.columns.values)) top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(S1_values.columns.values)) S1_values = pd.concat([top_row,S1_values.loc[:],bottom_row]) for p in range(len(percentiles)): total = np.sum(S1_values[str(percentiles[p])])-S1_values.at['Min',str(percentiles[p])] if total!=0: diff = 1-total S1_values.set_value('Interaction',str(percentiles[p]),diff) for param in param_names+['Interaction']: value = (globalmax[p]-globalmin[p])*S1_values.at[param,str(percentiles[p])] S1_values.set_value(param,str(percentiles[p]),value) S1_values = S1_values.round(decimals = 2) S1_values_to_plot = S1_values.values.tolist() R2_values = pd.read_csv('./R2_scores.csv') R2_values.set_index(list(R2_values)[0],inplace=True) R2_values = R2_values.clip(lower=0) bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(R2_values.columns.values)) top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(R2_values.columns.values)) R2_values = pd.concat([top_row,R2_values.loc[:],bottom_row]) for p in range(len(percentiles)): total = np.sum(R2_values[str(percentiles[p])])-R2_values.at['Min',str(percentiles[p])] if total!=0: diff = 1-total R2_values.set_value('Interaction',str(percentiles[p]),diff) for param in param_names+['Interaction']: value = (globalmax[p]-globalmin[p])*R2_values.at[param,str(percentiles[p])] R2_values.set_value(param,str(percentiles[p]),value) R2_values = R2_values.round(decimals = 2) R2_values_to_plot = R2_values.values.tolist() color_list = ["white", "#F18670", "#E24D3F", "#CF233E", "#681E33", "#676572", "#F3BE22", "#59DEBA", "#14015C", "#DAF8A3", "#0B7A0A", "#F8FFA2", "#578DC0", "#4E4AD8", "#F77632"] fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(14.5,8)) ax1.stackplot(percentiles, delta_values_to_plot, colors = color_list, labels=parameter_names_long) l1 = ax1.plot(percentiles, globalmax, color='black', linewidth=2) l2 = ax1.plot(percentiles, globalmin, color='black', linewidth=2) ax1.set_title("Delta index") ax1.set_xlim(0,100) ax2.stackplot(np.arange(0,100), S1_values_to_plot, colors = color_list, labels=parameter_names_long) ax2.plot(percentiles, globalmax, color='black', linewidth=2) ax2.plot(percentiles, globalmin, color='black', linewidth=2) ax2.set_title("S1") ax2.set_xlim(0,100) ax3.stackplot(np.arange(0,100), R2_values_to_plot, colors = color_list, labels=parameter_names_long) ax3.plot(percentiles, globalmax, color='black', linewidth=2) ax3.plot(percentiles, globalmin, color='black', linewidth=2) ax3.set_title("R^2") ax3.set_xlim(0,100) handles, labels = ax3.get_legend_handles_labels() ax1.set_ylabel('Annual shortage (af)', fontsize=12) ax2.set_xlabel('Shortage magnitude percentile', fontsize=12) ax1.legend((l1), ('Global ensemble',), fontsize=10, loc='upper left') fig.legend(handles[1:], labels[1:], fontsize=10, loc='lower center',ncol = 5) plt.subplots_adjust(bottom=0.2) fig.savefig('./experiment_sensitivity_curves.png')

References:

[1]: Borgonovo, E. “A New Uncertainty Importance Measure.” Reliability Engineering & System Safety 92, no. 6 (June 1, 2007): 771–84. https://doi.org/10.1016/j.ress.2006.04.015.

[2]: Sobol, I. M. (2001). “Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates.” Mathematics and Computers in Simulation, 55(1-3):271-280, doi:10.1016/S0378-4754(00)00270-6.

# Magnitude-varying sensitivity analysis and visualization (Part 1)

Various posts have discussed sensitivity analysis and techniques in this blog before. The purpose of this post is to show an application of the methods and demonstrate how they can be used in an exploratory manner, for the purposes of robust decision making (RDM). RDM aims to evaluate the performance of a policy/strategy/management plan over an ensemble of deeply uncertain parameter combinations – commonly referred to as “states of the world” (SOWs) – and then identify the policies that are most robust to those uncertainties. Most importantly, this process allows the decision maker to examine the implications of their assumptions about the world (or how it will unfold) on their candidate strategies [1].

This is Part 1 of a two part post. In this first post, I’ll introduce the types of figures I’ll be talking about, and some visualization code. In the second post (up in a couple days), I’ll discuss sensitivity analysis for the system as well as some visuals. All the code and data to produce the figures below can be found in this repository.

Now assume the performance of a system is described by a time-series, produced by our model as an output. This might be a streamflow we care about, reservoir releases, nutrient loading, or any type of time-series produced by a run of our model. For the purposes of this example, I’ll use a time-series from the system I’ve been working on, which represents historical shortages for an agricultural user.

Fig. 1: Historical data in series

We can sort and rank these data, in the style of a flow duration curve, which would allow us to easily see, levels for median shortage (50th percentile), worst (99th), etc. The reasons one might care about these things (instead of, say, just looking at the mean, or at the time series as presented in Fig. 1) are multiple : we are often concerned with the levels and frequencies of our extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“), or, simply, maintain the distributional information of the series we care about in an easily interpretable format.

Fig. 2: Historical data sorted by percentile

For the purposes of an exploratory experiment, we would like to see how this time-series of model output might change under different conditions (or SOWs). There are multiple ways one might go about this [2], and in this study we sampled a broad range of parameters that we thought would potentially affect the system using Latin Hypercube Sampling [3], producing 1000 parameter combinations. We then re-simulated the system and saved all equivalent outputs for this time-series. We would like to see how this output changes under all the sampled runs.

Fig. 3: Historical data vs. experiment outputs (under 1000 SOWs)

Another way of visualizing this information, if we’re not interested in seeing all the individual lines, is to look at the range of outputs. To produce Fig. 4, I used the fill_between function in matplotlib, filling between the max and min values at each percentile level.

Fig. 4: Historical data vs. range of experiment outputs

By looking at the individual lines or the range, there’s one piece of potentially valuable information we just missed. We have little to no idea of what the density of outputs is within our experiment. We can see the max and min range, the lines thinning out at the edges, but it’s very difficult to infer any density of output within our samples. To address this, I’ve written a little function that loops through 10 frequency levels (you can also think of them as percentiles) and uses the fill_between function again. The only tricky thing to figure out was how to appropriately represent each layer of increasing opacity in the legend – they are all the same color and transparency, but become darker as they’re overlaid. I pulled two tricks for this. First, I needed a function that calculates the custom alpha, or the transparency, as it is not cumulative in matplotlib (e.g., two objects with transparency 0.2 together will appear as a single object with transparency 0.36).

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 def alpha(i, base=0.2): l = lambda x: x+base-x*base ar = [l(0)] for j in range(i): ar.append(l(ar[-1])) return ar[-1]
view raw alpha.py hosted with ❤ by GitHub

Second, I needed proxy artists representing the color at each layer. These are the handles in the code below, produced with every loop iteration.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 handles = [] labels=[] fig = plt.figure() ax=fig.add_subplot(1,1,1) for i in range(len(p)): ax.fill_between(P, np.min(expData_sort[:,:],1), np.percentile(expData_sort[:,:], p[i], axis=1), color='#4286f4', alpha = 0.1) ax.plot(P, np.percentile(expData_sort[:,:], p[i], axis=1), linewidth=0.5, color='#4286f4', alpha = 0.3) handle = matplotlib.patches.Rectangle((0,0),1,1, color='#4286f4', alpha=alpha(i, base=0.1)) handles.append(handle) label = "{:.0f} %".format(100-p[i]) labels.append(label) ax.plot(P,hist_sort, c='black', linewidth=2, label='Historical record') ax.set_xlim(0,100) ax.legend(handles=handles, labels=labels, framealpha=1, fontsize=8, loc='upper left', title='Frequency in experiment',ncol=2) ax.set_xlabel('Shortage magnitude percentile', fontsize=12) plt.savefig('experiment_data_density.png')
view raw plot.py hosted with ❤ by GitHub

Fig. 5: Historical data vs. frequency of experiment outputs

This allows us to draw some conclusions about how events of different magnitudes/frequencies shift under the SOWs we evaluated. For this particular case, it seems that high frequency, small shortages (left hand side) are becoming smaller and/or less frequent, whereas low frequency, large shortages (right hand side) are becoming larger and/or more frequent. Of course, the probabilistic inference here depends on the samples we chose, but it serves the exploratory purposes of this analysis.

References:

[1]: Bryant, Benjamin P., and Robert J. Lempert. “Thinking inside the Box: A Participatory, Computer-Assisted Approach to Scenario Discovery.” Technological Forecasting and Social Change 77, no. 1 (January 1, 2010): 34–49. https://doi.org/10.1016/j.techfore.2009.08.002.

[2]: Herman, Jonathan D., Patrick M. Reed, Harrison B. Zeff, and Gregory W. Characklis. “How Should Robustness Be Defined for Water Systems Planning under Change?” Journal of Water Resources Planning and Management 141, no. 10 (2015): 4015012. https://doi.org/10.1061/(ASCE)WR.1943-5452.0000509.

[3]: McKay, M. D., R. J. Beckman, and W. J. Conover. “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code.” Technometrics 21, no. 2 (1979): 239–45. https://doi.org/10.2307/1268522.

# Introduction to Borg Operators Part 1: Simplex Crossover (SPX)

I have always found the connection between Genetic Algorithms and Biological Evolution very interesting. In many ways, Genetic Algorithms like Borg, emulate Biological Evolution. They start with a random population of potential solutions, the best fit of these solutions are selected to produce the future generation. During the production of offspring, crossover and mutation occur, which leads to more variation in the future generation. Strong (Non-dominated) offspring join the larger population and replace some of the previous weaker solutions (which are now dominated), and the process is repeated until some final condition is met.

In this post, I will briefly discuss Borg’s operators, and then dive deeper into Simplex Crossover (SPX) by Tsutsui et Al. (1999). This is the first of a series of posts in which I will discuss the Borg operators.

## General Overview of Borg Operators

Borg uses the six operators shown in the figure 1 below to achieve variation in its solutions. In the figure, the larger dots represent the parent solutions, while the smaller dots represent the offspring solutions (Hadka & Reed, 2013).

The question is how does Borg know which operators are best suited for a new problem? Borg auto-adaptively selects operators to based on their performance to aid the search. Based on section 3.4 in Hadka and Reed, 2013: In the first iteration in Borg assigns an equal probability for using each operator. For K operators, this probability is 1/K. With each iteration, these probabilities are updated, so that the operators that produced more solutions in the epsilon box archive, are assigned higher probabilities, while those with less solutions in the epsilon box dominance archive are assigned a lower probability. Hadka and Reed(2013) use the following weighted average equation to obtain this behavior:

Where:

• K is the number of operators used
• Qi ∈ {Q1,…, Qk} is the probability that an operator is used to produce offspring in the next generation. and is initialized at 1/K
• Ci ∈ {C1,…, Ck} is the number of solutions in the epsilon box dominance archive produced by each operator i.
• Sigma is a positive number that prevents any of the operators from reaching a probability of zero.

Figure 1 (Hadka & Reed, 2013)

Notice in the bottom three operators in Figure 1 show some similarity: the parents form a simplex (which is a triangle in 2 dimensions). However, the way the offspring are distributed around the parents is different. In Parent Centric Crossover, the offspring form around the parent solutions, in Unimodal Normal Distribution Crossover, the offspring are normally distributed around the center of mass of the three parents, and in Simplex Crossover, the solutions are uniformly distributed inside a simplex that is larger than the simplex created by the three parents (Deb et Al., 2002).

## Simplex Crossover (SPX)

Discussion based on Tsutsui et Al., 1999

Simplex Crossover (SPX) is a recombination operator that can use more than two parent solutions to generate offspring. The method involves expanding the simplex formed by the parent solutions by a factor of epsilon, and then sampling offspring from the resulting expanded simplex. The expansion of the simplex is centered around the center of mass of the parent solution vectors, and therefore independent of the coordinate system used. The number of parents must be at least 2 and less than or equal to the number of parameters plus one. That is:

Where:

• m: is the number of parents
• n: is the number of parameters

The SPX is denoted as SPX-n-m-ε.

Tsutsui et Al., 1999 states that for low dimensional functions, SPX works best with 3 parent solution vectors, and for high dimensional functions SPX works best with 4 parent solution vectors.

### The 3 Parent, 2 Parameter Case

It is easiest to see this in the 2-dimensional three parent, 2 parameter case, i.e. SPX-2-3-ε. This can be done in four steps, referring to figure 2:

Figure 2 (Tsutsui et Al., 1999)

1. Let X1, X2, and X3 be the three parent solution vectors. Each of these vectors has two parameters (it’s x and y coordinates). Calculate the center of mass, O, of the parents by computing the average of their two parameters:
2. Expand the simplex by epsilon at every vertex:
3. Produce offspring by using a uniform distribution to sample inside new expanded simplex defined in step 3.

You can see this in Python using the code below:


import numpy as np
import random

def SPX_2D(X1, X2, X3, epsilon, n_offspring=1, m=2):
# m is the number of parameters in each parent vector
# X1, X2, X3 are the parent vectors as np.arrays, each with m parameters (elements) i.e. len(Xi)=m
# epsilon is a value greated that zero by which the simplex is expanded
# n_offspring is the number of offspring you would like to produce

# Step 0: apply checks and inform the user if the input is wrong
if m<2:
print("error: at least two parameters needed for 2D SPX")
elif len(X1)!= len(X2) | len(X1)!=len(X3) | len(X1)!=len(X2):
print("error: the number of parameters in the parent vectors don't match")
elif len(X1)!=m:
print("error: the number of parameters in the parent vectors is not equal to m")

# if the checks pass, the function can proceed
else:
# Step 1: find the center of mass (CoM) of the three parents
CoM = 1/3 * (X1 + X2 + X3)

# Step 2: Define the vertices (Vi) of the simplex by expanding the simplex in the direction of Xi-CoM by (1+epsilon)
# note that the equation here is slightly different from the one in the paper, since the one in the paper produces the vector
# that translates the center of mass to the new vertices, and so the vectors need to be added to the center of mass coordinates
# to produce the new vertices.
V1=CoM+(X1-CoM)*(1+epsilon)
V2=CoM+(X2-CoM)*(1+epsilon)
V3=CoM+(X3-CoM)*(1+epsilon)

# Step 3: Produce offspring by sampling n_offspring points from the new simplex defined in step 3 using a uniform distribution
offspring = [point_on_triangle(V1, V2, V3) for _ in range(n_offspring)]

return offspring, V1, V2, V3, CoM

#########################################################################################################################

# Point_on_triangle function
# source: https://stackoverflow.com/questions/47410054/generate-random-locations-within-a-triangular-domain
# Solution by Mark Dickinson (Nov 21, 2017)

def point_on_triangle(pt1, pt2, pt3):

# pt1, pt2, pt3 are the vertices of a triangle

# find two random numbers s and t, note that random produces a float between 0 and 1 using a uniform distribution
s, t = sorted([random.random(), random.random()])

# use s & t to calculate a weighted average of each coordinate. This will produce the offspring vector
return (s * pt1[0] + (t-s)*pt2[0] + (1-t)*pt3[0],
s * pt1[1] + (t-s)*pt2[1] + (1-t)*pt3[1])

#########################################################################################################################

# Let's try an example
X1=np.array([-2,2])
X2=np.array([4,2])
X3=np.array([1,6])
epsilon=0.3
m=2
n_offspring=1000

offspring, V1, V2, V3, CoM = SPX_2D(X1, X2, X3, epsilon, n_offspring, m)

# Finally, you can plot the parents and offspring to produce Figure 3

import matplotlib.pyplot as plt

# Plot the parents
x1, y1 = zip(X1, X2, X3)
plt.scatter(x1, y1, s=50, label='Parents')

# Plot the center of mass
x2, y2 = zip(CoM)
plt.scatter(x2, y2, s=150, label='Center of Mass')

# Plot the expanded simplex
x3, y3 = zip(V1, V2, V3)
plt.scatter(x3, y3, s=100, label='Simplex Vertices')

# Plot the offspring
x4, y4 = zip(*offspring)
plt.scatter(x4, y4, s=0.05, label='Offspring')

plt.legend()

plt.show()



Figure 3

### General Case

The general case is denoted as SPX-n-m-ε, with n parameters and n parents.

Each parent solution vector is:

These vectors are in R^n.

Thinking in terms of Biology, this parent vector mimics a chromosome, with m parameters as its different traits.

The general case can be outlined in four steps:

1) Parent vectors are selected from the population pool

2)  R^n is space is then divided as follows:Divide R^n into h sets of R^m-1 spaces and one R^q space. I found this easier when I thought of it in the context of a chromosome, i.e. large parent vector of length n, being divided into smaller sub-vectors of length m-1, and a remainder vector or length q. See figure 4 below:

Figure 4

3) In each R^m-1 space, the m parent vectors are used to generate offspring using the expanded simplex as described in the 2-dimensional, 3 parent, 2 parameters case. Again, using the chromosome analogy, figure 5 shows how this can be depicted for m=3 parents and for example, n=9 parameters. R^9 is divided into four (h=integer(n/(m-1)=4) R^2 (m-1=2) spaces and one R^1 space (q=remainder(n/(m-1)=1).

Figure 5

4) The sets of offspring in R^m-1 produced in step 3 together with the vector in R^q (which remains unchanged), are then combined in their correct positions to form an offspring vector in R^n.

The code for the general SPX case can be found in David Hadka’s github.