Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Data Augmentation for Time Series Application

This post is meant to be an introduction to the concept of data augmentation. Data augmentation is the process of increasing the size of your data through small modifications to the original dataset. In instances where data availability are small (basically every ML application), this technique is especially useful to create more training data that can lead to a more robust model that isn’t as susceptible to overfitting. Let’s begin with an example that will demonstrate why data augmentation is useful in image classification. Imagine that you have trained your model to distinguish between images of cats and dogs. The figure on the left is of a very good boy named Lincoln and this image resides in the training set. Let’s suppose that the image in the middle is in the test set. To humans, this is very clearly Lincoln (and a dog) once again, but if the algorithm hasn’t seen many images of dogs in this position, there is a chance that it won’t classify this image correctly and may think that Lincoln looks more like this cat in the training set that has a similar orientation and ears.

Lincoln, Lincoln, and….Lincoln? (Cat stock image from here)

However, if I were to augment my original image in the training set by rotating, scaling, and shifting it, as shown below, perhaps my model would be more likely to classify Lincoln correctly as a dog having been trained on these variations. Various studies have demonstrated the benefits of this augmentation in image processing applications.

Augmentations of original image that create new training data

This is a very simple example to demonstrate that limited data availability need not preclude the ability to make robust predictions. It is not a far stretch to wonder how data augmentation may be utilized for regression-based prediction problems, especially in the water resources field where we have limited data. Particularly, it is hard for us to predict extremes because we have such few data points to characterize them. This style of problem is inherently more complicated than classification because time series have a temporal structure and are connected to underlying (sometimes physical) relationships. Thus, this requires that any augmentation does not completely change the fundamental characteristics of the data. Below are some examples of techniques that could be useful, but these extremely case-specific and require a strong understanding of the behavior of your system. Before you implement any of these techniques, first make sure to split your data into the training and test set. Then feel free to add variations to the training set and test them out!

Block Bootstrapping

Bootstrapping (sampling with replacement) single points your dataset can only be done if each point is independent. This is not the case with time series data that has a temporal structure. Thus, it is more appropriate to utilize block-bootstrapping. This technique involves resampling blocks of continuous data from the original training data to make a new training dataset. By using large continuous blocks, we are preserving the inherent structure in the data, while allowing our algorithm to see new data (the original data in a new order).

Jittering with Noise

A small sample size doesn’t give us the opportunity to map out the rich input-output space that characterizes our system. Often, adding a little bit of random noise to your training data can help expand your understanding of the space. If your system exhibits highly non-linear behavior, you have to be extra careful that the noise that you are adding is realistic. For example, in a rainfall-runoff model, the fluctuations of temperature and precipitation are very different. Small changes in precipitation can result in very large overall streamflow changes, whereas temperature often fluctuates widely during the day with very little effect on streamflow. Therefore, adding the same amount of noise to each feature and the output may not make sense. It is a non-trivial effort, but it could be interesting to determine how to appropriately add noise to features that exhibit different behavior.

Interpolation

If you want to augment training data that has clear trends, interpolation between data points can be a viable option that won’t distort these trends. However, using a linear interpolation method sets the underlying assumption that your data are linear; for instance that a linear change in precipitation and temperature leads to a linear change in streamflow. This is likely not the case, so interpolation may not be a useful data augmentation technique for a rainfall-runoff regression-based model. However, interpolation could be useful in a less sensitive classification-based model.

Decomposition

Decomposition methods generally decompose time series signals by extracting features or underlying patterns from the training data. These features can either be used independently or recombined with noise and the old training data to generate new training data. Decomposition can be preformed in either the time or frequency domain. Within the decomposition domain lies manifold-based techniques as well. A study by Forestier et al., 2017 calculate a weighted average that reflects the manifold of the
original data and use it as new data with high success.

Implementation

All of these techniques have shown success in very specific time series applications: those related to speech, audio, and gait recognition, and specifically for classification-based models. Very little has been published on regression-based models and the use of data augmentation in the water resources community seems nonexistent.

Below, I implemented one of these techniques using the CNN that I fit in my prior post. The results for the baseline prediction of streamflow are shown in the first panel. Then I tried a data augmentation scenario. I took the training set and reversed it and kept the first 1000 data points (which have more extremes). I then took these points and concatenated them to the original training set. The new results are shown in the second panel. The CNN trained with the additional augmented data does a much better job of capturing the extremes in the test set, which is what we often are interested in. There is a lot of work to be done and more complicated methods to explore, but these initial results look interesting and suggest that data augmentation can be useful in our field!

References

Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017, November). Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM) (pp. 865-870). IEEE.

Oh, C., Han, S., & Jeong, J. (2020). Time-series Data Augmentation based on Interpolation. Procedia Computer Science175, 64-71.

The ma-R-velous tidyverse

I am currently taking a class in environmental economics, which of late seems to be doubling as a crash course in the R tidyverse. This blog post will discuss several R packages and demonstrate the application of several tidyverse sub-packages and commands that I have learned to use through this class. Two datasets will be used: the storm dataset, which contains data on the types of storms that occurred in the United States from 1975 to 2015, and the damages dataset, which is a record of the dollar cost of the damages caused by every tropical storm that occurred from 1975 to 2015. The datasets required for this example can be found in in this GitHub repository.

Tidy data and the tidyverse?

‘Tidy’ datasets provide a standardized way to link the physical layout of a dataset with the meaning associated with it [1]. A ‘tidy’ dataset has the following characteristics:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each cell contains only one value of only one data type

On the other hand, the tidyverse is a meta-package that bundles the packages facilitating data ‘tidying’, or data cleaning such that the final form of the dataset is ‘tidy’. You can check the full set of packages bundled into the tidyverse using tidyverse_packages(), which will show you the following output in the console:

Within this set, we will be using the dplyr, tidyr and lubridate packages. The dplyr and tidyr packages are “workhorse” packages – their main purpose is to clean and wrangle large datasets, and will be the two packages that we use most often in this example. The lubridate package is used to convert strings into dates, which we will use later in this example.

In this example, we will transform a ‘messy’ dataset into a ‘tidy’ one, and perform several other operations that are enabled by the R tidyverse.

The pacman package management tool

No, this is not the Toru Iwatani-created, dot-munching yellow puck released in 1980 by Namco. Instead, the pacman tool is a convenient library and package wrapper that combines the functionality of base library-related functions into intuitively named functions [2]. It was created to improve workflow by reducing the time required in calling and re-calling obscurely named functions. Ideally, it should be called at the beginning of the R script, like so:

# Run this line
if (!require("pacman")) install.packages("pacman")

This enables you to automatically install and load packages throughout the base R libraries and the R tidyverse using the p_load function. For this example, we will only require the tidyverse package. Load this package using the following script:

# Then run p_load to load the necessary packages
pacman::p_load(
 tidyverse
)

As previously mentioned, this automatically loads the dplyr, tidyr and lubridate packages that we will need for this example.

Before beginning, make sure that you are running the correct versions of dplyr and tidyr. Check the package versions using the packageVersion('package_name') function. All packages should be at least version 1.0.0.

Working with the datasets

Some key tidyr verbs

With the explanations out of the way, we begin by loading our two datasets:

# Load dataset
storms <- read.csv("storms.csv")
damages <- read.csv("damages.csv")

Note that the damages dataset is in an inconvenient ‘wide’ shape. To convert damages into ‘tidy’ format, we use pivot_longer(). pivot_longer() is one of the four key tidyr verbs, the remaining three being pivot_wider(), separate() and unite(). These verbs perform the following uses:

  1. pivot_longer(): Vertically stretches the data, increasing the number of rows and decreasing the number of columns
  2. pivot_wider(): Horizontally stretches the data, increasing the number of columns and decreasing the number of rows
  3. separate(): Turns a single-character column into multiple columns
  4. unite(): Pastes multiple columns into one

For this example, pivot_longer() is used to turn damages from its original wide shape into a long, 3-column dataframe where each column contains data on the storm’s name, type, and total dollar cost of damages. The code is as follows:

# tidy-up damages
damages <- damages %>% mutate(status = "tropical storm") %>%
  pivot_longer(-status, names_to="name", values_to="damages")

This script first categorizes all the storms within damages as tropical storms, and them assigns the names of each storm to the column ‘name’ and the cost of their damages to ‘damages’. This turns the original dataset:

into the following shape:

This results in a more readable dataset!

It’s a lubridate!

Next, we will visit out storm dataset. Note that the information regarding the date and time for each storm is inconveniently split between four columns:

We use the as_datetime() function within the lubridate package to combine these columns into one, easy-to-interpret column called ‘date’. Use the following script to:

# Paste four columns and convert to date-time format
storms <- storms %>% mutate(date=as_datetime(
  paste0(year, "-", month, "-", day, " ", hour, ":00:00")))

This pastes the data in the ‘year’, ‘month’, ‘day’ and ‘hour’ columns together and inserts formatted date and time into the newly-added ‘date’ columns:

Great! Our storm dataset is now more readable.

Some dplyr verbs and operations

There are five key dplyr verbs, some of which you have already seen:

  1. filter(): Obtains a subset of rows based on their column values
  2. arrange(): Reorders rows (in ascending or descending order) based on their values
  3. select(): Selects columns or variables of interest
  4. mutate(): Creates new columns or variables and automatically appends them to the dataframe
  5. summarize(): Collapses multiple rows into a single summary value, commonly by grouping a pre-selected variable

The dplyr package also come with a set of join operations, which enables the merging of two dataframes, x and y. The types of operations include:

  1. inner_join(): Matches pairs of observations whenever their values are equal
  2. left_join(): Keeps all x-observations that appear in either x or y
  3. right_join(): Keeps all y-observations that appear in either x or y
  4. full_join(): Keeps all x– and y-observations that appear in either x or y
  5. semi_join(): Keeps all observations in x that have a match in y
  6. anti_join(): Drops all observations in x that have a match in y

In this example, we will only be using the inner_join() function to merge a subset of storms and the whole of damages.

The verbs can be used to modify datasets using ‘pipes’, represented in R as %>%. We have previously seen the application of the mutate verb when we added the new columns ‘storm’ and ‘date’ to the damages and storms dataset respectively. Now, let’s apply the filter() verb to obtain only tropical storm data from storms:

# filter out only the tropical storms
ts <- storms %>% filter(
    stringr::str_detect(status, "tropical storm")
)

Here, we use stringr::str_detect() to detect the string ‘tropical storm’ within the ‘status’ column of storms and store it within the new ts dataframe.

Next, we use the select() function to select all columns but the columns containing the diameter of the hurricane/storm, since this is data that we will not use in this example:

# select a subset of columns
ts <- ts %>% select(!ts_diameter:hu_diameter)

Using ! indicates to the function that you would like to select the complement of the set following the exclamation point. Using : is useful when there are consecutive columns you would like to (de)select. Following this, you will have a pared-down dataset:

Now it’s time to apply our inner_join() operation to merge ts with damages in a new dataframe joined_sd that contains both geophysical information of each tropical storm, as well as its dollar cost of damages.

# joining tropical storms and damages
joined_sd <- inner_join(ts, damages)

where joined_sd has the following form:

Now, we perform some simple operations to obtain the mean dollar cost of damages per windspeed, and arrange each tropical storm in order of decreasing damage per windspeed:

# calculate average cost of damage per windspeed
joined_sd <- joined_storms_damages %>% 
  mutate(damage_per_mph = mean(damages)/wind) %>% 
  arrange(desc(damage_per_mph))

This results in the following final dataset:

From here, you can see that tropical storm Amy resulted in the most expensive damages. Finally, you can also use the summarize() verb to identify the mean of the cost of damages per windspeed of all the tropical storms:

# summarize by mean and sd of cost per windspeed
summary_sd <- joined_sd %>% summarize(
  damage_per_mph=mean(damage_per_mph))

You will find that, on average, each tropical storm costs USD$11,244,531. Plotting the cost of damages with respect to time using

# plot damages wrt time
ggplot(joined_sd, aes(year, damages)) + 
  geom_point() + ylim(min(joined_sd$damages), max(joined_sd$damages)) +
  geom_smooth(method = lm) + ggtitle("Damages ($) with respect to time")

you will find a slight increase in the cost of damages incurred by tropical storms as the years go by.

Conclusion

In this blog post, we walked through two example datasets to demonstrate three sub-packages within the tidyverse: tidyr, lubridate and dplyr. For the full version of this example, please visit the author’s GitHub repository.

References

pacman package – RDocumentation. (2021). Retrieved 14 November 2021, from https://www.rdocumentation.org/packages/pacman/versions/0.5.1

Rudik, I. (2021). AEM 6510 Chapter 10: R and the tidyverse. Presentation.

Tidy data. (2021). Retrieved 14 November 2021, from https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Plotting change on maps

or how to replicate the New York Times presidential election shift map

This week’s blogpost is a visualization demo replicating a popular map from last year. The map below shows the shift in voter margin between the 2016 and 2020 Presidential Elections by the two major political parties in the United States. The direction and color of the arrows indicates the party and the length of the arrow indicates the shift. This type of figure can be useful in visualizing many types of spatially distributed changes (e.g. population change in a city, change in GDP per capita, losses and gains). This blogpost shows how to replicate it in Python using commonly used packages.

Screengrab of the original graphic from the NYT website. Original can be found here: https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html

Even though the creators of the original provide their 2020 data, their 2016 data is not available so the data I’ll be using came from the MIT Election Data and Science Lab and can be downloaded here: https://doi.org/10.7910/DVN/VOQCHQ. All the code and data to replicate my figure can be found in this repository: https://github.com/antonia-had/election_data_shift

The main packages we’ll be using for this are cartopy and matplotlib to create the map and annotate elements on it, pandas for some simple data analysis and haversine to convert distances on the map (which you might not need if you’re applying the code to a small spatial scale).

First thing we do is load our packages and data. counties.csv contains the latitude and longitude for every country we’ll be plotting. countypres_2000-2020.csv contains our downloaded election data. As you can see in the code comments, I had to clean out some of the datapoints due to inconsistencies or errors. I’ll also only be plotting the contiguous US to simplify the exercise, but you can definitely include code to also plot Alaska and Hawaii in the same figure.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import cartopy.io.shapereader as shpreader
from haversine import inverse_haversine, Direction

# Read in county position data
pos_data = pd.read_csv('./data/counties.csv', delimiter=',', index_col=0)

# Read in county election data
# Data from https://doi.org/10.7910/DVN/VOQCHQ
# Data points without county FIPS code removed
all_election_data = pd.read_csv('./data/countypres_2000-2020.csv')
# Filter data to only keep years 2016 and 2020
# Dataset reports issues with Alaska data so filter those out too
# Missing data for 2020 for some counties
# County with FIPS code 46113 was assigned a new FIPS code (46102) which is changed in the downloaded data
mask = (all_election_data['year'] >= 2016) & \
       (all_election_data['state'] != 'ALASKA') &\
       (all_election_data['state'] != 'HAWAII') & \
       (all_election_data['county_fips'] != 11001) & \
       (all_election_data['county_fips'] != 51515) & \
       (all_election_data['county_fips'] != 36000)
election_data = all_election_data[mask]

Next we calculate the percentage of votes each party gained at each election and compare the results between the two elections to calculate their shift. A simplifying assumption here is that we’re only focussing on the top two parties (but you can do more with different color arrows for example). We’re also copying the latitude and longitude of each county so everything is in one dataframe.

# Calculate vote percentage per party
election_data['percentagevote'] = election_data['candidatevotes']/election_data['totalvotes'] * 100

# Create new dataframe to store county change results
shift = election_data[['state', 'county_name', 'county_fips']].copy()
# Drop duplicate rows (original dataframe was both 2016 and 2020)
shift = shift.drop_duplicates(['county_fips'])

# Create columns to store change for every party
shift['DEMOCRAT'] = 0.0
shift['REPUBLICAN'] = 0.0

#Create columns for latitude and longitude so everything is in the same dataframe
shift['lat'] = 0.0
shift['lon'] = 0.0

# Iterate through every county and estimate difference in vote share for two major parties
for index, row in shift.iterrows():
    county = row['county_fips']
    for party in ['DEMOCRAT', 'REPUBLICAN']:
        previous_result = election_data.loc[(election_data['year'] == 2016) &
                                            (election_data['county_fips'] == county) &
                                            (election_data['party'] == party)]['percentagevote'].values[0]
        new_result = election_data.loc[(election_data['year'] == 2020) &
                                       (election_data['county_fips'] == county) &
                                       (election_data['party'] == party)]['percentagevote'].values[0]
        # If any of the two results is nan assign zero change
        if pd.isna(new_result) or pd.isna(previous_result):
            shift.at[index, party] = 0
        else:
            shift.at[index, party] = new_result - previous_result
    # Combine lat and long values also so it's all in one dataframe
    shift.at[index, 'lat'] = pos_data.at[county, 'lat']
    shift.at[index, 'lon'] = pos_data.at[county, 'lon']

To create our map we do the following.

Set up matplotlib figure with the map extent of the contiguous United States and use cartopy geometries to add the shapes of all states.

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.LambertConformal(), frameon=False)
ax.set_extent([-120, -74, 24, 50], ccrs.PlateCarree())
# Add states shape
shapename = 'admin_1_states_provinces_lakes'
states_shp = shpreader.natural_earth(resolution='110m',
                                     category='cultural', name=shapename)
ax.add_geometries(shpreader.Reader(states_shp).geometries(), ccrs.PlateCarree(),
                  facecolor='#e5e5e5', edgecolor='white', zorder=0)

We then need to determine how the shift should be plotted in each county. A simplifying assumption here is that we’re showing the largest positive shift (i.e., if both parties lost votes we’re only showing a small grey point). There’s several ways to draw an arrow at each point, depending on what you’d like to show and the complexity you’re comfortable with. The way I am showing here is exploiting the matplotlib annotate function, typically used to annotate a figure with text and arrows.

The way I’m going about this is a little mischievous but works: I’m only using the arrow component of it with a blank text annotation and identify a point where each arrow should be pointing to by using each county’s lat and long and the estimated shift. If this was a simple matplotlib figure using cartesian coordinates, calculating the end point would be simple trigonometry. Since latitude and longitude are not on a cartesian plane, we need to convert them using the haversine formula (or its inverse). It’s fairly easy to implement yourself but since there already exceeds a handy python package for it, I’m using that instead. The transform function I am using up top is necessary for matplotlib to know how to transform the points from the annotation function (typically not necessary to do if using, say, ax.scatter()), some explanation of why that is can be found here. The colors and all other customization is done so the figure looks as close as possible to the original.

transform = ccrs.PlateCarree()._as_mpl_transform(ax)
for index, row in shift.iterrows():
    # Determine arrow color
    dem_shift = shift.at[index, 'DEMOCRAT']
    rep_shift = shift.at[index, 'REPUBLICAN']
    # Check if both lost votes, then set arrow to grey
    if dem_shift<0 and rep_shift<0:
        arrow_color = 'grey'
        ax.scatter(shift.at[index, 'lon'], shift.at[index, 'lat'],
                   color=arrow_color, transform=ccrs.PlateCarree(),
                   s=0.5)
    # If at least one of them gained votes
    else:
        if dem_shift >= rep_shift:
            arrow_color = '#1460a8'
            direction = Direction.NORTHWEST
            change = dem_shift
        else:
            arrow_color = '#bb1d2a'
            direction = Direction.NORTHEAST
            change = rep_shift
        end_location = inverse_haversine((shift.at[index, 'lat'], shift.at[index, 'lon']), change*25, direction)[::-1]
        ax.annotate(" ", xytext=(shift.at[index, 'lon'], shift.at[index, 'lat']), xy=end_location,
                    arrowprops=dict(facecolor=arrow_color, edgecolor=arrow_color,
                                    width=0.2, headwidth=3, headlength=5),
                    xycoords=transform, zorder=1)
plt.tight_layout()
plt.savefig('electionshiftmap.png', dpi=300)

The resulting figure looks like this, which I am calling pretty close, considering the dataset differences. Tinkering with colors, widths, lengths and transforms can get you a different look if you’re after that.

Pam agrees:

Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

  1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
  2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
  3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()

Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

  1. Write demoScript.m so that it reads the sample number from the input.
  2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
  3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory
sample_out = fileread(strcat("input_sample_", sample, "/sample_data.txt"));

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)

2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

Reads in the total tasks and number of samples per task from command line.
'''
# load necessary libraries
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes
N_SAMPLES_PER_TASK = int(sys.argv[2]) # number of runs per/task

# loop through samples assigned to current rank
for i in range(N_SAMPLES_PER_TASK):
	sample= rank + TOTAL_TASKS * i

	# write the command that will be sent to the terminal (here RUN will replace the {})
	terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

	# write the terminal command to the process
	os.system(terminal_command)

	# sleep before submitting the next command
	time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()

submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
SAMPLES_PER_TASK=5 # number of runs for each MPI task
N_NODES=2 # number of nodes
TASKS_PER_NODE=2 # number of tasks per node

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks

# Submit the parallel job
#!/bin/bash
#SBATCH -n $(TOTAL_TASKS) -N $(N_NODES)
#SBATCH --time=0:01:00
#SBATCH --job-name=demoMPI4py
#SBATCH --output=output/demo.out
#SBATCH --error=output/demo.err
#SBATCH --exclusive
module load py3-mpi4py
module load py3-numpy
module load octave/6.3.0

mpirun -np $TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS $SAMPLES_PER_TASK > demoOutput.txt

Additional resources

Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/

Tips for Creating Watershed Maps in R

There have been a few posts on this blog on creating watershed maps (here, here, and here), but this post is going to be focused on some of my recent experiences on creating watershed maps in R with files that may be missing attributes, are in the wrong projection, and contain data that need to be clipped to a specific boundary shapefile. There are lots of packages that exist to do one or more of these things, but anyone who has ever tried to create watershed maps in R knows that there isn’t one package that does it all. My main goal for this post is to outline the most efficient workflow and use of packages that also allow for the most compatibility when plotting shapefiles and raster files in one figure.

In this post, we are going to be creating a map of the Tuolumne River Basin boundary and plot elevation data within the basin. All the data are found here. First we will read in the Tuolumne boundary shape file (.shp) and the elevation raster file (.asc which is an ASCII file) using the appropriate functions and do some preliminary plotting to see what we have.

#Import libraries 

library(rgdal)
library(ggplot2)
library(raster)

#Read in Tuolumne shapefile

tuolumne.basin <- readOGR(dsn = "doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/Tuolumne_utm.shp")

#Read in elevation raster

elevation.raster = raster("doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/merced_tuolumne_100mdem_utm.asc")

#Plot the files
 
ggplot() +  geom_polygon(data = tuolumne.basin, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)
plot(elevation.raster)
Raw shapefile and raster data

So we have the pieces that we need to build the map, but notice that the latitude and longitude are in the wrong projection. We can use the following command to check what projection the shapefile is in:

proj4string(tuolumne.basin) 

We see that the output is: “+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs”. So we are in the Universal Transverse Mercator coordinate system, but we should change to WGS 84. We can do this using the function “spTransform” which will swap out the projection by adjusting the CRS (Coordinate Reference System) attribute of the shapefile. You can use “proj4string” to verify that the transformation took place.

 tuolumne.basin.transformed <- spTransform(tuolumne.basin, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

Now we need to transform the raster files coordinates. Note that the raster file doesn’t have an associated coordinate reference system listed. If you try to change the projection at this point, you will get an error. This is a minor inconvenience since we know that the coordinate system should match that of the raw Tuolumne shapefile and we can just insert the original coordinate system as a string under the projargs attribute. Then we can transform it to match the coordinate system of the transformed shapefile using “projectRaster”.

elevation.raster@crs@projargs <- "+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs" 
  
elevation.raster.transformed <- projectRaster(elevation.raster,crs=crs(tuolumne.basin.transformed))

Great, now we have all the data in the right projection. Now we have to clip the raster layer to show only the data in the bounds of our shapefile. We first use the “crop” function in the raster library to clip the layer based on the extent of the shapefile boundary as well as the mask function. It is important to do both otherwise the clip will not work!

elevation.raster.transformed.cropped <- crop(elevation.raster.transformed, extent(tuolumne.basin.transformed))
elevation.raster.transformed.cropped <- mask(elevation.raster.transformed, tuolumne.basin.transformed)

Now we need to get the appropriate elevation values and coordinates from the raster object so that we can plot it using ggplot.When we use ggplot here, notice that we only need to use geom_raster and elevation data since the clipped data will perfectly follow the shapefile boundary.

#Isolate elevation values from the raster file

val <- getValues(elevation.raster.transformed.cropped)
xy <-as.data.frame(xyFromCell(elevation.raster.transformed.cropped,1:ncell(elevation.raster.transformed.cropped)))
xy <- cbind(xy,val)

#Plot it!

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c()+theme_bw()
A not so pretty watershed figure

It’s almost perfect aside from that gray box that results from clipping and masking. After we clip, we are converting all values outside the boundary of the shapefile to NAs, which falls out of the bounds of our color scale. To fix this, we simply insert an additional argument to scale_fill_viridis_c() and we also make some additional aesthetic changes to the theme.

#Final plot function

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c(na.value=NA,name = "Elevation (m)")+theme_bw()+ggtitle("Tuolumne River Basin Elevation (m)")+xlab("Longitude") + ylab("Latitude")+theme(text = element_text(size = 20)) 
A pretty watershed figure!

MORDM Basics VI: Processing the output and reevaluating for robustness

In the previous post, we conducted a basic WaterPaths tutorial in which we ran a simulation-optimization of the North Carolina Research Triangle test case (Trindade et al, 2019); across 1000 possible futures, or states of the world (SOWs). To briefly recap, the Research Triangle test case consists of three water utilities in Cary (C), Durham (D) and Raleigh (R). Cary is the main supplier, having a water treatment plant of its own. Durham and Raleigh purchase water from Cary via treated transfers.

Having obtained the .set file containing the Pareto-optimal decision variables and their respective performance objective values, we will now walk through the .set file processing and visualize the decision variables and performance objective space.

Understanding the .set file

First, it is important that the .set file makes sense. The NC_output_MS_S3_N1000.set file should have between 30-80 rows, and a total of 35 columns. The first 20 columns contain values of the decision variables. Note that only Durham and Raleigh have water transfer triggers as they purchase treated water from Cary.

  1. Restriction trigger, RT (C, D, R)
  2. Transfer trigger, TT (D, R)
  3. Jordan Lake allocation, JLA (C, D, R)
  4. Reserve fund contribution as a percentage of total annual revenue, RC (C, D, R)
  5. Insurance trigger, IT (C, D, R)
  6. Insurance payments as a percentage of total annual revenue, IP (C, D, R)
  7. Infrastructure trigger, INF (C, D, R)

The last 15 columns contain the objective values for the following performance objectives of all three utilities:

  1. Reliability (REL) to be maximized
  2. Restriction frequency (RF) to be minimized
  3. Infrastructure net present cost (INF_NPC) to be minimized
  4. Peak financial cost of drought mitigation actions (PFC) to be minimized
  5. Worst-case financial cost of drought mitigation actions (WCC) to be minimized

This reference set needs to be processed to output a .csv file to enable reevaluation for robustness analysis. To do so, run the post_processing.py file found in this GitHub repository in the command line:

python post_processing.py

In addition to post-processing the optimization output files, this file also conduct regional minimax operation, where each regional performance objective is taken to be the objective value of the worst-performing utility (Gold et al, 2019).

This should output two files:

  1. NC_refset.csv No header row. This is the file that will be used to run the re-evaluation for robustness analysis in the next blog post.
  2. NC_dvs_objs.csv Contains a header row. This file that contains the labeled decision variables and objectives, including the minimax regional performance objectives. Will be used for visualizing the reference set’s decision variables and performance objectives.

Visualizing the reference set

Due to the higher number of decision variables, we utilize parallel axis plots to identify how varying the decision variables can potentially affect certain performance objectives. Here, we use the regional reliability performance objective, REL, as an example. Figure 1 below demonstrates how all decision variables vary with regional reliability.

Figure 1: All decision variables for the three utilities. A darker blue indicates a higher degree of reliability.

From Figure 1, most solutions found via the optimization conducted in the previous blog post seem to have relatively high reliability across the full range of decision variable values. It is unclear as to how each decision variable might affect regional reliability. It is thus more helpful to identify specific sets of decision variables (or policies) that enable the achievement of reliability beyond a certain threshold.

With this in mind, we assume that all members of the Triangle require that their collective reliability be at least 98%. This results in the following figure:

Figure 2: All decision variables across the three utilities. The dark lines represent the policies that are at least 98% reliable.

Figure 2 has one clear takeaway: Pareto-optimality does not indicate satisfactory performance. In addition, setting this threshold make the effects of each decision variable clearer. It can be seen that regional reliability is significantly affected by both Durham and Raleigh’s infrastructure trigger (INF). Desirable levels of regional reliability can be achieved when Durham sets a high INF value. Conversely, Raleigh can set lower INF values to benefit from satisfactory reliability. Figure 2 also shows having Durham set a high insurance trigger (IT) may benefit the regional in terms of reliability.

However, there are caveats. Higher INF and IT values for Durham implies that the financial burden of investment and insurance payments are being borne by Raleigh and Cary, as Durham is able to withstand more risk without having to trigger an investment or infrastructure payment. This may affect how each member utility perceives their individual risks and benefits by forming a cooperative contract.

The code to plot these figures can be found under ‘refset_parallel.py’ in the repository.

Robustness analysis and what’s next

So how is setting a threshold value of regional reliability significant?

Within the MORDM framework, robustness is defined using a multivariate satisficing metric (Gold et al, 2019). Depending on the requirements of the stakeholders, a set of criteria is defined that will then be used distinguish between success (all criteria are met) and failure (at least one criteria is not met). Using these criteria, the rest of Pareto-optimal policies are simulated across a number of uncertain SOWs. Each policy’s robustness is then represented by the percent of SOWs in which it meets the minimum performance criteria that has been set.

In this post, we processed the reference set and visualized its decision variable space with respect to each variable’s effect on the reliability performance objective. A similar process can be repeated across all utilities for all performance objectives.

Using the processed reference set, we will conduct multi-criterion robustness analysis using two criteria:

  1. Regional reliability should be at least 98%
  2. Regional restriction frequency should be less than or equal to 20%

We will also conduct sensitivity analysis to identify the the decision variables that most impact regional robustness. Finally, we will conduct scenario discovery to identify SOWs that may cause the policies to fail.

References

Gold, D. F., Reed, P. M., Trindade, B. C., & Characklis, G. W. (2019). Identifying actionable compromises: Navigating multi‐city robustness conflicts to discover cooperative safe operating spaces for regional water supply portfolios. Water Resources Research, 55(11), 9024–9050. https://doi.org/10.1029/2019wr025462

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). DEEPLY UNCERTAIN PATHWAYS: Integrated multi-city Regional Water Supply Infrastructure Investment and portfolio management. Advances in Water Resources, 134, 103442. https://doi.org/10.1016/j.advwatres.2019.103442

Make LaTeX easier with custom commands

LaTeX is a powerful tool for creating professional looking documents. Its ability to easily format mathematical equations, citations and complex figures makes LaTeX especially useful for developing peer-reviewed journal articles and scientific reports. LaTeX is highly customizable, which allows you to create documents that are not carbon copies of generic Microsoft Word templates.

Using LaTeX does have it’s drawbacks- instead of simply typing on a page, you construct the document by writing LaTeX code. Once you’ve written your code, a compiler translates it into a finished and formatted document. This can sometimes result in high overhead time for fixing bugs and managing format. But coding a document also has advantages, in addition to the vast array of existing LaTeX libraries and commands, you can create your own custom commands that speed up the writing and formatting process. Below I’ll overview the basics of creating custom LaTeX commands and provide some illustrative examples.

Commands with no arguments

If you have an equation or a complex sequence of text that you know you’ll be repeating, you can create a custom command to produce it. For example, if I’m constantly referencing the equation for an Ordinary Least Squares (OLS) estimator, I can make a new command that produces it:

\newcommand{\OLS}{$\hat{\beta}=(X^TX)^{-1}X^Ty$}

There are three parts to defining this command, as shown in the figure below:

  1. Tell LaTex you are defining a new command by specifying “\newcommand”
  2. name the command (make sure to include the backslash)
  3. Specify the output of the new command

Example LaTex code that calls the OLS command:

I can store complex terms using a predefined command: \OLS

Compiled output:

Commands with basic arguments

Single argument

You can also define commands that accept arguments. For example, if you want to make commands to assist tracking changes in a document, you can create a command that formats a section of added text so it has the color blue and is bolded:

\newcommand{\addtxt}[1]{{\color{blue} \textbf{#1}}} % Highlight text that has been added

The command defined above accepts one argument (shown as the “[1]”) and calls that argument using “#1”, as highlighted in the figure below:

Example Latex code using my command:

Demonstrating my custom commands with arguments:\addtxt{This text has be inserted into this sentence}

Example compiled output:

Multiple arguments

You can also define commands with multiple arguments, for example, you can create a template sentence that provides an update the timing of a project:

\newcommand{\projReportA}[2]{The project was planned to finish on \textbf{#1}after reviewing current progress we have determined that it will likely finish on \textbf{#2}} % insert a date for when a project was planned to be completed and when a project is likely to be completed

Here, argument #1 is the date when the project was planned on being completed, and argument #2 is the date that the project will likely be completed.

Example use of this function:

Another way you can use an argument:\\ \\
\projReportA{September $9^{th}$}{October $1^{st}$}

Example compiled output:

Commands with optional arguments

The project report command above can be modified to accept a default completion date with an option to include an updated date.

\newcommand{\progReportB}[2][September 9th]{The project was planned to finish on \textbf{#2}, after reviewing current progress, we have determined that it will likely finish on \textbf{#1}}

To create an optional argument, specify the default value of the first argument in a new set of brackets. Note that for basic Latex this only works for a single default argument, for more defaults you can use a package such as xparse.

Here’s an example using this new command with the default argument:

Here I'll will use the command without the optional argument, so it will print the default: \\
\\
\progReportB{September $9^{th}$}

This will compile to:

Here’s an example with the optional argument specified

Now I'll add the optional argument, which will be added in place of the default: \\ \\
\progReportB[October $1^{st}$]{September $9^{th}$}

This will compile to:

Concluding thoughts

These simple examples only scratch the surface of what you can do with LaTex commands. I should also note that while custom commands are useful, LaTex also contains a large suite of packages with predefined commands that can be easily imported into your document.

Helpful Latex resources:

Teaching Tools for Complex Adaptive Systems

This semester, I am taking a foundational class in the Systems Engineering department here at Cornell and I wanted to use this blog post to relay some cool packages and tools that we have used that hopefully can be useful teaching material for emerging faculty or anyone looking for interactive systems tutorials.

To begin, we have to first define what falls under the umbrella of complex adaptive systems. In a nutshell, these systems tend to (1) have networks of many components, (2) typically involve non-linear interactions between components, (3) exhibit self-organizing behavior, (4) have the potential to exhibit emergent properties. One really beautiful website that explains these properties in more detail is Complexity Explained, which started as a community outreach project to try to explain complex systems to a wider audience within the science community and the public. The website features interactive animations of systems properties and a short booklet that can be downloaded (in many languages) with key concepts.

It is well known that complex systems are hard for humans to understand because many of the characteristics are non-intuitive for us. For example, self-organizing behavior is often contradictory to our own lives (when can you remember a time that a system around you naturally seemed to become more orderly as time passed?). Emergent properties can come about in long time scales that are often far distanced from the original action. We can’t always understand how decisions on the microscale resulted in large macroscale processes. Thus, in order to best approach complex systems, we must have the ability to interact with them, model them, and map out their complex behavior under many conditions. Below, I am introducing some tools that might help foster more understanding of these ideas using simple, yet dynamically rich cases.

PyCX

One of the main creators of the Complexity Explained website and a visiting lecturer to my systems class is Hiroki Sayama, a world-renowned researcher and director of the Center for Collective Dynamics of Complex Systems at Binghamton University. Dr. Sayama has created a python package called PyCX that contains sample Python codes of complex systems that a user can run interactively and then manipulate or build off of. Simply download the package off of GitHub and all of the code and a simulator will be available to you. Figure 1 shows an example interactive simulation of a Turing pattern. In 1952, Alan Turing authored a paper where he described how patterns in animals’ coats such such as stripes and spots, can arise naturally from a chaotic system. He uses a simple set of reaction-diffusion equations to describe this process. Figure 1 shows the python simulator in PyCX, the equation for the Turing pattern, and the evolution from the random initialization to the ordered spots.

Figure 1: PyCX interactive simulation for the Turing Pattern

PyCX also allows you to toggle the parameters of the problem, which can express how small perturbations in the system can lead to substantially different outcomes. You can adjust these parameters within the source python code (which I believe is more useful for students rather than just clicking a “play” button). Figure 2 shows the difference in behavior across a forest fire model when the initial density is adjusted from 35% to 40% of the space.

Figure 2: The effect of initial conditions in a forest fire agent-based model

Golly- Game of Life Simulator

Golly is an open-source tool for visualizing cellular automata, including Conway’s Game of Life. Golly allows the user to draw different patterns and apply specific rules for how the systems evolve. You can stop the simulation midway and apply different rules to the existing patterns.

Figure 3: Golly Interface Screen Shot

Swarm Behavior

Dr. Sayama also developed a really interesting Java application to study swarm behavior, or collective behavior that is exhibited by entities, typically animals. This application, called swarm chemistry creates agents with different kinetic parameters that dictate dynamics. The application allows you to mix agents into a single population and observe how emergent dynamics form. Figure 4 shows the opening interface when you click the .jar executable. The application brings up 6 random agents that exhibit some dynamic behavior. By clicking on any two agents, you will create a new population that shows how the dynamics of the agents interact (Figure 5). You can keep mixing agents and adding more random swarms. You can individually mutate certain swarms or edit the parameters as well. The pictures do not do this application justice. It is super fun (and slightly addicting) and a great way to get students excited about the concepts.

Figure 4: Swarm Chemistry Opening Interface

Figure 5: Emergent dynamic behavior

I had so much fun using these packages in class and I hope that these tools can help you/your students become more engaged and excited about complex systems!

References

My knowledge of these tools came from Hiroki Sayama’s guest lectures in SYSEN 6000 at Cornell University and from:

Sayama, H. (2015) Introduction to the Modeling and Analysis of Complex Systems,Open SUNY Textbooks, Milne Library, State University of New York at Geneseo.

A non-intimidating introduction to parallel computing with Numba

This blog post is adapted from material I learned during the 2021 San Diego Supercomputer Center (SDSC) Summer Institute. This was an introductory boot camp to high-performance computing (HPC), and one of the modules taught the application of Numba for in-line parallelization and speeding up of Python code.

What is Numba?

According to its official web page, Numba is a just-in-time (JIT) compiler that translates subsets of Python and NumPy code into fast machine code, enabling it to run at speeds approaching that of C or Fortran. This is becuase JIT compilation enables specific lines of code to be compiled or activated only when necessary. Numba also makes use of cache memory to generate and store the compiled version of all data types entered to a specific function, which eliminates the need for recompilation every time the same data type is called when a function is run.

This blog post will demonstrate a simple examples of using Numba and its most commonly-used decorator, @jit, via Jupyter Notebook. The Binder file containing all the executable code can be found here.

Note: The ‘@‘ flag is used to indicate the use of a decorator

Installing Numba and Setting up the Jupyter Notebook

First, in your command prompt, enter:

pip install numba

Alternatively, you can also use:

conda install numba

Next, import Numba:

import numpy as np
import numba
from numba import jit
from numba import vectorize

Great! Now let’s move onto using the @jit decorator.

Using @jit for executing functions on the CPU

The @jit decorator works best on numerical functions that use NumPy. It has two modes: nopython mode and object mode. Setting nopython=True tell the compiler to overlook the involvement of the Python interpreter when running the entire decorated function. This setting leads to the best performance. However, in the case when:

  1. nopython=True fails
  2. nopython=False, or
  3. nopython is not set at all

the compiler defaults to object mode. Then, Numba will manually identify loops that it can compile into functions to be run in machine code, and will run the remaining code in the interpreter.

Here, @jit is demonstrated on a simple matrix multiplication function:

# a function that does multiple matrix multiplication
@jit(nopython=True)
def matrix_multiplication(A, x):
    b = np.empty(shape=(x.shape[0],1), dtype=np.float64)
    for i in range(x.shape[0]):
        b[i] = np.dot(A[i,:], x)
    return b

Remember – the use of @jit means that this function has not been compiled yet! Compilation only happens when you call the function:

A = np.random.rand(10, 10)
x = np.random.rand(10, 1)
a_complicated_function(A,x)

But how much faster is Numba really? To find out, some benchmarking is in order. Jupyter Notebook has a handy function called %timeit that runs simple functions many times in a loop to get their average execution time, that can be used as follows:

%timeit matrix_multiplication(A,x)

# 11.4 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba has a special .py_func attribute that effectively allows the decorated function to run as the original uncompiled Python function. Using this to compare its runtime to that of the decorated version,

%timeit matrix_multiplication.py_func(A,x)

# 35.5 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

From here, you can see that the Numba version runs about 3 times faster than using only NumPy arrays. In addition to this, Numba also supports tuples, integers, floats, and Python lists. All other Python features supported by Numba can be found here.

Besides explicitly declaring @jit at the start of a function, Numba makes it simple to turn a NumPy function into a Numba function by attaching jit(nopython=True) to the original function. This essentially uses the @jit decorator as a function. The function to calculate absolute percentage relative error demonstrates how this is done:

# Calculate percentage relative error
def numpy_re(x, true):
    return np.abs(((x - true)/true))*100

numba_re = jit(nopython=True)(numpy_re)

And we can see how the Number version is faster:

%timeit numpy_re(x, 0.66)
%timeit numba_re(x, 0.66)

where the NumPy version takes approximately 2.61 microseconds to run, while the Numba version takes 687 nanoseconds.

Inline parallelization with Numba

The @jit decorator can also be used to enable inline parallelization by setting its parallelization pass parallel=True. Parallelization in Numba is done via multi-threading, which essentially creates threads of code that are distributed over all the available CPU cores. An example of this can be seen in the code snippet below, describing a function that calculates the normal distribution of a set of data with a given mean and standard deviation:

SQRT_2PI = np.sqrt(2 * np.pi)

@jit(nopython=True, parallel=True)
def normals(x, means, sds):
    n = means.shape[0]
    result = np.exp(-0.5*((x - means)/sds)**2)
    return (1 / (sds * np.sqrt(2*np.pi))) * result

As usual, the function must be compiled:

means = np.random.uniform(-1,1, size=10**8)
sds = np.random.uniform(0.1, 0.2, size=10**8)

normals(0.6, means, sds)

To appreciate the speed-up that Numba’s multi-threading provides, compare the runtime for this with:

  1. A decorated version of the function with a disabled parallel pass
  2. The uncompiled, original NumPy function

The first example can be timed by:

normals_deco_nothread = jit(nopython=True)(normals.py_func)
%timeit normals_deco_nothread(0.6, means, sds)

# 3.24 s ± 757 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The first line of the code snippet first makes an uncompiled copy of the normals function, and then applies the @jit decorator to it. This effectively creates a version of normals that uses @jit, but is not multi-threaded. This run of the function took approximately 3.3 seconds.

For the second example, simply:

%timeit normals.py_func(0.6, means, sds)

# 7.38 s ± 759 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, compare both these examples to the runtime of the decorated and multi-threaded normals function:

%timeit normals(0.6, means, sds)

# 933 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The decorated, multi-threaded function is significantly faster (933 ms) than the decorated function without multi-threading (3.24 s), which in turn is faster than the uncompiled original NumPy function (7.38 s). However, the degree of speed-up may vary depending on the number of CPUs that the machine has available.

Summary

In general, the improvements achieved by using Numba on top of NumPy functions are marginal for simple, few-loop functions. Nevertheless, Numba is particularly useful for large datasets or high-dimensional arrays that require a large number of loops, and would benefit from the one-and-done compilation that it enables. For more information on using Numba, please refer to its official web page.

Simple profiling checks for running jobs on clusters

The goal of this short blog post is to share some simple tips on profiling your (to be) submitted jobs on high performance computing resources. Profiling your jobs can give you information about how efficiently you are using your computational resources, i.e., your CPUs and your allocated memory. Typically you would perform these checks on your experiment at a smaller scale, ensuring that everything is working as it should, before expanding to more tasks and CPUs.

Your first check is squeue typically paired with your user ID on a cluster. Here’s an example:

(base) [ah986@login02 project_dir]$ squeue -u ah986
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON) 
           5688212    shared <job_name>    ah986  R       0:05      1 exp-4-55 

This tells me that my submitted job is utilizing 1 node in the shared partition of this cluster. If your cluster is using the SLURM scheduler, you can also use sacct which can display accounting data for all jobs you are currently running or have run in the past. There’s many pieces of information available with sacct, that you can specify using the --format flag. Here’s an example for the same job:

(base) [ah986@login02 project_dir]$ sacct --format=JobID,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,AllocTRES%32 -j 5688212
       JobID  Partition      State  Timelimit               Start                 End    Elapsed   NNodes      NCPUS        NodeList                        AllocTRES 
------------ ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- -------------------------------- 
5688212          shared    RUNNING   20:00:00 2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55 billing=360000,cpu=100,mem=200G+ 
5688212.bat+               RUNNING            2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55          cpu=100,mem=200G,node=1 
5688212.0                  RUNNING            2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55          cpu=100,mem=200G,node=1 

In this case I can see the number of nodes (1) and the number of cores (100) utilized by my job as well as the resources allocated to it (100 CPUs and 200G of memory on 1 node). This information is useful in cases where a task launches other tasks and you’d like to diagnose whether the correct number of cores is being used.

Another useful tool is seff, which is actually a wrapper around sacct and summarizes your job’s overall performance. It is a little unreliable while the job is still running, but after the job is finished you can run:

(base) [ah986@login02 project_dir]$ seff 5688212
Job ID: 5688212
Cluster: expanse
User/Group: ah986/pen110
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 100
CPU Utilized: 1-01:59:46
CPU Efficiency: 68.16% of 1-14:08:20 core-walltime
Job Wall-clock time: 00:22:53
Memory Utilized: 38.25 GB
Memory Efficiency: 19.13% of 200.00 GB

The information here is very useful if you want to find out about how efficiently you’re using your resources. For this example I had 100 separate tasks I needed to perform and I requested 100 cores on 1 node and 200 GB of memory. These results tell me that my job completed in 23mins or so, the total time using the CPUs (CPU Utilized) was 01:59:46, and most importantly, the efficiency of my CPU use. CPU Efficiency is calculated “as the ratio of the actual core time from all cores divided by the number of cores requested divided by the run time”, in this case 68.16%. What this means it that I could be utilizing my cores more efficiently by allocating fewer cores to the same number of tasks, especially if scaling up to a larger number of nodes/cores. Additionally, my allocated memory is underutilized and I could be requesting a smaller memory allocation without inhibiting my runs.

Finally, while your job is still running you can log in the node(s) executing the job to look at live data. To do so, you simply ssh to one of the nodes listed under NODELIST (not all clusters allow this). From there, you can run the top command like below (with your own username), which will start the live task manager:

(base) [ah986@r143 ~]$ top -u ah986

top - 15:17:34 up 25 days, 19:55,  1 user,  load average: 0.09, 12.62, 40.64
Tasks: 1727 total,   2 running, 1725 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257662.9 total, 249783.4 free,   5561.6 used,   2317.9 buff/cache
MiB Swap: 716287.0 total, 716005.8 free,    281.2 used. 250321.1 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                              
 78985 ah986     20   0  276212   7068   4320 R   0.3   0.0   0:00.62 top                                                                                                                                  
 78229 ah986     20   0  222624   3352   2936 S   0.0   0.0   0:00.00 slurm_script                                                                                                                         
 78467 ah986     20   0  259464   8128   4712 S   0.0   0.0   0:00.00 srun                                                                                                                                 
 78468 ah986     20   0   54520    836      0 S   0.0   0.0   0:00.00 srun                                                                                                                                 
 78481 ah986     20   0  266404  19112   4704 S   0.0   0.0   0:00.24 parallel                                                                                                                             
 78592 ah986     20   0  217052    792    720 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78593 ah986     20   0  217052    732    660 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78594 ah986     20   0  217052    764    692 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78595 ah986     20   0  217052    708    636 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78596 ah986     20   0  217052    708    636 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78597 ah986     20   0  217052    796    728 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78598 ah986     20   0  217052    732    660 S   0.0   0.0   0:00.00 sleep       

Memory and CPU usage can be tracked from RES and %CPU columns respectively. In this case, for the sake of an example, I just assigned all my cores to sleep a certain number of minutes each (using no CPU or memory). Similar information can also be obtained using the ps command, with memory being tracked under the RSS column.

 (base) [ah986@r143 ~]$ ps -u$USER -o %cpu,rss,args
%CPU   RSS COMMAND
 0.0  3352 /bin/bash /var/spool/slurm/d/job3509431/slurm_script
 0.0  8128 srun --export=all --exclusive -N1 -n1 parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 0.0   836 srun --export=all --exclusive -N1 -n1 parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 0.1 19112 /usr/bin/perl /usr/bin/parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 0.0   792 sleep 3m
 0.0   732 sleep 4m
 0.0   764 sleep 5m
 0.0   708 sleep 6m
 0.0   708 sleep 7m
 0.0   796 sleep 8m
 0.0   732 sleep 9m
 0.0   712 sleep 10m