Data Augmentation for Time Series Application

This post is meant to be an introduction to the concept of data augmentation. Data augmentation is the process of increasing the size of your data through small modifications to the original dataset. In instances where data availability are small (basically every ML application), this technique is especially useful to create more training data that can lead to a more robust model that isn’t as susceptible to overfitting. Let’s begin with an example that will demonstrate why data augmentation is useful in image classification. Imagine that you have trained your model to distinguish between images of cats and dogs. The figure on the left is of a very good boy named Lincoln and this image resides in the training set. Let’s suppose that the image in the middle is in the test set. To humans, this is very clearly Lincoln (and a dog) once again, but if the algorithm hasn’t seen many images of dogs in this position, there is a chance that it won’t classify this image correctly and may think that Lincoln looks more like this cat in the training set that has a similar orientation and ears.

Lincoln, Lincoln, and….Lincoln? (Cat stock image from here)

However, if I were to augment my original image in the training set by rotating, scaling, and shifting it, as shown below, perhaps my model would be more likely to classify Lincoln correctly as a dog having been trained on these variations. Various studies have demonstrated the benefits of this augmentation in image processing applications.

Augmentations of original image that create new training data

This is a very simple example to demonstrate that limited data availability need not preclude the ability to make robust predictions. It is not a far stretch to wonder how data augmentation may be utilized for regression-based prediction problems, especially in the water resources field where we have limited data. Particularly, it is hard for us to predict extremes because we have such few data points to characterize them. This style of problem is inherently more complicated than classification because time series have a temporal structure and are connected to underlying (sometimes physical) relationships. Thus, this requires that any augmentation does not completely change the fundamental characteristics of the data. Below are some examples of techniques that could be useful, but these are extremely case-specific and require a strong understanding of the behavior of your system. Before you implement any of these techniques, first make sure to split your data into the training and test set. Then feel free to add variations to the training set and test them out!

Block Bootstrapping

Bootstrapping (sampling with replacement) single points your dataset can only be done if each point is independent. This is not the case with time series data that has a temporal structure. Thus, it is more appropriate to utilize block-bootstrapping. This technique involves resampling blocks of continuous data from the original training data to make a new training dataset. By using large continuous blocks, we are preserving the inherent structure in the data, while allowing our algorithm to see new data (the original data in a new order).

Jittering with Noise

A small sample size doesn’t give us the opportunity to map out the rich input-output space that characterizes our system. Often, adding a little bit of random noise to your training data can help expand your understanding of the space. If your system exhibits highly non-linear behavior, you have to be extra careful that the noise that you are adding is realistic. For example, in a rainfall-runoff model, the fluctuations of temperature and precipitation are very different. Small changes in precipitation can result in very large overall streamflow changes, whereas temperature often fluctuates widely during the day with very little effect on streamflow. Therefore, adding the same amount of noise to each feature and the output may not make sense. It is a non-trivial effort, but it could be interesting to determine how to appropriately add noise to features that exhibit different behavior.

Interpolation

If you want to augment training data that has clear trends, interpolation between data points can be a viable option that won’t distort these trends. However, using a linear interpolation method sets the underlying assumption that your data are linear; for instance that a linear change in precipitation and temperature leads to a linear change in streamflow. This is likely not the case, so interpolation may not be a useful data augmentation technique for a rainfall-runoff regression-based model. However, interpolation could be useful in a less sensitive classification-based model.

Decomposition

Decomposition methods generally decompose time series signals by extracting features or underlying patterns from the training data. These features can either be used independently or recombined with noise and the old training data to generate new training data. Decomposition can be preformed in either the time or frequency domain. Within the decomposition domain lies manifold-based techniques as well. A study by Forestier et al., 2017 calculate a weighted average that reflects the manifold of the
original data and use it as new data with high success.

Implementation

All of these techniques have shown success in very specific time series applications: those related to speech, audio, and gait recognition, and specifically for classification-based models. Very little has been published on regression-based models and the use of data augmentation in the water resources community seems nonexistent.

Below, I implemented one of these techniques using the CNN that I fit in my prior post. The results for the baseline prediction of streamflow are shown in the first panel. Then I tried a data augmentation scenario. I took the training set and reversed it and kept the first 1000 data points (which have more extremes). I then took these points and concatenated them to the original training set. The new results are shown in the second panel. The CNN trained with the additional augmented data does a much better job of capturing the extremes in the test set, which is what we often are interested in. There is a lot of work to be done and more complicated methods to explore, but these initial results look interesting and suggest that data augmentation can be useful in our field!

References

Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017, November). Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM) (pp. 865-870). IEEE.

Oh, C., Han, S., & Jeong, J. (2020). Time-series Data Augmentation based on Interpolation. Procedia Computer Science175, 64-71.

The ma-R-velous tidyverse

I am currently taking a class in environmental economics, which of late seems to be doubling as a crash course in the R tidyverse. This blog post will discuss several R packages and demonstrate the application of several tidyverse sub-packages and commands that I have learned to use through this class. Two datasets will be used: the storm dataset, which contains data on the types of storms that occurred in the United States from 1975 to 2015, and the damages dataset, which is a record of the dollar cost of the damages caused by every tropical storm that occurred from 1975 to 2015. The datasets required for this example can be found in in this GitHub repository.

Tidy data and the tidyverse?

‘Tidy’ datasets provide a standardized way to link the physical layout of a dataset with the meaning associated with it [1]. A ‘tidy’ dataset has the following characteristics:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each cell contains only one value of only one data type

On the other hand, the tidyverse is a meta-package that bundles the packages facilitating data ‘tidying’, or data cleaning such that the final form of the dataset is ‘tidy’. You can check the full set of packages bundled into the tidyverse using tidyverse_packages(), which will show you the following output in the console:

Within this set, we will be using the dplyr, tidyr and lubridate packages. The dplyr and tidyr packages are “workhorse” packages – their main purpose is to clean and wrangle large datasets, and will be the two packages that we use most often in this example. The lubridate package is used to convert strings into dates, which we will use later in this example.

In this example, we will transform a ‘messy’ dataset into a ‘tidy’ one, and perform several other operations that are enabled by the R tidyverse.

The pacman package management tool

No, this is not the Toru Iwatani-created, dot-munching yellow puck released in 1980 by Namco. Instead, the pacman tool is a convenient library and package wrapper that combines the functionality of base library-related functions into intuitively named functions [2]. It was created to improve workflow by reducing the time required in calling and re-calling obscurely named functions. Ideally, it should be called at the beginning of the R script, like so:

# Run this line
if (!require("pacman")) install.packages("pacman")

This enables you to automatically install and load packages throughout the base R libraries and the R tidyverse using the p_load function. For this example, we will only require the tidyverse package. Load this package using the following script:

# Then run p_load to load the necessary packages
pacman::p_load(
 tidyverse
)

As previously mentioned, this automatically loads the dplyr, tidyr and lubridate packages that we will need for this example.

Before beginning, make sure that you are running the correct versions of dplyr and tidyr. Check the package versions using the packageVersion('package_name') function. All packages should be at least version 1.0.0.

Working with the datasets

Some key tidyr verbs

With the explanations out of the way, we begin by loading our two datasets:

# Load dataset
storms <- read.csv("storms.csv")
damages <- read.csv("damages.csv")

Note that the damages dataset is in an inconvenient ‘wide’ shape. To convert damages into ‘tidy’ format, we use pivot_longer(). pivot_longer() is one of the four key tidyr verbs, the remaining three being pivot_wider(), separate() and unite(). These verbs perform the following uses:

  1. pivot_longer(): Vertically stretches the data, increasing the number of rows and decreasing the number of columns
  2. pivot_wider(): Horizontally stretches the data, increasing the number of columns and decreasing the number of rows
  3. separate(): Turns a single-character column into multiple columns
  4. unite(): Pastes multiple columns into one

For this example, pivot_longer() is used to turn damages from its original wide shape into a long, 3-column dataframe where each column contains data on the storm’s name, type, and total dollar cost of damages. The code is as follows:

# tidy-up damages
damages <- damages %>% mutate(status = "tropical storm") %>%
  pivot_longer(-status, names_to="name", values_to="damages")

This script first categorizes all the storms within damages as tropical storms, and them assigns the names of each storm to the column ‘name’ and the cost of their damages to ‘damages’. This turns the original dataset:

into the following shape:

This results in a more readable dataset!

It’s a lubridate!

Next, we will visit out storm dataset. Note that the information regarding the date and time for each storm is inconveniently split between four columns:

We use the as_datetime() function within the lubridate package to combine these columns into one, easy-to-interpret column called ‘date’. Use the following script to:

# Paste four columns and convert to date-time format
storms <- storms %>% mutate(date=as_datetime(
  paste0(year, "-", month, "-", day, " ", hour, ":00:00")))

This pastes the data in the ‘year’, ‘month’, ‘day’ and ‘hour’ columns together and inserts formatted date and time into the newly-added ‘date’ columns:

Great! Our storm dataset is now more readable.

Some dplyr verbs and operations

There are five key dplyr verbs, some of which you have already seen:

  1. filter(): Obtains a subset of rows based on their column values
  2. arrange(): Reorders rows (in ascending or descending order) based on their values
  3. select(): Selects columns or variables of interest
  4. mutate(): Creates new columns or variables and automatically appends them to the dataframe
  5. summarize(): Collapses multiple rows into a single summary value, commonly by grouping a pre-selected variable

The dplyr package also come with a set of join operations, which enables the merging of two dataframes, x and y. The types of operations include:

  1. inner_join(): Matches pairs of observations whenever their values are equal
  2. left_join(): Keeps all x-observations that appear in either x or y
  3. right_join(): Keeps all y-observations that appear in either x or y
  4. full_join(): Keeps all x– and y-observations that appear in either x or y
  5. semi_join(): Keeps all observations in x that have a match in y
  6. anti_join(): Drops all observations in x that have a match in y

In this example, we will only be using the inner_join() function to merge a subset of storms and the whole of damages.

The verbs can be used to modify datasets using ‘pipes’, represented in R as %>%. We have previously seen the application of the mutate verb when we added the new columns ‘storm’ and ‘date’ to the damages and storms dataset respectively. Now, let’s apply the filter() verb to obtain only tropical storm data from storms:

# filter out only the tropical storms
ts <- storms %>% filter(
    stringr::str_detect(status, "tropical storm")
)

Here, we use stringr::str_detect() to detect the string ‘tropical storm’ within the ‘status’ column of storms and store it within the new ts dataframe.

Next, we use the select() function to select all columns but the columns containing the diameter of the hurricane/storm, since this is data that we will not use in this example:

# select a subset of columns
ts <- ts %>% select(!ts_diameter:hu_diameter)

Using ! indicates to the function that you would like to select the complement of the set following the exclamation point. Using : is useful when there are consecutive columns you would like to (de)select. Following this, you will have a pared-down dataset:

Now it’s time to apply our inner_join() operation to merge ts with damages in a new dataframe joined_sd that contains both geophysical information of each tropical storm, as well as its dollar cost of damages.

# joining tropical storms and damages
joined_sd <- inner_join(ts, damages)

where joined_sd has the following form:

Now, we perform some simple operations to obtain the mean dollar cost of damages per windspeed, and arrange each tropical storm in order of decreasing damage per windspeed:

# calculate average cost of damage per windspeed
joined_sd <- joined_storms_damages %>% 
  mutate(damage_per_mph = mean(damages)/wind) %>% 
  arrange(desc(damage_per_mph))

This results in the following final dataset:

From here, you can see that tropical storm Amy resulted in the most expensive damages. Finally, you can also use the summarize() verb to identify the mean of the cost of damages per windspeed of all the tropical storms:

# summarize by mean and sd of cost per windspeed
summary_sd <- joined_sd %>% summarize(
  damage_per_mph=mean(damage_per_mph))

You will find that, on average, each tropical storm costs USD$11,244,531. Plotting the cost of damages with respect to time using

# plot damages wrt time
ggplot(joined_sd, aes(year, damages)) + 
  geom_point() + ylim(min(joined_sd$damages), max(joined_sd$damages)) +
  geom_smooth(method = lm) + ggtitle("Damages ($) with respect to time")

you will find a slight increase in the cost of damages incurred by tropical storms as the years go by.

Conclusion

In this blog post, we walked through two example datasets to demonstrate three sub-packages within the tidyverse: tidyr, lubridate and dplyr. For the full version of this example, please visit the author’s GitHub repository.

References

pacman package – RDocumentation. (2021). Retrieved 14 November 2021, from https://www.rdocumentation.org/packages/pacman/versions/0.5.1

Rudik, I. (2021). AEM 6510 Chapter 10: R and the tidyverse. Presentation.

Tidy data. (2021). Retrieved 14 November 2021, from https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Plotting change on maps

or how to replicate the New York Times presidential election shift map

This week’s blogpost is a visualization demo replicating a popular map from last year. The map below shows the shift in voter margin between the 2016 and 2020 Presidential Elections by the two major political parties in the United States. The direction and color of the arrows indicates the party and the length of the arrow indicates the shift. This type of figure can be useful in visualizing many types of spatially distributed changes (e.g. population change in a city, change in GDP per capita, losses and gains). This blogpost shows how to replicate it in Python using commonly used packages.

Screengrab of the original graphic from the NYT website. Original can be found here: https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html

Even though the creators of the original provide their 2020 data, their 2016 data is not available so the data I’ll be using came from the MIT Election Data and Science Lab and can be downloaded here: https://doi.org/10.7910/DVN/VOQCHQ. All the code and data to replicate my figure can be found in this repository: https://github.com/antonia-had/election_data_shift

The main packages we’ll be using for this are cartopy and matplotlib to create the map and annotate elements on it, pandas for some simple data analysis and haversine to convert distances on the map (which you might not need if you’re applying the code to a small spatial scale).

First thing we do is load our packages and data. counties.csv contains the latitude and longitude for every country we’ll be plotting. countypres_2000-2020.csv contains our downloaded election data. As you can see in the code comments, I had to clean out some of the datapoints due to inconsistencies or errors. I’ll also only be plotting the contiguous US to simplify the exercise, but you can definitely include code to also plot Alaska and Hawaii in the same figure.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import cartopy.io.shapereader as shpreader
from haversine import inverse_haversine, Direction

# Read in county position data
pos_data = pd.read_csv('./data/counties.csv', delimiter=',', index_col=0)

# Read in county election data
# Data from https://doi.org/10.7910/DVN/VOQCHQ
# Data points without county FIPS code removed
all_election_data = pd.read_csv('./data/countypres_2000-2020.csv')
# Filter data to only keep years 2016 and 2020
# Dataset reports issues with Alaska data so filter those out too
# Missing data for 2020 for some counties
# County with FIPS code 46113 was assigned a new FIPS code (46102) which is changed in the downloaded data
mask = (all_election_data['year'] >= 2016) & \
       (all_election_data['state'] != 'ALASKA') &\
       (all_election_data['state'] != 'HAWAII') & \
       (all_election_data['county_fips'] != 11001) & \
       (all_election_data['county_fips'] != 51515) & \
       (all_election_data['county_fips'] != 36000)
election_data = all_election_data[mask]

Next we calculate the percentage of votes each party gained at each election and compare the results between the two elections to calculate their shift. A simplifying assumption here is that we’re only focussing on the top two parties (but you can do more with different color arrows for example). We’re also copying the latitude and longitude of each county so everything is in one dataframe.

# Calculate vote percentage per party
election_data['percentagevote'] = election_data['candidatevotes']/election_data['totalvotes'] * 100

# Create new dataframe to store county change results
shift = election_data[['state', 'county_name', 'county_fips']].copy()
# Drop duplicate rows (original dataframe was both 2016 and 2020)
shift = shift.drop_duplicates(['county_fips'])

# Create columns to store change for every party
shift['DEMOCRAT'] = 0.0
shift['REPUBLICAN'] = 0.0

#Create columns for latitude and longitude so everything is in the same dataframe
shift['lat'] = 0.0
shift['lon'] = 0.0

# Iterate through every county and estimate difference in vote share for two major parties
for index, row in shift.iterrows():
    county = row['county_fips']
    for party in ['DEMOCRAT', 'REPUBLICAN']:
        previous_result = election_data.loc[(election_data['year'] == 2016) &
                                            (election_data['county_fips'] == county) &
                                            (election_data['party'] == party)]['percentagevote'].values[0]
        new_result = election_data.loc[(election_data['year'] == 2020) &
                                       (election_data['county_fips'] == county) &
                                       (election_data['party'] == party)]['percentagevote'].values[0]
        # If any of the two results is nan assign zero change
        if pd.isna(new_result) or pd.isna(previous_result):
            shift.at[index, party] = 0
        else:
            shift.at[index, party] = new_result - previous_result
    # Combine lat and long values also so it's all in one dataframe
    shift.at[index, 'lat'] = pos_data.at[county, 'lat']
    shift.at[index, 'lon'] = pos_data.at[county, 'lon']

To create our map we do the following.

Set up matplotlib figure with the map extent of the contiguous United States and use cartopy geometries to add the shapes of all states.

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.LambertConformal(), frameon=False)
ax.set_extent([-120, -74, 24, 50], ccrs.PlateCarree())
# Add states shape
shapename = 'admin_1_states_provinces_lakes'
states_shp = shpreader.natural_earth(resolution='110m',
                                     category='cultural', name=shapename)
ax.add_geometries(shpreader.Reader(states_shp).geometries(), ccrs.PlateCarree(),
                  facecolor='#e5e5e5', edgecolor='white', zorder=0)

We then need to determine how the shift should be plotted in each county. A simplifying assumption here is that we’re showing the largest positive shift (i.e., if both parties lost votes we’re only showing a small grey point). There’s several ways to draw an arrow at each point, depending on what you’d like to show and the complexity you’re comfortable with. The way I am showing here is exploiting the matplotlib annotate function, typically used to annotate a figure with text and arrows.

The way I’m going about this is a little mischievous but works: I’m only using the arrow component of it with a blank text annotation and identify a point where each arrow should be pointing to by using each county’s lat and long and the estimated shift. If this was a simple matplotlib figure using cartesian coordinates, calculating the end point would be simple trigonometry. Since latitude and longitude are not on a cartesian plane, we need to convert them using the haversine formula (or its inverse). It’s fairly easy to implement yourself but since there already exceeds a handy python package for it, I’m using that instead. The transform function I am using up top is necessary for matplotlib to know how to transform the points from the annotation function (typically not necessary to do if using, say, ax.scatter()), some explanation of why that is can be found here. The colors and all other customization is done so the figure looks as close as possible to the original.

transform = ccrs.PlateCarree()._as_mpl_transform(ax)
for index, row in shift.iterrows():
    # Determine arrow color
    dem_shift = shift.at[index, 'DEMOCRAT']
    rep_shift = shift.at[index, 'REPUBLICAN']
    # Check if both lost votes, then set arrow to grey
    if dem_shift<0 and rep_shift<0:
        arrow_color = 'grey'
        ax.scatter(shift.at[index, 'lon'], shift.at[index, 'lat'],
                   color=arrow_color, transform=ccrs.PlateCarree(),
                   s=0.5)
    # If at least one of them gained votes
    else:
        if dem_shift >= rep_shift:
            arrow_color = '#1460a8'
            direction = Direction.NORTHWEST
            change = dem_shift
        else:
            arrow_color = '#bb1d2a'
            direction = Direction.NORTHEAST
            change = rep_shift
        end_location = inverse_haversine((shift.at[index, 'lat'], shift.at[index, 'lon']), change*25, direction)[::-1]
        ax.annotate(" ", xytext=(shift.at[index, 'lon'], shift.at[index, 'lat']), xy=end_location,
                    arrowprops=dict(facecolor=arrow_color, edgecolor=arrow_color,
                                    width=0.2, headwidth=3, headlength=5),
                    xycoords=transform, zorder=1)
plt.tight_layout()
plt.savefig('electionshiftmap.png', dpi=300)

The resulting figure looks like this, which I am calling pretty close, considering the dataset differences. Tinkering with colors, widths, lengths and transforms can get you a different look if you’re after that.

Pam agrees:

Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

  1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
  2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
  3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()

Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

  1. Write demoScript.m so that it reads the sample number from the input.
  2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
  3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory
sample_out = fileread(strcat("input_sample_", sample, "/sample_data.txt"));

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)

2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

Reads in the total tasks and number of samples per task from command line.
'''
# load necessary libraries
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes
N_SAMPLES_PER_TASK = int(sys.argv[2]) # number of runs per/task

# loop through samples assigned to current rank
for i in range(N_SAMPLES_PER_TASK):
	sample= rank + TOTAL_TASKS * i

	# write the command that will be sent to the terminal (here RUN will replace the {})
	terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

	# write the terminal command to the process
	os.system(terminal_command)

	# sleep before submitting the next command
	time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()

submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
SAMPLES_PER_TASK=5 # number of runs for each MPI task
N_NODES=2 # number of nodes
TASKS_PER_NODE=2 # number of tasks per node

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks

# Submit the parallel job
#!/bin/bash
#SBATCH -n $(TOTAL_TASKS) -N $(N_NODES)
#SBATCH --time=0:01:00
#SBATCH --job-name=demoMPI4py
#SBATCH --output=output/demo.out
#SBATCH --error=output/demo.err
#SBATCH --exclusive
module load py3-mpi4py
module load py3-numpy
module load octave/6.3.0

mpirun -np $TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS $SAMPLES_PER_TASK > demoOutput.txt

Additional resources

Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/