The ma-R-velous tidyverse

I am currently taking a class in environmental economics, which of late seems to be doubling as a crash course in the R tidyverse. This blog post will discuss several R packages and demonstrate the application of several tidyverse sub-packages and commands that I have learned to use through this class. Two datasets will be used: the storm dataset, which contains data on the types of storms that occurred in the United States from 1975 to 2015, and the damages dataset, which is a record of the dollar cost of the damages caused by every tropical storm that occurred from 1975 to 2015. The datasets required for this example can be found in in this GitHub repository.

Tidy data and the tidyverse?

‘Tidy’ datasets provide a standardized way to link the physical layout of a dataset with the meaning associated with it [1]. A ‘tidy’ dataset has the following characteristics:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each cell contains only one value of only one data type

On the other hand, the tidyverse is a meta-package that bundles the packages facilitating data ‘tidying’, or data cleaning such that the final form of the dataset is ‘tidy’. You can check the full set of packages bundled into the tidyverse using tidyverse_packages(), which will show you the following output in the console:

Within this set, we will be using the dplyr, tidyr and lubridate packages. The dplyr and tidyr packages are “workhorse” packages – their main purpose is to clean and wrangle large datasets, and will be the two packages that we use most often in this example. The lubridate package is used to convert strings into dates, which we will use later in this example.

In this example, we will transform a ‘messy’ dataset into a ‘tidy’ one, and perform several other operations that are enabled by the R tidyverse.

The pacman package management tool

No, this is not the Toru Iwatani-created, dot-munching yellow puck released in 1980 by Namco. Instead, the pacman tool is a convenient library and package wrapper that combines the functionality of base library-related functions into intuitively named functions [2]. It was created to improve workflow by reducing the time required in calling and re-calling obscurely named functions. Ideally, it should be called at the beginning of the R script, like so:

# Run this line
if (!require("pacman")) install.packages("pacman")

This enables you to automatically install and load packages throughout the base R libraries and the R tidyverse using the p_load function. For this example, we will only require the tidyverse package. Load this package using the following script:

# Then run p_load to load the necessary packages
pacman::p_load(
 tidyverse
)

As previously mentioned, this automatically loads the dplyr, tidyr and lubridate packages that we will need for this example.

Before beginning, make sure that you are running the correct versions of dplyr and tidyr. Check the package versions using the packageVersion('package_name') function. All packages should be at least version 1.0.0.

Working with the datasets

Some key tidyr verbs

With the explanations out of the way, we begin by loading our two datasets:

# Load dataset
storms <- read.csv("storms.csv")
damages <- read.csv("damages.csv")

Note that the damages dataset is in an inconvenient ‘wide’ shape. To convert damages into ‘tidy’ format, we use pivot_longer(). pivot_longer() is one of the four key tidyr verbs, the remaining three being pivot_wider(), separate() and unite(). These verbs perform the following uses:

  1. pivot_longer(): Vertically stretches the data, increasing the number of rows and decreasing the number of columns
  2. pivot_wider(): Horizontally stretches the data, increasing the number of columns and decreasing the number of rows
  3. separate(): Turns a single-character column into multiple columns
  4. unite(): Pastes multiple columns into one

For this example, pivot_longer() is used to turn damages from its original wide shape into a long, 3-column dataframe where each column contains data on the storm’s name, type, and total dollar cost of damages. The code is as follows:

# tidy-up damages
damages <- damages %>% mutate(status = "tropical storm") %>%
  pivot_longer(-status, names_to="name", values_to="damages")

This script first categorizes all the storms within damages as tropical storms, and them assigns the names of each storm to the column ‘name’ and the cost of their damages to ‘damages’. This turns the original dataset:

into the following shape:

This results in a more readable dataset!

It’s a lubridate!

Next, we will visit out storm dataset. Note that the information regarding the date and time for each storm is inconveniently split between four columns:

We use the as_datetime() function within the lubridate package to combine these columns into one, easy-to-interpret column called ‘date’. Use the following script to:

# Paste four columns and convert to date-time format
storms <- storms %>% mutate(date=as_datetime(
  paste0(year, "-", month, "-", day, " ", hour, ":00:00")))

This pastes the data in the ‘year’, ‘month’, ‘day’ and ‘hour’ columns together and inserts formatted date and time into the newly-added ‘date’ columns:

Great! Our storm dataset is now more readable.

Some dplyr verbs and operations

There are five key dplyr verbs, some of which you have already seen:

  1. filter(): Obtains a subset of rows based on their column values
  2. arrange(): Reorders rows (in ascending or descending order) based on their values
  3. select(): Selects columns or variables of interest
  4. mutate(): Creates new columns or variables and automatically appends them to the dataframe
  5. summarize(): Collapses multiple rows into a single summary value, commonly by grouping a pre-selected variable

The dplyr package also come with a set of join operations, which enables the merging of two dataframes, x and y. The types of operations include:

  1. inner_join(): Matches pairs of observations whenever their values are equal
  2. left_join(): Keeps all x-observations that appear in either x or y
  3. right_join(): Keeps all y-observations that appear in either x or y
  4. full_join(): Keeps all x– and y-observations that appear in either x or y
  5. semi_join(): Keeps all observations in x that have a match in y
  6. anti_join(): Drops all observations in x that have a match in y

In this example, we will only be using the inner_join() function to merge a subset of storms and the whole of damages.

The verbs can be used to modify datasets using ‘pipes’, represented in R as %>%. We have previously seen the application of the mutate verb when we added the new columns ‘storm’ and ‘date’ to the damages and storms dataset respectively. Now, let’s apply the filter() verb to obtain only tropical storm data from storms:

# filter out only the tropical storms
ts <- storms %>% filter(
    stringr::str_detect(status, "tropical storm")
)

Here, we use stringr::str_detect() to detect the string ‘tropical storm’ within the ‘status’ column of storms and store it within the new ts dataframe.

Next, we use the select() function to select all columns but the columns containing the diameter of the hurricane/storm, since this is data that we will not use in this example:

# select a subset of columns
ts <- ts %>% select(!ts_diameter:hu_diameter)

Using ! indicates to the function that you would like to select the complement of the set following the exclamation point. Using : is useful when there are consecutive columns you would like to (de)select. Following this, you will have a pared-down dataset:

Now it’s time to apply our inner_join() operation to merge ts with damages in a new dataframe joined_sd that contains both geophysical information of each tropical storm, as well as its dollar cost of damages.

# joining tropical storms and damages
joined_sd <- inner_join(ts, damages)

where joined_sd has the following form:

Now, we perform some simple operations to obtain the mean dollar cost of damages per windspeed, and arrange each tropical storm in order of decreasing damage per windspeed:

# calculate average cost of damage per windspeed
joined_sd <- joined_storms_damages %>% 
  mutate(damage_per_mph = mean(damages)/wind) %>% 
  arrange(desc(damage_per_mph))

This results in the following final dataset:

From here, you can see that tropical storm Amy resulted in the most expensive damages. Finally, you can also use the summarize() verb to identify the mean of the cost of damages per windspeed of all the tropical storms:

# summarize by mean and sd of cost per windspeed
summary_sd <- joined_sd %>% summarize(
  damage_per_mph=mean(damage_per_mph))

You will find that, on average, each tropical storm costs USD$11,244,531. Plotting the cost of damages with respect to time using

# plot damages wrt time
ggplot(joined_sd, aes(year, damages)) + 
  geom_point() + ylim(min(joined_sd$damages), max(joined_sd$damages)) +
  geom_smooth(method = lm) + ggtitle("Damages ($) with respect to time")

you will find a slight increase in the cost of damages incurred by tropical storms as the years go by.

Conclusion

In this blog post, we walked through two example datasets to demonstrate three sub-packages within the tidyverse: tidyr, lubridate and dplyr. For the full version of this example, please visit the author’s GitHub repository.

References

pacman package – RDocumentation. (2021). Retrieved 14 November 2021, from https://www.rdocumentation.org/packages/pacman/versions/0.5.1

Rudik, I. (2021). AEM 6510 Chapter 10: R and the tidyverse. Presentation.

Tidy data. (2021). Retrieved 14 November 2021, from https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Plotting change on maps

or how to replicate the New York Times presidential election shift map

This week’s blogpost is a visualization demo replicating a popular map from last year. The map below shows the shift in voter margin between the 2016 and 2020 Presidential Elections by the two major political parties in the United States. The direction and color of the arrows indicates the party and the length of the arrow indicates the shift. This type of figure can be useful in visualizing many types of spatially distributed changes (e.g. population change in a city, change in GDP per capita, losses and gains). This blogpost shows how to replicate it in Python using commonly used packages.

Screengrab of the original graphic from the NYT website. Original can be found here: https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html

Even though the creators of the original provide their 2020 data, their 2016 data is not available so the data I’ll be using came from the MIT Election Data and Science Lab and can be downloaded here: https://doi.org/10.7910/DVN/VOQCHQ. All the code and data to replicate my figure can be found in this repository: https://github.com/antonia-had/election_data_shift

The main packages we’ll be using for this are cartopy and matplotlib to create the map and annotate elements on it, pandas for some simple data analysis and haversine to convert distances on the map (which you might not need if you’re applying the code to a small spatial scale).

First thing we do is load our packages and data. counties.csv contains the latitude and longitude for every country we’ll be plotting. countypres_2000-2020.csv contains our downloaded election data. As you can see in the code comments, I had to clean out some of the datapoints due to inconsistencies or errors. I’ll also only be plotting the contiguous US to simplify the exercise, but you can definitely include code to also plot Alaska and Hawaii in the same figure.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import cartopy.io.shapereader as shpreader
from haversine import inverse_haversine, Direction

# Read in county position data
pos_data = pd.read_csv('./data/counties.csv', delimiter=',', index_col=0)

# Read in county election data
# Data from https://doi.org/10.7910/DVN/VOQCHQ
# Data points without county FIPS code removed
all_election_data = pd.read_csv('./data/countypres_2000-2020.csv')
# Filter data to only keep years 2016 and 2020
# Dataset reports issues with Alaska data so filter those out too
# Missing data for 2020 for some counties
# County with FIPS code 46113 was assigned a new FIPS code (46102) which is changed in the downloaded data
mask = (all_election_data['year'] >= 2016) & \
       (all_election_data['state'] != 'ALASKA') &\
       (all_election_data['state'] != 'HAWAII') & \
       (all_election_data['county_fips'] != 11001) & \
       (all_election_data['county_fips'] != 51515) & \
       (all_election_data['county_fips'] != 36000)
election_data = all_election_data[mask]

Next we calculate the percentage of votes each party gained at each election and compare the results between the two elections to calculate their shift. A simplifying assumption here is that we’re only focussing on the top two parties (but you can do more with different color arrows for example). We’re also copying the latitude and longitude of each county so everything is in one dataframe.

# Calculate vote percentage per party
election_data['percentagevote'] = election_data['candidatevotes']/election_data['totalvotes'] * 100

# Create new dataframe to store county change results
shift = election_data[['state', 'county_name', 'county_fips']].copy()
# Drop duplicate rows (original dataframe was both 2016 and 2020)
shift = shift.drop_duplicates(['county_fips'])

# Create columns to store change for every party
shift['DEMOCRAT'] = 0.0
shift['REPUBLICAN'] = 0.0

#Create columns for latitude and longitude so everything is in the same dataframe
shift['lat'] = 0.0
shift['lon'] = 0.0

# Iterate through every county and estimate difference in vote share for two major parties
for index, row in shift.iterrows():
    county = row['county_fips']
    for party in ['DEMOCRAT', 'REPUBLICAN']:
        previous_result = election_data.loc[(election_data['year'] == 2016) &
                                            (election_data['county_fips'] == county) &
                                            (election_data['party'] == party)]['percentagevote'].values[0]
        new_result = election_data.loc[(election_data['year'] == 2020) &
                                       (election_data['county_fips'] == county) &
                                       (election_data['party'] == party)]['percentagevote'].values[0]
        # If any of the two results is nan assign zero change
        if pd.isna(new_result) or pd.isna(previous_result):
            shift.at[index, party] = 0
        else:
            shift.at[index, party] = new_result - previous_result
    # Combine lat and long values also so it's all in one dataframe
    shift.at[index, 'lat'] = pos_data.at[county, 'lat']
    shift.at[index, 'lon'] = pos_data.at[county, 'lon']

To create our map we do the following.

Set up matplotlib figure with the map extent of the contiguous United States and use cartopy geometries to add the shapes of all states.

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.LambertConformal(), frameon=False)
ax.set_extent([-120, -74, 24, 50], ccrs.PlateCarree())
# Add states shape
shapename = 'admin_1_states_provinces_lakes'
states_shp = shpreader.natural_earth(resolution='110m',
                                     category='cultural', name=shapename)
ax.add_geometries(shpreader.Reader(states_shp).geometries(), ccrs.PlateCarree(),
                  facecolor='#e5e5e5', edgecolor='white', zorder=0)

We then need to determine how the shift should be plotted in each county. A simplifying assumption here is that we’re showing the largest positive shift (i.e., if both parties lost votes we’re only showing a small grey point). There’s several ways to draw an arrow at each point, depending on what you’d like to show and the complexity you’re comfortable with. The way I am showing here is exploiting the matplotlib annotate function, typically used to annotate a figure with text and arrows.

The way I’m going about this is a little mischievous but works: I’m only using the arrow component of it with a blank text annotation and identify a point where each arrow should be pointing to by using each county’s lat and long and the estimated shift. If this was a simple matplotlib figure using cartesian coordinates, calculating the end point would be simple trigonometry. Since latitude and longitude are not on a cartesian plane, we need to convert them using the haversine formula (or its inverse). It’s fairly easy to implement yourself but since there already exceeds a handy python package for it, I’m using that instead. The transform function I am using up top is necessary for matplotlib to know how to transform the points from the annotation function (typically not necessary to do if using, say, ax.scatter()), some explanation of why that is can be found here. The colors and all other customization is done so the figure looks as close as possible to the original.

transform = ccrs.PlateCarree()._as_mpl_transform(ax)
for index, row in shift.iterrows():
    # Determine arrow color
    dem_shift = shift.at[index, 'DEMOCRAT']
    rep_shift = shift.at[index, 'REPUBLICAN']
    # Check if both lost votes, then set arrow to grey
    if dem_shift<0 and rep_shift<0:
        arrow_color = 'grey'
        ax.scatter(shift.at[index, 'lon'], shift.at[index, 'lat'],
                   color=arrow_color, transform=ccrs.PlateCarree(),
                   s=0.5)
    # If at least one of them gained votes
    else:
        if dem_shift >= rep_shift:
            arrow_color = '#1460a8'
            direction = Direction.NORTHWEST
            change = dem_shift
        else:
            arrow_color = '#bb1d2a'
            direction = Direction.NORTHEAST
            change = rep_shift
        end_location = inverse_haversine((shift.at[index, 'lat'], shift.at[index, 'lon']), change*25, direction)[::-1]
        ax.annotate(" ", xytext=(shift.at[index, 'lon'], shift.at[index, 'lat']), xy=end_location,
                    arrowprops=dict(facecolor=arrow_color, edgecolor=arrow_color,
                                    width=0.2, headwidth=3, headlength=5),
                    xycoords=transform, zorder=1)
plt.tight_layout()
plt.savefig('electionshiftmap.png', dpi=300)

The resulting figure looks like this, which I am calling pretty close, considering the dataset differences. Tinkering with colors, widths, lengths and transforms can get you a different look if you’re after that.

Pam agrees:

Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

  1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
  2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
  3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()

Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

  1. Write demoScript.m so that it reads the sample number from the input.
  2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
  3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory
sample_out = fileread(strcat("input_sample_", sample, "/sample_data.txt"));

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)

2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

Reads in the total tasks and number of samples per task from command line.
'''
# load necessary libraries
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes
N_SAMPLES_PER_TASK = int(sys.argv[2]) # number of runs per/task

# loop through samples assigned to current rank
for i in range(N_SAMPLES_PER_TASK):
	sample= rank + TOTAL_TASKS * i

	# write the command that will be sent to the terminal (here RUN will replace the {})
	terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

	# write the terminal command to the process
	os.system(terminal_command)

	# sleep before submitting the next command
	time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()

submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
SAMPLES_PER_TASK=5 # number of runs for each MPI task
N_NODES=2 # number of nodes
TASKS_PER_NODE=2 # number of tasks per node

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks

# Submit the parallel job
#!/bin/bash
#SBATCH -n $(TOTAL_TASKS) -N $(N_NODES)
#SBATCH --time=0:01:00
#SBATCH --job-name=demoMPI4py
#SBATCH --output=output/demo.out
#SBATCH --error=output/demo.err
#SBATCH --exclusive
module load py3-mpi4py
module load py3-numpy
module load octave/6.3.0

mpirun -np $TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS $SAMPLES_PER_TASK > demoOutput.txt

Additional resources

Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/

Tips for Creating Watershed Maps in R

There have been a few posts on this blog on creating watershed maps (here, here, and here), but this post is going to be focused on some of my recent experiences on creating watershed maps in R with files that may be missing attributes, are in the wrong projection, and contain data that need to be clipped to a specific boundary shapefile. There are lots of packages that exist to do one or more of these things, but anyone who has ever tried to create watershed maps in R knows that there isn’t one package that does it all. My main goal for this post is to outline the most efficient workflow and use of packages that also allow for the most compatibility when plotting shapefiles and raster files in one figure.

In this post, we are going to be creating a map of the Tuolumne River Basin boundary and plot elevation data within the basin. All the data are found here. First we will read in the Tuolumne boundary shape file (.shp) and the elevation raster file (.asc which is an ASCII file) using the appropriate functions and do some preliminary plotting to see what we have.

#Import libraries 

library(rgdal)
library(ggplot2)
library(raster)

#Read in Tuolumne shapefile

tuolumne.basin <- readOGR(dsn = "doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/Tuolumne_utm.shp")

#Read in elevation raster

elevation.raster = raster("doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/merced_tuolumne_100mdem_utm.asc")

#Plot the files
 
ggplot() +  geom_polygon(data = tuolumne.basin, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)
plot(elevation.raster)
Raw shapefile and raster data

So we have the pieces that we need to build the map, but notice that the latitude and longitude are in the wrong projection. We can use the following command to check what projection the shapefile is in:

proj4string(tuolumne.basin) 

We see that the output is: “+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs”. So we are in the Universal Transverse Mercator coordinate system, but we should change to WGS 84. We can do this using the function “spTransform” which will swap out the projection by adjusting the CRS (Coordinate Reference System) attribute of the shapefile. You can use “proj4string” to verify that the transformation took place.

 tuolumne.basin.transformed <- spTransform(tuolumne.basin, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

Now we need to transform the raster files coordinates. Note that the raster file doesn’t have an associated coordinate reference system listed. If you try to change the projection at this point, you will get an error. This is a minor inconvenience since we know that the coordinate system should match that of the raw Tuolumne shapefile and we can just insert the original coordinate system as a string under the projargs attribute. Then we can transform it to match the coordinate system of the transformed shapefile using “projectRaster”.

elevation.raster@crs@projargs <- "+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs" 
  
elevation.raster.transformed <- projectRaster(elevation.raster,crs=crs(tuolumne.basin.transformed))

Great, now we have all the data in the right projection. Now we have to clip the raster layer to show only the data in the bounds of our shapefile. We first use the “crop” function in the raster library to clip the layer based on the extent of the shapefile boundary as well as the mask function. It is important to do both otherwise the clip will not work!

elevation.raster.transformed.cropped <- crop(elevation.raster.transformed, extent(tuolumne.basin.transformed))
elevation.raster.transformed.cropped <- mask(elevation.raster.transformed, tuolumne.basin.transformed)

Now we need to get the appropriate elevation values and coordinates from the raster object so that we can plot it using ggplot.When we use ggplot here, notice that we only need to use geom_raster and elevation data since the clipped data will perfectly follow the shapefile boundary.

#Isolate elevation values from the raster file

val <- getValues(elevation.raster.transformed.cropped)
xy <-as.data.frame(xyFromCell(elevation.raster.transformed.cropped,1:ncell(elevation.raster.transformed.cropped)))
xy <- cbind(xy,val)

#Plot it!

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c()+theme_bw()
A not so pretty watershed figure

It’s almost perfect aside from that gray box that results from clipping and masking. After we clip, we are converting all values outside the boundary of the shapefile to NAs, which falls out of the bounds of our color scale. To fix this, we simply insert an additional argument to scale_fill_viridis_c() and we also make some additional aesthetic changes to the theme.

#Final plot function

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c(na.value=NA,name = "Elevation (m)")+theme_bw()+ggtitle("Tuolumne River Basin Elevation (m)")+xlab("Longitude") + ylab("Latitude")+theme(text = element_text(size = 20)) 
A pretty watershed figure!

MORDM Basics VI: Processing the output and reevaluating for robustness

In the previous post, we conducted a basic WaterPaths tutorial in which we ran a simulation-optimization of the North Carolina Research Triangle test case (Trindade et al, 2019); across 1000 possible futures, or states of the world (SOWs). To briefly recap, the Research Triangle test case consists of three water utilities in Cary (C), Durham (D) and Raleigh (R). Cary is the main supplier, having a water treatment plant of its own. Durham and Raleigh purchase water from Cary via treated transfers.

Having obtained the .set file containing the Pareto-optimal decision variables and their respective performance objective values, we will now walk through the .set file processing and visualize the decision variables and performance objective space.

Understanding the .set file

First, it is important that the .set file makes sense. The NC_output_MS_S3_N1000.set file should have between 30-80 rows, and a total of 35 columns. The first 20 columns contain values of the decision variables. Note that only Durham and Raleigh have water transfer triggers as they purchase treated water from Cary.

  1. Restriction trigger, RT (C, D, R)
  2. Transfer trigger, TT (D, R)
  3. Jordan Lake allocation, JLA (C, D, R)
  4. Reserve fund contribution as a percentage of total annual revenue, RC (C, D, R)
  5. Insurance trigger, IT (C, D, R)
  6. Insurance payments as a percentage of total annual revenue, IP (C, D, R)
  7. Infrastructure trigger, INF (C, D, R)

The last 15 columns contain the objective values for the following performance objectives of all three utilities:

  1. Reliability (REL) to be maximized
  2. Restriction frequency (RF) to be minimized
  3. Infrastructure net present cost (INF_NPC) to be minimized
  4. Peak financial cost of drought mitigation actions (PFC) to be minimized
  5. Worst-case financial cost of drought mitigation actions (WCC) to be minimized

This reference set needs to be processed to output a .csv file to enable reevaluation for robustness analysis. To do so, run the post_processing.py file found in this GitHub repository in the command line:

python post_processing.py

In addition to post-processing the optimization output files, this file also conduct regional minimax operation, where each regional performance objective is taken to be the objective value of the worst-performing utility (Gold et al, 2019).

This should output two files:

  1. NC_refset.csv No header row. This is the file that will be used to run the re-evaluation for robustness analysis in the next blog post.
  2. NC_dvs_objs.csv Contains a header row. This file that contains the labeled decision variables and objectives, including the minimax regional performance objectives. Will be used for visualizing the reference set’s decision variables and performance objectives.

Visualizing the reference set

Due to the higher number of decision variables, we utilize parallel axis plots to identify how varying the decision variables can potentially affect certain performance objectives. Here, we use the regional reliability performance objective, REL, as an example. Figure 1 below demonstrates how all decision variables vary with regional reliability.

Figure 1: All decision variables for the three utilities. A darker blue indicates a higher degree of reliability.

From Figure 1, most solutions found via the optimization conducted in the previous blog post seem to have relatively high reliability across the full range of decision variable values. It is unclear as to how each decision variable might affect regional reliability. It is thus more helpful to identify specific sets of decision variables (or policies) that enable the achievement of reliability beyond a certain threshold.

With this in mind, we assume that all members of the Triangle require that their collective reliability be at least 98%. This results in the following figure:

Figure 2: All decision variables across the three utilities. The dark lines represent the policies that are at least 98% reliable.

Figure 2 has one clear takeaway: Pareto-optimality does not indicate satisfactory performance. In addition, setting this threshold make the effects of each decision variable clearer. It can be seen that regional reliability is significantly affected by both Durham and Raleigh’s infrastructure trigger (INF). Desirable levels of regional reliability can be achieved when Durham sets a high INF value. Conversely, Raleigh can set lower INF values to benefit from satisfactory reliability. Figure 2 also shows having Durham set a high insurance trigger (IT) may benefit the regional in terms of reliability.

However, there are caveats. Higher INF and IT values for Durham implies that the financial burden of investment and insurance payments are being borne by Raleigh and Cary, as Durham is able to withstand more risk without having to trigger an investment or infrastructure payment. This may affect how each member utility perceives their individual risks and benefits by forming a cooperative contract.

The code to plot these figures can be found under ‘refset_parallel.py’ in the repository.

Robustness analysis and what’s next

So how is setting a threshold value of regional reliability significant?

Within the MORDM framework, robustness is defined using a multivariate satisficing metric (Gold et al, 2019). Depending on the requirements of the stakeholders, a set of criteria is defined that will then be used distinguish between success (all criteria are met) and failure (at least one criteria is not met). Using these criteria, the rest of Pareto-optimal policies are simulated across a number of uncertain SOWs. Each policy’s robustness is then represented by the percent of SOWs in which it meets the minimum performance criteria that has been set.

In this post, we processed the reference set and visualized its decision variable space with respect to each variable’s effect on the reliability performance objective. A similar process can be repeated across all utilities for all performance objectives.

Using the processed reference set, we will conduct multi-criterion robustness analysis using two criteria:

  1. Regional reliability should be at least 98%
  2. Regional restriction frequency should be less than or equal to 20%

We will also conduct sensitivity analysis to identify the the decision variables that most impact regional robustness. Finally, we will conduct scenario discovery to identify SOWs that may cause the policies to fail.

References

Gold, D. F., Reed, P. M., Trindade, B. C., & Characklis, G. W. (2019). Identifying actionable compromises: Navigating multi‐city robustness conflicts to discover cooperative safe operating spaces for regional water supply portfolios. Water Resources Research, 55(11), 9024–9050. https://doi.org/10.1029/2019wr025462

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). DEEPLY UNCERTAIN PATHWAYS: Integrated multi-city Regional Water Supply Infrastructure Investment and portfolio management. Advances in Water Resources, 134, 103442. https://doi.org/10.1016/j.advwatres.2019.103442

A non-intimidating introduction to parallel computing with Numba

This blog post is adapted from material I learned during the 2021 San Diego Supercomputer Center (SDSC) Summer Institute. This was an introductory boot camp to high-performance computing (HPC), and one of the modules taught the application of Numba for in-line parallelization and speeding up of Python code.

What is Numba?

According to its official web page, Numba is a just-in-time (JIT) compiler that translates subsets of Python and NumPy code into fast machine code, enabling it to run at speeds approaching that of C or Fortran. This is becuase JIT compilation enables specific lines of code to be compiled or activated only when necessary. Numba also makes use of cache memory to generate and store the compiled version of all data types entered to a specific function, which eliminates the need for recompilation every time the same data type is called when a function is run.

This blog post will demonstrate a simple examples of using Numba and its most commonly-used decorator, @jit, via Jupyter Notebook. The Binder file containing all the executable code can be found here.

Note: The ‘@‘ flag is used to indicate the use of a decorator

Installing Numba and Setting up the Jupyter Notebook

First, in your command prompt, enter:

pip install numba

Alternatively, you can also use:

conda install numba

Next, import Numba:

import numpy as np
import numba
from numba import jit
from numba import vectorize

Great! Now let’s move onto using the @jit decorator.

Using @jit for executing functions on the CPU

The @jit decorator works best on numerical functions that use NumPy. It has two modes: nopython mode and object mode. Setting nopython=True tell the compiler to overlook the involvement of the Python interpreter when running the entire decorated function. This setting leads to the best performance. However, in the case when:

  1. nopython=True fails
  2. nopython=False, or
  3. nopython is not set at all

the compiler defaults to object mode. Then, Numba will manually identify loops that it can compile into functions to be run in machine code, and will run the remaining code in the interpreter.

Here, @jit is demonstrated on a simple matrix multiplication function:

# a function that does multiple matrix multiplication
@jit(nopython=True)
def matrix_multiplication(A, x):
    b = np.empty(shape=(x.shape[0],1), dtype=np.float64)
    for i in range(x.shape[0]):
        b[i] = np.dot(A[i,:], x)
    return b

Remember – the use of @jit means that this function has not been compiled yet! Compilation only happens when you call the function:

A = np.random.rand(10, 10)
x = np.random.rand(10, 1)
a_complicated_function(A,x)

But how much faster is Numba really? To find out, some benchmarking is in order. Jupyter Notebook has a handy function called %timeit that runs simple functions many times in a loop to get their average execution time, that can be used as follows:

%timeit matrix_multiplication(A,x)

# 11.4 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba has a special .py_func attribute that effectively allows the decorated function to run as the original uncompiled Python function. Using this to compare its runtime to that of the decorated version,

%timeit matrix_multiplication.py_func(A,x)

# 35.5 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

From here, you can see that the Numba version runs about 3 times faster than using only NumPy arrays. In addition to this, Numba also supports tuples, integers, floats, and Python lists. All other Python features supported by Numba can be found here.

Besides explicitly declaring @jit at the start of a function, Numba makes it simple to turn a NumPy function into a Numba function by attaching jit(nopython=True) to the original function. This essentially uses the @jit decorator as a function. The function to calculate absolute percentage relative error demonstrates how this is done:

# Calculate percentage relative error
def numpy_re(x, true):
    return np.abs(((x - true)/true))*100

numba_re = jit(nopython=True)(numpy_re)

And we can see how the Number version is faster:

%timeit numpy_re(x, 0.66)
%timeit numba_re(x, 0.66)

where the NumPy version takes approximately 2.61 microseconds to run, while the Numba version takes 687 nanoseconds.

Inline parallelization with Numba

The @jit decorator can also be used to enable inline parallelization by setting its parallelization pass parallel=True. Parallelization in Numba is done via multi-threading, which essentially creates threads of code that are distributed over all the available CPU cores. An example of this can be seen in the code snippet below, describing a function that calculates the normal distribution of a set of data with a given mean and standard deviation:

SQRT_2PI = np.sqrt(2 * np.pi)

@jit(nopython=True, parallel=True)
def normals(x, means, sds):
    n = means.shape[0]
    result = np.exp(-0.5*((x - means)/sds)**2)
    return (1 / (sds * np.sqrt(2*np.pi))) * result

As usual, the function must be compiled:

means = np.random.uniform(-1,1, size=10**8)
sds = np.random.uniform(0.1, 0.2, size=10**8)

normals(0.6, means, sds)

To appreciate the speed-up that Numba’s multi-threading provides, compare the runtime for this with:

  1. A decorated version of the function with a disabled parallel pass
  2. The uncompiled, original NumPy function

The first example can be timed by:

normals_deco_nothread = jit(nopython=True)(normals.py_func)
%timeit normals_deco_nothread(0.6, means, sds)

# 3.24 s ± 757 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The first line of the code snippet first makes an uncompiled copy of the normals function, and then applies the @jit decorator to it. This effectively creates a version of normals that uses @jit, but is not multi-threaded. This run of the function took approximately 3.3 seconds.

For the second example, simply:

%timeit normals.py_func(0.6, means, sds)

# 7.38 s ± 759 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, compare both these examples to the runtime of the decorated and multi-threaded normals function:

%timeit normals(0.6, means, sds)

# 933 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The decorated, multi-threaded function is significantly faster (933 ms) than the decorated function without multi-threading (3.24 s), which in turn is faster than the uncompiled original NumPy function (7.38 s). However, the degree of speed-up may vary depending on the number of CPUs that the machine has available.

Summary

In general, the improvements achieved by using Numba on top of NumPy functions are marginal for simple, few-loop functions. Nevertheless, Numba is particularly useful for large datasets or high-dimensional arrays that require a large number of loops, and would benefit from the one-and-done compilation that it enables. For more information on using Numba, please refer to its official web page.

Simple profiling checks for running jobs on clusters

The goal of this short blog post is to share some simple tips on profiling your (to be) submitted jobs on high performance computing resources. Profiling your jobs can give you information about how efficiently you are using your computational resources, i.e., your CPUs and your allocated memory. Typically you would perform these checks on your experiment at a smaller scale, ensuring that everything is working as it should, before expanding to more tasks and CPUs.

Your first check is squeue typically paired with your user ID on a cluster. Here’s an example:

(base) [ah986@login02 project_dir]$ squeue -u ah986
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON) 
           5688212    shared <job_name>    ah986  R       0:05      1 exp-4-55 

This tells me that my submitted job is utilizing 1 node in the shared partition of this cluster. If your cluster is using the SLURM scheduler, you can also use sacct which can display accounting data for all jobs you are currently running or have run in the past. There’s many pieces of information available with sacct, that you can specify using the --format flag. Here’s an example for the same job:

(base) [ah986@login02 project_dir]$ sacct --format=JobID,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,AllocTRES%32 -j 5688212
       JobID  Partition      State  Timelimit               Start                 End    Elapsed   NNodes      NCPUS        NodeList                        AllocTRES 
------------ ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- -------------------------------- 
5688212          shared    RUNNING   20:00:00 2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55 billing=360000,cpu=100,mem=200G+ 
5688212.bat+               RUNNING            2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55          cpu=100,mem=200G,node=1 
5688212.0                  RUNNING            2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55          cpu=100,mem=200G,node=1 

In this case I can see the number of nodes (1) and the number of cores (100) utilized by my job as well as the resources allocated to it (100 CPUs and 200G of memory on 1 node). This information is useful in cases where a task launches other tasks and you’d like to diagnose whether the correct number of cores is being used.

Another useful tool is seff, which is actually a wrapper around sacct and summarizes your job’s overall performance. It is a little unreliable while the job is still running, but after the job is finished you can run:

(base) [ah986@login02 project_dir]$ seff 5688212
Job ID: 5688212
Cluster: expanse
User/Group: ah986/pen110
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 100
CPU Utilized: 1-01:59:46
CPU Efficiency: 68.16% of 1-14:08:20 core-walltime
Job Wall-clock time: 00:22:53
Memory Utilized: 38.25 GB
Memory Efficiency: 19.13% of 200.00 GB

The information here is very useful if you want to find out about how efficiently you’re using your resources. For this example I had 100 separate tasks I needed to perform and I requested 100 cores on 1 node and 200 GB of memory. These results tell me that my job completed in 23mins or so, the total time using the CPUs (CPU Utilized) was 01:59:46, and most importantly, the efficiency of my CPU use. CPU Efficiency is calculated “as the ratio of the actual core time from all cores divided by the number of cores requested divided by the run time”, in this case 68.16%. What this means it that I could be utilizing my cores more efficiently by allocating fewer cores to the same number of tasks, especially if scaling up to a larger number of nodes/cores. Additionally, my allocated memory is underutilized and I could be requesting a smaller memory allocation without inhibiting my runs.

Finally, while your job is still running you can log in the node(s) executing the job to look at live data. To do so, you simply ssh to one of the nodes listed under NODELIST (not all clusters allow this). From there, you can run the top command like below (with your own username), which will start the live task manager:

(base) [ah986@r143 ~]$ top -u ah986

top - 15:17:34 up 25 days, 19:55,  1 user,  load average: 0.09, 12.62, 40.64
Tasks: 1727 total,   2 running, 1725 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257662.9 total, 249783.4 free,   5561.6 used,   2317.9 buff/cache
MiB Swap: 716287.0 total, 716005.8 free,    281.2 used. 250321.1 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                              
 78985 ah986     20   0  276212   7068   4320 R   0.3   0.0   0:00.62 top                                                                                                                                  
 78229 ah986     20   0  222624   3352   2936 S   0.0   0.0   0:00.00 slurm_script                                                                                                                         
 78467 ah986     20   0  259464   8128   4712 S   0.0   0.0   0:00.00 srun                                                                                                                                 
 78468 ah986     20   0   54520    836      0 S   0.0   0.0   0:00.00 srun                                                                                                                                 
 78481 ah986     20   0  266404  19112   4704 S   0.0   0.0   0:00.24 parallel                                                                                                                             
 78592 ah986     20   0  217052    792    720 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78593 ah986     20   0  217052    732    660 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78594 ah986     20   0  217052    764    692 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78595 ah986     20   0  217052    708    636 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78596 ah986     20   0  217052    708    636 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78597 ah986     20   0  217052    796    728 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78598 ah986     20   0  217052    732    660 S   0.0   0.0   0:00.00 sleep       

Memory and CPU usage can be tracked from RES and %CPU columns respectively. In this case, for the sake of an example, I just assigned all my cores to sleep a certain number of minutes each (using no CPU or memory). Similar information can also be obtained using the ps command, with memory being tracked under the RSS column.

 (base) [ah986@r143 ~]$ ps -u$USER -o %cpu,rss,args
%CPU   RSS COMMAND
 0.0  3352 /bin/bash /var/spool/slurm/d/job3509431/slurm_script
 0.0  8128 srun --export=all --exclusive -N1 -n1 parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 0.0   836 srun --export=all --exclusive -N1 -n1 parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 0.1 19112 /usr/bin/perl /usr/bin/parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 0.0   792 sleep 3m
 0.0   732 sleep 4m
 0.0   764 sleep 5m
 0.0   708 sleep 6m
 0.0   708 sleep 7m
 0.0   796 sleep 8m
 0.0   732 sleep 9m
 0.0   712 sleep 10m

Basics of data visualization with ggplot2

Basics of data visualization with ggplot2

In my previous post, I showed how wonderful the ggplot2 library in R is for visualizing complex networks. I realized that while there are several posts on this blog going over the advanced visualization capabilities of the ggplot2 library, there isn’t a primer on structuring code for creating graphs in R…yet. In this post, I will go over the syntax for creating pretty ggplot2 graphs and tweaking various parameters. I am a self-declared Python aficionado, but love using ggplot2 because it is intuitive to use, beginner-friendly, and highly customizable all at the same time.

Dataset and setup

For this tutorial, I will be using one of the built-in datasets in R called mtcars which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Further documentation on this dataset can be found here. We import the data into our RStudio workspace.

# import the library into our workspace
library(ggplot2)

# import dataset
data(mtcars)
head(mtcars)

The resultant dataset looks something like this.

Basic plot

Now that we have the data, we can get to plotting with ggplot2. We can declaratively create graphics using this library. We just have to provide the data, specify how to map properties to graph aesthetics, and the library takes care of the rest for us! We need to specify three things for each ggplot — 1) the data, 2) the aesthetics, and 3) the geometry.

Let us start by creating a basic scatterplot of the mileage (mpg) of each car as a function of its horsepower (hp). In this case the data is our dataframe mtcars, and the aesthetics x and y will be defined as the names of the columns we wish to plot along each axis — hp and mpg. We can also set the color aesthetic to indicate the number of cylinders (cyl) in each car. One of the reasons ggplot2 is so user-friendly is because each graph property can be tacked on to the same line of code with a + sign. Since we want a scatterplot, the geometry will be defined using geom_point().

# basic scatterplot
g <- ggplot(data = mtcars, aes(x = hp, y = mpg, color=cyl))
g + geom_point()

Excellent! The library automatically assigns the column names as axis labels, and uses the default theme and colors, but all of this can be modified to suit our tastes and to create pretty graphs. It is also important to note that we could have visualized the same data (less helpfully) as a line plot instead of a scatterplot, just by tweaking the geometry function.

# basic line plot
g + geom_line()

Well, this looks unpleasant. But wait, we can do so much more. We can also layer multiple geometries on the same graph to make more interesting plots.

# basic scatter+line plot
g + geom_line() + geom_point()

Additionally, we can tweak the geometry properties in each graph. Here is how we can transform the lines to dotted, and specify line widths and marker shape and size.

# change properties of geometry
g + geom_point(shape = "diamond", size = 3) +
  geom_line(color = "black", linetype = "dotted", size = .3) 

While our graph looks much neater now, using a line plot is actually pretty unhelpful for our dataset since each data point is a separate car. We will stick with a scatterplot for the rest of this tutorial. However, the above sort of graph would work great for time series data or other data that measures change in one variable.

Axis labels

One of the cardinal rules of good data visualization is to add axis labels to your graphs. While R automatically set the axis labels to be column headers, we can override this to make the axis labels more informative with just one extra function.

# change axis titles
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)")

Title

This graph is in serious need of a title to provide a reader some idea of what they’re looking at. There are actually multiple ways to add a graph title here, but I find it easiest to use ggtitle().

# add title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") 

Alright, having a title is helpful, but I don’t love it’s placement on the graph. R automatically left-aligns the title, where it clashes with the y-axis. I would much rather have the title right-aligned, in a bigger font, and bolded. Here is how to do that.

# change position of title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Theme

There are ways to manually change the background and gridlines of ggplot2 graphs using theme(), but an easy workaround is to use the built-in themes. Which theme you use depends greatly on the graph type and formatting guidelines, but I personally like a white background, faint gridlines, and a bounding box. One thing to note here though is that theme_bw() overrides theme() so the order of these two matters.

# add theme
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also use the theme() function to change the base font size and font family. Shown below is how to increase the base font size to 15 and change the base font family to Courier.

# use theme to change base font family and font size
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme_bw(base_size = 15, base_family = "Courier") +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Legend

It has been bothering me for the last seven paragraphs that my legend title still uses the column name. However, this is an easy fix. All I have to do is add a label to the color aesthetic in the labs() function.

# change legend title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also change the position of the legend. R automatically places legends on the right, and while I like having it to the right in this case, I could also place the legend at the bottom of the graph. This automatically changes the aspect ratio of the graph.

# change legend position
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "bottom")

Margins

The theme() function is of endless use in ggplot2, and can be used to manually adjust the graph margins and add/remove white space padding. The order of arguments in margin() is counterclockwise — top, right, bottom, left (helpfully remembered by the pneumonic TRouBLe).

# add plot margins
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "right") +
  theme(plot.margin = margin(t = 1, r = 1, b = 1, l = 2, unit = "cm"))

Conclusion

I have barely scratched the surface of what can be achieved using ggplot2 in this post. There are hundreds of excellent tutorials online that dive deeper into ggplot2, like this blog post by Cedric Scherer. I have yet to learn so much about this library and data visualization in general, but have hopefully made a solid case for using ggplot2 to create clean and aesthetically-pleasing data visualizations.

Custom Plotting Symbols in R

R has a lot of options for plotting symbols, but sometimes you might want a something not in the base set. This post will cover a way to get a custom plotting symbol in R. The code is available on github.

I will use data is for 24 nations that competed in the 2021 Euros (men’s) tournament, specifically the FIFA rankings of the men’s and women’s national teams (accessed 12 Aug 2021). Using a flag to represent the country seems reasonable, and easier than text or plotting symbol and color combinations. Plus, it might add a bit of visual interest.

Although this might seem specialized, it highlights changing backgrounds in R base graphics, controlling aspect ratios, and adding mathematical notation to plots.

Some prep work before coding comes in handy. First, I downloaded PNG files of each country’s flag and saved it with the country’s name. Next, I checked Wikipedia for the aspect ratio of each flag and added it to my CSV file of the FIFA rankings.

The code begins by loading the library we will be using for a custom PCH and then reading in my data from the CSV. The men’s rankings range from 1 to 62 and the women’s rankings from 2 to 131.

library(png) # library for a custom pch

df <- read.csv('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_national_team_rankings.csv',
               header=T) # read-in data from csv

# calculating ranges for men's and women's teams
men_range <- range(df$Men.s.National.Team)
women_range <- range(df$Women.s.National.Team)

Next, we set the basic parameters of the plot. Using the PNG function allows me to save it directly to a file and have a stable image process I an fine tune. The height and width needed to me modified until the axes were close to the same scale (10 ranks on x = 10 ranks on y). This will make life easier when trying to scale the flags with the aspect ratio.

# setting plot to save as a png file
# checked the output file to make sure
# axes were on the same scale, or close to it
png(filename='/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_euro_plot.png'
    ,width=round(women_range[2]/15,2)
    ,height=round(men_range[2]/11.5,2)
    ,units='in'
    ,res=300)

par(mar=c(5,5,5,5)) # plot margins

Now, we can initialize the plot. Because we are not plotting the data yet, we can simply label the axes. Default axis ticks would include 0, so I used more natural axis ticks. The gray background makes flags with white more visible.

# setting up the base scatter plot
# used \ to get the "'" in titles
plot(x=NA
     ,y=NA
     ,ylab=expression('Men\'s Ranking')
     ,xlab=expression('Women\'s Ranking')
     ,main='Euros 2021 FIFA Nations\nMen\'s vs Women\'s National Teams'
     ,xlim=rev(women_range) # reverse axes to rank 1 is top right
     ,ylim=rev(men_range) # reverse axes to rank 1 is top right
     ,las=1 # rotate vertical axis labels to be horizontal
     ,xaxt='n' # no default x axis
     ,yaxt='n' # no default y axis
)

# adding axes
axis(side=1, at=c(1,seq(10,women_range[2],10)),las=1)
axis(side=2, at=c(1,seq(10,men_range[2],12)),las=1)

# adding a gray background to the plot
# many flgs have a white portion that would disappear otherwise
rect(xleft=par('usr')[1],xright=par('usr')[2]
     ,ybottom=par('usr')[3],ytop=par('usr')[4]
     ,col='lightgray')

Now comes the fun part. Not all flags are shaped the same. The two extremes in these countries are Switzerland (aspect ratio = 1) and Croatia, Hungary, and North Macedonia (aspect ratio = 2). We will be telling the function the corners of where to plot the flag, so using the same x and y offset for all flags will lead to distortions.

To solve this problem, I set a value for the area of all flags. This will ensure that each country has the same visual impact. After some math, that means the x and y dimensions for each flag can be found based on the aspect ratio. Once we have all these, we can loop through the countries, calculate the coordinates for the flag corners, and use the country’s flag as the plotting symbol.

# setting approximate "area" for each country flag
flag_area <- 25

# looping over all the rankings
for(i in 1:nrow(df)){
  # read in the image and set it to a variable
  img <- readPNG(paste('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/flags/',df$Country[i],'.png',sep=''))
  
  # math for same flag areas (approximately)
  # area = dx*dy = (dy^2)*(aspect_ratio)
  # dy = sqrt(area/aspect_ratio)
  # dx = sqrt(area*aspect_ratio)
  dy <- sqrt(flag_area/df$Aspect.Ratio[i])
  dx <- sqrt(flag_area*df$Aspect.Ratio[i])
  
  # plot the png image as a raster
  # need the image and coordinates
  # men's team on y axis, women's team on x axis
  rasterImage(image=img
              ,ytop=df$Men.s.National.Team[i]+dy/2
              ,ybottom=df$Men.s.National.Team[i]-dy/2
              ,xleft=df$Women.s.National.Team[i]-dx/2
              ,xright=df$Women.s.National.Team[i]+dx/2)
}

For a little bit of fun, and to illustrate one way to get Greek letters in a plot notation, I calculated the correlation of these ranks. The last line of this code closes the figure file.

# adding a text correlation
# using linear correlation of ranks, not rank correlation
correl <- round(cor(df$Men.s.National.Team,df$Women.s.National.Team),2)

# adding text notation for correlation
text(x=women_range[2]-20,
     y=men_range[1]+10,
     labels = bquote(paste(," Linear Corr. (",rho,") = ",.(correl),sep=""))
     )

# close file
dev.off()

And we are done! We see that the flags do look to be about the same area and have approximately correct aspect ratios. (I won’t say low long it took me to get that part figured out…) I hope this provides a fun way to change up your plots and help an audience better understand the data. Although I used flags here because it seemed the most natural for this data, any PNG file could be used based on your data and needs.

fifa_euro_plot

Introduction to PyBorg – basic setup and running

PyBorg is a new secondary implementation of Borg, written entirely in Python using the Platypus optimization library. PyBorg was developed by Andrew Dircks based on the original implementation in C and it is intended primarily as a learning tool as it is less efficient than the original C version (which you can still use with Python but through the use of the plugin “wrapper” also found in the package). PyBorg can be found in the same repository where the original Borg can be downloaded, for which you can request access here: http://borgmoea.org/#contact

This blogpost is intended to demonstrate this new implementation. To follow along, first you need to either clone or download the BitBucket repository after you gain access.

Setting up the required packages is easy. In your terminal, navigate to the Python directory in the repository and install all prerequisites using python setup.py install. This will install all requirements (i.e. the Platypus library, numpy, scipy and six) for you in your current environment.

You can test that everything works fine by running the optimization on the DTLZ2 test function, found in dtlz2.py. The script creates an instance of the problem (as it is already defined in the Platypus library), sets it up as a ploblem for Borg to optimize and runs the algorithm for 10,000 function evaluations:

    # define a DTLZ2 problem instance from the Platypus library
    nobjs = 3
    problem = DTLZ2(nobjs)

    # define and run the Borg algorithm for 10000 evaluations
    algorithm = BorgMOEA(problem, epsilons=0.1)
    algorithm.run(10000)

A handy 3D scatter plot is also generated to show the optimization results.

The repository also comes with two other scripts dtlz2_runtime.py and dtlz2_advanced.py.
The first demonstrates how to use the Platypus hypervolume indicator at a specified runtime frequency to get learn about its progress as the algorithm goes through function evaluations:

The latter provides more advanced functionality that allows you define custom parameters for Borg. It also includes a function to generate runtime data from the run. Both scripts are useful to diagnose how your algorithm is performing on any given problem.

The rest of this post is a demo of how you can use PyBorg with your own Python model and all of the above. I’ll be using a model I’ve used before, which can be found here, and I’ll formulate it so it only uses the first three objectives for the purposes of demonstration.

The first thing you need to do to optimize your problem is to define it. This is done very simply in the exact same way you’d do it on Project Platypus, using the Problem class:

from fishery import fish_game
from platypus import Problem, Real
from pyborg import BorgMOEA

# define a problem
nVars = 6
nObjs = 3 

problem = Problem(nVars, nObjs) # first input is no of decision variables, second input is no of objectives
problem.types[:] = Real(0, 1) #defines the type and bounds of each decision variable
problem.function = fish_game #defines the model function

This assumes that all decision variables are of the same type and range, but you can also define them individually using, e.g., problem.types[0].

Then you define the problem for the algorithm and set the number of function evaluations:

algorithm = BorgMOEA(problem, epsilons=0.001) #epsilons for each objective
algorithm.run(10000) # number of function evaluations

If you’d like to also produce a runtime file you can use the detailed_run function included in the demo (in the files referenced above), which wraps the algorithm and runs it in intervals so the progress can be monitored. You can combine it with runtime_hypervolume to also track your hypervolume indicator. To use it you need to define the total number of function evaluations, the frequency with which you’d like the progress to be monitored and the name of the output file. If you’d like to calculate the Hypervolume (you first need to import it from platypus) you also need to either provide a known reference set or define maximum and minimum values for your solutions.

maxevals = 10000
frequency = 100
output = "fishery.data"
hv = Hypervolume(minimum=[-6000, 0, 0], maximum=[0, 1, 100])

nfe, hyp = detailed_run(algorithm, maxevals, frequency, output, hv)

My full script can be found below. The detailed_run function is an edited version of the default that comes in the demo to also include the hypervolume calculation.

from fishery import fish_game
from platypus import Problem, Real, Hypervolume
from pyborg import BorgMOEA
from runtime_diagnostics import detailed_run

# define a problem
nVars = 6 # no. of decision variables to be optimized
nObjs = 3

problem = Problem(nVars, nObjs) # first input is no of decision variables, second input is no of objectives
problem.types[:] = Real(0, 1)
problem.function = fish_game

# define and run the Borg algorithm for 10000 evaluations
algorithm = BorgMOEA(problem, epsilons=0.001)
#algorithm.run(10000)

# define detailed_run parameters
maxevals = 10000
frequency = 100
output = "fishery.data"
hv = Hypervolume(minimum=[-6000, 0, 0], maximum=[0, 1, 100])

nfe, hyp = detailed_run(algorithm, maxevals, frequency, output, hv)

# plot the results using matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter([s.objectives[0] for s in algorithm.result],
           [s.objectives[1] for s in algorithm.result],
           [s.objectives[2] for s in algorithm.result])
ax.set_xlabel('Objective 1')
ax.set_ylabel('Objective 2')
ax.set_zlabel('Objective 3')
ax.scatter(-6000, 0, 0, marker="*", c='orange', s=50)
plt.show()

plt.plot(nfe, hyp)
plt.title('PyBorg Runtime Hypervolume Fish game')
plt.xlabel('Number of Function Evaluations')
plt.ylabel('Hypervolume')
plt.show()

It produces the following two figures: