Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB

This post is dedicated to Chris Vernon and Travis Thurber from PNNL who taught me how to use these tools!

Lately, I’ve been generating synthetic weather scenarios for basins in California. The weather scenarios are created at a daily time step and are informed by tree-ring products which span across 600 years. We want to develop multiple ensembles of these scenarios that are representative of plausible future climate changes that the region could experience. It became clear to me that I would need to make sure that I was storing data efficiently and that I was able to parse through these data quickly to generate plots and develop metrics.

Below, I wanted to discuss two points that have really helped my workflow: (1) focusing on file structure and compression and (2) changing the way that I explore the resulting scenarios.

File choice and compression

To illustrate the importance of considering how you choose to save your data, I grab the historical trace of daily precipitation over the last 50 years for the Tuolumne River Basin and upload it into R as a dataframe. Below I show the code to take this dataframe and save it in various formats.

#Traditional RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne.rds"))

#Compressed RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne_compressed.rds",compress = 'xz'))

#Text 
system.time(write.table(prcp_tuolumne,"E:/blog/prcp_tuolumne.txt"))

#CSV
system.time(write.csv(prcp_tuolumne,"E:/blog/prcp_tuolumne.csv"))

#Parquet
library(arrow)
system.time(write_parquet(prcp_tuolumne,"E:/blog/prcp_tuolumne.parquet",compression = "gzip"))

#HDF5 
library(rhdf5)
system.time(h5write(prcp_tuolumne,"E:/blog/prcp_tuolumne.h5","df"))


#NetCDF
library(ncdf4)
# path and file name, set dname
ncpath <- "E:/blog/"
ncname <- "prcp_tuolumne"  
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "prcp"  # note: tmp means temperature (not temporary)
timedim <- ncdim_def("time","days since 1950-11-1",as.numeric(dates))
# define variables
fillvalue <- 1e32
dlname <- "observed precipitation"
prcp_def <- ncvar_def("prcp","mm",list(timedim),1e32,"observed prcp",prec = "double")
# create netCDF file and put arrays
ncout <- nc_create(ncfname,list(prcp_def),force_v4=TRUE)
# put variables
system.time(ncvar_put(ncout,prcp_def,as.matrix(prcp_tuolumne[ ,1])))

Here, I summarize the resulting size and write times for each format.

File TypeSize (Kb)Writing Time (seconds)
RDS9,0652.13
Compressed RDS4,01217.73
HDF55,204323.03
Text26,85112.69
CSV27,02412.14
netCDF141,3603.12
Parquet8,3960.98

Note that the original file size was on the smaller side so this exercise may seem trivial, but when you are creating many ensembles of both precipitation and temperature, across hundreds of years, across may basins, across many climate change scenarios, small files will add up and can potentially limit the scope of experiment that you can conduct if you don’t have enough file storage. Of these file types, one of my recent favorites has been Apache Parquet. Parquet is an open-source column-oriented format that is specifically designed for efficient storage and accessibility of data. Whereas the compressed RDS file and HDF5 beat Parquet in terms of size, it takes much longer to write to these files and subsequently read them back in. Another advantage of Parquet is that it is recognized by common coding languages (R, Matlab, Python) which allows for a more seamless workflow between models of different languages. If you have a little more storage to work with, Parquet is a good choice to balance size and writing time tradeoffs.

Querying data with SQL and DuckDB

Once your data are in an efficient format like Parquet, the next order of business is to make sure that you can easily sort through it and, as Chris would say, “talk to your data and let it talk back”. One way to communicate with your data and ask questions in a more semantically meaningful way is to use Structured Query Language (SQL). There are many methods of querying in R, but a useful data management tool that I have enjoyed interacting with is DuckDB which is a relational database management system. DuckDB has an R API and allows you to use the SQL syntax to query data in an efficient way. It should be noted that you can perform similar aggregation and subsetting using other R base functions, but DuckDB allows you to perform these tasks faster and across multiple files that you may have in a directory.

Let’s use a case where you have developed 30 traces of daily basin-averaged synthetic precipitation for the Tuolumne based on the historical period and each trace is stored in a respective parquet file (“ensemble1.parquetgzip, ensemble2.parquetgzip…ensemble30.parquetgzip). Each ensemble looks something like this:

Example of the parquet file structure

Now let’s state a question that I want to ask to the data and see how to structure these questions using SQL and DuckDB.

“Hey DuckDB, can you create a dataframe for me that provides an annual average across all my synthetic traces? Let’s plot that with respect to the observed annual average and show the bounds of the synthetic traces.”

First we need to open a connection to DuckDB. Then we are going to have to create annual averages from the daily historical and synthetic traces and then create max/min bounds to show the additional variability we are getting from the generator.

In order to answer this question, DuckDB will query all our parquet files simultaneously (!) and return a dataframe with annual averages and we can do some clever sub-querying to get the max/min of the averages. I really enjoy that I don’t have to read in the files into my R workspace. R reads in files and stores them based on memory, so those of you who have used larger datasets might have found that you are limited in how many datasets you can have open, which can be frustrating! This is not an issue with DuckDB.

#Historical trace 
historical=data.frame(dates.year,dates.month,dates.day,rowMeans(prcp_sacsma))
historical_yearly=aggregate(historical,by=list(dates.year),FUN=mean)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

# query to find annual average across all synthetic traces
df=dbGetQuery(con, 
              "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM 'E:/NHMM/blog/*.parquet'
           GROUP BY year
           ORDER BY year")

#For the max/min, we need to find the average first and then return the max/min values in something like a nested approach 

#First create a dataframe with all files 
all_files=dbGetQuery(con,"SELECT *
                FROM 'E:/NHMM/blog/*.parquet'")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "all_files_table", all_files)


annual_average=dbGetQuery(con, 
               "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM all_files_table
           GROUP BY year,sample
           ORDER BY year")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "annual_table", annual_average)

#query to find max
max=dbGetQuery(con, 
               "SELECT year,
           MAX(yearly_average) AS max_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")

#query to find min
min=dbGetQuery(con, 
               "SELECT year,
           min(yearly_average) AS min_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")
#Plot the results!
library(ggplot2)

ggplot(df, aes(x = year, y = yearly_average)) +
  geom_ribbon(aes(ymin = min$min_yearly_average,
                  ymax = max$max_yearly_average),
              alpha = 0.2,fill="blue") +
  geom_line()+geom_line(color="blue",size=2)+ geom_line(data=basin.wide.yearly[3:64,],aes(x=year,y=precipitation), color="Black",size=2)+ ggtitle("Annual Average Precipitation") +
  xlab("Year") + ylab("Annual Average Precipitation (mm)")+theme_ridges()

Here’s the resulting figure:

Synthetic generation in blue compared to the historical trace in black

“Hey DuckDB, from all of the synthetic traces that I have generated, how many instances are there where we are producing daily precipitation that is above 10 mm for specifically the years 1980 or 2000?”

In order to answer this question, DuckDB will once again query all our parquet files simultaneously and return a dataframe with all the instances of >10 mm and only for the years 1980 or 2000. This is how you would write this question in “code”.

library(arrow)
library(duckdb)
library(fs)
library(tidyverse)
library(DBI)
library(glue)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

df=dbGetQuery(con,"SELECT *
                FROM 'E:/blog/*.parquet'
                WHERE precipitation > 10
                AND (year = 1980 OR year = 2000)
                order BY year")

If we look at the size of the resulting dataframe it looks like we have generated 1335 instances of daily precipitation that are greater than 10 mm in specifically the years 1980 or 2000:

Hey DuckDB, I want to think about meteorological drought. In how many instances am I half a standard deviation below the long-term average monthly precipitation?

To try this out, let’s just look at one synthetic trace. First we need to find the long term average across the trace and let’s plot what a drought might look like.

# query to grab the first synthetic trace
scenario1=dbGetQuery(con, 
              "SELECT month,year,
           AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'
           GROUP by year, month")

# query to find mean
long_term_average=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'") 

# query to standard deviation
stdev=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_stdev
           FROM 'E:/NHMM/blog/ensemble1.parquet'")

We can then plot the synthetic trace (blue), mean (black), and a 1/2 standard deviation (dashed line) below the mean.

Now we can count the number of months that we are classified in a drought across the synthetic trace. Based on the stats that we calculated in the last block, I hard coded in the drought threshold, but in theory, you can nest many functions with more complicated sub-queries.

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "scenario_1", scenario1)


instances=dbGetQuery(con, 
                     "SELECT month,year
           FROM scenario_1
           WHERE monthly_avg < 3.125291
           GROUP by year, month
           ORDER by year,month")

It looks like 115/325 months or about 35% of the time we are in a drought in this particular synthetic trace. Yikes! But it makes for a good way to assess the vulnerability of the system to future plausible meteorological drought conditions.

I hope that these approaches can be as useful to you as they have been for me!

The ma-R-velous tidyverse

I am currently taking a class in environmental economics, which of late seems to be doubling as a crash course in the R tidyverse. This blog post will discuss several R packages and demonstrate the application of several tidyverse sub-packages and commands that I have learned to use through this class. Two datasets will be used: the storm dataset, which contains data on the types of storms that occurred in the United States from 1975 to 2015, and the damages dataset, which is a record of the dollar cost of the damages caused by every tropical storm that occurred from 1975 to 2015. The datasets required for this example can be found in in this GitHub repository.

Tidy data and the tidyverse?

‘Tidy’ datasets provide a standardized way to link the physical layout of a dataset with the meaning associated with it [1]. A ‘tidy’ dataset has the following characteristics:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each cell contains only one value of only one data type

On the other hand, the tidyverse is a meta-package that bundles the packages facilitating data ‘tidying’, or data cleaning such that the final form of the dataset is ‘tidy’. You can check the full set of packages bundled into the tidyverse using tidyverse_packages(), which will show you the following output in the console:

Within this set, we will be using the dplyr, tidyr and lubridate packages. The dplyr and tidyr packages are “workhorse” packages – their main purpose is to clean and wrangle large datasets, and will be the two packages that we use most often in this example. The lubridate package is used to convert strings into dates, which we will use later in this example.

In this example, we will transform a ‘messy’ dataset into a ‘tidy’ one, and perform several other operations that are enabled by the R tidyverse.

The pacman package management tool

No, this is not the Toru Iwatani-created, dot-munching yellow puck released in 1980 by Namco. Instead, the pacman tool is a convenient library and package wrapper that combines the functionality of base library-related functions into intuitively named functions [2]. It was created to improve workflow by reducing the time required in calling and re-calling obscurely named functions. Ideally, it should be called at the beginning of the R script, like so:

# Run this line
if (!require("pacman")) install.packages("pacman")

This enables you to automatically install and load packages throughout the base R libraries and the R tidyverse using the p_load function. For this example, we will only require the tidyverse package. Load this package using the following script:

# Then run p_load to load the necessary packages
pacman::p_load(
 tidyverse
)

As previously mentioned, this automatically loads the dplyr, tidyr and lubridate packages that we will need for this example.

Before beginning, make sure that you are running the correct versions of dplyr and tidyr. Check the package versions using the packageVersion('package_name') function. All packages should be at least version 1.0.0.

Working with the datasets

Some key tidyr verbs

With the explanations out of the way, we begin by loading our two datasets:

# Load dataset
storms <- read.csv("storms.csv")
damages <- read.csv("damages.csv")

Note that the damages dataset is in an inconvenient ‘wide’ shape. To convert damages into ‘tidy’ format, we use pivot_longer(). pivot_longer() is one of the four key tidyr verbs, the remaining three being pivot_wider(), separate() and unite(). These verbs perform the following uses:

  1. pivot_longer(): Vertically stretches the data, increasing the number of rows and decreasing the number of columns
  2. pivot_wider(): Horizontally stretches the data, increasing the number of columns and decreasing the number of rows
  3. separate(): Turns a single-character column into multiple columns
  4. unite(): Pastes multiple columns into one

For this example, pivot_longer() is used to turn damages from its original wide shape into a long, 3-column dataframe where each column contains data on the storm’s name, type, and total dollar cost of damages. The code is as follows:

# tidy-up damages
damages <- damages %>% mutate(status = "tropical storm") %>%
  pivot_longer(-status, names_to="name", values_to="damages")

This script first categorizes all the storms within damages as tropical storms, and them assigns the names of each storm to the column ‘name’ and the cost of their damages to ‘damages’. This turns the original dataset:

into the following shape:

This results in a more readable dataset!

It’s a lubridate!

Next, we will visit out storm dataset. Note that the information regarding the date and time for each storm is inconveniently split between four columns:

We use the as_datetime() function within the lubridate package to combine these columns into one, easy-to-interpret column called ‘date’. Use the following script to:

# Paste four columns and convert to date-time format
storms <- storms %>% mutate(date=as_datetime(
  paste0(year, "-", month, "-", day, " ", hour, ":00:00")))

This pastes the data in the ‘year’, ‘month’, ‘day’ and ‘hour’ columns together and inserts formatted date and time into the newly-added ‘date’ columns:

Great! Our storm dataset is now more readable.

Some dplyr verbs and operations

There are five key dplyr verbs, some of which you have already seen:

  1. filter(): Obtains a subset of rows based on their column values
  2. arrange(): Reorders rows (in ascending or descending order) based on their values
  3. select(): Selects columns or variables of interest
  4. mutate(): Creates new columns or variables and automatically appends them to the dataframe
  5. summarize(): Collapses multiple rows into a single summary value, commonly by grouping a pre-selected variable

The dplyr package also come with a set of join operations, which enables the merging of two dataframes, x and y. The types of operations include:

  1. inner_join(): Matches pairs of observations whenever their values are equal
  2. left_join(): Keeps all x-observations that appear in either x or y
  3. right_join(): Keeps all y-observations that appear in either x or y
  4. full_join(): Keeps all x– and y-observations that appear in either x or y
  5. semi_join(): Keeps all observations in x that have a match in y
  6. anti_join(): Drops all observations in x that have a match in y

In this example, we will only be using the inner_join() function to merge a subset of storms and the whole of damages.

The verbs can be used to modify datasets using ‘pipes’, represented in R as %>%. We have previously seen the application of the mutate verb when we added the new columns ‘storm’ and ‘date’ to the damages and storms dataset respectively. Now, let’s apply the filter() verb to obtain only tropical storm data from storms:

# filter out only the tropical storms
ts <- storms %>% filter(
    stringr::str_detect(status, "tropical storm")
)

Here, we use stringr::str_detect() to detect the string ‘tropical storm’ within the ‘status’ column of storms and store it within the new ts dataframe.

Next, we use the select() function to select all columns but the columns containing the diameter of the hurricane/storm, since this is data that we will not use in this example:

# select a subset of columns
ts <- ts %>% select(!ts_diameter:hu_diameter)

Using ! indicates to the function that you would like to select the complement of the set following the exclamation point. Using : is useful when there are consecutive columns you would like to (de)select. Following this, you will have a pared-down dataset:

Now it’s time to apply our inner_join() operation to merge ts with damages in a new dataframe joined_sd that contains both geophysical information of each tropical storm, as well as its dollar cost of damages.

# joining tropical storms and damages
joined_sd <- inner_join(ts, damages)

where joined_sd has the following form:

Now, we perform some simple operations to obtain the mean dollar cost of damages per windspeed, and arrange each tropical storm in order of decreasing damage per windspeed:

# calculate average cost of damage per windspeed
joined_sd <- joined_storms_damages %>% 
  mutate(damage_per_mph = mean(damages)/wind) %>% 
  arrange(desc(damage_per_mph))

This results in the following final dataset:

From here, you can see that tropical storm Amy resulted in the most expensive damages. Finally, you can also use the summarize() verb to identify the mean of the cost of damages per windspeed of all the tropical storms:

# summarize by mean and sd of cost per windspeed
summary_sd <- joined_sd %>% summarize(
  damage_per_mph=mean(damage_per_mph))

You will find that, on average, each tropical storm costs USD$11,244,531. Plotting the cost of damages with respect to time using

# plot damages wrt time
ggplot(joined_sd, aes(year, damages)) + 
  geom_point() + ylim(min(joined_sd$damages), max(joined_sd$damages)) +
  geom_smooth(method = lm) + ggtitle("Damages ($) with respect to time")

you will find a slight increase in the cost of damages incurred by tropical storms as the years go by.

Conclusion

In this blog post, we walked through two example datasets to demonstrate three sub-packages within the tidyverse: tidyr, lubridate and dplyr. For the full version of this example, please visit the author’s GitHub repository.

References

pacman package – RDocumentation. (2021). Retrieved 14 November 2021, from https://www.rdocumentation.org/packages/pacman/versions/0.5.1

Rudik, I. (2021). AEM 6510 Chapter 10: R and the tidyverse. Presentation.

Tidy data. (2021). Retrieved 14 November 2021, from https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

How to automate scripts on a cluster

There are several reasons why you might need to schedule or automate your scripts on a personal machine or a cluster:

  • You’re waiting for a job to finish before submitting another
  • You’d like to automate regular backups or cleanups of your data (e.g., move new data to another location or remove unnecessary output files)
  • You need to submit jobs to get around node limitations (e.g., you’d like to spread out the submissions over several days)
  • You need to retrieve regularly updated data (e.g., you have a model that uses daily precipitation data and you’d like to automatically collect them every day)

Cron is a utility program on Unix operating systems that allows you to schedule or repeat such tasks in the future. There’s a crontab file associated with every user in a cluster, where you’ll input all the information needed to schedule and automate your tasks. Note that not all clusters automatically allow their users to run cron jobs[1], for example, I can use it on the Reed Group’s Cube cluster, but not on XSEDE’s Comet.

To edit the crontab file associated with your user, type the following in your command line:

crontab -e

This will open a text editor (like Vim) which you can edit. To simply view your current crontab without editing, run:

crontab -l

Crontab syntax is made up of two parts: the timer indicating when to run and the command to run:

Source

The timer accepts five fields, indicating the time and day for the command to run:

  • Minute — minute of the hour, from 0 to 59
  • Hour — hour of the day, from 0 to 23
  • Day of the month — day of the month, from 1 to 31
  • Month — month of the year, from 1 to 12
  • Day of the week — day of the week, from 0 to 7

For example the following would execute script.sh on January 2nd at 9:00AM:

0 9 2 1 * /home/user/scripts/script.sh

Special characters are naturally very useful here, as they allow multiple execution times or ranges:

Asterisk (*) — to use all scheduling parameters in a field, for example, run the script, every day at midnight:

0 0 * * * /home/user/scripts/script.sh

Comma (,) — to use more than one scheduling parameter in a field, for example, run the script every day at midnight and 12PM:

0 0,12 * * * /home/user/scripts/script.sh

Slash (/) — to create predetermined time intervals, for example, run the script every four hours:

0 */4 * * * /home/user/scripts/script.sh

Hyphen (-) — to determine a range of values in a field, for example, run the script every minute during the first 10 minutes of every hour, every day

0-10 * * * * /home/user/scripts/script.sh

Hyphens and slashes can be combined, for example, to run a script every 5 minutes during the first 30 minutes of every hour, every day:

0-30/5 * * * * /home/user/scripts/script.sh

Last (L) — this character can only be used in the day-of-the-month and day-of-the-week fields to specify the last occurrence of something, for example the last day of the month (which could differ):

0 9 L * * /home/user/scripts/script.sh

or, to specify constructs such as “the last Friday” of a every month:

0 9 * * 5L /home/user/scripts/script.sh

Weekday ( W) — this character is only allowed on the day-of-month field and is used to determine the closest weekday to that day of the month. For instance, using “15W” indicates to cron to run the script on the nearest weekday to the 15th day of the month. If the 15th is a Saturday, the script will be executed on Friday the 14th. If the 15th is a Sunday, the script will be executed on Monday the 16th. If the 15th is a weekday, the script will be executed on the same day:

0 0 15W * * /home/user/scripts/script.sh

Hash (#) — this character is only allowed in the day-of-week field and is used to specify constructs such as the second Friday of every month:

0 0 * * 5#2 /home/user/scripts/script.sh

Lastly, if you’d like to be notified whenever a script is executed you can use the MAILTO parameter, with your email address.

The important thing to remember when running cron on a cluster (as opposed to your own machine) is that it will launch a shell that with a new clean environment (i.e., without the environment variables that are automatically applied when you log on an interactive shell) and it will likely not be able to recognize some commands or where your modules are. This can be easily addressed by sourcing your bash_rc or bash_profile from your home directory before running anything. You also need to remember that it will launch at your home directory and you need to specify the absolute path of the scripts to be executed, or change directory before executing them.

For example my crontab file on the Reed Group cluster looks like this:

#!/bin/bash
MAILTO=myemail@cornell.edu
00 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git pull; sbatch ./script.sh
30 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git add . ; git commit -m 'fetched data'; git push

This does the following:
Every day at 10am it sources my bashrc profile so it knows all my environment variables. It changes to the directory of my project and pulls from git any new updates to that project. It then submits a script using sbatch. I get an email at the same time, with the text that would that would have appeared in my command line had I executed these commands in an interactive node (i.e., the git information and a line saying Submitted batch job xxxxx).
Then, every day at 10:30 am, I commit and push the new data back to git.


[1] If you’re just a regular user on a cluster you might need to request to be granted access. If you have root privileges (say, on a personal machine), you need to edit your cron allow and deny files:

/etc/cron.allow
/etc/cron.deny

Getting started with API requests in Python

This is an introductory blogpost on how to work with Application Programming Interfaces (APIs) in Python. Seeing as this is our blog’s first post on this topic I’ll spend some time explaining some basic information I’ve had to learn in the process of doing this and why it might be useful in your research. There are several blogs and websites with tutorials online and there’s no point repeating, but I’ll try to explain this for an audience like me, i.e., (a) has no formal training in computer science/web servers/HTTP but competent with scripting, and (b) interested in using this for research purposes.

What is an API?

Many sites (e.g., Facebook, Twitter, many many more) make their data available through what’s called Application Programming Interfaces or APIs. APIs basically allow software to interact, and send and receive data from servers so as to typically provide additional services to businesses, mobile apps and the like, or allow for additional analysis of the data for research or commercial purposes (e.g., collecting all trending Twitter hashtags per location to analyze how news and information propagates).

I am a civil/environmental engineer, what can APIs do for me?

APIs are particularly useful for data that changes often or that involves repeated computation (e.g., daily precipitation measurements used to forecast lake levels). There’s a lot of this kind of data relevant for water resources systems analysts, easily accessible through APIs:

I’m interested, how do I do it?

I’ll demonstrate some basic information scripts using Python and the Requests library in the section below. There are many different API requests one can perform, but the most common one is GET, which is a request to retrieve data (not to modify it in any way). To retrieve data from an API you need two things: a base URL and an endpoint. The base URL is basically the static address to the API you’re interested in and an endpoint is appended to the end of it as the server route used to collect a specific set of data from the API. Collecting different kinds of data from an API is basically a process of manipulating that endpoint to retrieve exactly what is needed for a specific computation. I’ll demo this using the USGS’s API, but be aware that it varies for each one, so you’d need to figure out how to construct your URLs for the specific API you’re trying to access. They most often come with a documentation page and example URL generators that help you figure out how to construct them.

The base URL for the USGS API is http://waterservices.usgs.gov/nwis/iv/

I am, for instance, interested in collecting streamflow data from a gage near my house for last May. In Python I would set up the specific URL for these data like so:

response = requests.get("http://waterservices.usgs.gov/nwis/iv/?format=json&indent=on&sites=04234000&startDT=2020-05-01&endDT=2020-05-31¶meterCd=00060&siteType=ST&siteStatus=all")

For some interpretation of how this was constructed, every parameter option is separated by &, and they can be read like so: data in json format; indented so I can read them more easily; site number is the USGS gage number; start and end dates; a USGS parameter code to indicate streamflow; site type ‘stream’; any status (active or inactive). This is obviously the format that works for this API, for a different API you’d have to figure out how to structure these arguments for the data you need. If you paste the URL in your browser you’ll see the same data I’m using Python to retrieve.

Object response now contains several attributes, the most useful of which are response.content and response.status_code. content contains the content of the URL, i.e. your data and other stuff, as a bytes object. status_code contains the status of your request, with regards to whether the server couldn’t find what you asked for or you weren’t authenticated to access that data, etc. You can find what the codes mean here.

To access the data contained in your request (most usually in json format) you can use the json function contained in the library to return a dictionary object:

data = response.json()

The Python json library is also very useful here to help you manipulate the data you retrieve.

How can I use this for multiple datasets with different arguments?

So obviously, the utility of APIs comes when one needs multiple datasets of something. Using a Python script we can iterate through the arguments needed and generate the respective endpoints to retrieve our data. An easier way of doing this without manipulating and appending strings is to set up a dictionary with the parameter values and pass that to the get function:

parameters = {"format": 'json', "indent": 'on',
              "sites": '04234000', "startDT": '2020-05-01',
              "endDT": '2020-05-31', "parameterCd":'00060',
              "siteType": 'ST', "siteStatus":'all'}
response = requests.get("http://waterservices.usgs.gov/nwis/iv/", 
                        params=parameters)

This produces the same data, but we now have an easier way to manipulate the arguments in the dictionary. Using simple loops and lists to iterate through one can retrieve and store multiple different datasets from this API.

Other things to be aware of

Multiple other people use these APIs and some of them carry datasets that are very large so querying them takes time. If the server needs to process something before producing it for you, that takes time too. Some APIs limit the number of requests you can make as a result and it is general good practice to try to be as efficient as possible with your requests. This means not requesting large datasets you only need parts of, but being specific to what you need. Depending on the database, this might mean adding several filters to your request URL to be as specific as possible to only the dates and attributes needed, or combining several attributes (e.g., multiple gages in the above example) in a single request.

Parallel File Compressing in Linux

If you are dealing with big data and need to move them to different directories or archive them for long-term storage, you have to think about how you can do so efficiently. Without an efficient method, you will probably need to spend days organizing and finishing the work. There are several utilities that are popular for this task. I had more than 3 TB of data that I needed to move to free up some space on the disk, so I thought about a strategy for moving my files. My smallest subdirectory was about 175 GB, and it took about 2 hours to compress with normal tar and gzip. I realized that gzip has options for the level of compression and the speed of compression, which I did not know before. This can be helpful. I did a simple test with a smaller data set (about 1.94 GB) by applying different speed options from 1 to 9; 1 indicates the fastest but compression method, and 9 indicates the slowest but best compression method:

GZIP=-Option tar cvzf OUTPUT_FILE.tar.gz ./Paths_to_Archive/

Here is the result: the default is 6, but you can play with the speed option and, based on your priority, choose the timing and compression ratio. If your file or folder is much bigger than what I used here as an example, different speed options can really save you time. Here is the graph that I created from this experiment:

You can further speed it up using more than one core/processor. There is another available gzip version that compresses files/folders on multiple processors and cores.

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 20 > OUTPUT_FILE.tar.gz

In this case, I used “1” as the speed option and specified “20” possessors for the task, and the path where I downloaded the pigz. The good news is that you can write a simple bash script to run the same command on other nodes rather than on the login node. Therefore, you can use all the available possessors on a node without making the head node slow.

#!/bin/bash										
#SBATCH -t 24:00:00					
#SBATCH --job-name=test				
#SBATCH --mail-type=end				
#SBATCH -p normal					
#SBATCH --export=ALL				
#SBATCH --nodes=1			
#SBATCH --output="test.txt"				
#SBATCH --cpus-per-task=32

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 32 > OUTPUT_FILE.tar.gz

Save the lines above in a test.sh, and run it with sbatch test.sh.

I used a larger subset of my data (16.5 GB) to check different speed options on parallel, and here is the result: the slowest option was almost threefold faster (168 vs. 478 seconds) than my previous test on one possessor; however, my previous test folder was much smaller (1.96 vs. 16.5 GB).

More Terminal Schooling

You are probably asking yourself “and why do I need more terminal schooling?”. The short answer is: to not have to spend as much time as you do on the terminal, most of which spent (1) pushing arrow keys thousands of times per afternoon to move through a command or history of commands, (2) waiting for a command that takes forever to be done running before you can run anything else, (3) clicking all over the place on MobaXTerm and still feeling lost, (4) manually running the same command multiple times with different inputs, (5) typing the two-step verification token every time you want to change a “+” to a “-” on a file on a supercomputer, (6) waiting forever for a time-consuming run done in serial on a single core, and (7, 8, …) other useless and horribly frustrating chores. Below are some tricks to make your Linux work more efficient and reduce the time you spend on the terminal. From now on, I will use a “$” sign to indicate that what follows is a command typed in the terminal.

The tab autocomple is your best friend

When trying to do something with that file whose name is 5480458 characters long, be smart and don’t type the whole thing. Just type the first few letters and hit tab. If it doesn’t complete all the way it’s because there are multiple files whose names begin with the sequence of characters. In this case, hitting tab twice will return the names of all such files. The tab autocomplete works for commands as well.

Ctrl+r for search through previous commands

When on the terminal, hit ctrl+r to switch to reverse search mode. This works like a simple search function o a text document, but instead looking in your bash history file for commands you used over the last weeks or months. For example, if you hit ctrl+r and type sbatch it will fill the line with the last command you ran that contained the word sbatch. If you hit ctrl+r again, it will find the second last used command, and so on.

Vim basics to edit files on a system that requires two-step authentication

Vim is one the most useful things I have came across when it comes to working on supercomputers with two-step identity verification, in which case using MobaXTerm of VS Code requires typing a difference security code all the time. Instead of uploading a new version of a code file every time you want to make a simple change, just edit the file on the computer itself using Vim. To make simple edits on your files, there are very few commands you need to know.

To open a file with Vim from the terminal: $ vim <file name> or $ vim +10 <file name>, if you want to open the file and go straight to line 10.

Vim has two modes of operation: text-edit (for you to type whatever you want in the file) and command (replacement to clicking on file, edit, view, etc. on the top bar of notepad). When you open Vim, it will be in command mode.

To switch to text-edit mode, just hit either “a” or “i” (you should then see “– INSERT –” at the bottom of the screen). To return to command mode, hit escape (Esc). When in text-edit more, the keys “Home,” “End,” “Pg Up,” “Pg Dn,” “Backspace,” and “Delete” work just like on Notepad and MS Word.

When in command mode, save your file by typing :w + Enter, save and quite with :wq, and quit without saving with :q!. Commands for selecting, copying and pasting, finding and replacing, replacing just one character, deleting a line, and other more advanced tasks can be found here. There’s also a great cheatsheet for Vim here. Hint: once you learn some more five to ten commands, making complex edits on your file with Vim becomes blazingly fast.

Perform repetitive tasks on the terminal using one-line Bash for-loops.

Instead of manually typing a command for each operation you want to perform on a subset of files in a directory (“e.g., cp file<i>.csv directory300-400 for i from 300 to 399 , tar -xzvf myfile<i>.tar.gz, etc.), you can use a Bash for-loop if using the is not possible.

Consider a situation in which you have 10,000 files and want to move files number 200 to 299 to a certain directory. Using the wildcard “*” in this case wouldn’t be possible, as result_2<i>.csv would return result_2.csv, result_20.csv to result_29.csv, and result_2000.csv to result_2999.csv as well–sometimes you may be able to use Regex, but that’s another story. To move a subset of result files to a directory using a Bash for-loop, you can use the following syntax:

$ for i in {0..99}; do cp result_2$i results_200s/; done

Keep in mind that you can have multiple commands inside a for-loop by separating them with “;” and also nest for-loops.

Run a time-intensive command on the background with an “&” and keep doing your terminal work

Some commands may take a long time to run and render the terminal unusable until it’s complete. Instead of opening another instance of the terminal and login in again, you can send a command to the background by adding “&” at the end of it. For example, if you want to extract a tar file with dozens of thousands of files in it and keep doing your work as the files are extracted, just run:

$ tar -xzf my_large_file.tar.gz &

If you have a directory with several tar files and want to extract a few of them in parallel while doing your work, you can use the for-loop described above and add “&” to the end of the tar command inside the loop. BE CAREFUL, if your for-loop iterates over dozens or more files, you may end up with your terminal trying to run dozens or more tasks at once. I accidentally crashed the Cube once doing this.

Check what is currently running on the terminal using ps

To make sure you are not overloading the terminal by throwing too many processes at it, you can check what it is currently running by running the command ps. For example, if I run an program with MPI creating two processes and run ps before my program is done, it will return the following:

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
$ mpirun -n 2 ./triangleSimulation -I Tests/test_input_file_borg.wp &
[1] 6129     <-- this is the process ID
bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ps
 PID TTY TIME CMD
 8 tty1 00:00:00 bash
 6129 tty1 00:00:00 mpirun    <-- notice the process ID 6129 again
 6134 tty1 00:00:00 triangleSimulat
 6135 tty1 00:00:00 triangleSimulat
 6136 tty1 00:00:00 ps

Check the output of a command running on the background

If you run a program on the background its output will not be printed on the screen. To know what’s happening with your program, send (to pipe) its output to a text file using the “>” symbol, which will be updated continuously as your program is running, and check it with cat <file name>, less +F<file name>, tail -n<file name>, or something similar. For example, if test_for_background.sh is a script that will print a number on the screen every one second, you could do the following (note the “> pipe.csv” in the first command):

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ./test_for_background.sh > pipe.csv &
 [1] 6191

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2
 3

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ tail -3 pipe.csv
 8
 9
 10

This is also extremely useful in situations when you want to run a command that takes long to run but whose outputs are normally displayed one time on the screen. For example, if you want to check the contents of a directory with thousands of files to search for a few specific files, you can pipe the output of ls to a file and send it to the background with ls > directory_contents.txt & and search the resulting text file for the file of interest.

System monitor: check core and memory usage with htop, or top if htop is not available

If ps does not provide enough information given your needs, such as if you’re trying to check if your multi-thread application is using the number of cores it should, you can try running htop instead. This will show on your screen something along the lines of the Performance view  of Windows’ Task Manager, but without the time plot. It will also show how much memory is being used, so that you do not accidentally shut down a node on an HPC system. If htop is not available, you can try top.

Running make in parallel with make -j for much shorter compiling time

If a C++ code is properly modularized, make can compile certain source code files in parallel. To do that, run make -j<number of cores> <rule in makefile>. For example, the following command would compile WaterPaths in parallel over four cores:

$ make -j4 gcc

For WaterPaths, make gcc takes 54s on the Cube, make -j4 gcc takes 15s, make -j8 gcc takes 9s, so the time and patience savings are real if you have to compile the code various times per day. To make your life simpler, you can add an alias to bash_aliases such as alias make='make -j4' (see below in section about .bash_aliases file). DO NOT USE MAKE -J ON NSF HPC SYSTEMS: it is against the rules. On the cube keep it to four cores or less not to disturb other users, but use all cores available if on the cloud or iterative section.

Check the size of files and directories using du -hs

The title above is quite self-explanatory. Running du -hs <file name> will tell you its size.

Check the data and time a file was created or last modified using the stat command

Also rather self-explanatory. Running stat <file name> is really useful if you cannot remember on which file you saved the output last time you ran your program.

Split large files into smaller chunks with the split command and put them back together with cat

This works for splitting a large text file into files with fewer lines, as well as for splitting large binary files (such as large tar files) so that you can, for example, upload them to GitHub or e-mail them to someone. To split a text file with 10,000 into ten files with 1,000 lines each, use:

 $ split -l 1000 myfile myfile_part

This will result in ten files called myfile_part00, myfile_part01, and so on with 1,000 lines each. Similarly, the command below would break a binary file into parts with 50 MB each:

 $ split -b 50m myfile myfile_part

To put all files back together in either case, run:

$ cat myfile_part* myfile

More information about the split command can be found in Joe’s post about it.

Checking your HPC submission history with `sacct`

Another quite sulf-explanatory tile. If you want to remember when you submitted something, such as to check if an output file resulted from this or that submission (see stat command), just run the command below in one line:

$ sacct -S 2019-09-18 -u bct52 --format=User,JobID,Jobname,start,end,elapsed,nnodes,nodelist,state

This will result in an output similar to the one below:

bct52 979 my_job 2019-09-10T21:48:30 2019-09-10T21:55:08 00:06:38 1 c0001 COMPLETED
bct52 980 skx_test_1 2019-09-11T01:44:08 2019-09-11T01:44:09 00:00:01 1 c0001 FAILED
bct52 981 skx_test_1 2019-09-11T01:44:33 2019-09-11T01:56:45 00:12:12 1 c0001 CANCELLED
bct52 1080 skx_test_4 2019-09-11T22:07:03 2019-09-11T22:08:39 00:01:36 4 c[0001-0004] COMPLETED
1080.0 orted 2019-09-11T22:08:38 2019-09-11T22:08:38 00:00:00 3 c[0002-0004] COMPLETED

Compare files with meld, fldiff, or diff

There are several programs to show the differences between text files. This is particularly useful when you want to see what the changes between different versions of the same file, normally a source code file. If you are on a computer running a Linux OS or have an X server like Xming installed, you can use meld and kdiff3 for pretty outputs on a nice GUI or fldiff to quickly handle a files with huge number of difference. Otherwise, diff will show you the differences in a cruder pure-terminal but still very much functional manner. The syntax for all of them is:

$ <command> <file1> <file2>

Except for diff, for which it is worth calling with the --color option:

$ diff --color <file1> <file2>

If cannot run a graphical user interface but is feeling fancy today, you can install the ydiff Python extension with (done just once):

$ python3 -m pip install --user ydiff 

and pipe diff’s output to it with the following:

$diff -u <file1> <file2> | python3 -m ydiff -s

This will show you the differences between two versions of a code file in a crystal clear, side by side, and colorized way.

Creating a .bashrc file for a terminal that’s easy to work with and good (or better) to look at

When we first login to several Linux systems the terminal is all black with white characters, in which it’s difficult find the commands you typed amidst all the output printed on the screen, and with limited autocomplete and history search. In short, it’s a real pain and you makes you long for Windows as much as for you long for your mother’s weekend dinner. There is, however, a way of making the terminal less of a pain to work with, which is by creating a file called .bashrc with the right contents in your home directory. Below is an example of a .bashrc file with the following features for you to just copy and paste in your home directory (e.g., /home/username/, or ~/ for short):

  • Colorize your username and show the directory you’re currently in, so that it’s easy to see when the output of a command ends and the next one begins–as in section “Checking the output of a command running on the background.”
  • Allow for a search function with the up and down arrow keys. This way, if you’re looking for all the times you typed a command starting with sbatch, you can just type “sba” and hit up arrow until you find the call you’re looking for.
  • A function that allows you to call extract and the compressed file will be extracted. No more need to tar with a bunch of options, unzip, unrar, etc. so long as you have all of them installed.
  • Colored man pages. This means that when you look for the documentation of a program using man, such as man cat to see all available options for the cat command, the output will be colorized.
  • A function called pretty_csv to let you see csv files in a convenient, organized and clean way from the terminal, without having to download it to your computer.
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# Load aliases
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

# Automatically added by module
shopt -s expand_aliases

if [ ! -z "$PS1" ]; then
PS1='\[\033[G\]\[\e]0;\w\a\]\n\[\e[1;32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '
bind '"\e[A":history-search-backward'
bind '"\e[B":history-search-forward'
fi

set show-all-if-ambiguous on
set completion-ignore-case on
export PATH=/usr/local/gcc-7.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/gcc-7.1/lib64:$LD_LIBRARY_PATH

history -a
export DISPLAY=localhost:0.0

sshd_status=$(service ssh status)
if [[ $sshd_status = *"is not running"* ]]; then
sudo service ssh --full-restart
fi

HISTSIZE=-1
HISTFILESIZE=-1

extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2)   tar xvjf $1    ;;
*.tar.gz)    tar xvzf $1    ;;
*.bz2)       bunzip2 $1     ;;
*.rar)       unrar x $1       ;;
*.gz)        gunzip $1      ;;
*.tar)       tar xvf $1     ;;
*.tbz2)      tar xvjf $1    ;;
*.tgz)       tar xvzf $1    ;;
*.zip)       unzip $1       ;;
*.Z)         uncompress $1  ;;
*.7z)        7z x $1        ;;
*)           echo "don't know how to extract '$1'..." ;;
esac
else
echo "'$1' is not a valid file!"
fi
}

# Colored man pages
export LESS_TERMCAP_mb=$'\E[01;31m'
export LESS_TERMCAP_md=$'\E[01;31m'
export LESS_TERMCAP_me=$'\E[0m'
export LESS_TERMCAP_se=$'\E[0m'
export LESS_TERMCAP_so=$'\E[01;44;33m'
export LESS_TERMCAP_ue=$'\E[0m'
export LESS_TERMCAP_us=$'\E[01;32m'

# Combine multiline commands into one in history
shopt -s cmdhist

# Ignore duplicates, ls without options and builtin commands
HISTCONTROL=ignoredups
export HISTIGNORE="&:ls:[bf]g:exit"

pretty_csv () {
cat "$1" | column -t -s, | less -S
}

There are several .bashrc example files online with all sorts of functionalities. Believe me, a nice .bashrc will make your life A LOT BETTER. Just copy and paste the above into a text file called .bashrc and sent it to your home directory in your local or HPC system terminal.

Make the terminal far less user-friendly and less archane by setting up a .bash_aliases file

You should also have a .bash_aliases file to significantly reduce typing and colorizing the output of commands you often use for ease of navigation. Just copy all the below into a file called .bash_aliases and copy into your home directory (e.g., /home/username/, or ~/ for short). This way, every time you run the command between the word “alias” and the “=” sign, the command after the “=”sign will be run.

alias ls='ls --color=tty'
alias ll='ls -l --color=auto'
alias lh='ls -al --color=auto'
alias lt='ls -alt --color=auto'
alias uu='sudo apt-get update && sudo apt-get upgrade -y'
alias q='squeue -u '
alias qkill='scancel $(qselect -u bct52)'
alias csvd="awk -F, 'END {printf \"Number of Rows: %s\\nNumber of Columns: %s\\n\", NR, NF}'"
alias grep='grep --color=auto'                          #colorize grep output
alias gcc='gcc -fdiagnostics-color=always'                           #colorize gcc output
alias g++='g++ -fdiagnostics-color=always'                          #colorize g++ output
alias paper='cd /my/directory/with/my/beloved/paper/'
alias res='cd /my/directory/with/my/ok/results/'
alias diss='cd /my/directory/of/my/@#$%&/dissertation/'
alias aspell='aspell --lang=en --mode=tex check'
alias aspellall='find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;'
alias make='make -j4'

Check for spelling mistakes in your Latex files using aspell

Command-line spell checker, you know what this is.

aspell --lang=en --mode=tex check'

To run aspell check on all the Latexfiles in a directory and its subdirectories, run:

find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;

Easily share a directory on certain HPC systems with others working on the same project [Hint from Stampede 2]

Here’s a great way to set permissions recursively to share a directory named projdir with your research group:

$ lfs find projdir | xargs chmod g+rX

Using lfs is faster and less stressful on Lustre than a recursive chmod. The capital “X” assigns group execute permissions only to files and directories for which the owner has execute permissions.

Run find and replace in all files in a directory [Hint from Stampede 2]

Suppose you wish to remove all trailing blanks in your *.c and *.h files. You can use the find command with the sed command with in place editing and regular expressions to this. Starting in the current directory you can do:

$ find . -name *.[ch] -exec sed -i -e ‘s/ +$//’ {} \;

The find command locates all the *.c and *.h files in the current directory and below. The -exec option run the sed command replacing {} with the name of each file. The -i option tells sed to make the changes in place. The s/ +$// tells sed to replace one or blanks at the end of the line with nothing. The \; is required to let find know where the end of the text for the -exec option. Being an effective user of sed and find can make a great different in your productivity, so be sure to check Tina’s post about them.

Other post in this blog

Be sure to look at other posts in this blog, such as Jon Herman’s post about ssh, Bernardo’s post about other useful Linux commands organized by task to be performed, and Joe’s posts about grep (search inside multiple files) and cut.

Remote terminal environment using VS Code for Windows and Mac

On Windows machines, the application MobaXterm is a valuable tool for computing on virtual machines and working through SSH clients. David Gold’s blog post walks through the installation and use of this app, which works well in Windows environments.

Working remotely on my Mac laptop, I have been struggling to achieve the same workflow as in the office, with a Windows machine. Unfortunately, MobaXterm is not available for download on Mac OS. Looking for alternatives, I discovered that using VS Code with the “Remote – SSH” extension is a great replacement with significant advantages to MobaXterm, as it an SSH client interface and code editor in one.

A screenshot from my VS Code remote interface, with the graphical file browser on the left panel, the SSH server terminal on the bottom-right, and the VS Code editor on the top-right.

Here’s how you can set up a remote session on Mac (and Windows) using VS Code: 

  1. Install the VS Code application here. For installation help and a brief overview of the app, check out this video.
  2. With VS Code opened, go to View -> Extensions, and search “Remote – SSH.” Click on the extension and press the green “Install” button. You should see the message “This extension is enabled globally” appear. Check out this extension’s description below (I’ll run through the basics in this post).
  3. On the bottom left of your screen, there should be a small green box with two opposite pointing arrow heads. Click this.
The green box is the Remote – SSH extension.
  1. Choose the first pop-up option “Remote-SSH: Connect to host…” and then select “Add New SSH Host…”.
Click the first box and then the “Add New SSH Host” button to connect to your SSH client.
  1. Here, enter your remote SSH username@serverid (here at Cornell, this would be yournetid@thecube.cac.cornell.edu to connect to our remote computing cluster, the Cube).
  2. In the same pop-up window, click the remote server that you just added. A new window will open and prompt you to enter your password for the server.
  3. Now, you in are in your remote SSH environment. Click “Open folder…” and select “OK” to see your remote directory on the left. You can navigate through these files in your remote machine the same way as MobaXterm. Click View -> Terminal to see your SSH command line on the bottom of the screen (here’s where you can actually run the programs on your cluster).

Now using VS Code, you can install other extensions to aid in code editing in different languages (here’s an article with a few good ones for various uses). This environment has the same functionality as MobaXterm, without having to switch applications for editing code. Run your cluster programs in the terminal window and edit the code in the main VS Code editor!

EnGauge: R Code Repository for Environmental Gauge Data Acquisition, Processing, and Visualization

Introduction and Motivation

Gauge data is an essential component of water systems research projects; however, data acquisition, processing, and exploratory (spatio-temporal) data analysis often consumes a large chunk of limited project research time. I developed the EnGauge GitHub repository to reduce the time required to download, process, and explore streamflow, water quality, and weather station gauge data that are hosted primarily on U.S. government servers. This repository compiles and modifies functions from other Packages for Hydrological Data Retrieval and Statistical Analysis, and develops new functions for processing and exploring the data.

Data Acquisition

Given a polygon shapefile of the region of interest and an optional radial buffer size, the types of gauge data downloaded can include:

  1. USGS streamflow from the NWIS portal
  2. EPA STORET, USGS, USDA and other water quality data via the water quality portal
  3. NOAA ACIS, GHCN weather station data

The USGS R package dataRetrieval and the NOAA rnoaa package contain the primary functions used for data acquisition. Additional references to learn about these packages are available in the EnGauge README file and at the provided web links.

Data Processing

Significant processing is required to use some of these gauge datasets for environmental modeling. The EnGauge repository has functions that may be used to address the following common data processing needs:

  1. Check for duplicate records
  2. Check for zeros and negative values
  3. Check detection limits
  4. Fill date gaps (add NAs to dates missing from timeseries)
  5. Aggregate to daily, monthly, and/or annual timeseries
  6. Project spatial data to a specified coordinate system
  7. Write processed data to shapefiles, .txt files, and lists that can be loaded into other software for further analysis and/or modeling.

Data Visualization and Exploratory Data Analysis – From GitHub Example

This example is applied to the Gwynns Falls watershed in the Baltimore Ecosystem Study Long Term Ecological Research site. The following figures are some of the output from the EnGague USGSdataRetrieval.R script (as of commit 2fc84cd).

  1. Record lengths at each gaugeStremflowGauges_RecordLengths
  2. Locations of sites with zero and/or negative valuesStreamflow_ZerosNegsMap_fn
  3. Locations of sites with different water quality information: total nitrogen and total phosphorus in this exampleTNTPsites
  4. Locations of sites with certain weather station data: maximum temperature in this exampleNOAA_Datatype_TMAX
  5. Visualizing quality codes on timeseriesTP_Timeseries_MDDNR-GWN0115
  6. Summary exploratory spatial data analysis for sitesStreamflowExceedanceTimeseries_Map_01589330 
  7. Summary daily, monthly, annual informationStreamflowEDA_01589330 
  8. Monthly heatmapTNMonthly_MDDNR-GWN0115 
  9. Outlier visualization: currently implements a simplistic global spatio-temporal method defined by flows greater than a selected quantile. Plots offer qualitative support for the flows at other stations on the dates with high outliers at the reference station.Outlier99Quantile_01589320 
  10. DEM vs. Gauge Elevation: If you supply a DEM, the reported gauge elevation can be compared to the DEM elevation within the region of interest (ROI)CompareGaugeElevToDEM_ROI_fn
  11. Seasonal Scatterplot with Histograms: If you have two timeseries of different data types, e.g. streamflow and water quality, a scatterplot by season may be made (not in example code, but a function is available in the repository).TN_ScatterHist01583570 POBR

Concluding Thoughts

This repository can be used to download gauge data from several sources, to employ standard data processing methods across those sources, and to explore the resulting data. Spend less time getting your data ready to do your research, and more time thinking about what your data are telling you and actually using it for modeling. Check out the EnGague repository for your next research project!

Performing Experiments on HPC Systems

A lot of the work we do in the Reed lab involves running computational experiments on High Performance Computing (HPC) systems. These experiments often consist of performing multi-objective optimization to aid decision making in complex systems, a task that requires hundreds of thousands of simulation runs and may not possible without the use of thousands of computing core. This post will outline some best practices for performing experiments on HPC systems.

1. Have a Plan

By nature, experiments run on HPC systems will consume a large amount of computational resources and generate large amounts of data. In order to stay organized, its important to have a plan for both how the computational resources will be used and how data will be managed.

Estimating your computational resources

Estimating the scale of your experiment is the first step to running on an HPC system. To make a reasonable estimate, you’ll need to gather the following pieces of information:

  • How long (in wall clock time) does a single model run take on your local machine?
  • How many function evaluations (for an MOEA run) or ensemble model runs will you need to perform?
  • How long do you have in wall clock time to run the experiment?

Using this information you can estimate the number of parallel processes that you will need to successfully run the experiment. Applications such as running the Borg MOEA are known as, “embarrassingly parallel” and scale quite well with an increase in processors, especially for problems with long function evaluation times (see Hadka and Reed, 2013 for more info). However, many other applications scale poorly, so it’s important to be aware of the parallel potential of your code. A helpful tip is to identify any inherently serial sections of the code which create bottlenecks to parallelization. Parallelizing tasks such as Monte Carlo runs and MOEA function evaluations will often result in higher efficiency than paralellizing the simulation model itself. For more resources on how to parallelize your code, see Bernardo’s post from last year.

Once you have an idea of the scale of your experiment, you’ll need to estimate the experiment’s computational expense. Each HPC resource has its own charging policy to track resource consumption. For example, XSEDE tracks charges in “service units” which are defined differently for each cluster. On the Stampede2 Cluster, a service unit is defined as one node-hour of computing time, so if you run on 100 nodes for 10 hours, you spend 1,000 service units regardless of how many core per node you utilize. On the Comet Cluster, a service unit is charged by the core-hour, so if you run 100 nodes for 10 hours and each utilizes 24 core, you’ll be charged 24,000 service units. Usually, the allocations you receive to each resource will be scaled accordingly, so even though Comet looks more expensive, you likely have a much large allocation to work with. I usually make an estimate of service units I need for an experiment and add another 20% as a factor of safety.

Data management:

Large experiments often create proportionately large amounts of data. Before you start, its important to think about where this data will be stored and how it will be transferred to and from the remote system. Many clusters have limits to how much you can store on different drives, and breaking these limits can cause performance issues for the system. System administrators often don’t take kindly to these performance issues and in extreme cases, breaking the rules may result in suspension or removal from a cluster. It helps to create an informal data management plan for yourself that specifies:

  1. How will you transfer large amounts of data to and from the cluster (tools such as Globus are helpful here).
  2. Where will you upload your code and how your files will be structured
  3. Where will you store data during your experimental runs. Often clusters have “scratch drives” with large or unlimited storage allocations. These drives may be cleaned periodically so they are not suitable for long term storage.
  4. Where will you store data during post processing. This may still be on the cluster if your post processing is computationally intensive or your local machine can’t handle the data size.
  5. Where will you store your experimental results and model data for publication and replication.

2. Test on your local machine

To make the most of your time on a cluster, its essential that you do everything you can to ensure your code is properly functioning and efficient before you launch your experiment. The biggest favor you can do for yourself is to properly test your code on a local machine before porting to the HPC cluster. Before porting to a cluster, I always run the following 4 checks:

  1. Unit testing: the worst case scenario after a HPC run is to find out there was a major error in your code that invalidates your results. To mitigate this risk as much as possible, it’s important to have careful quality control. One helpful tool for quality control is unit testing to examine every function or module in your code and ensure it is working as expected. For an introduction to unit testing, see Bernardo’s post on Python and C++.
  2. Memory checking: in low level code (think C, C++) memory leaks can be silent problem that throws off your results or crash your model runs. Sometimes, memory leaks can go undetected during small runs but add up and crash your system when run in large scale experiments. To test for memory leaks, make sure to use tools such as Valgrind before uploading your code to any cluster. This post features a nice introduction to Valgrind with C++.
  3. Timing and profiling: Profile your code to see which parts take longest and eliminate unnecessary sections. If you can, optimize your code to avoid computationally intensive pieces of code such as nested loops. There are numerous tools for profiling low level codes such as Callgrind and gprof. See this post for some tips on making your C/C++ code faster.
  4. Small MOEA tests: Running small test runs (>1000 NFE) will give you an opportunity to ensure that the model is properly interacting with the MOEA (i.e. objectives are connected properly, decision variables are being changed etc.). Make sure you are collecting runtime information so you can evaluate algorithmic performance

3. Stay organized on the cluster

After you’ve fully tested your code, it’s time to upload and port to the cluster. After transferring your code and data, you’ll likely need to use the command line to navigate files and run your code. A familiarity with Linux command line tools, bash scripting and command line editors such as vim can make your life much easier at this point. I’ve found some basic Linux training modules online that are very useful, in particular “Learning Linux Command Line” from Linked-in learning (formerly Lynda.com) was very useful.

Once you’ve got your code properly organized and compiled, validate your timing estimate by running a small ensemble of simulation runs. Compare the timing on the cluster to your estimates from local tests and reconfigure your model runs if needed. If performing an experiment with an MOEA, run a small number of NFE on a development or debug node confirm that the algorithm is running and properly parallelizing. Then run a single seed of the MOEA and perform runtime diagnostics to ensure things are working more or less as you expect. Finally, you’re ready to run the full experiment.

I’d love to hear your thoughts and suggestions

These tips have been derived from my experiences running on HPC systems, if anyone else has tips or best practices that you find useful, I’d love to hear about them in the comments.

Establishing an Effective Data Backup Strategy for Your Workstation

When determining your data management strategy for your workflow, considering a range of backup options for your data beyond just a single copy on your workstation or your external hard drive is paramount. Creating a seamless workspace that will easily transition between workstations and while maintaining durability and availability is easily achievable once you know what resources might be available and a general guideline.

General considerations for how you will be managing and sharing data is crucial, especially for collaborative projects when files must often be accessible in real time.

Considering how long you might need to retain data and how often you might need to access it will drastically change your approach to your storage strategy.

3-2-1 Data Backup Rule

If you walk away form this with nothing else, remember the 3-2-1 rule. The key to ensuring durability of your data—preventing loss due to hardware or software malfunction, fire, viruses, and institutional changes or uproars—is following the 3-2-1 Rule. Maintaining three or more copies on two or more different mediums (i.e. cloud and HDD) with at least one off-site copy.

3-2-1-Backup-Rule-1024x505Source: https://cactus-it.co.uk/the-3-2-1-backup-rule/

An example of this would be to have a primary copy of your data on your desktop that is backed up continuously via Dropbox and nightly via an external hard drive. There are three copies of your data between your local workstation, external hard drive (HD), and Dropbox. By having your media saved on hard drive disks (HDDs) on your workstation and external HD in addition to ‘the cloud’ (Dropbox), you have accomplished spreading your data across exactly two different mediums. Lastly, since cloud storage is located on external servers connected via the internet, you have successfully maintained at least one off-site copy. Additionally, with a second external HD, you could create weekly/monthly/yearly backups and store this HD offsite.

Version Control Versus Data Backup

Maintaining a robust version control protocol does not ensure your data will be properly backed up and vice versa. Notably, you should not be relying on services such as GitHub to back up your data, only your code (and possibly very small datasets, i.e. <50 MB). However, you should still maintain an effective strategy for version control.

  • Code Version Control
  • Large File Version Control
    • GitHub is not the place to be storing and sharing large datasets, only the code to produce large datasets
    • Git Large File Storage (LFS) can be used for a Git-based version-control on large files

Data Storage: Compression

Compressing data reduces the amount of storage required (thereby reducing cost), but ensuring the data’s integrity is an extremely complex topic that is continuously changing. While standard compression techniques (e.g. .ZIP and HDF5) are generally effective at compression without issues, accessing such files requires additional steps before having the data in a usable format (i.e. decompressing the files is required).  It is common practice (and often a common courtesy) to compress files prior to sharing them, especially when emailed.

7-Zip is a great open-source tool for standard compression file types (.ZIP, .RAR) and has its own compression file type. Additionally, a couple of guides looking into using HDF5/zlib for NetCFD files are located here and here.

Creating Your Storage Strategy

To comply with the 3-2-1 strategy, you must actively choose where you wish to back up your files. In addition to pushing your code to GitHub, choosing how to best push your files to be backed up is necessary. However, you must consider any requirements you might have for your data handling:

My personal strategy costs approximately $120 per year. For my workstation on campus, I primarily utilize DropBox with a now-outdated version control history plugin that allows for me to access files one year after deletion. Additionally, I instantaneously sync these files to GoogleDrive (guide to syncing). Beyond these cloud services, I utilize an external HDD that backs up select directories nightly (refer below to my script that works with Windows 7).

It should be noted that Cornell could discontinue its contracts with Google so that unlimited storage on Google Drive is no longer available. Additionally, it is likely that Cornell students will lose access to Google Drive and Cornell Box upon graduation, rendering these options impractical for long-term or permanent storage.

  • Minimal Cost (Cornell Students)
    • Cornell Box
    • Google Drive
    • Local Storage
    • TheCube
  • Accessibility and Sharing
    • DropBox
    • Google Drive
    • Cornell Box (for sharing within Cornell, horrid for external sharing)
  • Minimal Local Computer Storage Availability
    Access Via Web Interface (Cloud Storage) or File Explorer

    • DropBox
    • Google Drive (using Google Stream)
    • Cornell Box
    • TheCube
    • External HDD
  • Reliable (accessibility through time)
    • Local Storage (especially an external HDD if you will be relocating)
    • Dropbox
    • TheCube
  • Always Locally Accessible
    • Local Storage (notably where you will be utilizing the data, e.g. keep data on TheCube if you plan to utilize it there)
    • DropBox (with all files saved locally)
    • Cornell Box (with all files saved locally)
  • Large Capacity (~2 TB total)
    • Use Cornell Box or Google Drive
  • Extremely Large Capacity (or unlimited file size)

Storage Option Details and Tradeoffs

Working with large datasets can be challenging to do between workstations, changing the problem from simply incorporating the files directly within your workflow to interacting with the files from afar (e.g. keeping and utilizing files on TheCube).

But on a personal computer level, the most significant differentiator between storage types is whether you can (almost) instantaneously update and access files across computers (cloud-based storage with desktop file access) or if manual/automated backups occur. I personally like to have a majority of my files easily accessible, so I utilize Dropbox and Google Drive to constantly update between computers. I also back up all of my files from my workstation to an external hard drive just to maintain an extra  layer of data protection in case something goes awry.

  • Requirements for Data Storage
  • Local Storage: The Tried and True
    • Internal HDD
      • Installed on your desktop or laptop
      • Can most readily access data for constant use, making interactions with files the fastest
      • Likely the most at-risk version due to potential exposure to viruses in addition to nearly-constant uptime (and bumps for laptops)
      • Note that Solid State Drives (SSDs) do not have the same lifespan for the number of read/write as an HDD, leading to slowdowns or even failures if improperly managed. However, newer SSDs are less prone to these issues due to a combination of firmware and hardware advances.
      • A separate data drive (a secondary HDD that stores data and not the primary operating system) is useful for expanding easily-accessible space. However, it is not nearly as isolated as data contained within a user’s account on a computer and must be properly configured to ensure privacy of files
    • External Hard Drive Disk (HDD)
      • One-time cost ($50-200), depending on portability/size/speed
      • Can allow for off-line version of data to be stored, avoiding newly introduced viruses from preventing access or corrupting older versions (e.g. ransomware)—requires isolation from your workflow
      • May back up data instantaneously or as often as desired: general practice is to back up nightly or weekly
      • Software provided with external hard drives is generally less effective than self-generated scripts (e.g. Robocopy in Windows)
      • Unless properly encrypted, can be easily accessed by anyone with physical access
      • May be used without internet access, only requiring physical access
      • High quality (and priced) HDDs generally increase capacity and/or write/read speeds
    • Alternative Media Storage
      • Flash Thumb Drive
        • Don’t use these for data storage, only temporary transfer of files (e.g. for a presentation)
        • Likely to be lost
        • Likely to malfunction/break
      • Outdated Methods
        • DVD/Blu-Ray
        • Floppy Disks
        • Magnetic Tapes
      • M-Discs
        • Required a Blu-Ray or DVD reader/writer
        • Supposedly lasts multiple lifetimes
        • 375 GB for $67.50
  •  Dropbox
    • My experience is that Dropbox is the easiest cloud-storage solution to use
    • Free Version includes 2 GB of space without bells and whistles
    • 1 TB storage for $99.00/year
    • Maximum file size of 20 GB
    • Effective (and standard) for filesharing
    • 30-day version history (extended version history for one year can be purchased for an additional $39.00/year)
    • Professional, larger plans with additional features (e.g. collaborative document management) also available
    • Can easily create collaborative folders, but storage counts against all individuals added (an issue if individuals are sharing large datasets)
    • Can interface with both a web interface and across as operating system desktops
    • Fast upload/download speeds
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • Google Drive
    • My experience is that Google Drive is relatively straight forward
    • Unlimited data/email storage for Cornell students, staff, and faculty
    • Costs $9.99/mo for 1 TB
    • Maximum file size of 5 GB
    • Easy access to G Suite, which allows for real-time collaboration on browser-based documents
    • Likely to lose access to storage capabilities upon graduation
    • Google Drive is migrating over to Google Stream which stores less commonly used files online as opposed to on your hard drive
    • Google File Stream (used to sync files with desktop) requires a constant internet connection except for recently-used files
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup
  • Cornell Box
    • My experiences are that Cornell Box is not easy to use relative to other options
    • Unlimited storage space, 15 GB file-size limit
    • Free for Cornell students, staff, and faculty, but alumni lose access once graduating
    • Can only be used for university-related activities (e.g. classwork, research)
    • Sharable links for internal Cornell users; however, it is very intrusive to access files for external users (requires making an account)
    • Version history retains the 100 most recent versions for each file
    • Can connect with Google Docs
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • TheCube

Long-Term (5+ Years) Data Storage

It should be noted that most local media types degrade through time. Utilizing the 3-2-1 strategy is most important for long-term storage (with an emphasis on multiple media types and off-site storage). Notably, even if stored offline and never used, external HDDs, CDs, and Blu-Ray disks can only be expected to last at most around five years. Other strategies, such as magnetic tapes (10 years) or floppy disks (10-20 year), may last longer, there is no truly permanent storage strategy (source of lifespans).

M-Discs are a write-once (i.e. read only, cannot be modified) storage strategy that is projected to last many lifetimes and up to 1,000 years. If you’re able to dust off an old Blu-Ray disk reader/writer, M-Discs are likely the best long-term data strategy that is likely to survive the test of time—making two copies stored in two locations is definitely worthwhile. However, the biggest drawback is that M-Discs are relatively difficult to access compared to plugging in an external HD.

Because of the range of lifespans and how cheap storage has become, I would recommend maintaining your old (and likely relatively small) data archives within your regular storage strategy which is likely to migrate between services through time.

For larger datasets that you are required to retain and would like to easily access, I would maintain them on at least two offline external hard drive stored in separate locations (e.g. at home and your office) while occasionally (i.e. every six months) checking the health of the hard drives in perpetuity and replacing them as required.

Relying only on cloud storage for long-term storage is not recommended due to the possibility of companies closing their doors or simply deactivating your account. However, they can be used as an additional layer of protection in addition to having physical copies (i.e. external HD, M-Discs).

Windows 7 Robocopy Code

The script I use for backing up specific directories from my workstation (Windows 7) to an external HD is shown below. To set up my hard drive, I first formatted it to a format compatible with multiple operating systems using this guide. Note that your maximum file size and operating system requirements require different formats. Following this, I used the following guide to implement a nightly backup of all of my data while keeping a log on my C: drive. Note that I have only new files and versions of files copied over, ensuring that the back up does not take ages.

@echo off
robocopy C:\Users\pqs4\Desktop F:\Backups\Desktop /E /XA:H /W:0 /R:3 /REG > C:\externalbackup.log
robocopy E:\Dropbox F:\Backups\Dropbox /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Downloads F:\Backups\Downloads /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Documents F:\Backups\Documents /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4 F:\Backups\UserProfile /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy E:\Program Files\Zotero F:\Backups\Zotero /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log