Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!)

We are pleased to announce that we have published our free eBook on Addressing Uncertainty in Multisector Dynamics Research, a primer on uncertainty characterization methods for research on sustainability, climate, and human-natural systems: https://lnkd.in/d6AMPnM3

This eBook is the product of years of collaboration across several teams that brings in perspectives from natural sciences, decision making under (deep) uncertainty, software engineering, statistics, and risk analysis. The breakdown of the book is as follows (summarized by Antonia Hadjimichael):

Chapter 1 uses the IM3 project as a living lab to encapsulate the challenges that emerge in bridging disciplines to make consequential model-based insights while acknowledging the tremendous array of uncertainties that shape them.

Chapter 2 helps the reader to better understand the importance of using diagnostic modeling and the diverse disciplinary perspectives that exist on how best to pursue consequential model-based insights.

Chapter 3 is a technical tools-focused primer for readers on the key elements of uncertainty characterization that includes ensemble-based design of experiments, quantitative methods for computing global sensitivities, and a summary of existing software packages.

Chapter 4 narrates for readers how and why the tools from the previous chapter can be applied in a range of tasks from diagnosing model performance to formal exploratory modeling methods for making consequential model-based discoveries.

The supplemental appendices provide a glossary, a brief summary of uncertainty quantification tools, and a suite of Jupyter notebook tutorials that provide hands-on training tied to the contents of Chapters 3 and 4.

A central theme of this eBook is that it is a living document that will be actively maintained through its GitHub repository. If you work on exploratory modeling, sustainability, management of human-natural systems, complex systems or adjacent areas we’d love your feedback on how the document could be improved! You can open an issue here.

Over the next few months, we will be posting more about the process of making a book and extending out some of the Jupyter Notebook coding exercises to get readers more comfortable with adjusting the source code for their own use. 

In this post, Chris Vernon and I will be discussing how you can create a similar style of eBook or documentation using Python and Sphinx. Subsequent posts will contain more information on how the book is hosted and managed in GitHub. Please note that this is a brief overview of some of the aspects of the eBook, but for full details, the source code is available here. You can find the sourcefiles for this tutorial in this repo and download the index.html file to click through the HTML page interactively.

Introduction to Sphinx

Sphinx is a tool written in Python that can facilitate the creation of a variety of documentation. Sphinx will translate a set of reStructuredText (reST) files that are linked in a hierarchy defined in an index file. First download and install Sphinx as follows (Note: the following pip commands assume that your version of pip is pointing to the instance of Python you want to use):

$pip install -U sphinx

To quickly set up a directory, execute the following from your terminal or command line while in your desired parent directory:

$sphinx-quickstart

This will create a build and source directory along with a makefile (a Windows Batch File: make.bat), and within the source directory, you will find an index.rst file and a file called conf.py. The index.rst file is termed the root documentation and houses the table of contents which organizes the structure of the document.

Changing the Theme

The default Sphinx theme is “alabaster” which has minimalist look. There are many other built-in themes and examples of documentation that use these themes can be found here. We are using an external theme for the book (found here) that has a sleek design and more of a “book feel”. To download this theme, simply execute the following in your terminal or command line:

$pip install sphinx-book-theme

Then change the theme which is located in conf.py.

html_theme = ‘sphinx_book_theme’

Adding Content

To add content to our book, we need to write our text files using reStructuredText, which is the default plaintext markup language used by Sphinx. The syntax is similar to Latex and easy to pick up as you get used to it. We will first add front matter so that when the HTML page is accessed, the front of the book is not empty. We place the front matter in index.rst as follows using a “directive”. Directives begin with an explicit markup start (two periods and a space), followed by the directive type and two colons (collectively, the “directive marker”).  Here we use an epigraph directive to render a classic block quote.

Creating front matter

Let’s build our documentation to see what our book looks like so far. I use a terminal called Cygwin since I am working on a Windows computer and run the following:

$ sphinx-build source/ build/

I can then open the resulting index.html located in the build directory by dragging and dropping it into my favorite browser or simply double clicking the file. 

eBook homepage

Now let’s add more sections to my book. I’m going to create a preface, a single section of material, and a bibliography. I first create a text file in my source directory that I call preface.rst. I add text to this file as shown below.

Preface Content

Then I add a section to the index.rst file that tells Sphinx where the preface should be placed in the book through the toctree directive (“table of contents tree”). We customize the table of contents to contain multiple sections and thus adjust the default toctree using the raw directive.

Updated Table of Contents

Next we create a section of content. In this section, I will demonstrate paragraph text, citations, images, notes, and equations, which cover the gamut of what you’ll likely want to include in various parts of your book. Let’s first install the necessary extensions for the bibliography as follows:

$pip install sphinxcontrib-bibtex

We then must add the extension into config.py. I also place the name of the file that contains my references.

Adding bibliography extensions and file names to config.py

Below is an example of what refs.bib looks like along with the Bibliography file.

Biblography .rst file and associated Bibtex

R.Bibliography.rst and refs.bib must written in the source directory and R.Bibliography.rst should be referenced in the index.rst file as follows.

Updated Table of Contents

Now let’s start writing our section and then place the reference to it in our index.rst file as shown above. In this section, note the first line that denotes the reference to the section that is used in the index.rst file. We show how to place a figure in lines 14-20 and the syntax for in-text citations is shown in the figure caption.   

If we render the webpage (e.g., sphinx-build source/ build/) it looks like this:

Updated webpage

The bibliography has the one reference so far:

Bibliography as seen on the webpage

The last two components that I will demonstrate are equations and notes, which may come in handy for scientific communications. In order to write equations, we use the MathJax Sphinx extension which allows for the best display of math in HTML.

Updated extensions

In our same section, we add additional text to show the equation and note functionality.

Adding equations and notes

We can then render the webpage one last time.

Updated webpage

Well there you go, after completing this tutorial, you will create the foundations of a book. You can feel free to use these files as a base for your own project! Enjoy!   

Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB

This post is dedicated to Chris Vernon and Travis Thurber from PNNL who taught me how to use these tools!

Lately, I’ve been generating synthetic weather scenarios for basins in California. The weather scenarios are created at a daily time step and are informed by tree-ring products which span across 600 years. We want to develop multiple ensembles of these scenarios that are representative of plausible future climate changes that the region could experience. It became clear to me that I would need to make sure that I was storing data efficiently and that I was able to parse through these data quickly to generate plots and develop metrics.

Below, I wanted to discuss two points that have really helped my workflow: (1) focusing on file structure and compression and (2) changing the way that I explore the resulting scenarios.

File choice and compression

To illustrate the importance of considering how you choose to save your data, I grab the historical trace of daily precipitation over the last 50 years for the Tuolumne River Basin and upload it into R as a dataframe. Below I show the code to take this dataframe and save it in various formats.

#Traditional RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne.rds"))

#Compressed RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne_compressed.rds",compress = 'xz'))

#Text 
system.time(write.table(prcp_tuolumne,"E:/blog/prcp_tuolumne.txt"))

#CSV
system.time(write.csv(prcp_tuolumne,"E:/blog/prcp_tuolumne.csv"))

#Parquet
library(arrow)
system.time(write_parquet(prcp_tuolumne,"E:/blog/prcp_tuolumne.parquet",compression = "gzip"))

#HDF5 
library(rhdf5)
system.time(h5write(prcp_tuolumne,"E:/blog/prcp_tuolumne.h5","df"))


#NetCDF
library(ncdf4)
# path and file name, set dname
ncpath <- "E:/blog/"
ncname <- "prcp_tuolumne"  
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "prcp"  # note: tmp means temperature (not temporary)
timedim <- ncdim_def("time","days since 1950-11-1",as.numeric(dates))
# define variables
fillvalue <- 1e32
dlname <- "observed precipitation"
prcp_def <- ncvar_def("prcp","mm",list(timedim),1e32,"observed prcp",prec = "double")
# create netCDF file and put arrays
ncout <- nc_create(ncfname,list(prcp_def),force_v4=TRUE)
# put variables
system.time(ncvar_put(ncout,prcp_def,as.matrix(prcp_tuolumne[ ,1])))

Here, I summarize the resulting size and write times for each format.

File TypeSize (Kb)Writing Time (seconds)
RDS9,0652.13
Compressed RDS4,01217.73
HDF55,204323.03
Text26,85112.69
CSV27,02412.14
netCDF141,3603.12
Parquet8,3960.98

Note that the original file size was on the smaller side so this exercise may seem trivial, but when you are creating many ensembles of both precipitation and temperature, across hundreds of years, across may basins, across many climate change scenarios, small files will add up and can potentially limit the scope of experiment that you can conduct if you don’t have enough file storage. Of these file types, one of my recent favorites has been Apache Parquet. Parquet is an open-source column-oriented format that is specifically designed for efficient storage and accessibility of data. Whereas the compressed RDS file and HDF5 beat Parquet in terms of size, it takes much longer to write to these files and subsequently read them back in. Another advantage of Parquet is that it is recognized by common coding languages (R, Matlab, Python) which allows for a more seamless workflow between models of different languages. If you have a little more storage to work with, Parquet is a good choice to balance size and writing time tradeoffs.

Querying data with SQL and DuckDB

Once your data are in an efficient format like Parquet, the next order of business is to make sure that you can easily sort through it and, as Chris would say, “talk to your data and let it talk back”. One way to communicate with your data and ask questions in a more semantically meaningful way is to use Structured Query Language (SQL). There are many methods of querying in R, but a useful data management tool that I have enjoyed interacting with is DuckDB which is a relational database management system. DuckDB has an R API and allows you to use the SQL syntax to query data in an efficient way. It should be noted that you can perform similar aggregation and subsetting using other R base functions, but DuckDB allows you to perform these tasks faster and across multiple files that you may have in a directory.

Let’s use a case where you have developed 30 traces of daily basin-averaged synthetic precipitation for the Tuolumne based on the historical period and each trace is stored in a respective parquet file (“ensemble1.parquetgzip, ensemble2.parquetgzip…ensemble30.parquetgzip). Each ensemble looks something like this:

Example of the parquet file structure

Now let’s state a question that I want to ask to the data and see how to structure these questions using SQL and DuckDB.

“Hey DuckDB, can you create a dataframe for me that provides an annual average across all my synthetic traces? Let’s plot that with respect to the observed annual average and show the bounds of the synthetic traces.”

First we need to open a connection to DuckDB. Then we are going to have to create annual averages from the daily historical and synthetic traces and then create max/min bounds to show the additional variability we are getting from the generator.

In order to answer this question, DuckDB will query all our parquet files simultaneously (!) and return a dataframe with annual averages and we can do some clever sub-querying to get the max/min of the averages. I really enjoy that I don’t have to read in the files into my R workspace. R reads in files and stores them based on memory, so those of you who have used larger datasets might have found that you are limited in how many datasets you can have open, which can be frustrating! This is not an issue with DuckDB.

#Historical trace 
historical=data.frame(dates.year,dates.month,dates.day,rowMeans(prcp_sacsma))
historical_yearly=aggregate(historical,by=list(dates.year),FUN=mean)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

# query to find annual average across all synthetic traces
df=dbGetQuery(con, 
              "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM 'E:/NHMM/blog/*.parquet'
           GROUP BY year
           ORDER BY year")

#For the max/min, we need to find the average first and then return the max/min values in something like a nested approach 

#First create a dataframe with all files 
all_files=dbGetQuery(con,"SELECT *
                FROM 'E:/NHMM/blog/*.parquet'")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "all_files_table", all_files)


annual_average=dbGetQuery(con, 
               "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM all_files_table
           GROUP BY year,sample
           ORDER BY year")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "annual_table", annual_average)

#query to find max
max=dbGetQuery(con, 
               "SELECT year,
           MAX(yearly_average) AS max_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")

#query to find min
min=dbGetQuery(con, 
               "SELECT year,
           min(yearly_average) AS min_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")
#Plot the results!
library(ggplot2)

ggplot(df, aes(x = year, y = yearly_average)) +
  geom_ribbon(aes(ymin = min$min_yearly_average,
                  ymax = max$max_yearly_average),
              alpha = 0.2,fill="blue") +
  geom_line()+geom_line(color="blue",size=2)+ geom_line(data=basin.wide.yearly[3:64,],aes(x=year,y=precipitation), color="Black",size=2)+ ggtitle("Annual Average Precipitation") +
  xlab("Year") + ylab("Annual Average Precipitation (mm)")+theme_ridges()

Here’s the resulting figure:

Synthetic generation in blue compared to the historical trace in black

“Hey DuckDB, from all of the synthetic traces that I have generated, how many instances are there where we are producing daily precipitation that is above 10 mm for specifically the years 1980 or 2000?”

In order to answer this question, DuckDB will once again query all our parquet files simultaneously and return a dataframe with all the instances of >10 mm and only for the years 1980 or 2000. This is how you would write this question in “code”.

library(arrow)
library(duckdb)
library(fs)
library(tidyverse)
library(DBI)
library(glue)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

df=dbGetQuery(con,"SELECT *
                FROM 'E:/blog/*.parquet'
                WHERE precipitation > 10
                AND (year = 1980 OR year = 2000)
                order BY year")

If we look at the size of the resulting dataframe it looks like we have generated 1335 instances of daily precipitation that are greater than 10 mm in specifically the years 1980 or 2000:

Hey DuckDB, I want to think about meteorological drought. In how many instances am I half a standard deviation below the long-term average monthly precipitation?

To try this out, let’s just look at one synthetic trace. First we need to find the long term average across the trace and let’s plot what a drought might look like.

# query to grab the first synthetic trace
scenario1=dbGetQuery(con, 
              "SELECT month,year,
           AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'
           GROUP by year, month")

# query to find mean
long_term_average=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'") 

# query to standard deviation
stdev=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_stdev
           FROM 'E:/NHMM/blog/ensemble1.parquet'")

We can then plot the synthetic trace (blue), mean (black), and a 1/2 standard deviation (dashed line) below the mean.

Now we can count the number of months that we are classified in a drought across the synthetic trace. Based on the stats that we calculated in the last block, I hard coded in the drought threshold, but in theory, you can nest many functions with more complicated sub-queries.

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "scenario_1", scenario1)


instances=dbGetQuery(con, 
                     "SELECT month,year
           FROM scenario_1
           WHERE monthly_avg < 3.125291
           GROUP by year, month
           ORDER by year,month")

It looks like 115/325 months or about 35% of the time we are in a drought in this particular synthetic trace. Yikes! But it makes for a good way to assess the vulnerability of the system to future plausible meteorological drought conditions.

I hope that these approaches can be as useful to you as they have been for me!

Understanding Information Usage in Machine Learning Models

Deep learning models still largely remain black box concepts, where users have little understanding of how and when input variables are being used for prediction. In some applications, simpler and more interpretable models may be appropriate, but there are instances where deep learning models are simply more accurate. The question is, does one have to sacrifice accuracy for interpretability? Ideally, no. If you truly understand how a model works, this puts you in an optimal position to determine what changes need to be made to make it more accurate. Techniques for better understanding the inner workings of more complex ML models likely will lead to the acceptance of these models in process-driven fields like hydrology. We should aspire to move away from the practice of blindly tuning hyperparameters to get good validation results, and rather to focus more about what these parameters may physically represent (which is certainly not always easy or applicable, otherwise it would be done more often). The goal of this blog post is to outline some concepts within the computer science community that can help users understand information usage in their ML models that can be applied for hydrology applications.

An example of a complex ML model

The current state of the art for recurrent-style neural networks is a Long Short-Term Memory (LSTM) network which has become increasingly popular in hydrology applications. These RNN style networks contain a cell state that has the ability of learn long-term dependencies as well as gates to regulate the flow of information in and out the cell.

A single LSTM cell and its components (Colah, 2020)

The top horizontal line denotes the cell state, Ct ,which is continually being updated, either removing irrelevant information from subsequent time periods or incorporating new recent information. The first gate of the cell state is denoted as the forget gate, or ft. The forget gate takes the previous hidden state ht and the current input xt and decides what to forget from the previous time step’s cell state Ct-1 through the application of a sigmoid function, which effectively can set values in the cell state to a value between 0 and 1 (i.e. representing completely forget to completely retain). Next the input gate, it ,decides which values to update using a sigmoid function. A tanh layer creates new cell state candidate, C~t ,which are multiplied by the values from the input gate to create an update. The cell state can then be updated first by forgetting the prescribed values at the forget gate, and then adding the scaled values, it* C~t. Finally the output gate is used to describe what will be output to the next LSTM unit, which is represented by ht  at the bottom of the unit. The input xt is run through a sigmoidal function. The cell state is pushed through a tanh function from above and then multiplied with the transformed input to decide what the hidden state will be that is relayed to the next unit. The LSTM layer can be made up of hundreds of these cells and multiple layers can be used. Most importantly, the LSTM allows for the preservation and memory of past inputs and outputs, which can help expedite training in a system with slow changing dynamics. 

Clearly an LSTM is one of the more complex deep learning models. However, it has shown great success in hydrology applications, particularly where it’s important to capture physical processes that occur at different time scales. For example, Kratzert et al. (2018) demonstrate how an LSTM network used for rainfall-runoff across 241 catchments is able to perform comparably to SAC-SMA in a fraction of the computational time. The authors also do a great job of unboxing some of the LSTM behavior, including demonstrating how the cell state captures temperature and snow dynamics. However, very few other studies investigate their ML models to this degree.

Techniques for Interpretability

Model-Agnostic Local Models

Many interpretability techniques have been introduced that utilize simple local methods to explain model predictions for a single sample to better understand the behavior of a more complex model. Local interpretable model-agnostic explanations (LIME) is one such technique introduced by Ribeiro et al. (2016) that utilizes a local surrogate model. LIME generates perturbations of input and output data, weighs the samples according to proximity to the baseline sample, and trains a localized (usually linear) model. The goal is to find a model that will minimize loss (minimize the distance between the prediction and estimation) and also minimize complexity.

Another localized technique utilizes Shapley values. Originating from cooperative game theory, Shapley values assign a payout depending on a player’s contribution to the total payout (Shapley, 1953). The analog to ML is that the Shapley value becomes the marginal contribution of a feature across many coalitions (subsets) of possible feature combinations. Because Shapley values are calculated across many possible coalitions, computation time to compute them can be cumbersome.

Model-Specific Local Models

DeepLift is a backpropogation-based approach that propagates the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the activation of each neuron to its ‘reference activation’ and assigns contribution scores according to the difference. DeepSHAP is a modification of the DeepLift algorithm used to efficiently estimate Shapley values over the input feature space for a given instance. DeepShap has been shown to be faster than model-agnostic methods and can explain series of complex models more than model specific local models (Chen et al., 2021). Rather than passing multipliers comparing neuron activation, DeepShap will backpropagate SHAP values.

DeepShap

The authors of DeepShap have created a GitHub repository here that allows you to implement SHAP for any machine learning model and DeepShap for deep learning models implemented using TensorFlow and Keras. The package has some really beautiful figures that you can create with single lines of code. My goal was to see if I could harvest some tools from DeepShap to help make sense of a rainfall runoff LSTM model for an ephemeral subbasin in the Tuolumne River Basin that I had previously coded.

I utilize the following features to predict outflow: Precipitation, Temperature, DOY (represented as sine/cosine of day in this case), prior precipitation aggregated for the past three days ago and two weeks, and then interaction variables. In a prior blog post, I demonstrate why these aggregate variables and interaction variables help to capture different regimes of flow. The results look great, but can we use DeepShap to think more about the parameters of the LSTM?

LSTM Prediction

One parameter of interest is the lag of information that the LSTM feeds in at every time step. In my LSTM, I decide to feed my inputs in in batches of 30, so at every time step, the LSTM is seeing the past 30 days of inputs. I use the DeepExplainer function to calculate the average Shapley values across a set of 500 time steps of the input data and I plot these features for the past 30 days.

#Deep Shap Implementation (Training)
#Choose random points from the dataset to train the DeepExplainer
random_ind = np.random.choice(X_Train.shape[0], 1000, replace=False)
data = X_Train[random_ind[0:500]]
e = shap.DeepExplainer((model.layers[0].input, model.layers[-1].output),data,K.get_session())

#Deep Shap Testing
#Create a test set out of another part of the training set 
test = X_Train[random_ind[500:1000]]
shap_values = e.shap_values(test)
shap_val = np.array(shap_values)
shap_val = np.reshape(shap_val,(int(shap_val.shape[1]),int(shap_val.shape[2]),int(shap_val.shape[3])))
shap_abs = np.absolute(shap_val)
sum_0 = np.mean(shap_abs,axis=0)

#Plot Shapley Values
shap_plot = pd.DataFrame(sum_0, columns=["Precipitation","Temperature","DOY","Three Day Precip","Two Week Precip","Precip_Times_Temperature","Temperature_Times_Day","Precip_Times_Temperature_Times_Day"])
shap_plot['days'] = [i-16 for i in list(range(1,16))]
shap_plot.head()
shap_plot.plot.area(x='days',figsize=(10, 6), cmap='rainbow')
plt.title("Deep SHAP - Feature Importance")
plt.savefig('SHAP.png', bbox_inches='tight')
plt.show()
Feature importance for a 30-day lag

The figure above shows at what times in the batch certain information is being utilized based on the magnitude of the Shapley value associated with that information. We see primarily that current precipitation is the most prominent driver of the next day’s outflow along with the precipitation*temperature interactive variable which intuitively makes sense.

Let’s look back 5 days though. Many of the interactive variables are still relevant but see that temperature and day of the year are now given a higher priority over precipitation .

Shapley factors for 5 days prior to the current time step

Lags of these variables may give the algorithm more information about seasonality that is more informative than looking at these variables in their current state. Furthermore, precipitation at this lag may not be relevant. There’s lots of additional work to be done, but hopefully this tutorial illustrates that If you have the tools to look into your model and a good understanding of the system, it then becomes possible to better attribute your model’s behavior to a process-based understanding.

References

Chen, H., Lundberg, S. M., & Lee, S. I. (2021). Explaining a Series of Models by Propagating Shapley Values. arXiv preprint arXiv:2105.00108.

Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018). Rainfall–runoff modelling using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences22(11), 6005-6022.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).

Shapley, Lloyd S. “A value for n-person games.” Contributions to the Theory of Games 2.28 (1953): 307-317.

I followed tutorials shown in these blogs as well:

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://github.com/danielhkt/deep-forecasting/blob/main/notebook/test_forecasting.ipynb

https://medium.datadriveninvestor.com/time-step-wise-feature-importance-in-deep-learning-using-shap-e1c46a655455

Data Augmentation for Time Series Application

This post is meant to be an introduction to the concept of data augmentation. Data augmentation is the process of increasing the size of your data through small modifications to the original dataset. In instances where data availability are small (basically every ML application), this technique is especially useful to create more training data that can lead to a more robust model that isn’t as susceptible to overfitting. Let’s begin with an example that will demonstrate why data augmentation is useful in image classification. Imagine that you have trained your model to distinguish between images of cats and dogs. The figure on the left is of a very good boy named Lincoln and this image resides in the training set. Let’s suppose that the image in the middle is in the test set. To humans, this is very clearly Lincoln (and a dog) once again, but if the algorithm hasn’t seen many images of dogs in this position, there is a chance that it won’t classify this image correctly and may think that Lincoln looks more like this cat in the training set that has a similar orientation and ears.

Lincoln, Lincoln, and….Lincoln? (Cat stock image from here)

However, if I were to augment my original image in the training set by rotating, scaling, and shifting it, as shown below, perhaps my model would be more likely to classify Lincoln correctly as a dog having been trained on these variations. Various studies have demonstrated the benefits of this augmentation in image processing applications.

Augmentations of original image that create new training data

This is a very simple example to demonstrate that limited data availability need not preclude the ability to make robust predictions. It is not a far stretch to wonder how data augmentation may be utilized for regression-based prediction problems, especially in the water resources field where we have limited data. Particularly, it is hard for us to predict extremes because we have such few data points to characterize them. This style of problem is inherently more complicated than classification because time series have a temporal structure and are connected to underlying (sometimes physical) relationships. Thus, this requires that any augmentation does not completely change the fundamental characteristics of the data. Below are some examples of techniques that could be useful, but these are extremely case-specific and require a strong understanding of the behavior of your system. Before you implement any of these techniques, first make sure to split your data into the training and test set. Then feel free to add variations to the training set and test them out!

Block Bootstrapping

Bootstrapping (sampling with replacement) single points your dataset can only be done if each point is independent. This is not the case with time series data that has a temporal structure. Thus, it is more appropriate to utilize block-bootstrapping. This technique involves resampling blocks of continuous data from the original training data to make a new training dataset. By using large continuous blocks, we are preserving the inherent structure in the data, while allowing our algorithm to see new data (the original data in a new order).

Jittering with Noise

A small sample size doesn’t give us the opportunity to map out the rich input-output space that characterizes our system. Often, adding a little bit of random noise to your training data can help expand your understanding of the space. If your system exhibits highly non-linear behavior, you have to be extra careful that the noise that you are adding is realistic. For example, in a rainfall-runoff model, the fluctuations of temperature and precipitation are very different. Small changes in precipitation can result in very large overall streamflow changes, whereas temperature often fluctuates widely during the day with very little effect on streamflow. Therefore, adding the same amount of noise to each feature and the output may not make sense. It is a non-trivial effort, but it could be interesting to determine how to appropriately add noise to features that exhibit different behavior.

Interpolation

If you want to augment training data that has clear trends, interpolation between data points can be a viable option that won’t distort these trends. However, using a linear interpolation method sets the underlying assumption that your data are linear; for instance that a linear change in precipitation and temperature leads to a linear change in streamflow. This is likely not the case, so interpolation may not be a useful data augmentation technique for a rainfall-runoff regression-based model. However, interpolation could be useful in a less sensitive classification-based model.

Decomposition

Decomposition methods generally decompose time series signals by extracting features or underlying patterns from the training data. These features can either be used independently or recombined with noise and the old training data to generate new training data. Decomposition can be preformed in either the time or frequency domain. Within the decomposition domain lies manifold-based techniques as well. A study by Forestier et al., 2017 calculate a weighted average that reflects the manifold of the
original data and use it as new data with high success.

Implementation

All of these techniques have shown success in very specific time series applications: those related to speech, audio, and gait recognition, and specifically for classification-based models. Very little has been published on regression-based models and the use of data augmentation in the water resources community seems nonexistent.

Below, I implemented one of these techniques using the CNN that I fit in my prior post. The results for the baseline prediction of streamflow are shown in the first panel. Then I tried a data augmentation scenario. I took the training set and reversed it and kept the first 1000 data points (which have more extremes). I then took these points and concatenated them to the original training set. The new results are shown in the second panel. The CNN trained with the additional augmented data does a much better job of capturing the extremes in the test set, which is what we often are interested in. There is a lot of work to be done and more complicated methods to explore, but these initial results look interesting and suggest that data augmentation can be useful in our field!

References

Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017, November). Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM) (pp. 865-870). IEEE.

Oh, C., Han, S., & Jeong, J. (2020). Time-series Data Augmentation based on Interpolation. Procedia Computer Science175, 64-71.

Tips for Creating Watershed Maps in R

There have been a few posts on this blog on creating watershed maps (here, here, and here), but this post is going to be focused on some of my recent experiences on creating watershed maps in R with files that may be missing attributes, are in the wrong projection, and contain data that need to be clipped to a specific boundary shapefile. There are lots of packages that exist to do one or more of these things, but anyone who has ever tried to create watershed maps in R knows that there isn’t one package that does it all. My main goal for this post is to outline the most efficient workflow and use of packages that also allow for the most compatibility when plotting shapefiles and raster files in one figure.

In this post, we are going to be creating a map of the Tuolumne River Basin boundary and plot elevation data within the basin. All the data are found here. First we will read in the Tuolumne boundary shape file (.shp) and the elevation raster file (.asc which is an ASCII file) using the appropriate functions and do some preliminary plotting to see what we have.

#Import libraries 

library(rgdal)
library(ggplot2)
library(raster)

#Read in Tuolumne shapefile

tuolumne.basin <- readOGR(dsn = "doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/Tuolumne_utm.shp")

#Read in elevation raster

elevation.raster = raster("doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/merced_tuolumne_100mdem_utm.asc")

#Plot the files
 
ggplot() +  geom_polygon(data = tuolumne.basin, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)
plot(elevation.raster)
Raw shapefile and raster data

So we have the pieces that we need to build the map, but notice that the latitude and longitude are in the wrong projection. We can use the following command to check what projection the shapefile is in:

proj4string(tuolumne.basin) 

We see that the output is: “+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs”. So we are in the Universal Transverse Mercator coordinate system, but we should change to WGS 84. We can do this using the function “spTransform” which will swap out the projection by adjusting the CRS (Coordinate Reference System) attribute of the shapefile. You can use “proj4string” to verify that the transformation took place.

 tuolumne.basin.transformed <- spTransform(tuolumne.basin, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

Now we need to transform the raster files coordinates. Note that the raster file doesn’t have an associated coordinate reference system listed. If you try to change the projection at this point, you will get an error. This is a minor inconvenience since we know that the coordinate system should match that of the raw Tuolumne shapefile and we can just insert the original coordinate system as a string under the projargs attribute. Then we can transform it to match the coordinate system of the transformed shapefile using “projectRaster”.

elevation.raster@crs@projargs <- "+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs" 
  
elevation.raster.transformed <- projectRaster(elevation.raster,crs=crs(tuolumne.basin.transformed))

Great, now we have all the data in the right projection. Now we have to clip the raster layer to show only the data in the bounds of our shapefile. We first use the “crop” function in the raster library to clip the layer based on the extent of the shapefile boundary as well as the mask function. It is important to do both otherwise the clip will not work!

elevation.raster.transformed.cropped <- crop(elevation.raster.transformed, extent(tuolumne.basin.transformed))
elevation.raster.transformed.cropped <- mask(elevation.raster.transformed, tuolumne.basin.transformed)

Now we need to get the appropriate elevation values and coordinates from the raster object so that we can plot it using ggplot.When we use ggplot here, notice that we only need to use geom_raster and elevation data since the clipped data will perfectly follow the shapefile boundary.

#Isolate elevation values from the raster file

val <- getValues(elevation.raster.transformed.cropped)
xy <-as.data.frame(xyFromCell(elevation.raster.transformed.cropped,1:ncell(elevation.raster.transformed.cropped)))
xy <- cbind(xy,val)

#Plot it!

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c()+theme_bw()
A not so pretty watershed figure

It’s almost perfect aside from that gray box that results from clipping and masking. After we clip, we are converting all values outside the boundary of the shapefile to NAs, which falls out of the bounds of our color scale. To fix this, we simply insert an additional argument to scale_fill_viridis_c() and we also make some additional aesthetic changes to the theme.

#Final plot function

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c(na.value=NA,name = "Elevation (m)")+theme_bw()+ggtitle("Tuolumne River Basin Elevation (m)")+xlab("Longitude") + ylab("Latitude")+theme(text = element_text(size = 20)) 
A pretty watershed figure!

Teaching Tools for Complex Adaptive Systems

This semester, I am taking a foundational class in the Systems Engineering department here at Cornell and I wanted to use this blog post to relay some cool packages and tools that we have used that hopefully can be useful teaching material for emerging faculty or anyone looking for interactive systems tutorials.

To begin, we have to first define what falls under the umbrella of complex adaptive systems. In a nutshell, these systems tend to (1) have networks of many components, (2) typically involve non-linear interactions between components, (3) exhibit self-organizing behavior, (4) have the potential to exhibit emergent properties. One really beautiful website that explains these properties in more detail is Complexity Explained, which started as a community outreach project to try to explain complex systems to a wider audience within the science community and the public. The website features interactive animations of systems properties and a short booklet that can be downloaded (in many languages) with key concepts.

It is well known that complex systems are hard for humans to understand because many of the characteristics are non-intuitive for us. For example, self-organizing behavior is often contradictory to our own lives (when can you remember a time that a system around you naturally seemed to become more orderly as time passed?). Emergent properties can come about in long time scales that are often far distanced from the original action. We can’t always understand how decisions on the microscale resulted in large macroscale processes. Thus, in order to best approach complex systems, we must have the ability to interact with them, model them, and map out their complex behavior under many conditions. Below, I am introducing some tools that might help foster more understanding of these ideas using simple, yet dynamically rich cases.

PyCX

One of the main creators of the Complexity Explained website and a visiting lecturer to my systems class is Hiroki Sayama, a world-renowned researcher and director of the Center for Collective Dynamics of Complex Systems at Binghamton University. Dr. Sayama has created a python package called PyCX that contains sample Python codes of complex systems that a user can run interactively and then manipulate or build off of. Simply download the package off of GitHub and all of the code and a simulator will be available to you. Figure 1 shows an example interactive simulation of a Turing pattern. In 1952, Alan Turing authored a paper where he described how patterns in animals’ coats such such as stripes and spots, can arise naturally from a chaotic system. He uses a simple set of reaction-diffusion equations to describe this process. Figure 1 shows the python simulator in PyCX, the equation for the Turing pattern, and the evolution from the random initialization to the ordered spots.

Figure 1: PyCX interactive simulation for the Turing Pattern

PyCX also allows you to toggle the parameters of the problem, which can express how small perturbations in the system can lead to substantially different outcomes. You can adjust these parameters within the source python code (which I believe is more useful for students rather than just clicking a “play” button). Figure 2 shows the difference in behavior across a forest fire model when the initial density is adjusted from 35% to 40% of the space.

Figure 2: The effect of initial conditions in a forest fire agent-based model

Golly- Game of Life Simulator

Golly is an open-source tool for visualizing cellular automata, including Conway’s Game of Life. Golly allows the user to draw different patterns and apply specific rules for how the systems evolve. You can stop the simulation midway and apply different rules to the existing patterns.

Figure 3: Golly Interface Screen Shot

Swarm Behavior

Dr. Sayama also developed a really interesting Java application to study swarm behavior, or collective behavior that is exhibited by entities, typically animals. This application, called swarm chemistry creates agents with different kinetic parameters that dictate dynamics. The application allows you to mix agents into a single population and observe how emergent dynamics form. Figure 4 shows the opening interface when you click the .jar executable. The application brings up 6 random agents that exhibit some dynamic behavior. By clicking on any two agents, you will create a new population that shows how the dynamics of the agents interact (Figure 5). You can keep mixing agents and adding more random swarms. You can individually mutate certain swarms or edit the parameters as well. The pictures do not do this application justice. It is super fun (and slightly addicting) and a great way to get students excited about the concepts.

Figure 4: Swarm Chemistry Opening Interface

Figure 5: Emergent dynamic behavior

I had so much fun using these packages in class and I hope that these tools can help you/your students become more engaged and excited about complex systems!

References

My knowledge of these tools came from Hiroki Sayama’s guest lectures in SYSEN 6000 at Cornell University and from:

Sayama, H. (2015) Introduction to the Modeling and Analysis of Complex Systems,Open SUNY Textbooks, Milne Library, State University of New York at Geneseo.

Fitting and Simulating from NHMMs in R

The purpose of this post is to give a workflow that one could follow if your goal is to fit Non-Homogeneous Hidden Markov Models (NHMMs) and to simulate sequences of states. Julie Quinn wrote a series of great posts on fitting Hidden Markov Models (HMMs) here but the goal of this post is to discuss the non-homogeneous counterparts of HMM. NHMMs can be distinguished from the former because they involve non-stationary transition probabilities between states. These dynamic transition probability matrices are conditioned on one or more external covariates that influences transitions between states. One example of the use of NHMMs is to model and simulate precipitation. Precipitation models that simulate without using atmospheric information cannot be expected to perform well on conditions that deviate from those on which the model was fit. Thus using a covariate that shows changes in atmospheric circulation (such as geopotential height), can help capture some of the nonstationarity that a solely precipitation-based model alone could not capture.

I have had a lot of time to work with NHMMs in R; however at the beginning, I found overwhelmingly few examples to work off of, so this post is meant to outline a workflow that I have settled on that you can apply to your own application. First off, there are tons of packages available for HMMs, but very few that handle the dynamic transition matrices of NHMMs. My favorite is depmixS4 found here. This package has all the tools needed to fit your NHMM, but it takes some experimenting to understand the syntax and to piece together the functions to create a full workflow.

First we want to fit a depmix model. In the first line of the code, I am creating a model called modNHMM. The data that I am fitting the NHMM with are the first 9 principal components of geopotential height. It is important that you list these components in this syntax and they should be the column titles of your dataset, which is synoptic.pcs in my case. The number of states is how many states you want to fit your model with. For a precipitation/streamflow application, you could fit a two-state NHMM to represent wet/dry conditions, or if you are trying to identify multiple regimes, you may have more.

library(depmixs4)

modNHMM <- depmix(list(PC1~1,PC2~1,PC3~1, PC4~1, PC5~1, PC6~1, PC7~1, PC8~1,
PC9~1),
nstates = num.states,
               family=list(gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),
                              gaussian()),
ntimes =  nrow(synoptic.pcs),
data = data.frame(synoptic.pcs),transition = ~predictor$PC1+predictor$PC2+predictor$PC3+predictor$PC4)

fit.modNHMM.depmix.paleo.check <- fit(modNHMM)

synoptic.state.assignments<- posterior(fit.modHMMs.depmix.paleo.check)$state # state sequence using the Viterbi algorithm

Then we choose a distribution for our responses, which I choose to be a Gaussian distribution. The argument, ntimes, refers to the length of our time series, which is the number of rows in our dataset. Next we specify the dataset that contains the PC data and the transition which is dictated by our external covariates. In this case, my covariates are in a dataframe called predictor and each column corresponds to the 4 covariates (which happen to be PCs of a different dataset) that will influence my transitions. Then we fit the model with our prescribed attributes using the fit function. Finally, we want to calculate the Viterbi sequence, which is the most probable path or sequence of states over the period that the model is fit. This step (last line of code) will return a vector of the length of the dataset with each day classified into one of num.states.

Now we need to locate the transition probability matrices. We use the following command:

fit.modNHMM.depmix.paleo.check@transition

If we were fitting an HMM, we would get one transition probability matrix. However, we get an output that looks like this:

Transition probability matrix coefficients

Now instead of having constant values for transitions, we have equations of the form: Intercept+b1*PC1+b2*PC2+b3*PC3+b4*PC4 where the b values are the coefficient values listed in the table. If were were to look at the first block [[1]], this block dictates transitions from State 1 to the other 5 states. The transitions from the states are as follows:

State 1 to State 1: 0+0+0+0+0 (State 1 is used as a reference variable in this case and the probability would be found by subtracting the other probabilities from 1 at the end)

State 1 to State 2: -3.11+-0.22*PC1+0.22*PC2+0.014*PC3-0.13*PC4

and so on. You have to remember to divide each value by the total across the row in order to return probabilities.

Once you have these symbolic equations for the transition probability matrices, you can create a list of matrices which will allow you to simulate sequences of states for new sets of the PC1, PC2, PC3, PC4 covariates. You can get a sense how you might create n different transition matrices if you have times series of length n of the covariates. Below I am just representing those symbolic equations in code using the getpars function to acquire the coefficients and store the resulting daily matrices in a list called mm. Depending on the number of covariates or states, you will need to adjust the indices accordingly.

n=dim(df)[[1]]
mm<-matrix(list(), 1,n)


for (j in 1:n){
  transition_matrix=matrix(, nrow = 5, ncol = 5)
  for (i in 6:10){
    transition_matrix[1,i-5]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
     }
  denominator=sum(exp(transition_matrix[1,]))
  for (i in 6:10){
    transition_matrix[1,i-5]=exp(transition_matrix[1,i-5])/denominator
  }
  
  for (i in 31:35){
    transition_matrix[2,i-30]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
  }
  denominator=sum(exp(transition_matrix[2,]))
  for (i in 31:35){
    transition_matrix[2,i-30]=exp(transition_matrix[2,i-30])/denominator
  }
  for (i in 56:60){
    transition_matrix[3,i-55]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[3,]))
  for (i in 56:60){
    transition_matrix[3,i-55]=exp(transition_matrix[3,i-55])/denominator
    
  }
  for (i in 81:85){
    transition_matrix[4,i-80]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[4,]))
  for (i in 81:85){
    transition_matrix[4,i-80]=exp(transition_matrix[4,i-80])/denominator
    
  }
  
  for (i in 106:110){
    transition_matrix[5,i-105]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[5,]))
  for (i in 106:110){
    transition_matrix[5,i-105]=exp(transition_matrix[5,i-105])/denominator
    
  }
  mm[[j]]=transition_matrix

}

Once we have these matrices, we can then simulate state sequences that can result from the chain of transition matrices. For this part, we need to create markov lists with our transition matrices:

library(markovchain)
mcObject <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(j){ 
  
  mcObject.time.varying <- mclapply(X=1:n.sim,mc.preschedule=TRUE,mc.cores=1,FUN=function(t){
    tr.prob=as.matrix(mm[[t]])
    mcObject.time.varying.out <- new("markovchain", states = c("1","2","3","4","5"),
                                             transitionMatrix = tr.prob, name = paste("McObject",t,sep=""))    
    return(McObject.time.varying.out)
  }
  )
  mcObject.final <- new("markovchainList",markovchains = mcObject.time.varying, name = "mcObject.nh")
  return(

mcObject.final

)
  
}

Finally we simulate using the following:

simulate.mc <- function(mcObject,num.states,dates.sim,last.month,last.day,n.sim,iter) {
  
  #this function will simulate the Markov chain iter times
  
  #Arguments:
  #mcObject = a Markov chain object from the markovchain package
  #num.states = the number of states 
  #date.sim = a time series of dates for the simulation period
  #last.month = last month of the season
  #last.day = last day of the last month of the season
  #iter = the number of iterations
  
  #day and month sequences for the simulation period
  days.sim <- as.numeric(format(dates.sim,"%d"))
  months.sim <- as.numeric(format(dates.sim,"%m"))
  n.sim <- length(dates.sim)  #length of simulation
  
  final.mc.sim <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(i){  
    
    mc.sim <- as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[1]]))
    end.state <- paste(mc.sim[length(mc.sim)])
    for (t in 1:n.sim) {
      mc.sim <- c(mc.sim,as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[t]],t0=end.state)))
      end.state <- paste(mc.sim[length(mc.sim)])
      if(months.sim[t]==last.month & days.sim[t]==last.day) {end.state <- paste(sample(1:num.states,size=1))}
    }    
    
    #here is the final mc simulation
    final.mc.sim.iter <- mc.sim[2:(n.sim+1)]
    return(final.mc.sim.iter)
  }
  )
  return(final.mc.sim)
  
}



simulations=matrix(list(), 1,1000)
for (i in 1:1000){

  simulations[[i]] <- simulate.mc(mcObject=mcWeather.Regime,num.states=num.states,
                                  dates.sim=dates.sim,last.month=last.month,last.day=last.day,iter=iter)
}

And that’s it! You can simulate for many different iterations (1000 in my case) and you will be returned a large list with your 1000 sequence of states over the simulation period.

CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

    

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

NetCDF Operators

This post is an introduction to Linux-based climate data and NetCDF operators (CDOs or NCOs) which allow you to perform various operations on netNDF files through the command line. I found these commands to be really nifty when I was working with pre-industrial control runs from a GCM. The output was being written on daily timestep, across 1200 years, and for the whole world, so it was absolutely essential that I cut the size of the files down as much as I could before transferring to my own computer.

The official documentation and installation instructions for NCO can be found here and CDO here, but if you’re working on a supercomputer, the libraries will already likely be installed. I will outline how I used some of these functions for my pre-industrial runs.

Concatenation

Some of the NCO commands have size limits of 40 GB, so it’s important to use the right order of operations when processing your files, which will be different depending on your ultimate goal. My goal was to ultimately get the 500-hpa geopotential height anomalies across the whole 1200 year period for specifically the Western US. Assuming you have a directory with all the NetCDF files, the first goal is to concatenate the data, since my run was broken into many smaller files. The easiest way to do this is with the following command which will take all the netcdf files in the directory (using the *) and merge them into a file called merged_file.nc:

cdo mergetime *.nc merged_file.nc

Return Individual Monthly Files

When calculating anomalies, you will need to determine a mean geopotential height value for each of the 12 months, and then calculate daily deviations with respect to these months to obtain daily deviations. You can do this with the following command:

cdo splitmon merged_file.nc zg500.mon

This command will return 12 files of the form zg500.mon$i.nc.

Return Monthly Means and Daily Anomalies

The next step is to calculate a monthly mean for each of these files. For example, for January use:

cdo timmean zg500.mon1.nc zg500.mean.mon1.nc

Return Anomalies

Now we subtract the means from each monthly file to return the daily anomalies for each month, which will be of the form: zg500.mean.mon${i}.anom.nc. If you want to combine the last two steps into one loop, you can use the code below:

for i in $(seq 1 12)
do
  cdo timmean zg500.mon${i}.nc zg500.mean.mon${i}.nc
  cdo sub zg500.mon${i}.nc zg500.mean.mon${i}.nc zg500.mean.mon${i}.anom.nc
done 

Cut Down to Geographical Area of Interest

Finally, we need to cut down the data just to the Western US. We use ncks (NetCDF Kitchen Sink) from NCO, which is probably the most versatile of all the functions (hence the name). This command is one that has the 40 GB limit, which is why I had to wait until I had monthly files before I could cut them down geographically. We must first specify the variable of interest using the -v flag. In my case, I only had one variable to extract, but you can also extract multiple in one command. Then denote the range of latitude and longitude using the -d flags. It is very important to include the periods at the end of each lat/lon (even if your bounds are integers) otherwise the command does not work.

for i in $(seq 1 12)
do
  ncks -v zg500 -d lon,180.,260. -d lat,30.,60. zg500.mean.mon${i}.cut.anom.nc -o zg500.mean.mon${i}.cut.anom.region.nc
done 

Ultimately, you will get 12 files of the form: zg500.mean.mon${i}.cut.anom.region.nc. And that’s it! You can concatenate the monthly files back together and try to resort the data back into the correct sequence according to time. I was unsuccessful at finding a quick way to do this, but it is possible. I found it much easier to work on this within R. I imported each of the 12 anomaly files, assigned a time vector, concatenated each monthly anomaly matrix into a larger matrix and then sorted according to date. If your files are small enough by the end of the process, this likely is the easiest way to take care of resorting. Enjoy!

Do The (Schaake) Shuffle

This post in an introduction to the Schaake Shuffle, a method that can be used to address reconstructing space time variability in forecasted and synthetic variables. The Schaake Shuffle was originally introduced in a synthetic weather generation post by Julie Quinn almost 5 years ago. Lately, the importance (and difficulty) of being able to reproduce spatial and temporal variability in forecasts and synthetically generated variables across multiple correlated sites has been a prominent topic in our group. The goal of this post is to just “bump” this topic back into discussion and to make readers aware of its existence as a nifty post-generation way to build spatial and temporal variability back into synthetically generated data or forecasts. In the fundamental paper that establishes the method, Clark et al., 2004, the authors are looking to apply the method to forecasts of precipitation and temperature. In the case of weather variables such as temperature and precipitation, it is common to create forecasts for individual stations from a Numerical Weather Prediction (NWP) model. These variables serve as predictor variables in regression models that can be used to generate forecasts. The problem with these styles of approaches is that spatial correlation is not preserved between multiple stations nor temporal persistence, which is very important for hydrologic applications with memory.

The Schaake Shuffle is a method that reorders ensemble forecasts of precipitation and temperature to better reconstruct the space-time variability using a rank-ordering approach constructed from a historical record. The basic steps are as follows:

  1. Gather appropriate data: The NWP model outputs forecasts of accumulated precipitation, air temperature, relative humidity at 700 hpa, wind speed, total column precipitable water, and mean sea level pressure which are used as predictors in the forecast equations. Further, the authors acquire historical precipitation and temperature data for stations within four basins across the United States.
  2. Create Forecasts: The next step involves creating the precipitation and temperature ensemble forecasts. A multiple linear regression is used to develop model output statistics (MOS) equations. The forecasted variables that are taken from the NWP model are ultimately filtered down to keep on the variables that explain the highest variance in the response variable (in this example, response variables are precipitation, minimum temperature, maximum temperature). A separate regression equation is fit for each variable, station, and forecast lead time. The residuals of the regression equation are modeled stochastically to generate an ensemble of forecasts. Alternatively, one can apply the Schaake Shuffle to synthetically generated ensembles (not limited to forecasts).
  3. Reorder Forecasts: The reordering method can best be described by an example. For a given time, assume you have an ensemble of possible forecasts that you align in a 3D matrix: Xi,j,k where i=ensemble member, j=station, and k=variable of interest (precipitation or temperature). From the historical record, you must construct an equally sized matrix Yi,j,k which contains historical station observations for the same date. For this matrix, i=an index of dates in the historical time period, j=station, and k=variable of interest (precipitation or temperature).

Using this same notation the authors create a toy example to demonstrate the process. For some time t, imagine we have forecasts of maximum temperature for 10 ensembles, for a given date and station.

Let X be a 10 member ensemble in consideration.

X=[15.3,11.2,8.8,11.9,7.5,9.7,8.3,12.5,10.3,10.1]

We can sort the vector X to create χ

χ=[7.5,8.3,8.8,9.7,10.1,10.3,11.2,11.9,12.5,15.3].

Then we go to the historical record and choose 10 dates that reside in a 7 day window around the date that is being forecasted. This is our Y vector.

Y=[10.7,9.3,6.8,11.3,12.2,13.6,8.9,9.9,11.8,12.9]

We can sort this vector to create γ.

γ=[6.8,8.9,9.3,9.9,10.7,11.3,11.8,12.2,12.9,13.6]

We also create a vector B, which denotes the order of the sorted historical vector with respect to the unsorted vector.

B=[3,7,2,8,1,4,9,5,10,6]

The key is to now to reorder the ensemble forecast in the same order as the B vector. The rank order 1 value is in position 5 of the B vector. Therefore, we take the 5th value from χ (10.1). Then rank order 2 is in position 3. We take the third value from χ (8.8). We continue doing this until we have

Xss=[10.1, 8.8, 7.5, 10.3, 11.9, 15.3, 8.3, 9.7, 11.2, 12.5]

These are the basic fundamentals of the reordering algorithm and it can be extended to involve forecasting at multiple stations, demonstrated in the figure below. Table A shows 10 ensembles for forecasting weather on January 14th, 2004, ranked from lowest to highest value for three stations. Table B shows the historical record and the black and light gray ellipses represent the 1st and 2nd ensemble respectively. Table C shows the sorted historical record and where the selected historical observations lie in the sorted list. Finally Table A can be reordered accordingly to form Table D.

Schaake Shuffle for 10 member ensemble and 3 stations (Figure 2 from Clarke et al., 2004)

It’s important to remember that the Schaake Shuffle is only meant to capture the Spearman rank correlation of observations, but not to reconstruct the actual spearman correlations. The results from the paper, however, are quite remarkable and show how well the method captures spatial and temporal properties. The figure below shows an example of how the method preserves spatial correlation between two selected stations. The top set of figures show raw ensemble output while the bottom figures show results after the ensemble is reordered. The black lines denote the target observed correlation. Clearly, the reordered output approximates the observed correlation across lead times better than the raw ensemble output.

Preservation of spatial correlation for raw (top) and reordered (bottom) forecast ensembles (Figure 6 from Clarke et al, 2004)

One basic limitation of this approach is the assumption of stationarity and that the structure in the historical record will be applicable to the forecasted data. While other methods exist which can potentially preserve space-time variability well, the advantage of the Schaake Shuffle is the ability to reconstruct these patterns after the fact, as a post-processing step. If readers are interested in implementing the Schaake Shuffle, basic pseudocode is included at the end of the paper but there are also R packages that can automate the reordering process here. The steps to download the package and run the algorithm are denoted here. Note that this code only works for a single-station case. Here each column in the X vector will be an ensemble and the rows correspond to the number of days in the forecast. Implementing the example in Figure 2 for one station will requires X and Y to be a single row vector. Of course, one can manually extend this process to multiple stations.

install.packages("devtools")

devtools::install_github("katerobsau/depPPR")

library(depPPR)

forecast_example=as.matrix(read.delim("C:/Users/Rohini/Documents/synthetic.txt",header=FALSE))
climate_example=as.matrix(read.delim("C:/Users/Rohini/Documents/historical.txt",header=FALSE))

schaake_shuffle(X = forecast_example, Y = climate_example)

References:

All material is derived from the fundamental paper that introduces the Shaake Shuffle:

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004). The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology5(1), 243-262.