Data Augmentation for Time Series Application

This post is meant to be an introduction to the concept of data augmentation. Data augmentation is the process of increasing the size of your data through small modifications to the original dataset. In instances where data availability are small (basically every ML application), this technique is especially useful to create more training data that can lead to a more robust model that isn’t as susceptible to overfitting. Let’s begin with an example that will demonstrate why data augmentation is useful in image classification. Imagine that you have trained your model to distinguish between images of cats and dogs. The figure on the left is of a very good boy named Lincoln and this image resides in the training set. Let’s suppose that the image in the middle is in the test set. To humans, this is very clearly Lincoln (and a dog) once again, but if the algorithm hasn’t seen many images of dogs in this position, there is a chance that it won’t classify this image correctly and may think that Lincoln looks more like this cat in the training set that has a similar orientation and ears.

Lincoln, Lincoln, and….Lincoln? (Cat stock image from here)

However, if I were to augment my original image in the training set by rotating, scaling, and shifting it, as shown below, perhaps my model would be more likely to classify Lincoln correctly as a dog having been trained on these variations. Various studies have demonstrated the benefits of this augmentation in image processing applications.

Augmentations of original image that create new training data

This is a very simple example to demonstrate that limited data availability need not preclude the ability to make robust predictions. It is not a far stretch to wonder how data augmentation may be utilized for regression-based prediction problems, especially in the water resources field where we have limited data. Particularly, it is hard for us to predict extremes because we have such few data points to characterize them. This style of problem is inherently more complicated than classification because time series have a temporal structure and are connected to underlying (sometimes physical) relationships. Thus, this requires that any augmentation does not completely change the fundamental characteristics of the data. Below are some examples of techniques that could be useful, but these are extremely case-specific and require a strong understanding of the behavior of your system. Before you implement any of these techniques, first make sure to split your data into the training and test set. Then feel free to add variations to the training set and test them out!

Block Bootstrapping

Bootstrapping (sampling with replacement) single points your dataset can only be done if each point is independent. This is not the case with time series data that has a temporal structure. Thus, it is more appropriate to utilize block-bootstrapping. This technique involves resampling blocks of continuous data from the original training data to make a new training dataset. By using large continuous blocks, we are preserving the inherent structure in the data, while allowing our algorithm to see new data (the original data in a new order).

Jittering with Noise

A small sample size doesn’t give us the opportunity to map out the rich input-output space that characterizes our system. Often, adding a little bit of random noise to your training data can help expand your understanding of the space. If your system exhibits highly non-linear behavior, you have to be extra careful that the noise that you are adding is realistic. For example, in a rainfall-runoff model, the fluctuations of temperature and precipitation are very different. Small changes in precipitation can result in very large overall streamflow changes, whereas temperature often fluctuates widely during the day with very little effect on streamflow. Therefore, adding the same amount of noise to each feature and the output may not make sense. It is a non-trivial effort, but it could be interesting to determine how to appropriately add noise to features that exhibit different behavior.

Interpolation

If you want to augment training data that has clear trends, interpolation between data points can be a viable option that won’t distort these trends. However, using a linear interpolation method sets the underlying assumption that your data are linear; for instance that a linear change in precipitation and temperature leads to a linear change in streamflow. This is likely not the case, so interpolation may not be a useful data augmentation technique for a rainfall-runoff regression-based model. However, interpolation could be useful in a less sensitive classification-based model.

Decomposition

Decomposition methods generally decompose time series signals by extracting features or underlying patterns from the training data. These features can either be used independently or recombined with noise and the old training data to generate new training data. Decomposition can be preformed in either the time or frequency domain. Within the decomposition domain lies manifold-based techniques as well. A study by Forestier et al., 2017 calculate a weighted average that reflects the manifold of the
original data and use it as new data with high success.

Implementation

All of these techniques have shown success in very specific time series applications: those related to speech, audio, and gait recognition, and specifically for classification-based models. Very little has been published on regression-based models and the use of data augmentation in the water resources community seems nonexistent.

Below, I implemented one of these techniques using the CNN that I fit in my prior post. The results for the baseline prediction of streamflow are shown in the first panel. Then I tried a data augmentation scenario. I took the training set and reversed it and kept the first 1000 data points (which have more extremes). I then took these points and concatenated them to the original training set. The new results are shown in the second panel. The CNN trained with the additional augmented data does a much better job of capturing the extremes in the test set, which is what we often are interested in. There is a lot of work to be done and more complicated methods to explore, but these initial results look interesting and suggest that data augmentation can be useful in our field!

References

Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017, November). Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM) (pp. 865-870). IEEE.

Oh, C., Han, S., & Jeong, J. (2020). Time-series Data Augmentation based on Interpolation. Procedia Computer Science175, 64-71.

Tips for Creating Watershed Maps in R

There have been a few posts on this blog on creating watershed maps (here, here, and here), but this post is going to be focused on some of my recent experiences on creating watershed maps in R with files that may be missing attributes, are in the wrong projection, and contain data that need to be clipped to a specific boundary shapefile. There are lots of packages that exist to do one or more of these things, but anyone who has ever tried to create watershed maps in R knows that there isn’t one package that does it all. My main goal for this post is to outline the most efficient workflow and use of packages that also allow for the most compatibility when plotting shapefiles and raster files in one figure.

In this post, we are going to be creating a map of the Tuolumne River Basin boundary and plot elevation data within the basin. All the data are found here. First we will read in the Tuolumne boundary shape file (.shp) and the elevation raster file (.asc which is an ASCII file) using the appropriate functions and do some preliminary plotting to see what we have.

#Import libraries 

library(rgdal)
library(ggplot2)
library(raster)

#Read in Tuolumne shapefile

tuolumne.basin <- readOGR(dsn = "doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/Tuolumne_utm.shp")

#Read in elevation raster

elevation.raster = raster("doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/merced_tuolumne_100mdem_utm.asc")

#Plot the files
 
ggplot() +  geom_polygon(data = tuolumne.basin, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)
plot(elevation.raster)
Raw shapefile and raster data

So we have the pieces that we need to build the map, but notice that the latitude and longitude are in the wrong projection. We can use the following command to check what projection the shapefile is in:

proj4string(tuolumne.basin) 

We see that the output is: “+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs”. So we are in the Universal Transverse Mercator coordinate system, but we should change to WGS 84. We can do this using the function “spTransform” which will swap out the projection by adjusting the CRS (Coordinate Reference System) attribute of the shapefile. You can use “proj4string” to verify that the transformation took place.

 tuolumne.basin.transformed <- spTransform(tuolumne.basin, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

Now we need to transform the raster files coordinates. Note that the raster file doesn’t have an associated coordinate reference system listed. If you try to change the projection at this point, you will get an error. This is a minor inconvenience since we know that the coordinate system should match that of the raw Tuolumne shapefile and we can just insert the original coordinate system as a string under the projargs attribute. Then we can transform it to match the coordinate system of the transformed shapefile using “projectRaster”.

elevation.raster@crs@projargs <- "+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs" 
  
elevation.raster.transformed <- projectRaster(elevation.raster,crs=crs(tuolumne.basin.transformed))

Great, now we have all the data in the right projection. Now we have to clip the raster layer to show only the data in the bounds of our shapefile. We first use the “crop” function in the raster library to clip the layer based on the extent of the shapefile boundary as well as the mask function. It is important to do both otherwise the clip will not work!

elevation.raster.transformed.cropped <- crop(elevation.raster.transformed, extent(tuolumne.basin.transformed))
elevation.raster.transformed.cropped <- mask(elevation.raster.transformed, tuolumne.basin.transformed)

Now we need to get the appropriate elevation values and coordinates from the raster object so that we can plot it using ggplot.When we use ggplot here, notice that we only need to use geom_raster and elevation data since the clipped data will perfectly follow the shapefile boundary.

#Isolate elevation values from the raster file

val <- getValues(elevation.raster.transformed.cropped)
xy <-as.data.frame(xyFromCell(elevation.raster.transformed.cropped,1:ncell(elevation.raster.transformed.cropped)))
xy <- cbind(xy,val)

#Plot it!

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c()+theme_bw()
A not so pretty watershed figure

It’s almost perfect aside from that gray box that results from clipping and masking. After we clip, we are converting all values outside the boundary of the shapefile to NAs, which falls out of the bounds of our color scale. To fix this, we simply insert an additional argument to scale_fill_viridis_c() and we also make some additional aesthetic changes to the theme.

#Final plot function

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c(na.value=NA,name = "Elevation (m)")+theme_bw()+ggtitle("Tuolumne River Basin Elevation (m)")+xlab("Longitude") + ylab("Latitude")+theme(text = element_text(size = 20)) 
A pretty watershed figure!

Teaching Tools for Complex Adaptive Systems

This semester, I am taking a foundational class in the Systems Engineering department here at Cornell and I wanted to use this blog post to relay some cool packages and tools that we have used that hopefully can be useful teaching material for emerging faculty or anyone looking for interactive systems tutorials.

To begin, we have to first define what falls under the umbrella of complex adaptive systems. In a nutshell, these systems tend to (1) have networks of many components, (2) typically involve non-linear interactions between components, (3) exhibit self-organizing behavior, (4) have the potential to exhibit emergent properties. One really beautiful website that explains these properties in more detail is Complexity Explained, which started as a community outreach project to try to explain complex systems to a wider audience within the science community and the public. The website features interactive animations of systems properties and a short booklet that can be downloaded (in many languages) with key concepts.

It is well known that complex systems are hard for humans to understand because many of the characteristics are non-intuitive for us. For example, self-organizing behavior is often contradictory to our own lives (when can you remember a time that a system around you naturally seemed to become more orderly as time passed?). Emergent properties can come about in long time scales that are often far distanced from the original action. We can’t always understand how decisions on the microscale resulted in large macroscale processes. Thus, in order to best approach complex systems, we must have the ability to interact with them, model them, and map out their complex behavior under many conditions. Below, I am introducing some tools that might help foster more understanding of these ideas using simple, yet dynamically rich cases.

PyCX

One of the main creators of the Complexity Explained website and a visiting lecturer to my systems class is Hiroki Sayama, a world-renowned researcher and director of the Center for Collective Dynamics of Complex Systems at Binghamton University. Dr. Sayama has created a python package called PyCX that contains sample Python codes of complex systems that a user can run interactively and then manipulate or build off of. Simply download the package off of GitHub and all of the code and a simulator will be available to you. Figure 1 shows an example interactive simulation of a Turing pattern. In 1952, Alan Turing authored a paper where he described how patterns in animals’ coats such such as stripes and spots, can arise naturally from a chaotic system. He uses a simple set of reaction-diffusion equations to describe this process. Figure 1 shows the python simulator in PyCX, the equation for the Turing pattern, and the evolution from the random initialization to the ordered spots.

Figure 1: PyCX interactive simulation for the Turing Pattern

PyCX also allows you to toggle the parameters of the problem, which can express how small perturbations in the system can lead to substantially different outcomes. You can adjust these parameters within the source python code (which I believe is more useful for students rather than just clicking a “play” button). Figure 2 shows the difference in behavior across a forest fire model when the initial density is adjusted from 35% to 40% of the space.

Figure 2: The effect of initial conditions in a forest fire agent-based model

Golly- Game of Life Simulator

Golly is an open-source tool for visualizing cellular automata, including Conway’s Game of Life. Golly allows the user to draw different patterns and apply specific rules for how the systems evolve. You can stop the simulation midway and apply different rules to the existing patterns.

Figure 3: Golly Interface Screen Shot

Swarm Behavior

Dr. Sayama also developed a really interesting Java application to study swarm behavior, or collective behavior that is exhibited by entities, typically animals. This application, called swarm chemistry creates agents with different kinetic parameters that dictate dynamics. The application allows you to mix agents into a single population and observe how emergent dynamics form. Figure 4 shows the opening interface when you click the .jar executable. The application brings up 6 random agents that exhibit some dynamic behavior. By clicking on any two agents, you will create a new population that shows how the dynamics of the agents interact (Figure 5). You can keep mixing agents and adding more random swarms. You can individually mutate certain swarms or edit the parameters as well. The pictures do not do this application justice. It is super fun (and slightly addicting) and a great way to get students excited about the concepts.

Figure 4: Swarm Chemistry Opening Interface

Figure 5: Emergent dynamic behavior

I had so much fun using these packages in class and I hope that these tools can help you/your students become more engaged and excited about complex systems!

References

My knowledge of these tools came from Hiroki Sayama’s guest lectures in SYSEN 6000 at Cornell University and from:

Sayama, H. (2015) Introduction to the Modeling and Analysis of Complex Systems,Open SUNY Textbooks, Milne Library, State University of New York at Geneseo.

MORDM Basics V: WaterPaths Tutorial

The MORDM framework poses four fundamental questions that each corresponds to a section within the framework shown in Figure 1:

  1. What actions can we take to address the problem?
  2. What worlds are we implementing those actions in?
  3. What are acceptable levels of performance for those actions in all worlds?
  4. What controls failures across those worlds?
Figure 1: The MORDM framework (Herman et. al., 2013).

In the previous blog post, we used state-aware ROF triggers to implement drought mitigation and supply management actions for one hydroclimatic and demand scenario. In the first MORDM blog post, we generated multiple deeply-uncertain synthetic realizations of hydroclimatic and demand scenarios. The next logical question would be: how do these actions fare across all worlds and across time?

Setting up the system

To explore this question, we will expand the scope of our test case to include Cary’s two neighboring utilities – Durham and Raleigh – within the Research Triangle. The three utilities aim to form a cooperative water supply and management agreement in which they would like to optimize the following objectives, as stated in Trindade et al (2019):

  1. Maximize supply reliability, REL
  2. Minimize the frequency of implementing water use restrictions, RT
  3. Minimize the net present value of infrastructure investment, NPV
  4. Minimize the financial cost of drought mitigation and debt repayment, FC
  5. Minimize the worst first-percentile cost of the FC

In this example, each objective is a function of short-term risks of failure triggers (stROF) and long term risks of failure triggers (ltROF). The stROFs trigger short-term action, typically on a weekly or monthly basis. The ltROFs trigger action on a multi-year, sometimes decadal, timescale. Recall from a previous post that ROF triggers are state-aware, probabilistic decision rules that dictate how a system responds to risk. Here, we optimize for a Pareto-approximate set of ROF triggers (or risk thresholds) that will result in a range of performance objective tradeoffs. An example of a stROF is the restriction ROF trigger we explored in the post prior to this one.

In addition, an example of an ltROF would be the infrastructure construction ltROF. When this ltROF threshold is crossed, an infrastructure project is ‘built’. Potential infrastructure projects are ordered in a development pathway (Zeff et al 2016), and the ltROF triggers the next infrastructure option in the sequence. The term ‘pathway’ is used as these infrastructure sequences are not fixed, but are state-dependent and can be shuffled to allow the optimization process to discover multiple potential pathways.

Over the next two blog posts, we will explore the interaction between the water-use restriction stROF and the infrastructure ltROF, and how this affects system performance. For now, we will simulate the Research Triangle test case and optimize for the ‘best’ set of ROF triggers using WaterPaths and Borg MOEA.

Using WaterPaths and Borg MOEA

We will be using the WaterPaths utility planning and management tool (Trindade et al, 2020) to simulate the performance of the Research Triangle test case. For clarification, the default simulation within WaterPaths is that of the Research Triangle. The folder that you will be downloading from GitHub already contains all the input, uncertainty and decision variable files required. This tool will be paired with the Borg MOEA (Hadka and Reed, 2013) to optimize the performance objectives in each simulation to obtain a set of Pareto-optimal long- and short-term ROF triggers that result in a non-dominated set of tradeoffs. Jazmin made a few posts that walks through compiling Borg and the installation of different parallel and series versions of Borg that might be helpful to try out before attempting this exercise.

Once you have Borg downloaded and set up, begin by downloading the GitHub repository into a file location on your machine of choice:

git clone https://github.com/lbl59/WaterPaths.git

Once all the files are downloaded, compile WaterPaths:

make gcc

To optimize the WaterPaths simulation with Borg, first move the Borg files into the main WaterPaths/borg folder:

mv -r Borg WaterPaths/borg/

This line of code will make automatically make folder called borg within the WaterPaths folder, and copy all the external Borg files into it.

Next, cd into the /borg folder and run make in your terminal. This should a generate a file called libborgms.a. Make a folder called lib within the WaterPaths folder, and move this file into the WaterPaths/lib folder

cp libborgms.a ../lib/

Next, cd back into the main folder and use the Makefile to compile the WaterPaths executable with Borg:

make borg

Great! Now, notice that the /rof_tables_test_problem folder is empty. You will need to generate ROF tables within the WaterPaths environment. To do so, run the generate_rof_tables.sh script provided in the GitHub repository into your terminal. The script provided should look like this:

#!/bin/bash
#SBATCH -n 16 -N 2 -p normal
#SBATCH --job-name=gen_rof_tables
#SBATCH --output=output/gen_rof_tables.out
#SBATCH --error=error/gen_rof_tables.err
#SBATCH --time=04:00:00
#SBATCH --mail-user=YOURUSERNAME@cornell.edu
#SBATCH --mail-type=all

export OMP_NUM_THREADS=16
cd $SLURM_SUBMIT_DIR
./waterpaths\
	-T ${OMP_NUM_THREADS}\
       	-t 2344\
       	-r 1000\
       	-d /YOURCURRENTDIRECTORY/\
       	-C 1\ 
	-m 0\
	-s sample_solutions.csv\
       	-O rof_tables_test_problem/\
       	-e 0\
       	-U TestFiles/rdm_utilities_test_problem_opt.csv\
       	-W TestFiles/rdm_water_sources_test_problem_opt.csv\
       	-P TestFiles/rdm_dmp_test_problem_opt.csv\
	-p false

Replace all the ‘YOUR…’ parameters with your system-specific details. Make two new folders: output/ and error/. Then run the script above by entering

sbatch ./generate_rof_tables.sh

into your terminal. This should take approximately 50 minutes. Once the ROF tables have been generated, run the optimize_sedento_valley.sh script provided. It should look like this:

#!/bin/bash
#SBATCH -n 16 -N 3 -p normal
#SBATCH --job-name=sedento_valley_optimization
#SBATCH --output=output/sedento_valley_optimization.out
#SBATCH --error=error/sedento_valley_optimization.err
#SBATCH --time=04:00:00
#SBATCH --mail-user=YOURUSERNAME@cornell.edu
#SBATCH --mail-type=all

DATA_DIR=/YOURDIRECTORYPATH/
N_REALIZATIONS=1000
export OMP_NUM_THREADS=16
cd $SLURM_SUBMIT_DIR

mpirun ./waterpaths -T ${OMP_NUM_THREADS}\
  -t 2344 -r ${N_REALIZATIONS} -d ${DATA_DIR}\
  -C -1 -O rof_tables_test_problem/ -e 3\
  -U TestFiles/rdm_utilities_test_problem_opt.csv\
  -W TestFiles/rdm_water_sources_test_problem_opt.csv\
  -P TestFiles/rdm_dmp_test_problem_opt.csv\
  -b true -o 200 -n 1000

As usual, replace all the ‘YOUR…’ parameters with your system-specific details. Run this script by entering

sbatch ./optimize_sedento_valley.sh

into the terminal. This script runs the Borg MOEA optimization for 1,000 function evaluations, and will output a .set file every 200 function evaluations. At the end of the run, you should have two files within your main folder:

  1. NC_output_MS_S3_N1000.set contains the Pareto-optimal set of decision variables and the performance objective values for the individual utilities and the overall region.
  2. NC_runtime_MS_S3_N1000.runtime contains information on the time it took for 1000 simulations of the optimization of the Research Triangle to complete.

The process should take approximately 1 hour and 40 minutes.

Summary

Congratulations, you are officially the Dr Strange of the Research Triangle! You have successfully downloaded WaterPaths and Borg MOEA, as well as run a simulation-optimization of the North Carolina Research Triangle test case across 1000 possible futures, in which you were Pareto-optimal in more than one. You obtained the .set files containing the Pareto-optimal decision variables and their respective performance objective values. Now that we have optimized the performance of the Research Triangle system, we are ready to examine the performance objectives and the Pareto-optimal ROF trigger values that result in this optimal set of tradeoffs.

In the next blog post, we will process the output of the .set files to visualize the objective space, decision variable space, and the tradeoff space. We will also conduct robustness analysis on the Pareto-optimal set to delve further into the question of “What are acceptable levels of performance for those actions in all worlds?”. Finally, we will explore the temporal interactions between the water use restrictions stROF and the infrastructure construction ltROF, and how supply and demand are both impacted by – and have an effect on – these decision variables.

References

Hadka, D., & Reed, P. (2013). Borg: An auto-adaptive Many-objective evolutionary computing framework. Evolutionary Computation, 21(2), 231–259. https://doi.org/10.1162/evco_a_00075

Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772. https://doi.org/10.1016/j.envsoft.2020.104772

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management. Advances in Water Resources, 134, 103442. https://doi.org/10.1016/j.advwatres.2019.103442

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research, 52(9), 7327–7346. https://doi.org/10.1002/2016wr018771

Fitting and Simulating from NHMMs in R

The purpose of this post is to give a workflow that one could follow if your goal is to fit Non-Homogeneous Hidden Markov Models (NHMMs) and to simulate sequences of states. Julie Quinn wrote a series of great posts on fitting Hidden Markov Models (HMMs) here but the goal of this post is to discuss the non-homogeneous counterparts of HMM. NHMMs can be distinguished from the former because they involve non-stationary transition probabilities between states. These dynamic transition probability matrices are conditioned on one or more external covariates that influences transitions between states. One example of the use of NHMMs is to model and simulate precipitation. Precipitation models that simulate without using atmospheric information cannot be expected to perform well on conditions that deviate from those on which the model was fit. Thus using a covariate that shows changes in atmospheric circulation (such as geopotential height), can help capture some of the nonstationarity that a solely precipitation-based model alone could not capture.

I have had a lot of time to work with NHMMs in R; however at the beginning, I found overwhelmingly few examples to work off of, so this post is meant to outline a workflow that I have settled on that you can apply to your own application. First off, there are tons of packages available for HMMs, but very few that handle the dynamic transition matrices of NHMMs. My favorite is depmixS4 found here. This package has all the tools needed to fit your NHMM, but it takes some experimenting to understand the syntax and to piece together the functions to create a full workflow.

First we want to fit a depmix model. In the first line of the code, I am creating a model called modNHMM. The data that I am fitting the NHMM with are the first 9 principal components of geopotential height. It is important that you list these components in this syntax and they should be the column titles of your dataset, which is synoptic.pcs in my case. The number of states is how many states you want to fit your model with. For a precipitation/streamflow application, you could fit a two-state NHMM to represent wet/dry conditions, or if you are trying to identify multiple regimes, you may have more.

library(depmixs4)

modNHMM <- depmix(list(PC1~1,PC2~1,PC3~1, PC4~1, PC5~1, PC6~1, PC7~1, PC8~1,
PC9~1),
nstates = num.states,
               family=list(gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),
                              gaussian()),
ntimes =  nrow(synoptic.pcs),
data = data.frame(synoptic.pcs),transition = ~predictor$PC1+predictor$PC2+predictor$PC3+predictor$PC4)

fit.modNHMM.depmix.paleo.check <- fit(modNHMM)

synoptic.state.assignments<- posterior(fit.modHMMs.depmix.paleo.check)$state # state sequence using the Viterbi algorithm

Then we choose a distribution for our responses, which I choose to be a Gaussian distribution. The argument, ntimes, refers to the length of our time series, which is the number of rows in our dataset. Next we specify the dataset that contains the PC data and the transition which is dictated by our external covariates. In this case, my covariates are in a dataframe called predictor and each column corresponds to the 4 covariates (which happen to be PCs of a different dataset) that will influence my transitions. Then we fit the model with our prescribed attributes using the fit function. Finally, we want to calculate the Viterbi sequence, which is the most probable path or sequence of states over the period that the model is fit. This step (last line of code) will return a vector of the length of the dataset with each day classified into one of num.states.

Now we need to locate the transition probability matrices. We use the following command:

fit.modNHMM.depmix.paleo.check@transition

If we were fitting an HMM, we would get one transition probability matrix. However, we get an output that looks like this:

Transition probability matrix coefficients

Now instead of having constant values for transitions, we have equations of the form: Intercept+b1*PC1+b2*PC2+b3*PC3+b4*PC4 where the b values are the coefficient values listed in the table. If were were to look at the first block [[1]], this block dictates transitions from State 1 to the other 5 states. The transitions from the states are as follows:

State 1 to State 1: 0+0+0+0+0 (State 1 is used as a reference variable in this case and the probability would be found by subtracting the other probabilities from 1 at the end)

State 1 to State 2: -3.11+-0.22*PC1+0.22*PC2+0.014*PC3-0.13*PC4

and so on. You have to remember to divide each value by the total across the row in order to return probabilities.

Once you have these symbolic equations for the transition probability matrices, you can create a list of matrices which will allow you to simulate sequences of states for new sets of the PC1, PC2, PC3, PC4 covariates. You can get a sense how you might create n different transition matrices if you have times series of length n of the covariates. Below I am just representing those symbolic equations in code using the getpars function to acquire the coefficients and store the resulting daily matrices in a list called mm. Depending on the number of covariates or states, you will need to adjust the indices accordingly.

n=dim(df)[[1]]
mm<-matrix(list(), 1,n)


for (j in 1:n){
  transition_matrix=matrix(, nrow = 5, ncol = 5)
  for (i in 6:10){
    transition_matrix[1,i-5]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
     }
  denominator=sum(exp(transition_matrix[1,]))
  for (i in 6:10){
    transition_matrix[1,i-5]=exp(transition_matrix[1,i-5])/denominator
  }
  
  for (i in 31:35){
    transition_matrix[2,i-30]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
  }
  denominator=sum(exp(transition_matrix[2,]))
  for (i in 31:35){
    transition_matrix[2,i-30]=exp(transition_matrix[2,i-30])/denominator
  }
  for (i in 56:60){
    transition_matrix[3,i-55]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[3,]))
  for (i in 56:60){
    transition_matrix[3,i-55]=exp(transition_matrix[3,i-55])/denominator
    
  }
  for (i in 81:85){
    transition_matrix[4,i-80]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[4,]))
  for (i in 81:85){
    transition_matrix[4,i-80]=exp(transition_matrix[4,i-80])/denominator
    
  }
  
  for (i in 106:110){
    transition_matrix[5,i-105]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[5,]))
  for (i in 106:110){
    transition_matrix[5,i-105]=exp(transition_matrix[5,i-105])/denominator
    
  }
  mm[[j]]=transition_matrix

}

Once we have these matrices, we can then simulate state sequences that can result from the chain of transition matrices. For this part, we need to create markov lists with our transition matrices:

library(markovchain)
mcObject <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(j){ 
  
  mcObject.time.varying <- mclapply(X=1:n.sim,mc.preschedule=TRUE,mc.cores=1,FUN=function(t){
    tr.prob=as.matrix(mm[[t]])
    mcObject.time.varying.out <- new("markovchain", states = c("1","2","3","4","5"),
                                             transitionMatrix = tr.prob, name = paste("McObject",t,sep=""))    
    return(McObject.time.varying.out)
  }
  )
  mcObject.final <- new("markovchainList",markovchains = mcObject.time.varying, name = "mcObject.nh")
  return(

mcObject.final

)
  
}

Finally we simulate using the following:

simulate.mc <- function(mcObject,num.states,dates.sim,last.month,last.day,n.sim,iter) {
  
  #this function will simulate the Markov chain iter times
  
  #Arguments:
  #mcObject = a Markov chain object from the markovchain package
  #num.states = the number of states 
  #date.sim = a time series of dates for the simulation period
  #last.month = last month of the season
  #last.day = last day of the last month of the season
  #iter = the number of iterations
  
  #day and month sequences for the simulation period
  days.sim <- as.numeric(format(dates.sim,"%d"))
  months.sim <- as.numeric(format(dates.sim,"%m"))
  n.sim <- length(dates.sim)  #length of simulation
  
  final.mc.sim <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(i){  
    
    mc.sim <- as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[1]]))
    end.state <- paste(mc.sim[length(mc.sim)])
    for (t in 1:n.sim) {
      mc.sim <- c(mc.sim,as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[t]],t0=end.state)))
      end.state <- paste(mc.sim[length(mc.sim)])
      if(months.sim[t]==last.month & days.sim[t]==last.day) {end.state <- paste(sample(1:num.states,size=1))}
    }    
    
    #here is the final mc simulation
    final.mc.sim.iter <- mc.sim[2:(n.sim+1)]
    return(final.mc.sim.iter)
  }
  )
  return(final.mc.sim)
  
}



simulations=matrix(list(), 1,1000)
for (i in 1:1000){

  simulations[[i]] <- simulate.mc(mcObject=mcWeather.Regime,num.states=num.states,
                                  dates.sim=dates.sim,last.month=last.month,last.day=last.day,iter=iter)
}

And that’s it! You can simulate for many different iterations (1000 in my case) and you will be returned a large list with your 1000 sequence of states over the simulation period.

CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

    

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

MORDM Basics IV: Visualizing ROF-Storage Dynamics (finally)

The previous post described a simple, two-objective test case in which the city of Cary employed risk-of-failure (ROF) triggers as levers to adjust for its preferred tradeoff level between its objectives. The example given showed how ROF triggers allowed Cary to account for future uncertainty in its system inputs, thus enabling it to visualize how their risk appetite would affect their desired outcomes.

In meeting these objectives, different risk thresholds would have affected Cary’s response to extreme events such as floods and droughts, and its ability to fulfill demand. Simply analyzing the tradeoffs between objectives that result from a range of ROF trigger values only presents one side of the story. It is vital to visualize how these performance objectives and tradeoffs manifest in the system’s capacity (pun intended) to store enough water in times of scarcity, and by extension, its ability to fulfill its customers’ demand for water.

Using ROFs allow us to more concretely measure how the dynamics of both storage and demand fulfillment evolve and change over time for a given risk tolerance. In the long term, these dynamics will influence when and where new water infrastructure is built to cope with storage requirements and demand growth, but this is a topic for a future blog post. This week, we will focus on unpacking the dynamic evolution of storage and demand in response to different ROF trigger values.

As a quick refresher, our system is a water supply utility located in Cary, which is a city within the Research Triangle region in North Carolina (Trindade et al, 2017). Cary uses water-use restrictions when a weekly ROF exceeds the threshold of risk that Cary is willing to tolerate (α) during which only 50% of demand is met. More frequent water use restrictions help to maintain the reservoir levels and ensure reliability, which was defined in the previous blog post. However, the decision to implement restrictions (or not) will impact the storage levels of the system. With this in mind, we will first examine storage responds to the triggering of a water use restriction. For context, we consider a scenario in which Cary’s inflow timeseries is only 20% of the original levels. Figure 1 below shows the inflow, demand and storage timeseries for this scenario.

Figure 1: The hydrologic timeseries for Cary given that no water restrictions are implemented in a scenario where inflows are 20% of the original levels.

Cary’s challenge becomes apparent in Figure 1. While inflow decreases over time (fewer peaks), demand is steadily growing and has effectively tripled by the end of the period. This results in periods during which storage levels drop to zero, which occurs once past 2040. Also note that the frequency of low storage peaks have increased in the second half of the period. The following questions can thus be posed:

  1. How does the system’s ROF change with increasing demand and decreasing supply?
  2. How does risk tolerance affect the implementation of water-use restrictions during drought?
  3. How will the system reservoir levels respond to different levels of risk tolerance?
Figure 2: The length of the pink bars denote the nth-week during which the first water use restriction was implemented for a given α-value. This is an indicator of the responsiveness of the system to a drought, or decrease in storage levels. The blue line indicates the percent of storage filled with water.

To answer the first question, it is useful to identify how different values of α affect the first instance of a water-use restriction. Figure 2, generated using ‘rof_dynamics.py‘, demonstrates that lower risk tolerances result in earlier implementations of restrictions. This is reasonable, as an actor who more risk-averse will quickly implement water-use restrictions to maintain reliable levels of storage during a drought. However, an actor who is more willing to tolerate the change of low reservoir levels will delay implementing water use restrictions. The blue line juxtaposed on top of the bars indicates the inflows to the reservoir. After the first period of low flows between weeks 30-40, the plot shows that the amount of inflows do not recover, and is likely insufficient to fill the reservoir to initial levels. With a lower α, an actor is more likely to implement restrictions almost immediately after observing merely a few weeks of low inflows. In contrast, an actor who opts for a higher α will only resort to restrictions after seeing an extended period of low flows during which they can be more certain that restrictions are absolutely necessary.

Answering the second and third questions first require that periods of drought are more definitively quantified. To do this, the standardized streamflow indicator (SSI6) was used. The SSI6 is a method that identifies time periods during which the standardized inflow is less than the 6-month rolling mean (Herman et al, 2016). It detects a drought period when the value of the SSI6 < 0 for three consecutive months and SSI6 < -1 at least once during the three-month period. The juxtaposition of storage-restrictions and the periods of drought will allow us to see where restrictions were imposed and its impacts on reservoir levels for a given demand timeseries.

Figure 3 and Figure 4 are a visualization of how the system’s storage levels responds to drought (the red bars in the lower subplot) by implementing water-use restrictions (the dark red lines in the upper subplot) given α = 1% and α = 15% respectively. Predictably, restrictions coincide with periods of drought as defined by the SSI6. However, with a lower risk tolerance, period of restrictions are longer and more frequent. As Figure 3 shows, an actor with a lower risk tolerance may implement restrictions where only a slight risk of failure exists.

Figure 3: Storage dynamics given α=1%. (Upper subplot) The blue lines indicate the reservoir storage levels in billion gallons per week. The yellow lines are the weekly ROF values, or the likelihood that the percent of water stored will drop below 20% of the reservoir levels. The grey lines indicate where water use restrictions are implemented, and the red dashed line denotes α=2%. (Lower subplot) The zones are where droughts were detected using the SSI6 (Herman et al, 2016) method are highlighted in red.

Compared to α = 1%, an actor who is willing to tolerate higher ROF values (Figure 4 as an example) will implement restrictions less frequently and for shorter periods of time. Although this means that demands are less likely to get disrupted, this also puts water supplies at a higher risk of dropping to critical levels (<20%), as restrictions may not get implemented even during times of drought.

Figure 4: Storage dynamics given α=15%. (Upper subplot) The blue lines indicate the reservoir storage levels in billion gallons per week. The yellow lines are the weekly ROF values, or the likelihood that the percent of water stored will drop below 20% of the reservoir levels. The grey lines indicate where water use restrictions are implemented, and the red dashed line denotes α=15%. (Lower subplot) The zones are where droughts were detected using the SSI6 (Herman et al, 2016) method are highlighted in red.

There is one important thing to note when comparing Figures 3 and 4. When the periods water use restrictions coincide for both α-values (between 2040 and 2050), the actor with a lower tolerance implements water use restrictions at the beginning of both drought periods. This decision makes the biggest difference in terms of the reservoir storage levels. By implementing water use restrictions early and for a longer period of time, Cary’s reservoir levels are consistently kept at levels above 50% of full capacity (given full capacity of 7.45 BG). A different actor with higher risk tolerance resulted in water levels that dropped below the 30% of full capacity during periods of drought.

Although this seems undesirable, recall that the system is said to have failed if the capacity drops below 20% of full capacity. Herein lies the power of using an ROF metric – questions 2 and 3 can be answered by generating storage-restriction response figures as shown in the above figures, which allows an actor to examine the consequences of being varying levels of risk tolerance on their ability to fulfill demand while maintaining sufficient water levels. This ability can improve judgement on how much risk a utility can actually tolerate without adversely impacting the socioeconomic aspects of the systems dependent on a water supply utility. In addition, using ROFs enable a utility to better estimate when new infrastructure really needs to be built, instead of making premature investments as a result of unwarranted risk aversion.

To briefly summarize this blog post, we have shown how different risk tolerance levels affect the decisions made by an actor, and how these decisions in turn impact the system. Not shown here is the ability of an ROF to evolve over time given climate change and the building of new water supply infrastructure. In the next blog post, we will briefly discuss the role of ROFs in mapping out adaptation pathways for a utility, how ROFs form the basis of a dynamic and adaptive pathway and their associated operation policies, and connect this to the concept of the soft path (Gleick, 2002) in water supply management.

References

Gleick, P., 2002. Water management: Soft water paths. Nature, 418(6896), pp.373-373.

Herman, J., Zeff, H., Lamontagne, J., Reed, P. and Characklis, G., 2016. Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments. Journal of Water Resources Planning and Management, 142(11), p.04016050.

Trindade, B., Reed, P., Herman, J., Zeff, H. and Characklis, G., 2017. Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources, 104, pp.195-209.

NetCDF Operators

This post is an introduction to Linux-based climate data and NetCDF operators (CDOs or NCOs) which allow you to perform various operations on netNDF files through the command line. I found these commands to be really nifty when I was working with pre-industrial control runs from a GCM. The output was being written on daily timestep, across 1200 years, and for the whole world, so it was absolutely essential that I cut the size of the files down as much as I could before transferring to my own computer.

The official documentation and installation instructions for NCO can be found here and CDO here, but if you’re working on a supercomputer, the libraries will already likely be installed. I will outline how I used some of these functions for my pre-industrial runs.

Concatenation

Some of the NCO commands have size limits of 40 GB, so it’s important to use the right order of operations when processing your files, which will be different depending on your ultimate goal. My goal was to ultimately get the 500-hpa geopotential height anomalies across the whole 1200 year period for specifically the Western US. Assuming you have a directory with all the NetCDF files, the first goal is to concatenate the data, since my run was broken into many smaller files. The easiest way to do this is with the following command which will take all the netcdf files in the directory (using the *) and merge them into a file called merged_file.nc:

cdo mergetime *.nc merged_file.nc

Return Individual Monthly Files

When calculating anomalies, you will need to determine a mean geopotential height value for each of the 12 months, and then calculate daily deviations with respect to these months to obtain daily deviations. You can do this with the following command:

cdo splitmon merged_file.nc zg500.mon

This command will return 12 files of the form zg500.mon$i.nc.

Return Monthly Means and Daily Anomalies

The next step is to calculate a monthly mean for each of these files. For example, for January use:

cdo timmean zg500.mon1.nc zg500.mean.mon1.nc

Return Anomalies

Now we subtract the means from each monthly file to return the daily anomalies for each month, which will be of the form: zg500.mean.mon${i}.anom.nc. If you want to combine the last two steps into one loop, you can use the code below:

for i in $(seq 1 12)
do
  cdo timmean zg500.mon${i}.nc zg500.mean.mon${i}.nc
  cdo sub zg500.mon${i}.nc zg500.mean.mon${i}.nc zg500.mean.mon${i}.anom.nc
done 

Cut Down to Geographical Area of Interest

Finally, we need to cut down the data just to the Western US. We use ncks (NetCDF Kitchen Sink) from NCO, which is probably the most versatile of all the functions (hence the name). This command is one that has the 40 GB limit, which is why I had to wait until I had monthly files before I could cut them down geographically. We must first specify the variable of interest using the -v flag. In my case, I only had one variable to extract, but you can also extract multiple in one command. Then denote the range of latitude and longitude using the -d flags. It is very important to include the periods at the end of each lat/lon (even if your bounds are integers) otherwise the command does not work.

for i in $(seq 1 12)
do
  ncks -v zg500 -d lon,180.,260. -d lat,30.,60. zg500.mean.mon${i}.cut.anom.nc -o zg500.mean.mon${i}.cut.anom.region.nc
done 

Ultimately, you will get 12 files of the form: zg500.mean.mon${i}.cut.anom.region.nc. And that’s it! You can concatenate the monthly files back together and try to resort the data back into the correct sequence according to time. I was unsuccessful at finding a quick way to do this, but it is possible. I found it much easier to work on this within R. I imported each of the 12 anomaly files, assigned a time vector, concatenated each monthly anomaly matrix into a larger matrix and then sorted according to date. If your files are small enough by the end of the process, this likely is the easiest way to take care of resorting. Enjoy!

Do The (Schaake) Shuffle

This post in an introduction to the Schaake Shuffle, a method that can be used to address reconstructing space time variability in forecasted and synthetic variables. The Schaake Shuffle was originally introduced in a synthetic weather generation post by Julie Quinn almost 5 years ago. Lately, the importance (and difficulty) of being able to reproduce spatial and temporal variability in forecasts and synthetically generated variables across multiple correlated sites has been a prominent topic in our group. The goal of this post is to just “bump” this topic back into discussion and to make readers aware of its existence as a nifty post-generation way to build spatial and temporal variability back into synthetically generated data or forecasts. In the fundamental paper that establishes the method, Clark et al., 2004, the authors are looking to apply the method to forecasts of precipitation and temperature. In the case of weather variables such as temperature and precipitation, it is common to create forecasts for individual stations from a Numerical Weather Prediction (NWP) model. These variables serve as predictor variables in regression models that can be used to generate forecasts. The problem with these styles of approaches is that spatial correlation is not preserved between multiple stations nor temporal persistence, which is very important for hydrologic applications with memory.

The Schaake Shuffle is a method that reorders ensemble forecasts of precipitation and temperature to better reconstruct the space-time variability using a rank-ordering approach constructed from a historical record. The basic steps are as follows:

  1. Gather appropriate data: The NWP model outputs forecasts of accumulated precipitation, air temperature, relative humidity at 700 hpa, wind speed, total column precipitable water, and mean sea level pressure which are used as predictors in the forecast equations. Further, the authors acquire historical precipitation and temperature data for stations within four basins across the United States.
  2. Create Forecasts: The next step involves creating the precipitation and temperature ensemble forecasts. A multiple linear regression is used to develop model output statistics (MOS) equations. The forecasted variables that are taken from the NWP model are ultimately filtered down to keep on the variables that explain the highest variance in the response variable (in this example, response variables are precipitation, minimum temperature, maximum temperature). A separate regression equation is fit for each variable, station, and forecast lead time. The residuals of the regression equation are modeled stochastically to generate an ensemble of forecasts. Alternatively, one can apply the Schaake Shuffle to synthetically generated ensembles (not limited to forecasts).
  3. Reorder Forecasts: The reordering method can best be described by an example. For a given time, assume you have an ensemble of possible forecasts that you align in a 3D matrix: Xi,j,k where i=ensemble member, j=station, and k=variable of interest (precipitation or temperature). From the historical record, you must construct an equally sized matrix Yi,j,k which contains historical station observations for the same date. For this matrix, i=an index of dates in the historical time period, j=station, and k=variable of interest (precipitation or temperature).

Using this same notation the authors create a toy example to demonstrate the process. For some time t, imagine we have forecasts of maximum temperature for 10 ensembles, for a given date and station.

Let X be a 10 member ensemble in consideration.

X=[15.3,11.2,8.8,11.9,7.5,9.7,8.3,12.5,10.3,10.1]

We can sort the vector X to create χ

χ=[7.5,8.3,8.8,9.7,10.1,10.3,11.2,11.9,12.5,15.3].

Then we go to the historical record and choose 10 dates that reside in a 7 day window around the date that is being forecasted. This is our Y vector.

Y=[10.7,9.3,6.8,11.3,12.2,13.6,8.9,9.9,11.8,12.9]

We can sort this vector to create γ.

γ=[6.8,8.9,9.3,9.9,10.7,11.3,11.8,12.2,12.9,13.6]

We also create a vector B, which denotes the order of the sorted historical vector with respect to the unsorted vector.

B=[3,7,2,8,1,4,9,5,10,6]

The key is to now to reorder the ensemble forecast in the same order as the B vector. The rank order 1 value is in position 5 of the B vector. Therefore, we take the 5th value from χ (10.1). Then rank order 2 is in position 3. We take the third value from χ (8.8). We continue doing this until we have

Xss=[10.1, 8.8, 7.5, 10.3, 11.9, 15.3, 8.3, 9.7, 11.2, 12.5]

These are the basic fundamentals of the reordering algorithm and it can be extended to involve forecasting at multiple stations, demonstrated in the figure below. Table A shows 10 ensembles for forecasting weather on January 14th, 2004, ranked from lowest to highest value for three stations. Table B shows the historical record and the black and light gray ellipses represent the 1st and 2nd ensemble respectively. Table C shows the sorted historical record and where the selected historical observations lie in the sorted list. Finally Table A can be reordered accordingly to form Table D.

Schaake Shuffle for 10 member ensemble and 3 stations (Figure 2 from Clarke et al., 2004)

It’s important to remember that the Schaake Shuffle is only meant to capture the Spearman rank correlation of observations, but not to reconstruct the actual spearman correlations. The results from the paper, however, are quite remarkable and show how well the method captures spatial and temporal properties. The figure below shows an example of how the method preserves spatial correlation between two selected stations. The top set of figures show raw ensemble output while the bottom figures show results after the ensemble is reordered. The black lines denote the target observed correlation. Clearly, the reordered output approximates the observed correlation across lead times better than the raw ensemble output.

Preservation of spatial correlation for raw (top) and reordered (bottom) forecast ensembles (Figure 6 from Clarke et al, 2004)

One basic limitation of this approach is the assumption of stationarity and that the structure in the historical record will be applicable to the forecasted data. While other methods exist which can potentially preserve space-time variability well, the advantage of the Schaake Shuffle is the ability to reconstruct these patterns after the fact, as a post-processing step. If readers are interested in implementing the Schaake Shuffle, basic pseudocode is included at the end of the paper but there are also R packages that can automate the reordering process here. The steps to download the package and run the algorithm are denoted here. Note that this code only works for a single-station case. Here each column in the X vector will be an ensemble and the rows correspond to the number of days in the forecast. Implementing the example in Figure 2 for one station will requires X and Y to be a single row vector. Of course, one can manually extend this process to multiple stations.

install.packages("devtools")

devtools::install_github("katerobsau/depPPR")

library(depPPR)

forecast_example=as.matrix(read.delim("C:/Users/Rohini/Documents/synthetic.txt",header=FALSE))
climate_example=as.matrix(read.delim("C:/Users/Rohini/Documents/historical.txt",header=FALSE))

schaake_shuffle(X = forecast_example, Y = climate_example)

References:

All material is derived from the fundamental paper that introduces the Shaake Shuffle:

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004). The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology5(1), 243-262.

Potential Biases and Pitfalls of Machine Learning

Artificial intelligence has already changed the world in many ways. It is very difficult to find a manmade tool that does not use AI in one way or another. It can be argued that many of these changes have been positive; however, it can also easily be shown that AI has not benefited everybody evenly.

In terms of the application of AI in science, the number of publications on this topic has significantly increased during the last few years. I searched for the words water, machine, and learning on Google Scholar and limited the search to only count the studies with all of these words in their titles. I created the following figure from the results of that simple search. As you can see, the number of publications has significantly increased in the last few years. This is obviously not a thorough study, but it shows how important artificial intelligence has become in academic research.

However, there are many pitfalls that can degrade the usefulness of AI in academic and non-academic applications. Some people even argue that these pitfalls could potentially lead us to a new AI winter that nobody will enjoy. In this blog post, I decided to go over some of these issues and difficulties. Some of the problems have emerged recently, but others have been known by statisticians for decades. For many reasons, they continue to exist and negatively affect artificial intelligence projects.

Any machine learning project usually includes the following components: 1) sample collection and preparation, 2) ML model generation, and 3) result interoperation and extraction of actionable insights. Here, I will talk about some of the mistakes and biases that can happen in each of these steps.

1- Sample Collection and Preparation

The input dataset is a crucial source of bias in any ML project. The problem basically occurs when the dataset is not a good representation of the real world. Here are a few variants of this issue:

Exclusion Bias

This happens when specific groups that exist in the real world are systematically absent in the dataset. For example, a specific part of the population can be absent in the dataset, or the dataset can only contain information about particular countries, climates, etc. Having a diverse group that represents various segments of society and different worldviews can reduce the consequences of this type of bias.

Measurement Bias

There are different reasons that observations can be faulty and unreliable. For example, observations can be sensed and collected through malfunctioning measurement devices that can bias the entire dataset. Also, human reading errors and mistakes can cause measurement errors. Additionally, there are many things that can go wrong during the post-processing of raw measurements and can lead to measurement bias.

Other Sampling Biases

There are also other types of data collection biases that have been discussed in statistical literature for years. One example of these is self-selection bias. Let’s imagine that you are running an online poll and want to analyze the results, but your participants are not fully random. People who choose to participate in your poll might have specific personalities or worldviews and not represent the population at large. Other biases that have been widely discussed in statistical literature include survivor bias, prejudice bias, observer bias, and recall bias. You can find more information about these biases here, here, and here. If possible, these biases should be avoided. Otherwise, their effects should be estimated and removed from the analysis. Also, acquiring more comprehensive datasets can reduce these types of biases.

Imbalanced Datasets

If the data collection process is fair and reliable, the imbalanced dataset is not really a bias: it’s more of a challenge that is inherent to specific problems. Some events just don’t happen frequently. The problems and research questions that focus on rare events need their own appropriate treatments. In a previous blogpost, I explained this issue and the ways that we can deal with this problem. You can also refer to this article and this article for information about imbalance datasets.

Data Leakage

Data leakage is a situation in which we mistakenly share a portion (or the entirety) of the data between the training and validation datasets. Data leakage can happen due to different types of mistakes during data preparation. Timeseries can also implicitly include data leakage. For example, if training and testing datasets are selected in a way that allows the training dataset to gain information about the removed period, we have implicitly leaked data. For example, if we extract the testing dataset from the middle of the original timeseries, we basically give the model insight about what happens before and after the testing period, which is basically cheating and can reduce the accuracy of the model. One way to avoid this is to make sure that our training data period precedes the validation period.

Another type of leakage is called feature-leakage, and it happens when our feature list includes a variable that is either closely correlated with data labels or is a duplicate of the data label. Obviously, we won’t have that feature during the prediction. Therefore, our trained model will be highly biased.

More information about data leakage can be found here and here.

2- ML and Statistical Models

Lack of Physical Understanding of the System

In machine learning, we use statistical relationships to explore how the system would behave under a specific condition. However, ML models do not gain any explicit knowledge about the actual physical laws that govern the system. For example, in hydrology, it is very possible to violate water and energy conservation laws while training and using machine learning-driven models. However, there are studies (here and here) that have been trying to incorporate these insights into their ML model. This issue is likely to attract more attention in the future.

Reproducibility

There is a general consensus among ML researchers that many ML-based studies are not reproducible. This means that an individual researcher cannot replicate the ML process and find the same results. Many recent studies have shown that ML scientists are often not able to reproduce the results of other studies. While some people refer to this as the reproducibility crisis in ML studies, it is more of a general academic research issue. Based on a 2016 survey published by Nature, 70% of scientists across various disciplines were not able to reproduce the results of other studies, and more than 50% even failed to reproduce their own previously developed results. This is obviously a big challenge and needs to be taken into consideration. More information about reproducibility challenge can be find blog posts (here and here).

Instability of the Model

Model stability is also an important challenge. As discussed earlier, models can be very efficient and accurate under specific random conditions (e.g., random seed or data sample) and will fail under different conditions. This problem is also closely related to the reproducibility challenge. Using different random seeds and developing model ensembles can provide a solution to this problem. The solution also depends on the inherent nature of the ML algorithm. For example, this blog post nicely explains the issue of stability and its solution in K-Means clustering.

Simpson’s Paradox

Simpson’s Paradox describes a situation in which our analysis generates a different conclusion if we investigate each individual group separately while comparing them to the situation where these groups are all combined. The following figure shows an example of this situation under a simple linear regression. The implication of this phenomenon should be taken into account when we make decisions about labels and when we interpret our results.

Correlation Bias

It is important to make sure that there is no significant correlation between different features, which can lead to the creation of unstable and unreliable ML models. This is the same issue as the widely discussed multicollinearity problem in regression. However, removing feature columns needs to be done carefully in order to avoid significant loss of information.

Other Statistical and ML Biases

There are various ways that machine learning studies and statistical analysis studies can be biased and misleading, and P-hacking is one of the widely discussed pathways of that in hypothesis testing. P-hacking basically means that the researcher changes the definition of the problem or the significance threshold to prove a positive relationship. However, when reporting the results of the analysis, only the successful experiment gets reported to the scientific community.

Another issue occurs when the researcher works on a big dataset with many features. The researcher keeps exploring different parts of the dataset to find a statistically significant positive relationship and fails to do so; after enough digging, though, a positive relationship can be found. However, when reporting the data through scientific publications, only the significant positive results get reported and not the number of times that the research attempted to find them. This issue is often referred to as the multiple comparisons problem or the look-elsewhere effect.

There are many studies that show that p-hacking is a common problem and has significantly biased many scientific findings, but other studies exist that argue that, while p-hacking is common in scientific studies, it does not significantly change the big picture of what we eventually learn from scientific literature.

We should also always keep in mind that ML and generally data mining analyses are inherently exploratory, not confirmatory. It means that ML studies can only generate a hypothesis, while separate statistical testing is necessary to confirm the theory (more on this can be found here and here). This can also be seen as another aspect of the reproducibility problem in ML. Machine learning methods do not provide insights into causal relationships. Therefore, ML-driven methods can give us high predication accuracy under a specific condition, but their accuracy might significantly drop under a different condition, and we need to pay attention to this issue.

Performance Metrics

It is crucial to take into account different performance aspects of an ML model, and, to do that, we often need to consider various metrics (refer to this blog post for more information). As I discussed in this previous blog post, it is also important to carefully think about which metrics are really desirable for the problem at hand. For example, the decision about focusing on specificity vs. sensitivity should be determined by examining the nature of our problem. The multiple metrics issue has been widely discussed in past hydrology literature, but it has also been ignored in many water-related studies.

3- Results Interoperation and Extraction of Actionable Insights

Overinterpretation and Generalization Bias

The ML models are trained and tested for a specific dataset. Therefore, when they are used in the real-world application for prediction, they may or may not be as reliable. Therefore, we should also be careful about how we extrapolate the fidelity of our model to new datasets.

Confirmation Bias

This bias takes various forms. For example, we are subconsciously biased toward what we already think is the truth that will show up in the final results of our analysis. Also, we can unknowingly lean toward the results that are shinier and more groundbreaking. In addition, we might want to make our results more interesting for our stakeholders, clients, and the organizations we work for. These are some of the biases that we need to avoid in any type of research and certainly in ML projects.

Interpretability

ML models can benefit from powerful and sophisticated algorithms that are built based on comprehensive lists of features. However, that does not guarantee a decent level of interoperability. In fact, it might make the model and its results harder to understand. I won’t get into the details of this here, but there are many approaches that have been used to control model complexity and interpretability (here and here).

Publication Bias

Scientific journals and media outlets are less interested in publishing results that show non-significant causal relationships. In other words, research efforts that find negative results are less likely to get published. This phenomenon affects academic culture and creates pressure to find and submit positive results, and it also leads to the publication of positive results that have been found by chance. Publication bias has been recognized and talked about for a long time. The issue was first introduced by Theodore Sterling (1959), but it continues to harm the products of academia today.

There is another type of bias that frequently happens in publications. There is always a tendency toward showing that the model developed in the study outperforms popular state-of-the-art algorithms. The issue is that the models developed by the authors often benefit from careful fine-tuning of hyperparameters, while a generic type of the popular model is used in the comparison. This creates an unfair comparison. Refer to this blog post and this publication for more on this type of bias.

Also, ML models can yield a very good response under a specific condition (e.g., a random seed) and fall apart under a different condition (e.g., a different random seed). For example, this problem frequently happens in unsupervised clustering. Running the model with multiple random seeds and starting points is one of the ways to avoid this.

How Can We Address These Limitations?

As discussed in this blog post, ML studies can be negatively affected by various dimensions of these problems, but there are ways to respond to them. For example, biases in data collection can be improved through promoting diversity in research teams (more on this here, and here) or paying more attention to the design of the experiment. Careful definition of the problem and scope of the study also seems to be critical. It would be idealistic, but a cultural shift in the academic world that results in giving more credit to studies that find negative results would improve some of these problems. Finally, effectively using multiple performance metrics, synthetic data, and visual analytics should also benefit ML models.