Do The (Schaake) Shuffle

This post in an introduction to the Schaake Shuffle, a method that can be used to address reconstructing space time variability in forecasted and synthetic variables. The Schaake Shuffle was originally introduced in a synthetic weather generation post by Julie Quinn almost 5 years ago. Lately, the importance (and difficulty) of being able to reproduce spatial and temporal variability in forecasts and synthetically generated variables across multiple correlated sites has been a prominent topic in our group. The goal of this post is to just “bump” this topic back into discussion and to make readers aware of its existence as a nifty post-generation way to build spatial and temporal variability back into synthetically generated data or forecasts. In the fundamental paper that establishes the method, Clark et al., 2004, the authors are looking to apply the method to forecasts of precipitation and temperature. In the case of weather variables such as temperature and precipitation, it is common to create forecasts for individual stations from a Numerical Weather Prediction (NWP) model. These variables serve as predictor variables in regression models that can be used to generate forecasts. The problem with these styles of approaches is that spatial correlation is not preserved between multiple stations nor temporal persistence, which is very important for hydrologic applications with memory.

The Schaake Shuffle is a method that reorders ensemble forecasts of precipitation and temperature to better reconstruct the space-time variability using a rank-ordering approach constructed from a historical record. The basic steps are as follows:

  1. Gather appropriate data: The NWP model outputs forecasts of accumulated precipitation, air temperature, relative humidity at 700 hpa, wind speed, total column precipitable water, and mean sea level pressure which are used as predictors in the forecast equations. Further, the authors acquire historical precipitation and temperature data for stations within four basins across the United States.
  2. Create Forecasts: The next step involves creating the precipitation and temperature ensemble forecasts. A multiple linear regression is used to develop model output statistics (MOS) equations. The forecasted variables that are taken from the NWP model are ultimately filtered down to keep on the variables that explain the highest variance in the response variable (in this example, response variables are precipitation, minimum temperature, maximum temperature). A separate regression equation is fit for each variable, station, and forecast lead time. The residuals of the regression equation are modeled stochastically to generate an ensemble of forecasts. Alternatively, one can apply the Schaake Shuffle to synthetically generated ensembles (not limited to forecasts).
  3. Reorder Forecasts: The reordering method can best be described by an example. For a given time, assume you have an ensemble of possible forecasts that you align in a 3D matrix: Xi,j,k where i=ensemble member, j=station, and k=variable of interest (precipitation or temperature). From the historical record, you must construct an equally sized matrix Yi,j,k which contains historical station observations for the same date. For this matrix, i=an index of dates in the historical time period, j=station, and k=variable of interest (precipitation or temperature).

Using this same notation the authors create a toy example to demonstrate the process. For some time t, imagine we have forecasts of maximum temperature for 10 ensembles, for a given date and station.

Let X be a 10 member ensemble in consideration.

X=[15.3,11.2,8.8,11.9,7.5,9.7,8.3,12.5,10.3,10.1]

We can sort the vector X to create χ

χ=[7.5,8.3,8.8,9.7,10.1,10.3,11.2,11.9,12.5,15.3].

Then we go to the historical record and choose 10 dates that reside in a 7 day window around the date that is being forecasted. This is our Y vector.

Y=[10.7,9.3,6.8,11.3,12.2,13.6,8.9,9.9,11.8,12.9]

We can sort this vector to create γ.

γ=[6.8,8.9,9.3,9.9,10.7,11.3,11.8,12.2,12.9,13.6]

We also create a vector B, which denotes the order of the sorted historical vector with respect to the unsorted vector.

B=[3,7,2,8,1,4,9,5,10,6]

The key is to now to reorder the ensemble forecast in the same order as the B vector. The rank order 1 value is in position 5 of the B vector. Therefore, we take the 5th value from χ (10.1). Then rank order 2 is in position 3. We take the third value from χ (8.8). We continue doing this until we have

Xss=[10.1, 8.8, 7.5, 10.3, 11.9, 15.3, 8.3, 9.7, 11.2, 12.5]

These are the basic fundamentals of the reordering algorithm and it can be extended to involve forecasting at multiple stations, demonstrated in the figure below. Table A shows 10 ensembles for forecasting weather on January 14th, 2004, ranked from lowest to highest value for three stations. Table B shows the historical record and the black and light gray ellipses represent the 1st and 2nd ensemble respectively. Table C shows the sorted historical record and where the selected historical observations lie in the sorted list. Finally Table A can be reordered accordingly to form Table D.

Schaake Shuffle for 10 member ensemble and 3 stations (Figure 2 from Clarke et al., 2004)

It’s important to remember that the Schaake Shuffle is only meant to capture the Spearman rank correlation of observations, but not to reconstruct the actual spearman correlations. The results from the paper, however, are quite remarkable and show how well the method captures spatial and temporal properties. The figure below shows an example of how the method preserves spatial correlation between two selected stations. The top set of figures show raw ensemble output while the bottom figures show results after the ensemble is reordered. The black lines denote the target observed correlation. Clearly, the reordered output approximates the observed correlation across lead times better than the raw ensemble output.

Preservation of spatial correlation for raw (top) and reordered (bottom) forecast ensembles (Figure 6 from Clarke et al, 2004)

One basic limitation of this approach is the assumption of stationarity and that the structure in the historical record will be applicable to the forecasted data. While other methods exist which can potentially preserve space-time variability well, the advantage of the Schaake Shuffle is the ability to reconstruct these patterns after the fact, as a post-processing step. If readers are interested in implementing the Schaake Shuffle, basic pseudocode is included at the end of the paper but there are also R packages that can automate the reordering process here. The steps to download the package and run the algorithm are denoted here. Note that this code only works for a single-station case. Here each column in the X vector will be an ensemble and the rows correspond to the number of days in the forecast. Implementing the example in Figure 2 for one station will requires X and Y to be a single row vector. Of course, one can manually extend this process to multiple stations.

install.packages("devtools")

devtools::install_github("katerobsau/depPPR")

library(depPPR)

forecast_example=as.matrix(read.delim("C:/Users/Rohini/Documents/synthetic.txt",header=FALSE))
climate_example=as.matrix(read.delim("C:/Users/Rohini/Documents/historical.txt",header=FALSE))

schaake_shuffle(X = forecast_example, Y = climate_example)

References:

All material is derived from the fundamental paper that introduces the Shaake Shuffle:

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004). The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology5(1), 243-262.

Introduction to Convergent Cross Mapping

This post is a follow-on to my previous post on Granger causality. Granger causality has well-known limitations.  As previously discussed, the test can only find “predictive causality” and not true causality. Further, a key requirement of Granger causality is separability, meaning the casual variable is independent of the variable that it influences. This separability tends to be characteristic of stochastic and linear systems. For systems in which separability is not satisfied or which there are shared driving variables, Granger causality is not applicable. A new approach has been suggested to try to understand causality in dynamical systems where effects of casual variables cannot be separated or uncoupled from the variables that they influence.

This approach, called Convergent Cross Mapping (CCM), was first tested in 1991 but was later evolved by Sugihara et al., 2012 for identifying causation in ecological time series. CCM relies on the fundamental assumption that dynamics in the world aren’t purely stochastic and that in some applications, there are governing dynamics that can be represented by some underlying manifold, M, shown in Figure 1. If two variables, X and Y are causally linked, they should share an underlying M as a common attractor manifold. This means that the state of the variables can be used to inform each other. Furthermore, if X is an environmental driver of a population variable Y, information about the states of X can be recovered from Y, but not vice versa. The CCM approach intends to test how much Y can be a reliable estimate of states of  X [1].

Figure 1: Figure 2 from Sugihara et al., 2012 demonstrating CCM on shadow manifolds Mx and My

 In essence, the approach seeks to create “shadow” manifolds of M, termed Mx and My which are constructed using lagged information from each time series. At some time, t, the nearest neighbors on each manifold are determined. Weights are assigned to each nearest neighbor and Y is estimated using the weighted nearest neighbors from X. Then, the correlation between Y and Y|Mx is computed. The length of the time series is important; a longer time series results in closer nearest neighbors and better predictability [2].

Sugihara et al., 2012 demonstrates situations in which CCM is applicable:

  1. Unidirectional Causality: Species X influences Y, but the reverse is not true.
  2. Transivity/External Forcing: Two species, X and Y, do not interact, but are driven by a common external variable Z. This type of interaction has been studied extensively in the famous anchovy and sardine example, where the two species tend to have linked dynamics (one peaks when the other is in a trough). Some theories suggest that the species are directly related and influence each other’s dynamics, but research has shown that neither species interacts with the other and rather that the population dynamics are shaped by an external driver (temperature). However, using correlation or applying Granger Causality tends to imply a causal link. Using CCM, Sugihara et al., 2012 confirms that there is no cross-map signal between the anchovy and sardine time series, but there is a clear mapping between sea surface temperature and anchovies and sea temperature and sardines.

An R package called multispatialCCM is available to implement CCM but the author, Adam Clark, adapts the methods to be applicable to ecological time series that are shorter (<30 sequential observations which is usually required for the traditional CCM method) and potentially disjoint [3]. In the documentation of the package, there is a full example demonstrating the application of the method to some synthetic time series, A and B, meant to demonstrate A being causally forced by B, but the opposite not being true. In that spirit, I tried out the package with a very simple example and using the same dataset from the last post. Here I use precipitation and outflow where precipitation should causally force outflow, but the opposite should not be true. The following code demonstrates this:

library(multispatialCCM)

#Import data
setwd("D:/Documents/")
data=read.delim("Minimal_Train.txt")
Accm=as.vector(data$observed_cms)[1:200]
Bccm=as.vector(data$Precip)[1:200]

#Simulate data to use for multispatial CCM test
#See function for details - A is causally forced by B,
#but the reverse is not true.
#Calculate optimal E
maxE<-5 #Maximum E to test
#Matrix for storing output
Emat<-matrix(nrow=maxE-1, ncol=2); colnames(Emat)<-c("A", "B")

#Loop over potential E values and calculate predictive ability
#of each process for its own dynamics
for(E in 2:maxE) {
  #Uses defaults of looking forward one prediction step (predstep)
  #And using time lag intervals of one time step (tau)
  Emat[E-1,"A"]<-SSR_pred_boot(A=Accm, E=E, predstep=1, tau=1)$rho
  Emat[E-1,"B"]<-SSR_pred_boot(A=Bccm, E=E, predstep=1, tau=1)$rho
}

#Look at plots to find E for each process at which
#predictive ability rho is maximized
matplot(2:maxE, Emat, type="l", col=1:2, lty=1:2,
        xlab="E", ylab="rho", lwd=2)
legend("bottomleft", c("A", "B"), lty=1:2, col=1:2, lwd=2, bty="n")
E_A<-2
E_B<-3
#Check data for nonlinear signal that is not dominated by noise
#Checks whether predictive ability of processes declines with

#increasing time distance
#See manuscript and R code for details
signal_A_out<-SSR_check_signal(A=Accm, E=E_A, tau=1,
                               predsteplist=1:10)
signal_B_out<-SSR_check_signal(A=Bccm, E=E_B, tau=1,
                               predsteplist=1:10)
#Run the CCM test
#E_A and E_B are the embedding dimensions for A and B.
#tau is the length of time steps used (default is 1)
#iterations is the number of bootstrap iterations (default 100)
# Does A "cause" B?
#Note - increase iterations to 100 for consistent results

CCM_boot_A<-CCM_boot(Accm, Bccm, E_A, tau=1, iterations=100)
# Does B "cause" A?
CCM_boot_B<-CCM_boot(Bccm, Accm, E_B, tau=1, iterations=100)
#Test for significant causal signal
#See R function for details
(CCM_significance_test<-ccmtest(CCM_boot_A,
                                CCM_boot_B))
#Plot results
plotxlimits<-range(c(CCM_boot_A$Lobs, CCM_boot_B$Lobs))
#Plot "A causes B"
plot(CCM_boot_A$Lobs, CCM_boot_A$rho, type="l", col=1, lwd=2,
     xlim=c(plotxlimits[1], plotxlimits[2]), ylim=c(0,1),
     xlab="L", ylab="rho")
#Add +/- 1 standard error
matlines(CCM_boot_A$Lobs,
         cbind(CCM_boot_A$rho-CCM_boot_A$sdevrho,
               CCM_boot_A$rho+CCM_boot_A$sdevrho),
         lty=3, col=1)
#Plot "B causes A"
lines(CCM_boot_B$Lobs, CCM_boot_B$rho, type="l", col=2, lty=2, lwd=2)
#Add +/- 1 standard error
matlines(CCM_boot_B$Lobs,
         cbind(CCM_boot_B$rho-CCM_boot_B$sdevrho,
               CCM_boot_B$rho+CCM_boot_B$sdevrho),
         lty=3, col=2)
legend("topleft",
       c("A causes B", "B causes A"),
       lty=c(1,2), col=c(1,2), lwd=2, bty="n")


First we read in the appropriate columns. I also decided to use just the first 200 observations from my 5-year long dataset. While longer time series are supposed to lead to better convergence, I found that for my time series (characterized by a lot of natural variability and noise), that it was best to limit the sequence to cleaner signal.

The next steps involve determining an appropriate E matrix for A and B. This matrix is an embedding dimension matrix and is similar to determining the number of time steps that a single process needs to predict its own dynamics.

Figure 2: Predictability (rho) as a function of E

When looking at Figure 2, you can see that the rho (predictability) is maximized for A=3 and B=2 (for the value of E). The CCM_Boot function runs the actual Convergent Cross Mapping and implements a bootstrapping method where a subset of the time series is used to determine E and the predictability in order to make sure that the ordering of the sample is not a defining factor of the outcome and to address uncertainty in the prediction. Finally, the results are plotted in Figure 3.  There are two main takeaways from this figure: The predictability of A causing B (outflow causing precipitation) is a flat line around 0 while the predictability of B causing A (precipitation causing outflow) is much higher. Furthermore, note how the predictability increases with an increasing time series length which allows for better convergence on the system dynamics. Clark et al., 2015 contains many other more well-defined ecological examples to demonstrate the package as well.

Figure 3: Predictability as a function of L for A causes B and B causes A

References:

[1] Sugihara, G., May, R., Ye, H., Hsieh, C. H., Deyle, E., Fogarty, M., & Munch, S. (2012). Detecting causality in complex ecosystems. science338(6106), 496-500.

[2] McCracken, J. M., & Weigel, R. S. (2014). Convergent cross-mapping and pairwise asymmetric inference. Physical Review E90(6), 062903.

[3] Clark, A. T., Ye, H., Isbell, F., Deyle, E. R., Cowles, J., Tilman, G. D., & Sugihara, G. (2015). Spatial convergent cross mapping to detect causal relationships from short time series. Ecology96(5), 1174-1181.

Special thanks to @ecoquant who suggested to look into this topic!

Recommended Software for the Kasprzyk Group

In this post, I provide a list of recommended software for multi-objective optimization research and a bit of context about each item. This is an update of two posts (for Windows users and Mac-users, respectively) that Joe made several years ago and is intended for Windows users. Although this list is catered toward members of the Kasprzyk Group at the University of Colorado Boulder (CU), it should be relevant to most readers of this blog.

Please feel free to make comments and give additional suggestions!

Text Editors

Text editors are a great way to view data (e.g., csv or space-delimited data) or review some code. If you are looking to run code in an interactive manner, you’ll want to get an interactive development environment (IDE) which corresponds to the programming language you are working with. I’ll get into that more in the “Programming Languages” section. If you have a Windows machine, you are probably used to opening things with the default text editor, Notepad. Notepad is the worst.

Notepad++

Notepad++ is infinitely better! The formatting is great, it has a bunch of plug-ins you can download, it improves readability, and runs quickly. Notepad++ is my preferred text editor if you want to look at a couple files. If you are looking to manage a larger project with several files and multiple directories, Atom is the way to go.

Atom

Atom is extremely powerful and customizable. It is made for software developers in mind and is basically the modern version of Notepad++ (which has been around for a while). It integrates easily with Git and GitHub (see “Version Control and Open Source Repositories” for explanation of these tools), it has an extensive library of packages–similar to Notepad++ plug-ins–and its growing constantly. Atom also blurs the line between text editor and IDE, because with the Atom-IDE packages, it can have IDE-like functionality.

Programming Languages

You never know what languages you might work with in your research, but the main languages we use are Python, R, and C/C++. If Joe asks your preference, tell him that Python >> R and C/C++ >> R. Although if you catch him at a moment of weakness, he might admit that R can do some stats things, I guess.

Python

If you are working in Python (download), it will likely depend on the project whether you are working in 2.7 or 3.X (the X’s being whatever version they are on). A lot of scientific computing research still uses Python 2.7, but more people are transitioning to Python 3. In fact, they have decided to stop maintaining Python 2.7 at midnight on January 1, 2020. So remember to pour one out for your homie, Python 2.7, on New Years 2020. Although there are plenty of other python IDEs  (e.g., Spyder, Rodeo), we generally use PyCharm Community (download) in the group. For ease of installing packages, download Anaconda (download) which installs Python and over 150 scientific packages automatically.

R

If you are working it R (download), most everyone uses Rstudio (download Open Source Licence) as an IDE. You will also want to download Rtools, which is helpful to have installed for building some packages which require it. Generally, these packages require command line tools and compiling languages other than R.

C/C++

Although most projects are focused on Python and R, we have some work in C and C++ as well. These very interrelated languages, unlike Python and R, need to be compiled before you can run the code. To do this, you’ll need to download a compiler. We generally recommend installing MinGW (download). This will allow you to compile programs that will work across different platforms (e.g., Windows, Linux) while working on a Windows machine. Pro tip: if you have Rtools installed, it comes with MinGW so you don’t need to download it separately! If you want to compile things in for POSIX application deployment for Windows, you’ll want Cygwin. If you have no idea what that means, that’s okay. It’s my belief that you shouldn’t download Cygwin unless you need it because it takes up a lot of space on your system. But if you don’t care about storage than go ahead! For most cases I run into, MinGW does the job.

Why do we even care about compiling things across platforms? Well, you may be running code on a Windows machine, but if you are doing any work with the supercomputer, you will also need to run that code in a Linux environment. So you want to make sure your code will work on both platforms.

Although it isn’t perfect, CodeLite (download) is a nice (and free) IDE for C/C++ work.

Version Control and Online Repositories

Version control is the best way to your manage your code. It allows you to track your changes, make notes, and even revert to older versions of your code.

Git

Git (download) is the most popular way to implement version control but other methods do exist. Learning Git takes a bit of time, but it is an essential skill to learn which will pay dividends in the future. Most IDE (interactive development environments) interface with Git and even some text editors (I’m looking at you, Atom).

GitHub

GitHub is where people post their code and data (bundled into so-called repositories) to share with the world. It is amazing collaborative environment, and since we do computational research, it is best practices to create a GitHub repository for code related to papers that you publish. Soon it will be a requirement in most journals!

Both Git and GitHub have stellar documentation and tutorials about how to get started, so you will have lots of support when you start learning the ropes.

Command Line Interfaces, Supercomputing, and File Transfer

If you haven’t used a command line interface (CLI) before, you will definitely learn in this line of work. CLIs can be useful for interacting with Git (I prefer it for most version control tasks), installing open source software, and tasks like copying/moving files or creating/moving directories on your local machine. My apologies in advance if I butcher the language in this section related to Linux, shell, bash, etc!

Git Bash

For these tasks, you can use the Command Prompt which is a default program in Windows. But more likely, you’ll want to use a CLI which accepts Linux commands (that way you can use the same commands for your local CLI–which runs Windows–and for interacting with the supercomputer–which runs Linux). I like using Git Bash since it gets installed automatically when you install Git. You can also use Cygwin, but like I said before, this is generally overkill if you are just interested in writing Linux commands if you have Git Bash or something else already installed. I’m sure there are many other CLIs to choose from as well that I’m not aware of.

While its a nifty skill to know how to use the command line while working on your local computer, it is absolutely essential when working on the supercomputer (i.e., cloud or remote computing). Although I’ve seen some graphical interfaces and interactive environments for using cloud computing resources, it is far more common to perform tasks from the command line. You can even connect to the supercomputer with a CLI using the ‘ssh’ command.

MobaXterm

To make your life a bit easier though when connecting to remote computing resources, you can download MobaXterm or Putty and WinSCP. As David explains in his post, MobaXterm is probably the best way to go. It does the job of both Putty (connecting to remote computing) and WinSCP (moving files between your local computer and a remote resource).

If you’re doing any work on the supercomputer at CU, check out Research Computing for tutorials and other details about our supercomputers.

Cloud Storage

University of Colorado has unlimited Google Drive storage which is linked to your CU Gmail account; therefore, it is the cloud storage of choice for the group. Google Drive File Stream allows you to access files stored on your Drive on your local computer without having to download your whole drive. Meaning it won’t take up a ton of memory but it will ‘feel’ like the files have been downloaded onto your computer (as long as you are connected to the internet). If you know you won’t be connected to the internet, you can easily download certain files/folders or even your whole Drive if you would like.

The supercomputer can serve as cloud storage; however, it is best to keep those files backed up locally, if possible. I’ve heard too many horror stories about people storing important data on the supercomputer and it getting erased! Although this can be avoided by storing things in the right place, you might sleep better if you’ve got another copy.

Multi-objective Optimization and Visualization

Borg

You’ve probably heard of Borg Multi-objective Evolutionary Algorithm (MOEA). If not, you will soon! There’s no direct download link for Borg, but you can fill out a form on its website to request the source code.

DiscoveryDV

Once you’ve performed an optimization, you will want to visualize the results. You can do this in your favorite programming language, but it is often difficult to interact with the data that way. For an interactive visualization experience, we generally use DiscoveryDV. Just like Borg, DiscoveryDV is not available for download directly, but you can request it on their website.

Open Source Projects

Additionally, here are a few open source projects related to multi-objective optimization, robust decision making, and visualization that may be useful to be familiar with: Project Platypus, OpenMORDM, and Exploratory modeling workbench.

Reference Managers

Reference managers are amazing things. Finding the right one will save you a lot of time and effort in the future. As Jazmin mentions in her post on research workflows, there are a ton to choose from. Check out this comparison table of reference management software, if you want to go down that rabbit hole.

Zotero

Joe’s group uses one called Zotero (download both the standalone and Chrome connector) which is free, easy to use, and integrates well with Microsoft Word.

Graphic Design

There are many ways to create custom figures for papers–PowerPoint is an easy choice because you likely already have Microsoft Office on your computer. However, Adobe Illustrator is much more powerful. Since Illustrator requires a license, ask Joe for more details if you need the software.

Happy downloading!