Do The (Schaake) Shuffle

This post in an introduction to the Schaake Shuffle, a method that can be used to address reconstructing space time variability in forecasted and synthetic variables. The Schaake Shuffle was originally introduced in a synthetic weather generation post by Julie Quinn almost 5 years ago. Lately, the importance (and difficulty) of being able to reproduce spatial and temporal variability in forecasts and synthetically generated variables across multiple correlated sites has been a prominent topic in our group. The goal of this post is to just “bump” this topic back into discussion and to make readers aware of its existence as a nifty post-generation way to build spatial and temporal variability back into synthetically generated data or forecasts. In the fundamental paper that establishes the method, Clark et al., 2004, the authors are looking to apply the method to forecasts of precipitation and temperature. In the case of weather variables such as temperature and precipitation, it is common to create forecasts for individual stations from a Numerical Weather Prediction (NWP) model. These variables serve as predictor variables in regression models that can be used to generate forecasts. The problem with these styles of approaches is that spatial correlation is not preserved between multiple stations nor temporal persistence, which is very important for hydrologic applications with memory.

The Schaake Shuffle is a method that reorders ensemble forecasts of precipitation and temperature to better reconstruct the space-time variability using a rank-ordering approach constructed from a historical record. The basic steps are as follows:

  1. Gather appropriate data: The NWP model outputs forecasts of accumulated precipitation, air temperature, relative humidity at 700 hpa, wind speed, total column precipitable water, and mean sea level pressure which are used as predictors in the forecast equations. Further, the authors acquire historical precipitation and temperature data for stations within four basins across the United States.
  2. Create Forecasts: The next step involves creating the precipitation and temperature ensemble forecasts. A multiple linear regression is used to develop model output statistics (MOS) equations. The forecasted variables that are taken from the NWP model are ultimately filtered down to keep on the variables that explain the highest variance in the response variable (in this example, response variables are precipitation, minimum temperature, maximum temperature). A separate regression equation is fit for each variable, station, and forecast lead time. The residuals of the regression equation are modeled stochastically to generate an ensemble of forecasts. Alternatively, one can apply the Schaake Shuffle to synthetically generated ensembles (not limited to forecasts).
  3. Reorder Forecasts: The reordering method can best be described by an example. For a given time, assume you have an ensemble of possible forecasts that you align in a 3D matrix: Xi,j,k where i=ensemble member, j=station, and k=variable of interest (precipitation or temperature). From the historical record, you must construct an equally sized matrix Yi,j,k which contains historical station observations for the same date. For this matrix, i=an index of dates in the historical time period, j=station, and k=variable of interest (precipitation or temperature).

Using this same notation the authors create a toy example to demonstrate the process. For some time t, imagine we have forecasts of maximum temperature for 10 ensembles, for a given date and station.

Let X be a 10 member ensemble in consideration.

X=[15.3,11.2,8.8,11.9,7.5,9.7,8.3,12.5,10.3,10.1]

We can sort the vector X to create χ

χ=[7.5,8.3,8.8,9.7,10.1,10.3,11.2,11.9,12.5,15.3].

Then we go to the historical record and choose 10 dates that reside in a 7 day window around the date that is being forecasted. This is our Y vector.

Y=[10.7,9.3,6.8,11.3,12.2,13.6,8.9,9.9,11.8,12.9]

We can sort this vector to create γ.

γ=[6.8,8.9,9.3,9.9,10.7,11.3,11.8,12.2,12.9,13.6]

We also create a vector B, which denotes the order of the sorted historical vector with respect to the unsorted vector.

B=[3,7,2,8,1,4,9,5,10,6]

The key is to now to reorder the ensemble forecast in the same order as the B vector. The rank order 1 value is in position 5 of the B vector. Therefore, we take the 5th value from χ (10.1). Then rank order 2 is in position 3. We take the third value from χ (8.8). We continue doing this until we have

Xss=[10.1, 8.8, 7.5, 10.3, 11.9, 15.3, 8.3, 9.7, 11.2, 12.5]

These are the basic fundamentals of the reordering algorithm and it can be extended to involve forecasting at multiple stations, demonstrated in the figure below. Table A shows 10 ensembles for forecasting weather on January 14th, 2004, ranked from lowest to highest value for three stations. Table B shows the historical record and the black and light gray ellipses represent the 1st and 2nd ensemble respectively. Table C shows the sorted historical record and where the selected historical observations lie in the sorted list. Finally Table A can be reordered accordingly to form Table D.

Schaake Shuffle for 10 member ensemble and 3 stations (Figure 2 from Clarke et al., 2004)

It’s important to remember that the Schaake Shuffle is only meant to capture the Spearman rank correlation of observations, but not to reconstruct the actual spearman correlations. The results from the paper, however, are quite remarkable and show how well the method captures spatial and temporal properties. The figure below shows an example of how the method preserves spatial correlation between two selected stations. The top set of figures show raw ensemble output while the bottom figures show results after the ensemble is reordered. The black lines denote the target observed correlation. Clearly, the reordered output approximates the observed correlation across lead times better than the raw ensemble output.

Preservation of spatial correlation for raw (top) and reordered (bottom) forecast ensembles (Figure 6 from Clarke et al, 2004)

One basic limitation of this approach is the assumption of stationarity and that the structure in the historical record will be applicable to the forecasted data. While other methods exist which can potentially preserve space-time variability well, the advantage of the Schaake Shuffle is the ability to reconstruct these patterns after the fact, as a post-processing step. If readers are interested in implementing the Schaake Shuffle, basic pseudocode is included at the end of the paper but there are also R packages that can automate the reordering process here. The steps to download the package and run the algorithm are denoted here. Note that this code only works for a single-station case. Here each column in the X vector will be an ensemble and the rows correspond to the number of days in the forecast. Implementing the example in Figure 2 for one station will requires X and Y to be a single row vector. Of course, one can manually extend this process to multiple stations.

install.packages("devtools")

devtools::install_github("katerobsau/depPPR")

library(depPPR)

forecast_example=as.matrix(read.delim("C:/Users/Rohini/Documents/synthetic.txt",header=FALSE))
climate_example=as.matrix(read.delim("C:/Users/Rohini/Documents/historical.txt",header=FALSE))

schaake_shuffle(X = forecast_example, Y = climate_example)

References:

All material is derived from the fundamental paper that introduces the Shaake Shuffle:

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004). The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology5(1), 243-262.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s