Do The (Schaake) Shuffle

This post in an introduction to the Schaake Shuffle, a method that can be used to address reconstructing space time variability in forecasted and synthetic variables. The Schaake Shuffle was originally introduced in a synthetic weather generation post by Julie Quinn almost 5 years ago. Lately, the importance (and difficulty) of being able to reproduce spatial and temporal variability in forecasts and synthetically generated variables across multiple correlated sites has been a prominent topic in our group. The goal of this post is to just “bump” this topic back into discussion and to make readers aware of its existence as a nifty post-generation way to build spatial and temporal variability back into synthetically generated data or forecasts. In the fundamental paper that establishes the method, Clark et al., 2004, the authors are looking to apply the method to forecasts of precipitation and temperature. In the case of weather variables such as temperature and precipitation, it is common to create forecasts for individual stations from a Numerical Weather Prediction (NWP) model. These variables serve as predictor variables in regression models that can be used to generate forecasts. The problem with these styles of approaches is that spatial correlation is not preserved between multiple stations nor temporal persistence, which is very important for hydrologic applications with memory.

The Schaake Shuffle is a method that reorders ensemble forecasts of precipitation and temperature to better reconstruct the space-time variability using a rank-ordering approach constructed from a historical record. The basic steps are as follows:

  1. Gather appropriate data: The NWP model outputs forecasts of accumulated precipitation, air temperature, relative humidity at 700 hpa, wind speed, total column precipitable water, and mean sea level pressure which are used as predictors in the forecast equations. Further, the authors acquire historical precipitation and temperature data for stations within four basins across the United States.
  2. Create Forecasts: The next step involves creating the precipitation and temperature ensemble forecasts. A multiple linear regression is used to develop model output statistics (MOS) equations. The forecasted variables that are taken from the NWP model are ultimately filtered down to keep on the variables that explain the highest variance in the response variable (in this example, response variables are precipitation, minimum temperature, maximum temperature). A separate regression equation is fit for each variable, station, and forecast lead time. The residuals of the regression equation are modeled stochastically to generate an ensemble of forecasts. Alternatively, one can apply the Schaake Shuffle to synthetically generated ensembles (not limited to forecasts).
  3. Reorder Forecasts: The reordering method can best be described by an example. For a given time, assume you have an ensemble of possible forecasts that you align in a 3D matrix: Xi,j,k where i=ensemble member, j=station, and k=variable of interest (precipitation or temperature). From the historical record, you must construct an equally sized matrix Yi,j,k which contains historical station observations for the same date. For this matrix, i=an index of dates in the historical time period, j=station, and k=variable of interest (precipitation or temperature).

Using this same notation the authors create a toy example to demonstrate the process. For some time t, imagine we have forecasts of maximum temperature for 10 ensembles, for a given date and station.

Let X be a 10 member ensemble in consideration.

X=[15.3,11.2,8.8,11.9,7.5,9.7,8.3,12.5,10.3,10.1]

We can sort the vector X to create χ

χ=[7.5,8.3,8.8,9.7,10.1,10.3,11.2,11.9,12.5,15.3].

Then we go to the historical record and choose 10 dates that reside in a 7 day window around the date that is being forecasted. This is our Y vector.

Y=[10.7,9.3,6.8,11.3,12.2,13.6,8.9,9.9,11.8,12.9]

We can sort this vector to create γ.

γ=[6.8,8.9,9.3,9.9,10.7,11.3,11.8,12.2,12.9,13.6]

We also create a vector B, which denotes the order of the sorted historical vector with respect to the unsorted vector.

B=[3,7,2,8,1,4,9,5,10,6]

The key is to now to reorder the ensemble forecast in the same order as the B vector. The rank order 1 value is in position 5 of the B vector. Therefore, we take the 5th value from χ (10.1). Then rank order 2 is in position 3. We take the third value from χ (8.8). We continue doing this until we have

Xss=[10.1, 8.8, 7.5, 10.3, 11.9, 15.3, 8.3, 9.7, 11.2, 12.5]

These are the basic fundamentals of the reordering algorithm and it can be extended to involve forecasting at multiple stations, demonstrated in the figure below. Table A shows 10 ensembles for forecasting weather on January 14th, 2004, ranked from lowest to highest value for three stations. Table B shows the historical record and the black and light gray ellipses represent the 1st and 2nd ensemble respectively. Table C shows the sorted historical record and where the selected historical observations lie in the sorted list. Finally Table A can be reordered accordingly to form Table D.

Schaake Shuffle for 10 member ensemble and 3 stations (Figure 2 from Clarke et al., 2004)

It’s important to remember that the Schaake Shuffle is only meant to capture the Spearman rank correlation of observations, but not to reconstruct the actual spearman correlations. The results from the paper, however, are quite remarkable and show how well the method captures spatial and temporal properties. The figure below shows an example of how the method preserves spatial correlation between two selected stations. The top set of figures show raw ensemble output while the bottom figures show results after the ensemble is reordered. The black lines denote the target observed correlation. Clearly, the reordered output approximates the observed correlation across lead times better than the raw ensemble output.

Preservation of spatial correlation for raw (top) and reordered (bottom) forecast ensembles (Figure 6 from Clarke et al, 2004)

One basic limitation of this approach is the assumption of stationarity and that the structure in the historical record will be applicable to the forecasted data. While other methods exist which can potentially preserve space-time variability well, the advantage of the Schaake Shuffle is the ability to reconstruct these patterns after the fact, as a post-processing step. If readers are interested in implementing the Schaake Shuffle, basic pseudocode is included at the end of the paper but there are also R packages that can automate the reordering process here. The steps to download the package and run the algorithm are denoted here. Note that this code only works for a single-station case. Here each column in the X vector will be an ensemble and the rows correspond to the number of days in the forecast. Implementing the example in Figure 2 for one station will requires X and Y to be a single row vector. Of course, one can manually extend this process to multiple stations.

install.packages("devtools")

devtools::install_github("katerobsau/depPPR")

library(depPPR)

forecast_example=as.matrix(read.delim("C:/Users/Rohini/Documents/synthetic.txt",header=FALSE))
climate_example=as.matrix(read.delim("C:/Users/Rohini/Documents/historical.txt",header=FALSE))

schaake_shuffle(X = forecast_example, Y = climate_example)

References:

All material is derived from the fundamental paper that introduces the Shaake Shuffle:

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004). The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology5(1), 243-262.

Potential Biases and Pitfalls of Machine Learning

Artificial intelligence has already changed the world in many ways. It is very difficult to find a manmade tool that does not use AI in one way or another. It can be argued that many of these changes have been positive; however, it can also easily be shown that AI has not benefited everybody evenly.

In terms of the application of AI in science, the number of publications on this topic has significantly increased during the last few years. I searched for the words water, machine, and learning on Google Scholar and limited the search to only count the studies with all of these words in their titles. I created the following figure from the results of that simple search. As you can see, the number of publications has significantly increased in the last few years. This is obviously not a thorough study, but it shows how important artificial intelligence has become in academic research.

However, there are many pitfalls that can degrade the usefulness of AI in academic and non-academic applications. Some people even argue that these pitfalls could potentially lead us to a new AI winter that nobody will enjoy. In this blog post, I decided to go over some of these issues and difficulties. Some of the problems have emerged recently, but others have been known by statisticians for decades. For many reasons, they continue to exist and negatively affect artificial intelligence projects.

Any machine learning project usually includes the following components: 1) sample collection and preparation, 2) ML model generation, and 3) result interoperation and extraction of actionable insights. Here, I will talk about some of the mistakes and biases that can happen in each of these steps.

1- Sample Collection and Preparation

The input dataset is a crucial source of bias in any ML project. The problem basically occurs when the dataset is not a good representation of the real world. Here are a few variants of this issue:

Exclusion Bias

This happens when specific groups that exist in the real world are systematically absent in the dataset. For example, a specific part of the population can be absent in the dataset, or the dataset can only contain information about particular countries, climates, etc. Having a diverse group that represents various segments of society and different worldviews can reduce the consequences of this type of bias.

Measurement Bias

There are different reasons that observations can be faulty and unreliable. For example, observations can be sensed and collected through malfunctioning measurement devices that can bias the entire dataset. Also, human reading errors and mistakes can cause measurement errors. Additionally, there are many things that can go wrong during the post-processing of raw measurements and can lead to measurement bias.

Other Sampling Biases

There are also other types of data collection biases that have been discussed in statistical literature for years. One example of these is self-selection bias. Let’s imagine that you are running an online poll and want to analyze the results, but your participants are not fully random. People who choose to participate in your poll might have specific personalities or worldviews and not represent the population at large. Other biases that have been widely discussed in statistical literature include survivor bias, prejudice bias, observer bias, and recall bias. You can find more information about these biases here, here, and here. If possible, these biases should be avoided. Otherwise, their effects should be estimated and removed from the analysis. Also, acquiring more comprehensive datasets can reduce these types of biases.

Imbalanced Datasets

If the data collection process is fair and reliable, the imbalanced dataset is not really a bias: it’s more of a challenge that is inherent to specific problems. Some events just don’t happen frequently. The problems and research questions that focus on rare events need their own appropriate treatments. In a previous blogpost, I explained this issue and the ways that we can deal with this problem. You can also refer to this article and this article for information about imbalance datasets.

Data Leakage

Data leakage is a situation in which we mistakenly share a portion (or the entirety) of the data between the training and validation datasets. Data leakage can happen due to different types of mistakes during data preparation. Timeseries can also implicitly include data leakage. For example, if training and testing datasets are selected in a way that allows the training dataset to gain information about the removed period, we have implicitly leaked data. For example, if we extract the testing dataset from the middle of the original timeseries, we basically give the model insight about what happens before and after the testing period, which is basically cheating and can reduce the accuracy of the model. One way to avoid this is to make sure that our training data period precedes the validation period.

Another type of leakage is called feature-leakage, and it happens when our feature list includes a variable that is either closely correlated with data labels or is a duplicate of the data label. Obviously, we won’t have that feature during the prediction. Therefore, our trained model will be highly biased.

More information about data leakage can be found here and here.

2- ML and Statistical Models

Lack of Physical Understanding of the System

In machine learning, we use statistical relationships to explore how the system would behave under a specific condition. However, ML models do not gain any explicit knowledge about the actual physical laws that govern the system. For example, in hydrology, it is very possible to violate water and energy conservation laws while training and using machine learning-driven models. However, there are studies (here and here) that have been trying to incorporate these insights into their ML model. This issue is likely to attract more attention in the future.

Reproducibility

There is a general consensus among ML researchers that many ML-based studies are not reproducible. This means that an individual researcher cannot replicate the ML process and find the same results. Many recent studies have shown that ML scientists are often not able to reproduce the results of other studies. While some people refer to this as the reproducibility crisis in ML studies, it is more of a general academic research issue. Based on a 2016 survey published by Nature, 70% of scientists across various disciplines were not able to reproduce the results of other studies, and more than 50% even failed to reproduce their own previously developed results. This is obviously a big challenge and needs to be taken into consideration. More information about reproducibility challenge can be find blog posts (here and here).

Instability of the Model

Model stability is also an important challenge. As discussed earlier, models can be very efficient and accurate under specific random conditions (e.g., random seed or data sample) and will fail under different conditions. This problem is also closely related to the reproducibility challenge. Using different random seeds and developing model ensembles can provide a solution to this problem. The solution also depends on the inherent nature of the ML algorithm. For example, this blog post nicely explains the issue of stability and its solution in K-Means clustering.

Simpson’s Paradox

Simpson’s Paradox describes a situation in which our analysis generates a different conclusion if we investigate each individual group separately while comparing them to the situation where these groups are all combined. The following figure shows an example of this situation under a simple linear regression. The implication of this phenomenon should be taken into account when we make decisions about labels and when we interpret our results.

Correlation Bias

It is important to make sure that there is no significant correlation between different features, which can lead to the creation of unstable and unreliable ML models. This is the same issue as the widely discussed multicollinearity problem in regression. However, removing feature columns needs to be done carefully in order to avoid significant loss of information.

Other Statistical and ML Biases

There are various ways that machine learning studies and statistical analysis studies can be biased and misleading, and P-hacking is one of the widely discussed pathways of that in hypothesis testing. P-hacking basically means that the researcher changes the definition of the problem or the significance threshold to prove a positive relationship. However, when reporting the results of the analysis, only the successful experiment gets reported to the scientific community.

Another issue occurs when the researcher works on a big dataset with many features. The researcher keeps exploring different parts of the dataset to find a statistically significant positive relationship and fails to do so; after enough digging, though, a positive relationship can be found. However, when reporting the data through scientific publications, only the significant positive results get reported and not the number of times that the research attempted to find them. This issue is often referred to as the multiple comparisons problem or the look-elsewhere effect.

There are many studies that show that p-hacking is a common problem and has significantly biased many scientific findings, but other studies exist that argue that, while p-hacking is common in scientific studies, it does not significantly change the big picture of what we eventually learn from scientific literature.

We should also always keep in mind that ML and generally data mining analyses are inherently exploratory, not confirmatory. It means that ML studies can only generate a hypothesis, while separate statistical testing is necessary to confirm the theory (more on this can be found here and here). This can also be seen as another aspect of the reproducibility problem in ML. Machine learning methods do not provide insights into causal relationships. Therefore, ML-driven methods can give us high predication accuracy under a specific condition, but their accuracy might significantly drop under a different condition, and we need to pay attention to this issue.

Performance Metrics

It is crucial to take into account different performance aspects of an ML model, and, to do that, we often need to consider various metrics (refer to this blog post for more information). As I discussed in this previous blog post, it is also important to carefully think about which metrics are really desirable for the problem at hand. For example, the decision about focusing on specificity vs. sensitivity should be determined by examining the nature of our problem. The multiple metrics issue has been widely discussed in past hydrology literature, but it has also been ignored in many water-related studies.

3- Results Interoperation and Extraction of Actionable Insights

Overinterpretation and Generalization Bias

The ML models are trained and tested for a specific dataset. Therefore, when they are used in the real-world application for prediction, they may or may not be as reliable. Therefore, we should also be careful about how we extrapolate the fidelity of our model to new datasets.

Confirmation Bias

This bias takes various forms. For example, we are subconsciously biased toward what we already think is the truth that will show up in the final results of our analysis. Also, we can unknowingly lean toward the results that are shinier and more groundbreaking. In addition, we might want to make our results more interesting for our stakeholders, clients, and the organizations we work for. These are some of the biases that we need to avoid in any type of research and certainly in ML projects.

Interpretability

ML models can benefit from powerful and sophisticated algorithms that are built based on comprehensive lists of features. However, that does not guarantee a decent level of interoperability. In fact, it might make the model and its results harder to understand. I won’t get into the details of this here, but there are many approaches that have been used to control model complexity and interpretability (here and here).

Publication Bias

Scientific journals and media outlets are less interested in publishing results that show non-significant causal relationships. In other words, research efforts that find negative results are less likely to get published. This phenomenon affects academic culture and creates pressure to find and submit positive results, and it also leads to the publication of positive results that have been found by chance. Publication bias has been recognized and talked about for a long time. The issue was first introduced by Theodore Sterling (1959), but it continues to harm the products of academia today.

There is another type of bias that frequently happens in publications. There is always a tendency toward showing that the model developed in the study outperforms popular state-of-the-art algorithms. The issue is that the models developed by the authors often benefit from careful fine-tuning of hyperparameters, while a generic type of the popular model is used in the comparison. This creates an unfair comparison. Refer to this blog post and this publication for more on this type of bias.

Also, ML models can yield a very good response under a specific condition (e.g., a random seed) and fall apart under a different condition (e.g., a different random seed). For example, this problem frequently happens in unsupervised clustering. Running the model with multiple random seeds and starting points is one of the ways to avoid this.

How Can We Address These Limitations?

As discussed in this blog post, ML studies can be negatively affected by various dimensions of these problems, but there are ways to respond to them. For example, biases in data collection can be improved through promoting diversity in research teams (more on this here, and here) or paying more attention to the design of the experiment. Careful definition of the problem and scope of the study also seems to be critical. It would be idealistic, but a cultural shift in the academic world that results in giving more credit to studies that find negative results would improve some of these problems. Finally, effectively using multiple performance metrics, synthetic data, and visual analytics should also benefit ML models.

Introduction to Convergent Cross Mapping

This post is a follow-on to my previous post on Granger causality. Granger causality has well-known limitations.  As previously discussed, the test can only find “predictive causality” and not true causality. Further, a key requirement of Granger causality is separability, meaning the casual variable is independent of the variable that it influences. This separability tends to be characteristic of stochastic and linear systems. For systems in which separability is not satisfied or which there are shared driving variables, Granger causality is not applicable. A new approach has been suggested to try to understand causality in dynamical systems where effects of casual variables cannot be separated or uncoupled from the variables that they influence.

This approach, called Convergent Cross Mapping (CCM), was first tested in 1991 but was later evolved by Sugihara et al., 2012 for identifying causation in ecological time series. CCM relies on the fundamental assumption that dynamics in the world aren’t purely stochastic and that in some applications, there are governing dynamics that can be represented by some underlying manifold, M, shown in Figure 1. If two variables, X and Y are causally linked, they should share an underlying M as a common attractor manifold. This means that the state of the variables can be used to inform each other. Furthermore, if X is an environmental driver of a population variable Y, information about the states of X can be recovered from Y, but not vice versa. The CCM approach intends to test how much Y can be a reliable estimate of states of  X [1].

Figure 1: Figure 2 from Sugihara et al., 2012 demonstrating CCM on shadow manifolds Mx and My

 In essence, the approach seeks to create “shadow” manifolds of M, termed Mx and My which are constructed using lagged information from each time series. At some time, t, the nearest neighbors on each manifold are determined. Weights are assigned to each nearest neighbor and Y is estimated using the weighted nearest neighbors from X. Then, the correlation between Y and Y|Mx is computed. The length of the time series is important; a longer time series results in closer nearest neighbors and better predictability [2].

Sugihara et al., 2012 demonstrates situations in which CCM is applicable:

  1. Unidirectional Causality: Species X influences Y, but the reverse is not true.
  2. Transivity/External Forcing: Two species, X and Y, do not interact, but are driven by a common external variable Z. This type of interaction has been studied extensively in the famous anchovy and sardine example, where the two species tend to have linked dynamics (one peaks when the other is in a trough). Some theories suggest that the species are directly related and influence each other’s dynamics, but research has shown that neither species interacts with the other and rather that the population dynamics are shaped by an external driver (temperature). However, using correlation or applying Granger Causality tends to imply a causal link. Using CCM, Sugihara et al., 2012 confirms that there is no cross-map signal between the anchovy and sardine time series, but there is a clear mapping between sea surface temperature and anchovies and sea temperature and sardines.

An R package called multispatialCCM is available to implement CCM but the author, Adam Clark, adapts the methods to be applicable to ecological time series that are shorter (<30 sequential observations which is usually required for the traditional CCM method) and potentially disjoint [3]. In the documentation of the package, there is a full example demonstrating the application of the method to some synthetic time series, A and B, meant to demonstrate A being causally forced by B, but the opposite not being true. In that spirit, I tried out the package with a very simple example and using the same dataset from the last post. Here I use precipitation and outflow where precipitation should causally force outflow, but the opposite should not be true. The following code demonstrates this:

library(multispatialCCM)

#Import data
setwd("D:/Documents/")
data=read.delim("Minimal_Train.txt")
Accm=as.vector(data$observed_cms)[1:200]
Bccm=as.vector(data$Precip)[1:200]

#Simulate data to use for multispatial CCM test
#See function for details - A is causally forced by B,
#but the reverse is not true.
#Calculate optimal E
maxE<-5 #Maximum E to test
#Matrix for storing output
Emat<-matrix(nrow=maxE-1, ncol=2); colnames(Emat)<-c("A", "B")

#Loop over potential E values and calculate predictive ability
#of each process for its own dynamics
for(E in 2:maxE) {
  #Uses defaults of looking forward one prediction step (predstep)
  #And using time lag intervals of one time step (tau)
  Emat[E-1,"A"]<-SSR_pred_boot(A=Accm, E=E, predstep=1, tau=1)$rho
  Emat[E-1,"B"]<-SSR_pred_boot(A=Bccm, E=E, predstep=1, tau=1)$rho
}

#Look at plots to find E for each process at which
#predictive ability rho is maximized
matplot(2:maxE, Emat, type="l", col=1:2, lty=1:2,
        xlab="E", ylab="rho", lwd=2)
legend("bottomleft", c("A", "B"), lty=1:2, col=1:2, lwd=2, bty="n")
E_A<-2
E_B<-3
#Check data for nonlinear signal that is not dominated by noise
#Checks whether predictive ability of processes declines with

#increasing time distance
#See manuscript and R code for details
signal_A_out<-SSR_check_signal(A=Accm, E=E_A, tau=1,
                               predsteplist=1:10)
signal_B_out<-SSR_check_signal(A=Bccm, E=E_B, tau=1,
                               predsteplist=1:10)
#Run the CCM test
#E_A and E_B are the embedding dimensions for A and B.
#tau is the length of time steps used (default is 1)
#iterations is the number of bootstrap iterations (default 100)
# Does A "cause" B?
#Note - increase iterations to 100 for consistent results

CCM_boot_A<-CCM_boot(Accm, Bccm, E_A, tau=1, iterations=100)
# Does B "cause" A?
CCM_boot_B<-CCM_boot(Bccm, Accm, E_B, tau=1, iterations=100)
#Test for significant causal signal
#See R function for details
(CCM_significance_test<-ccmtest(CCM_boot_A,
                                CCM_boot_B))
#Plot results
plotxlimits<-range(c(CCM_boot_A$Lobs, CCM_boot_B$Lobs))
#Plot "A causes B"
plot(CCM_boot_A$Lobs, CCM_boot_A$rho, type="l", col=1, lwd=2,
     xlim=c(plotxlimits[1], plotxlimits[2]), ylim=c(0,1),
     xlab="L", ylab="rho")
#Add +/- 1 standard error
matlines(CCM_boot_A$Lobs,
         cbind(CCM_boot_A$rho-CCM_boot_A$sdevrho,
               CCM_boot_A$rho+CCM_boot_A$sdevrho),
         lty=3, col=1)
#Plot "B causes A"
lines(CCM_boot_B$Lobs, CCM_boot_B$rho, type="l", col=2, lty=2, lwd=2)
#Add +/- 1 standard error
matlines(CCM_boot_B$Lobs,
         cbind(CCM_boot_B$rho-CCM_boot_B$sdevrho,
               CCM_boot_B$rho+CCM_boot_B$sdevrho),
         lty=3, col=2)
legend("topleft",
       c("A causes B", "B causes A"),
       lty=c(1,2), col=c(1,2), lwd=2, bty="n")


First we read in the appropriate columns. I also decided to use just the first 200 observations from my 5-year long dataset. While longer time series are supposed to lead to better convergence, I found that for my time series (characterized by a lot of natural variability and noise), that it was best to limit the sequence to cleaner signal.

The next steps involve determining an appropriate E matrix for A and B. This matrix is an embedding dimension matrix and is similar to determining the number of time steps that a single process needs to predict its own dynamics.

Figure 2: Predictability (rho) as a function of E

When looking at Figure 2, you can see that the rho (predictability) is maximized for A=3 and B=2 (for the value of E). The CCM_Boot function runs the actual Convergent Cross Mapping and implements a bootstrapping method where a subset of the time series is used to determine E and the predictability in order to make sure that the ordering of the sample is not a defining factor of the outcome and to address uncertainty in the prediction. Finally, the results are plotted in Figure 3.  There are two main takeaways from this figure: The predictability of A causing B (outflow causing precipitation) is a flat line around 0 while the predictability of B causing A (precipitation causing outflow) is much higher. Furthermore, note how the predictability increases with an increasing time series length which allows for better convergence on the system dynamics. Clark et al., 2015 contains many other more well-defined ecological examples to demonstrate the package as well.

Figure 3: Predictability as a function of L for A causes B and B causes A

References:

[1] Sugihara, G., May, R., Ye, H., Hsieh, C. H., Deyle, E., Fogarty, M., & Munch, S. (2012). Detecting causality in complex ecosystems. science338(6106), 496-500.

[2] McCracken, J. M., & Weigel, R. S. (2014). Convergent cross-mapping and pairwise asymmetric inference. Physical Review E90(6), 062903.

[3] Clark, A. T., Ye, H., Isbell, F., Deyle, E. R., Cowles, J., Tilman, G. D., & Sugihara, G. (2015). Spatial convergent cross mapping to detect causal relationships from short time series. Ecology96(5), 1174-1181.

Special thanks to @ecoquant who suggested to look into this topic!

Introduction to Granger Causality

This post focuses on the question of determining whether one time series can be helpful in forecasting another time series. This question could be answered by investigating whether the two time series have some form of a causal relationship. Many time series models are based on univariate or multivariate autoregressive (AR) models. More often, a time series of a variable is added to a univariate forecasting model if the two variables are correlated. However, correlation does not necessarily imply causality. Causality is a special kind of relationship between variables where one variable causes, influences, or has an effect on the other variable. For example, although the sales of sunglasses and swimsuits are highly correlated, it cannot be concluded that an increase in the sales of sunglasses causes an increase in the sales of swimsuits. Nor can it be assumed that the reverse is true. The relationship between the two sales is likely due to a third time series variable, temperature, which has an effect on both variables. The relationships between a pair of highly correlated variables may also be purely coincidental with no causality association. Furthermore, the correlation between the variables of two time series that are causal may be small. For example, the appliance usage per household and outside temperature have a causal relationship because a spike in the outside temperature leads to, after a time lag, a spike in the appliance usage. In this case, the correlation between the two variables may actually be quite small.  The drawback of using correlation in time series analysis is that the measure does not factor in the inherent time-ordering and lags of the series.  The same correlation coefficient would be obtained even if the time series were shuffled randomly.

Classic Dilbert Reference

Clive Granger  (Nobel Prize Laureate in Economic Sciences) proposed that causality in economics could be tested by measuring the ability to predict the future values of a time series using prior values of another time series [1].  The causality relationship is based on two principles: the cause happens prior to its effect and the cause has unique information about the future values of its effect. He developed the Granger Causality test (GCT) which is a statistical hypothesis test for determining whether one time series is useful in forecasting another [1]. The test finds only “predictive causality” and not true causality. Therefore, if the prediction of future values of a time series Y(t) improves by including past values of a time series, X(t), X(t) is said to “Granger cause” Y(t). A compact notation for X(t) Granger causes Y(t) is: X(t) G.C. Y(t). Granger causality is (a) unidirectional if X(t) G.C. Y(t) but Y(t) does not G.C. X(t) and (b) bidirectional if X(t) G.C. Y(t) and Y(t) G.C. X(t).  Although GCT was originally developed for time series econometrics, it is now also used in various other fields such as neuroscience to characterize the directionality and influence between the time series neural activity from different neural sources in the brain and in water resources to analyze drivers of groundwater patterns [2-3].

The Granger Causality Test

Let Y(t) be the time series for which the future values have to be predicted and let X(t) be the other time series that will be used to augment Y(t) with the lagged values of X(t). The GCT is based on univariate and multivariate AR models which assume the time series are stationary.  Stationarity can be confirmed by applying the augmented Dickey-Fuller test. If the series are not stationary, they should be transformed into stationary series through operations such as detrending, differencing, or log transforming prior to applying the test.  The GCT uses a series of t-tests and F-tests conducted on the lagged values of Y(t) and X(t). The test is based on the following null hypothesis:

H0: X(t) does not G.C. Y(t)

Step 1: Fit the best (in terms of number of lags) univariate autoregressive model to the time series. The number of lags (model order) can be determined from the PACF or by choosing the model order that minimizes the AIC or BIC criteria. Let the resulting univariate AR model be represented by:

Y(t)=a0+a1Y(t-1)+a2Y(t-2)+…+anY(t-n)+ey(t)

where n is the model order ai are the predictor variables, and ey(t) is a white noise term with a zero mean and some variance σe2.

Step 2: Apply the t-test on each on predictor coefficient to determine if it is significant.  That is, for each coefficient the hypothesis test is:

H0: ai=0

H1: ai ≠ 0

The alternate hypothesis can also be H1: ai > 0 if ai is positive or H1: ai<0 if ai is negative. Any coefficient that is not significant (p<0.05) is excluded from the model.  For example, if the coefficients, a2=a3=0 the resulting model will be:

Y(t)=a0+a1Y(t-1)+a4Y(t-4)+…+anY(t-n)+ey(t)

For convenience, it will be assumed that all the coefficients of the model in the equation in Step 1 are significant.

Step 3: Augment the model in Step 1 with the lagged values of the second time series to generate the bivariate AR model given by:

Y(t)=a0+a1Y(t-1)+…+anY(t-n)+b1X(t-1)+…+bnX(t-n)+Ey(t)

where Ey(t) is the new white noise error term with zero mean and variance, σE2.

Step 4:   As in Step 2, apply the t-test now to each bi to determine statistical significance and remove the lagged terms which are not significant.  Next, test the significance of the retained coefficients jointly using the F-test.  Again, for convenience, it will be assumed that all the coefficients are individually significant.  The null hypothesis for the F-test is, therefore,

H0: b1=b2=bn=0

and the alternate hypothesis is that any coefficient is non-zero. 

Step 5: If the null hypothesis is rejected (p<0.05), it is concluded that the lags of X(t) help predicting future values of Y(t) and, therefore, X(t) G.C. Y(t).

Quantifying Granger Causality

The variances of the prediction errors from the univariate and bivariate models in Steps 1 and 3 can be used to quantify Granger prediction (GP) using the following equation:

GP=ln[σe2/σE2]

GP will be zero if X(t) does not G.C Y(t) as the univariate and bivariate AR models will have the same errors because the coefficients in Step 2 will be zero. GP will be greater than zero if X(t) G.C Y(t) because at least one bi coefficient is non-zero due to the bivariate AR model providing a better fit to the data than the univariate AR model. Consequently, the variance of the error terms in the bivariate AR model will be less than the variance of the error terms in the univariate AR model. Higher values of GP indicate stronger Granger causality.

Implementation in R

The granger causality test can be very simply conducted in R using the package lmtest. In order to demonstrate functionality, we use a real-world dataset to predict drivers of flow in Owasco Lake. The partial table below shows the predictors (precipitation, temperature, day, both real-time snow measurements and accumulated snow depth, relative humidity, wind speed, solar radiation) and the corresponding outflow from the lake.

Table 1: Predictors of Owasco Lake Outflow

In this example, it may be quite intuitive as to which variables would have the greatest predictive capabilities of flow, but you may encounter cases where the choice is less obvious. A common first step used in machine learning is to calculate the correlation between all predictors and the outflow to try to determine the most important predictors. The table below shows these correlation values.

VarPTSNSNWDRHWSSR
Cor0.37-0.150.038-0.0700.140.074-0.234

In absence of any other information, these correlation values are quite low and may lead to the misleading conclusion that none of the variables can help predict outflow. However, we can use the granger causality test to check this. As an example, let’s look at precipitation as a predictor for outflow. We use the function grangertest and the only other parameter to specify is how many lags of the precipitation variables you want to use to predict outflow. Here, an arbitrary value of 10 is chosen.

library(lmtest)
data=read.csv("D:/Documents/OwascoPrediction.csv")
grangertest(observed_cms ~ Precip, order = 10, data = data)

The results are reported as follows:

The function returns the unrestricted model (Model 1) with the lagged precipitation values and the restricted model (Model 2) without the lags. You can see that Model 1 has a p value of 0.0059 making it significant at the 0.01 level (there is less than a 1% chance that the coefficient is 0). So this indicates that even up to 10 lags of precipitation can help to predict outflow. Therefore precipitation G.C. outflow.

As an additional example, while 3 lags of snow does not G.C. outflow, 15 lags of snow is significant at the 0.01 level, indicating that while real-time snow may not immediately cause changes in outflow, snow accumulated over 15 days does have an impact on outflow. This is a really nice way to partition out which variables may have an immediate or lagged effect on an output.

Applications to Machine Learning

As seen above, the Granger causality test can not only help identify important predictors but also gives extra information about the lag of the variables that can offer more predictive capabilities. The GCT is based on autoregressive and multivariate AR modeling, however, the results can be used to select additional time series that are Granger causal and the lags of the time series to augment the input for better predictability in other models such as ANNs and LSTMs.

References

1. Granger, C. W. J. 2001 Essays in Econometrics: The Collected Papers of Clive W.J. Granger. Cambridge: Cambridge University Press.

2. Ding, M., Chen, Y., & Bressler, S.L. 2006 Granger causality: Basic theory and application to neuroscience. In Schelter. S., Winterhalder, N., & Timmer, J. Handbook of Time Series Analysis. Wiley, Wienheim.

3. Singh, N.K., Borrok, D.M. A Granger causality analysis of groundwater patterns over a half-century. Sci Rep 9, 12828 (2019). https://doi.org/10.1038/s41598-019-49278-8

Creating Dendrograms in R

A dendrogram is an effective way of visualizing results from hierarchical clustering. The purpose of this post is to show how to make a basic dendrogram in R and illustrate the ways in which one can add colors to dendrogram labels and branches to help identify key clustering drivers. Making dendrograms in R is quite straightforward. However, customizing a dendrogram is not so straightforward, so this post shows some tricks that I learned and should help expedite the process!

First and foremost, your data must be in an appropriate from for hierarchical clustering to be conducted. Table 1 shows an example of how your data can be set up. Four different spatial temperatures projected by CMIP5 models are shown along with various attributes that could be potential driving forces behind clustering: the institution at which the model comes from, the RCP (radiative forcing scenario) used in the model, and the initial conditions with which the model was run.

Table 1: Model Attributes

At this point, it is helpful to add the model names as the row names (shown in the leftmost column) of your data frame, otherwise the dendrogram function will use the row number as a label on the dendrogram which can make it hard to interpret the clustering results.

Next, create a distance matrix, which will be composed of Euclidean distances between pairs of model projections. This is what clustering will be based on. We first create a new data frame composed of just the temperature values (shown below) by removing columns from the Model Attributes table.

pic2.png

Table 2: Temperature Projections

The following code can be used to create Table 2 from the original table and then the distance matrix.


#Create a new data frame with just temperature values

just_temperature=Model_Attributes[ -c(1:4) ]

#Create a distance matrix

d=dist(just_temperature)

Now, one can make the clustering diagram. Here I chose to use complete linkage clustering as the agglomeration method and wanted my dendrogram to be horizontal.


#Perform clustering

complete_linkage_cluster=as.dendrogram(hclust(d,method="complete"))

#Adjust dimensions of dendrogram so that it fits in plotting window

par(mar=c(3,4,1,15))

plot(complete_linkage_cluster,horiz =TRUE)

And that’s it! Here is the most basic dendrogram.

Plot_1

Figure 1: Dendrogram

Now for customization. You will first need to install the “dendextend” library in R.

We have 11 institutions that the models can come from and we want to visualize if institution has some impact on clustering, by assigning a color to the label. Here we use the rainbow color palette to assign each model a color and then replot the dendrogram.


library(dendextend)

#Create a vector of colors with one color for each institution

col=rainbow(max(Model_Attributes$Institution))

#Add colors to the ordered dendrogram
labels_colors(complete_linkage_cluster)= col[Model_Attributes$Institution][order.dendrogram(complete_linkage_cluster)]

#Replot the dendrogram

par(mar=c(3,4,1,15)) #Dendrogram parameters
plot(complete_linkage_cluster,horiz =TRUE)

Plot_2

Figure 2: Dendrogram with Colored Labels

Now suppose we wanted to change the branch colors to show what RCP each model was run with. Here, we assign a color from the rainbow palette to each of the four RCPs and add it to the dendrogram.


col=rainbow(max(Model_Attributes$RCP))

col_branches= col[Model_Attributes$RCP][order.dendrogram(complete_linkage_cluster)]

colored_dendrogram=color_branches(complete_linkage_cluster,col=col_branches)
par(mar=c(3,4,1,15))
plot(colored_dendrogram,horiz =TRUE)

Plot_3

Figure 3: Dendrogram with Colored Labels and Colored Branches

Now finally, we can change the node shapes to reflect the initial condition. There are 10 total initial conditions, so we’re going to use the first 10 standard pch (plot character) elements to represent the individual nodes.


pch=c(1:max(Model_Attributes$Initial_Conditions))
nodes=pch[Model_Attributes$Initial_Conditions[order.dendrogram(complete_linkage_cluster)]
nodePar = list(lab.cex = 0.6, pch = c(NA,19),cex = 0.7, col = "black") #node parameters

dend1 = colored_dendrogram %>% set("leaves_pch", c(nodes))

par(mar=c(3,4,1,15))
plot(dend1,horiz =TRUE)

Plot_4

Figure 4: Dendrogram with Colored Labels, Colored Branches, and Node Shapes

And that’s how you customize a dendrogram in R!

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

This semester I’m taking a Multivariate statistics course taught by Professor Scott Steinschneider in the BEE department at Cornell. I’ve been really enjoying the course thus far and thought I would share some of what we’ve covered in the class with a blog post. The material below on multicollinearity is from Dr. Steinschneider’s class, presented in my own words.

What is Multicollinearity?

Multicollinearity is the condition where two or more predictor variables in a statistical model are linearly related (Dormann et. al. 2013). The existence of multicollinearity in your data set can result in an increase of the variance of regression coefficients leading to unstable estimation of parameter values. This in turn can lead to erroneous identification of relevant predictors within a regression and detracts from a model’s ability to extrapolate beyond the range of the sample it was constructed with. In this post, I’ll explain how multicollinearity causes problems for linear regression by Ordinary Least Squares (OLS), introduce three metrics for detecting multicollinearity and detail two “Tolerant Methods” for dealing with multicollinearity within a data set.

How does multicollinearity cause problems in OLS regression?

To illustrate the problems caused by multicollinearity, let’s start with a linear regression:

y=x\beta +\epsilon

Where:

y=x\beta +\epsilon

x = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} predictor \hspace{.1 cm} variables

\beta = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} coefficients

\epsilon =  a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} residuals

The Gauss-Markov theorem states that the Best Linear Unbiased Estimator (BLUE) for each  coefficient can be found using OLS:

\hat{\beta}_{OLS} = (x^Tx)^{-1}x^Ty

This  estimate will have a variance defined as:

var(\hat{\beta}_{OLS}) =\sigma^2 (x^Tx)^{-1}

Where:

\sigma^2 = the \hspace{.1 cm} variance\hspace{.1 cm} of \hspace{.1 cm} the\hspace{.1 cm} residuals

If you dive into the matrix algebra, you will find that the term (xTx) is equal to a matrix with ones on the diagonals and the pairwise Pearson’s correlation coefficients (ρ) on the off-diagonals:

(x^Tx) =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}

As the correlation values increase, the values within (xTx)-1 also increase. Even with a low residual variance, multicollinearity can cause large increases in estimator variance. Here are a few examples of the effect of multicollinearity using a hypothetical regression with two predictors:

 \rho = .3 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.09 & -0.33 \\ -0.33 & 1.09 \end{bmatrix}

 \rho = .9 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 5.26 & -4.73 \\ -5.26 & -4.73 \end{bmatrix}

 \rho = .999 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 500.25 & -499.75 \\ -499.75 & 500.25\end{bmatrix}

So why should you care about the variance of your coefficient estimators? The answer depends on what the purpose of your model is. If your only goal is to obtain an accurate measure of the predictand, the presence of multicollinearity in your predictors might not be such a problem. If, however, you are trying to identify the key predictors that effect the predictand, multicollinearity is a big problem.

OLS estimators with large variances are highly unstable, meaning that if you construct estimators from different data samples you will potentially get wildly different estimates of your coefficient values (Dormann et al. 2013). Large estimator variance also undermines the trustworthiness of hypothesis testing of the significance of coefficients. Recall that the t value used in hypothesis testing for an OLS regression coefficient is a function of the sample standard deviation (the square root of the variance) of the  OLS estimator.

t_{n-2} =\frac{\hat{\beta_j}-0}{s_{\beta_j}}

An estimator with an inflated standard deviation, s_{\beta_j}, will thus yield a lower t value, which could lead to the false rejection of a significant predictor (ie. a type II error). See Ohlemüller et al. (2008) for some examples where hypothesis testing results are undermined by multicollinearity.

Detecting Multicollinearity within a data set

Now we know how multicollinearity causes problems in our regression, but how can we tell if there is multicollinearity within a data set? There are several commonly used metrics for which basic guidelines have been developed to determine whether multicollinearity is present.

The most basic metric is the pairwise Pearson Correlation Coefficient between predictors, r. Recall from your intro statistics course that the Pearson Correlation Coefficient is a measure of the linear relationship between two variables, defined as:

r_{x_1,x_2}=\frac{cov(x_1,x_2)}{\sigma_x\sigma_y}

A common rule of thumb is that multicollinearity may be a problem in a data set if any pairwise |r| > 0.7 (Dormann et al. 2013).

Another common metric is known as the Variance Inflation Factor (VIF). This measure is calculated by regressing each predictor on all others being used in the regression.

VIF(\beta_j) = \frac{1}{1-R^2_j}

Where Rj2 is the R2 value generated by regressing predictor xj on all other predictors. Multicollinearity is thought to be a problem if VIF > 10 for any given predictor (Dormann et al. 2012).

A third metric for detecting multicollinearity in a data set is the Condition Number (CN) of the predictor matrix defined as the square root of the ratio of the largest and smallest eigenvalues in the predictor matrix:

CN=\sqrt{\frac{\lambda_{max}}{\lambda_{min}}}

CN> 15 indicates the possible presence of multicollinearity, while a CN > 30 indicates serious multicollinearity problems (Dormann et al. 2013).

Dealing with Multicollinearity using Tolerant Methods

In a statistical sense, there is no way to “fix” multicollinearity. However, methods have been developed to mitigate its effects. Perhaps the most effective way to remedy multicollinearity is to make a priori judgements about the relationship between predictors and remove or consolidate predictors that have known correlations. This is not always possible however, especially when the true functional forms of relationships are not known (which is often why regression is done in the first place). In this section I will explain two “Tolerant Methods” for dealing with multicollinearity.

The purpose of Tolerant Methods is to reduce the sensitivity of regression parameters to multicollinearity. This is accomplished through penalized regression. Since multicollinearity can result in large and opposite signed  estimator values for correlated predictors, a penalty function is imposed to keep the value of predictors below a pre-specified value.

\sum_{j=1}^{p}|\beta|^l \leq c

Where c is the predetermined value representing model complexity, p is the number of predictors and l is either 1 or 2 depending on the type of tolerant method employed (more on this below).

Ridge Regression

Ridge regression uses the L2 norm, or Euclidean distance, to constrain model coefficients (ie. l = 2 in the equation above). The coefficients created using ridge regression are defined as:

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Ridge regression adds a constant, λ, to the term xTx to construct the estimator. It should be noted that both x and y should be standardized before this estimator is constructed. The Ridge regression coefficient is the result of a constrained version of the ordinary least squares optimization problem. The objective is to minimize the sum of square errors for the regression while meeting the complexity constraint.

\hat{\beta_r} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2  \\  \sum_{j=1}^{p}|\beta_j|^2 \leq c \end{cases}

To solve the constrained optimization, Lagrange multipliers can be employed. Let z equal the Residual Sum of Squares (RSS) to be minimized:

argmin(\beta) \hspace{.3cm}  z= (y-x\beta)^T(y-x\beta)+\lambda(\sum_{i=1}^{N}|\beta_j|^2-c)

This can be rewritten in terms of the L2 norm of β:

z = (y-x\beta)^T(y-x\beta)+\lambda||\beta||^2_2

Taking the derivative with respect to β and solving:

0 = \frac{\partial z}{\partial \beta} = -2x^T(y-x\beta)+2\lambda\beta

x^Ty = x^Tx\beta+\lambda\beta=(x^Tx+\lambda I)\beta

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Remember that the Gauss-Markov theorem states that the OLS estimate for regression coefficients is the BLUE, so by using ridge regression, we are sacrificing some benefits of OLS estimators in order to constrain estimator variance. Estimators constructed using ridge regression are in fact biased, this can be proven by calculating the expected value of ridge regression coefficients.

E[\hat{\beta_r}]=(I+\lambda(x^Tx)^{-1})\beta \neq \beta

For a scenario with two predictors, the tradeoff between reduced model complexity and increase bias in the estimators can be visualized graphically by plotting the estimators of the two beta values against each other. The vector of beta values estimated by regression are represented as points on this plot  (\hat{\beta}=[\beta_1, \beta_2]).  In Figure 1,\beta_{OLS} is plotted in the upper right quadrant and represents estimator that produces the smallest RSS possible for the model. The ellipses centered around  are representations of the increasing RSS resulting from the combination of β1 and β2  values, each RSS is a function of a different lambda value added to the regression.  The circle centered around the origin represents the chosen level of model complexity that is constraining the ridge regression. The ridge estimator is the point where this circle intersects a RSS ellipse. Notice that as the value of c increases, the error introduced into the estimators decreases and vice versa.

ridge_regression

Figure 1: Geometric Interpretation of a ridge regression estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimates represent RSS contours for each Beta 1, Beta 2 combination (denoted on here as z from the optimization equation above). The model complexity is constrained by distance c from the origin. The ridge regression estimator of Beta is shown as the red dot, where the RSS contour meets the circle defined by c.

The c value displayed in Figure 1 is only presented to explain the theoretical underpinnings of ridge regression. In practice, c is never specified, rather, a value for λ is chosen a priori to model construction. Lambda is usually chosen through a process known as k-fold cross validation, which is accomplished through the following steps:

  1. Partition data set into K separate sets of equal size
  2. For each k = 1 …k, fit model with excluding the kth set.
  3. Predict for the kth set
  4. Calculate the cross validation error (CVerror)for kth set: CV^{\lambda_0}_k = E[\sum(y-\hat{y})^2]
  5. Repeat for different values of , choose a that minimizes   CV^{\lambda_0} = \frac{1}{k}CV^{\lambda_0}_k

Lasso Regression

Another Tolerant Method for dealing with multicollinearity known as Least Absolute Shrinkage and Selection Operator (LASSO) regression, solves the same constrained optimization problem as ridge regression, but uses the L1 norm rather than the L2 norm as a measure of complexity.

\hat{\beta}_{Lasso} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2 \\ \sum_{j=1}^{p}|\beta_j|^1 \leq c \end{cases}

LASSO regression can be visualized similarly to ridge regression, but since c is defined by the sum of absolute values of beta, rather than sum of squares, the area it constrains is diamond shaped rather than circular.  Figure 2 shows the selection of the beta estimator from LASSO regression. As you can see, the use of the L1 norm means LASSO regression selects one of the predictors and drops the other (weights it as zero). This has been argued to provide a more interpretable estimators (Tibshirani 1996).

lasso_regression

Figure 2: Geometric interpretation of Lasso Regression Estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimate represents RSS contours for each Beta 1, Beta 2 combination (denoted as z from the optimization equation). The mode complexity is constrained by the L1 norm representing model complexity. The Lasso estimator of Beta is shown as the red dot, the location where the RSS contour intersects the diamond defined by c.

Final thoughts

If you’re creating a model with multiple predictors, it’s important to be cognizant of potential for multicollinearity within your data set. Tolerant methods are only one of many possible remedies for multicollinearity (other notable techniques include data clustering and Principle Component Analysis) but it’s important to remember that no known technique can truly “solve” the problem of multicollinearity. The method chosen to deal with multicollinearity should be chosen on a case to case basis and multiple methods should be employed if possible to help identify the underlying structure within the predictor data set (Dormann et. al. 2013)

Citations

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D. and Lautenbach, S. 2013, “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.” Ecography, 36: 27–46. doi:10.1111/j.1600-0587.2012.07348.x

Ohlemüller, R. et al. 2008. “The coincidence of climatic and species rarity: high risk to small-range species from climate change.” Biology Letters. 4: 568 – 572.

Tibshirani, Robert 1996. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological): 267-288.

 

 

 

Synthetic streamflow generation

A recent research focus of our group has been the development and use of synthetic streamflow generators.  There are many tools one might use to generate synthetic streamflows and it may not be obvious which is right for a specific application or what the inherent limitations of each method are.  More fundamentally, it may not be obvious why it is desirable to generate synthetic streamflows in the first place.  This will be the first in a series of blog posts on the synthetic streamflow generators in which I hope to sketch out the various categories of generation methods and their appropriate use as I see it.  In this first post we’ll focus on the motivation and history behind the development of synthetic streamflow generators and broadly categorize them.

Why should we use synthetic hydrology?

The most obvious reason to use synthetic hydrology is if there is little or no data for your system (see Lamontagne, 2015).  Another obvious reason is if you are trying to evaluate the effect of hydrologic non-stationarity on your system (Herman et al. 2015; Borgomeo et al. 2015).  In that case you could use synthetic methods to generate flows reflecting a shift in hydrologic regime.  But are there other reasons to use synthetic hydrology?

In water resources systems analysis it is common practice to evaluate the efficacy of management or planning strategies by simulating system performance over the historical record, or over some critical period.  In this approach, new strategies are evaluated by asking the question:  How well would we have done with your new strategy?

This may be an appealing approach, especially if some event was particularly traumatic to your system. But is this a robust way of evaluating alternative strategies?  It’s important to remember that any hydrologic record, no matter how long, is only a single realization of a stochastic process.  Importantly, drought and flood events emerge as the result of specific sequences of events, unlikely to be repeated.  Furthermore, there is a 50% chance that the worst flood or drought in an N year record will be exceeded in the next N years.  Is it well advised to tailor our strategies to past circumstances that will likely never be repeated and will as likely as not be exceeded?  As Lettenmaier et al. [1987] reminds us “Little is certain about the future except that it will be unlike the past.”

Even under stationarity and even with long hydrologic records, the use of synthetic streamflow can improve the efficacy of planning and management strategies by exposing them to larger and more diverse flood and drought than those in the record (Loucks et al. 1981; Vogel and Stedinger, 1988; Loucks et al. 2005).  Figure 7.12 from Loucks et al. 2005 shows a typical experimental set-up using synthetic hydrology with a simulation model.  Often our group will wrap an optimization model like Borg around this set up, where the system design/operating policy (bottom of the figure) are the decision variables, and the system performance (right of the figure) are the objective(s).

loucks-7-12

(Loucks et al. 2005)

 

What are the types of generators?

Many synthetic streamflow generation techniques have been proposed since the early 1960s.  It can be difficult for a researcher or practitioner to know which method is best suited to the problem at hand.  Thus, we’ll start with a very broad characterization of what is out there, then proceed to some history.

Broadly speaking there are two approaches to generating synthetic hydrology: indirect and direct.  The indirect approach generates streamflow by synthetically generating the forcings to a hydrologic model.  For instance one might generate precipitation and temperature series and input them to a hydrologic model of a basin (e.g. Steinschneider et al. 2014).  In contrast, direct methods use statistical techniques to generate streamflow timeseries directly.

The direct approach is generally easier to apply and more parsimonious because it does not require the selection, calibration, and validation of a separate hydrologic model (Najafi et al. 2011).  On the other hand, the indirect approach may be desirable.  Climate projections from GCMs often include temperature or precipitation changes, but may not describe hydrologic shifts at a resolution or precision that is useful.  In other cases, profound regime shifts may be difficult to represent with statistical models and may require process-driven models, thus necessitating the indirect approach.

Julie’s earlier series focused on indirect approaches, so we’ll focus on the direct approach.  Regardless of the approach many of the methods are same.  In general generator methods can be divided into two categories: parametric and non-parametricParametric methods rely on a hypothesized statistical model of streamflow whose parameters are selected to achieve a desired result (Stedinger and Taylor, 1982a).  In contrast non-parametric methods do not make strong structural assumptions about the processes generating the streamflow, but rather rely on re-sampling from the hydrologic record in some way (Lall, 1995).  Some methods combine parametric and non-parametric techniques, which we’ll refer to as semi-parametric (Herman et al. 2015).

Both parametric and non-parametric methods have advantages and disadvantages.  Parametric methods are often parsimonious, and often have analytical forms that allow easy parameter manipulation to reflect non-stationarity.  However, there can be concern that the underlying statistical models may not reflect the hydrologic reality well (Sharma et al. 1997).  Furthermore, in multi-dimensional, multi-scale problems the proliferation of parameters can make parametric models intractable (Grygier and Stedinger, 1988).  Extensive work has been done to confront both challenges, but they may lead a researcher to adopt a non-parametric method instead.

Because many non-parametric methods ‘re-sample’ flows from a record, realism is not generally a concern, and most re-sampling schemes are computationally straight forward (relatively speaking).  On the other hand, manipulating synthetic flows to reflect non-stationarity may not be as straightforward as a simple parameter change, though methods have been suggested (Herman et al. 2015Borgomeo et al. 2015).  More fundamentally, because non-parametric methods rely so heavily on the data, they require sufficiently long records to ensure there is enough hydrologic variability to sample.  Short records can be a concern for parametric methods as well, though parametric uncertainty can be explicitly considered in those methods (Stedinger and Taylor, 1982b).  Of course, parametric methods also have structural uncertainty that non-parametric models largely avoid by not assuming an explicit statistical model.

In the coming posts we’ll dig into the nuances of the different methods in greater detail.

A historical perspective

The first use of synthetic flow generation seems to have been by Hazen [1914].  That work attempted to quantify the reliability of a water supply by aggregating the streamflow records of local streams into a 300-year ‘synthetic record.’  Of course the problem with this is that the cross-correlation between concurrent flows rendered the effective record length much less than the nominal 300 years.

Next Barnes [1954] generated 1,000 years of streamflow for a basin in Australia by drawing random flows from a normal distribution with mean and variance equal to the sample estimates from the observed record.  That work was extended by researchers from the Harvard Water Program to account for autocorrelation of monthly flows (Maass et al., 1962; Thomas and Fiering, 1962).  Later work also considered the use of non-normal distributions (Fiering, 1967), and the generation of correlated concurrent flows at multiple sites (Beard, 1965; Matalas, 1967).

Those early methods relied on first-order autoregressive models that regressed flows in the current period on the flows of the previous period (see Loucks et al.’s Figure 7.13  below).  Box and Jenkins [1970] extended those methods to autoregressive models of arbitrary order, moving average models of arbitrary order, and autoregressive-moving average models of arbitrary order.  Those models were the focus of extensive research over the course of the 1970s and 1980s and underpin many of the parametric generators that are widely used in hydrology today (see Salas et al. 1980; Grygier and Stedinger, 1990; Salas, 1993; Loucks et al. 2005).

loucks-7-13

(Loucks et al. 2005)

By the mid-1990s, non-parametric methods began to gain popularity (Lall, 1995).  While much of this work has its roots in earlier work from the 1970s and 1980s (Yakowitz, 1973, 1979, 1985; Schuster and Yakowitz, 1979; Yakowitz and Karlsson, 1987; Karlson and Yakowitz, 1987), improvements in computing and the availability of large data sets meant that by the mid-1990s non-parametric methods were feasible (Lall and Sharma, 1996).  Early examples of non-parametric methods include block bootstrapping (Vogel and Shallcross, 1996), k-nearest neighbor (Lall and Sharma, 1996), and kernel density methods (Sharma et al. 1997).  Since that time extensive research has made improvement to these methods, often by incorporating parametric elements.  For instance, Srinivas and Srinivasan (2001, 2005, and 2006) develop a hybrid autoregressive-block bootstrapping method designed to improve the bias in lagged correlation and to generate flows other than the historical, for multiple sites and multiple seasons.  K-nearest neighbor methods have also been the focus of extensive research (Rajagopalan and Lall, 1999; Harrold et al. 2003; Yates et al. 2003; Sharif and Burn, 2007; Mehortra and Sharma, 2006; Prairie et al. 2006; Lee et al. 2010, Salas and Lee, 2010, Nowak et al., 2010), including recent work by our group  (Giuliani et al. 2014).

Emerging work focuses on stochastic streamflow generation using copulas [Lee and Salas, 2011; Fan et al. 2016], entropy theory bootstrapping [Srivastav and Simonovic, 2014], and wavelets [Kwon et al. 2007; Erkyihun et al., 2016], among other methods.

In the following posts I’ll address different challenges in stochastic generation [e.g. long-term persistence, parametric uncertainty, multi-site generation, seasonality, etc.] and the relative strengths and shortcomings of the various methods for addressing them.

Works Cited

Barnes, F. B., Storage required for a city water supply, J. Inst. Eng. Australia 26(9), 198-203, 1954.

Beard, L. R., Use of interrelated records to simulate streamflow, J. Hydrol. Div., ASCE 91(HY5), 13-22, 1965.

Borgomeo, E., Farmer, C. L., and Hall, J. W. (2015). “Numerical rivers: A synthetic streamflow generator for water resources vulnerability assessments.” Water Resour. Res., 51(7), 5382–5405.

Y.R. Fan, W.W. Huang, G.H. Huang, Y.P. Li, K. Huang, Z. Li, Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas, Advances in Water Resources, Volume 88, February 2016, Pages 170-185.

Fiering, M.B, Streamflow Synthesis, Harvard University Press, Cambridge, Mass., 1967.

Giuliani, M., J. D. Herman, A. Castelletti, and P. Reed (2014), Many-objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management, Water Resour. Res., 50, 3355–3377, doi:10.1002/2013WR014700.

Grygier, J.C., and J.R. Stedinger, Condensed Disaggregation Procedures and Conservation Corrections for Stochastic Hydrology, Water Resour. Res. 24(10), 1574-1584, 1988.

Grygier, J.C., and J.R. Stedinger, SPIGOT Technical Description, Version 2.6, 1990.

Harrold, T. I., Sharma, A., and Sheather, S. J. (2003). “A nonparametric model for stochastic generation of daily rainfall amounts.” Water Resour. Res., 39(12), 1343.

Hazen, A., Storage to be provided in impounding reservoirs for municipal water systems, Trans. Am. Soc. Civ. Eng. 77, 1539, 1914.

Herman, J.D., H.B. Zeff, J.R. Lamontagne, P.M. Reed, and G. Characklis (2016), Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments, Journal of Water Resources Planning & Management, doi: 10.1061/(ASCE)WR.1943-5452.0000701.

Karlsson, M., and S. Yakowitz, Nearest-Neighbor methods for nonparametric rainfall-runoff forecasting, Water Resour. Res., 23, 1300-1308, 1987.

Kwon, H.-H., U. Lall, and A. F. Khalil (2007), Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43, W05407, doi:10.1029/2006WR005258.

Lall, U., Recent advances in nonparametric function estimation: Hydraulic applications, U.S. Natl. Rep. Int. Union Geod. Geophys. 1991- 1994, Rev. Geophys., 33, 1093, 1995.

Lall, U., and A. Sharma (1996), A nearest neighbor bootstrap for resampling hydrologic time series, Water Resour. Res. 32(3), pp. 679-693.

Lamontagne, J.R. 2015,Representation of Uncertainty and Corridor DP for Hydropower 272 Optimization, PhD edn, Cornell University, Ithaca, NY.

Lee, T., J. D. Salas, and J. Prairie (2010), An enhanced nonparametric streamflow disaggregation model with genetic algorithm, Water Resour. Res., 46, W08545, doi:10.1029/2009WR007761.

Lee, T., and J. Salas (2011), Copula-based stochastic simulation of hydrological data applied to Nile River flows, Hydrol. Res., 42(4), 318–330.

Lettenmaier, D. P., K. M. Latham, R. N. Palmer, J. R. Lund and S. J. Burges, Strategies for coping with drought Part II: Planning techniques for planning and reliability assessment, EPRI P-5201, Final Report Project 2194-1, June 1987.

Loucks, D.P., Stedinger, J.R. & Haith, D.A. 1981, Water Resources Systems Planning and Analysis, 1st edn, Prentice-Hall, Englewood Cliffs, N.J.

Loucks, D.P. et al. 2005, Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Delft, The Netherlands.

Maass, A., M. M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair,

Design of Water Resource Systems, Harvard University Press, Cambridge, Mass., 1962.

Matalas, N. C., Mathematical assessment of synthetic hydrology, Water Resour. Res. 3(4), 937-945, 1967.

Najafi, M. R., Moradkhani, H., and Jung, I. W. (2011). “Assessing the uncertainties of hydrologic model selection in climate change impact studies.” Hydrol. Process., 25(18), 2814–2826.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., and Zagona, E. A. (2006). “Modified K-NN model for stochastic streamflow simulation.” J. Hydrol. Eng., 11(4), 371–378.

Rajagopalan, B., and Lall, U. (1999). “A k-nearest-neighbor simulator for daily precipitation and other weather variables.” Water Resour. Res., 35(10), 3089–3101.

Salas, J. D., J. W. Deller, V. Yevjevich and W. L. Lane, Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colo., 1980.

Salas, J.D., 1993, Analysis and Modeling of Hydrologic Time Series, Chapter 19 (72 p.) in The McGraw Hill Handbook of Hydrology, D.R. Maidment, Editor.

Salas, J.D., T. Lee. (2010). Nonparametric Simulation of Single-Site Seasonal Streamflow, J. Hydrol. Eng., 15(4), 284-296.

Schuster, E., and S. Yakowitz, Contributions to the theory of nonparametric regression, with application to system identification, Ann. Stat., 7, 139-149, 1979.

Sharif, M., and Burn, D. H. (2007). “Improved K-nearest neighbor weather generating model.” J. Hydrol. Eng., 12(1), 42–51.

Sharma, A., Tarboton, D. G., and Lall, U., 1997. “Streamflow simulation: A nonparametric approach.” Water Resour. Res., 33(2), 291–308.

Srinivas, V. V., and Srinivasan, K. (2001). “A hybrid stochastic model for multiseason streamflow simulation.” Water Resour. Res., 37(10), 2537–2549.

Srinivas, V. V., and Srinivasan, K. (2005). “Hybrid moving block bootstrap for stochastic simulation of multi-site multi-season streamflows.” J. Hydrol., 302(1–4), 307–330.

Srinivas, V. V., and Srinivasan, K. (2006). “Hybrid matched-block bootstrap for stochastic simulation of multiseason streamflows.” J. Hydrol., 329(1–2), 1–15.

Roshan K. Srivastav, Slobodan P. Simonovic, An analytical procedure for multi-site, multi-season streamflow generation using maximum entropy bootstrapping, Environmental Modelling & Software, Volume 59, September 2014a, Pages 59-75.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 1. Model verification and validation, Water Resour. Res. 18(4), 909-918, 1982a.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 2. Parameter uncertainty,Water Resour. Res. 18(4), 919-924, 1982b.

Steinschneider, S., Wi, S., and Brown, C. (2014). “The integrated effects of climate and hydrologic uncertainty on future flood risk assessments.” Hydrol. Process., 29(12), 2823–2839.

Thomas, H. A. and M. B. Fiering, Mathematical synthesis of streamflow sequences for the analysis of river basins by simulation, in Design of Water Resource Systems, by A. Maass, M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Harvard University Press, Cambridge, Mass., 1962.

Vogel, R.M., and J.R. Stedinger, The value of stochastic streamflow models in over-year reservoir design applications, Water Resour. Res. 24(9), 1483-90, 1988.

Vogel, R. M., and A. L. Shallcross (1996), The moving block bootstrap versus parametric time series models, Water Resour. Res., 32(6), 1875–1882.

Yakowitz, S., A stochastic model for daily river flows in an arid region, Water Resour. Res., 9, 1271-1285, 1973.

Yakowitz, S., Nonparametric estimation of markov transition functions, Ann. Stat., 7, 671-679, 1979.

Yakowitz, S. J., Nonparametric density estimation, prediction, and regression for markov sequences J. Am. Stat. Assoc., 80, 215-221, 1985.

Yakowitz, S., and M. Karlsson, Nearest-neighbor methods with application to rainfall/runoff prediction, in Stochastic  Hydrology, edited by J. B. Macneil and G. J. Humphries, pp. 149-160, D. Reidel, Norwell, Mass., 1987.

Yates, D., Gangopadhyay, S., Rajagopalan, B., and Strzepek, K. (2003). “A technique for generating regional climate scenarios using a nearest-neighbor algorithm.” Water Resour. Res., 39(7), 1199.

New Federal Flood Frequency Analysis Guidelines

This post is a bit off topic for this blog, but I think it should be interesting to our readers.  The current procedure used by Federal agencies (FEMA, Bureau of Reclamation, USGS, Army Corps, etc.) to assess flood risk at gauged sites is described in Bulletin 17B.  That procedure is used to estimate flood risk for things like FEMA flood maps, to set flood insurance rates, and to design riparian structures like levees.  Given how important the procedure is, many people are surprised to learn that it has not been updated since 1982, despite improvements in statistical techniques and computational resources, and the additional data that is now available.  This post wont speculate as to why change has taken so long, but I will briefly describe the old procedures and the proposed updates.

Bulletin 17B Procedure:

Dr. Veronica Webster has written an excellent brief history on the evolution of  national flood frequency standards in the US dating back to the 1960s.  For this post, we focus on the procedures adopted in 1982.  In short, Bulletin 17B recommends fitting the log-Pearson Type III distribution to the annual peak flow series at a gauged location, using the log-space method of moments.  In other words, one takes the logarithms of the flood series and then uses the mean, variance, and skew coefficient of the transformed series to fit a Pearson Type III distribution.  The Pearson Type III distribution is a shifted Gamma distribution.  When the population skew coefficient is positive the distribution is bounded on the low end, and when the population skew is negative the distribution is bounded on the high end.  When the population skew is zero, the Pearson Type III distribution becomes a normal distribution.  For those wanting more information, Veronica and Dr. Jery Stedinger have written a nice three-part series on the log-Pearson Type III, starting with a paper on the distribution characteristics.

Unfortunately data is not always well behaved and the true distribution of floods is certainly not log-Person Type III, so the Bulletin takes measures to make the procedure more robust.  One such measure is the use of a regional skew coefficient to supplement the sample skew coefficient computed from the transformed flood record.  The idea here is that computing the skew coefficient from short samples is difficult because the skew coefficient can be noisy, so one instead uses a weighted average of the sample skew and a regional skew.  The relative weights are proportional to the mean squared error of each skew, with the more accurate skew given more weight.  The Bulletin recommends obtaining a regional skew from either an average of nearby and hydrologically similar records, a model of skew based on basin characteristics (like drainage area), or the use of the National Skew Map (see below), based on a 1974 study.

National Skew Map provided in Bulletin 17B [IAWCD, 1982, Plate I]

National Skew Map provided in Bulletin 17B [IAWCD, 1982, Plate I]

A second concern is that small observations in a flood record might exert unjustifiable influence in the analysis and distort the estimates of the likelihood of the large floods of interest.  Often such small observations are caused by different hydrologic processes than those that cause the largest floods.  In my own experience, I’ve encountered basins wherein the maximum annual discharge is zero, or nearly zero in some years.  The Bulletin recommends censoring such observations, meaning that one removes them from the computation of the sample moments, then applies a probability adjustment to account for their presence.  The Bulletin provides an objective outlier detection test to identify potential nuisance observations, but ultimately leaves censoring decisions to the subjective judgement of the analyst.  The proposed objective test is the Grubbs-Beck test, which is a 10% significance test of whether the smallest observation in a normal sample is unusually small.

Updates:

The Hydrologic Frequency Analysis Work Group (HFAWG) is charged with updating the Bulletin.  Their recommendations can be found on-line, as well as a testing report which compares the new methods to the old Bulletin.  Bulletin 17C is currently being written.  Like Bulletin 17B, the new recommendations also fit the log-Pearson Type III distribution to the annual maximum series from a gaged site, but a new fitting technique is used: the expected moments algorithm (EMA).  This method is related to the method of moments estimators previously used, but allows for the incorporation of historical/paleo flood information, censored observations, and regional skew information in a unified, systematic methodology.

High water mark of 1530 flood of Tiber at Rome.

High water mark of 1530 flood of the Tiber River at Rome

Historical information might include non-systematic records of large floods: “The town hall has been there since 1760 and has only been flooded once,”  thus providing a threshold flow which has only been crossed once in 256 years.  EMA can include that sort of valuable community experience about flooding in a statistically rigorous way! For the case that no historical information is available, no observations are censored, and no regional skew is used, the EMA moment estimators are exactly the same as the Bulletin 17B method of moments estimators.  See Dr. Tim Cohn’s website for EMA software.

The EMA methodology also has correct quantile confidence intervals (confidence intervals for the 100-year flood, etc.), which are more accurate than the ones used in Bulletin 17B.

Another update to the Bulletin involves the identification of outlier observations, which are now termed Potentially Influential Low Flows (PILFs).  A concern with the old detection test was that it rarely identified multiple outliers, even when several very small observations are present.  In fact, the Grubbs-Beck test, as used in the Bulletin, is only designed to test the smallest observation and not the second, third, or kth smallest observations.  Instead, the Bulletin adopts a generalization of the Grubbs-Beck test designed to test for multiple PILFs (see this proceedings, full paper to come).  The new test identifies PILFs more often and will identify more PILFs than the previous test, but we’ve found that use of a reasonable outlier test actually results in improved quantile efficiency when fitting log-Pearson Type III data with EMA  (again, full paper in review).

The final major revision is the retirement of the skew map (see above), and the adoption of language recommending more modern techniques, like Bayesian Generalized Least Squares (BGLS).  In fact, Dr. Andrea Veilleux, along with her colleagues at the USGS have been busy using iterations of BGLS to generate models of regional skew across the country, including California, Iowa, the Southeast, the Northwest, MissouriVermont, Alaska, Arizona, and California rainfall floods.  My masters work was in this area, and I’d be happy to write further on Bayesian regression techniques for hydrologic data, if folks are interested!

If folks are interested in software that implements the new techniques, the USGS has put out a package with good documentation.

That’s it for now!

Jon

 

C++ Training: Exercise 1

(Updated 1/30/12)

This example requires some data you can download here: inputdata.  Open the file in Excel and “Save As”… csv.

Write a C++ program that reads data from the csv file and then calculates the mean, the standard deviation, the 10th percentile, and the 90th percentile, of each column of data.  Output these statistics to one or more new files.  You may create one file for each statistic, i.e.

means-of-data.csv

col1, col2, col3,
mean-of-col-1, mean-of-col-2, mean-of-col-3

Or put each statistic in its own row in a single file.

For this exercise, you may need the following functions, libraries, or features.  Please look them up on cplusplus.com and make liberal use of the example code there!

math: pow, operators such as +, -, and *

input/output of data: please use ifstream and ofstream.  You’ll want to use the << and >> operator, and the function: getline

manipulating text streams: the c++ library string, and stringstream, may be helpful.  You may need to look up how to handle csv files, and how to separate the commas from the data

data: sort

control structures: if, else, while, for

One piece of advice: if you want to convert a C++ style string to an integer or vice versa, you may find the following functions handy:


string intToString(int input_int)

{

string s;

stringstream out;

out << input_int;

s = out.str();

out.clear();

return s;

}

int stringToInt(string input_string)

{

return atoi(input_string.c_str());

}

double stringToDouble(string input_string)

{

return strtod(input_string.c_str(), NULL);

}

Unit 1: Getting the Program Working

  1. Write a standard deviation function that can take the standard deviation of an arbitrary number of values. Verify that the function works either with a calculator, Matlab, or Excel. Is the precision the same? What could cause differences between your answer and the calculators?
  2. Learn how to use the input/output streams.  You’ll want to set up one stream to read the file, and another one to write output files. To test this, you can simply “copy” the file you read in directly into another file. Are there ways to do this without using too much system memory?
  3. The easiest statistics to calculate are mean and standard deviation, since they can be done without storing much information.  The 10th and 90th percentile are more difficult, because you must sort the values first.  Don’t try to do these until you have the other two tasks completed.
  4. Once you have your results, verify them using Matlab or Excel.  Are your statistics correct?

Unit 2: Intermediate Steps

After your program is working, please try the following procedures:

  1. Initially, you can write all the code necessary to do this task in one file, maybe even all in the main() function. This is fine for small tasks, but as you continue, you’ll need to have more complicated programs that span multiple files and use many more functions. A makefile is a way to compile large projects using Linux. Set up a makefile for your program, and try separating the code into multiple files. There’s a short blog post to help you (log in to see it). You can also use a development environment such as Eclipse to create a makefile for you, or Visual Studio to create a project file without needing a makefile.
  2. The first time you do this example, you may use (or have used) regular C-style arrays to store the data from the CSV files. This is fine, as long as you know how many rows and columns are in the file! Arrays also require you to keep track of pointers and memory management, which can make operations like sorting more painful than they need to be. C++’s answer to these problems is vectors. Vectors are like ArrayLists in Java; they can be sized and appended dynamically, and they know their own size. Consider implementing the same program using vectors for an input file with an unknown number of rows and columns.
  3. Use gprof to profile your code, to determine how much time is being spent in each function. This can identify “bottlenecks” in your code. See a blog post here for more info.
  4. Use valgrind to identify memory leaks in your code. This is very important, since memory issue can actually make your code fail to run. The blog post here explains.
Unit 3. Pulling everything together
  1. The last step will be to combine everything you’ve done so far to create your own Matrix class. You should be able to find some good websites explaining basic C++ classes, and if you need more advice, come talk to us. Your Matrix class should use vectors to create a matrix behind the scenes, so that you can use it easily in your main() function. The class should have some helpful functions to simplify your tasks … for example, you may want to have a Matrix.readFile(…) function, a Matrix.sortColumns() function, or whatever else you want to include!
  2. You may want to implement your class using a separate header (.h) and source (.cpp) file. To compile a project with multiple source files, you’ll need to learn how to use basic makefiles, which are explained in Joe’s previous post (link is given in step 2).
  3. Re-implement your main() function using your new Matrix class. Make sure your output is the same as in the previous exercises. Once you do this, you will have written a general, re-usable class that could (potentially) save you lots of time in the future.

If you have questions, please ask them and if the answers are general enough we will add them to this post.  Good luck!

–Jon and Joe