MORDM Basics I: Synthetic Streamflow Generation

In this post, we will break down the key concepts underlying synthetic streamflow generation, and how it fits within the Many Objective Robust Decision Making (MORDM) framework (Kasprzyk, Nataraj et. al, 2012). This post is the first in a series on MORDM which will begin here: with generating and validating the data used in the framework. To provide some context as to what we are about to attempt, please refer to this post by Jon Herman.

What is synthetic streamflow generation?

Synthetic streamflow generation is a non-parametric, direct statistical approach used to generate synthetic streamflow timeseries from a reasonably long historical record. It is used when there is a need to diversify extreme event scenarios, such as flood and drought, or when we want to generate flows to reflect a shift in the hydrologic regime due to climate change. It is favored as it relies on a re-sampling of the historical record, preserving temporal correlation up to a certain degree, and results in a more realistic synthetic dataset. However, its dependence on a historical record also implies that this approach requires a relatively long historical inflow data. Jon Lamontagne’s post goes into further detail regarding this approach.

Why synthetic streamflow generation?

An important step in the MORDM framework is scenario discovery, which requires multiple realistic scenarios to predict future states of the world (Kasprzyk et. al., 2012). Depending solely on the historical dataset is insufficient; we need to generate multiple realizations of realistic synthetic scenarios to facilitate a comprehensive scenario discovery process. As an approach that uses a long historical record to generate synthetic data that has been found to preserve seasonal and annual correlation (Kirsch et. al., 2013; Herman et. al., 2016), this method provides us with a way to:

  1. Fully utilize a large historical dataset
  2. Stochastically generate multiple synthetic datasets while preserving temporal correlation
  3. Explore many alternative climate scenarios by changing the mean and the spread of the synthetic datasets

The basics of synthetic streamflow generation in action

To better illustrate the inner workings of synthetic streamflow generation, it is helpful to use a test case. In this post, the historical dataset is obtained from the Research Triangle Region in North Carolina. The Research Triangle region consists of four main utilities: Raleigh, Durham, Cary and the Orange County Water and Sewer Authority (OWASA). These utilities are receive their water supplies from four water sources: the Little River Reservoir, Lake Wheeler, Lake Benson, and the Jordan Lake (Figure 1), and historical streamflow data is obtained from ten different stream gauges located at each of these water sources. For the purpose of this example, we will be using 81 years’ worth of weekly streamflow data available here.

Figure 1: The Research Triangle region (Trindade et. al., 2019).

The statistical approach that drive synthetic streamflow generation is called the Kirsch Method (Kirsch et. al., 2013). In plain language, this method does the following:

  1. Converts the historical streamflows from real space to log space, and then standardize the log-space data.
  2. Bootstrap the log-space historical matrix to obtain an uncorrelated matrix of historical data.
  3. Obtain the correlation matrix of the historical dataset by performing Cholesky decomposition.
  4. Impose the historical correlation matrix upon the uncorrelated matrix obtained in (2) to generate a standardized synthetic dataset. This preserves seasonal correlation.
  5. De-standardize the synthetic data, and transform it back into real space.
  6. Repeat steps (1) to (5) with a historical dataset that is shifted forward by 6 months (26 weeks). This preserves year-to-year correlation.

This post by Julie Quinn delves deeper into the Kirsch Method’s theoretical steps. The function that executes these steps can be found in the stress_dynamic.m Matlab file, which in turn is executed by the wsc_main_rate.m file by setting the input variable p = 0 as shown on Line 27. Both these files are available on GitHub here.

However, this is simply where things get interesting. Prior to this, steps (1) to (6) would have simply generated a synthetic dataset based on only historical statistical characteristics as validated here in Julie’s second blog post on a similar topic. Out of the three motivations for using synthetic streamflow generation, the third one (exploration of multiple scenarios) has yet to be satisfied. This is a nice segue into out next topic:

Generating multiple scenarios using synthetic streamflow generation

The true power of synthetic streamflow generation lies in its ability to generate multiple climate (or in this case, streamflow) scenarios. This is done in stress_dynamic.m using three variables:

Input variableData type
pThe lowest x% of streamflows
nA vector where each element ni is the number of copies of the p-lowest streamflow years to be added to the bootstrapped historical dataset.
mA vector where each element mi is the number of copies of the (1-p)-highest streamflow years to be added to the bootstrapped historical dataset.
Table 1: The input variables to the stress_dynamic function.

These three variables bootstrap (increase the length of) the historical record while allow us to perturb the historical streamflow record streamflows to reflect an increase in frequency or severity of extreme events such as floods and droughts using the following equation:

new_hist_years = old_historical_years + [(p*old_historical_years)*ni ] + (old_hist_years – [(p*old_historical_years)mi])

The stress_dynamic.m file contains more explanation regarding this step.

This begs the question: how do we choose the value of p? This brings us to the topic of the standardized streamflow indicator (SSI6).

The SSI6 is the 6-month moving average of the standardized streamflows to determine the occurrence and severity of drought on the basis of duration and frequency (Herman et. al., 2016). Put simply, this method determines the occurrence of drought if the the value of the SSI6 < 0 continuously for at least 3 months, and SSI6 < -1 at least once during the 6-month interval. The periods and severity (or lack thereof) of drought can then be observed, enabling the decision on the length of both the n and m vectors (which correspond to the number of perturbation periods, or climate event periods). We will not go into further detail regarding this method, but there are two important points to be made:

  1. The SSI6 enables the determination of the frequency (likelihood) and severity of drought events in synthetic streamflow generation through the values contained in p, n and m.
  2. This approach can be used to generate flood events by exchanging the values between the n and m vectors.

A good example of point (2) is done in this test case, in which more-frequent and more-severe floods was simulated by ensuring that most of the values in m where larger than those of n. Please refer to Jon Herman’s 2016 paper titled ‘Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments’ for further detail.

A brief conceptual letup

Now we have shown how synthetic streamflow generation satisfies all three factors motivating its use. We should have three output folders:

  • synthetic-data-stat: contains the synthetic streamflows based on the unperturbed historical dataset
  • synthetic-data-dyn: contains the synthetic streamflows based on the perturbed historical dataset

Comparing these two datasets, we can compare how increasing the likelihood and severity of floods has affected the resulting synthetic data.

Validation

To exhaustively compare the statistical characteristics of the synthetic streamflow data, we will perform two forms of validation: visual and statistical. This method of validation is based on Julie’s post here.

Visual validation

Done by generating flow duration curves (FDCs) . Figure 2 below compares the unperturbed (left) and perturbed (right) synthetic datasets.

Figure 2: (Above) The FDC of the unperturbed historical dataset (pink) and its resulting synthetic dataset (blue). (Below) The corresponsing perturbed historical and synthetic dataset.

The bottom plots in Figure 2 shows an increase in the volume of weekly flows, as well as an smaller return period, when the the historical streamflows were perturbed to reflect an increasing frequency and magnitude of flood events. Together with the upper plots in Figure 2, this visually demonstrates that the synthetic streamflow generation approach (1) faithfully reconstructs historical streamflow patterns, (2) increases the range of possible streamflow scenarios and (3) can model multiple extreme climate event scenarios by perturbing the historical dataset. The file to generate this Figure can be found in the plotFDCrange.py file.

Statistical validation

The mean and standard deviation of the perturbed and unperturbed historical datasets are compared to show if the perturbation resulted in significant changes in the synthetic datasets. Ideally, the perturbed synthetic data would have higher means and similar standard deviations compared to the unperturbed synthetic data.

Figure 3: (Above) The unperturbed synthetic (blue) and historical (pink) streamflow datasets for each of the 10 gauges. (Below) The perturbed counterpart.

The mean and tails of the synthetic streamflow values of the bottom plots in Figure 3 show that the mean and maximum values of the synthetic flows are significantly higher than the unperturbed values. In addition, the spread of the standard deviations of the perturbed synthetic streamflows are similar to that of its unperturbed counterpart. This proves that synthetic streamflow generation can be used to synthetically change the occurrence and magnitude of extreme events while maintaining the periodicity and spread of the data. The file to generate Figure 3 can be found in weekly-moments.py.

Synthetic streamflow generation and internal variability

The generation of multiple unperturbed realizations of synthetic streamflow is vital for characterizing the internal variability of a system., otherwise known as variability that arises from natural variations in the system (Lehner et. al., 2020). As internal variability is intrinsic to the system, its effects cannot be eliminated – but it can be moderated. By evaluating multiple realizations, we can determine the number of realizations at which the internal variability (quantified here by standard deviation as a function of the number of realizations) stabilizes. Using the synthetic streamflow data for the Jordan Lake, it is shown that more than 100 realizations are required for the standard deviation of the 25% highest streamflows across all years to stabilize (Figure 4). Knowing this, we can generate sufficient synthetic realizations to render the effects of internal variability insignificant.

Figure 4: The highest 25% of synthetic streamflows for the Jordan Lake gauge.

The file internal-variability.py contains the code to generate the above figure.

How does this all fit within the context of MORDM?

So far, we have generated synthetic streamflow datasets and validated them. But how are these datasets used in the context of MORDM?

Synthetic streamflow generation lies within the domain of the second part of the MORDM framework as shown in Figure 5 above. Specifically, synthetic streamflow generation plays an important role in the design of experiments by preserving the effects of deeply uncertain factors that cause natural events. As MORDM requires multiple scenarios to reliably evaluate all possible futures, this approach enables the simulation of multiple scenarios, while concurrently increasing the severity or frequency of extreme events in increments set by the user. This will allow us to evaluate how coupled human-natural systems change over time given different scenarios, and their consequences towards the robustness of the system being evaluated (in this case, the Research Triangle).

Figure 4: The taxonomy of robustness frameworks. The bold-outlined segments highlight where MORDM fits within this taxonomy (Herman et. al., 2015).

Typically, this evaluation is performed in two main steps:

  1. Generation and evaluation of multiple realizations of unperturbed annual synthetic streamflow. The resulting synthetic data is used to generate the Pareto optimal set of policies. This step can help us understand how the system’s internal variability affects future decision-making by comparing it with the results in step (2).
  2. Generation and evaluation of multiple realizations of perturbed annual synthetic streamflow. These are the more extreme scenarios in which the previously-found Pareto-optimal policies will be evaluated against. This step assesses the robustness of the base state under deeply uncertain deviations caused by the perturbations in the synthetic data and other deeply uncertain factors.

Conclusion

Overall, synthetic streamflow generation is an approach that is highly applicable in the bottom-up analysis of a system. It preserves historical characteristics of a streamflow timeseries while providing the flexibility to modify the severity and frequency of extreme events in the face of climate change. It also allows the generation of multiple realizations, aiding in the characterization and understanding of a system’s internal variability, and a more exhaustive scenario discovery process.

This summarizes the basics of data generation for MORDM. In my next blog post, I will introduce risk-of-failure (ROF) triggers, their background, key concepts, and how they are applied within the MORDM framework.

References

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 142(11), 04016050. doi:10.1061/(asce)wr.1943-5452.0000701

Kasprzyk, J. R., Nataraj, S., Reed, P. M., & Lempert, R. J. (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71. doi:10.1016/j.envsoft.2012.12.007

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406. doi:10.1061/(asce)wr.1943-5452.0000287

Mankin, J. S., Lehner, F., Coats, S., & McKinnon, K. A. (2020). The value of initial condition large ensembles to Robust Adaptation Decision‐Making. Earth’s Future, 8(10). doi:10.1029/2020ef001610

Trindade, B., Reed, P., Herman, J., Zeff, H., & Characklis, G. (2017). Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources, 104, 195-209. doi:10.1016/j.advwatres.2017.03.023

How do we deal with extreme events and imbalanced datasets in machine learning?

One of the most popular goals of machine learning is binary classification. Many of the problems with binary classification arise when we need to identify and classify rare events in our datasets. The binary classification of rare events happens frequently in the detection of rare diseases, fraudulent financial activities, and manufactured products. In water system research, we can take the detection of flood risk as an example. We might want to identify flood days from a combination of precipitation, snow coverage, temperature, time of the year, soil moisture, and many other factors.

Training a machine learning model such as this can be very challenging because, in our historical record, we might only have 1% extreme flood days (or even much less), and the rest of the days are non-flood days (normal days). In this blog post, I go over some of the considerations that need to be taken into account when dealing with rare events. I will also discuss some of the techniques that can help us address the issues of imbalanced datasets.

Confusion Matrix

In machine learning, we often use confusion matrices (also known as error matrices) to investigate the performance of classification models. To make this more understandable, I am going to use the example of floods. For example, let’s imagine that our goal is to predict whether each day is a flood day or a non-flood day. We will train a machine learning model to identify flood days. In this case, the predicted label of each day can fall into one of the following categories:

True positive (TP): We correctly classify a flood day as a flood day.

True negative (TN): We correctly classify a non-flood day as a normal day.

False positive (FP): We misclassify a normal day as a flood day.

False negative (FN): we misclassify a flood day as a normal day.

We then count the number of TPs, TNs, FPs, and FNs. After that, we can draw the following table, called a confusion matrix, in order to visualize the outcomes of our binary classification model and use this to better understand the performance of our model.

Accuracy of Prediction

One of the most intuitive ways to perform error calculation is to count how many times our model classification is right and divide it by the total number of events. This is called accuracy and can be calculated from the following equation:

Accuracy Paradox

By definition, extreme and rare events are a small portion of our dataset, and using a single accuracy measure as the performance metric can cause significant bias in our model evaluation. I’ll provide an example to make this clearer. Let’s say that we need to identify a specific fraudulent activity in an online purchase dataset that only happens 0.1% of the time on average; as such, our goal is to create a high-accuracy model to identify these events. Let’s assume that we have a model that classifies all activities as normal activities. In this case, the accuracy of our prediction is 99.9%, which looks quite decent. However, this model misclassifies all of the rare activities that we are looking for. Therefore, we have a very bad model with a very high accuracy. This is called the accuracy paradox. To respond to this paradox, different methods and error metrics have been introduced.

Other Performance Indicators

  1. True Positive Rate (TPR), which is also known as sensitivity or recall:

2- True Negative Rate (TNR), which is also known as specificity:

3- False Positive Rate (FPR):

4- Positive Predictive Value (PPV):

There are many other performance measures that focus on different aspects of model performance, and more information on these binary classification metrics can be found in here.

The selection of these metrics depends on the actual problem at hand and the concerns of stakeholders. In some cases, we have to give the priority to reducing false negatives. In the flood example, reducing false negatives might improve the reliability of the system in flood protection. However, in some other situations, reducing false positives can be more important because having incorrect flood alarms can trigger flood safety measures such as reducing the water volume in dams, which can put more pressure on irrigation supply systems.

Therefore, one of the main issues that we need to take into account in classifying rare events is carefully selecting the performance metrics that we use to train our model.

There are other ways to investigate the performance of binary classification models that I am not covering in this blog post, such as developing a Receiver Operating Characteristic (ROC) curve or calculating the area under the ROC curve (AUC). However, more information about these methods can be found at here and at here.

Manipulating the Training Dataset

As expected, rare events do not appear in our dataset frequently; however, they can cause important ramifications. When we train our model for an imbalanced dataset that only has 5% TRUE labels, for example, the model tends to learn more about FALSE labels (the majority) and do poorly in identifying TRUE labels. To avoid this, we can modify or resample our training dataset to force the model to focus on the TRUE labels (the minority class). There are various approaches to do that, and I am going to introduce four of these methods here: 1) undersampling; 2) oversampling; 3) SMOTE (Synthetic Minority Over-sampling TEchnique); and 4) increasing the cost of misclassifying the minority class.

Undersampling

Undersampling reduces the size of the majority class in our training dataset in order to balance the minority and majority classes. There are different resampling methods; for example, we can randomly select from the majority dataset and remove them. However, this method can lead to a loss of useful information and reduce our model’s performance. There are other, more intelligent ways that tend to preserve the useful information and focus on removing redundant samples. Condensed Nearest Neighbor Rule (CNN), Tomek Links, and One-Sided Selection are some examples of these methods. You can refer to this blog post for more information about resampling techniques.

Oversampling

Another way of modifying the training dataset is to increase the size of the minority class. The most basic way of doing that is to duplicate the minority class members until the data is not imbalanced anymore.

SMOTE

SMOTE (Nitesh Chawla et al., 2002) is an oversampling technique that synthetically generates new samples from the original minority group. Here is how the SMOTE algorithm works. 1) The algorithm first selects a random sample. 2) It finds k nearest neighbors of the selected point and randomly chooses one of them. 3) The algorithm then finds the distance between the two points. 4) It generates a random number between 0 and 1. 5) The algorithm then multiplies the feature vector by that random number to generate a new point that is a reasonable distance from our original minority class. 7) We start over and continue this process until the minority class reaches the desired size. There are some other variants of the basic SMOTE technique, which this blog post discusses. We can also use a combination of oversampling and undersampling to achieve a better result.

Penalized Models

Another method to reduce the impact of imbalanced datasets is to increase the cost of misclassifying the minority class. To do that, we can add a new column to our feature list that has a low number for the majority class (e.g., 1) and has much greater values for the minority class (e.g., 100). This forces the model to pay more attention to the minority class in order to reduce the errors during training.

Software Packages for Manipulation of Imbalanced Datasets

Here, I introduce some of the Python and R libraries that have been developed to address the problems that arise when dealing with imbalanced datasets.

Python:

  1. scikit-learn offers confusion_matrix, roc_curve and auc modules that can be used to generate the confusion matrix.
  2. imbalanced-learn offers several over- and undersampling techniques as well as combined methods.  

R:

  1. pROC can be used for calculating ROC curve and AUC.
  2. ROSE performs simple over-, under-, and combined sampling.
  3. DMwR performs SMOTE.

Introduction to Granger Causality

This post focuses on the question of determining whether one time series can be helpful in forecasting another time series. This question could be answered by investigating whether the two time series have some form of a causal relationship. Many time series models are based on univariate or multivariate autoregressive (AR) models. More often, a time series of a variable is added to a univariate forecasting model if the two variables are correlated. However, correlation does not necessarily imply causality. Causality is a special kind of relationship between variables where one variable causes, influences, or has an effect on the other variable. For example, although the sales of sunglasses and swimsuits are highly correlated, it cannot be concluded that an increase in the sales of sunglasses causes an increase in the sales of swimsuits. Nor can it be assumed that the reverse is true. The relationship between the two sales is likely due to a third time series variable, temperature, which has an effect on both variables. The relationships between a pair of highly correlated variables may also be purely coincidental with no causality association. Furthermore, the correlation between the variables of two time series that are causal may be small. For example, the appliance usage per household and outside temperature have a causal relationship because a spike in the outside temperature leads to, after a time lag, a spike in the appliance usage. In this case, the correlation between the two variables may actually be quite small.  The drawback of using correlation in time series analysis is that the measure does not factor in the inherent time-ordering and lags of the series.  The same correlation coefficient would be obtained even if the time series were shuffled randomly.

Classic Dilbert Reference

Clive Granger  (Nobel Prize Laureate in Economic Sciences) proposed that causality in economics could be tested by measuring the ability to predict the future values of a time series using prior values of another time series [1].  The causality relationship is based on two principles: the cause happens prior to its effect and the cause has unique information about the future values of its effect. He developed the Granger Causality test (GCT) which is a statistical hypothesis test for determining whether one time series is useful in forecasting another [1]. The test finds only “predictive causality” and not true causality. Therefore, if the prediction of future values of a time series Y(t) improves by including past values of a time series, X(t), X(t) is said to “Granger cause” Y(t). A compact notation for X(t) Granger causes Y(t) is: X(t) G.C. Y(t). Granger causality is (a) unidirectional if X(t) G.C. Y(t) but Y(t) does not G.C. X(t) and (b) bidirectional if X(t) G.C. Y(t) and Y(t) G.C. X(t).  Although GCT was originally developed for time series econometrics, it is now also used in various other fields such as neuroscience to characterize the directionality and influence between the time series neural activity from different neural sources in the brain and in water resources to analyze drivers of groundwater patterns [2-3].

The Granger Causality Test

Let Y(t) be the time series for which the future values have to be predicted and let X(t) be the other time series that will be used to augment Y(t) with the lagged values of X(t). The GCT is based on univariate and multivariate AR models which assume the time series are stationary.  Stationarity can be confirmed by applying the augmented Dickey-Fuller test. If the series are not stationary, they should be transformed into stationary series through operations such as detrending, differencing, or log transforming prior to applying the test.  The GCT uses a series of t-tests and F-tests conducted on the lagged values of Y(t) and X(t). The test is based on the following null hypothesis:

H0: X(t) does not G.C. Y(t)

Step 1: Fit the best (in terms of number of lags) univariate autoregressive model to the time series. The number of lags (model order) can be determined from the PACF or by choosing the model order that minimizes the AIC or BIC criteria. Let the resulting univariate AR model be represented by:

Y(t)=a0+a1Y(t-1)+a2Y(t-2)+…+anY(t-n)+ey(t)

where n is the model order ai are the predictor variables, and ey(t) is a white noise term with a zero mean and some variance σe2.

Step 2: Apply the t-test on each on predictor coefficient to determine if it is significant.  That is, for each coefficient the hypothesis test is:

H0: ai=0

H1: ai ≠ 0

The alternate hypothesis can also be H1: ai > 0 if ai is positive or H1: ai<0 if ai is negative. Any coefficient that is not significant (p<0.05) is excluded from the model.  For example, if the coefficients, a2=a3=0 the resulting model will be:

Y(t)=a0+a1Y(t-1)+a4Y(t-4)+…+anY(t-n)+ey(t)

For convenience, it will be assumed that all the coefficients of the model in the equation in Step 1 are significant.

Step 3: Augment the model in Step 1 with the lagged values of the second time series to generate the bivariate AR model given by:

Y(t)=a0+a1Y(t-1)+…+anY(t-n)+b1X(t-1)+…+bnX(t-n)+Ey(t)

where Ey(t) is the new white noise error term with zero mean and variance, σE2.

Step 4:   As in Step 2, apply the t-test now to each bi to determine statistical significance and remove the lagged terms which are not significant.  Next, test the significance of the retained coefficients jointly using the F-test.  Again, for convenience, it will be assumed that all the coefficients are individually significant.  The null hypothesis for the F-test is, therefore,

H0: b1=b2=bn=0

and the alternate hypothesis is that any coefficient is non-zero. 

Step 5: If the null hypothesis is rejected (p<0.05), it is concluded that the lags of X(t) help predicting future values of Y(t) and, therefore, X(t) G.C. Y(t).

Quantifying Granger Causality

The variances of the prediction errors from the univariate and bivariate models in Steps 1 and 3 can be used to quantify Granger prediction (GP) using the following equation:

GP=ln[σe2/σE2]

GP will be zero if X(t) does not G.C Y(t) as the univariate and bivariate AR models will have the same errors because the coefficients in Step 2 will be zero. GP will be greater than zero if X(t) G.C Y(t) because at least one bi coefficient is non-zero due to the bivariate AR model providing a better fit to the data than the univariate AR model. Consequently, the variance of the error terms in the bivariate AR model will be less than the variance of the error terms in the univariate AR model. Higher values of GP indicate stronger Granger causality.

Implementation in R

The granger causality test can be very simply conducted in R using the package lmtest. In order to demonstrate functionality, we use a real-world dataset to predict drivers of flow in Owasco Lake. The partial table below shows the predictors (precipitation, temperature, day, both real-time snow measurements and accumulated snow depth, relative humidity, wind speed, solar radiation) and the corresponding outflow from the lake.

Table 1: Predictors of Owasco Lake Outflow

In this example, it may be quite intuitive as to which variables would have the greatest predictive capabilities of flow, but you may encounter cases where the choice is less obvious. A common first step used in machine learning is to calculate the correlation between all predictors and the outflow to try to determine the most important predictors. The table below shows these correlation values.

VarPTSNSNWDRHWSSR
Cor0.37-0.150.038-0.0700.140.074-0.234

In absence of any other information, these correlation values are quite low and may lead to the misleading conclusion that none of the variables can help predict outflow. However, we can use the granger causality test to check this. As an example, let’s look at precipitation as a predictor for outflow. We use the function grangertest and the only other parameter to specify is how many lags of the precipitation variables you want to use to predict outflow. Here, an arbitrary value of 10 is chosen.

library(lmtest)
data=read.csv("D:/Documents/OwascoPrediction.csv")
grangertest(observed_cms ~ Precip, order = 10, data = data)

The results are reported as follows:

The function returns the unrestricted model (Model 1) with the lagged precipitation values and the restricted model (Model 2) without the lags. You can see that Model 1 has a p value of 0.0059 making it significant at the 0.01 level (there is less than a 1% chance that the coefficient is 0). So this indicates that even up to 10 lags of precipitation can help to predict outflow. Therefore precipitation G.C. outflow.

As an additional example, while 3 lags of snow does not G.C. outflow, 15 lags of snow is significant at the 0.01 level, indicating that while real-time snow may not immediately cause changes in outflow, snow accumulated over 15 days does have an impact on outflow. This is a really nice way to partition out which variables may have an immediate or lagged effect on an output.

Applications to Machine Learning

As seen above, the Granger causality test can not only help identify important predictors but also gives extra information about the lag of the variables that can offer more predictive capabilities. The GCT is based on autoregressive and multivariate AR modeling, however, the results can be used to select additional time series that are Granger causal and the lags of the time series to augment the input for better predictability in other models such as ANNs and LSTMs.

References

1. Granger, C. W. J. 2001 Essays in Econometrics: The Collected Papers of Clive W.J. Granger. Cambridge: Cambridge University Press.

2. Ding, M., Chen, Y., & Bressler, S.L. 2006 Granger causality: Basic theory and application to neuroscience. In Schelter. S., Winterhalder, N., & Timmer, J. Handbook of Time Series Analysis. Wiley, Wienheim.

3. Singh, N.K., Borrok, D.M. A Granger causality analysis of groundwater patterns over a half-century. Sci Rep 9, 12828 (2019). https://doi.org/10.1038/s41598-019-49278-8