Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

In this post, we will break down the key concepts underlying synthetic streamflow generation, and how it fits within the Many Objective Robust Decision Making (MORDM) framework (Kasprzyk, Nataraj et. al, 2012). This post is the first in a series on MORDM which will begin here: with generating and validating the data used in the framework. To provide some context as to what we are about to attempt, please refer to this post by Jon Herman.

What is synthetic streamflow generation?

Synthetic streamflow generation is a non-parametric, direct statistical approach used to generate synthetic streamflow timeseries from a reasonably long historical record. It is used when there is a need to diversify extreme event scenarios, such as flood and drought, or when we want to generate flows to reflect a shift in the hydrologic regime due to climate change. It is favored as it relies on a re-sampling of the historical record, preserving temporal correlation up to a certain degree, and results in a more realistic synthetic dataset. However, its dependence on a historical record also implies that this approach requires a relatively long historical inflow data. Jon Lamontagne’s post goes into further detail regarding this approach.

Why synthetic streamflow generation?

An important step in the MORDM framework is scenario discovery, which requires multiple realistic scenarios to predict future states of the world (Kasprzyk et. al., 2012). Depending solely on the historical dataset is insufficient; we need to generate multiple realizations of realistic synthetic scenarios to facilitate a comprehensive scenario discovery process. As an approach that uses a long historical record to generate synthetic data that has been found to preserve seasonal and annual correlation (Kirsch et. al., 2013; Herman et. al., 2016), this method provides us with a way to:

Fully utilize a large historical dataset

Stochastically generate multiple synthetic datasets while preserving temporal correlation

Explore many alternative climate scenarios by changing the mean and the spread of the synthetic datasets

The basics of synthetic streamflow generation in action

To better illustrate the inner workings of synthetic streamflow generation, it is helpful to use a test case. In this post, the historical dataset is obtained from the Research Triangle Region in North Carolina. The Research Triangle region consists of four main utilities: Raleigh, Durham, Cary and the Orange County Water and Sewer Authority (OWASA). These utilities are receive their water supplies from four water sources: the Little River Reservoir, Lake Wheeler, Lake Benson, and the Jordan Lake (Figure 1), and historical streamflow data is obtained from ten different stream gauges located at each of these water sources. For the purpose of this example, we will be using 81 years’ worth of weekly streamflow data available here.

The statistical approach that drive synthetic streamflow generation is called the Kirsch Method (Kirsch et. al., 2013). In plain language, this method does the following:

Converts the historical streamflows from real space to log space, and then standardize the log-space data.

Bootstrap the log-space historical matrix to obtain an uncorrelated matrix of historical data.

Obtain the correlation matrix of the historical dataset by performing Cholesky decomposition.

Impose the historical correlation matrix upon the uncorrelated matrix obtained in (2) to generate a standardized synthetic dataset. This preserves seasonal correlation.

De-standardize the synthetic data, and transform it back into real space.

Repeat steps (1) to (5) with a historical dataset that is shifted forward by 6 months (26 weeks). This preserves year-to-year correlation.

This post by Julie Quinn delves deeper into the Kirsch Method’s theoretical steps. The function that executes these steps can be found in the stress_dynamic.m Matlab file, which in turn is executed by the wsc_main_rate.m file by setting the input variable p = 0 as shown on Line 27. Both these files are available on GitHub here.

However, this is simply where things get interesting. Prior to this, steps (1) to (6) would have simply generated a synthetic dataset based on only historical statistical characteristics as validated here in Julie’s second blog post on a similar topic. Out of the three motivations for using synthetic streamflow generation, the third one (exploration of multiple scenarios) has yet to be satisfied. This is a nice segue into out next topic:

Generating multiple scenarios using synthetic streamflow generation

The true power of synthetic streamflow generation lies in its ability to generate multiple climate (or in this case, streamflow) scenarios. This is done in stress_dynamic.m using three variables:

Input variable

Data type

p

The lowest x% of streamflows

n

A vector where each element n_{i} is the number of copies of the p-lowest streamflow years to be added to the bootstrapped historical dataset.

m

A vector where each element m_{i} is the number of copies of the (1-p)-highest streamflow years to be added to the bootstrapped historical dataset.

Table 1: The input variables to the stress_dynamic function.

These three variables bootstrap (increase the length of) the historical record while allow us to perturb the historical streamflow record streamflows to reflect an increase in frequency or severity of extreme events such as floods and droughts using the following equation:

The stress_dynamic.m file contains more explanation regarding this step.

This begs the question: how do we choose the value of p? This brings us to the topic of the standardized streamflow indicator (SSI_{6}).

The SSI_{6} is the 6-month moving average of the standardized streamflows to determine the occurrence and severity of drought on the basis of duration and frequency (Herman et. al., 2016). Put simply, this method determines the occurrence of drought if the the value of the SSI_{6} < 0 continuously for at least 3 months, and SSI_{6} < -1 at least once during the 6-month interval. The periods and severity (or lack thereof) of drought can then be observed, enabling the decision on the length of both the n and m vectors (which correspond to the number of perturbation periods, or climate event periods). We will not go into further detail regarding this method, but there are two important points to be made:

The SSI_{6} enables the determination of the frequency (likelihood) and severity of drought events in synthetic streamflow generation through the values contained in p, n and m.

This approach can be used to generate flood events by exchanging the values between the n and m vectors.

A good example of point (2) is done in this test case, in which more-frequent and more-severe floods was simulated by ensuring that most of the values in m where larger than those of n. Please refer to Jon Herman’s 2016 paper titled ‘Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments’ for further detail.

A brief conceptual letup

Now we have shown how synthetic streamflow generation satisfies all three factors motivating its use. We should have three output folders:

synthetic-data-stat: contains the synthetic streamflows based on the unperturbed historical dataset

synthetic-data-dyn: contains the synthetic streamflows based on the perturbed historical dataset

Comparing these two datasets, we can compare how increasing the likelihood and severity of floods has affected the resulting synthetic data.

Validation

To exhaustively compare the statistical characteristics of the synthetic streamflow data, we will perform two forms of validation: visual and statistical. This method of validation is based on Julie’s post here.

Visual validation

Done by generating flow duration curves (FDCs) . Figure 2 below compares the unperturbed (left) and perturbed (right) synthetic datasets.

The bottom plots in Figure 2 shows an increase in the volume of weekly flows, as well as an smaller return period, when the the historical streamflows were perturbed to reflect an increasing frequency and magnitude of flood events. Together with the upper plots in Figure 2, this visually demonstrates that the synthetic streamflow generation approach (1) faithfully reconstructs historical streamflow patterns, (2) increases the range of possible streamflow scenarios and (3) can model multiple extreme climate event scenarios by perturbing the historical dataset. The file to generate this Figure can be found in the plotFDCrange.py file.

Statistical validation

The mean and standard deviation of the perturbed and unperturbed historical datasets are compared to show if the perturbation resulted in significant changes in the synthetic datasets. Ideally, the perturbed synthetic data would have higher means and similar standard deviations compared to the unperturbed synthetic data.

The mean and tails of the synthetic streamflow values of the bottom plots in Figure 3 show that the mean and maximum values of the synthetic flows are significantly higher than the unperturbed values. In addition, the spread of the standard deviations of the perturbed synthetic streamflows are similar to that of its unperturbed counterpart. This proves that synthetic streamflow generation can be used to synthetically change the occurrence and magnitude of extreme events while maintaining the periodicity and spread of the data. The file to generate Figure 3 can be found in weekly-moments.py.

Synthetic streamflow generation and internal variability

The generation of multiple unperturbed realizations of synthetic streamflow is vital for characterizing the internal variability of a system., otherwise known as variability that arises from natural variations in the system (Lehner et. al., 2020). As internal variability is intrinsic to the system, its effects cannot be eliminated – but it can be moderated. By evaluating multiple realizations, we can determine the number of realizations at which the internal variability (quantified here by standard deviation as a function of the number of realizations) stabilizes. Using the synthetic streamflow data for the Jordan Lake, it is shown that more than 100 realizations are required for the standard deviation of the 25% highest streamflows across all years to stabilize (Figure 4). Knowing this, we can generate sufficient synthetic realizations to render the effects of internal variability insignificant.

How does this all fit within the context of MORDM?

So far, we have generated synthetic streamflow datasets and validated them. But how are these datasets used in the context of MORDM?

Synthetic streamflow generation lies within the domain of the second part of the MORDM framework as shown in Figure 5 above. Specifically, synthetic streamflow generation plays an important role in the design of experiments by preserving the effects of deeply uncertain factors that cause natural events. As MORDM requires multiple scenarios to reliably evaluate all possible futures, this approach enables the simulation of multiple scenarios, while concurrently increasing the severity or frequency of extreme events in increments set by the user. This will allow us to evaluate how coupled human-natural systems change over time given different scenarios, and their consequences towards the robustness of the system being evaluated (in this case, the Research Triangle).

Typically, this evaluation is performed in two main steps:

Generation and evaluation of multiple realizations of unperturbed annual synthetic streamflow. The resulting synthetic data is used to generate the Pareto optimal set of policies. This step can help us understand how the system’s internal variability affects future decision-making by comparing it with the results in step (2).

Generation and evaluation of multiple realizations of perturbed annual synthetic streamflow. These are the more extreme scenarios in which the previously-found Pareto-optimal policies will be evaluated against. This step assesses the robustness of the base state under deeply uncertain deviations caused by the perturbations in the synthetic data and other deeply uncertain factors.

Conclusion

Overall, synthetic streamflow generation is an approach that is highly applicable in the bottom-up analysis of a system. It preserves historical characteristics of a streamflow timeseries while providing the flexibility to modify the severity and frequency of extreme events in the face of climate change. It also allows the generation of multiple realizations, aiding in the characterization and understanding of a system’s internal variability, and a more exhaustive scenario discovery process.

This summarizes the basics of data generation for MORDM. In my next blog post, I will introduce risk-of-failure (ROF) triggers, their background, key concepts, and how they are applied within the MORDM framework.

References

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management,141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management,142(11), 04016050. doi:10.1061/(asce)wr.1943-5452.0000701

Kasprzyk, J. R., Nataraj, S., Reed, P. M., & Lempert, R. J. (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software,42, 55-71. doi:10.1016/j.envsoft.2012.12.007

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management,139(4), 396-406. doi:10.1061/(asce)wr.1943-5452.0000287

Mankin, J. S., Lehner, F., Coats, S., & McKinnon, K. A. (2020). The value of initial condition large ensembles to Robust Adaptation Decision‐Making. Earth’s Future,8(10). doi:10.1029/2020ef001610

Trindade, B., Reed, P., Herman, J., Zeff, H., & Characklis, G. (2017). Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources,104, 195-209. doi:10.1016/j.advwatres.2017.03.023

One of the most popular goals of machine learning is binary classification. Many of the problems with binary classification arise when we need to identify and classify rare events in our datasets. The binary classification of rare events happens frequently in the detection of rare diseases, fraudulent financial activities, and manufactured products. In water system research, we can take the detection of flood risk as an example. We might want to identify flood days from a combination of precipitation, snow coverage, temperature, time of the year, soil moisture, and many other factors.

Training a machine learning model such as this can be very challenging because, in our historical record, we might only have 1% extreme flood days (or even much less), and the rest of the days are non-flood days (normal days). In this blog post, I go over some of the considerations that need to be taken into account when dealing with rare events. I will also discuss some of the techniques that can help us address the issues of imbalanced datasets.

Confusion Matrix

In machine learning, we often use confusion matrices (also known as error matrices) to investigate the performance of classification models. To make this more understandable, I am going to use the example of floods. For example, let’s imagine that our goal is to predict whether each day is a flood day or a non-flood day. We will train a machine learning model to identify flood days. In this case, the predicted label of each day can fall into one of the following categories:

True positive (TP): We correctly classify a flood day as a flood day.

True negative (TN): We correctly classify a non-flood day as a normal day.

False positive (FP): We misclassify a normal day as a flood day.

False negative (FN): we misclassify a flood day as a normal day.

We then count the number of TPs, TNs, FPs, and FNs. After that, we can draw the following table, called a confusion matrix, in order to visualize the outcomes of our binary classification model and use this to better understand the performance of our model.

Accuracy of Prediction

One of the most intuitive ways to perform error calculation is to count how many times our model classification is right and divide it by the total number of events. This is called accuracy and can be calculated from the following equation:

Accuracy Paradox

By definition, extreme and rare events are a small portion of our dataset, and using a single accuracy measure as the performance metric can cause significant bias in our model evaluation. I’ll provide an example to make this clearer. Let’s say that we need to identify a specific fraudulent activity in an online purchase dataset that only happens 0.1% of the time on average; as such, our goal is to create a high-accuracy model to identify these events. Let’s assume that we have a model that classifies all activities as normal activities. In this case, the accuracy of our prediction is 99.9%, which looks quite decent. However, this model misclassifies all of the rare activities that we are looking for. Therefore, we have a very bad model with a very high accuracy. This is called the accuracy paradox. To respond to this paradox, different methods and error metrics have been introduced.

Other Performance Indicators

True Positive Rate (TPR), which is also known as sensitivity or recall:

2- True Negative Rate (TNR), which is also known as specificity:

3- False Positive Rate (FPR):

4- Positive Predictive Value (PPV):

There are many other performance measures that focus on different aspects of model performance, and more information on these binary classification metrics can be found in here.

The selection of these metrics depends on the actual problem at hand and the concerns of stakeholders. In some cases, we have to give the priority to reducing false negatives. In the flood example, reducing false negatives might improve the reliability of the system in flood protection. However, in some other situations, reducing false positives can be more important because having incorrect flood alarms can trigger flood safety measures such as reducing the water volume in dams, which can put more pressure on irrigation supply systems.

Therefore, one of the main issues that we need to take into account in classifying rare events is carefully selecting the performance metrics that we use to train our model.

There are other ways to investigate the performance of binary classification models that I am not covering in this blog post, such as developing a Receiver Operating Characteristic (ROC) curve or calculating the area under the ROC curve (AUC). However, more information about these methods can be found at here and at here.

Manipulating the Training Dataset

As expected, rare events do not appear in our dataset frequently; however, they can cause important ramifications. When we train our model for an imbalanced dataset that only has 5% TRUE labels, for example, the model tends to learn more about FALSE labels (the majority) and do poorly in identifying TRUE labels. To avoid this, we can modify or resample our training dataset to force the model to focus on the TRUE labels (the minority class). There are various approaches to do that, and I am going to introduce four of these methods here: 1) undersampling; 2) oversampling; 3) SMOTE (Synthetic Minority Over-sampling TEchnique); and 4) increasing the cost of misclassifying the minority class.

Undersampling

Undersampling reduces the size of the majority class in our training dataset in order to balance the minority and majority classes. There are different resampling methods; for example, we can randomly select from the majority dataset and remove them. However, this method can lead to a loss of useful information and reduce our model’s performance. There are other, more intelligent ways that tend to preserve the useful information and focus on removing redundant samples. Condensed Nearest Neighbor Rule (CNN), Tomek Links, and One-Sided Selection are some examples of these methods. You can refer to this blog post for more information about resampling techniques.

Oversampling

Another way of modifying the training dataset is to increase the size of the minority class. The most basic way of doing that is to duplicate the minority class members until the data is not imbalanced anymore.

SMOTE

SMOTE (Nitesh Chawla et al., 2002) is an oversampling technique that synthetically generates new samples from the original minority group. Here is how the SMOTE algorithm works. 1) The algorithm first selects a random sample. 2) It finds k nearest neighbors of the selected point and randomly chooses one of them. 3) The algorithm then finds the distance between the two points. 4) It generates a random number between 0 and 1. 5) The algorithm then multiplies the feature vector by that random number to generate a new point that is a reasonable distance from our original minority class. 7) We start over and continue this process until the minority class reaches the desired size. There are some other variants of the basic SMOTE technique, which this blog post discusses. We can also use a combination of oversampling and undersampling to achieve a better result.

Penalized Models

Another method to reduce the impact of imbalanced datasets is to increase the cost of misclassifying the minority class. To do that, we can add a new column to our feature list that has a low number for the majority class (e.g., 1) and has much greater values for the minority class (e.g., 100). This forces the model to pay more attention to the minority class in order to reduce the errors during training.

Software Packages for Manipulation of Imbalanced Datasets

Here, I introduce some of the Python and R libraries that have been developed to address the problems that arise when dealing with imbalanced datasets.

Python:

scikit-learn offers confusion_matrix, roc_curve and auc modules that can be used to generate the confusion matrix.

imbalanced-learn offers several over- and undersampling techniques as well as combined methods.

R:

pROC can be used for calculating ROC curve and AUC.

ROSE performs simple over-, under-, and combined sampling.

This post focuses on the question of determining whether one time series can be helpful in forecasting another time series. This question could be answered by investigating whether the two time series have some form of a causal relationship. Many time series models are based on univariate or multivariate autoregressive (AR) models. More often, a time series of a variable is added to a univariate forecasting model if the two variables are correlated. However, correlation does not necessarily imply causality. Causality is a special kind of relationship between variables where one variable causes, influences, or has an effect on the other variable. For example, although the sales of sunglasses and swimsuits are highly correlated, it cannot be concluded that an increase in the sales of sunglasses causes an increase in the sales of swimsuits. Nor can it be assumed that the reverse is true. The relationship between the two sales is likely due to a third time series variable, temperature, which has an effect on both variables. The relationships between a pair of highly correlated variables may also be purely coincidental with no causality association. Furthermore, the correlation between the variables of two time series that are causal may be small. For example, the appliance usage per household and outside temperature have a causal relationship because a spike in the outside temperature leads to, after a time lag, a spike in the appliance usage. In this case, the correlation between the two variables may actually be quite small. The drawback of using correlation in time series analysis is that the measure does not factor in the inherent time-ordering and lags of the series. The same correlation coefficient would be obtained even if the time series were shuffled randomly.

Clive Granger (Nobel Prize Laureate in Economic Sciences) proposed that causality in economics could be tested by measuring the ability to predict the future values of a time series using prior values of another time series [1]. The causality relationship is based on two principles: the cause happens prior to its effect and the cause has unique information about the future values of its effect. He developed the Granger Causality test (GCT) which is a statistical hypothesis test for determining whether one time series is useful in forecasting another [1]. The test finds only “predictive causality” and not true causality. Therefore, if the prediction of future values of a time series Y(t) improves by including past values of a time series, X(t), X(t) is said to “Granger cause” Y(t). A compact notation for X(t) Granger causes Y(t) is: X(t) G.C. Y(t). Granger causality is (a) unidirectional if X(t) G.C. Y(t) but Y(t) does not G.C. X(t) and (b) bidirectional if X(t) G.C. Y(t) and Y(t) G.C. X(t). Although GCT was originally developed for time series econometrics, it is now also used in various other fields such as neuroscience to characterize the directionality and influence between the time series neural activity from different neural sources in the brain and in water resources to analyze drivers of groundwater patterns [2-3].

The Granger Causality Test

Let Y(t) be the time series for which the future values have to be predicted and let X(t) be the other time series that will be used to augment Y(t) with the lagged values of X(t). The GCT is based on univariate and multivariate AR models which assume the time series are stationary. Stationarity can be confirmed by applying the augmented Dickey-Fuller test. If the series are not stationary, they should be transformed into stationary series through operations such as detrending, differencing, or log transforming prior to applying the test. The GCT uses a series of t-tests and F-tests conducted on the lagged values of Y(t) and X(t). The test is based on the following null hypothesis:

H_{0}: X(t) does not G.C. Y(t)

Step 1: Fit the best (in terms of number of lags) univariate autoregressive model to the time series. The number of lags (model order) can be determined from the PACF or by choosing the model order that minimizes the AIC or BIC criteria. Let the resulting univariate AR model be represented by:

where n is the model order a_{i}are the predictor variables, and e_{y}(t) is a white noise term with a zero mean and some variance σ_{e}^{2}.

Step 2: Apply the t-test on each on predictor coefficient to determine if it is significant. That is, for each coefficient the hypothesis test is:

H_{0}: a_{i}=0

H_{1}: a_{i} ≠ 0

The alternate hypothesis can also be H_{1}: a_{i} > 0 if a_{i} is positive or H_{1}: a_{i}<0 if a_{i} is negative. Any coefficient that is not significant (p<0.05) is excluded from the model. For example, if the coefficients, a_{2}=a_{3}=0 the resulting model will be:

where E_{y}(t) is the new white noise error term with zero mean and variance, σ_{E}^{2}.

Step 4: As in Step 2, apply the t-test now to each b_{i} to determine statistical significance and remove the lagged terms which are not significant. Next, test the significance of the retained coefficients jointly using the F-test. Again, for convenience, it will be assumed that all the coefficients are individually significant. The null hypothesis for the F-test is, therefore,

H_{0}: b_{1}=b_{2}=b_{n}=0

and the alternate hypothesis is that any coefficient is non-zero.

Step 5: If the null hypothesis is rejected (p<0.05), it is concluded that the lags of X(t) help predicting future values of Y(t) and, therefore, X(t) G.C. Y(t).

Quantifying Granger Causality

The variances of the prediction errors from the univariate and bivariate models in Steps 1 and 3 can be used to quantify Granger prediction (GP) using the following equation:

GP=ln[σ_{e}^{2}/σ_{E}^{2}]

GP will be zero if X(t) does not G.C Y(t) as the univariate and bivariate AR models will have the same errors because the coefficients in Step 2 will be zero. GP will be greater than zero if X(t) G.C Y(t) because at least one b_{i} coefficient is non-zero due to the bivariate AR model providing a better fit to the data than the univariate AR model. Consequently, the variance of the error terms in the bivariate AR model will be less than the variance of the error terms in the univariate AR model. Higher values of GP indicate stronger Granger causality.

Implementation in R

The granger causality test can be very simply conducted in R using the package lmtest. In order to demonstrate functionality, we use a real-world dataset to predict drivers of flow in Owasco Lake. The partial table below shows the predictors (precipitation, temperature, day, both real-time snow measurements and accumulated snow depth, relative humidity, wind speed, solar radiation) and the corresponding outflow from the lake.

In this example, it may be quite intuitive as to which variables would have the greatest predictive capabilities of flow, but you may encounter cases where the choice is less obvious. A common first step used in machine learning is to calculate the correlation between all predictors and the outflow to try to determine the most important predictors. The table below shows these correlation values.

Var

P

T

SN

SNWD

RH

WS

SR

Cor

0.37

-0.15

0.038

-0.070

0.14

0.074

-0.234

In absence of any other information, these correlation values are quite low and may lead to the misleading conclusion that none of the variables can help predict outflow. However, we can use the granger causality test to check this. As an example, let’s look at precipitation as a predictor for outflow. We use the function grangertest and the only other parameter to specify is how many lags of the precipitation variables you want to use to predict outflow. Here, an arbitrary value of 10 is chosen.

library(lmtest) data=read.csv("D:/Documents/OwascoPrediction.csv") grangertest(observed_cms ~ Precip, order = 10, data = data)

The results are reported as follows:

The function returns the unrestricted model (Model 1) with the lagged precipitation values and the restricted model (Model 2) without the lags. You can see that Model 1 has a p value of 0.0059 making it significant at the 0.01 level (there is less than a 1% chance that the coefficient is 0). So this indicates that even up to 10 lags of precipitation can help to predict outflow. Therefore precipitation G.C. outflow.

As an additional example, while 3 lags of snow does not G.C. outflow, 15 lags of snow is significant at the 0.01 level, indicating that while real-time snow may not immediately cause changes in outflow, snow accumulated over 15 days does have an impact on outflow. This is a really nice way to partition out which variables may have an immediate or lagged effect on an output.

Applications to Machine Learning

As seen above, the Granger causality test can not only help identify important predictors but also gives extra information about the lag of the variables that can offer more predictive capabilities. The GCT is based on autoregressive and multivariate AR modeling, however, the results can be used to select additional time series that are Granger causal and the lags of the time series to augment the input for better predictability in other models such as ANNs and LSTMs.

References

1. Granger, C. W. J. 2001 Essays in Econometrics: The Collected Papers of Clive W.J. Granger. Cambridge: Cambridge University Press.

2. Ding, M., Chen, Y., & Bressler, S.L. 2006 Granger causality: Basic theory and application to neuroscience. In Schelter. S., Winterhalder, N., & Timmer, J. Handbook of Time Series Analysis. Wiley, Wienheim.

3. Singh, N.K., Borrok, D.M. A Granger causality analysis of groundwater patterns over a half-century. Sci Rep9, 12828 (2019). https://doi.org/10.1038/s41598-019-49278-8

Most of us in academia do not have formal training in computer science, yet for many of us writing and sharing code is an essential part of our research. While the quality of code produced by many graduate students can be quite impressive, many of us never learned the basic CS principles that allow programs to be sharable and inheritable by other programmers. Last summer I proctored an online class in Python fundamentals. The course was for beginner programmers and, though it covered material simpler than the scripts I write for my PhD, I learned a lot about how to properly document and structure Python functions to make them usable by others. In this post I’ll discuss two important concepts I took away from this course: writing proper function specifications and enforcing preconditions using assert statements.

Function Specifications

One of the the most important aspects of writing a quality Python function is proper specification. While the term may sound generic, a specification actually has a very precise definition and implementation for a Python function. In practice, a specification is a docstring, a “string literal” that occurs as the first statement in a function, module, class or method, formed by a bracketed set of “””. Here is an example of a simple function I wrote with a specification:

def radians_to_degrees(theta):
"""
Returns: theta converted to degrees
Value return has type float
Parameter theta: the angle in radians
Precondition: theta is a float
"""
return theta * (180.0/3.14159)

The function specification is everything between the sets of “””. When Python sees this docstring at the front of a function definition, it automatically is stored as the “__doc__” associated with the function. With this specification in place, any user that loads this function can access its __doc__ by typing help(radians_to_degrees), which will print the following to the terminal:

Help on function radians_to_degrees in module __main__:
radians_to_degrees(theta)
Returns: theta converted to degrees
Value return has type float
Parameter theta: the angle in radians
Precondition: theta is a float

The help function will print anything in the docstring at the start of a function, but a it is good practice for the specification to have the following elements:

A one-line summary of the function at the very beginning, if the function is a “fruitful function” (meaning it returns something), this line should tell what it returns. In my example I note that my function returns theta converted to degrees.

A more detailed description of the function. In my example this simply provides the type of the return value.

A description of the function’s parameter(s)

Any preconditions that are necessary for the code to run properly (more on this later)

I should note that officially the Python programming language has a more flexible set of requirements for function specifications, which can be found here, but the attributes above are a good starting point for writing clear specifications.

Properly specifying a Python function will clarify the function’s intended use and provide instructions for how new users can utilize it. It will also help you document your code for formal release if you ever publish it. Google any of your favorite Python functions and you’ll likely be brought to a page that has a fancy looking version of the function’s specification. These pages can be automatically generated by tools such as Spinx that create them right from the function’s definition.

Aside from clarifying and providing instructions for your function, specifications provide a means of creating a chain of accountability for any problems with your code. This chain of accountability is created through precondition statements (element four above). A precondition statement dictates requirements for the function to run properly. Preconditions may specify the type of parameter input (i.e. x is a float) or a general statement about the parameter (x < 0).

For large teams of many developers and users of functions, precondition statements create a chain of accountability for code problems. If the preconditions are violated and a code crashes, then it is the responsibility of the user, who did not use the code properly. On the other hand, if the preconditions were met and the code crashes, it is the responsibility of the developer, who did not properly specify the code.

Asserting preconditions to catch errors early

One way to improve the usability of a code is to catch precondition violations with statements that throw specific errors. In Python, this may be done using assert statements. Assert statements evaluate a boolean expression and can display a custom error message if the expression returns false. For example, I can modify my radians_to_degrees function with the following assert statement (line 10):

def radians_to_degrees(theta):
"""
Returns: theta converted to degrees
Value return has type float
Parameter theta: the angle in radians
Precondition: theta is a float
"""
assert type(theta) == float, repr(theta) + ' is not float'
return theta * (180.0/3.14159)

A helpful function in my assert statement above is “repr”, which will return a printable representation of a given object (i.e. for a string it will keep the quotation marks around it). Now if I were to enter ‘a’ into my function, I would get the following return:

AssertionError: 'a' is not float

This is trivial for my example, but can save a lot of time when running large and complex functions. Rather than seeing 30 lines of “traceback…”, the user will be immediately brought to the source of the error. In general, you should try to make assert statements as precise as possible to pinpoint the violation.

It’s important to note that not every conceivable precondition violation requires an assert statement. There is a tradeoff between code efficiency and number of assert statements (checking a precondition takes time). Determining which preconditions to strictly enforce with assert statements is a balancing act and will be different for each application. It’s important to remember though, that even if you do not have an assert statement for a precondition, you should still list all preconditions in the function specification to preserve the chain of accountability.

Further resources

This post concerns specifications for Python functions. However, the use of docstrings is also important when coding modules, classes and public methods. Guidance on how these docstrings may be formatted can be found here.

While this post focused on Python, informative function specifications are important for all programming languages. Some resources on documentation in other languages can be found below:

This post is a follow up from my last one, where I’ll be demonstrating some of the basic network analysis capabilities of the Python library NetworkX. I will be using the same data and all my scripts can be found in the same repository. The data we’re using represent flows of food between US counties, which I am limiting to the 95th percentile of largest flows so the network is of a reasonable size for this simple analysis. Given that these are flows (i.e., from one place to another) this is referred to as a directed network, with every edge (link) having a source and a destination. NetworkX allows the analysis and visualization of several types of networks, illustrated below.

Undirected Networks: Edges have no direction, the relationships are always reciprocal or symmetric, for example A is friends with B. Directed Networks: Edges have direction and relationships don’t have to be reciprocal, for example B sends an email to A. Weighted Networks: Edges contain some quantitative information indicating the weight of a relationship, for example A owes $6 dollars to B, B owes $13 dollars to C, etc. Signed Networks: Edges in these networks also carry information about the type of interaction between the two nodes, positive or negative, for example A and B are friends but B and C are enemies. Multi Networks: Several connections or attributes might exist between two nodes, for example A gave some 6 apples and 3 pears to B, B gave 4 pears and 8 peaches to C, etc.

I will use the rest of this blogpost to demonstrate some simple analysis that can be performed on a directed network like this one. This analysis is only demonstrative of the capabilities – obviously, US counties have several other connections between them in real life and the food network is only used here as a demonstration testbed, not to solve actual connectivity problems.

We’ll be answering questions such as:

How connected are counties to others?

Are there counties that are bigger ‘exporters’ than ‘importers’ of goods?

Can I send something from any one county to any other using only the already established connections in this network?

If yes, what is the largest number of connections that I would need? Are there counties with no connections between them?

Node connectivity is typically measured by the node’s degree. In undirected networks, this is simply the number of connections a node has. In directed networks, we can also distinguish between connections where the node is the source and where the node is the destination. To estimate them using NetworkX, we can use G.out_degree() and G.in_degree(), respectively. We can also calculate the average (in or out) degree by dividing by the total number of nodes. In this case they’re both around 3.08, i.e., on average, every node has three connections. Of course this tells us very little about our network, which is why most often people like to see the distribution of degrees. This is easily generated by sorting the degree values and plotting them with matplotlib.

nnodes = G.number_of_nodes()
degrees_in = [d for n, d in G.in_degree()]
degrees_out = [d for n, d in G.out_degree()]
avrg_degree_in = sum(degrees_in) / float(nnodes)
avrg_degree_out = sum(degrees_out) / float(nnodes)
in_values = sorted(set(degrees_in))
in_hist = [degrees_in.count(x) for x in in_values]
out_values = sorted(set(degrees_out))
out_hist = [degrees_out.count(x) for x in out_values]
plt.figure()
plt.plot(in_values,in_hist,'ro-') # in-degree
plt.plot(out_values,out_hist,'bo-') # out-degree
plt.legend(['In-degree','Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Food distribution network')
plt.close()

This shows that this network is primarily made up of nodes with few connections (degree<5) and few nodes with larger degrees. Distributions like this are common for real-world networks [1, 2], often times they follow an exponential or a log-normal distribution, sometimes a power law distribution (also referred to as “scale free”).

We can also compare the in- and out-degrees of the nodes in this network which would give us information about counties that export to more counties than they import from and vice versa. For example, in the figure below, points below the diagonal line represent counties that import from more places than they export to.

To address the last two prompt questions, we are essentially concerned with network connectness. In directed networks such as this one, we can distinguish between strongly connected and weakly connected notions. A network is weakly connected if there is an undirected path between any pair of nodes (i.e., ignoring edge direction), and strongly connected if there is a directed path between every pair of vertices (i.e., only following edge direction) [3]. The networks below are all weakly but not strongly connected:

NetworkX can help answer these questions for our network, using existent and intuitive functionality. Executing:

will return False for both. This means that using the already established connections and directions, not all nodes can be reached by all other nodes. If we ignore the directions (weak connectedness) this remains the case. This implies that our network is made up of more than one components, i.e., connected subgraphs of our network. For example the undirected graph below has three components:

Strongly connected components in directed graphs also consider the direction of each edge. For example the directed graph below also has three components:

Weakly connected components in directed graphs are identified by ignoring the direction of the edges, so in the above example, the graph has one weakly connected component.

To examine these components for our network we can use nx.strongly_connected_components(G) and nx.weakly_connected_components(G) which would produce lists of all strongly or weakly connected components in the network, respectively, in this case 1156 strongly connected and 111 weakly connected components. In both cases this includes one giant component including most of the network nodes, 1220 in the strongly connected and 2348 in the weakly connected case, and hundreds of small components with fewer than 10 nodes trading between them.

Finally, we can draw these strongly and weakly connected components. In the top row of figure below, I show the largest components identified by each definition and in the bottom row all other components in the network, according to each definition.

References: [1] Broido, A.D., Clauset, A. Scale-free networks are rare. Nat Commun10, 1017 (2019). https://doi.org/10.1038/s41467-019-08746-5 [2] http://networksciencebook.com/ [3] Skiena, S. “Strong and Weak Connectivity.” §5.1.2 in Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Reading, MA: Addison-Wesley, pp. 172-174, 1990.

Most of the time, we need to deal with complex systems that have several components, each with different properties and behavior. Usually, these properties vary with time and space, and understanding their spatiotemporal dynamics plays a key role in our understanding of the system performance and its vulnerabilities. Therefore, aggregation in time and space would hide variations that might be important in our analysis and understanding of the system behavior. In this blog post, I am going to show how we can add bar plots onto a map at different locations for better visualization of variables. Some examples for when this type of visualization is useful are irrigation districts that have different water rights or cropping patterns that affect their crop production and water requirements in dry or wet years; another example would be the sensitivity indices of specific model parameters that are clustered based on their location and varying based on the wetness and dryness of the system. To demonstrate this, I am going to create an animated graph that shows annual crop yield variations in different counties in the State of Washington. You can download the Washington counties’ data layer (WA_County_Boundaries-shp.zip) and NASS historical yield data for corn, winter wheat, and spring wheat (NASS.txt) from here.

“shp_countiesis” is a SpatialPolygonsDataFrame object and has a different format than a regular dataframe. It usually has a few slots that contain different type of information. For example, “data” has non-geographic properties. We can explore the information for each polygon with:

head(shp_counties@data)
#The "JURISDIC_2" and "JURISDIC_3" columns both contain the names of counties.

To visualize it with ggplot, it has to first be converted into a dataframe. Then, we can use it with geom_polygon function:

library(ggplot2)
counties <- fortify(shp_counties, region="JURISDIC_2")
head(counties)
#long lat order hole piece id group
#1 -13131351 5984710 1 FALSE 1 Adams Adams.1
#2 -13131340 5983102 2 FALSE 1 Adams Adams.1
ggplot() + geom_polygon(data = counties, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)

Now, I am going to extract the center of each polygon so that I can later add the bar plots to these coordinates:

We are going to extract just a few years of data from the yield dataset to show on the map:

library(data.table)
yield<- fread("(your directory)/NASS.txt")
# Year Ag_District Ag_District_Code Data_Item Value County JURISDIC_5 Group
#1: 1947 EAST CENTRAL 50 CORN, GRAIN 41.0 Adams 53001 CORN
#2: 1948 EAST CENTRAL 50 CORN, GRAIN 40.0 Adams 53001 CORN
years<- as.data.frame(unique(yield$Year))
years<- as.data.frame(years[34:42,])

For the final map, I am going to need a common legend for all of the bar plots in all of the counties and years in which I am interested. So, I need to know all of the categories that are available:

unique(yield$Group)
unique(yield$Data_Item)

Based on the “Group” column, we know that there are three main groups of crops: corn, winter wheat and spring wheat. Based on the “Data_Item” column, we know that there are three different types of winter and spring wheat (non-irrigated, non-irrigated with continuous cropping (CC), and non-irrigated with summer fallow (SF)). Note that we do not have all of these crop types for all of the counties and all the years, and the common legend should be created from a location that all the crop types are available, so that it can be applicable for all of the counties and years that I am eventually going to plot. For this reason, I am going to subset a dataset and create a bar plot for one county and year when all of the crop types are available:

I want to use custom colors so that each crop Group has a different color and each Data_item corresponding to a Group share the same Group color theme, which gradually changes from darker to brighter. First, I create three color functions, each corresponding to 3 Groups in my dataset.

Next, in the template bar plot, I can use the customized “colors” in scale_fill_manual() function:

ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
scale_fill_manual(values=colors)+
geom_bar(stat='identity', color = "black")

From this plot, we just need to extract the legend. Also, we need to change its font size since we are going to add it to another plot. You should try different sizes to find out which one is more appropriate for your dataset/figure. I am saving this plot in “tmp” while I remove the legend title and change the font size. Then, I extract the legend section by using the get_legend() function and convert it to a ggplot by using the as_ggplot() function.

Now, I am going to save the counties map with the appropriate font size and title and add the extracted legend from the yield data to it with the geom_subview() funtion. You need to adjust the coordinates of that you want to show the legend on the map based on your data.

In the below section, I am going to show how we can loop through all of the years and then all the counties in the county map, use the center coordinate of each county (polygon), plot the corresponded bar plot, and print it in the center of each polygon. We can use the ggplotGrob() funtion to extract different pieces of the bar plot created with gpplot. With this function, we can treat each part of the bar plot (background, axis, labels, main plot panel, etc.) as a graphical object and move it to the coordinates that are in our interest. For example, if we just want to use the main panel and not any other components, we can extract the panel, and adjust the other components as we wish to present in the final graph.

In this example, I am adding all of the bar plots for all counties in one year as a graphical object in a list “barplot_list”. Then, by using the annotation_custom() function I add each item in the “barplot_list” to the coordinates at the center of the polygons (counties) that I already extracted. Note that the orders of center coordinates and plots in the “barplot_list” are the same.

At the end, I just add the base map with the customized legend (“tmp_map_final”) and with the list of all bar plots with their customized locations (“barplot_annotation_list”). Then, I add them all in a list (“all_list“) and repeat this process for every year. The last step is to save this list with saveGIF() to create an animated gif. Note that we can use the same procedure but replace the bar plot with other types of plots such as pie charts.

counties_list<- as.data.frame(unique(counties$id)) #list of counties
all_list<- list()
for (y in 1:nrow(years)){ #loop through all of the years
nass_oneyear<- subset(yield,yield$Year==years[y,]) #extract one year
barplot_list <-
##create bar plot for each county
lapply(1:length(shp_counties$JURISDIC_2), function(i) {
#extract one county
nass_oneyear_onecounty<- subset(nass_oneyear,nass_oneyear$County==counties_list[i,])
# As I mentioned, for each county and year the number of crop types might be different. So, I need to customize the color for each sub-dataset using the manual color ramp that I previously defined for each item.
nass_oneyear_onecounty$itemcolor<- "NA"
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="CORN, GRAIN"]<- colors[1]
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED"]<- colors[2]
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, CC"]<- colors[3]
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, SF"]<- colors[4]
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED"]<- colors[5]
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, CC"]<- colors[6]
nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, SF"]<- colors[7]
plots_comp <- ggplotGrob(
ggplot(nass_oneyear_onecounty,aes(x=Group,y=Value,group=(itemcolor),fill=factor(Data_Item)))+
scale_fill_manual(values=nass_oneyear_onecounty$itemcolor)+
geom_bar(stat='identity', color = "black") +
labs(x = NULL, y = NULL) +
theme(legend.position = "none", rect = element_blank(), axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x= element_blank(),
axis.ticks = element_blank(),
axis.text.y=element_text(size=14,face="bold",colour="black")) + coord_cartesian(expand=FALSE)
)})
barplots_list <- lapply(1:length(shp_counties), function(i)
annotation_custom(barplot_list[[i]], xmin = counties_centroids[i,1]- 28000, ymin = counties_centroids[i,2]- 28000, xmax = counties_centroids[i,1]+ 28000, ymax = counties_centroids[i,2]+ 28000))
# xmin, ymin, xmax and ymax can be used to adjust are horizontal and vertical location of the bar plots
all_list[[y]] <- list(tmp_map_final + barplots_list)
}
library(animation)
saveGIF(
{lapply(all_list, print)}
, "(your directory)/final.gif", interval = 2, ani.width = 1600, ani.height = 1200)

If we want to have labels for our bar plots (instead of having yield values at the y-axis), we may want to show the yield value corresponding to each item in a group. In this case I can use the below command lines within a loop before we create a graphical objects of ggplot (before plots_comp <- ggplotGrob(….) ), to add a label column that shows the cumulative yield value in each group.

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%
group_by(Group) %>%
mutate(label_y = cumsum(Value)-10) #I subtracted 10 from these cumulative values just to print them inside of the bar plot sections.

Or we can just have one label that shows the total yield in each group:

Then we can add these labels to the bar plots by adding the geom_text() function in the ggplot section within ggplotGrob(), and specifying the column of interest:

Instead of manually adjusting the position the of label (such as in the first example), “vjust” can be added to the geom_text() for modifying text alignment:

The evaluation of model performance is a widely discussed and yet extremely controversial topic in hydrology, atmospheric science, and environmental studies. Generally, it is not super straightforward to quantify the accuracy of tools that simulate complex systems. One of the reasons is that these systems usually have various sub-systems. For example, a complex river system will simulate regulated streamflow, water allocation, dam operations, etc. This can imply that there are multiple goodness-of-fit objectives that cannot be fully optimized simultaneously. In this situation, there will always be tradeoffs in the accuracy of the system representation.

The other reason is that these complex systems are often state-dependent, and their behavior non-linearly changes as the system evolves. Finally, there are variables in the system that have several properties, and it is usually impossible to improve all of their properties at the same time (Gupta et al., 1998). Take streamflow as an example. In calculating model accuracy, we can focus on the daily fluctuation of streamflow or look into the changes weekly, monthly, and annually. We can also concentrate on the seasonality of streamflow, extreme low and high cases, and the persistence of these extreme events. Therefore, our results are not fully defensible if we only focus on one performance metric and make inferences that ignore many other components of the system. In this blog post, I am going to explain a visualization technique that allows us to draw three or more metrics of the system on the same plot. The plot is called a Taylor Diagram. The Taylor Diagram is not really a solution to the problem that I explained, but it does provide a systematic and mathematically sound way of demonstrating a few popular goodness-of-fit measures. The original version of the Taylor Diagram includes three performance metrics: i) standard deviation of simulated vs. observed; ii) correlation coefficient between observed and simulated; and iii) centered root mean square error (CRMSE). However, there are other versions of the Taylor Diagram that include more than these three metrics (e.g., here). The Taylor Diagram was developed by Karl E. Taylor (original paper) and has been frequently used by meteorologists and atmospheric scientists. In this blog post, I focus on streamflow.

Underlying Mathematical Relationships

As mentioned above, the Taylor Diagram draws the following three statistical metrics on a single plot: standard deviation, CRMSE, and correlation. The equation below relates these three:

In this equation:

This relationship can be easily derived using the definition of CRMSE, which is:

In this equation:

We can expand this equation to the following:

The first two components of the equation are standard deviations of observed and simulated time series. To understand the third component of the equation, we need to recall the mathematical definition of correlation.

If we multiply both sides of the above equation by standard deviation of observed and simulated, we see that the third components of the equations 3 and equation 4 are actually the same. The Taylor Diagram uses polar coordinates to visualize all of these components. Readers can refer to the original paper to find more discussion and the trigonometric proofs behind this form of plot.

Code Example

In this blog post, I am presenting a simple R package that can be used to create Taylor Diagrams. In this example, I use the same streamflow datasets that I explained in the previous blog post. First, you need to download the time series from GitHub and use the following commands to read these two time series into R.

In this post, I use the “openair” R library, which was originally developed for atmospheric chemistry data analysis. In addition to the Taylor Diagram, the package provides many helpful visualization options that can be accessed from here. Note that I was able to find at least one more R package that can be used to create Taylor Diagrams (i.e., plotrix ). There are also Python and MATLAB libraries that can be used to create Taylor Diagrams.

You can use the following to install the “openair” package and activate the library.

install.packages("openair")
library(openair)

The following command can be used to create a Taylor Diagram.

combined_dataset<-cbind(Observed_Arrow[ 18263:length(Observed_Arrow[,1]),], Arrow_sim=Simulated_Arrow[1:10592,4])
TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim")

Interpretation of the Taylor Diagram

The Taylor Diagram indicates the baseline observed point where correlation is 1 and CRMSE is 0 (purple color). If the simulation point is close to the observed point, it means that they are similar in terms of standard deviation, their correlation is high, and their CRMSE is close to zero. There is also a black dashed line that represents the standard deviation of the observed time series. If the red dot is above the line, it means that the simulated data set has a higher variation. The other helpful information that we gain from the Taylor Diagram is the correlation between observed and simulated values. Higher correlation shows a higher level of agreement between observed and simulated data. The correlation goes down as a point moves toward higher sectors in the graph. Centered root mean square error also shows the quality of the simulation process; however, it puts more weight on outliers. The golden contour lines on the polar plot show the value of CRMSE. Basically, on this figure, the red point has a higher standard deviation, the correlation is around 90%, and the CRMSE is close to 20,000.

The package also allows us to look at the performance of the model during different months or years, and we can compare different simulation scenarios with the observed data. Here, I use the function to draw a point for each month. The “group” argument can be used to do that.

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month")

The function also provides a simple way to break down the data into different plots for other attributes. For example, I created four panels for four different annual periods:

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month", type = "Year")

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here^{1}.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(G.degree())
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()
plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)
plt.show()

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

This blog post intends to demonstrate some machine learning tips that I have acquired through my PhD so far that I hope can be helpful to others, especially if you are working in water resources applications/in basins with complex hydrology/with limited data.

Addressing Correlation and Multicollinearity

The first step in any machine learning problem is often to identify the information that will be most relevant to what you are predicting for your output. If you are in the rare situation where you have a large number of inputs available, the best course of action is not to assume that every single one of those inputs are improving the predictability of your model. If your inputs are highly collinear, then ultimately, you are adding dimensionality to your model without receiving any predicting power in return. A couple suggestions include:

Assess the correlation of your inputs with your outputs in the training set. It’s a good idea to just note which variables are most correlated with your output, because those will likely be extremely important for prediction.

As you add inputs, you should check for multicollinearity among the selected input variables. You can calculate a correlation matrix that will show correlation among the variables and develop a criterion for dropping highly correlated variables. You can also calculate the Variance Inflation Factor (VIF) which represents how the variance of the output is attributed to variance of single inputs.

There are many algorithms available for information selection, including an Iterative Input Selection (IIS) algorithm which uses a regression tree approach to iteratively select candidate inputs based on correlation and returns the most predictive subset of inputs.

Incorporation of Time as an Input Variable

Water resources applications tend to be characterized by some periodicity due to seasonality so it may be obvious that including the day/month of the year as inputs can provide information about the cyclic nature the data. What may be less obvious is making sure that the machine learning model understands the cyclic nature of the data. If raw day or month values are implemented (1-365 or 1-12) as predictor variables, the gives the impression that day 1 is vastly different from day 365 or that January and December are the least related months, which is not actually the case and can send the wrong message to the algorithm. The trick is to create two new features for each of the day and month time series that take the sin and sin and cosine of each value. Both sin and cosine are needed to get a unique mapping for each. If we plot this, you can see that the cyclic nature is preserved compared to the uneven jumps in the first figure.

Aggregation of Variables

When working with rainfall-runoff prediction, you might see peaks in runoff that are somewhat lagged from the precipitation event of interest. This can create difficulties for the machine learning model, which potentially could see a zero value for precipitation on a day with an increased outflow. This can be due to residence time and different types of storage in the basin. Using a memory-based model such as an LSTM and/or an aggregation of past precipitation over various timescales can help to capture these effects.

Immediate Effects: Aggregation of precipitation over the last few hours to a few days will likely capture run-off responses and a few initial ‘buckets’ of storage (e.g. canopy, initial abstraction, etc.)

Medium-Term Effects: Aggregation over a few days to weeks might capture the bulk of soil storage layers (and movement between them and the stream (interflow) or deep groundwater)

Long-Term Effects: Ideally, we would have snow information like SWE or snowpack depth if it is relevant, but in my test cases, this information was not available. Aggregation of precipitation over the past a water year is a proxy that can be used to capture snow storage.

Interaction Variables

Often time, classifiers consider variables to be independent, but often times, the interaction of variables can explain specific events that would not be captured by the variables alone. Think of an instance in which you are trying to predict yield based on water and fertilizer input variables. If you have no water, but you have fertilizer, the yield will be zero. If you have water and no fertilizer, the yield will be some value greater than zero. However, if you have both water and fertilizer, the yield will be greater than what each variable can produce alone. This is a demonstration of interaction effects, and the way to encode this as a variable is to add another input that is amount of water *amount of fertilizer. With this term, the model can attempt to capture these secondary effects. In a snow-driven watershed with limited information, interactions terms can be used to help predict runoff as follows:

For a given amount of precipitation the flow will change based on temperature. For instance, if a watershed receives 100mm of precipitation at 1C there will be lots of runoff. If a watershed receives 100mm of precipitation at -10C, there will be very little runoff because that precipitation will be stored a snow. However, if there is 100mm of precipitation at 10C, this will equate to lots of runoff + lots of snowmelt. (interaction term=precipitation*temperature)

For a given amount of precipitation and temperature the flow will change based on time of year. For instance, 100mm of precipitation at 10C in October will lead to lots of run-off. However, that same amount of precipitation at 10C in February can create flooding events not only from the precipitation but the additional rain-on-snow events. (interaction term=precipitation*temperature*sin(month)*cos(month))

Even independent of precipitation the response of the watershed to various temperatures will be affected by the month of the water year/day of year. A temperature of 30C in April will probably lead to a flood event just from the snowmelt it would trigger. That same temperature in September will likely lead to no change in runoff. (interaction term=temperature*sin(month)*cos(month))

Hopefully these tips can be useful to you on your quest for making the most of your machine learning information!

A few weeks ago I filmed a video training guide to the Rhodium framework for the annual meeting of the society for Decision Making Under Deep Uncertainty. Rhodium is a Python library that facilitates Many Objective Robust Decision making. The training walks through a demonstration of Rhodium using the Lake Problem. The training introduces a live Jupyter notebook Antonia and I created using Binder.