# Water Programming Blog Guide (Part 2)

Water Programming Blog Guide (Part 1)

This second part of the blog guide will cover the following topics:

1. Version control using git
2. Generating maps and working with spatial data in python
3. Reviews on synthetic streamflow and synthetic weather generation
4. Conceptual posts

## 1. Version Control using git

If you are developing code it’s worth the time to gain familiarity with git to maintain reliable and stable development.  Git allows a group of people to work together developing large projects minimizing the chaos when multiple people are editing the same files.   It is also valuable for individual projects as it allows you to have multiple versions of a project, show the changes that you have made over time and undo those changes if necessary.  For a quick introduction to git terminology and functionality, check out   The Intro to git Part 1: Local version control and  Intro to git Part 2: Remote Repositories  posts will guide you through your first git project (local or remote) while providing a set of useful commands.  Other specialized tips can be found in: Git branch in bash prompt and GitHub Pages. And if you are wondering how to use git with pycharm, you’ll find these couple of posts useful: A Guide to Using Git in PyCharm – Part 1A Guide to Using Git in PyCharm – Part 2

## 2. Generating maps and working with spatial data in python

To learn more about python’s capabilities on this subject,  this  lecture  summarizes key python libraries relevant for spatial analysis.  Also,  Julie and the Jons have documented their efforts when working with spatial data and with python’s basemap, leaving us with some valuable examples:

### Working with raster data

Python – Extract raster data value at a point

Python – Clip raster data with a shapefile

### Generating maps

Making Watershed Maps in Python

Plotting geographic data from geojson files using Python

### Generating map animations

Python makes the world go ’round

Making Movies of Time-Evolving Global Maps with Python

## 3. Reviews on synthetic streamflow and weather generation

We are lucky to have thorough reviews on synthetic weather and synthetic streamflow generation written by our experts Julie and Jon L.  The series on synthetic weather generation consists of five parts. Part I and Part II cover parametric and non-parametric methods, respectively. Part III covers multi-site generation.  Part IV discusses how to modify both parametric and non-parametric methods to simulate weather with climate change projections and finally Part V covers how to simulate weather with seasonal climate forecasts:

The synthetic streamflow review provides a historical perspective while answering key questions on “Why do we care about synthetic streamflow generation?  “Why do we use it in water resources planning and management? and “What are the different methods available?

Synthetic streamflow generation

## 4.  Conceptual posts

### Multi-objective evolutionary algorithms (MOEAs)

We frequently use multi-objective evolutionary algorithms due to their power and flexibility to solve multi-objective problems in water resources applications, so you’ll find sufficient documentation in the blog on basic concepts, applications and performance metrics:

You have a problem integrated into your MOEA, now what?

On constraints within MOEAs

MOEA Performance Metrics

### Many Objective Robust Decision Making (MORDM) and Problem framing

The next post discusses the MORDM framework which combines many objective evolutionary optimization, robust decision making, and interactive visual analytics to frame and solve many objective problems under uncertainty.  This is a valuable reading along with the references within.  The second post listed provides a systematic way of thinking about problem formulation and defines the key components of a many-objective problem:

Many Objective Robust Decision Making (MORDM): Concepts and Methods

“The Problem” is the Problem Formulation! Definitions and Getting Started

### Econometric analysis and handling multi-variate data

To close this second part of the blog guide, I leave you with a couple selected topics  from the Econometrics and Multivariate statistics courses at Cornell documented by Dave Gold:

A visual introduction to data compression through Principle Component Analysis

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

# Introduction To Econometrics: Part II- Violations of OLS Assumptions & Methods for Fixing them

Regression is the primary tool used in econometrics to infer relationships between a group of explanatory variables, X and a dependent variable, y. My previous post focused on the mechanics of Ordinary Least Squares (OLS) Regression and outlined key assumptions that, if true, make OLS estimates the Best Linear Unbiased Estimator (BLUE) for the coefficients in the regression:

$y = \beta X+\epsilon$

This post will discuss three common violations of OLS assumptions, and explain tools that have been developed for dealing with these violations. We’ll start with a violation of the assumption of a linear relationship between X and y, then discuss heteroskedasticity in the error terms and the issue of endogeniety.

### Linearity

If the relationship between X and y is not linear, OLS can no longer be used to estimate beta. A nonlinear regression of y on X has the form:

$y = g(X\beta)+\epsilon$

Where  g(X\beta) is the functional form of the nonlinear relationship between X and y and epsilon is the error term. Beta can be estimated using Nonlinear Least Squares regression (NLS). Similar  to OLS regression, NLS seeks to minimize the sum of the square error term.

$\hat{\beta} = argmin(\beta) \epsilon'\epsilon = (y-g(x\beta))^2$

To solve for beta, we again take the derivative and set it equal to zero, but for the nonlinear system there is no closed form solution, so the estimators have to be found using numerical optimization techniques.

The variance of a NLS estimator is:

$\hat{Var}_{\hat{\beta}_{NLS}} = \hat{\sigma^2}(\hat{G}'\hat{G})^{-1}$

Where G is a matrix of partial derivatives of g with respect to each Beta.

Modern numerical optimization techniques can solve many NLS equations quite easily making NLS a common alternative to OLS regression especially when there is a hypothesized functional form for the relationship between X and y.

### Heteroskedasticity

Heteroskedasticity arises within a data set when the errors do not have a constant variance with respect to X. In equation form, under heteroskedasticity:

$E(\epsilon_i^2|X ) \neq \sigma^2$

The presence of heteroskedasticity  increases the variance of Beta estimators found using OLS regression, reducing the efficiency of the estimator and causing it to no longer be the BLUE. As put by Allison (2012), OLS on heteroskedastic data puts “equal weight on all observations when, in fact, observations with larger disturbances contain less information”.

To fix this problem, econometric literature provides two options which both use a form of weighting to correct for differences in variance amongst the error terms:

1. Use the OLS estimate for beta, but calculate the variance of beta with a robust variance-covariance matrix .
2. Estimate Beta using Feasible Generalized Least Squares (FGLS)

Let’s begin with the first strategy, using OLS beta estimates with a robust variance-covariance matrix. The robust variance-covariance matrix can be derived using the Generalized Method of Moments (GMM) for the sake of brevity, I’ll omit the derivation here and skip to the final result:

$\hat{var}(\hat{\beta}) = (X'X)^{-1}(X'\hat{D}X)(X'X)^{-1}$

Where $\hat{D}$ is a matrix of square residuals from the OLS regression:

The second strategy, estimation using FGLS, requires a more involved process for estimating beta. FGLS can be accomplished through 3 steps:

1. Use OLS to find OLS estimate for beta and calculate the residuals:

$\hat{\epsilon}_i = y_i-x_i \hat{\beta}_{OLS}$

2. Regress the error term on a subset of X, which we will call Z, to get an estimate of a new parameter, theta (denoted with a tilde, but wordpress makes it difficult for me to add this in the middle of a paragraph). We then use this parameter to estimate the variance of the error term, sigma squared,  for each observation:

$\hat{\sigma}^2_i = z_i\tilde{\theta}$

A diagonal matrix, D (different than the D used for the robust variance-covariance matrix), is then constructed using these variance estimates.

3. Finally, we use the matrix D to find our FGLS estimator for beta:

$\hat{\beta}_{FGLS} = (X'\hat{D}^{-1}X)^{-1}(X'\hat{D}^{-1}y)$

The variance of of the FGLS beta etimate is then defined as:

$\hat{var}(\hat{\beta}_{FGLS} = (X'\hat{D}X)^{-1}$

### Endogeneity

Endogeneity arises when explanatory variables are correlated with the error term in a regression. This may be a result of simultaneity, when errors and explanatory variables are effected by the same exogenous influences, omitted variable bias,  when an important variable is left out of a regression, causing the over- or underestimation of the effect of other explanatory variables and the error term, measurement error or a lag in the dependent variable. Endogeniety can be hard to detect and may cause regression large errors in regression results.

A common way of correcting for endogeniety is through Instrumental Variables (IVs). Instrumental variables are explanatory variables that are highly correlated with variables that cause endogeniety but are exogenous to the system. Examples include using proximity to cardiac care centers as an IV for heart surgery when modeling health or state cigarette taxes as an IV for maternal smoking rate when modeling infant birth weight (Angrist and Kruger, 2001). For an expansive but accessible overview of IVs and their many applications, see Angrist and Kruger (2001).

A common technique for conducting a regression using IVs is 2 Stage Least Squares (2SLS) regression. The two stages of 2SLS are as follows:

1. Define Z as a new set of explanatory variables, which omits the endogenous variables and includes the IVs (which are usually not included in the original OLS regression).
2. Project Z onto the column space of X.
3. Estimate the 2SLS using this projection:

$\hat{\beta}_{2SLS} = [X'Z(Z'Z)^{-1}Z'X]^{-1}[X'Z(Z'Z)^{-1}Z'y]$

Using 2SLS regression to correct for endogeneity is fairly simple, however identifying good IVs for an endogenous variable can be extremely difficult. Finding a good IV (or set of IVs) can be enough to get one published in an economics journal (at least that’s what my economist friend told me).

## Concluding thoughts

These two posts have constituted an extremely brief introduction to the field of econometrics meant for engineers who may be interested in learning about common empirical tools employed by economists. We covered the above methods in much more detail in class and also covered other topics such as panel data, Generalize Method Of Moments estimation, Maximum Likelihood Estimation, systems of equations in regression and discrete choice modeling. Overall, I found the course (AEM 7100) to be a useful introduction to a field that I hope to learn more about over the course of my PhD.

### References:

Allison, Paul D. (2012). “Multiple regression: a primer. Pine Forge. Thousand Oaks, CA: Press Print.

Angrist, J.; Krueger, A. (2001). “Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments”. Journal of Economic Perspectives. 15 (4): 69–85. doi:10.1257/jep.15.4.69.

# An Introduction To Econometrics: Part 1- Ordinary Least Squares Regression

I took a PhD level econometrics course this semester in the Applied Economics and Management department here at Cornell and I thought I’d share some of what I learned. Overall, I enjoyed the course and learned a great deal. It was very math and theory heavy, but the Professor Shanjun Li did a nice job keeping the class lively and interesting. I would recommend the class to future EWRS students who may be looking for some grounding in econometrics, provided they’ve taken some basic statics and linear algebra courses.

So lets start with the basics, what does the term “econometrics” even mean? Hansen (2010) defined econometrics as “the unified study of economic models, mathematical statistics and economic data”. After taking this introductory course, I’m inclined to add my own definition: econometrics is “a study of the problems with regression using Ordinary Least Squares (OLS) and how to solve them”. This is obviously a gross oversimplification of the field, however, regression through OLS was the primary tool used for finding insights and patterns within data, and we spent the vast majority of the course examining it. In this post I’ll briefly summarize OLS mechanics and classical OLS assumptions. In my next post, I’ll detail methods for dealing with violations of OLS assumptions. My hope is that reading this may help you understand some key terminology and the reasoning behind why certain econometric tools are employed.

## OLS mechanics

Our primary interest when creating an econometric model is to estimate some dependent variable, y, using a observations from a set of independent variables, X. Usually y is a vector of length n, where n is the number of observations, and X is a matrix of size (n x k) where k is the number of explanatory variables (you can think of X as a table of observations, where each column contains a different variable and each row represents an observation of that variable). The goal of OLS regression is to estimate the coefficients, beta, for the model:

$y = X\beta+\epsilon$

Where beta is a k by 1 vector of coefficients on X and epsilon is a k by 1 vector of error terms.

OLS regression estimates beta by minimizing the sum of the square error term (hence the name “least squares”). Put in matrix notation, OLS estimates beta using the equation:

$\hat{\beta} = argmin_{\beta} SSE_N(\beta) = \epsilon ' \epsilon$

The optimal beta estimate can be found through the following equations:

$\epsilon = y-X\hat{\beta}$

$\epsilon ' \epsilon = (y-X\hat{\beta})'(y-X\hat{\beta})$

Taking the derivative and setting it equal to zero:

$2X'y+2X'X\hat{\beta} = 0$

Then solving for the beta estimate:

$\hat{\beta} = (X'X)^{-1}X'y$

Estimation of y using OLS regression can be visualized as the orthogonal projection of the vector y onto the column space of X. The estimated error term, epsilon, is the orthogonal distance between the projection and the true vector y.  Figure 1 shows this projection for a y that is regressed on two explanatory variables, X1 and X2.

Figure 1: OLS regression as an orthogonal projection of vector y onto the column space of matrix X. The error term, $\hat{\epsilon}$, is the orthogonal distance between y and $X\hat{\beta}$. (Image source: Wikipedia commons)

## Assumptions and properties of OLS regression

The Gauss-Markov Theorem states that under a certain set of assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) for vector y.

To understand the full meaning of the Gauss-Markov theorem, it’s important to define two fundamental properties that can be used to describe estimators, consistency and efficiency. An estimator is consistent if its value will converge to the true parameter value as the number of observations goes to infinity. An estimator is efficient if its asymptotic variance is no larger than the asymptotic variance of any other possible consistent estimator for the parameter. In light of these definitions, the Gauss-Markov Theorem can be restated as: estimators found using OLS will be the most efficient consistent estimator for beta as long as the classical OLS assumptions hold. The remainder of this post will be devoted to describing the necessary assumptions for the OLS estimator to be the BLUE and detailing fixes for when these assumptions are violated.

The four classical assumptions for OLS to be the BLUE are:

1. Linearity: The relationship between X and y is linear, following the functional form:

$y = X\beta+\epsilon$.

2. Strict exogeneity: The error $\epsilon$ terms should be independent of the value of the explanatory variables, X. Put in equation form, this assumption requires:

$E(\epsilon_i|X) = 0$

$E(\epsilon_i) =0$

3.  No perfect multicollinearity: columns of X should not be correlated with each other (see my earlier post on dealing with mulitcollinearity for fixes for violations of this assumption).

4. Spherical Error: Error terms should be homoskedastic, meaning they are evenly distributed around the X values. Put in equation form:

$E(\epsilon_i^2|X) =\sigma^2$

Where $\sigma^2$ is a constant value.

$E(\epsilon_i \epsilon_j|X)=0$

Using assumption 4, we can define the variance of $\hat{\beta}$ as:

$var(\hat{\beta}_{OLS}) = \sigma^2(X'X)^{-1}$

If assumptions 1-4 hold, then the OLS estimate for beta is the BLUE, if however, any of the assumptions are broken, we must employ other methods for estimating our regression coefficients.

In my next post I’ll detail the methods econometricians use when these assumptions are violated.

### References:

Hansen, Bruce. “Econometrics”. 2010. University of Wisconsin

http://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics2010.pdf

# A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

$u = xE$

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

$\tilde{x} = \tilde{u}\tilde{E}^T$

Where:

$\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set$

$\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)$

$\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used$

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,...
2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,...
13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
legend('Original', 'Reconstruction')

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

# Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

This semester I’m taking a Multivariate statistics course taught by Professor Scott Steinschneider in the BEE department at Cornell. I’ve been really enjoying the course thus far and thought I would share some of what we’ve covered in the class with a blog post. The material below on multicollinearity is from Dr. Steinschneider’s class, presented in my own words.

### What is Multicollinearity?

Multicollinearity is the condition where two or more predictor variables in a statistical model are linearly related (Dormann et. al. 2013). The existence of multicollinearity in your data set can result in an increase of the variance of regression coefficients leading to unstable estimation of parameter values. This in turn can lead to erroneous identification of relevant predictors within a regression and detracts from a model’s ability to extrapolate beyond the range of the sample it was constructed with. In this post, I’ll explain how multicollinearity causes problems for linear regression by Ordinary Least Squares (OLS), introduce three metrics for detecting multicollinearity and detail two “Tolerant Methods” for dealing with multicollinearity within a data set.

### How does multicollinearity cause problems in OLS regression?

To illustrate the problems caused by multicollinearity, let’s start with a linear regression:

$y=x\beta +\epsilon$

Where:

$y=x\beta +\epsilon$

$x = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} predictor \hspace{.1 cm} variables$

$\beta = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} coefficients$

$\epsilon = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} residuals$

The Gauss-Markov theorem states that the Best Linear Unbiased Estimator (BLUE) for each  coefficient can be found using OLS:

$\hat{\beta}_{OLS} = (x^Tx)^{-1}x^Ty$

This  estimate will have a variance defined as:

$var(\hat{\beta}_{OLS}) =\sigma^2 (x^Tx)^{-1}$

Where:

$\sigma^2 = the \hspace{.1 cm} variance\hspace{.1 cm} of \hspace{.1 cm} the\hspace{.1 cm} residuals$

If you dive into the matrix algebra, you will find that the term (xTx) is equal to a matrix with ones on the diagonals and the pairwise Pearson’s correlation coefficients (ρ) on the off-diagonals:

$(x^Tx) =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}$

As the correlation values increase, the values within (xTx)-1 also increase. Even with a low residual variance, multicollinearity can cause large increases in estimator variance. Here are a few examples of the effect of multicollinearity using a hypothetical regression with two predictors:

$\rho = .3 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.09 & -0.33 \\ -0.33 & 1.09 \end{bmatrix}$

$\rho = .9 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 5.26 & -4.73 \\ -5.26 & -4.73 \end{bmatrix}$

$\rho = .999 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 500.25 & -499.75 \\ -499.75 & 500.25\end{bmatrix}$

So why should you care about the variance of your coefficient estimators? The answer depends on what the purpose of your model is. If your only goal is to obtain an accurate measure of the predictand, the presence of multicollinearity in your predictors might not be such a problem. If, however, you are trying to identify the key predictors that effect the predictand, multicollinearity is a big problem.

OLS estimators with large variances are highly unstable, meaning that if you construct estimators from different data samples you will potentially get wildly different estimates of your coefficient values (Dormann et al. 2013). Large estimator variance also undermines the trustworthiness of hypothesis testing of the significance of coefficients. Recall that the t value used in hypothesis testing for an OLS regression coefficient is a function of the sample standard deviation (the square root of the variance) of the  OLS estimator.

$t_{n-2} =\frac{\hat{\beta_j}-0}{s_{\beta_j}}$

An estimator with an inflated standard deviation, $s_{\beta_j}$, will thus yield a lower t value, which could lead to the false rejection of a significant predictor (ie. a type II error). See Ohlemüller et al. (2008) for some examples where hypothesis testing results are undermined by multicollinearity.

### Detecting Multicollinearity within a data set

Now we know how multicollinearity causes problems in our regression, but how can we tell if there is multicollinearity within a data set? There are several commonly used metrics for which basic guidelines have been developed to determine whether multicollinearity is present.

The most basic metric is the pairwise Pearson Correlation Coefficient between predictors, r. Recall from your intro statistics course that the Pearson Correlation Coefficient is a measure of the linear relationship between two variables, defined as:

$r_{x_1,x_2}=\frac{cov(x_1,x_2)}{\sigma_x\sigma_y}$

A common rule of thumb is that multicollinearity may be a problem in a data set if any pairwise |r| > 0.7 (Dormann et al. 2013).

Another common metric is known as the Variance Inflation Factor (VIF). This measure is calculated by regressing each predictor on all others being used in the regression.

$VIF(\beta_j) = \frac{1}{1-R^2_j}$

Where Rj2 is the R2 value generated by regressing predictor xj on all other predictors. Multicollinearity is thought to be a problem if VIF > 10 for any given predictor (Dormann et al. 2012).

A third metric for detecting multicollinearity in a data set is the Condition Number (CN) of the predictor matrix defined as the square root of the ratio of the largest and smallest eigenvalues in the predictor matrix:

$CN=\sqrt{\frac{\lambda_{max}}{\lambda_{min}}}$

CN> 15 indicates the possible presence of multicollinearity, while a CN > 30 indicates serious multicollinearity problems (Dormann et al. 2013).

### Dealing with Multicollinearity using Tolerant Methods

In a statistical sense, there is no way to “fix” multicollinearity. However, methods have been developed to mitigate its effects. Perhaps the most effective way to remedy multicollinearity is to make a priori judgements about the relationship between predictors and remove or consolidate predictors that have known correlations. This is not always possible however, especially when the true functional forms of relationships are not known (which is often why regression is done in the first place). In this section I will explain two “Tolerant Methods” for dealing with multicollinearity.

The purpose of Tolerant Methods is to reduce the sensitivity of regression parameters to multicollinearity. This is accomplished through penalized regression. Since multicollinearity can result in large and opposite signed  estimator values for correlated predictors, a penalty function is imposed to keep the value of predictors below a pre-specified value.

$\sum_{j=1}^{p}|\beta|^l \leq c$

Where c is the predetermined value representing model complexity, p is the number of predictors and l is either 1 or 2 depending on the type of tolerant method employed (more on this below).

#### Ridge Regression

Ridge regression uses the L2 norm, or Euclidean distance, to constrain model coefficients (ie. l = 2 in the equation above). The coefficients created using ridge regression are defined as:

$\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty$

Ridge regression adds a constant, λ, to the term xTx to construct the estimator. It should be noted that both x and y should be standardized before this estimator is constructed. The Ridge regression coefficient is the result of a constrained version of the ordinary least squares optimization problem. The objective is to minimize the sum of square errors for the regression while meeting the complexity constraint.

$\hat{\beta_r} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2 \\ \sum_{j=1}^{p}|\beta_j|^2 \leq c \end{cases}$

To solve the constrained optimization, Lagrange multipliers can be employed. Let z equal the Residual Sum of Squares (RSS) to be minimized:

$argmin(\beta) \hspace{.3cm} z= (y-x\beta)^T(y-x\beta)+\lambda(\sum_{i=1}^{N}|\beta_j|^2-c)$

This can be rewritten in terms of the L2 norm of β:

$z = (y-x\beta)^T(y-x\beta)+\lambda||\beta||^2_2$

Taking the derivative with respect to β and solving:

$0 = \frac{\partial z}{\partial \beta} = -2x^T(y-x\beta)+2\lambda\beta$

$x^Ty = x^Tx\beta+\lambda\beta=(x^Tx+\lambda I)\beta$

$\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty$

Remember that the Gauss-Markov theorem states that the OLS estimate for regression coefficients is the BLUE, so by using ridge regression, we are sacrificing some benefits of OLS estimators in order to constrain estimator variance. Estimators constructed using ridge regression are in fact biased, this can be proven by calculating the expected value of ridge regression coefficients.

$E[\hat{\beta_r}]=(I+\lambda(x^Tx)^{-1})\beta \neq \beta$

For a scenario with two predictors, the tradeoff between reduced model complexity and increase bias in the estimators can be visualized graphically by plotting the estimators of the two beta values against each other. The vector of beta values estimated by regression are represented as points on this plot  $(\hat{\beta}=[\beta_1, \beta_2])$.  In Figure 1,$\beta_{OLS}$ is plotted in the upper right quadrant and represents estimator that produces the smallest RSS possible for the model. The ellipses centered around  are representations of the increasing RSS resulting from the combination of β1 and β2  values, each RSS is a function of a different lambda value added to the regression.  The circle centered around the origin represents the chosen level of model complexity that is constraining the ridge regression. The ridge estimator is the point where this circle intersects a RSS ellipse. Notice that as the value of c increases, the error introduced into the estimators decreases and vice versa.

Figure 1: Geometric Interpretation of a ridge regression estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimates represent RSS contours for each Beta 1, Beta 2 combination (denoted on here as z from the optimization equation above). The model complexity is constrained by distance c from the origin. The ridge regression estimator of Beta is shown as the red dot, where the RSS contour meets the circle defined by c.

The c value displayed in Figure 1 is only presented to explain the theoretical underpinnings of ridge regression. In practice, c is never specified, rather, a value for λ is chosen a priori to model construction. Lambda is usually chosen through a process known as k-fold cross validation, which is accomplished through the following steps:

1. Partition data set into K separate sets of equal size
2. For each k = 1 …k, fit model with excluding the kth set.
3. Predict for the kth set
4. Calculate the cross validation error (CVerror)for kth set: $CV^{\lambda_0}_k = E[\sum(y-\hat{y})^2]$
5. Repeat for different values of , choose a that minimizes   $CV^{\lambda_0} = \frac{1}{k}CV^{\lambda_0}_k$

#### Lasso Regression

Another Tolerant Method for dealing with multicollinearity known as Least Absolute Shrinkage and Selection Operator (LASSO) regression, solves the same constrained optimization problem as ridge regression, but uses the L1 norm rather than the L2 norm as a measure of complexity.

$\hat{\beta}_{Lasso} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2 \\ \sum_{j=1}^{p}|\beta_j|^1 \leq c \end{cases}$

LASSO regression can be visualized similarly to ridge regression, but since c is defined by the sum of absolute values of beta, rather than sum of squares, the area it constrains is diamond shaped rather than circular.  Figure 2 shows the selection of the beta estimator from LASSO regression. As you can see, the use of the L1 norm means LASSO regression selects one of the predictors and drops the other (weights it as zero). This has been argued to provide a more interpretable estimators (Tibshirani 1996).

Figure 2: Geometric interpretation of Lasso Regression Estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimate represents RSS contours for each Beta 1, Beta 2 combination (denoted as z from the optimization equation). The mode complexity is constrained by the L1 norm representing model complexity. The Lasso estimator of Beta is shown as the red dot, the location where the RSS contour intersects the diamond defined by c.

### Final thoughts

If you’re creating a model with multiple predictors, it’s important to be cognizant of potential for multicollinearity within your data set. Tolerant methods are only one of many possible remedies for multicollinearity (other notable techniques include data clustering and Principle Component Analysis) but it’s important to remember that no known technique can truly “solve” the problem of multicollinearity. The method chosen to deal with multicollinearity should be chosen on a case to case basis and multiple methods should be employed if possible to help identify the underlying structure within the predictor data set (Dormann et. al. 2013)

### Citations

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D. and Lautenbach, S. 2013, “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.” Ecography, 36: 27–46. doi:10.1111/j.1600-0587.2012.07348.x

Ohlemüller, R. et al. 2008. “The coincidence of climatic and species rarity: high risk to small-range species from climate change.” Biology Letters. 4: 568 – 572.

Tibshirani, Robert 1996. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological): 267-288.

# Calculating Risk-of-Failures as in the Research Triangle papers (2014-2016) – Part 1

There has been a series of papers (e.g., Palmer and Characklis, 2009; Zeff et al., 2014; Herman et al., 2014) suggesting the use of an approximate risk-of-failure (ROF) metric, as opposed to the more conventional days of supply remaining, for utilities’ managers to decide when to enact not only water use restrictions, but also water transfers between utilities. This approach was expanded to decisions about the best time and in which new infrastructure project a utility should invest (Zeff at al., 2016), as opposed to setting fixed times in the future for either construction or options evaluation. What all these papers have in common is that drought mitigation and infrastructure expansion decisions are triggered when the values of the short and long-term ROFs, respectively, for a given utility exceeds those of pre-set triggers.

For example, the figure below shows that as streamflows (black line, subplot “a”) get lower while demands are maintained (subplot “b”), the combined storage levels of the fictitious utility starts to drop around the month of April (subplot “c”), increasing the utility’s short-term ROF (subplot “d”) until it finally triggers transfers and restrictions (subplot “e”). Despite the triggered restriction and transfers, the utility’s combined storage levels crossed the dashed line in subplot “c”, which denotes the fail criteria (i.e. combined storage levels dropping below 20% of the total capacity).

It is beyond the scope of this post to go into the details presented in all of these papers, but even after reading them the readers may be wondering how exactly ROFs are calculated. In this post, I’ll try to show in a graphical and concise manner how short-term ROFs are calculated.

In order to calculate a utility’s ROF for week m, we would run 50 independent simulations (henceforth called ROF simulations) all departing from the system conditions (reservoir storage levels, demand probability density function, etc.) observed in week m, and each using one of 50 years of streamflows time series recorded immediately prior to week m. The utility’s ROF is then calculated as the number of ROF simulations in which the combined storage level of that utility dropped below 20% of the total capacity in at least one week, divided by the number of ROF simulations ran (50). An animation of the process can be seen below.

For example, for a water utility who started using ROF triggers on 01/01/2017, this week’s short-term ROF (02/13/2017, or week m=7) would be calculated using the recorded streamflows from weeks 6 through -47 (assuming here a year of 52 weeks, for simplicity) for ROF simulation 1, the streamflows from weeks -48 to -99 for ROF simulation 2, and so on until we reach 50 simulations. However, if the utility is running an optimization or scenario evaluation and wants to calculate the ROF in week 16 (04/10/2017) of a system simulation, ROF simulation 1 would use 10 weeks of synthetically generated streamflows (16 to 7) and 42 weeks of historical records (weeks 6 to -45), simulation 2 would use records for weeks -46 to -97, and so on, as in a 50 years moving window.

In another blog post, I will show how to calculate the long-term ROF and the reasoning behind it.

Works cited

Herman, J. D., H. B. Zeff, P. M. Reed, and G. W. Characklis (2014), Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty, Water Resour. Res., 50, 7692–7713, doi:10.1002/2014WR015338.

Palmer, R., and G. W. Characklis (2009), Reducing the costs of meeting regional water demand through risk-based transfer agreements, J. Environ. Manage., 90(5), 1703–1714.

Zeff, H. B., J. R. Kasprzyk, J. D. Herman, P. M. Reed, and G. W. Characklis (2014), Navigating financial and supply reliability tradeoffs in regional drought management portfolios, Water Resour. Res., 50, 4906–4923, doi:10.1002/2013WR015126.

Zeff, H. B., J. D. Herman, P. M. Reed, and G. W. Characklis (2016), Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways, Water Resour. Res., 52, 7327–7346, doi:10.1002/2016WR018771.

# Synthetic streamflow generation

A recent research focus of our group has been the development and use of synthetic streamflow generators.  There are many tools one might use to generate synthetic streamflows and it may not be obvious which is right for a specific application or what the inherent limitations of each method are.  More fundamentally, it may not be obvious why it is desirable to generate synthetic streamflows in the first place.  This will be the first in a series of blog posts on the synthetic streamflow generators in which I hope to sketch out the various categories of generation methods and their appropriate use as I see it.  In this first post we’ll focus on the motivation and history behind the development of synthetic streamflow generators and broadly categorize them.

### Why should we use synthetic hydrology?

The most obvious reason to use synthetic hydrology is if there is little or no data for your system (see Lamontagne, 2015).  Another obvious reason is if you are trying to evaluate the effect of hydrologic non-stationarity on your system (Herman et al. 2015; Borgomeo et al. 2015).  In that case you could use synthetic methods to generate flows reflecting a shift in hydrologic regime.  But are there other reasons to use synthetic hydrology?

In water resources systems analysis it is common practice to evaluate the efficacy of management or planning strategies by simulating system performance over the historical record, or over some critical period.  In this approach, new strategies are evaluated by asking the question:  How well would we have done with your new strategy?

This may be an appealing approach, especially if some event was particularly traumatic to your system. But is this a robust way of evaluating alternative strategies?  It’s important to remember that any hydrologic record, no matter how long, is only a single realization of a stochastic process.  Importantly, drought and flood events emerge as the result of specific sequences of events, unlikely to be repeated.  Furthermore, there is a 50% chance that the worst flood or drought in an N year record will be exceeded in the next N years.  Is it well advised to tailor our strategies to past circumstances that will likely never be repeated and will as likely as not be exceeded?  As Lettenmaier et al. [1987] reminds us “Little is certain about the future except that it will be unlike the past.”

Even under stationarity and even with long hydrologic records, the use of synthetic streamflow can improve the efficacy of planning and management strategies by exposing them to larger and more diverse flood and drought than those in the record (Loucks et al. 1981; Vogel and Stedinger, 1988; Loucks et al. 2005).  Figure 7.12 from Loucks et al. 2005 shows a typical experimental set-up using synthetic hydrology with a simulation model.  Often our group will wrap an optimization model like Borg around this set up, where the system design/operating policy (bottom of the figure) are the decision variables, and the system performance (right of the figure) are the objective(s).

(Loucks et al. 2005)

### What are the types of generators?

Many synthetic streamflow generation techniques have been proposed since the early 1960s.  It can be difficult for a researcher or practitioner to know which method is best suited to the problem at hand.  Thus, we’ll start with a very broad characterization of what is out there, then proceed to some history.

Broadly speaking there are two approaches to generating synthetic hydrology: indirect and direct.  The indirect approach generates streamflow by synthetically generating the forcings to a hydrologic model.  For instance one might generate precipitation and temperature series and input them to a hydrologic model of a basin (e.g. Steinschneider et al. 2014).  In contrast, direct methods use statistical techniques to generate streamflow timeseries directly.

The direct approach is generally easier to apply and more parsimonious because it does not require the selection, calibration, and validation of a separate hydrologic model (Najafi et al. 2011).  On the other hand, the indirect approach may be desirable.  Climate projections from GCMs often include temperature or precipitation changes, but may not describe hydrologic shifts at a resolution or precision that is useful.  In other cases, profound regime shifts may be difficult to represent with statistical models and may require process-driven models, thus necessitating the indirect approach.

Julie’s earlier series focused on indirect approaches, so we’ll focus on the direct approach.  Regardless of the approach many of the methods are same.  In general generator methods can be divided into two categories: parametric and non-parametricParametric methods rely on a hypothesized statistical model of streamflow whose parameters are selected to achieve a desired result (Stedinger and Taylor, 1982a).  In contrast non-parametric methods do not make strong structural assumptions about the processes generating the streamflow, but rather rely on re-sampling from the hydrologic record in some way (Lall, 1995).  Some methods combine parametric and non-parametric techniques, which we’ll refer to as semi-parametric (Herman et al. 2015).

Both parametric and non-parametric methods have advantages and disadvantages.  Parametric methods are often parsimonious, and often have analytical forms that allow easy parameter manipulation to reflect non-stationarity.  However, there can be concern that the underlying statistical models may not reflect the hydrologic reality well (Sharma et al. 1997).  Furthermore, in multi-dimensional, multi-scale problems the proliferation of parameters can make parametric models intractable (Grygier and Stedinger, 1988).  Extensive work has been done to confront both challenges, but they may lead a researcher to adopt a non-parametric method instead.

Because many non-parametric methods ‘re-sample’ flows from a record, realism is not generally a concern, and most re-sampling schemes are computationally straight forward (relatively speaking).  On the other hand, manipulating synthetic flows to reflect non-stationarity may not be as straightforward as a simple parameter change, though methods have been suggested (Herman et al. 2015Borgomeo et al. 2015).  More fundamentally, because non-parametric methods rely so heavily on the data, they require sufficiently long records to ensure there is enough hydrologic variability to sample.  Short records can be a concern for parametric methods as well, though parametric uncertainty can be explicitly considered in those methods (Stedinger and Taylor, 1982b).  Of course, parametric methods also have structural uncertainty that non-parametric models largely avoid by not assuming an explicit statistical model.

In the coming posts we’ll dig into the nuances of the different methods in greater detail.

### A historical perspective

The first use of synthetic flow generation seems to have been by Hazen [1914].  That work attempted to quantify the reliability of a water supply by aggregating the streamflow records of local streams into a 300-year ‘synthetic record.’  Of course the problem with this is that the cross-correlation between concurrent flows rendered the effective record length much less than the nominal 300 years.

Next Barnes [1954] generated 1,000 years of streamflow for a basin in Australia by drawing random flows from a normal distribution with mean and variance equal to the sample estimates from the observed record.  That work was extended by researchers from the Harvard Water Program to account for autocorrelation of monthly flows (Maass et al., 1962; Thomas and Fiering, 1962).  Later work also considered the use of non-normal distributions (Fiering, 1967), and the generation of correlated concurrent flows at multiple sites (Beard, 1965; Matalas, 1967).

Those early methods relied on first-order autoregressive models that regressed flows in the current period on the flows of the previous period (see Loucks et al.’s Figure 7.13  below).  Box and Jenkins [1970] extended those methods to autoregressive models of arbitrary order, moving average models of arbitrary order, and autoregressive-moving average models of arbitrary order.  Those models were the focus of extensive research over the course of the 1970s and 1980s and underpin many of the parametric generators that are widely used in hydrology today (see Salas et al. 1980; Grygier and Stedinger, 1990; Salas, 1993; Loucks et al. 2005).

(Loucks et al. 2005)

By the mid-1990s, non-parametric methods began to gain popularity (Lall, 1995).  While much of this work has its roots in earlier work from the 1970s and 1980s (Yakowitz, 1973, 1979, 1985; Schuster and Yakowitz, 1979; Yakowitz and Karlsson, 1987; Karlson and Yakowitz, 1987), improvements in computing and the availability of large data sets meant that by the mid-1990s non-parametric methods were feasible (Lall and Sharma, 1996).  Early examples of non-parametric methods include block bootstrapping (Vogel and Shallcross, 1996), k-nearest neighbor (Lall and Sharma, 1996), and kernel density methods (Sharma et al. 1997).  Since that time extensive research has made improvement to these methods, often by incorporating parametric elements.  For instance, Srinivas and Srinivasan (2001, 2005, and 2006) develop a hybrid autoregressive-block bootstrapping method designed to improve the bias in lagged correlation and to generate flows other than the historical, for multiple sites and multiple seasons.  K-nearest neighbor methods have also been the focus of extensive research (Rajagopalan and Lall, 1999; Harrold et al. 2003; Yates et al. 2003; Sharif and Burn, 2007; Mehortra and Sharma, 2006; Prairie et al. 2006; Lee et al. 2010, Salas and Lee, 2010, Nowak et al., 2010), including recent work by our group  (Giuliani et al. 2014).

Emerging work focuses on stochastic streamflow generation using copulas [Lee and Salas, 2011; Fan et al. 2016], entropy theory bootstrapping [Srivastav and Simonovic, 2014], and wavelets [Kwon et al. 2007; Erkyihun et al., 2016], among other methods.

In the following posts I’ll address different challenges in stochastic generation [e.g. long-term persistence, parametric uncertainty, multi-site generation, seasonality, etc.] and the relative strengths and shortcomings of the various methods for addressing them.

### Works Cited

Barnes, F. B., Storage required for a city water supply, J. Inst. Eng. Australia 26(9), 198-203, 1954.

Beard, L. R., Use of interrelated records to simulate streamflow, J. Hydrol. Div., ASCE 91(HY5), 13-22, 1965.

Borgomeo, E., Farmer, C. L., and Hall, J. W. (2015). “Numerical rivers: A synthetic streamflow generator for water resources vulnerability assessments.” Water Resour. Res., 51(7), 5382–5405.

Y.R. Fan, W.W. Huang, G.H. Huang, Y.P. Li, K. Huang, Z. Li, Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas, Advances in Water Resources, Volume 88, February 2016, Pages 170-185.

Fiering, M.B, Streamflow Synthesis, Harvard University Press, Cambridge, Mass., 1967.

Giuliani, M., J. D. Herman, A. Castelletti, and P. Reed (2014), Many-objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management, Water Resour. Res., 50, 3355–3377, doi:10.1002/2013WR014700.

Grygier, J.C., and J.R. Stedinger, Condensed Disaggregation Procedures and Conservation Corrections for Stochastic Hydrology, Water Resour. Res. 24(10), 1574-1584, 1988.

Grygier, J.C., and J.R. Stedinger, SPIGOT Technical Description, Version 2.6, 1990.

Harrold, T. I., Sharma, A., and Sheather, S. J. (2003). “A nonparametric model for stochastic generation of daily rainfall amounts.” Water Resour. Res., 39(12), 1343.

Hazen, A., Storage to be provided in impounding reservoirs for municipal water systems, Trans. Am. Soc. Civ. Eng. 77, 1539, 1914.

Herman, J.D., H.B. Zeff, J.R. Lamontagne, P.M. Reed, and G. Characklis (2016), Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments, Journal of Water Resources Planning & Management, doi: 10.1061/(ASCE)WR.1943-5452.0000701.

Karlsson, M., and S. Yakowitz, Nearest-Neighbor methods for nonparametric rainfall-runoff forecasting, Water Resour. Res., 23, 1300-1308, 1987.

Kwon, H.-H., U. Lall, and A. F. Khalil (2007), Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43, W05407, doi:10.1029/2006WR005258.

Lall, U., Recent advances in nonparametric function estimation: Hydraulic applications, U.S. Natl. Rep. Int. Union Geod. Geophys. 1991- 1994, Rev. Geophys., 33, 1093, 1995.

Lall, U., and A. Sharma (1996), A nearest neighbor bootstrap for resampling hydrologic time series, Water Resour. Res. 32(3), pp. 679-693.

Lamontagne, J.R. 2015,Representation of Uncertainty and Corridor DP for Hydropower 272 Optimization, PhD edn, Cornell University, Ithaca, NY.

Lee, T., J. D. Salas, and J. Prairie (2010), An enhanced nonparametric streamflow disaggregation model with genetic algorithm, Water Resour. Res., 46, W08545, doi:10.1029/2009WR007761.

Lee, T., and J. Salas (2011), Copula-based stochastic simulation of hydrological data applied to Nile River flows, Hydrol. Res., 42(4), 318–330.

Lettenmaier, D. P., K. M. Latham, R. N. Palmer, J. R. Lund and S. J. Burges, Strategies for coping with drought Part II: Planning techniques for planning and reliability assessment, EPRI P-5201, Final Report Project 2194-1, June 1987.

Loucks, D.P., Stedinger, J.R. & Haith, D.A. 1981, Water Resources Systems Planning and Analysis, 1st edn, Prentice-Hall, Englewood Cliffs, N.J.

Loucks, D.P. et al. 2005, Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Delft, The Netherlands.

Maass, A., M. M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair,

Design of Water Resource Systems, Harvard University Press, Cambridge, Mass., 1962.

Matalas, N. C., Mathematical assessment of synthetic hydrology, Water Resour. Res. 3(4), 937-945, 1967.

Najafi, M. R., Moradkhani, H., and Jung, I. W. (2011). “Assessing the uncertainties of hydrologic model selection in climate change impact studies.” Hydrol. Process., 25(18), 2814–2826.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., and Zagona, E. A. (2006). “Modified K-NN model for stochastic streamflow simulation.” J. Hydrol. Eng., 11(4), 371–378.

Rajagopalan, B., and Lall, U. (1999). “A k-nearest-neighbor simulator for daily precipitation and other weather variables.” Water Resour. Res., 35(10), 3089–3101.

Salas, J. D., J. W. Deller, V. Yevjevich and W. L. Lane, Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colo., 1980.

Salas, J.D., 1993, Analysis and Modeling of Hydrologic Time Series, Chapter 19 (72 p.) in The McGraw Hill Handbook of Hydrology, D.R. Maidment, Editor.

Salas, J.D., T. Lee. (2010). Nonparametric Simulation of Single-Site Seasonal Streamflow, J. Hydrol. Eng., 15(4), 284-296.

Schuster, E., and S. Yakowitz, Contributions to the theory of nonparametric regression, with application to system identification, Ann. Stat., 7, 139-149, 1979.

Sharif, M., and Burn, D. H. (2007). “Improved K-nearest neighbor weather generating model.” J. Hydrol. Eng., 12(1), 42–51.

Sharma, A., Tarboton, D. G., and Lall, U., 1997. “Streamflow simulation: A nonparametric approach.” Water Resour. Res., 33(2), 291–308.

Srinivas, V. V., and Srinivasan, K. (2001). “A hybrid stochastic model for multiseason streamflow simulation.” Water Resour. Res., 37(10), 2537–2549.

Srinivas, V. V., and Srinivasan, K. (2005). “Hybrid moving block bootstrap for stochastic simulation of multi-site multi-season streamflows.” J. Hydrol., 302(1–4), 307–330.

Srinivas, V. V., and Srinivasan, K. (2006). “Hybrid matched-block bootstrap for stochastic simulation of multiseason streamflows.” J. Hydrol., 329(1–2), 1–15.

Roshan K. Srivastav, Slobodan P. Simonovic, An analytical procedure for multi-site, multi-season streamflow generation using maximum entropy bootstrapping, Environmental Modelling & Software, Volume 59, September 2014a, Pages 59-75.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 1. Model verification and validation, Water Resour. Res. 18(4), 909-918, 1982a.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 2. Parameter uncertainty,Water Resour. Res. 18(4), 919-924, 1982b.

Steinschneider, S., Wi, S., and Brown, C. (2014). “The integrated effects of climate and hydrologic uncertainty on future flood risk assessments.” Hydrol. Process., 29(12), 2823–2839.

Thomas, H. A. and M. B. Fiering, Mathematical synthesis of streamflow sequences for the analysis of river basins by simulation, in Design of Water Resource Systems, by A. Maass, M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Harvard University Press, Cambridge, Mass., 1962.

Vogel, R.M., and J.R. Stedinger, The value of stochastic streamflow models in over-year reservoir design applications, Water Resour. Res. 24(9), 1483-90, 1988.

Vogel, R. M., and A. L. Shallcross (1996), The moving block bootstrap versus parametric time series models, Water Resour. Res., 32(6), 1875–1882.

Yakowitz, S., A stochastic model for daily river flows in an arid region, Water Resour. Res., 9, 1271-1285, 1973.

Yakowitz, S., Nonparametric estimation of markov transition functions, Ann. Stat., 7, 671-679, 1979.

Yakowitz, S. J., Nonparametric density estimation, prediction, and regression for markov sequences J. Am. Stat. Assoc., 80, 215-221, 1985.

Yakowitz, S., and M. Karlsson, Nearest-neighbor methods with application to rainfall/runoff prediction, in Stochastic  Hydrology, edited by J. B. Macneil and G. J. Humphries, pp. 149-160, D. Reidel, Norwell, Mass., 1987.

Yates, D., Gangopadhyay, S., Rajagopalan, B., and Strzepek, K. (2003). “A technique for generating regional climate scenarios using a nearest-neighbor algorithm.” Water Resour. Res., 39(7), 1199.