Glossary of commonly used terms

I have recently started training with the group and coming from a slightly different research background I was unfamiliar with some a lot of the terminology. I thought it might be useful to put together a glossary of sorts containing the terms that someone new to this field of study might not intuitively understand. The idea is that when someone encounters an unfamiliar term while going through the training or reading through the material, they can come to the blog glossary and quickly Ctrl-F the term (yes, that is a keyboard shortcut used as a verb).

The definitions are not exhaustive, but links/references are provided so the reader can find additional material. The glossary is a work in progress, some definitions are still incomplete, but it will be regularly updated. I’m also probably the least qualified person in the group to provide the definitions, so any corrections/suggestions are more than welcome. If there’s any other term I’ve omitted and you feel should be included, please leave a comment so I can edit the post.

Glossary 

Adaptive metropolis algorithm

It is based on the classical random walk Metropolis algorithm and adapts continuously to the target distribution. 1

Akaike’s Information Criterion (AIC)

A measure of the relative quality of statistical models for a given set of data.2

AMALGAM

MOEA that applies a multi-algorithm search that combines the NSGAII, particle swarm optimization, differential evolution, and adaptive metropolis.3

Approximation set

The set of solutions output by a multi-objective evolutionary algorithm approximating the Pareto front.4

Archive

A secondary population used by many multi-objective evolutionary algorithms to store non-dominated solutions found through the generational process.5

Attainment

Attainment plot

Bayesian Information Criterion

Borg MOEA

Iterative search algorithm for multi-objective optimization. It combines adaptive operator selection with ε-dominance, adaptive population sizing and time continuation. 6

Classification and Regression Trees (CART)

Decision tree algorithms that can used for classification and regression analysis of predictive modeling problems.7

Closed vs. open loop control

Concavity

See Convexity.

Constraints

Restrictions imposed on the decision space by the particular characteristics of the system. They must be satisfied for a certain solution to be considered acceptable.5

Control map

Controllability

Refers to whether the parameterization choices have significant impacts on the success or failure for an algorithm.

Convergence

Progress towards the reference set.

Convexity

Geometrically, a function is convex is a line segment drawn from any point along function f to any other point along f lies on or above the graph of f. For optimization problems, a problem is convex if the objective function and all constraints are convex functions if minimizing, and concave is maximizing.

(The) Cube

The Reed Group’s computing cluster run by the Center for Advanced Computing (CAC) at Cornell. The cluster has 32 nodes, each with Dual 8-core E5-2680 CPUs @ 2.7 GHz and 128 GB of RAM. For more information see: https://www.cac.cornell.edu/wiki/index.php?title=THECUBE_Cluster

Decision space

The set of all decision variables.

Decision variables

The numerical quantities (variables) that are manipulated during the optimization process and they represent how our actions are encoded within the problem.5

Deterioration

When elements of a solution set at a given time are dominated by a solution set the algorithm maintained some time before.5

Differential evolution

Evolutionary algorithm designed to optimize problems over continuous domains.5

Direct policy search

Dominance resistance

The inability of an algorithm to produce offspring that dominates poorly performing, non-dominated members of the population. (See also Pareto dominance).8

DTLZ problems

A suite of representative test problems for MOEAs that for which the analytical solutions have been found.  The acronym is a combination of the first letters of the creators’ last names (Deb, Thiele, Laumanns, Zitzler). DTLZ problems have been used for benchmarking and diagnostics when evaluating the performance of MOEAs.

Dynamic memory

Elitism

Refers to a case where as evolution progresses, non-dominated solutions will not be lost in subsequent generations.

Epsilon (ε) dominance

When dominance is determined by use of a user-specified precision to simplify sorting. Pareto epsilon (ε) optimality and Pareto epsilon (ε) front are defined accordingly. (See also Pareto dominance).5

Epsilon (ε) dominance archive

Epsilon (ε)-box dominance archive

ε-MOEA

A steady-state MOEA, the first to incorporate ε-dominance archiving into its search process.

Epsilon (ε) indicator

Additive ε-indicator (ε+-indicator) (performance metric) 

The smallest distance ε that the approximation set must be translated by in order to completely dominate the reference set.4

Epsilon (ε) progress

Equifinality

Evolutionary algorithms

A class of search and optimization algorithms inspired by processes of natural evolution.5

Multi-objective evolutionary algorithms (MOEAs)

Evolutionary algorithms used for solving multi-objective problems.5

Evolutionary operators

They operate on the population of an evolutionary algorithm attempting to generate solutions with higher and higher fitness.5

Mutation evolutionary operators

They perturb the decision variables of a single solution to look for improvement in its vicinity.

Recombination evolutionary operators

They combine decision variables from two or more solutions to create new solutions.

Selection evolutionary operators

They determine which solutions are allowed to proceed to the next cycle.

Executioner

A cross-language automation tool for running models. (See also Project Platypus).

Exploratory Modeling and Analysis (EMA)

A research methodology that uses computational experiments to analyze complex and uncertain systems.9

Data-driven exploratory modeling

Used to reveal implications of a data set by searching through an ensemble of models for instances that are consistent with the data.

Question-driven exploratory modeling

Searches over an ensemble of models believed to be plausible to answer a question of interest or illuminate policy choices.

Model-driven exploratory modeling

Investigates the properties of an ensemble of models without reference to a data set or policy question. It is rather a theoretical investigation into the properties of a class of models.

Feasible region

The set of all decision variables in the decision space that are feasible (i.e. satisfy all constraints).5

GDE3

Generational algorithms

A class of MOEAs that replace the entire population during each full mating, mutation, and selection iteration of the algorithm.5 (See also Steady-state algorithms).

Generational distance (performance metric)

The average distance from every solution in the approximation set to the nearest solution in the reference set.4

Gini index

A generalization of the binomial variance used in Classification and Regression Trees (CART). (See also Classification and Regression Trees (CART)). 

High performance computing

Hypervolume (performance metric)

The volume of the space dominated by the approximation set.4

Inverted generational distance (performance metric)

The average distance from every solution in the reference set to the nearest solution in the approximation set.4

J3

A free desktop application for producing and sharing high-dimensional, interactive scientific visualizations. (See also Project Platypus).

Kernel density estimation

Latin Hypercube Sampling (LHS)

Stratified technique used to generate samples of parameter values.

Markov chain

Method of moments

MOEA Framework

A free and open source Java library for developing and experimenting with multi-objective evolutionary algorithms and other general-purpose optimization algorithms.4

Monte Carlo

Morris method

NSGA-II

The Non-dominated Sorting Genetic Algorithm-II. MOEA featuring elitism, efficient non-domination sorting, and parameter free diversity maintenance.10

ε-NSGA-II 

A generational algorithm that uses e-dominance archiving, adaptive population sizing and time continuation.

Number of function evaluations (NFE)

Objectives

The criteria used to compare solutions in an optimization problem.

OMOPSO

A particle swarm optimization algorithm—the first to include e-dominance as a means to solve many-objective problems.11

Optimization

The process of identifying the best solution (or a set of best solutions) among a set of alternatives.

Multi-objective optimization

Multi-objective optimization employs two or more criteria to identify the best solution(s) among a set of alternatives

Intertemporal optimization

Parallel computing

Parametric generator

Pareto optimality

The notion that a solution is superior or inferior to another solution only when it is superior in all objectives or inferior in all objectives respectively.

Pareto dominance

A dominating solution is superior to another in all objectives. A dominated solution is inferior to another in all objectives. A non-dominated solution is superior in one objective but inferior in another.  

Pareto front

Contains the objective values of all non-dominated solutions (in the objective function space).

Pareto optimal set

Contains the decision variables of all non-dominated solutions (in the decision variable space).

Particle swarm optimization

Population-based stochastic optimization technique where the potential solutions, called particles, fly through the problem space by following the current optimum particles.

Patient Rule Induction method (PRIM)

A rule induction algorithm.

Performance metrics

Procedures used to compare the performance of approximation sets.

Pointer

Population

The set of encoded solutions that are manipulated and evaluated during the application of an evolutionary algorithm.

Principle Component Analysis (PCA)

Project Platypus

A Free and Open Source Python Library for Multiobjective Optimization. For more information see: https://github.com/Project-Platypus

Radial basis function

Reference set

The set of globally optimal solutions in an optimization problem.

Rhodium

Python Library for Robust Decision Making and Exploratory Modelling. (See also Project Platypus).

Robust Decision Making (RDM)

An analytic framework that helps identify potential robust strategies for a particular problem, characterize the vulnerabilities of such strategies, and evaluate trade-offs among them.12

Multi-objective robust decision making (MORDM)

An extension of Robust Decision Making (RDM) to explicitly include the use of multi-objective optimization to discover robust strategies and explore the trade-offs among multiple competing performance objectives.13

OpenMORDM

An open source implementation of MORDM with the tools necessary to perform a complete MORDM analysis.14 For more information see: https://github.com/OpenMORDM

Safe operating space

SALib

Seeding

Sobol sampling

Spacing (performance metric)

The uniformity of the spacing between the solutions in an approximation set.

SPEA2

MOEA that assigns a fitness value to each solution based on the number of competing solutions it dominates.

 State of the world

A fundamental concept in decision theory which refers to a feature of the world that the agent/decision maker has no control over and is the origin of the agent’s uncertainty about the world.

Steady-state algorithms

A class of MOEAs that only replace one solution in the population during each full mating, mutation, and selection iteration of the algorithm. (See also Generational algorithms).

Time continuation

The injection of new solutions in the population to reinvigorate search.

Tournament

The set of candidate solutions selected randomly from a population.

Trace

Visual analytics

The rapid analysis of large datasets using interactive software that enables multiple connected views of planning problems.

More information on the concepts

  1. Haario, H., Saksman, E. & Tamminen, J. An adaptive Metropolis algorithm. Bernoulli 7, 223–242 (2001).
  2. Akaike, H. Akaike’s information criterion. in International Encyclopedia of Statistical Science 25–25 (Springer, 2011).
  3. Vrugt, J. A. & Robinson, B. A. Improved evolutionary optimization from genetically adaptive multimethod search. Proc. Natl. Acad. Sci. 104, 708–711 (2007).
  4. Hadka, D. Beginner’s Guide to the MOEA Framework. (CreateSpace Independent Publishing Platform, 2016).
  5. Coello, C. A. C., Lamont, G. B. & Van Veldhuizen, D. A. Evolutionary algorithms for solving multi-objective problems. 5, (Springer, 2007).
  6. Hadka, D. & Reed, P. Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework. Evol. Comput. 21, 231–259 (2012).
  7. Breiman, L. Classification and Regression Trees. (Wadsworth International Group, 1984).
  8. Reed, P. M., Hadka, D., Herman, J. D., Kasprzyk, J. R. & Kollat, J. B. Evolutionary multiobjective optimization in water resources: The past, present, and future. Adv. Water Resour. 51, 438–456 (2013).
  9. Bankes, S. Exploratory Modeling for Policy Analysis. Oper. Res. 41, 435–449 (1993).
  10. Deb, K., Pratap, A., Agarwal, S. & Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. Evol. Comput. IEEE Trans. On 6, 182–197 (2002).
  11. Sierra, M. R. & Coello, C. C. Improving PSO-based multi-objective optimization using crowding, mutation and e-dominance. in Evolutionary multi-criterion optimization 3410, 505–519 (Springer, 2005).
  12. Lempert, R. J., Groves, D. G., Popper, S. W. & Bankes, S. C. A general, analytic method for generating robust strategies and narrative scenarios. Manag. Sci. 52, 514–528 (2006).
  13. Kasprzyk, J. R., Nataraj, S., Reed, P. M. & Lempert, R. J. Many objective robust decision making for complex environmental systems undergoing change. Environ. Model. Softw. (2013). doi:10.1016/j.envsoft.2012.12.007
  14. Hadka, D., Herman, J., Reed, P. & Keller, K. An open source framework for many-objective robust decision making. Environ. Model. Softw. 74, 114–129 (2015).

 

Advertisements

An Introduction To Econometrics: Part 1- Ordinary Least Squares Regression

I took a PhD level econometrics course this semester in the Applied Economics and Management department here at Cornell and I thought I’d share some of what I learned. Overall, I enjoyed the course and learned a great deal. It was very math and theory heavy, but the Professor Shanjun Li did a nice job keeping the class lively and interesting. I would recommend the class to future EWRS students who may be looking for some grounding in econometrics, provided they’ve taken some basic statics and linear algebra courses.

So lets start with the basics, what does the term “econometrics” even mean? Hansen (2010) defined econometrics as “the unified study of economic models, mathematical statistics and economic data”. After taking this introductory course, I’m inclined to add my own definition: econometrics is “a study of the problems with regression using Ordinary Least Squares (OLS) and how to solve them”. This is obviously a gross oversimplification of the field, however, regression through OLS was the primary tool used for finding insights and patterns within data, and we spent the vast majority of the course examining it. In this post I’ll briefly summarize OLS mechanics and classical OLS assumptions. In my next post, I’ll detail methods for dealing with violations of OLS assumptions. My hope is that reading this may help you understand some key terminology and the reasoning behind why certain econometric tools are employed.

OLS mechanics

Our primary interest when creating an econometric model is to estimate some dependent variable, y, using a observations from a set of independent variables, X. Usually y is a vector of length n, where n is the number of observations, and X is a matrix of size (n x k) where k is the number of explanatory variables (you can think of X as a table of observations, where each column contains a different variable and each row represents an observation of that variable). The goal of OLS regression is to estimate the coefficients, beta, for the model:

y = X\beta+\epsilon

Where beta is a k by 1 vector of coefficients on X and epsilon is a k by 1 vector of error terms.

OLS regression estimates beta by minimizing the sum of the square error term (hence the name “least squares”). Put in matrix notation, OLS estimates beta using the equation:

\hat{\beta} = argmin_{\beta} SSE_N(\beta) = \epsilon ' \epsilon

The optimal beta estimate can be found through the following equations:

\epsilon = y-X\hat{\beta}

\epsilon ' \epsilon =  (y-X\hat{\beta})'(y-X\hat{\beta})

Taking the derivative and setting it equal to zero:

2X'y+2X'X\hat{\beta} = 0

Then solving for the beta estimate:

\hat{\beta} = (X'X)^{-1}X'y

 

Estimation of y using OLS regression can be visualized as the orthogonal projection of the vector y onto the column space of X. The estimated error term, epsilon, is the orthogonal distance between the projection and the true vector y.  Figure 1 shows this projection for a y that is regressed on two explanatory variables, X1 and X2.

projection

Figure 1: OLS regression as an orthogonal projection of vector y onto the column space of matrix X. The error term, \hat{\epsilon}, is the orthogonal distance between y and X\hat{\beta}. (Image source: Wikipedia commons)

 Assumptions and properties of OLS regression

The Gauss-Markov Theorem states that under a certain set of assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) for vector y.

To understand the full meaning of the Gauss-Markov theorem, it’s important to define two fundamental properties that can be used to describe estimators, consistency and efficiency. An estimator is consistent if its value will converge to the true parameter value as the number of observations goes to infinity. An estimator is efficient if its asymptotic variance is no larger than the asymptotic variance of any other possible consistent estimator for the parameter. In light of these definitions, the Gauss-Markov Theorem can be restated as: estimators found using OLS will be the most efficient consistent estimator for beta as long as the classical OLS assumptions hold. The remainder of this post will be devoted to describing the necessary assumptions for the OLS estimator to be the BLUE and detailing fixes for when these assumptions are violated.

The four classical assumptions for OLS to be the BLUE are:

  1. Linearity: The relationship between X and y is linear, following the functional form:

y = X\beta+\epsilon.

2. Strict exogeneity: The error \epsilon terms should be independent of the value of the explanatory variables, X. Put in equation form, this assumption requires:

E(\epsilon_i|X) = 0

E(\epsilon_i) =0

3.  No perfect multicollinearity: columns of X should not be correlated with each other (see my earlier post on dealing with mulitcollinearity for fixes for violations of this assumption).

4. Spherical Error: Error terms should be homoskedastic, meaning they are evenly distributed around the X values. Put in equation form:

E(\epsilon_i^2|X) =\sigma^2

Where \sigma^2 is a constant value.

E(\epsilon_i \epsilon_j|X)=0

Using assumption 4, we can define the variance of \hat{\beta} as:

var(\hat{\beta}_{OLS}) = \sigma^2(X'X)^{-1}

If assumptions 1-4 hold, then the OLS estimate for beta is the BLUE, if however, any of the assumptions are broken, we must employ other methods for estimating our regression coefficients.

In my next post I’ll detail the methods econometricians use when these assumptions are violated.

 References:

Hansen, Bruce. “Econometrics”. 2010. University of Wisconsin

http://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics2010.pdf

Calculating Risk-of-Failures as in the Research Triangle papers (2014-2016) – Part 1

There has been a series of papers (e.g., Palmer and Characklis, 2009; Zeff et al., 2014; Herman et al., 2014) suggesting the use of an approximate risk-of-failure (ROF) metric, as opposed to the more conventional days of supply remaining, for utilities’ managers to decide when to enact not only water use restrictions, but also water transfers between utilities. This approach was expanded to decisions about the best time and in which new infrastructure project a utility should invest (Zeff at al., 2016), as opposed to setting fixed times in the future for either construction or options evaluation. What all these papers have in common is that drought mitigation and infrastructure expansion decisions are triggered when the values of the short and long-term ROFs, respectively, for a given utility exceeds those of pre-set triggers.

For example, the figure below shows that as streamflows (black line, subplot “a”) get lower while demands are maintained (subplot “b”), the combined storage levels of the fictitious utility starts to drop around the month of April (subplot “c”), increasing the utility’s short-term ROF (subplot “d”) until it finally triggers transfers and restrictions (subplot “e”). Despite the triggered restriction and transfers, the utility’s combined storage levels crossed the dashed line in subplot “c”, which denotes the fail criteria (i.e. combined storage levels dropping below 20% of the total capacity).

rof1

It is beyond the scope of this post to go into the details presented in all of these papers, but even after reading them the readers may be wondering how exactly ROFs are calculated. In this post, I’ll try to show in a graphical and concise manner how short-term ROFs are calculated.

In order to calculate a utility’s ROF for week m, we would run 50 independent simulations (henceforth called ROF simulations) all departing from the system conditions (reservoir storage levels, demand probability density function, etc.) observed in week m, and each using one of 50 years of streamflows time series recorded immediately prior to week m. The utility’s ROF is then calculated as the number of ROF simulations in which the combined storage level of that utility dropped below 20% of the total capacity in at least one week, divided by the number of ROF simulations ran (50). An animation of the process can be seen below.

test

For example, for a water utility who started using ROF triggers on 01/01/2017, this week’s short-term ROF (02/13/2017, or week m=7) would be calculated using the recorded streamflows from weeks 6 through -47 (assuming here a year of 52 weeks, for simplicity) for ROF simulation 1, the streamflows from weeks -48 to -99 for ROF simulation 2, and so on until we reach 50 simulations. However, if the utility is running an optimization or scenario evaluation and wants to calculate the ROF in week 16 (04/10/2017) of a system simulation, ROF simulation 1 would use 10 weeks of synthetically generated streamflows (16 to 7) and 42 weeks of historical records (weeks 6 to -45), simulation 2 would use records for weeks -46 to -97, and so on, as in a 50 years moving window.

In another blog post, I will show how to calculate the long-term ROF and the reasoning behind it.

Works cited

Herman, J. D., H. B. Zeff, P. M. Reed, and G. W. Characklis (2014), Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty, Water Resour. Res., 50, 7692–7713, doi:10.1002/2014WR015338.

Palmer, R., and G. W. Characklis (2009), Reducing the costs of meeting regional water demand through risk-based transfer agreements, J. Environ. Manage., 90(5), 1703–1714.

Zeff, H. B., J. R. Kasprzyk, J. D. Herman, P. M. Reed, and G. W. Characklis (2014), Navigating financial and supply reliability tradeoffs in regional drought management portfolios, Water Resour. Res., 50, 4906–4923, doi:10.1002/2013WR015126.

Zeff, H. B., J. D. Herman, P. M. Reed, and G. W. Characklis (2016), Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways, Water Resour. Res., 52, 7327–7346, doi:10.1002/2016WR018771.

 

Synthetic streamflow generation

A recent research focus of our group has been the development and use of synthetic streamflow generators.  There are many tools one might use to generate synthetic streamflows and it may not be obvious which is right for a specific application or what the inherent limitations of each method are.  More fundamentally, it may not be obvious why it is desirable to generate synthetic streamflows in the first place.  This will be the first in a series of blog posts on the synthetic streamflow generators in which I hope to sketch out the various categories of generation methods and their appropriate use as I see it.  In this first post we’ll focus on the motivation and history behind the development of synthetic streamflow generators and broadly categorize them.

Why should we use synthetic hydrology?

The most obvious reason to use synthetic hydrology is if there is little or no data for your system (see Lamontagne, 2015).  Another obvious reason is if you are trying to evaluate the effect of hydrologic non-stationarity on your system (Herman et al. 2015; Borgomeo et al. 2015).  In that case you could use synthetic methods to generate flows reflecting a shift in hydrologic regime.  But are there other reasons to use synthetic hydrology?

In water resources systems analysis it is common practice to evaluate the efficacy of management or planning strategies by simulating system performance over the historical record, or over some critical period.  In this approach, new strategies are evaluated by asking the question:  How well would we have done with your new strategy?

This may be an appealing approach, especially if some event was particularly traumatic to your system. But is this a robust way of evaluating alternative strategies?  It’s important to remember that any hydrologic record, no matter how long, is only a single realization of a stochastic process.  Importantly, drought and flood events emerge as the result of specific sequences of events, unlikely to be repeated.  Furthermore, there is a 50% chance that the worst flood or drought in an N year record will be exceeded in the next N years.  Is it well advised to tailor our strategies to past circumstances that will likely never be repeated and will as likely as not be exceeded?  As Lettenmaier et al. [1987] reminds us “Little is certain about the future except that it will be unlike the past.”

Even under stationarity and even with long hydrologic records, the use of synthetic streamflow can improve the efficacy of planning and management strategies by exposing them to larger and more diverse flood and drought than those in the record (Loucks et al. 1981; Vogel and Stedinger, 1988; Loucks et al. 2005).  Figure 7.12 from Loucks et al. 2005 shows a typical experimental set-up using synthetic hydrology with a simulation model.  Often our group will wrap an optimization model like Borg around this set up, where the system design/operating policy (bottom of the figure) are the decision variables, and the system performance (right of the figure) are the objective(s).

loucks-7-12

(Loucks et al. 2005)

 

What are the types of generators?

Many synthetic streamflow generation techniques have been proposed since the early 1960s.  It can be difficult for a researcher or practitioner to know which method is best suited to the problem at hand.  Thus, we’ll start with a very broad characterization of what is out there, then proceed to some history.

Broadly speaking there are two approaches to generating synthetic hydrology: indirect and direct.  The indirect approach generates streamflow by synthetically generating the forcings to a hydrologic model.  For instance one might generate precipitation and temperature series and input them to a hydrologic model of a basin (e.g. Steinschneider et al. 2014).  In contrast, direct methods use statistical techniques to generate streamflow timeseries directly.

The direct approach is generally easier to apply and more parsimonious because it does not require the selection, calibration, and validation of a separate hydrologic model (Najafi et al. 2011).  On the other hand, the indirect approach may be desirable.  Climate projections from GCMs often include temperature or precipitation changes, but may not describe hydrologic shifts at a resolution or precision that is useful.  In other cases, profound regime shifts may be difficult to represent with statistical models and may require process-driven models, thus necessitating the indirect approach.

Julie’s earlier series focused on indirect approaches, so we’ll focus on the direct approach.  Regardless of the approach many of the methods are same.  In general generator methods can be divided into two categories: parametric and non-parametricParametric methods rely on a hypothesized statistical model of streamflow whose parameters are selected to achieve a desired result (Stedinger and Taylor, 1982a).  In contrast non-parametric methods do not make strong structural assumptions about the processes generating the streamflow, but rather rely on re-sampling from the hydrologic record in some way (Lall, 1995).  Some methods combine parametric and non-parametric techniques, which we’ll refer to as semi-parametric (Herman et al. 2015).

Both parametric and non-parametric methods have advantages and disadvantages.  Parametric methods are often parsimonious, and often have analytical forms that allow easy parameter manipulation to reflect non-stationarity.  However, there can be concern that the underlying statistical models may not reflect the hydrologic reality well (Sharma et al. 1997).  Furthermore, in multi-dimensional, multi-scale problems the proliferation of parameters can make parametric models intractable (Grygier and Stedinger, 1988).  Extensive work has been done to confront both challenges, but they may lead a researcher to adopt a non-parametric method instead.

Because many non-parametric methods ‘re-sample’ flows from a record, realism is not generally a concern, and most re-sampling schemes are computationally straight forward (relatively speaking).  On the other hand, manipulating synthetic flows to reflect non-stationarity may not be as straightforward as a simple parameter change, though methods have been suggested (Herman et al. 2015Borgomeo et al. 2015).  More fundamentally, because non-parametric methods rely so heavily on the data, they require sufficiently long records to ensure there is enough hydrologic variability to sample.  Short records can be a concern for parametric methods as well, though parametric uncertainty can be explicitly considered in those methods (Stedinger and Taylor, 1982b).  Of course, parametric methods also have structural uncertainty that non-parametric models largely avoid by not assuming an explicit statistical model.

In the coming posts we’ll dig into the nuances of the different methods in greater detail.

A historical perspective

The first use of synthetic flow generation seems to have been by Hazen [1914].  That work attempted to quantify the reliability of a water supply by aggregating the streamflow records of local streams into a 300-year ‘synthetic record.’  Of course the problem with this is that the cross-correlation between concurrent flows rendered the effective record length much less than the nominal 300 years.

Next Barnes [1954] generated 1,000 years of streamflow for a basin in Australia by drawing random flows from a normal distribution with mean and variance equal to the sample estimates from the observed record.  That work was extended by researchers from the Harvard Water Program to account for autocorrelation of monthly flows (Maass et al., 1962; Thomas and Fiering, 1962).  Later work also considered the use of non-normal distributions (Fiering, 1967), and the generation of correlated concurrent flows at multiple sites (Beard, 1965; Matalas, 1967).

Those early methods relied on first-order autoregressive models that regressed flows in the current period on the flows of the previous period (see Loucks et al.’s Figure 7.13  below).  Box and Jenkins [1970] extended those methods to autoregressive models of arbitrary order, moving average models of arbitrary order, and autoregressive-moving average models of arbitrary order.  Those models were the focus of extensive research over the course of the 1970s and 1980s and underpin many of the parametric generators that are widely used in hydrology today (see Salas et al. 1980; Grygier and Stedinger, 1990; Salas, 1993; Loucks et al. 2005).

loucks-7-13

(Loucks et al. 2005)

By the mid-1990s, non-parametric methods began to gain popularity (Lall, 1995).  While much of this work has its roots in earlier work from the 1970s and 1980s (Yakowitz, 1973, 1979, 1985; Schuster and Yakowitz, 1979; Yakowitz and Karlsson, 1987; Karlson and Yakowitz, 1987), improvements in computing and the availability of large data sets meant that by the mid-1990s non-parametric methods were feasible (Lall and Sharma, 1996).  Early examples of non-parametric methods include block bootstrapping (Vogel and Shallcross, 1996), k-nearest neighbor (Lall and Sharma, 1996), and kernel density methods (Sharma et al. 1997).  Since that time extensive research has made improvement to these methods, often by incorporating parametric elements.  For instance, Srinivas and Srinivasan (2001, 2005, and 2006) develop a hybrid autoregressive-block bootstrapping method designed to improve the bias in lagged correlation and to generate flows other than the historical, for multiple sites and multiple seasons.  K-nearest neighbor methods have also been the focus of extensive research (Rajagopalan and Lall, 1999; Harrold et al. 2003; Yates et al. 2003; Sharif and Burn, 2007; Mehortra and Sharma, 2006; Prairie et al. 2006; Lee et al. 2010, Salas and Lee, 2010, Nowak et al., 2010), including recent work by our group  (Giuliani et al. 2014).

Emerging work focuses on stochastic streamflow generation using copulas [Lee and Salas, 2011; Fan et al. 2016], entropy theory bootstrapping [Srivastav and Simonovic, 2014], and wavelets [Kwon et al. 2007; Erkyihun et al., 2016], among other methods.

In the following posts I’ll address different challenges in stochastic generation [e.g. long-term persistence, parametric uncertainty, multi-site generation, seasonality, etc.] and the relative strengths and shortcomings of the various methods for addressing them.

Works Cited

Barnes, F. B., Storage required for a city water supply, J. Inst. Eng. Australia 26(9), 198-203, 1954.

Beard, L. R., Use of interrelated records to simulate streamflow, J. Hydrol. Div., ASCE 91(HY5), 13-22, 1965.

Borgomeo, E., Farmer, C. L., and Hall, J. W. (2015). “Numerical rivers: A synthetic streamflow generator for water resources vulnerability assessments.” Water Resour. Res., 51(7), 5382–5405.

Y.R. Fan, W.W. Huang, G.H. Huang, Y.P. Li, K. Huang, Z. Li, Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas, Advances in Water Resources, Volume 88, February 2016, Pages 170-185.

Fiering, M.B, Streamflow Synthesis, Harvard University Press, Cambridge, Mass., 1967.

Giuliani, M., J. D. Herman, A. Castelletti, and P. Reed (2014), Many-objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management, Water Resour. Res., 50, 3355–3377, doi:10.1002/2013WR014700.

Grygier, J.C., and J.R. Stedinger, Condensed Disaggregation Procedures and Conservation Corrections for Stochastic Hydrology, Water Resour. Res. 24(10), 1574-1584, 1988.

Grygier, J.C., and J.R. Stedinger, SPIGOT Technical Description, Version 2.6, 1990.

Harrold, T. I., Sharma, A., and Sheather, S. J. (2003). “A nonparametric model for stochastic generation of daily rainfall amounts.” Water Resour. Res., 39(12), 1343.

Hazen, A., Storage to be provided in impounding reservoirs for municipal water systems, Trans. Am. Soc. Civ. Eng. 77, 1539, 1914.

Herman, J.D., H.B. Zeff, J.R. Lamontagne, P.M. Reed, and G. Characklis (2016), Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments, Journal of Water Resources Planning & Management, doi: 10.1061/(ASCE)WR.1943-5452.0000701.

Karlsson, M., and S. Yakowitz, Nearest-Neighbor methods for nonparametric rainfall-runoff forecasting, Water Resour. Res., 23, 1300-1308, 1987.

Kwon, H.-H., U. Lall, and A. F. Khalil (2007), Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43, W05407, doi:10.1029/2006WR005258.

Lall, U., Recent advances in nonparametric function estimation: Hydraulic applications, U.S. Natl. Rep. Int. Union Geod. Geophys. 1991- 1994, Rev. Geophys., 33, 1093, 1995.

Lall, U., and A. Sharma (1996), A nearest neighbor bootstrap for resampling hydrologic time series, Water Resour. Res. 32(3), pp. 679-693.

Lamontagne, J.R. 2015,Representation of Uncertainty and Corridor DP for Hydropower 272 Optimization, PhD edn, Cornell University, Ithaca, NY.

Lee, T., J. D. Salas, and J. Prairie (2010), An enhanced nonparametric streamflow disaggregation model with genetic algorithm, Water Resour. Res., 46, W08545, doi:10.1029/2009WR007761.

Lee, T., and J. Salas (2011), Copula-based stochastic simulation of hydrological data applied to Nile River flows, Hydrol. Res., 42(4), 318–330.

Lettenmaier, D. P., K. M. Latham, R. N. Palmer, J. R. Lund and S. J. Burges, Strategies for coping with drought Part II: Planning techniques for planning and reliability assessment, EPRI P-5201, Final Report Project 2194-1, June 1987.

Loucks, D.P., Stedinger, J.R. & Haith, D.A. 1981, Water Resources Systems Planning and Analysis, 1st edn, Prentice-Hall, Englewood Cliffs, N.J.

Loucks, D.P. et al. 2005, Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Delft, The Netherlands.

Maass, A., M. M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair,

Design of Water Resource Systems, Harvard University Press, Cambridge, Mass., 1962.

Matalas, N. C., Mathematical assessment of synthetic hydrology, Water Resour. Res. 3(4), 937-945, 1967.

Najafi, M. R., Moradkhani, H., and Jung, I. W. (2011). “Assessing the uncertainties of hydrologic model selection in climate change impact studies.” Hydrol. Process., 25(18), 2814–2826.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., and Zagona, E. A. (2006). “Modified K-NN model for stochastic streamflow simulation.” J. Hydrol. Eng., 11(4), 371–378.

Rajagopalan, B., and Lall, U. (1999). “A k-nearest-neighbor simulator for daily precipitation and other weather variables.” Water Resour. Res., 35(10), 3089–3101.

Salas, J. D., J. W. Deller, V. Yevjevich and W. L. Lane, Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colo., 1980.

Salas, J.D., 1993, Analysis and Modeling of Hydrologic Time Series, Chapter 19 (72 p.) in The McGraw Hill Handbook of Hydrology, D.R. Maidment, Editor.

Salas, J.D., T. Lee. (2010). Nonparametric Simulation of Single-Site Seasonal Streamflow, J. Hydrol. Eng., 15(4), 284-296.

Schuster, E., and S. Yakowitz, Contributions to the theory of nonparametric regression, with application to system identification, Ann. Stat., 7, 139-149, 1979.

Sharif, M., and Burn, D. H. (2007). “Improved K-nearest neighbor weather generating model.” J. Hydrol. Eng., 12(1), 42–51.

Sharma, A., Tarboton, D. G., and Lall, U., 1997. “Streamflow simulation: A nonparametric approach.” Water Resour. Res., 33(2), 291–308.

Srinivas, V. V., and Srinivasan, K. (2001). “A hybrid stochastic model for multiseason streamflow simulation.” Water Resour. Res., 37(10), 2537–2549.

Srinivas, V. V., and Srinivasan, K. (2005). “Hybrid moving block bootstrap for stochastic simulation of multi-site multi-season streamflows.” J. Hydrol., 302(1–4), 307–330.

Srinivas, V. V., and Srinivasan, K. (2006). “Hybrid matched-block bootstrap for stochastic simulation of multiseason streamflows.” J. Hydrol., 329(1–2), 1–15.

Roshan K. Srivastav, Slobodan P. Simonovic, An analytical procedure for multi-site, multi-season streamflow generation using maximum entropy bootstrapping, Environmental Modelling & Software, Volume 59, September 2014a, Pages 59-75.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 1. Model verification and validation, Water Resour. Res. 18(4), 909-918, 1982a.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 2. Parameter uncertainty,Water Resour. Res. 18(4), 919-924, 1982b.

Steinschneider, S., Wi, S., and Brown, C. (2014). “The integrated effects of climate and hydrologic uncertainty on future flood risk assessments.” Hydrol. Process., 29(12), 2823–2839.

Thomas, H. A. and M. B. Fiering, Mathematical synthesis of streamflow sequences for the analysis of river basins by simulation, in Design of Water Resource Systems, by A. Maass, M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Harvard University Press, Cambridge, Mass., 1962.

Vogel, R.M., and J.R. Stedinger, The value of stochastic streamflow models in over-year reservoir design applications, Water Resour. Res. 24(9), 1483-90, 1988.

Vogel, R. M., and A. L. Shallcross (1996), The moving block bootstrap versus parametric time series models, Water Resour. Res., 32(6), 1875–1882.

Yakowitz, S., A stochastic model for daily river flows in an arid region, Water Resour. Res., 9, 1271-1285, 1973.

Yakowitz, S., Nonparametric estimation of markov transition functions, Ann. Stat., 7, 671-679, 1979.

Yakowitz, S. J., Nonparametric density estimation, prediction, and regression for markov sequences J. Am. Stat. Assoc., 80, 215-221, 1985.

Yakowitz, S., and M. Karlsson, Nearest-neighbor methods with application to rainfall/runoff prediction, in Stochastic  Hydrology, edited by J. B. Macneil and G. J. Humphries, pp. 149-160, D. Reidel, Norwell, Mass., 1987.

Yates, D., Gangopadhyay, S., Rajagopalan, B., and Strzepek, K. (2003). “A technique for generating regional climate scenarios using a nearest-neighbor algorithm.” Water Resour. Res., 39(7), 1199.

Basic Machine Learning in Python with Scikit-learn

Basic Machine Learning in Python with Scikit-learn

Machine learning has become a hot topic in the last few years and it is for a reason. It provides data analysts with efficient ways of extracting information from data, allowing it to be used for analysis and modeling purposes.

The Scikit-learn Python library has implementations of dozens of learning algorithms and is freely available for academic and commercial use under the terms of the BSD licence. Some of these algorithms can be extremely useful for our job as water systems analysts, so given the overwelming amount of algorithms implemented in Scikit-learn, I though I would mention a few I find particularly useful for my research. For each method below I included link swith an examples from the Scikit-learn’s website. Instalation and use instructions can be found in their website.

CART Trees

CART trees that can used for regression or classification. Any any tree, CART trees are considered poor (generally high variance) classifiers unless bootstrapped or boosted (see supervised learning), but the resulting rules are easily interpretable.

CART Treeshttp://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

Dimensionality reduction

Principal component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique. It works by finding the basis the maximizes the data’s variance, allowing for the elimination of axis that have low variances. Among its uses are noise reduction, data visualization, as it preserves the distances between data points, and improvement of computational efficiency of other algorithms by getting rid of redundant information. PCA can me used in its pure form or it can be kernelized to handle data sets whose variance is maximum in a non-linear direction. Manifold learning is another way of performing dimensionality reduction by unwinding the lower dimensional manifold where the information lies.

PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html#sphx-glr-auto-examples-decomposition-plot-pca-3d-py

Kernel PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html#sphx-glr-auto-examples-decomposition-plot-kernel-pca-py

Manifold learninghttp://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

Clustering

Clustering is used to group similar points in a data set. One example is the problem of find customer niches based on the products each customer buys. The most famous clustering algorithm is k-means, which, as any other machine learning algorithm, works well on some data sets but not in others. There are several alternative algorithms, all of which exemplified in the following two links:

Clustering algorithms comparison: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

Gaussian Mixture Models (finds the same results as k-means but also provides variances): http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

Reducing the dimentionality of a dataset with PCA or kernel PCA may speed up clustering algorithms.

Supervised learning

Supervised learning algorithms can be used for regression or classification problems (e.g. classify a point as pass/fail) based on labeled data sets. The most “trendy” one nowadays is neural networks, but support vector machines, boosted and bagged trees, and others are also options that should be considered and tested on your data set. Bellow are links to some of the supervised learning algorithms implemented in Scikit-learn:

Comparison between supervised learning algorithmshttp://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Neural networks: http://scikit-learn.org/stable/modules/neural_networks_supervised.html

Gaussian Processes is also a supervised learning algorithm (regression) which is also be used for Bayesian optimization:

Gaussian processes: http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py

 

Synthetic Weather Generation: Part V

Conditioning Synthetic Weather Generation on Seasonal Climate Forecasts

This is the final blog post in a five part series on synthetic weather generators. See Parts I and II for a description of single-site parametric and non-parametric generators, respectively, and Part III for a description of multi-site generators of both types.

In my previous post, Part IV, I discussed how parametric and non-parametric weather generators can be modified to produce weather that is consistent with climate change projections for use in long-term planning. In the shorter-term, water managers may be able to exploit mid-range climate forecasts to inform seasonal reservoir operations (see e.g., Kim and Palmer (1997), Chiew et al. (2003), Block (2011), Block and Goddard (2012), Anghileri et al. (2016)). For such analyses, it could be useful to tailor management plans to simulated weather conditions consistent with these probabilistic forecasts. Here I discuss how one can condition weather generators on seasonal climate forecasts for such purposes.

Two major forecasting groups, the International Research Institute (IRI) for Climate and Society at Columbia University and the Climate Prediction Center (CPC) of the U.S. National Centers for Environmental Prediction, issue tercile seasonal forecasts that specify the probabilities of observing above normal (pA), near normal (pN) and below normal (pB) precipitation and temperature. Forecasts are issued each month for the upcoming three months (see those from IRI and CPC). While these forecasts are derived from dynamical and statistical models that include a variety of physically-based processes, most of the forecast skill can be explained by the effects of the El Niño-Southern Oscillation (ENSO) on the climate system (Barnston et al., 1994). As most of you probably know, ENSO refers to the quasi-periodic cycling of sea surface temperatures (SSTs) in the tropical eastern Pacific Ocean. The warm phase is known as El Niño and the cool phase as La Niña. These are quantified as five consecutive three-month periods with SST anomalies in the Niño 3.4 region of the Pacific > +0.5°C (El Niño) or < -0.5°C (La Niña). All other periods are considered neutral (Climate Prediction Center, 2016). Because much of seasonal climate forecast skill is derived from ENSO, current or forecasted SST anomalies in the Niño 3.4 region are sometimes used by themselves as proxy seasonal forecasts. Here I will discuss techniques for conditioning weather generator parameters on either tercile forecasts or current or projected ENSO conditions.

Parametric Weather Generators

Wilks (2002) presents a method for conditioning parametric weather generators of the Richardson (1981) type on tercile seasonal climate forecasts using a network of sites in New York State. The key idea, derived from Briggs and Wilks (1996), is to estimate the weather generator parameters not from the entire historical record, but from a weighted bootstrapped sample of the historical record consistent with the forecast. This is similar to the method of streamflow generation use by Herman et al. (2016) to increase the frequency of droughts of a given magnitude. As an illustrative example, Herman et al. (2016) empirically estimate the quantile of a noteworthy drought from the historical record and show the system impacts of droughts of that magnitude or worse becoming n times more likely. This is done by adapting the semi-parametric streamflow generator developed by Kirsch et al. (2013) to sample years with historical droughts of at least that magnitude n times more often.

While Herman et al. (2016) take a fully non-parametric approach by estimating quantiles empirically, Briggs and Wilks (1996) estimate terciles parametrically by fitting a normal distribution to the historical mean seasonal temperatures and a Gamma distribution to the historical seasonal precipitation totals at each site. After estimating terciles, Briggs and Wilks (1996) classify each year in the historical record as below normal, near normal or above normal in terms of temperature and precipitation in the season of interest. Because terciles are estimated parametrically, each category will not necessarily contain an equal number of years, even if the record length is a multiple of three. It should be noted, however, that like Herman et al. (2016), IRI defines terciles empirically from the most recent 30-yr record (see more here and here) and CPC from the most recent 30-yr record updated only every 10 years, i.e., the current reference frame is 1981-2010 (see more here).  IRI provides a discussion of the advantages and disadvantages to parametric and non-parametric tercile estimation methods. For consistency, it may be best to take the same approach as the agency whose forecasts are being used.

Once terciles have been defined, weather generator parameters can be estimated from a bootstrapped sample of the historical record consistent with the forecast (Wilks, 2002). Consider a historical record with NB years classified as being below normal, NN as near normal, and NA as above normal. Then the expected value of a given seasonal statistic, X, can be estimated from a bootstrapped sample of L years from the historical record in which the below normal, near normal and above normal years are sampled with probabilities pB, pN, and pA, respectively. Representing the statistic of interest in the ith below, near and above normal year as xi(B), xi(N) and xi(A), respectively, then the expected value of X is the following:

(1) E\left[X\right] = \lim_{L\to\infty} \frac{1}{L}\left[\sum_{i=1}^{N_B} \frac{p_BL}{N_B}x_{i}^{\left(B\right)} + \sum_{i=1}^{N_N} \frac{p_NL}{N_N}x_{i}^{\left(N\right)} + \sum_{i=1}^{N_A} \frac{p_AL}{N_A}x_{i}^{\left(A\right)}\right]

That is, the forecast-conditional value of X is a weighted sum of its average in the below normal, near normal and above normal years, where the weights are equal to the forecast probabilities for each of the three respective categories (Wilks, 2002). Note that X is a seasonal statistic, not a parameter, so one cannot simply estimate a weather generator parameter as a weighted average of its value in each of the terciles unless the parameter is itself a simple statistic.

Fortunately, for some of the weather generator parameters this is the case. Recall from Part I that the first step of the Richardson generator is to generate a sequence of daily rainfall occurrences from a first order Markov chain. This chain is defined by the probabilities of transitioning from a dry day to a wet day, p01, or from a wet day to another wet day, p11. As discussed in Part IV, these two parameters together define the unconditional probability of a wet day, π, and the first-order autocorrelation of the occurrences, d, where π = p01/(1 + p01p11) and d = p11p01 (Katz, 1983). The unconditional probability of a wet day is a simple statistic. Therefore, since pN = 1 – pApB, π can be estimated each month as a function of pB, pA and the average portion of wet days in below normal, near normal and above normal years for that month ((B),(N) and (A)):

(2) π = pB(N) + (1 – pBpA)(N) + pA(A).

More generically, π =b0 + bBpB + bApA, where b0 = (N), bB = (B) – x̅(N) and bA = (A) – x̅(N). Wilks, 2002 recommends that the parameter π be estimated separately at each site and for each month, but that the below normal, near normal and above normal years be defined based on the total precipitation at that site in the entire three-month season. This is because the forecast is for the entire season, but the portion of wet days varies on a shorter time scale.

Like the unconditional probability of a wet day, the persistence parameter d can also be estimated as a function of pB, pA and the value of d in below normal, near normal and above normal years. Wilks (2002) shows that d can be represented by a quadratic function of pA and p:

(3) d = b0 + bBpB + bBBpB2 + bApA + bAApA2 + bBApBpA.

but finds that variations in d across forecasts are small such that one can reasonably assume the climatological estimate of d for all sites and months, regardless of the forecast.

The remaining weather generator parameters related to precipitation are those defining the distribution of precipitation amounts. Because these parameters (α and β if fitting a Gamma distribution, and α, λ1 and λ2 if fitting a mixed exponential distribution) are estimated iteratively in an MLE approach, they cannot be estimated as a function of the forecast probabilities like the occurrence parameters can. Instead, Wilks (2002) suggests using the Briggs and Wilks (1996) approach of bootstrapping a large sample of years from the historical record consistent with the forecast and fitting separate probability distributions to the precipitation amounts in each month’s weighted sample. When performing this estimation using a mixed exponential distribution to model precipitation amounts, Wilks (2002) found the estimates of the mixing parameter, α, to be the least consistent across the investigated sites and chose to hold it constant across sites and forecasts. Thus, only λ1 and λ2 were re-estimated for each month and site as a function of the seasonal forecast.

One drawback to the sampling scheme employed by Briggs and Wilks (1996) is that all historical years within each tercile have an equal probability of being sampled: pB/NB, pN/NN and pA/NA for below, near and above normal years. In reality, years similar to those in the tail are less likely to occur than years similar to those near the median. An alternative, more precise sampling scheme called the pdf-ratio method suggested by Stedinger and Kim (2010) assigns each year i an un-normalized probability of selection, qi = (1/N)*f1(xi)/f0(xi) where N is the number of years in the historical record and f1 and f0 are pdfs of the statistic X under forecast and climatological conditions, respectively. The qi are then normalized such that they sum to 1. f1 and f0 can be analytical or empirical distributions.

After estimating parameters of the precipitation amounts distributions using either the approach of Briggs and Wilks (1996) or Stedinger and Kim (2010), one must estimate the forecast-conditional temperature parameters. Recall that in the Richardson generator, separate harmonics are fit to the eight time series of means and standard deviations of minimum and maximum temperature on wet and dry days. Historical residuals from these fits are determined by first subtracting the predicted mean and then dividing by the predicted standard deviation. Finally, the residuals are modeled by an order-one vector autoregression, or VAR(1) model. Because forecast-conditional weather generators are only applied three months at a time, Wilks (2002) suggests instead fitting quadratic functions to these eight time series within the season of interest.

Like the parameters describing precipitation occurrences, the parameters of the quadratic functions of time describing the mean temperature on wet and dry days can be estimated as a weighted average of fits in each of the three terciles. As shown in Wilks (2002), the four mean temperature functions (minimum and maximum on wet and dry days), µ(t), at each site are specified by the function:

(4) µ(t) = (β0 + βBpB + βApA) + (γ0 + γBpB + γApA)t + (δ0 + δBpB + δApA)t2

where

(5) µ̅t(N) = β0 + γ0t + δ0t2,

(6) [µ̅t(B) µ̅t(N)] = βB + γBt + δBt2,

(7) [µ̅t(A) µ̅t(N)] = βA + γAt + δAt2,

t is the day and µ̅t(B), µ̅t(N) and µ̅t(A) are the mean temperature statistics of concern on each day of below normal, near normal and above normal years in the historical record, respectively.

Finally, Wilks (2002) shows that the standard deviations of minimum and maximum temperature on wet and dry days can be estimated by an extension of Equation 1:

(8) s\left(t\right) = \left[\frac{p_B}{N_B}\sum_{i=1}^{N_B}\left[x_{i}^{\left(B\right)}\left(t\right) - \mu\left(t\right)\right]^2 + \frac{p_N}{N_N}\sum_{i=1}^{N_N}\left[x_{i}^{\left(N\right)}\left(t\right) - \mu\left(t\right)\right]^2 + \frac{p_A}{N_A}\sum_{i=1}^{N_A}\left[x_{i}^{\left(A\right)}\left(t\right) - \mu\left(t\right)\right]^2\right]^{1/2}

where µ(t) is defined as in Equation 4. Once again, the forecast-conditional standard deviations of the four temperature series, σ(t), can then be estimated by quadratic functions of time, conditional on the forecast probabilities pA and pB:

(9) σ(t) = (β0 + βBpB + βBBpB2 + βApA + βAApA2 + βBApBpA)+ (γ0 + γBpB + γBBpB2 + γApA + γAApA2 + γBApBpA)+ (δ0 + δBpB + δBBpB2 + δApA + δAApA2 + δBApBpA)t2.

For the VAR(1) model of temperature residuals, Wilks (2002) found that variations in the estimates of these parameters as a function of the forecast, like d, were minor for the investigated sites in New York. For this reason, the VAR(1) model was fit separately for each month and site based on the entire historical record and these estimates were unchanged with the forecast. Finally, Wilks (2002) found that the spatial correlation of temperature and precipitation also did not change significantly between climatic terciles, and so they too were assumed independent of the forecast. Correlations in temperature were included in the VAR(1) model as described in Part III. Correlations in precipitation occurrences ω, and amounts, ζ, between sites k and l were approximated for all site pairs each month as a function of the horizontal distance, c, between them (see Equations 10 and 11, respectively, for which parameters θ1, θ2 and θ3 were estimated).

(10) \omega = \left(1 + \frac{c_{k,l}}{\theta_1}\right)^{\theta_2}

(11) \zeta = \exp\left(-\theta_3c_{k,l}\right)

If one received a tercile ENSO forecast, the same approach could be used as in Wilks (2002), except the season of interest in each historical year would be classified as La Niña, Neutral or El Niño instead of below normal, near normal or above normal.

Non-parametric Weather Generators

The key idea of weighted sampling from Briggs and Wilks (1996) has also been applied in non-parametric weather generators to condition synthetic weather series on seasonal climate forecasts. For example, Apipattanavis et al. (2007) modify their semi-parametric k-nn generator to find and probabilistically select neighbors, not from the entire historical record, but from a bootstrapped sample of the historical record consistent with the forecast. Again, this can be applied using tercile forecasts of either the {Below Normal, Near Normal, Above Normal} type or {La Niña, Neutral, El Niño} type.

Clark et al. (2004a) develop a more innovative approach that combines ideas from the non-parametric Schaake Shuffle method used to spatially correlate short-term precipitation and temperature forecasts (Clark et al., 2004b) with a parametric approach to weighted resampling presented by Yates et al. (2003) for the k-nn generator of Rajagopalan and Lall (1999). The Schaake Shuffle, originally devised by Dr. J. Schaake of the National Weather Service Office of Hydrologic Development, is a method of reordering ensemble precipitation and temperature forecasts to better capture the spatial and cross correlation of these spatial fields (Clark et al., 2004b).

Traditionally, model output statistics (MOS) from the Numerical Weather Prediction (NWP) model such as temperature, humidity and wind speed at different pressure levels, are used as predictors in a regression model to forecast daily temperature and precipitation at a number of sites. To generate an ensemble of predictions for each forecasted day, normal random variables with mean 0 and variance σε2 are added back to the mean prediction, where σε2 is the variance of the regression residuals. However, these regressions are generally developed independently for each variable at each site and therefore do not reproduce the spatial or temporal correlation between the variables across sites and time (Clark et al., 2004b). To better capture these correlations, the Schaake Shuffle, illustrated in Figure 2 from Clark et al. (2004a) for a 10-member ensemble, re-orders the ensemble members each day in order to preserve the Spearman-rank correlation in the temperature and precipitation variables between sites.

schaake_shuffle_table

The Schaake Shuffle proceeds as follows. For a particular day, the original ensemble members for each variable at each station are ranked from lowest to highest, as shown in Table A of Figure 2 above. Next, a set of historical observations of the same size is generated by randomly selecting days from the historical record within a window of 7 days before and after the forecast date (Table B). Third, the historical observations are sorted from lowest to highest for each variable at each site, as shown in Table C. Finally, the original ensemble members in Table A are re-shuffled to form the final, spatially correlated ensembles in Table D in the following way:

  1. The rank of the data in the first historical observation (shown with dark circles in Tables B and C) is determined at each site
  2. At each site, the member of the original ensemble with the same rank as the first historical observation for that site becomes the first member of the final, correlated ensemble (see dark circles in Table A and location in Table D).
  3. Steps 1 and 2 are repeated for every historical observation/ensemble member.

As stated earlier, this process reproduces the Spearman rank correlation of the observations across sites (Clark et al., 2004b). In order to preserve the temporal correlation for each variable, instead of re-generating a random set of historical observations to use for shuffling the next day’s forecast, the observations from the day following that used for the previous time step is utilized. While the Schaake Shuffle does not guarantee reproduction of the spatial correlation in the observations, just in their rank, the results presented in Clark et al. (2004b) indicate that the method does reasonably well for both, and significantly improves upon the un-shuffled forecasts.

In the weather generator presented by Clark et al. (2004a), the same approach is used to simulate weather sequences except the ensembles in Table A are not generated by MOS regressions but by independently sampling historical observations within +/- 7 days of the simulated day at each site. To condition this weather generator on seasonal climate forecasts, the unshuffled ensembles are formed by preferential selection of different years from the historical record following an approach inspired by Yates et al. (2003). The first step in this approach is to sort all N historical years in terms of their similarity to a climate index, such as current SSTs in the Niño 3.4 region. The most similar year is given rank i = 1 and the least similar i = N. Next, a standard uniform random variable u is drawn and the year of rank i is chosen as an ensemble member, where i = INT(uλN/α) + 1. Here INT(·) is the integer operator, λ is a weighting parameter, and α a selection parameter. Values of λ greater (less) than 1 increase (decrease) the probability of selecting years ranked more similar to the climate index. Values of α greater than 1 restrict the number of sampled years such that α = 5, for example, results in only the most similar 1/5 of years being selected (Clark et al., 2004a).

Yates et al. (2003) apply a simplified version of this method with only one parameter, λ, in a scenario discovery-type approach, investigating the effects of e.g. warmer-drier springs and cooler-wetter summers. Clark et al. (2004a) first take this approach by ranking the historical years according to their similarity to the current Niño 3.4 index and exploring the effects of different choices of λ and α on the skill of the generated weather sequences in forecasting total winter precipitation at Petrified Forest in Arizona, measuring skill by the ranked probability skill score (RPSS). Interestingly, they find that high values of both λ and α, where years more similar to the climate index at the beginning of the season are selected, result in negative forecast skill. This highlights the importance of not being overconfident by only sampling years closest to current or forecast conditions. They note that the values of λ and α should depend on the strength of the Niño 3.4 index, and therefore should be re-optimized for different values of the index in order to maximize the RPSS.

All of these approaches could prove informative for seasonal water resources planning, if the forecasts being used are reliable. In the case of tercile forecasts, this means that, on average, when a given climate state is forecast to occur with probability p, it does in fact occur with that probability. Given that past diagnostic assessments of IRI and CPC forecasts have found biases and overconfidence in some locations (Wilks and Godfrey, 2002; Wilks, 2000), water managers should proceed with caution in using them for seasonal planning. At a minimum, one should perform an analysis of the forecast value for the system of concern (Anghileri et al., 2016) before changing system operations. Fortunately, these forecasts continue to improve over time and several studies have already found value in using them to inform seasonal operations (e.g. Kim and Palmer (1997)Block (2011), Block and Goddard (2012), Anghileri et al. (2016)), indicating promise in their use for water resources planning.

Works Cited

Anghileri, D. Voisin, N., Castelletti, A., Pianosi, A., Nijssen, B., & Lettenmaier, D. P. (2016). Value of long-term streamflow forecasts to reservoir operations for water supply in snow-dominated river catchments. Water Resources Research, 52(6), 4209-4225.

Apipattanavis, S., Podestá, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43(11).

Barnston, A. G., van den Dool, H. M., Rodenhuis, D. R., Ropelewski, C. R., Kousky, V. E., O’Lenic, E. A., et al. (1994). Long-lead seasonal forecasts-Where do we stand?. Bulletin of the American Meteorological Society75(11), 2097-2114.

Block, P. (2011). Tailoring seasonal climate forecasts for hydropower operations. Hydrology and Earth System Sciences, 15, 1355-1368.

Block, P., & Goddard, L. (2012). Statistical and dynamical climate predictins to guide water resources in Ethiopoia. Journal of Water Resources Planning and management, 138(3), 287-298.

Briggs, W. M., & Wilks, D. S. (1996). Extension of the Climate Prediction Center long-lead temperature and precipitation outlooks to general weather statistics. Journal of climate, 9(12), 3496-3504.

Chiew, F. H. S., Zhou, S. L., & McMahon, T. A. Use of seasonal streamflow forecasts in water resources management. Journal of Hydrology, 270(1), 135-144.

Clark, M. P., Gangopadhyay, S., Brandon, D., Werner, K., Hay, L., Rajagopalan, B., & Yates, D. (2004a). A resampling procedure for generating conditioned daily weather sequences. Water Resources Research, 40(4).

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004b). The Schaake shuffle: A method for reconstructing space-time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology, 5(1), 243-262.

Climate Prediction Center (2016). ENSO: Recent Evolution, Current Status and Predictions. National Oceanic and Atmospheric Administration, pp. 19-20.

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 04016050.

Katz, R. W. (1983). Statistical procedures for making inferences about precipitation changes simulated by an atmospheric general circulation model. Journal of the Atmospheric Sciences, 40(9), 2193-2201.

Kim, Y., & Palmer, R. (1997). Value of seasonal flow forecasts in Bayesian stochastic programming. Journal of Water Resources Planning and Management, 123(6), 327-335.

Kirsch, B. R. , Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406.

Rajagopalan, B., & Lall, U. (1999). A k‐nearest‐neighbor simulator for daily precipitation and other weather variables. Water Resources Research35(10), 3089-3101.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Stedinger, J. R., & Kim, Y. O. (2010). Probabilities for ensemble forecasts reflecting climate information. Journal of hydrology, 391(1), 9-23.

Wilks, D. S. (2000). Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995-1998. Journal of Climate, 13, 2389-2403.

Wilks, D. S. (2002). Realizations of daily weather in forecast seasonal climate. Journal of Hydrometeorology, 3(2), 195-207.

Wilks, D. S. & Godfrey, C. M. (2002). Diagnostic verification of the IRI net assessment forecasts, 1997-2000. Journal of Climate, 15(11), 1369-1377.

Yates, D., Gangopadhyay, S., Rajagopalan, B., & Strzepek, K. A technique for generating regional climate scenarios using a nearest-neighbor algorithm. Water Resources Research, 39(7).

Visualizing multidimensional data: a brief historical overview

The results of a MOEA search are presented as a set of multidimensional data points. In order to form useful conclusions from our results, we must have the ability to comprehend the multidimensional differences between results and effectively analyze and communicate them to decision makers.

Navigating through multiple dimensions is an inherently complex task for the human mind. We perceive the world in three dimensions, and thinking in higher dimensional space can be heavily taxing.  The difficulty of comprehending multidimensional data is compounded when one must display the data on a two dimensional surface such as a sheet of paper or a computer screen. The challenge of “flattening” data has persisted for centuries, and has plagued not only those who were concerned with gleaning scientific insights from data, but also artists and those seeking to accurately portray the physical world as perceived by the human eye.

For much of human history, even the greatest artists were unable to accurately express the three dimensional world in a two dimensional plane. Nicolo da Bologna’s 14th century work, The Marriage, fails to convey any sense of three dimensional space, giving the viewer the impression that the figures painted have been pressed against a pane of glass.

drums

Nicolo da Bologna’s The Marriage (1350s) is unable to convey any sense of depth to the viewer.

During the Italian Renaissance, artists rediscovered the mathematics of perspective, allowing them to break free of the constraints of their two dimensional canvas and convey realistic images that gave the illusion of a third dimension. Raphael’s The school of Athens masterfully uses perspective to imbue his painting with a sense of depth. Through clever exploitation of Euclidean geometry and the mechanics of the human eye, Raphael is able to use the same medium (paint on a two dimensional surface) to convey a much richer representation of his subjects than his Bolognese predecessor.

Sanzio_01

Raphael’s The School of Athens (1509-1511) is an example of a masterful use of perspective. The painting vividly depicts a three dimensional space.

In the twentieth century, artists began attempting to covey more than three dimensions in two dimensional paintings. Cubists such as Picasso, attempted to portray multiple viewpoints of the same image simultaneously, and futurists such as Umberto Boccioni attempted to depict motion and “Dynamism” in their paintings to convey time as a fourth dimension.

picasso_portrait_of_dora_marr_1938

Pablo Picasso’s Portrait of Dora Maar (1938), depicts a woman’s face from multiple viewpoints simultaneously

Dinamismo di un ciclista GM5

Umberto Boccioni’s Dynamism of a Cyclist (1913) attempts to portray a fourth dimension, time, through a sense of motion and energy in the painting. Can you tell this is supposed to be a cyclist, or did I go too far out there for a water programming blog?

Regardless of your views on the validity of modern art, as engineers and scientists we have to admit that in this area we share similar goals and challenges with these artists: to effectively convey multidimensional data in a two dimensional space. Unlike artists, whose objectives are to convey emotions, beauty or abstract ideas through their work, we in the STEM fields seek to gain specific insights from multidimensional data that will guide our actions or investigations.

A notable historical example of the effective use of clever visualization was English physician John Snow’s map of the London Cholera epidemic of 1854. Snow combined data of cholera mortality with patient home addresses  to map the locations of cholera deaths within the city.

John Snow's cholera map of Soho

John Snow’s map of the 1854 London Cholera Epidemic. Each black bar is proportional to the number of cholera deaths at a given residence. Pumps are depicted using black circles. One can clearly see that the cholera deaths are clustered around the pump on Broad Street (which I’ve circled in red).

The results of his analysis led Snow to conclude that a contaminated well was the likely source of the outbreak, a pioneering feat in the field of public health. Snow’s effective visualization not only provided him with insights into the nature of the problem, but also allowed him to effectively communicate his results to a general public who had previously been resistant to the idea of water borne disease.

In his insightful book Visual Explanations: Images and Quantities, Evidence and Narrative, Edward Tufte points to three strengths within John Snow’s use of data visualization in his analysis of the epidemic. First, Snow provided the appropriate context for his data. Rather than simply plotting a time series of cholera deaths, Snow placed those deaths within a new context, geographic location, which allowed him to make the connection to the contaminated pump. Second, Snow made quantitative comparisons within his data. As Tufte points out, a fundamental question when dealing with statistical analysis is “Compared with what?” It’s not sufficient to simply provide data about those who were struck with the disease, but also to explain why certain populations were not effected. By complimenting his data collection with extensive interviews of the local population, Snow was able to show that there were indeed people who escaped disease within the area of interest, but these people all got their water from other sources, which strengthened his argument that the pump was the source of the epidemic. Finally, Tufte insists that one must always consider alternative explanations than the one that seems apparent from the data visualization before drawing final conclusions. It is easy for one to make a slick but misleading visualization, and in order to maintain credibility as an analyst, one must always keep an open mind to alternative explanations. Snow took the utmost care in crafting and verifying his conclusion, and as a result his work stands as a shining example of the use of visualization to explore multidimensional data.

While Snow’s methodology is impressive, and Tufte’s observations about his work helpful, we cannot directly use his methodology to future evaluation of multidimensional data, because his map is only useful when evaluating data from the epidemic of 1854. There is a need for general tools that can be applied to multidimensional data to provide insights through visualizations. Enter the field of visual analytics. As defined by Daniel Keim “Visual analytics combines automated-analysis  techniques with interactive visualization for an effective understanding, reasoning and decision making on the basis of very large and complex data sets”. The field of visual analytics combines the disciplines of data analysis, data management, geo-spacial and temporal processes, spacial decision support, human-computer interaction and statistics. The goal of the field is to create flexible tools for visual analysis and data mining. Noted visualization expert Alfred Inselberg proposed  six criterion that successful visualization tools should have:

  1. Low representational complexity.
  2. Works for any number of dimensions
  3. Every variable is treated uniformly
  4. the displayed object can be recognized under progressive transformations (ie. rotation, translation, scaling, perspective)
  5. The display easily/intuitively conveys information on the properties of the N-dimensional object it represents
  6. The methodology is based on rigorous mathematical and algorithmic results

Using the above criteria, Inselberg developed the Parallel Coordinate plot. Parallel Coordinate plots transform multidimensional relationships into two dimensional patterns which are well suited for visual data mining.

Step 1

An example of a five dimensional data set plotted on a Parallel Coordinate plot. Each line represents a data point, while each axis represents the point’s value in each dimension.

As water resources analysts dealing with multiobjective problems, it is critical that we have the ability to comprehend and communicate the complexities of multidimensional data. By learning from historical data visualization examples and making use of cutting edge visual analytics, we can make this task much more manageable. Parallel coordinate plots are just one example of the many visualization tools that have been created in recent decades by the ever evolving field of visual analytics.  As computing power continues its rapid advancement, it is important that we as analysts continue to ask ourselves whether we can improve our ability to visualize and gain insights from complex multidimensional data sets.