An Introduction To Econometrics: Part 1- Ordinary Least Squares Regression

I took a PhD level econometrics course this semester in the Applied Economics and Management department here at Cornell and I thought I’d share some of what I learned. Overall, I enjoyed the course and learned a great deal. It was very math and theory heavy, but the Professor Shanjun Li did a nice job keeping the class lively and interesting. I would recommend the class to future EWRS students who may be looking for some grounding in econometrics, provided they’ve taken some basic statics and linear algebra courses.

So lets start with the basics, what does the term “econometrics” even mean? Hansen (2010) defined econometrics as “the unified study of economic models, mathematical statistics and economic data”. After taking this introductory course, I’m inclined to add my own definition: econometrics is “a study of the problems with regression using Ordinary Least Squares (OLS) and how to solve them”. This is obviously a gross oversimplification of the field, however, regression through OLS was the primary tool used for finding insights and patterns within data, and we spent the vast majority of the course examining it. In this post I’ll briefly summarize OLS mechanics and classical OLS assumptions. In my next post, I’ll detail methods for dealing with violations of OLS assumptions. My hope is that reading this may help you understand some key terminology and the reasoning behind why certain econometric tools are employed.

OLS mechanics

Our primary interest when creating an econometric model is to estimate some dependent variable, y, using a observations from a set of independent variables, X. Usually y is a vector of length n, where n is the number of observations, and X is a matrix of size (n x k) where k is the number of explanatory variables (you can think of X as a table of observations, where each column contains a different variable and each row represents an observation of that variable). The goal of OLS regression is to estimate the coefficients, beta, for the model:

y = X\beta+\epsilon

Where beta is a k by 1 vector of coefficients on X and epsilon is a k by 1 vector of error terms.

OLS regression estimates beta by minimizing the sum of the square error term (hence the name “least squares”). Put in matrix notation, OLS estimates beta using the equation:

\hat{\beta} = argmin_{\beta} SSE_N(\beta) = \epsilon ' \epsilon

The optimal beta estimate can be found through the following equations:

\epsilon = y-X\hat{\beta}

\epsilon ' \epsilon =  (y-X\hat{\beta})'(y-X\hat{\beta})

Taking the derivative and setting it equal to zero:

2X'y+2X'X\hat{\beta} = 0

Then solving for the beta estimate:

\hat{\beta} = (X'X)^{-1}X'y

 

Estimation of y using OLS regression can be visualized as the orthogonal projection of the vector y onto the column space of X. The estimated error term, epsilon, is the orthogonal distance between the projection and the true vector y.  Figure 1 shows this projection for a y that is regressed on two explanatory variables, X1 and X2.

projection

Figure 1: OLS regression as an orthogonal projection of vector y onto the column space of matrix X. The error term, \hat{\epsilon}, is the orthogonal distance between y and X\hat{\beta}. (Image source: Wikipedia commons)

 Assumptions and properties of OLS regression

The Gauss-Markov Theorem states that under a certain set of assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) for vector y.

To understand the full meaning of the Gauss-Markov theorem, it’s important to define two fundamental properties that can be used to describe estimators, consistency and efficiency. An estimator is consistent if its value will converge to the true parameter value as the number of observations goes to infinity. An estimator is efficient if its asymptotic variance is no larger than the asymptotic variance of any other possible consistent estimator for the parameter. In light of these definitions, the Gauss-Markov Theorem can be restated as: estimators found using OLS will be the most efficient consistent estimator for beta as long as the classical OLS assumptions hold. The remainder of this post will be devoted to describing the necessary assumptions for the OLS estimator to be the BLUE and detailing fixes for when these assumptions are violated.

The four classical assumptions for OLS to be the BLUE are:

  1. Linearity: The relationship between X and y is linear, following the functional form:

y = X\beta+\epsilon.

2. Strict exogeneity: The error \epsilon terms should be independent of the value of the explanatory variables, X. Put in equation form, this assumption requires:

E(\epsilon_i|X) = 0

E(\epsilon_i) =0

3.  No perfect multicollinearity: columns of X should not be correlated with each other (see my earlier post on dealing with mulitcollinearity for fixes for violations of this assumption).

4. Spherical Error: Error terms should be homoskedastic, meaning they are evenly distributed around the X values. Put in equation form:

E(\epsilon_i^2|X) =\sigma^2

Where \sigma^2 is a constant value.

E(\epsilon_i \epsilon_j|X)=0

Using assumption 4, we can define the variance of \hat{\beta} as:

var(\hat{\beta}_{OLS}) = \sigma^2(X'X)^{-1}

If assumptions 1-4 hold, then the OLS estimate for beta is the BLUE, if however, any of the assumptions are broken, we must employ other methods for estimating our regression coefficients.

In my next post I’ll detail the methods econometricians use when these assumptions are violated.

 References:

Hansen, Bruce. “Econometrics”. 2010. University of Wisconsin

http://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics2010.pdf

Calculating Risk-of-Failures as in the Research Triangle papers (2014-2016) – Part 1

There has been a series of papers (e.g., Palmer and Characklis, 2009; Zeff et al., 2014; Herman et al., 2014) suggesting the use of an approximate risk-of-failure (ROF) metric, as opposed to the more conventional days of supply remaining, for utilities’ managers to decide when to enact not only water use restrictions, but also water transfers between utilities. This approach was expanded to decisions about the best time and in which new infrastructure project a utility should invest (Zeff at al., 2016), as opposed to setting fixed times in the future for either construction or options evaluation. What all these papers have in common is that drought mitigation and infrastructure expansion decisions are triggered when the values of the short and long-term ROFs, respectively, for a given utility exceeds those of pre-set triggers.

For example, the figure below shows that as streamflows (black line, subplot “a”) get lower while demands are maintained (subplot “b”), the combined storage levels of the fictitious utility starts to drop around the month of April (subplot “c”), increasing the utility’s short-term ROF (subplot “d”) until it finally triggers transfers and restrictions (subplot “e”). Despite the triggered restriction and transfers, the utility’s combined storage levels crossed the dashed line in subplot “c”, which denotes the fail criteria (i.e. combined storage levels dropping below 20% of the total capacity).

rof1

It is beyond the scope of this post to go into the details presented in all of these papers, but even after reading them the readers may be wondering how exactly ROFs are calculated. In this post, I’ll try to show in a graphical and concise manner how short-term ROFs are calculated.

In order to calculate a utility’s ROF for week m, we would run 50 independent simulations (henceforth called ROF simulations) all departing from the system conditions (reservoir storage levels, demand probability density function, etc.) observed in week m, and each using one of 50 years of streamflows time series recorded immediately prior to week m. The utility’s ROF is then calculated as the number of ROF simulations in which the combined storage level of that utility dropped below 20% of the total capacity in at least one week, divided by the number of ROF simulations ran (50). An animation of the process can be seen below.

test

For example, for a water utility who started using ROF triggers on 01/01/2017, this week’s short-term ROF (02/13/2017, or week m=7) would be calculated using the recorded streamflows from weeks 6 through -47 (assuming here a year of 52 weeks, for simplicity) for ROF simulation 1, the streamflows from weeks -48 to -99 for ROF simulation 2, and so on until we reach 50 simulations. However, if the utility is running an optimization or scenario evaluation and wants to calculate the ROF in week 16 (04/10/2017) of a system simulation, ROF simulation 1 would use 10 weeks of synthetically generated streamflows (16 to 7) and 42 weeks of historical records (weeks 6 to -45), simulation 2 would use records for weeks -46 to -97, and so on, as in a 50 years moving window.

In another blog post, I will show how to calculate the long-term ROF and the reasoning behind it.

Works cited

Herman, J. D., H. B. Zeff, P. M. Reed, and G. W. Characklis (2014), Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty, Water Resour. Res., 50, 7692–7713, doi:10.1002/2014WR015338.

Palmer, R., and G. W. Characklis (2009), Reducing the costs of meeting regional water demand through risk-based transfer agreements, J. Environ. Manage., 90(5), 1703–1714.

Zeff, H. B., J. R. Kasprzyk, J. D. Herman, P. M. Reed, and G. W. Characklis (2014), Navigating financial and supply reliability tradeoffs in regional drought management portfolios, Water Resour. Res., 50, 4906–4923, doi:10.1002/2013WR015126.

Zeff, H. B., J. D. Herman, P. M. Reed, and G. W. Characklis (2016), Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways, Water Resour. Res., 52, 7327–7346, doi:10.1002/2016WR018771.

 

Synthetic streamflow generation

A recent research focus of our group has been the development and use of synthetic streamflow generators.  There are many tools one might use to generate synthetic streamflows and it may not be obvious which is right for a specific application or what the inherent limitations of each method are.  More fundamentally, it may not be obvious why it is desirable to generate synthetic streamflows in the first place.  This will be the first in a series of blog posts on the synthetic streamflow generators in which I hope to sketch out the various categories of generation methods and their appropriate use as I see it.  In this first post we’ll focus on the motivation and history behind the development of synthetic streamflow generators and broadly categorize them.

Why should we use synthetic hydrology?

The most obvious reason to use synthetic hydrology is if there is little or no data for your system (see Lamontagne, 2015).  Another obvious reason is if you are trying to evaluate the effect of hydrologic non-stationarity on your system (Herman et al. 2015; Borgomeo et al. 2015).  In that case you could use synthetic methods to generate flows reflecting a shift in hydrologic regime.  But are there other reasons to use synthetic hydrology?

In water resources systems analysis it is common practice to evaluate the efficacy of management or planning strategies by simulating system performance over the historical record, or over some critical period.  In this approach, new strategies are evaluated by asking the question:  How well would we have done with your new strategy?

This may be an appealing approach, especially if some event was particularly traumatic to your system. But is this a robust way of evaluating alternative strategies?  It’s important to remember that any hydrologic record, no matter how long, is only a single realization of a stochastic process.  Importantly, drought and flood events emerge as the result of specific sequences of events, unlikely to be repeated.  Furthermore, there is a 50% chance that the worst flood or drought in an N year record will be exceeded in the next N years.  Is it well advised to tailor our strategies to past circumstances that will likely never be repeated and will as likely as not be exceeded?  As Lettenmaier et al. [1987] reminds us “Little is certain about the future except that it will be unlike the past.”

Even under stationarity and even with long hydrologic records, the use of synthetic streamflow can improve the efficacy of planning and management strategies by exposing them to larger and more diverse flood and drought than those in the record (Loucks et al. 1981; Vogel and Stedinger, 1988; Loucks et al. 2005).  Figure 7.12 from Loucks et al. 2005 shows a typical experimental set-up using synthetic hydrology with a simulation model.  Often our group will wrap an optimization model like Borg around this set up, where the system design/operating policy (bottom of the figure) are the decision variables, and the system performance (right of the figure) are the objective(s).

loucks-7-12

(Loucks et al. 2005)

 

What are the types of generators?

Many synthetic streamflow generation techniques have been proposed since the early 1960s.  It can be difficult for a researcher or practitioner to know which method is best suited to the problem at hand.  Thus, we’ll start with a very broad characterization of what is out there, then proceed to some history.

Broadly speaking there are two approaches to generating synthetic hydrology: indirect and direct.  The indirect approach generates streamflow by synthetically generating the forcings to a hydrologic model.  For instance one might generate precipitation and temperature series and input them to a hydrologic model of a basin (e.g. Steinschneider et al. 2014).  In contrast, direct methods use statistical techniques to generate streamflow timeseries directly.

The direct approach is generally easier to apply and more parsimonious because it does not require the selection, calibration, and validation of a separate hydrologic model (Najafi et al. 2011).  On the other hand, the indirect approach may be desirable.  Climate projections from GCMs often include temperature or precipitation changes, but may not describe hydrologic shifts at a resolution or precision that is useful.  In other cases, profound regime shifts may be difficult to represent with statistical models and may require process-driven models, thus necessitating the indirect approach.

Julie’s earlier series focused on indirect approaches, so we’ll focus on the direct approach.  Regardless of the approach many of the methods are same.  In general generator methods can be divided into two categories: parametric and non-parametricParametric methods rely on a hypothesized statistical model of streamflow whose parameters are selected to achieve a desired result (Stedinger and Taylor, 1982a).  In contrast non-parametric methods do not make strong structural assumptions about the processes generating the streamflow, but rather rely on re-sampling from the hydrologic record in some way (Lall, 1995).  Some methods combine parametric and non-parametric techniques, which we’ll refer to as semi-parametric (Herman et al. 2015).

Both parametric and non-parametric methods have advantages and disadvantages.  Parametric methods are often parsimonious, and often have analytical forms that allow easy parameter manipulation to reflect non-stationarity.  However, there can be concern that the underlying statistical models may not reflect the hydrologic reality well (Sharma et al. 1997).  Furthermore, in multi-dimensional, multi-scale problems the proliferation of parameters can make parametric models intractable (Grygier and Stedinger, 1988).  Extensive work has been done to confront both challenges, but they may lead a researcher to adopt a non-parametric method instead.

Because many non-parametric methods ‘re-sample’ flows from a record, realism is not generally a concern, and most re-sampling schemes are computationally straight forward (relatively speaking).  On the other hand, manipulating synthetic flows to reflect non-stationarity may not be as straightforward as a simple parameter change, though methods have been suggested (Herman et al. 2015Borgomeo et al. 2015).  More fundamentally, because non-parametric methods rely so heavily on the data, they require sufficiently long records to ensure there is enough hydrologic variability to sample.  Short records can be a concern for parametric methods as well, though parametric uncertainty can be explicitly considered in those methods (Stedinger and Taylor, 1982b).  Of course, parametric methods also have structural uncertainty that non-parametric models largely avoid by not assuming an explicit statistical model.

In the coming posts we’ll dig into the nuances of the different methods in greater detail.

A historical perspective

The first use of synthetic flow generation seems to have been by Hazen [1914].  That work attempted to quantify the reliability of a water supply by aggregating the streamflow records of local streams into a 300-year ‘synthetic record.’  Of course the problem with this is that the cross-correlation between concurrent flows rendered the effective record length much less than the nominal 300 years.

Next Barnes [1954] generated 1,000 years of streamflow for a basin in Australia by drawing random flows from a normal distribution with mean and variance equal to the sample estimates from the observed record.  That work was extended by researchers from the Harvard Water Program to account for autocorrelation of monthly flows (Maass et al., 1962; Thomas and Fiering, 1962).  Later work also considered the use of non-normal distributions (Fiering, 1967), and the generation of correlated concurrent flows at multiple sites (Beard, 1965; Matalas, 1967).

Those early methods relied on first-order autoregressive models that regressed flows in the current period on the flows of the previous period (see Loucks et al.’s Figure 7.13  below).  Box and Jenkins [1970] extended those methods to autoregressive models of arbitrary order, moving average models of arbitrary order, and autoregressive-moving average models of arbitrary order.  Those models were the focus of extensive research over the course of the 1970s and 1980s and underpin many of the parametric generators that are widely used in hydrology today (see Salas et al. 1980; Grygier and Stedinger, 1990; Salas, 1993; Loucks et al. 2005).

loucks-7-13

(Loucks et al. 2005)

By the mid-1990s, non-parametric methods began to gain popularity (Lall, 1995).  While much of this work has its roots in earlier work from the 1970s and 1980s (Yakowitz, 1973, 1979, 1985; Schuster and Yakowitz, 1979; Yakowitz and Karlsson, 1987; Karlson and Yakowitz, 1987), improvements in computing and the availability of large data sets meant that by the mid-1990s non-parametric methods were feasible (Lall and Sharma, 1996).  Early examples of non-parametric methods include block bootstrapping (Vogel and Shallcross, 1996), k-nearest neighbor (Lall and Sharma, 1996), and kernel density methods (Sharma et al. 1997).  Since that time extensive research has made improvement to these methods, often by incorporating parametric elements.  For instance, Srinivas and Srinivasan (2001, 2005, and 2006) develop a hybrid autoregressive-block bootstrapping method designed to improve the bias in lagged correlation and to generate flows other than the historical, for multiple sites and multiple seasons.  K-nearest neighbor methods have also been the focus of extensive research (Rajagopalan and Lall, 1999; Harrold et al. 2003; Yates et al. 2003; Sharif and Burn, 2007; Mehortra and Sharma, 2006; Prairie et al. 2006; Lee et al. 2010, Salas and Lee, 2010, Nowak et al., 2010), including recent work by our group  (Giuliani et al. 2014).

Emerging work focuses on stochastic streamflow generation using copulas [Lee and Salas, 2011; Fan et al. 2016], entropy theory bootstrapping [Srivastav and Simonovic, 2014], and wavelets [Kwon et al. 2007; Erkyihun et al., 2016], among other methods.

In the following posts I’ll address different challenges in stochastic generation [e.g. long-term persistence, parametric uncertainty, multi-site generation, seasonality, etc.] and the relative strengths and shortcomings of the various methods for addressing them.

Works Cited

Barnes, F. B., Storage required for a city water supply, J. Inst. Eng. Australia 26(9), 198-203, 1954.

Beard, L. R., Use of interrelated records to simulate streamflow, J. Hydrol. Div., ASCE 91(HY5), 13-22, 1965.

Borgomeo, E., Farmer, C. L., and Hall, J. W. (2015). “Numerical rivers: A synthetic streamflow generator for water resources vulnerability assessments.” Water Resour. Res., 51(7), 5382–5405.

Y.R. Fan, W.W. Huang, G.H. Huang, Y.P. Li, K. Huang, Z. Li, Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas, Advances in Water Resources, Volume 88, February 2016, Pages 170-185.

Fiering, M.B, Streamflow Synthesis, Harvard University Press, Cambridge, Mass., 1967.

Giuliani, M., J. D. Herman, A. Castelletti, and P. Reed (2014), Many-objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management, Water Resour. Res., 50, 3355–3377, doi:10.1002/2013WR014700.

Grygier, J.C., and J.R. Stedinger, Condensed Disaggregation Procedures and Conservation Corrections for Stochastic Hydrology, Water Resour. Res. 24(10), 1574-1584, 1988.

Grygier, J.C., and J.R. Stedinger, SPIGOT Technical Description, Version 2.6, 1990.

Harrold, T. I., Sharma, A., and Sheather, S. J. (2003). “A nonparametric model for stochastic generation of daily rainfall amounts.” Water Resour. Res., 39(12), 1343.

Hazen, A., Storage to be provided in impounding reservoirs for municipal water systems, Trans. Am. Soc. Civ. Eng. 77, 1539, 1914.

Herman, J.D., H.B. Zeff, J.R. Lamontagne, P.M. Reed, and G. Characklis (2016), Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments, Journal of Water Resources Planning & Management, doi: 10.1061/(ASCE)WR.1943-5452.0000701.

Karlsson, M., and S. Yakowitz, Nearest-Neighbor methods for nonparametric rainfall-runoff forecasting, Water Resour. Res., 23, 1300-1308, 1987.

Kwon, H.-H., U. Lall, and A. F. Khalil (2007), Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43, W05407, doi:10.1029/2006WR005258.

Lall, U., Recent advances in nonparametric function estimation: Hydraulic applications, U.S. Natl. Rep. Int. Union Geod. Geophys. 1991- 1994, Rev. Geophys., 33, 1093, 1995.

Lall, U., and A. Sharma (1996), A nearest neighbor bootstrap for resampling hydrologic time series, Water Resour. Res. 32(3), pp. 679-693.

Lamontagne, J.R. 2015,Representation of Uncertainty and Corridor DP for Hydropower 272 Optimization, PhD edn, Cornell University, Ithaca, NY.

Lee, T., J. D. Salas, and J. Prairie (2010), An enhanced nonparametric streamflow disaggregation model with genetic algorithm, Water Resour. Res., 46, W08545, doi:10.1029/2009WR007761.

Lee, T., and J. Salas (2011), Copula-based stochastic simulation of hydrological data applied to Nile River flows, Hydrol. Res., 42(4), 318–330.

Lettenmaier, D. P., K. M. Latham, R. N. Palmer, J. R. Lund and S. J. Burges, Strategies for coping with drought Part II: Planning techniques for planning and reliability assessment, EPRI P-5201, Final Report Project 2194-1, June 1987.

Loucks, D.P., Stedinger, J.R. & Haith, D.A. 1981, Water Resources Systems Planning and Analysis, 1st edn, Prentice-Hall, Englewood Cliffs, N.J.

Loucks, D.P. et al. 2005, Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Delft, The Netherlands.

Maass, A., M. M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair,

Design of Water Resource Systems, Harvard University Press, Cambridge, Mass., 1962.

Matalas, N. C., Mathematical assessment of synthetic hydrology, Water Resour. Res. 3(4), 937-945, 1967.

Najafi, M. R., Moradkhani, H., and Jung, I. W. (2011). “Assessing the uncertainties of hydrologic model selection in climate change impact studies.” Hydrol. Process., 25(18), 2814–2826.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., and Zagona, E. A. (2006). “Modified K-NN model for stochastic streamflow simulation.” J. Hydrol. Eng., 11(4), 371–378.

Rajagopalan, B., and Lall, U. (1999). “A k-nearest-neighbor simulator for daily precipitation and other weather variables.” Water Resour. Res., 35(10), 3089–3101.

Salas, J. D., J. W. Deller, V. Yevjevich and W. L. Lane, Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colo., 1980.

Salas, J.D., 1993, Analysis and Modeling of Hydrologic Time Series, Chapter 19 (72 p.) in The McGraw Hill Handbook of Hydrology, D.R. Maidment, Editor.

Salas, J.D., T. Lee. (2010). Nonparametric Simulation of Single-Site Seasonal Streamflow, J. Hydrol. Eng., 15(4), 284-296.

Schuster, E., and S. Yakowitz, Contributions to the theory of nonparametric regression, with application to system identification, Ann. Stat., 7, 139-149, 1979.

Sharif, M., and Burn, D. H. (2007). “Improved K-nearest neighbor weather generating model.” J. Hydrol. Eng., 12(1), 42–51.

Sharma, A., Tarboton, D. G., and Lall, U., 1997. “Streamflow simulation: A nonparametric approach.” Water Resour. Res., 33(2), 291–308.

Srinivas, V. V., and Srinivasan, K. (2001). “A hybrid stochastic model for multiseason streamflow simulation.” Water Resour. Res., 37(10), 2537–2549.

Srinivas, V. V., and Srinivasan, K. (2005). “Hybrid moving block bootstrap for stochastic simulation of multi-site multi-season streamflows.” J. Hydrol., 302(1–4), 307–330.

Srinivas, V. V., and Srinivasan, K. (2006). “Hybrid matched-block bootstrap for stochastic simulation of multiseason streamflows.” J. Hydrol., 329(1–2), 1–15.

Roshan K. Srivastav, Slobodan P. Simonovic, An analytical procedure for multi-site, multi-season streamflow generation using maximum entropy bootstrapping, Environmental Modelling & Software, Volume 59, September 2014a, Pages 59-75.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 1. Model verification and validation, Water Resour. Res. 18(4), 909-918, 1982a.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 2. Parameter uncertainty,Water Resour. Res. 18(4), 919-924, 1982b.

Steinschneider, S., Wi, S., and Brown, C. (2014). “The integrated effects of climate and hydrologic uncertainty on future flood risk assessments.” Hydrol. Process., 29(12), 2823–2839.

Thomas, H. A. and M. B. Fiering, Mathematical synthesis of streamflow sequences for the analysis of river basins by simulation, in Design of Water Resource Systems, by A. Maass, M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Harvard University Press, Cambridge, Mass., 1962.

Vogel, R.M., and J.R. Stedinger, The value of stochastic streamflow models in over-year reservoir design applications, Water Resour. Res. 24(9), 1483-90, 1988.

Vogel, R. M., and A. L. Shallcross (1996), The moving block bootstrap versus parametric time series models, Water Resour. Res., 32(6), 1875–1882.

Yakowitz, S., A stochastic model for daily river flows in an arid region, Water Resour. Res., 9, 1271-1285, 1973.

Yakowitz, S., Nonparametric estimation of markov transition functions, Ann. Stat., 7, 671-679, 1979.

Yakowitz, S. J., Nonparametric density estimation, prediction, and regression for markov sequences J. Am. Stat. Assoc., 80, 215-221, 1985.

Yakowitz, S., and M. Karlsson, Nearest-neighbor methods with application to rainfall/runoff prediction, in Stochastic  Hydrology, edited by J. B. Macneil and G. J. Humphries, pp. 149-160, D. Reidel, Norwell, Mass., 1987.

Yates, D., Gangopadhyay, S., Rajagopalan, B., and Strzepek, K. (2003). “A technique for generating regional climate scenarios using a nearest-neighbor algorithm.” Water Resour. Res., 39(7), 1199.

Basic Machine Learning in Python with Scikit-learn

Basic Machine Learning in Python with Scikit-learn

Machine learning has become a hot topic in the last few years and it is for a reason. It provides data analysts with efficient ways of extracting information from data, allowing it to be used for analysis and modeling purposes.

The Scikit-learn Python library has implementations of dozens of learning algorithms and is freely available for academic and commercial use under the terms of the BSD licence. Some of these algorithms can be extremely useful for our job as water systems analysts, so given the overwelming amount of algorithms implemented in Scikit-learn, I though I would mention a few I find particularly useful for my research. For each method below I included link swith an examples from the Scikit-learn’s website. Instalation and use instructions can be found in their website.

CART Trees

CART trees that can used for regression or classification. Any any tree, CART trees are considered poor (generally high variance) classifiers unless bootstrapped or boosted (see supervised learning), but the resulting rules are easily interpretable.

CART Treeshttp://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

Dimensionality reduction

Principal component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique. It works by finding the basis the maximizes the data’s variance, allowing for the elimination of axis that have low variances. Among its uses are noise reduction, data visualization, as it preserves the distances between data points, and improvement of computational efficiency of other algorithms by getting rid of redundant information. PCA can me used in its pure form or it can be kernelized to handle data sets whose variance is maximum in a non-linear direction. Manifold learning is another way of performing dimensionality reduction by unwinding the lower dimensional manifold where the information lies.

PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html#sphx-glr-auto-examples-decomposition-plot-pca-3d-py

Kernel PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html#sphx-glr-auto-examples-decomposition-plot-kernel-pca-py

Manifold learninghttp://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

Clustering

Clustering is used to group similar points in a data set. One example is the problem of find customer niches based on the products each customer buys. The most famous clustering algorithm is k-means, which, as any other machine learning algorithm, works well on some data sets but not in others. There are several alternative algorithms, all of which exemplified in the following two links:

Clustering algorithms comparison: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

Gaussian Mixture Models (finds the same results as k-means but also provides variances): http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

Reducing the dimentionality of a dataset with PCA or kernel PCA may speed up clustering algorithms.

Supervised learning

Supervised learning algorithms can be used for regression or classification problems (e.g. classify a point as pass/fail) based on labeled data sets. The most “trendy” one nowadays is neural networks, but support vector machines, boosted and bagged trees, and others are also options that should be considered and tested on your data set. Bellow are links to some of the supervised learning algorithms implemented in Scikit-learn:

Comparison between supervised learning algorithmshttp://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Neural networks: http://scikit-learn.org/stable/modules/neural_networks_supervised.html

Gaussian Processes is also a supervised learning algorithm (regression) which is also be used for Bayesian optimization:

Gaussian processes: http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py

 

Synthetic Weather Generation: Part V

Conditioning Synthetic Weather Generation on Seasonal Climate Forecasts

This is the final blog post in a five part series on synthetic weather generators. See Parts I and II for a description of single-site parametric and non-parametric generators, respectively, and Part III for a description of multi-site generators of both types.

In my previous post, Part IV, I discussed how parametric and non-parametric weather generators can be modified to produce weather that is consistent with climate change projections for use in long-term planning. In the shorter-term, water managers may be able to exploit mid-range climate forecasts to inform seasonal reservoir operations (see e.g., Kim and Palmer (1997), Chiew et al. (2003), Block (2011), Block and Goddard (2012), Anghileri et al. (2016)). For such analyses, it could be useful to tailor management plans to simulated weather conditions consistent with these probabilistic forecasts. Here I discuss how one can condition weather generators on seasonal climate forecasts for such purposes.

Two major forecasting groups, the International Research Institute (IRI) for Climate and Society at Columbia University and the Climate Prediction Center (CPC) of the U.S. National Centers for Environmental Prediction, issue tercile seasonal forecasts that specify the probabilities of observing above normal (pA), near normal (pN) and below normal (pB) precipitation and temperature. Forecasts are issued each month for the upcoming three months (see those from IRI and CPC). While these forecasts are derived from dynamical and statistical models that include a variety of physically-based processes, most of the forecast skill can be explained by the effects of the El Niño-Southern Oscillation (ENSO) on the climate system (Barnston et al., 1994). As most of you probably know, ENSO refers to the quasi-periodic cycling of sea surface temperatures (SSTs) in the tropical eastern Pacific Ocean. The warm phase is known as El Niño and the cool phase as La Niña. These are quantified as five consecutive three-month periods with SST anomalies in the Niño 3.4 region of the Pacific > +0.5°C (El Niño) or < -0.5°C (La Niña). All other periods are considered neutral (Climate Prediction Center, 2016). Because much of seasonal climate forecast skill is derived from ENSO, current or forecasted SST anomalies in the Niño 3.4 region are sometimes used by themselves as proxy seasonal forecasts. Here I will discuss techniques for conditioning weather generator parameters on either tercile forecasts or current or projected ENSO conditions.

Parametric Weather Generators

Wilks (2002) presents a method for conditioning parametric weather generators of the Richardson (1981) type on tercile seasonal climate forecasts using a network of sites in New York State. The key idea, derived from Briggs and Wilks (1996), is to estimate the weather generator parameters not from the entire historical record, but from a weighted bootstrapped sample of the historical record consistent with the forecast. This is similar to the method of streamflow generation use by Herman et al. (2016) to increase the frequency of droughts of a given magnitude. As an illustrative example, Herman et al. (2016) empirically estimate the quantile of a noteworthy drought from the historical record and show the system impacts of droughts of that magnitude or worse becoming n times more likely. This is done by adapting the semi-parametric streamflow generator developed by Kirsch et al. (2013) to sample years with historical droughts of at least that magnitude n times more often.

While Herman et al. (2016) take a fully non-parametric approach by estimating quantiles empirically, Briggs and Wilks (1996) estimate terciles parametrically by fitting a normal distribution to the historical mean seasonal temperatures and a Gamma distribution to the historical seasonal precipitation totals at each site. After estimating terciles, Briggs and Wilks (1996) classify each year in the historical record as below normal, near normal or above normal in terms of temperature and precipitation in the season of interest. Because terciles are estimated parametrically, each category will not necessarily contain an equal number of years, even if the record length is a multiple of three. It should be noted, however, that like Herman et al. (2016), IRI defines terciles empirically from the most recent 30-yr record (see more here and here) and CPC from the most recent 30-yr record updated only every 10 years, i.e., the current reference frame is 1981-2010 (see more here).  IRI provides a discussion of the advantages and disadvantages to parametric and non-parametric tercile estimation methods. For consistency, it may be best to take the same approach as the agency whose forecasts are being used.

Once terciles have been defined, weather generator parameters can be estimated from a bootstrapped sample of the historical record consistent with the forecast (Wilks, 2002). Consider a historical record with NB years classified as being below normal, NN as near normal, and NA as above normal. Then the expected value of a given seasonal statistic, X, can be estimated from a bootstrapped sample of L years from the historical record in which the below normal, near normal and above normal years are sampled with probabilities pB, pN, and pA, respectively. Representing the statistic of interest in the ith below, near and above normal year as xi(B), xi(N) and xi(A), respectively, then the expected value of X is the following:

(1) E\left[X\right] = \lim_{L\to\infty} \frac{1}{L}\left[\sum_{i=1}^{N_B} \frac{p_BL}{N_B}x_{i}^{\left(B\right)} + \sum_{i=1}^{N_N} \frac{p_NL}{N_N}x_{i}^{\left(N\right)} + \sum_{i=1}^{N_A} \frac{p_AL}{N_A}x_{i}^{\left(A\right)}\right]

That is, the forecast-conditional value of X is a weighted sum of its average in the below normal, near normal and above normal years, where the weights are equal to the forecast probabilities for each of the three respective categories (Wilks, 2002). Note that X is a seasonal statistic, not a parameter, so one cannot simply estimate a weather generator parameter as a weighted average of its value in each of the terciles unless the parameter is itself a simple statistic.

Fortunately, for some of the weather generator parameters this is the case. Recall from Part I that the first step of the Richardson generator is to generate a sequence of daily rainfall occurrences from a first order Markov chain. This chain is defined by the probabilities of transitioning from a dry day to a wet day, p01, or from a wet day to another wet day, p11. As discussed in Part IV, these two parameters together define the unconditional probability of a wet day, π, and the first-order autocorrelation of the occurrences, d, where π = p01/(1 + p01p11) and d = p11p01 (Katz, 1983). The unconditional probability of a wet day is a simple statistic. Therefore, since pN = 1 – pApB, π can be estimated each month as a function of pB, pA and the average portion of wet days in below normal, near normal and above normal years for that month ((B),(N) and (A)):

(2) π = pB(N) + (1 – pBpA)(N) + pA(A).

More generically, π =b0 + bBpB + bApA, where b0 = (N), bB = (B) – x̅(N) and bA = (A) – x̅(N). Wilks, 2002 recommends that the parameter π be estimated separately at each site and for each month, but that the below normal, near normal and above normal years be defined based on the total precipitation at that site in the entire three-month season. This is because the forecast is for the entire season, but the portion of wet days varies on a shorter time scale.

Like the unconditional probability of a wet day, the persistence parameter d can also be estimated as a function of pB, pA and the value of d in below normal, near normal and above normal years. Wilks (2002) shows that d can be represented by a quadratic function of pA and p:

(3) d = b0 + bBpB + bBBpB2 + bApA + bAApA2 + bBApBpA.

but finds that variations in d across forecasts are small such that one can reasonably assume the climatological estimate of d for all sites and months, regardless of the forecast.

The remaining weather generator parameters related to precipitation are those defining the distribution of precipitation amounts. Because these parameters (α and β if fitting a Gamma distribution, and α, λ1 and λ2 if fitting a mixed exponential distribution) are estimated iteratively in an MLE approach, they cannot be estimated as a function of the forecast probabilities like the occurrence parameters can. Instead, Wilks (2002) suggests using the Briggs and Wilks (1996) approach of bootstrapping a large sample of years from the historical record consistent with the forecast and fitting separate probability distributions to the precipitation amounts in each month’s weighted sample. When performing this estimation using a mixed exponential distribution to model precipitation amounts, Wilks (2002) found the estimates of the mixing parameter, α, to be the least consistent across the investigated sites and chose to hold it constant across sites and forecasts. Thus, only λ1 and λ2 were re-estimated for each month and site as a function of the seasonal forecast.

One drawback to the sampling scheme employed by Briggs and Wilks (1996) is that all historical years within each tercile have an equal probability of being sampled: pB/NB, pN/NN and pA/NA for below, near and above normal years. In reality, years similar to those in the tail are less likely to occur than years similar to those near the median. An alternative, more precise sampling scheme called the pdf-ratio method suggested by Stedinger and Kim (2010) assigns each year i an un-normalized probability of selection, qi = (1/N)*f1(xi)/f0(xi) where N is the number of years in the historical record and f1 and f0 are pdfs of the statistic X under forecast and climatological conditions, respectively. The qi are then normalized such that they sum to 1. f1 and f0 can be analytical or empirical distributions.

After estimating parameters of the precipitation amounts distributions using either the approach of Briggs and Wilks (1996) or Stedinger and Kim (2010), one must estimate the forecast-conditional temperature parameters. Recall that in the Richardson generator, separate harmonics are fit to the eight time series of means and standard deviations of minimum and maximum temperature on wet and dry days. Historical residuals from these fits are determined by first subtracting the predicted mean and then dividing by the predicted standard deviation. Finally, the residuals are modeled by an order-one vector autoregression, or VAR(1) model. Because forecast-conditional weather generators are only applied three months at a time, Wilks (2002) suggests instead fitting quadratic functions to these eight time series within the season of interest.

Like the parameters describing precipitation occurrences, the parameters of the quadratic functions of time describing the mean temperature on wet and dry days can be estimated as a weighted average of fits in each of the three terciles. As shown in Wilks (2002), the four mean temperature functions (minimum and maximum on wet and dry days), µ(t), at each site are specified by the function:

(4) µ(t) = (β0 + βBpB + βApA) + (γ0 + γBpB + γApA)t + (δ0 + δBpB + δApA)t2

where

(5) µ̅t(N) = β0 + γ0t + δ0t2,

(6) [µ̅t(B) µ̅t(N)] = βB + γBt + δBt2,

(7) [µ̅t(A) µ̅t(N)] = βA + γAt + δAt2,

t is the day and µ̅t(B), µ̅t(N) and µ̅t(A) are the mean temperature statistics of concern on each day of below normal, near normal and above normal years in the historical record, respectively.

Finally, Wilks (2002) shows that the standard deviations of minimum and maximum temperature on wet and dry days can be estimated by an extension of Equation 1:

(8) s\left(t\right) = \left[\frac{p_B}{N_B}\sum_{i=1}^{N_B}\left[x_{i}^{\left(B\right)}\left(t\right) - \mu\left(t\right)\right]^2 + \frac{p_N}{N_N}\sum_{i=1}^{N_N}\left[x_{i}^{\left(N\right)}\left(t\right) - \mu\left(t\right)\right]^2 + \frac{p_A}{N_A}\sum_{i=1}^{N_A}\left[x_{i}^{\left(A\right)}\left(t\right) - \mu\left(t\right)\right]^2\right]^{1/2}

where µ(t) is defined as in Equation 4. Once again, the forecast-conditional standard deviations of the four temperature series, σ(t), can then be estimated by quadratic functions of time, conditional on the forecast probabilities pA and pB:

(9) σ(t) = (β0 + βBpB + βBBpB2 + βApA + βAApA2 + βBApBpA)+ (γ0 + γBpB + γBBpB2 + γApA + γAApA2 + γBApBpA)+ (δ0 + δBpB + δBBpB2 + δApA + δAApA2 + δBApBpA)t2.

For the VAR(1) model of temperature residuals, Wilks (2002) found that variations in the estimates of these parameters as a function of the forecast, like d, were minor for the investigated sites in New York. For this reason, the VAR(1) model was fit separately for each month and site based on the entire historical record and these estimates were unchanged with the forecast. Finally, Wilks (2002) found that the spatial correlation of temperature and precipitation also did not change significantly between climatic terciles, and so they too were assumed independent of the forecast. Correlations in temperature were included in the VAR(1) model as described in Part III. Correlations in precipitation occurrences ω, and amounts, ζ, between sites k and l were approximated for all site pairs each month as a function of the horizontal distance, c, between them (see Equations 10 and 11, respectively, for which parameters θ1, θ2 and θ3 were estimated).

(10) \omega = \left(1 + \frac{c_{k,l}}{\theta_1}\right)^{\theta_2}

(11) \zeta = \exp\left(-\theta_3c_{k,l}\right)

If one received a tercile ENSO forecast, the same approach could be used as in Wilks (2002), except the season of interest in each historical year would be classified as La Niña, Neutral or El Niño instead of below normal, near normal or above normal.

Non-parametric Weather Generators

The key idea of weighted sampling from Briggs and Wilks (1996) has also been applied in non-parametric weather generators to condition synthetic weather series on seasonal climate forecasts. For example, Apipattanavis et al. (2007) modify their semi-parametric k-nn generator to find and probabilistically select neighbors, not from the entire historical record, but from a bootstrapped sample of the historical record consistent with the forecast. Again, this can be applied using tercile forecasts of either the {Below Normal, Near Normal, Above Normal} type or {La Niña, Neutral, El Niño} type.

Clark et al. (2004a) develop a more innovative approach that combines ideas from the non-parametric Schaake Shuffle method used to spatially correlate short-term precipitation and temperature forecasts (Clark et al., 2004b) with a parametric approach to weighted resampling presented by Yates et al. (2003) for the k-nn generator of Rajagopalan and Lall (1999). The Schaake Shuffle, originally devised by Dr. J. Schaake of the National Weather Service Office of Hydrologic Development, is a method of reordering ensemble precipitation and temperature forecasts to better capture the spatial and cross correlation of these spatial fields (Clark et al., 2004b).

Traditionally, model output statistics (MOS) from the Numerical Weather Prediction (NWP) model such as temperature, humidity and wind speed at different pressure levels, are used as predictors in a regression model to forecast daily temperature and precipitation at a number of sites. To generate an ensemble of predictions for each forecasted day, normal random variables with mean 0 and variance σε2 are added back to the mean prediction, where σε2 is the variance of the regression residuals. However, these regressions are generally developed independently for each variable at each site and therefore do not reproduce the spatial or temporal correlation between the variables across sites and time (Clark et al., 2004b). To better capture these correlations, the Schaake Shuffle, illustrated in Figure 2 from Clark et al. (2004a) for a 10-member ensemble, re-orders the ensemble members each day in order to preserve the Spearman-rank correlation in the temperature and precipitation variables between sites.

schaake_shuffle_table

The Schaake Shuffle proceeds as follows. For a particular day, the original ensemble members for each variable at each station are ranked from lowest to highest, as shown in Table A of Figure 2 above. Next, a set of historical observations of the same size is generated by randomly selecting days from the historical record within a window of 7 days before and after the forecast date (Table B). Third, the historical observations are sorted from lowest to highest for each variable at each site, as shown in Table C. Finally, the original ensemble members in Table A are re-shuffled to form the final, spatially correlated ensembles in Table D in the following way:

  1. The rank of the data in the first historical observation (shown with dark circles in Tables B and C) is determined at each site
  2. At each site, the member of the original ensemble with the same rank as the first historical observation for that site becomes the first member of the final, correlated ensemble (see dark circles in Table A and location in Table D).
  3. Steps 1 and 2 are repeated for every historical observation/ensemble member.

As stated earlier, this process reproduces the Spearman rank correlation of the observations across sites (Clark et al., 2004b). In order to preserve the temporal correlation for each variable, instead of re-generating a random set of historical observations to use for shuffling the next day’s forecast, the observations from the day following that used for the previous time step is utilized. While the Schaake Shuffle does not guarantee reproduction of the spatial correlation in the observations, just in their rank, the results presented in Clark et al. (2004b) indicate that the method does reasonably well for both, and significantly improves upon the un-shuffled forecasts.

In the weather generator presented by Clark et al. (2004a), the same approach is used to simulate weather sequences except the ensembles in Table A are not generated by MOS regressions but by independently sampling historical observations within +/- 7 days of the simulated day at each site. To condition this weather generator on seasonal climate forecasts, the unshuffled ensembles are formed by preferential selection of different years from the historical record following an approach inspired by Yates et al. (2003). The first step in this approach is to sort all N historical years in terms of their similarity to a climate index, such as current SSTs in the Niño 3.4 region. The most similar year is given rank i = 1 and the least similar i = N. Next, a standard uniform random variable u is drawn and the year of rank i is chosen as an ensemble member, where i = INT(uλN/α) + 1. Here INT(·) is the integer operator, λ is a weighting parameter, and α a selection parameter. Values of λ greater (less) than 1 increase (decrease) the probability of selecting years ranked more similar to the climate index. Values of α greater than 1 restrict the number of sampled years such that α = 5, for example, results in only the most similar 1/5 of years being selected (Clark et al., 2004a).

Yates et al. (2003) apply a simplified version of this method with only one parameter, λ, in a scenario discovery-type approach, investigating the effects of e.g. warmer-drier springs and cooler-wetter summers. Clark et al. (2004a) first take this approach by ranking the historical years according to their similarity to the current Niño 3.4 index and exploring the effects of different choices of λ and α on the skill of the generated weather sequences in forecasting total winter precipitation at Petrified Forest in Arizona, measuring skill by the ranked probability skill score (RPSS). Interestingly, they find that high values of both λ and α, where years more similar to the climate index at the beginning of the season are selected, result in negative forecast skill. This highlights the importance of not being overconfident by only sampling years closest to current or forecast conditions. They note that the values of λ and α should depend on the strength of the Niño 3.4 index, and therefore should be re-optimized for different values of the index in order to maximize the RPSS.

All of these approaches could prove informative for seasonal water resources planning, if the forecasts being used are reliable. In the case of tercile forecasts, this means that, on average, when a given climate state is forecast to occur with probability p, it does in fact occur with that probability. Given that past diagnostic assessments of IRI and CPC forecasts have found biases and overconfidence in some locations (Wilks and Godfrey, 2002; Wilks, 2000), water managers should proceed with caution in using them for seasonal planning. At a minimum, one should perform an analysis of the forecast value for the system of concern (Anghileri et al., 2016) before changing system operations. Fortunately, these forecasts continue to improve over time and several studies have already found value in using them to inform seasonal operations (e.g. Kim and Palmer (1997)Block (2011), Block and Goddard (2012), Anghileri et al. (2016)), indicating promise in their use for water resources planning.

Works Cited

Anghileri, D. Voisin, N., Castelletti, A., Pianosi, A., Nijssen, B., & Lettenmaier, D. P. (2016). Value of long-term streamflow forecasts to reservoir operations for water supply in snow-dominated river catchments. Water Resources Research, 52(6), 4209-4225.

Apipattanavis, S., Podestá, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43(11).

Barnston, A. G., van den Dool, H. M., Rodenhuis, D. R., Ropelewski, C. R., Kousky, V. E., O’Lenic, E. A., et al. (1994). Long-lead seasonal forecasts-Where do we stand?. Bulletin of the American Meteorological Society75(11), 2097-2114.

Block, P. (2011). Tailoring seasonal climate forecasts for hydropower operations. Hydrology and Earth System Sciences, 15, 1355-1368.

Block, P., & Goddard, L. (2012). Statistical and dynamical climate predictins to guide water resources in Ethiopoia. Journal of Water Resources Planning and management, 138(3), 287-298.

Briggs, W. M., & Wilks, D. S. (1996). Extension of the Climate Prediction Center long-lead temperature and precipitation outlooks to general weather statistics. Journal of climate, 9(12), 3496-3504.

Chiew, F. H. S., Zhou, S. L., & McMahon, T. A. Use of seasonal streamflow forecasts in water resources management. Journal of Hydrology, 270(1), 135-144.

Clark, M. P., Gangopadhyay, S., Brandon, D., Werner, K., Hay, L., Rajagopalan, B., & Yates, D. (2004a). A resampling procedure for generating conditioned daily weather sequences. Water Resources Research, 40(4).

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004b). The Schaake shuffle: A method for reconstructing space-time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology, 5(1), 243-262.

Climate Prediction Center (2016). ENSO: Recent Evolution, Current Status and Predictions. National Oceanic and Atmospheric Administration, pp. 19-20.

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 04016050.

Katz, R. W. (1983). Statistical procedures for making inferences about precipitation changes simulated by an atmospheric general circulation model. Journal of the Atmospheric Sciences, 40(9), 2193-2201.

Kim, Y., & Palmer, R. (1997). Value of seasonal flow forecasts in Bayesian stochastic programming. Journal of Water Resources Planning and Management, 123(6), 327-335.

Kirsch, B. R. , Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406.

Rajagopalan, B., & Lall, U. (1999). A k‐nearest‐neighbor simulator for daily precipitation and other weather variables. Water Resources Research35(10), 3089-3101.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Stedinger, J. R., & Kim, Y. O. (2010). Probabilities for ensemble forecasts reflecting climate information. Journal of hydrology, 391(1), 9-23.

Wilks, D. S. (2000). Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995-1998. Journal of Climate, 13, 2389-2403.

Wilks, D. S. (2002). Realizations of daily weather in forecast seasonal climate. Journal of Hydrometeorology, 3(2), 195-207.

Wilks, D. S. & Godfrey, C. M. (2002). Diagnostic verification of the IRI net assessment forecasts, 1997-2000. Journal of Climate, 15(11), 1369-1377.

Yates, D., Gangopadhyay, S., Rajagopalan, B., & Strzepek, K. A technique for generating regional climate scenarios using a nearest-neighbor algorithm. Water Resources Research, 39(7).

Visualizing multidimensional data: a brief historical overview

The results of a MOEA search are presented as a set of multidimensional data points. In order to form useful conclusions from our results, we must have the ability to comprehend the multidimensional differences between results and effectively analyze and communicate them to decision makers.

Navigating through multiple dimensions is an inherently complex task for the human mind. We perceive the world in three dimensions, and thinking in higher dimensional space can be heavily taxing.  The difficulty of comprehending multidimensional data is compounded when one must display the data on a two dimensional surface such as a sheet of paper or a computer screen. The challenge of “flattening” data has persisted for centuries, and has plagued not only those who were concerned with gleaning scientific insights from data, but also artists and those seeking to accurately portray the physical world as perceived by the human eye.

For much of human history, even the greatest artists were unable to accurately express the three dimensional world in a two dimensional plane. Nicolo da Bologna’s 14th century work, The Marriage, fails to convey any sense of three dimensional space, giving the viewer the impression that the figures painted have been pressed against a pane of glass.

drums

Nicolo da Bologna’s The Marriage (1350s) is unable to convey any sense of depth to the viewer.

During the Italian Renaissance, artists rediscovered the mathematics of perspective, allowing them to break free of the constraints of their two dimensional canvas and convey realistic images that gave the illusion of a third dimension. Raphael’s The school of Athens masterfully uses perspective to imbue his painting with a sense of depth. Through clever exploitation of Euclidean geometry and the mechanics of the human eye, Raphael is able to use the same medium (paint on a two dimensional surface) to convey a much richer representation of his subjects than his Bolognese predecessor.

Sanzio_01

Raphael’s The School of Athens (1509-1511) is an example of a masterful use of perspective. The painting vividly depicts a three dimensional space.

In the twentieth century, artists began attempting to covey more than three dimensions in two dimensional paintings. Cubists such as Picasso, attempted to portray multiple viewpoints of the same image simultaneously, and futurists such as Umberto Boccioni attempted to depict motion and “Dynamism” in their paintings to convey time as a fourth dimension.

picasso_portrait_of_dora_marr_1938

Pablo Picasso’s Portrait of Dora Maar (1938), depicts a woman’s face from multiple viewpoints simultaneously

Dinamismo di un ciclista GM5

Umberto Boccioni’s Dynamism of a Cyclist (1913) attempts to portray a fourth dimension, time, through a sense of motion and energy in the painting. Can you tell this is supposed to be a cyclist, or did I go too far out there for a water programming blog?

Regardless of your views on the validity of modern art, as engineers and scientists we have to admit that in this area we share similar goals and challenges with these artists: to effectively convey multidimensional data in a two dimensional space. Unlike artists, whose objectives are to convey emotions, beauty or abstract ideas through their work, we in the STEM fields seek to gain specific insights from multidimensional data that will guide our actions or investigations.

A notable historical example of the effective use of clever visualization was English physician John Snow’s map of the London Cholera epidemic of 1854. Snow combined data of cholera mortality with patient home addresses  to map the locations of cholera deaths within the city.

John Snow's cholera map of Soho

John Snow’s map of the 1854 London Cholera Epidemic. Each black bar is proportional to the number of cholera deaths at a given residence. Pumps are depicted using black circles. One can clearly see that the cholera deaths are clustered around the pump on Broad Street (which I’ve circled in red).

The results of his analysis led Snow to conclude that a contaminated well was the likely source of the outbreak, a pioneering feat in the field of public health. Snow’s effective visualization not only provided him with insights into the nature of the problem, but also allowed him to effectively communicate his results to a general public who had previously been resistant to the idea of water borne disease.

In his insightful book Visual Explanations: Images and Quantities, Evidence and Narrative, Edward Tufte points to three strengths within John Snow’s use of data visualization in his analysis of the epidemic. First, Snow provided the appropriate context for his data. Rather than simply plotting a time series of cholera deaths, Snow placed those deaths within a new context, geographic location, which allowed him to make the connection to the contaminated pump. Second, Snow made quantitative comparisons within his data. As Tufte points out, a fundamental question when dealing with statistical analysis is “Compared with what?” It’s not sufficient to simply provide data about those who were struck with the disease, but also to explain why certain populations were not effected. By complimenting his data collection with extensive interviews of the local population, Snow was able to show that there were indeed people who escaped disease within the area of interest, but these people all got their water from other sources, which strengthened his argument that the pump was the source of the epidemic. Finally, Tufte insists that one must always consider alternative explanations than the one that seems apparent from the data visualization before drawing final conclusions. It is easy for one to make a slick but misleading visualization, and in order to maintain credibility as an analyst, one must always keep an open mind to alternative explanations. Snow took the utmost care in crafting and verifying his conclusion, and as a result his work stands as a shining example of the use of visualization to explore multidimensional data.

While Snow’s methodology is impressive, and Tufte’s observations about his work helpful, we cannot directly use his methodology to future evaluation of multidimensional data, because his map is only useful when evaluating data from the epidemic of 1854. There is a need for general tools that can be applied to multidimensional data to provide insights through visualizations. Enter the field of visual analytics. As defined by Daniel Keim “Visual analytics combines automated-analysis  techniques with interactive visualization for an effective understanding, reasoning and decision making on the basis of very large and complex data sets”. The field of visual analytics combines the disciplines of data analysis, data management, geo-spacial and temporal processes, spacial decision support, human-computer interaction and statistics. The goal of the field is to create flexible tools for visual analysis and data mining. Noted visualization expert Alfred Inselberg proposed  six criterion that successful visualization tools should have:

  1. Low representational complexity.
  2. Works for any number of dimensions
  3. Every variable is treated uniformly
  4. the displayed object can be recognized under progressive transformations (ie. rotation, translation, scaling, perspective)
  5. The display easily/intuitively conveys information on the properties of the N-dimensional object it represents
  6. The methodology is based on rigorous mathematical and algorithmic results

Using the above criteria, Inselberg developed the Parallel Coordinate plot. Parallel Coordinate plots transform multidimensional relationships into two dimensional patterns which are well suited for visual data mining.

Step 1

An example of a five dimensional data set plotted on a Parallel Coordinate plot. Each line represents a data point, while each axis represents the point’s value in each dimension.

As water resources analysts dealing with multiobjective problems, it is critical that we have the ability to comprehend and communicate the complexities of multidimensional data. By learning from historical data visualization examples and making use of cutting edge visual analytics, we can make this task much more manageable. Parallel coordinate plots are just one example of the many visualization tools that have been created in recent decades by the ever evolving field of visual analytics.  As computing power continues its rapid advancement, it is important that we as analysts continue to ask ourselves whether we can improve our ability to visualize and gain insights from complex multidimensional data sets.

 

Synthetic Weather Generation: Part IV

Conditioning Synthetic Weather Generation on Climate Change Projections

This is the fourth blog post in a five part series on synthetic weather generators. You can read about common single-site parametric and non-parametric weather generators in Parts I and II, respectively, as well as multi-site generators of both types in Part III. Here I discuss how parametric and non-parametric weather generators can be modified to simulate weather that is consistent with climate change projections.

As you are all well aware, water managers are interested in finding robust water management plans that will consistently meet performance criteria in the face of hydrologic variability and change. One method for such analyses is to re-evaluate different water management plans across a range of potential future climate scenarios. Weather data that is consistent with these scenarios can be synthetically generated from a stochastic weather generator whose parameters or resampled data values have been conditioned on the climate scenarios of interest, and methods for doing so are discussed below.

Parametric Weather Generators

Most climate projections from GCMs are given in terms of monthly values, such as the monthly mean precipitation and temperature, as well as the variances of these monthly means. Wilks (1992) describes a method for adjusting parametric weather generators of the Richardson type in order to reproduce these projected means and variances. This method, described here, has been used for agricultural (Mearns et al., 1996; Riha et al., 1996) and hydrologic (Woodbury and Shoemaker, 2012) climate change assessments.

Recall from Part I that the first step of the Richardson generator is to simulate the daily precipitation states with a first order Markov chain, and then the precipitation amounts on wet days by independently drawing from a Gamma distribution. Thus, the total monthly precipitation is a function of the transition probabilities describing the Markov chain and the parameters of the Gamma distribution describing precipitation amounts. The transition probabilities of the Markov chain, p01 and p11, represent the probabilities of transitioning from a dry day to a wet day, or a wet day to another wet day, respectively. An alternative representation of this model shown by Katz (1983) is with two different parameters, π and d, where π = p01/(1 + p01 p11) and d = p11p01. Here π represents the unconditional probability of a wet day, and d represents the first-order autocorrelation of the Markov chain.

Letting SN be the sum of N daily precipitation amounts in a month (or equivalently, the monthly mean precipitation), Katz (1983) shows that if the precipitation amounts come from a Gamma distribution with shape parameter α and scale parameter β, the mean, μ, and variance, σ2, of SN are described by the following equations:

(1) μ(SN) = Nπαβ

(2) σ2(SN) = Nπαβ2[1 + α(1-π)(1+d)/(1-d)].

For climate change impact studies, one must find a set of parameters θ=(π,d,α,β) that satisfies the two equations above for the projected monthly mean precipitation amounts and variances of those means. Since there are only 2 equations but 4 unknowns, 2 additional constraints are required to fully specify the parameters (Wilks, 1992). For example, one might assume that the frequency and persistence of precipitation do not change, but that the mean and variance of the amounts distribution do. In that case, π and d would be unchanged from their estimates derived from the historical record, while α and β would be re-estimated to satisfy Equations (1) and (2). Other constraints can be chosen based on the intentions of the impacts assessment, or varied as part of a sensitivity analysis.

To modify the temperature-specific parameters, recall that in the Richardson generator, the daily mean and standard deviation of the non-precipitation variables are modeled separately on wet and dry days by annual Fourier harmonics. Standardized residuals of daily minimum and maximum temperature are calculated for each day by subtracting the daily mean and dividing by the daily standard deviation given by these harmonics. The standardized residuals are then modeled using a first-order vector auto-regression, or VAR(1) model.

For generating weather conditional on climate change projections, Wilks (1992) assumes that the daily temperature auto and cross correlation structure remains the same under the future climate so that the VAR(1) model parameters are unchanged. However, the harmonics describing the mean and standard deviation of daily minimum and maximum temperature must be modified to capture projected temperature changes. GCM projections of temperature changes do not usually distinguish between wet and dry days, but it is reasonable to assume the changes are the same on both days (Wilks, 1992). However, it is not reasonable to assume that changes in minimum and maximum temperatures are the same, as observations indicate that minimum temperatures are increasing by more than maximum temperatures (Easterling et al., 1997; Vose et al., 2005).

Approximating the mean temperature, T, on any day t as the average of that day’s mean maximum temperature, µmax(t), and mean minimum temperature, µmin(t), the projected change in that day’s mean temperature, ΔT(t), can be modeled by Equation 3:

(3) \Delta \overline{T}\left(t\right) = \frac{1}{2}\left[\mu_{min}\left(t\right) + \mu_{max}\left(t\right)\right] = \frac{1}{2} \left(CX_0 + CX_1\cos\left[\frac{2\pi\left(t-\phi\right)}{365}\right] + CN_0 + CN_1\cos\left[\frac{2\pi\left(t-\phi\right)}{365}\right]\right)

where CX0 and CN0 represent the annual average changes in maximum and minimum temperatures, respectively, and CX1 and CN1 the corresponding amplitudes of the annual harmonics. The phase angle, φ, represents the day of the year with the greatest temperature change between the current and projected climate, which is generally assumed to be the same for the maximum and minimum temperature. Since GCMs predict that warming will be greater in the winter than the summer, a reasonable value of φ is 21 for January 21st, the middle of winter (Wilks, 1992).

In order to use Equation 3 to estimate harmonics of mean minimum and maximum temperature under the projected climate, one must estimate the values of CX0, CN0, CX1 and CN1. Wilks (1992) suggests a system of four equations that can be used to estimate these parameters:

(4) ΔT = 0.5*(CX0 + CN0)

(5) Δ[T(JJA) – T(DJF)] = -0.895(CX1 + CN1)

(6) ΔDR(DJF) = CX0 − CN0 + 0.895(CX1 − CN1)

(7) ΔDR(JJA) = CX0 − CN0 − 0.895(CX1 − CN1)

where the left hand sides of Equations (4)-(7) represent the annual average temperature change, the change in temperature range between summer (JJF) and winter (DJF), the change in average diurnal temperature range (DR = µmax – µmin) in winter, and the change in average diurnal temperature range in summer, respectively. The constant ±0.895 is simply the average value of the cosine term in equation (3) evaluated at φ = 21 for the winter (+) and summer (−) seasons. The values for the left hand side of these equations can be taken from GCM projections, either as transient functions of time or as step changes.

Equations (4)-(7) can be used to estimate the mean minimum and maximum temperature harmonics for the modified weather generator, but the variance in these means may also change. Unlike changes in mean daily minimum and maximum temperature, it is fair to assume that changes in the standard deviation of these means are the same as each other and the GCM projections for changes in the standard deviation of daily mean temperature for both wet and dry days. Thus, harmonics modeling the standard deviation of daily minimum and maximum temperature on wet and dry days can simply be scaled by some factor σd’/ σd, where σd is the standard deviation of the daily mean temperature under the current climate, and σd’ is the standard deviation of the daily mean temperature under the climate change projection (Wilks, 1992). Like the daily mean temperature changes, this ratio can be specified as a transient function of time or a step change.

It should be noted that several unanticipated changes can occur from the modifications described above. For instance, if one modifies the probability of daily precipitation occurrence, this will change both the mean daily temperature (since temperature is a function of whether or not it rains) and its variance and autocorrelation (Katz, 1996). See Katz (1996) for additional examples and suggested modifications to overcome these potential problems.

Non-parametric Weather Generators

As described in Part II, most non-parametric and semi-parametric weather generators simulate weather data by resampling historical data. One drawback to this approach is that it does not simulate any data outside of the observed record; it simply re-orders them. Modifications to the simple resampling approach have been used in some stationary studies (Prairie et al., 2006; Leander and Buishand, 2009) as mentioned in Part II, and can be made for climate change studies as well. Steinschneider and Brown (2013) investigate several methods on their semi-parametric weather generator. Since their generator does have some parameters (specifically, transition probabilities for a spatially averaged Markov chain model of precipitation amounts), these can be modified using the methods described by Wilks (1992). For the non-parametric part of the generator, Steinschneider and Brown (2013) modify the resampled data itself using a few different techniques.

The first two methods they explore are common in climate change assessments: applying scaling factors to precipitation data and delta shifts to temperature data. Using the scaling factor method, resampled data for variable i, xi, are simply multiplied by a scaling factor, a, to produce simulated weather under climate change, axi. Using delta shifts, resampled data, xi, are increased (or decreased) by a specified amount, δ, to produce simulated weather under climate change, xi + δ.

Another more sophisticated method is the quantile mapping approach. This procedure is generally applied to parametric CDFs, but can also be applied to empirical CDFs, as was done by Steinschneider and Brown (2013). Under the quantile mapping approach, historical data of the random variable, X, are assumed to come from some distribution, FX, under the current climate. The CDF of X under climate change can be specified by a different target distribution, FX*. Simulated weather variables xi under current climate conditions can then be mapped to values under the projected climate conditions, xi*, by equating their values to those of the same quantiles in the target distribution, i.e. xi* = F*-1(F(xi)).

While simple, these methods are effective approaches for top-down or bottom-up robustness analyses. Unfortunately, what one often finds from such analyses is that there is a tradeoff between meeting robustness criteria in one objective, and sacrificing performance in another, termed regret. Fortunately, this tradeoff can occasionally be avoided if there is an informative climate signal that can be used to inform water management policies. In particular, skillful seasonal climate forecasts can be used to condition water management plans for the upcoming season. In order to evaluate these conditioned plans, one can generate synthetic weather consistent with such forecasts by again modifying the parameters or resampling schemes of a stochastic weather generator. Methods that can be used to modify weather generators consistent with seasonal climate forecasts will be discussed in my final blog post on synthetic weather generators.

Works Cited

Easterling, D. R., Horton, B., Jones, P. D., Peterson, T. C., Karl, T. R., Parker, D. E., et al. Maximum and minimum temperature trends for the globe. Science, 277(5324), 364-367.

Katz, R. W. (1983). Statistical procedures for making inferences about precipitation changes simulated by an atmospheric general circulation model. Journal of the Atmospheric Sciences, 40(9), 2193-2201.

Katz, R. W. (1996). Use of conditional stochastic models to generate climate change scenarios. Climatic Change, 32(3), 237-255.

Leander, R., & Buishand, T. A. (2009). A daily weather generator based on a two-stage resampling algorithm. Journal of Hydrology, 374, 185-195.

Mearns, L. O., Rosenzweig, C., & Goldberg, R. (1996). The effect of changes in daily and interannual climatic variability on CERES-Wheat: a sensitivity study. Climatic Change, 32, 257-292.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., & Zagona, E. A. (2006). Modified K-NN model for stochastic streamflow simulation. Journal of Hydrologic Engineering11(4), 371-378.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Riha, S. J., Wilks, D. S., & Simoens, P. (1996). Impact of temperature and precipitation variability on crop model predictions. Climatic Change, 32, 293-311.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49, 7205-7220.

Vose, R. S., Easterling, D. R., & Gleason, B. (2005). Maximum and minimum temperature trends for the globe: An update through 2004. Geophysical research letters, 32(23).

Wilks, D. S. (1992). Adapting stochastic weather generation algorithms for climate change studies. Climatic Change, 22(1), 67-84.

Woodbury, J., & Shoemaker, C. A. (2012). Stochastic assessment of long-term impacts of phosphorus management options on sustainability with and without climate change. Journal of Water Resources Planning and Management, 139(5), 512-519.