# A Deeper Dive into Principal Component Analysis

This post is meant to be a continuation of Dave Gold’s introductory post on Principal Component Analysis, which is an excellent explanation on how to conduct a PCA and visualize the principal components. The goal of this post is to elaborate on how to proceed after you have conducted a PCA and to address some common questions and concerns associated with the method.

### Performing a PCA in R

Often times, you will perform a PCA on large datasets that contain many variables and/or many observations. One such dataset that will be used as an example is the Living Blended Drought Atlas (LBDA) which is a reconstruction of the Palmer Drought Severity Index (PDSI) over the contiguous United States from 1473-2005. This dataset contains 4968 columns, or variables, each of which is a grid cell over the U.S., and 533 rows, each of which is a yearly observation. We will call this dataset, X. In matrix notation, we will denote the PCA analysis formula as:

U=XW    (1)

where X is the dataset, W is the weighting matrix, whose columns are the key patterns in the data, and U is the matrix whose columns are the resulting principal components (PCs). You can perform a PCA on this dataset with a single function in R, prcomp.

 PCA=prcomp(X, scale=TRUE/FALSE)

The first input into the function is your data matrix and the second input is used to declare if your dataset should be scaled to have a unit variance before the PCA is conducted. There are various other inputs into the function, listed here, that can be included if necessary.

prcomp returns three sets of results in a list that we have called “PCA”:

1. sdev: the standard deviations of the principal components (if you square them, you get the eigenvalues of the covariance/correlation matrix)
2. rotation: the loading matrix whose columns are the eigenvectors (W in the equation above)
3. x: the rotated data or your PCs (The columns of U in the equation above)

And that’s it! You have the results of the PCA. Now comes the more difficult part: interpreting them.

### How do I choose how many PCs to keep?

The dimensions associated with equation (1) are as follows:

U=XW

(nxk)=(nxk)(kxk)

If the number of observations is much larger than the number of variables in the dataset, i.e. n>>k, then the PCA will return k distinct eigenvectors. In our case, since n is smaller than our number of variables, the most non-zero eigenvectors that the PCA will return is n. Either way, we have many supposedly distinct patterns. How do we decide how many of those patterns to keep?

The answer is not always clear and most often subjective and case-dependent. One common tool used is a scree plot or an eigenvalue spectrum.

Scree Plot

Each column of our W matrix is a distinct, independent pattern, also called an empirical orthogonal function (EOF). Each EOF is responsible for explaining some amount of variance in the dataset. A scree plot, shown in Figure 1, allows you visualize this variance breakdown.

Figure 1: Scree Plot

On the x-axis of the scree plot is the EOF (we’ve chosen to keep 10) and on the y-axis is the total variance explained by that EOF. Each variance is equivalent to the eigenvalue associated with its respective eigenvector. You can find the eigenvalues/variances by squaring of the results of sdev that is returned by prcomp.

Generally speaking, one can look for the “elbow” of the scree plot to determine at what point to truncate the EOFs and retain up to the EOF before the elbow. At about the 5th EOF, the graph starts to level off and all subsequent EOFs start to contribute about the same amount of variance. Therefore, the elbow of the graph is located at about the 5th EOF and you will retain the first 4 EOFs.

North’s Rule of Thumb

North’s Rule of Thumb is more of a precise way of truncating and involves creating confidence intervals around your estimates of the variance. The rule states that you should truncated EOFs only when the confidence intervals of the variances start to intersect. At this point, the eigenvectors are considered too close to be interpretable and spacing might be due to sampling error rather than a clear distinction between the variances [1].

### Rotated EOFs

At some point, your EOFs might start to exhibit patterns that can be hard to interpret or attribute to a physical phenomenon. It is not uncommon for these types of patterns to result from pure noise in your data, especially if you are analyzing latter EOFs that explain a very small amount of the variance [2].

Rotating EOFs is a practice that is done to simplify the patterns obtained in EOFs and make them more interpretable. A varimax orthogonal rotation can be used to determine an optimum rotation matrix that maximizes the variance in the columns of W. The variance of the columns is maximized by driving some of the loadings to zero and trying to maximize the values of other loadings. In R, this is done using the varimax function.

 my.varimax=varimax(PCA$rotation[,1:10])  In the above command, my input is the first 10 EOFs from the original weighting matrix. The result, my.varimax, is a list with the following components: 1. loadings: the resulting rotated loading matrix 2. rotmat: the rotation matrix The new loading matrix, Wrot , is still orthogonal after the rotation, and the eigenvectors are, therefore, still orthonormal. However, multiplication of the rotated loading matrix by the original dataset, X, to obtain a new U, results in principal components are not guaranteed to be independent. This can be seen through further inspection of the correlation matrix associated with U. Unfortunately, this is a tradeoff associated with obtaining EOFs that are simpler and easier to interpret. Sources: [1] http://yyy.rsmas.miami.edu/users/bmapes/teaching/MPO581_2011/EOF_chapter_DelSole.pdf [2] Hannachi, A., Jolliffe, I.T., and Stephenson, D.B. (2007), Empirical orthogonal functions and related techniques in atmospheric science: A review, International Journal of Climatology, 27, 1119-1152. *All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class, BEE 6300: Environmental Statistics # A visual introduction to data compression through Principle Component Analysis Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post. As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York (if anyone is interested, here is the distance between the two towns). Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987. Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987 As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to create a new matrix, u, whose columns contain the principle components of the variability in x. $u = xE$ Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2. Figure 2: Geometric interpretation of PCA As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its corresponding u1 value. Once transformed, the original data can be recovered through a process known as synthesis. Synthesis can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1. $\tilde{x} = \tilde{u}\tilde{E}^T$ Where: $\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set$ $\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)$ $\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used$ For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3. Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Ithaca_Canandagua_PCA % By: D. Gold % Created: 3/20/17 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % This script will perform Principle Component analysis on minimum % temperature data from Ithaca and Canadaigua in January, 1987 provided in % Appendix A of Wilks (2011). It will then estimate minimum temperature % values of both locations using the first principle component. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% create data sets clear all % data from appendix Wilks (2011) Appendix A.1 Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,... 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]'; Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,... 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]'; %% center the data, plot temperature anomalies, calculate covariance and eigs % center the data x(:,1) = Ith - mean(Ith); x(:,2) = Can - mean(Can); % plot anomalies figure scatter(x(:,1),x(:,2),'Filled') xlabel('Ithaca min temp anomaly ({\circ}F)') ylabel('Canandagua min temp anomaly ({\circ}F)') % calculate covariance matrix and it's corresponding Eigenvalues & vectors S = cov(x(:,1),x(:,2)); [E, Lambda]=eigs(S); % Identify maximum eigenvalue, it's column will be the first eigenvector max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda idx = max_lambda(2); % column of max eigenvalue %% PCA U = x*E(:,idx); %% synthesis x_syn = E(:,idx)*U'; % reconstructed values of x % plot the reconstructed values against the original data figure subplot(2,1,1) plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--') xlim([1 31]) xlabel('Time (days)') ylabel('Ithaca min. temp. anomalies ({\circ}F)') legend('Original', 'Reconstruction') subplot(2,1,2) plot(1:31, x(:,2),1:31, x_syn(2,:)','--') xlim([1 31]) xlabel('Time (days)') ylabel('Canadaigua min. temp. anomalies ({\circ}F)') legend('Original', 'Reconstruction') Sources: Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press. # Synthetic streamflow generation A recent research focus of our group has been the development and use of synthetic streamflow generators. There are many tools one might use to generate synthetic streamflows and it may not be obvious which is right for a specific application or what the inherent limitations of each method are. More fundamentally, it may not be obvious why it is desirable to generate synthetic streamflows in the first place. This will be the first in a series of blog posts on the synthetic streamflow generators in which I hope to sketch out the various categories of generation methods and their appropriate use as I see it. In this first post we’ll focus on the motivation and history behind the development of synthetic streamflow generators and broadly categorize them. ### Why should we use synthetic hydrology? The most obvious reason to use synthetic hydrology is if there is little or no data for your system (see Lamontagne, 2015). Another obvious reason is if you are trying to evaluate the effect of hydrologic non-stationarity on your system (Herman et al. 2015; Borgomeo et al. 2015). In that case you could use synthetic methods to generate flows reflecting a shift in hydrologic regime. But are there other reasons to use synthetic hydrology? In water resources systems analysis it is common practice to evaluate the efficacy of management or planning strategies by simulating system performance over the historical record, or over some critical period. In this approach, new strategies are evaluated by asking the question: How well would we have done with your new strategy? This may be an appealing approach, especially if some event was particularly traumatic to your system. But is this a robust way of evaluating alternative strategies? It’s important to remember that any hydrologic record, no matter how long, is only a single realization of a stochastic process. Importantly, drought and flood events emerge as the result of specific sequences of events, unlikely to be repeated. Furthermore, there is a 50% chance that the worst flood or drought in an N year record will be exceeded in the next N years. Is it well advised to tailor our strategies to past circumstances that will likely never be repeated and will as likely as not be exceeded? As Lettenmaier et al. [1987] reminds us “Little is certain about the future except that it will be unlike the past.” Even under stationarity and even with long hydrologic records, the use of synthetic streamflow can improve the efficacy of planning and management strategies by exposing them to larger and more diverse flood and drought than those in the record (Loucks et al. 1981; Vogel and Stedinger, 1988; Loucks et al. 2005). Figure 7.12 from Loucks et al. 2005 shows a typical experimental set-up using synthetic hydrology with a simulation model. Often our group will wrap an optimization model like Borg around this set up, where the system design/operating policy (bottom of the figure) are the decision variables, and the system performance (right of the figure) are the objective(s). (Loucks et al. 2005) ### What are the types of generators? Many synthetic streamflow generation techniques have been proposed since the early 1960s. It can be difficult for a researcher or practitioner to know which method is best suited to the problem at hand. Thus, we’ll start with a very broad characterization of what is out there, then proceed to some history. Broadly speaking there are two approaches to generating synthetic hydrology: indirect and direct. The indirect approach generates streamflow by synthetically generating the forcings to a hydrologic model. For instance one might generate precipitation and temperature series and input them to a hydrologic model of a basin (e.g. Steinschneider et al. 2014). In contrast, direct methods use statistical techniques to generate streamflow timeseries directly. The direct approach is generally easier to apply and more parsimonious because it does not require the selection, calibration, and validation of a separate hydrologic model (Najafi et al. 2011). On the other hand, the indirect approach may be desirable. Climate projections from GCMs often include temperature or precipitation changes, but may not describe hydrologic shifts at a resolution or precision that is useful. In other cases, profound regime shifts may be difficult to represent with statistical models and may require process-driven models, thus necessitating the indirect approach. Julie’s earlier series focused on indirect approaches, so we’ll focus on the direct approach. Regardless of the approach many of the methods are same. In general generator methods can be divided into two categories: parametric and non-parametricParametric methods rely on a hypothesized statistical model of streamflow whose parameters are selected to achieve a desired result (Stedinger and Taylor, 1982a). In contrast non-parametric methods do not make strong structural assumptions about the processes generating the streamflow, but rather rely on re-sampling from the hydrologic record in some way (Lall, 1995). Some methods combine parametric and non-parametric techniques, which we’ll refer to as semi-parametric (Herman et al. 2015). Both parametric and non-parametric methods have advantages and disadvantages. Parametric methods are often parsimonious, and often have analytical forms that allow easy parameter manipulation to reflect non-stationarity. However, there can be concern that the underlying statistical models may not reflect the hydrologic reality well (Sharma et al. 1997). Furthermore, in multi-dimensional, multi-scale problems the proliferation of parameters can make parametric models intractable (Grygier and Stedinger, 1988). Extensive work has been done to confront both challenges, but they may lead a researcher to adopt a non-parametric method instead. Because many non-parametric methods ‘re-sample’ flows from a record, realism is not generally a concern, and most re-sampling schemes are computationally straight forward (relatively speaking). On the other hand, manipulating synthetic flows to reflect non-stationarity may not be as straightforward as a simple parameter change, though methods have been suggested (Herman et al. 2015Borgomeo et al. 2015). More fundamentally, because non-parametric methods rely so heavily on the data, they require sufficiently long records to ensure there is enough hydrologic variability to sample. Short records can be a concern for parametric methods as well, though parametric uncertainty can be explicitly considered in those methods (Stedinger and Taylor, 1982b). Of course, parametric methods also have structural uncertainty that non-parametric models largely avoid by not assuming an explicit statistical model. In the coming posts we’ll dig into the nuances of the different methods in greater detail. ### A historical perspective The first use of synthetic flow generation seems to have been by Hazen [1914]. That work attempted to quantify the reliability of a water supply by aggregating the streamflow records of local streams into a 300-year ‘synthetic record.’ Of course the problem with this is that the cross-correlation between concurrent flows rendered the effective record length much less than the nominal 300 years. Next Barnes [1954] generated 1,000 years of streamflow for a basin in Australia by drawing random flows from a normal distribution with mean and variance equal to the sample estimates from the observed record. That work was extended by researchers from the Harvard Water Program to account for autocorrelation of monthly flows (Maass et al., 1962; Thomas and Fiering, 1962). Later work also considered the use of non-normal distributions (Fiering, 1967), and the generation of correlated concurrent flows at multiple sites (Beard, 1965; Matalas, 1967). Those early methods relied on first-order autoregressive models that regressed flows in the current period on the flows of the previous period (see Loucks et al.’s Figure 7.13 below). Box and Jenkins [1970] extended those methods to autoregressive models of arbitrary order, moving average models of arbitrary order, and autoregressive-moving average models of arbitrary order. Those models were the focus of extensive research over the course of the 1970s and 1980s and underpin many of the parametric generators that are widely used in hydrology today (see Salas et al. 1980; Grygier and Stedinger, 1990; Salas, 1993; Loucks et al. 2005). (Loucks et al. 2005) By the mid-1990s, non-parametric methods began to gain popularity (Lall, 1995). While much of this work has its roots in earlier work from the 1970s and 1980s (Yakowitz, 1973, 1979, 1985; Schuster and Yakowitz, 1979; Yakowitz and Karlsson, 1987; Karlson and Yakowitz, 1987), improvements in computing and the availability of large data sets meant that by the mid-1990s non-parametric methods were feasible (Lall and Sharma, 1996). Early examples of non-parametric methods include block bootstrapping (Vogel and Shallcross, 1996), k-nearest neighbor (Lall and Sharma, 1996), and kernel density methods (Sharma et al. 1997). Since that time extensive research has made improvement to these methods, often by incorporating parametric elements. For instance, Srinivas and Srinivasan (2001, 2005, and 2006) develop a hybrid autoregressive-block bootstrapping method designed to improve the bias in lagged correlation and to generate flows other than the historical, for multiple sites and multiple seasons. K-nearest neighbor methods have also been the focus of extensive research (Rajagopalan and Lall, 1999; Harrold et al. 2003; Yates et al. 2003; Sharif and Burn, 2007; Mehortra and Sharma, 2006; Prairie et al. 2006; Lee et al. 2010, Salas and Lee, 2010, Nowak et al., 2010), including recent work by our group (Giuliani et al. 2014). Emerging work focuses on stochastic streamflow generation using copulas [Lee and Salas, 2011; Fan et al. 2016], entropy theory bootstrapping [Srivastav and Simonovic, 2014], and wavelets [Kwon et al. 2007; Erkyihun et al., 2016], among other methods. In the following posts I’ll address different challenges in stochastic generation [e.g. long-term persistence, parametric uncertainty, multi-site generation, seasonality, etc.] and the relative strengths and shortcomings of the various methods for addressing them. ### Works Cited Barnes, F. B., Storage required for a city water supply, J. Inst. Eng. Australia 26(9), 198-203, 1954. Beard, L. R., Use of interrelated records to simulate streamflow, J. Hydrol. Div., ASCE 91(HY5), 13-22, 1965. Borgomeo, E., Farmer, C. L., and Hall, J. W. (2015). “Numerical rivers: A synthetic streamflow generator for water resources vulnerability assessments.” Water Resour. Res., 51(7), 5382–5405. Y.R. Fan, W.W. Huang, G.H. Huang, Y.P. Li, K. Huang, Z. Li, Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas, Advances in Water Resources, Volume 88, February 2016, Pages 170-185. Fiering, M.B, Streamflow Synthesis, Harvard University Press, Cambridge, Mass., 1967. Giuliani, M., J. D. Herman, A. Castelletti, and P. Reed (2014), Many-objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management, Water Resour. Res., 50, 3355–3377, doi:10.1002/2013WR014700. Grygier, J.C., and J.R. Stedinger, Condensed Disaggregation Procedures and Conservation Corrections for Stochastic Hydrology, Water Resour. Res. 24(10), 1574-1584, 1988. Grygier, J.C., and J.R. Stedinger, SPIGOT Technical Description, Version 2.6, 1990. Harrold, T. I., Sharma, A., and Sheather, S. J. (2003). “A nonparametric model for stochastic generation of daily rainfall amounts.” Water Resour. Res., 39(12), 1343. Hazen, A., Storage to be provided in impounding reservoirs for municipal water systems, Trans. Am. Soc. Civ. Eng. 77, 1539, 1914. Herman, J.D., H.B. Zeff, J.R. Lamontagne, P.M. Reed, and G. Characklis (2016), Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments, Journal of Water Resources Planning & Management, doi: 10.1061/(ASCE)WR.1943-5452.0000701. Karlsson, M., and S. Yakowitz, Nearest-Neighbor methods for nonparametric rainfall-runoff forecasting, Water Resour. Res., 23, 1300-1308, 1987. Kwon, H.-H., U. Lall, and A. F. Khalil (2007), Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43, W05407, doi:10.1029/2006WR005258. Lall, U., Recent advances in nonparametric function estimation: Hydraulic applications, U.S. Natl. Rep. Int. Union Geod. Geophys. 1991- 1994, Rev. Geophys., 33, 1093, 1995. Lall, U., and A. Sharma (1996), A nearest neighbor bootstrap for resampling hydrologic time series, Water Resour. Res. 32(3), pp. 679-693. Lamontagne, J.R. 2015,Representation of Uncertainty and Corridor DP for Hydropower 272 Optimization, PhD edn, Cornell University, Ithaca, NY. Lee, T., J. D. Salas, and J. Prairie (2010), An enhanced nonparametric streamflow disaggregation model with genetic algorithm, Water Resour. Res., 46, W08545, doi:10.1029/2009WR007761. Lee, T., and J. Salas (2011), Copula-based stochastic simulation of hydrological data applied to Nile River flows, Hydrol. Res., 42(4), 318–330. Lettenmaier, D. P., K. M. Latham, R. N. Palmer, J. R. Lund and S. J. Burges, Strategies for coping with drought Part II: Planning techniques for planning and reliability assessment, EPRI P-5201, Final Report Project 2194-1, June 1987. Loucks, D.P., Stedinger, J.R. & Haith, D.A. 1981, Water Resources Systems Planning and Analysis, 1st edn, Prentice-Hall, Englewood Cliffs, N.J. Loucks, D.P. et al. 2005, Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Delft, The Netherlands. Maass, A., M. M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Design of Water Resource Systems, Harvard University Press, Cambridge, Mass., 1962. Matalas, N. C., Mathematical assessment of synthetic hydrology, Water Resour. Res. 3(4), 937-945, 1967. Najafi, M. R., Moradkhani, H., and Jung, I. W. (2011). “Assessing the uncertainties of hydrologic model selection in climate change impact studies.” Hydrol. Process., 25(18), 2814–2826. Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530. Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530. Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530. Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530. Prairie, J. R., Rajagopalan, B., Fulp, T. J., and Zagona, E. A. (2006). “Modified K-NN model for stochastic streamflow simulation.” J. Hydrol. Eng., 11(4), 371–378. Rajagopalan, B., and Lall, U. (1999). “A k-nearest-neighbor simulator for daily precipitation and other weather variables.” Water Resour. Res., 35(10), 3089–3101. Salas, J. D., J. W. Deller, V. Yevjevich and W. L. Lane, Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colo., 1980. Salas, J.D., 1993, Analysis and Modeling of Hydrologic Time Series, Chapter 19 (72 p.) in The McGraw Hill Handbook of Hydrology, D.R. Maidment, Editor. Salas, J.D., T. Lee. (2010). Nonparametric Simulation of Single-Site Seasonal Streamflow, J. Hydrol. Eng., 15(4), 284-296. Schuster, E., and S. Yakowitz, Contributions to the theory of nonparametric regression, with application to system identification, Ann. Stat., 7, 139-149, 1979. Sharif, M., and Burn, D. H. (2007). “Improved K-nearest neighbor weather generating model.” J. Hydrol. Eng., 12(1), 42–51. Sharma, A., Tarboton, D. G., and Lall, U., 1997. “Streamflow simulation: A nonparametric approach.” Water Resour. Res., 33(2), 291–308. Srinivas, V. V., and Srinivasan, K. (2001). “A hybrid stochastic model for multiseason streamflow simulation.” Water Resour. Res., 37(10), 2537–2549. Srinivas, V. V., and Srinivasan, K. (2005). “Hybrid moving block bootstrap for stochastic simulation of multi-site multi-season streamflows.” J. Hydrol., 302(1–4), 307–330. Srinivas, V. V., and Srinivasan, K. (2006). “Hybrid matched-block bootstrap for stochastic simulation of multiseason streamflows.” J. Hydrol., 329(1–2), 1–15. Roshan K. Srivastav, Slobodan P. Simonovic, An analytical procedure for multi-site, multi-season streamflow generation using maximum entropy bootstrapping, Environmental Modelling & Software, Volume 59, September 2014a, Pages 59-75. Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 1. Model verification and validation, Water Resour. Res. 18(4), 909-918, 1982a. Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 2. Parameter uncertainty,Water Resour. Res. 18(4), 919-924, 1982b. Steinschneider, S., Wi, S., and Brown, C. (2014). “The integrated effects of climate and hydrologic uncertainty on future flood risk assessments.” Hydrol. Process., 29(12), 2823–2839. Thomas, H. A. and M. B. Fiering, Mathematical synthesis of streamflow sequences for the analysis of river basins by simulation, in Design of Water Resource Systems, by A. Maass, M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Harvard University Press, Cambridge, Mass., 1962. Vogel, R.M., and J.R. Stedinger, The value of stochastic streamflow models in over-year reservoir design applications, Water Resour. Res. 24(9), 1483-90, 1988. Vogel, R. M., and A. L. Shallcross (1996), The moving block bootstrap versus parametric time series models, Water Resour. Res., 32(6), 1875–1882. Yakowitz, S., A stochastic model for daily river flows in an arid region, Water Resour. Res., 9, 1271-1285, 1973. Yakowitz, S., Nonparametric estimation of markov transition functions, Ann. Stat., 7, 671-679, 1979. Yakowitz, S. J., Nonparametric density estimation, prediction, and regression for markov sequences J. Am. Stat. Assoc., 80, 215-221, 1985. Yakowitz, S., and M. Karlsson, Nearest-neighbor methods with application to rainfall/runoff prediction, in Stochastic Hydrology, edited by J. B. Macneil and G. J. Humphries, pp. 149-160, D. Reidel, Norwell, Mass., 1987. Yates, D., Gangopadhyay, S., Rajagopalan, B., and Strzepek, K. (2003). “A technique for generating regional climate scenarios using a nearest-neighbor algorithm.” Water Resour. Res., 39(7), 1199. # Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 2) This blog post follows part 1 of this series and provides a tutorial for the statistical analysis I performed on my diagnostic results. After completing the briefly described diagnostic steps, I had 900 files that looked like this. ./SOW4/metrics/Borg_myLake4ObjStoch_1.metrics Threshold: 0.75 Best: 0.9254680538114082 Attainment: 0.328 Controllability: 0.0 Efficiency: 1.0 There were 900 as I evaluated 3 metrics for each of 50 seeds for 6 six algorithms. To evaluate statistics, I first had to read in the information of interest from the files. The following snippet of Matlab code is useful for that. clc; clear all; close all; algorithms = {'Borg', 'eMOEA', 'eNSGAII', 'NSGAII', 'MOEAD', 'GDE3'}; seeds = (1:1:50); metrics = {'GenDist'; 'EpsInd'; 'Hypervolume';}; % work = sprintf('./working_directory/'); %specify the working directory problem = 'Problem'; %Loop through metrics for i=1:length(metrics) %Loop through algorithms for j=1:length(algorithms) %Loop through seeds for k=1:length(seeds) %open and read files filename = ['./' metrics{i} '_' num2str(75) '_' algorithms{j} '_' num2str(seeds(k)) '.txt']; fh = fopen(filename); if(fh == -1) disp('Error opening analysisFile!'); end values = textscan(fh, '%*s %f', 5, 'Headerlines',1); fclose(fh); values = values{1}; threshold(k,j,i) = values(1); best(k,j,i) = values(2); if strcmp(metrics{i},'Hypervolume'); best(k,j,i) = best(k,j,i)/(threshold(k,j,i)/(75/100)); end; %Normalize the best Hypervolume value to be between 0 and 1 attainment(k,j,i) = values(3); controllability(k,j,i) = values(4); efficiency(k,j,i) = values(5); end end end  The above code loops through all metrics, algorithms, and seeds to load and read from the corresponding files. Each of the numerical values was stored in an appropriately named 3 dimensional array. For this tutorial, I am going to focus on a statistical analysis of the probability of attainment for the Hypervolume metric across all algorithms. The values of interest are stored in the attainment array. Below is code that can be used for the Kruskall Wallis nonparametric one-way ANOVA. P = kruskalwallis(attainment(:,:,3),algorithms,'off');  The above code performs the kruskalwallis test on the matrix of 50 rows and 6 columns storing the Hypervolume attainment values and determines if there is a difference in performance between any 2 of the groups (columns). In this case the 6 groups are the algorithms. The ‘off’ flag indicates whether you want a graphical display. More information can be found using Matlab help as you probably already know. If this returns a small p-value indicating that there is a difference between at least 2 of your algorithms, it is worthwhile to continue to the Mann Whitney U test. I used the following code to loop through the possible algorithm pairs and determine if there was a statistical difference in the probability of attaining the 75% threshold for hypervolume. I wasn’t very concerned about the P value so it is constantly overwritten, but I stored the binary indicating whether or not to reject the null hypothesis in a matrix. In this case, I used an upper tail test as algorithms with higher probabilities of attaining a threshold are considered to perform better. for i = 1:length(algorithms) for j = 1:length(algorithms) [P3, Mann_Whitney_U_Hyp(i,j)] = ranksum(attainment(:,i,3),... attainment(:,j,3),... 'alpha', 0.05, 'tail', 'right'); end end  In the end, you get a table of zeros and ones like that below. A 1 indicates that the null hypothesis should be rejected at the 5% level of significance. As the help function indicates that a 1 indicates that the null hypothesis that two medians are the same can be rejected and the alternative hypothesis tested in this case was that the median of the first group is higher, this table was used to rank algorithm performance. As the loop went row then column, Borg is considered the best performing algorithm in this case as its median is higher than the median for all other algorithms at the 5% level of significance. The rest of my ranking goes NSGAII, eNSGAII, eMOEA, GDE3, and MOEA/D. Hopefully, this is helpful to at least one person in the future. Feel free to use the snippets in your own code to automate the whole process. 🙂 # Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 1) While working on my diagnostic assessment of the difficulty of the “Lake Problem”, I decided that I would like to check that the differences in performance I observed were statistically significant. This part of the blog post is a broad overview of the steps I took before the statistical analysis and that analysis. Part 2 includes a tutorial. Looking through prior group papers, I drew initial inspiration from the following articles by Dave although I think my final method differs slightly. Hadka, D. and P. Reed. Diagnostic Assessment of Search Controls and Failure Modes in Many-Objective Evolutionary Optimization. Evolutionary Computation, 20(3):423-452, 2012. Hadka, D. and P. Reed. Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework. Evolutionary Computation, 21(2):231-259, 2013. Below is a short summary of the steps preceding my statistical analysis with links to group resources for those who may need more background. For everyone else, feel free to skip ahead. Before I could determine the statistical significance of my results, I needed to perform a comparative study of algorithm performance. I got help from sections 8.1 to 8.8 and section 8.10 of the MOEAFramework manual, which can be found at the MOEAFramework website to generate approximate fronts for the problem for multiple random seeds (I used 50) using algorithms found within the MOEAFramework, calculate metrics for each seed across all parameterizations, and determine the best value for each seed across all parameterizations and the probability of attaining a specified threshold (I used 75%) for each seed. I skipped section 8.9 and performed the analysis described in 8.10 on metrics for each seed’s metrics as many samples are necessary to perform statistics on algorithm performance. Please note, I also included the Borg evolutionary algorithm in this study, which requires optimization outside of the framework. All other steps for Borg can be performed the same as for the MOEAFramework algorithms. In this case, I used Latin Hypercube Sampling to generate 500 samples of the feasible parameter space for each algorithm. I chose parameter ranges from Table 1 of the following paper. Reed, P., et al. Evolutionary Multiobjective Optimization in Water Resources: The Past, Present, and Future. (Editor Invited Submission to the 35th Anniversary Special Issue), Advances in Water Resources, 51:438-456, 2013. Here is the shell script I used for calculating the best attained values and probability of attaining the 75% threshold for all algorithms and all seeds. NSEEDS=50 METHOD=Latin PROBLEM=myLake4ObjStoch ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg) SEEDS=$(seq 1 ${NSEEDS}) JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar" set -e # Perform the analysis echo "" echo "Hypervolume Analysis:" for ALGORITHM in${ALGORITHMS[@]}
do
for SEED in ${SEEDS} do NAME=${ALGORITHM}_${PROBLEM}_${SEED}_Hypervolume

PBS="\
#PBS -N ${NAME}\n\ #PBS -l nodes=1\n\ #PBS -l walltime=96:00:00\n\ #PBS -o output/${NAME}\n\
#PBS -e error/${NAME}\n\ cd \$PBS_O_WORKDIR\n\
java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.Analysis \ -p${ALGORITHM}_params.txt -i ${ALGORITHM}_${METHOD} -t 0.75 -o ./SOW6/Hypervolume/${ALGORITHM}_${PROBLEM}_${SEED}_Hypervolume.analysis \ -c -e -m 0 --hypervolume 0.7986 ./SOW6/metrics/${ALGORITHM}_${PROBLEM}_${SEED}.metrics"
echo -e \$PBS | qsub
done

done

The above shell script is written for Hypervolume, but other methods can be specified easily by indicating a different number after the -m flag and removing the hypervolume flag.  If one wants to analyze Generational Distance, type 1 after the -m flag.  For the Additive Epsilon Indicator, type 4.  Typically, these are the three metrics of interest as explained in this blog post https://waterprogramming.wordpress.com/2013/06/25/moea-performance-metrics/.   The –hypervolume flag allows you to specify the hypervolume of your reference set. That way the individual reference set hypervolume values are normalized to that of the best known approximation to the Pareto front.

Hypervolume can be calculated for a given approximation set using the following command line argument.

java -cp MOEAFramework-2.1-Demo.jar org.moeaframework.analysis.sensitivity.SetHypervolume name_of_reference_set_file

Finally, I calculated statistics on these values.  I used the Kruskal Wallis and Mann Whitney U (Also called the Wilcoxon Rank Sum) tests that Dave used.  One advantage of these tests is that they are non-parameteric, meaning that assumptions are not made about the distributions from which the data come.  Another advantage I found is that they are built into Matlab as the kruskalwallis and ranksum functions.  I began with the Kruskal Wallis test as it tests the null hypothesis that independent samples from two or more groups come from distributions with equal medians and returns the p-value that this is true given the data.  This test is a useful screen, as there would not have been much point in continuing if this test indicated that the values for all algorithms had been statistically similar.

I received very low p-values using Kruskal Wallis so I continued to the Mann Whitney U test in every case.  The Mann Whitney U test provides a pair-wise comparison.   Matlab has options to test one of three alternative hypotheses

1. medians are not equal (default, two-tailed test)
2. median of X is greater than median of Y (right-tailed test)
3. median of X is less than median of Y (left-tailed test)

I used one tailed tests in order to determine if one algorithm’s performance was superior to another algorithm’s performance.  I then was able to rank algorithm performance on any given metric by determining the number of algorithms each algorithm outperformed according to this test. A detailed walk through is provided in Part 2.