# Visualizing multidimensional data: a brief historical overview

The results of a MOEA search are presented as a set of multidimensional data points. In order to form useful conclusions from our results, we must have the ability to comprehend the multidimensional differences between results and effectively analyze and communicate them to decision makers.

Navigating through multiple dimensions is an inherently complex task for the human mind. We perceive the world in three dimensions, and thinking in higher dimensional space can be heavily taxing.  The difficulty of comprehending multidimensional data is compounded when one must display the data on a two dimensional surface such as a sheet of paper or a computer screen. The challenge of “flattening” data has persisted for centuries, and has plagued not only those who were concerned with gleaning scientific insights from data, but also artists and those seeking to accurately portray the physical world as perceived by the human eye.

For much of human history, even the greatest artists were unable to accurately express the three dimensional world in a two dimensional plane. Nicolo da Bologna’s 14th century work, The Marriage, fails to convey any sense of three dimensional space, giving the viewer the impression that the figures painted have been pressed against a pane of glass.

Nicolo da Bologna’s The Marriage (1350s) is unable to convey any sense of depth to the viewer.

During the Italian Renaissance, artists rediscovered the mathematics of perspective, allowing them to break free of the constraints of their two dimensional canvas and convey realistic images that gave the illusion of a third dimension. Raphael’s The school of Athens masterfully uses perspective to imbue his painting with a sense of depth. Through clever exploitation of Euclidean geometry and the mechanics of the human eye, Raphael is able to use the same medium (paint on a two dimensional surface) to convey a much richer representation of his subjects than his Bolognese predecessor.

Raphael’s The School of Athens (1509-1511) is an example of a masterful use of perspective. The painting vividly depicts a three dimensional space.

In the twentieth century, artists began attempting to covey more than three dimensions in two dimensional paintings. Cubists such as Picasso, attempted to portray multiple viewpoints of the same image simultaneously, and futurists such as Umberto Boccioni attempted to depict motion and “Dynamism” in their paintings to convey time as a fourth dimension.

Pablo Picasso’s Portrait of Dora Maar (1938), depicts a woman’s face from multiple viewpoints simultaneously

Umberto Boccioni’s Dynamism of a Cyclist (1913) attempts to portray a fourth dimension, time, through a sense of motion and energy in the painting. Can you tell this is supposed to be a cyclist, or did I go too far out there for a water programming blog?

Regardless of your views on the validity of modern art, as engineers and scientists we have to admit that in this area we share similar goals and challenges with these artists: to effectively convey multidimensional data in a two dimensional space. Unlike artists, whose objectives are to convey emotions, beauty or abstract ideas through their work, we in the STEM fields seek to gain specific insights from multidimensional data that will guide our actions or investigations.

A notable historical example of the effective use of clever visualization was English physician John Snow’s map of the London Cholera epidemic of 1854. Snow combined data of cholera mortality with patient home addresses  to map the locations of cholera deaths within the city.

John Snow’s map of the 1854 London Cholera Epidemic. Each black bar is proportional to the number of cholera deaths at a given residence. Pumps are depicted using black circles. One can clearly see that the cholera deaths are clustered around the pump on Broad Street (which I’ve circled in red).

The results of his analysis led Snow to conclude that a contaminated well was the likely source of the outbreak, a pioneering feat in the field of public health. Snow’s effective visualization not only provided him with insights into the nature of the problem, but also allowed him to effectively communicate his results to a general public who had previously been resistant to the idea of water borne disease.

In his insightful book Visual Explanations: Images and Quantities, Evidence and Narrative, Edward Tufte points to three strengths within John Snow’s use of data visualization in his analysis of the epidemic. First, Snow provided the appropriate context for his data. Rather than simply plotting a time series of cholera deaths, Snow placed those deaths within a new context, geographic location, which allowed him to make the connection to the contaminated pump. Second, Snow made quantitative comparisons within his data. As Tufte points out, a fundamental question when dealing with statistical analysis is “Compared with what?” It’s not sufficient to simply provide data about those who were struck with the disease, but also to explain why certain populations were not effected. By complimenting his data collection with extensive interviews of the local population, Snow was able to show that there were indeed people who escaped disease within the area of interest, but these people all got their water from other sources, which strengthened his argument that the pump was the source of the epidemic. Finally, Tufte insists that one must always consider alternative explanations than the one that seems apparent from the data visualization before drawing final conclusions. It is easy for one to make a slick but misleading visualization, and in order to maintain credibility as an analyst, one must always keep an open mind to alternative explanations. Snow took the utmost care in crafting and verifying his conclusion, and as a result his work stands as a shining example of the use of visualization to explore multidimensional data.

While Snow’s methodology is impressive, and Tufte’s observations about his work helpful, we cannot directly use his methodology to future evaluation of multidimensional data, because his map is only useful when evaluating data from the epidemic of 1854. There is a need for general tools that can be applied to multidimensional data to provide insights through visualizations. Enter the field of visual analytics. As defined by Daniel Keim “Visual analytics combines automated-analysis  techniques with interactive visualization for an effective understanding, reasoning and decision making on the basis of very large and complex data sets”. The field of visual analytics combines the disciplines of data analysis, data management, geo-spacial and temporal processes, spacial decision support, human-computer interaction and statistics. The goal of the field is to create flexible tools for visual analysis and data mining. Noted visualization expert Alfred Inselberg proposed  six criterion that successful visualization tools should have:

1. Low representational complexity.
2. Works for any number of dimensions
3. Every variable is treated uniformly
4. the displayed object can be recognized under progressive transformations (ie. rotation, translation, scaling, perspective)
5. The display easily/intuitively conveys information on the properties of the N-dimensional object it represents
6. The methodology is based on rigorous mathematical and algorithmic results

Using the above criteria, Inselberg developed the Parallel Coordinate plot. Parallel Coordinate plots transform multidimensional relationships into two dimensional patterns which are well suited for visual data mining.

An example of a five dimensional data set plotted on a Parallel Coordinate plot. Each line represents a data point, while each axis represents the point’s value in each dimension.

As water resources analysts dealing with multiobjective problems, it is critical that we have the ability to comprehend and communicate the complexities of multidimensional data. By learning from historical data visualization examples and making use of cutting edge visual analytics, we can make this task much more manageable. Parallel coordinate plots are just one example of the many visualization tools that have been created in recent decades by the ever evolving field of visual analytics.  As computing power continues its rapid advancement, it is important that we as analysts continue to ask ourselves whether we can improve our ability to visualize and gain insights from complex multidimensional data sets.

# Synthetic Weather Generation: Part IV

Conditioning Synthetic Weather Generation on Climate Change Projections

This is the fourth blog post in a five part series on synthetic weather generators. You can read about common single-site parametric and non-parametric weather generators in Parts I and II, respectively, as well as multi-site generators of both types in Part III. Here I discuss how parametric and non-parametric weather generators can be modified to simulate weather that is consistent with climate change projections.

As you are all well aware, water managers are interested in finding robust water management plans that will consistently meet performance criteria in the face of hydrologic variability and change. One method for such analyses is to re-evaluate different water management plans across a range of potential future climate scenarios. Weather data that is consistent with these scenarios can be synthetically generated from a stochastic weather generator whose parameters or resampled data values have been conditioned on the climate scenarios of interest, and methods for doing so are discussed below.

Parametric Weather Generators

Most climate projections from GCMs are given in terms of monthly values, such as the monthly mean precipitation and temperature, as well as the variances of these monthly means. Wilks (1992) describes a method for adjusting parametric weather generators of the Richardson type in order to reproduce these projected means and variances. This method, described here, has been used for agricultural (Mearns et al., 1996; Riha et al., 1996) and hydrologic (Woodbury and Shoemaker, 2012) climate change assessments.

Recall from Part I that the first step of the Richardson generator is to simulate the daily precipitation states with a first order Markov chain, and then the precipitation amounts on wet days by independently drawing from a Gamma distribution. Thus, the total monthly precipitation is a function of the transition probabilities describing the Markov chain and the parameters of the Gamma distribution describing precipitation amounts. The transition probabilities of the Markov chain, p01 and p11, represent the probabilities of transitioning from a dry day to a wet day, or a wet day to another wet day, respectively. An alternative representation of this model shown by Katz (1983) is with two different parameters, π and d, where π = p01/(1 + p01 p11) and d = p11p01. Here π represents the unconditional probability of a wet day, and d represents the first-order autocorrelation of the Markov chain.

Letting SN be the sum of N daily precipitation amounts in a month (or equivalently, the monthly mean precipitation), Katz (1983) shows that if the precipitation amounts come from a Gamma distribution with shape parameter α and scale parameter β, the mean, μ, and variance, σ2, of SN are described by the following equations:

(1) μ(SN) = Nπαβ

(2) σ2(SN) = Nπαβ2[1 + α(1-π)(1+d)/(1-d)].

For climate change impact studies, one must find a set of parameters θ=(π,d,α,β) that satisfies the two equations above for the projected monthly mean precipitation amounts and variances of those means. Since there are only 2 equations but 4 unknowns, 2 additional constraints are required to fully specify the parameters (Wilks, 1992). For example, one might assume that the frequency and persistence of precipitation do not change, but that the mean and variance of the amounts distribution do. In that case, π and d would be unchanged from their estimates derived from the historical record, while α and β would be re-estimated to satisfy Equations (1) and (2). Other constraints can be chosen based on the intentions of the impacts assessment, or varied as part of a sensitivity analysis.

To modify the temperature-specific parameters, recall that in the Richardson generator, the daily mean and standard deviation of the non-precipitation variables are modeled separately on wet and dry days by annual Fourier harmonics. Standardized residuals of daily minimum and maximum temperature are calculated for each day by subtracting the daily mean and dividing by the daily standard deviation given by these harmonics. The standardized residuals are then modeled using a first-order vector auto-regression, or VAR(1) model.

For generating weather conditional on climate change projections, Wilks (1992) assumes that the daily temperature auto and cross correlation structure remains the same under the future climate so that the VAR(1) model parameters are unchanged. However, the harmonics describing the mean and standard deviation of daily minimum and maximum temperature must be modified to capture projected temperature changes. GCM projections of temperature changes do not usually distinguish between wet and dry days, but it is reasonable to assume the changes are the same on both days (Wilks, 1992). However, it is not reasonable to assume that changes in minimum and maximum temperatures are the same, as observations indicate that minimum temperatures are increasing by more than maximum temperatures (Easterling et al., 1997; Vose et al., 2005).

Approximating the mean temperature, T, on any day t as the average of that day’s mean maximum temperature, µmax(t), and mean minimum temperature, µmin(t), the projected change in that day’s mean temperature, ΔT(t), can be modeled by Equation 3:

(3) $\Delta \overline{T}\left(t\right) = \frac{1}{2}\left[\mu_{min}\left(t\right) + \mu_{max}\left(t\right)\right] = \frac{1}{2} \left(CX_0 + CX_1\cos\left[\frac{2\pi\left(t-\phi\right)}{365}\right] + CN_0 + CN_1\cos\left[\frac{2\pi\left(t-\phi\right)}{365}\right]\right)$

where CX0 and CN0 represent the annual average changes in maximum and minimum temperatures, respectively, and CX1 and CN1 the corresponding amplitudes of the annual harmonics. The phase angle, φ, represents the day of the year with the greatest temperature change between the current and projected climate, which is generally assumed to be the same for the maximum and minimum temperature. Since GCMs predict that warming will be greater in the winter than the summer, a reasonable value of φ is 21 for January 21st, the middle of winter (Wilks, 1992).

In order to use Equation 3 to estimate harmonics of mean minimum and maximum temperature under the projected climate, one must estimate the values of CX0, CN0, CX1 and CN1. Wilks (1992) suggests a system of four equations that can be used to estimate these parameters:

(4) ΔT = 0.5*(CX0 + CN0)

(5) Δ[T(JJA) – T(DJF)] = -0.895(CX1 + CN1)

(6) ΔDR(DJF) = CX0 − CN0 + 0.895(CX1 − CN1)

(7) ΔDR(JJA) = CX0 − CN0 − 0.895(CX1 − CN1)

where the left hand sides of Equations (4)-(7) represent the annual average temperature change, the change in temperature range between summer (JJF) and winter (DJF), the change in average diurnal temperature range (DR = µmax – µmin) in winter, and the change in average diurnal temperature range in summer, respectively. The constant ±0.895 is simply the average value of the cosine term in equation (3) evaluated at φ = 21 for the winter (+) and summer (−) seasons. The values for the left hand side of these equations can be taken from GCM projections, either as transient functions of time or as step changes.

Equations (4)-(7) can be used to estimate the mean minimum and maximum temperature harmonics for the modified weather generator, but the variance in these means may also change. Unlike changes in mean daily minimum and maximum temperature, it is fair to assume that changes in the standard deviation of these means are the same as each other and the GCM projections for changes in the standard deviation of daily mean temperature for both wet and dry days. Thus, harmonics modeling the standard deviation of daily minimum and maximum temperature on wet and dry days can simply be scaled by some factor σd’/ σd, where σd is the standard deviation of the daily mean temperature under the current climate, and σd’ is the standard deviation of the daily mean temperature under the climate change projection (Wilks, 1992). Like the daily mean temperature changes, this ratio can be specified as a transient function of time or a step change.

It should be noted that several unanticipated changes can occur from the modifications described above. For instance, if one modifies the probability of daily precipitation occurrence, this will change both the mean daily temperature (since temperature is a function of whether or not it rains) and its variance and autocorrelation (Katz, 1996). See Katz (1996) for additional examples and suggested modifications to overcome these potential problems.

Non-parametric Weather Generators

As described in Part II, most non-parametric and semi-parametric weather generators simulate weather data by resampling historical data. One drawback to this approach is that it does not simulate any data outside of the observed record; it simply re-orders them. Modifications to the simple resampling approach have been used in some stationary studies (Prairie et al., 2006; Leander and Buishand, 2009) as mentioned in Part II, and can be made for climate change studies as well. Steinschneider and Brown (2013) investigate several methods on their semi-parametric weather generator. Since their generator does have some parameters (specifically, transition probabilities for a spatially averaged Markov chain model of precipitation amounts), these can be modified using the methods described by Wilks (1992). For the non-parametric part of the generator, Steinschneider and Brown (2013) modify the resampled data itself using a few different techniques.

The first two methods they explore are common in climate change assessments: applying scaling factors to precipitation data and delta shifts to temperature data. Using the scaling factor method, resampled data for variable i, xi, are simply multiplied by a scaling factor, a, to produce simulated weather under climate change, axi. Using delta shifts, resampled data, xi, are increased (or decreased) by a specified amount, δ, to produce simulated weather under climate change, xi + δ.

Another more sophisticated method is the quantile mapping approach. This procedure is generally applied to parametric CDFs, but can also be applied to empirical CDFs, as was done by Steinschneider and Brown (2013). Under the quantile mapping approach, historical data of the random variable, X, are assumed to come from some distribution, FX, under the current climate. The CDF of X under climate change can be specified by a different target distribution, FX*. Simulated weather variables xi under current climate conditions can then be mapped to values under the projected climate conditions, xi*, by equating their values to those of the same quantiles in the target distribution, i.e. xi* = F*-1(F(xi)).

While simple, these methods are effective approaches for top-down or bottom-up robustness analyses. Unfortunately, what one often finds from such analyses is that there is a tradeoff between meeting robustness criteria in one objective, and sacrificing performance in another, termed regret. Fortunately, this tradeoff can occasionally be avoided if there is an informative climate signal that can be used to inform water management policies. In particular, skillful seasonal climate forecasts can be used to condition water management plans for the upcoming season. In order to evaluate these conditioned plans, one can generate synthetic weather consistent with such forecasts by again modifying the parameters or resampling schemes of a stochastic weather generator. Methods that can be used to modify weather generators consistent with seasonal climate forecasts will be discussed in my final blog post on synthetic weather generators.

Works Cited

Easterling, D. R., Horton, B., Jones, P. D., Peterson, T. C., Karl, T. R., Parker, D. E., et al. Maximum and minimum temperature trends for the globe. Science, 277(5324), 364-367.

Katz, R. W. (1983). Statistical procedures for making inferences about precipitation changes simulated by an atmospheric general circulation model. Journal of the Atmospheric Sciences, 40(9), 2193-2201.

Katz, R. W. (1996). Use of conditional stochastic models to generate climate change scenarios. Climatic Change, 32(3), 237-255.

Leander, R., & Buishand, T. A. (2009). A daily weather generator based on a two-stage resampling algorithm. Journal of Hydrology, 374, 185-195.

Mearns, L. O., Rosenzweig, C., & Goldberg, R. (1996). The effect of changes in daily and interannual climatic variability on CERES-Wheat: a sensitivity study. Climatic Change, 32, 257-292.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., & Zagona, E. A. (2006). Modified K-NN model for stochastic streamflow simulation. Journal of Hydrologic Engineering11(4), 371-378.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Riha, S. J., Wilks, D. S., & Simoens, P. (1996). Impact of temperature and precipitation variability on crop model predictions. Climatic Change, 32, 293-311.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49, 7205-7220.

Vose, R. S., Easterling, D. R., & Gleason, B. (2005). Maximum and minimum temperature trends for the globe: An update through 2004. Geophysical research letters, 32(23).

Wilks, D. S. (1992). Adapting stochastic weather generation algorithms for climate change studies. Climatic Change, 22(1), 67-84.

Woodbury, J., & Shoemaker, C. A. (2012). Stochastic assessment of long-term impacts of phosphorus management options on sustainability with and without climate change. Journal of Water Resources Planning and Management, 139(5), 512-519.

# Synthetic Weather Generation: Part III

Multi-Site Weather Generators

This is the third part of a blog series on synthetic weather generators. In Parts I and II, I described common parametric and non-parametric methods of weather generation, respectively. Both of these blogs only discussed how to use these methods to generate daily weather simulations at one site. However, we are often interested in generating weather at multiple sites simultaneously, for instance, if we are using the weather as input to a distributed watershed model. If there is a high degree of spatial heterogeneity within the system being modeled, applying average weather conditions everywhere will not provide representative simulations of the hydrology, nor will independently generating weather at multiple sites within the system. Fortunately, several techniques have been developed to correlate synthetic weather sequences generated at multiple nearby sites that are consistent with observed data.

This blog will focus on two methods of generating correlated weather data: 1) driving multiple single-site parametric weather generators with spatially correlated random numbers (Wilks, 1998; Wilks 1999; Wilks, 2008; Wilks, 2009; Srikanthan and Pegram, 2009; Baigorria and Jones, 2010) and 2) resampling concurrent weather data from multiple sites with a non-parametric weather generator (Buishand and Brandsma, 2001; Beersma and Buishand, 2003; Apipattanavis et al., 2007; Steinschneider and Brown, 2013). Other methods exist, such as jointly sampling weather at multiple sites from a copula (Bardossy and Pegram, 2009; Serinaldi, 2009, Aghakouchak et al., 2010a; Aghakouchak et al., 2010b), modeling rainfall occurrence and amounts processes with generalized linear models (GLMs) (Kigobe et al., 2011), and disaggregating spatially averaged synthetic rainfall with a nonhomogeneous random cascade process (Jothityangkoon et al., 2000), but will not be discussed here.

Method 1

The first method, driving multiple single-site weather generators with spatially correlated random numbers, is described by Wilks (1998) for use with parametric generators of the Richardson type described in Part I. In the single-site formulation of this model, the first step is to generate precipitation occurrences at each site. Normally, this is done by generating a uniform random number, u ~ U[0,1], and comparing it to the critical probably, pc, of transitioning from the current state (wet or dry) to a wet state. If upc, the state transitions to a wet state; otherwise it transitions to a dry state.

An alternative to generating uniform random numbers for the transition computation is to generate standard normal random numbers, w ~ N(0,1). In this case, state transitions from the current state to a wet state occur if w ≤ Φ-1(pc), where Φ-1 (·) is the inverse of the standard normal CDF. The benefit to using this alternative is that it is easy to generate correlated normally distributed random variables (Wilks, 2011, pp.499-500), and we will need to correlate the random number streams generated at each of K sites to ensure realistic weather simulations.

Let Xt(k) and Xt(l) be the precipitation states on day t at sites k and l, respectively, where X = 0 corresponds to a dry day and X = 1 corresponds to a wet day. In order to simulate realistic sequences of precipitation states at these two sites, one must find the correlation between the normally distributed random variables driving the transitions, ωo(k,l) = ρ(wt(k), wt(l)), that reproduces the historical correlation between the precipitation states, ξo(k,l) = ρ(Xt(k), Xt(l)). Unfortunately, this relationship must be calculated empirically by simulating precipitation states at two sites for different values of ω and calculating the corresponding values of ξ. This must be done for every one of the K(K-1)/2 station pairs. Figure 1 shows one such relationship for the July transition parameters at Ithaca and Freeville from Wilks (1998).

Once the relationship between ξ and ω has been mapped for each station pair, one can determine the correlation ωo(k,l) that should be used for the weather generator. In the case of Ithaca and Freeville, the historical correlation in precipitation states between the two sites, ξo, is 0.800. The dotted line in Figure 1 shows that ξo = 0.8 corresponds to a value of ωo = 0.957. Once all of the K(K-1)/2 pairs of correlations ωo(k,l) have been determined, one can generate a daily vector of standard normal random variables, wt, from a multi-variate normal distribution with mean vector 0 and correlation matrix [Ω], whose elements are the correlations ωo(k,l). One drawback to this method is that, while it does a good job replicating the lag 0 correlations between sites, it tends to under-represent lag 1 correlations between them (Wilks, 1998).

As in the single site weather generator, after simulating whether or not rain occurs at a particular site each day, the next step is to determine the depth of rain on wet days. Not surprisingly, the conditional distribution of rainfall amounts at a particular site depends on whether or not it is raining at nearby sites. A convenient way to accommodate this is to generate the precipitation amounts at wet sites from the mixed exponential distribution (Wilks, 1998).

Recall that the mixed exponential distribution, f(x) = αλ1exp(-λ1x) + (1–α) λ2exp(-λ2x), has three parameters: two rate parameters, λ1 and λ2 where λ1 < λ2, and a weight parameter, α. Thus, with probability α the rainfall amounts are generated from an exponential distribution with rate λ1, and with probability 1–α they are generated from an exponential distribution with rate λ2. Since the mean of the exponential distribution is 1/λ and λ1 < λ2, rainfall amounts generated from an exponential distribution with rate λ1 will have a greater mean than those generated from an exponential distribution with rate λ2. Therefore, rainfall amounts at sites neighboring dry sites are more likely to have come from an exponential distribution with rate λ2, while rainfall amounts at wet sites neighboring other wet sites are more likely to have come from an exponential distribution with rate λ1.

Keeping this in mind, one can conditionally generate rainfall amounts at a given site, k, from one of these two exponential distributions. As stated earlier, if w ≤ Φ-1(pc), or equivalently if Φ(w(k)) < pc(k), then rainfall is generated at site k. Because the transition probabilities at nearby sites are correlated, if Φ(w(k)) << pc(k), then it is very likely that nearby sites also transitioned to a wet state, so rainfall amounts are likely high. Therefore, if Φ(w(k))/pc(k) ≤ α(k), rainfall at the site should be generated from the exponential distribution with the greater mean, that with rate λ1. On the contrary, if Φ(w(k)) is less than, but close to pc(k), it is very likely that nearby sites did not transition to a wet state, so rainfall amounts are likely lower. Therefore, if Φ(w(k))/pc(k) > α(k), rainfall at the site should be generated from the exponential distribution with the smaller mean, that with rate λ2 (Wilks, 1998).

This is a coarse method that can be used to decide on the rate parameter for rainfall generation; Wilks (1998) also describes a method to continuously vary the smaller rate parameter to improve the simulation of precipitation fields. Regardless of which method is chosen, once one has decided on the rate parameter, the rainfall amount itself can be generated from the exponential distribution with this parameter. Just as the precipitation state can be determined by generation of a standard normal random variable, so too can the precipitation amount. Letting the rainfall amount at a particular site k equal r(k), the depth of precipitation is simulated as r(k) = rmin­ – (1/λ)*ln(Φ(v(k))), where rmin is the minimum threshold below which a day is recorded as dry and λ is the rate parameter determined earlier.

The final step in precipitation simulation then is to generate a daily vector of correlated standard normal random variables, vt, that reproduces the historical spatial correlation structure in precipitation amounts across sites. This procedure is analogous to that used to reproduce the historical spatial correlation structure in precipitation occurrences across sites. However, Wilks (1998) found that this procedure did not always produce positive definite correlation matrices [Z] for generation of the multivariate normal random variables, vt. To overcome this problem, the correlations ζo(k,l) making up [Z] can be determined as smooth functions of the distance between station pairs. See Wilks (1998) for further details.

After generating spatially correlated precipitation, one must then generate spatially correlated minimum temperature, maximum temperature and solar radiation. Recall from Part I that in single site weather generators of the Richardson type, these three variables are generated from a first order vector auto-regressive (VAR(1)) model. As Wilks (1999) points out, this method can easily be extended by fitting a VAR(1) model to the simultaneous minimum temperature, maximum temperature and solar radiation across all sites, rather than just at one site. This increases the dimensionality from 3 for the 3 variables at one site, to 3K for the 3 variables at K sites. However, Wilks cautions that this may be difficult due to limited solar radiation data at many sites and inconsistent correlations if the record lengths are different at the K sites. See Wilks (1999) for details on how one can adjust the observed historical correlations such that they are mutually consistent.

The method just described will generate weather at multiple sites with long enough historical records to fit the weather generator parameters. One can also interpolate these parameters using locally weighted regressions (Wilks, 2008) to generate weather at arbitrary locations between stations with historical weather records. Wilks (2009) shows how to do just that to generate gridded weather data, which is particularly useful for driving distributed hydrologic models.

Method 2

The second method, resampling concurrent weather data from multiple sites with a non-parametric weather generator, is a simpler method and a natural extension of the non-parametric single site weather generator. As discussed in Part II, most non-parametric weather generators probabilistically re-sample weather data from the historical record based on how “near” the most recently generated daily weather characteristics are to previous days in the historical record. “Nearness” is typically measured using Euclidean distance.

Buishand and Brandsma (2001) modify the single-site approach to generate weather at multiple sites by describing nearness in terms of Euclidean distance of spatially averaged temperature and precipitation standardized anomalies. The simulation begins with a randomly selected day from the historical record. The temperature and precipitation on that day are standardized by subtracting the historical daily mean and dividing by the historical daily standard deviation. The standardized temperature and precipitation anomalies at each site are then averaged and compared to the spatially averaged standardized anomalies on all other historical days within some window of the current day using Euclidean distance. One of the k nearest neighbors (k-nn) is then probabilistically selected using the Kernel density estimator derived by Lall and Sharma (1996), and the weather at all sites on the following day in the historical record is chosen to be the weather on the next simulated day.

While this non-parametric method of multi-site weather generation is much simpler than the parametric method of Wilks (1998), Mehrotra (2006) found that the Wilks multi-site weather generator was better able to accommodate varying orders of serial dependence at each site and still maintain the historical correlation structure across sites. Similarly, Young (1994) found that the k-nn weather generation approach tended to under-simulate the lengths of wet and dry spells. For this reason, Apitpattanavis et al. (2007) adopt a semi-parametric multi-site weather generation approach that combines elements of each method.

The Apipattanavis generator, also employed by Steinschneider and Brown (2013), first generates spatially averaged precipitation states with a Markov chain model. To begin the simulation, the historical precipitation record is first converted to a time series of spatially averaged precipitation. If the spatially averaged rainfall on the first simulated day is below 0.3 mm, it is classified as a dry day. If it is greater than the 80th percentile of spatially averaged daily precipitation in the simulated month, it is classified as extremely wet. Otherwise it is simply wet. A three-state, first-order Markov chain is then used to simulate the next day’s spatially averaged precipitation state. One of the k-nearest neighbors from the historical record with the same transition (e.g. wet to extremely wet) is selected as in Buishand and Brandsma (2001). The weather at each site on the next simulated day is then chosen to be the weather at each site on the day following the selected neighbor. The main difference between these two methods then, is that Buishand and Brandsma (2001) consider all historical neighbors within some window of the simulated day, while Apipattanavis et al. (2007) only consider those with the same transition in precipitation states as is simulated.

All of the above described methods can reasonably reproduce the spatial and temporal dependence structure of weather at multiple sites, while maintaining the moments at each site. Additionally, they can be modified conditional on short-term, seasonal or long-term forecasts of climate change. That will be the subject of the next blog in this five-part series.

Works Cited

Aghakouchak, A., Bárdossy, A. & Habib, E. (2010a). Conditional simulation of remotely sensed rainfall data using a non-Gaussian v-transformed copula. Advances in Water Resources, 33, 624-634.

—. (2010b). Copula-based uncertainty modelling: application to multisensory precipitation estimates. Hydrological Processes, 24, 2111-2124.

Apipattanavis, S., Podesta, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43(11).

Baigorria, G. A., & Jones, J. W. (2010). GiST: A stochastic model for generating spatially and temporally correlated daily rainfall data. Journal of Climate, 23(22), 5990-6008.

Bárdossy, A., & Pegram, G. G. S. (2009). Copula based multisite model for daily precipitation simulation. Hydrology and Earth System Sciences, 13, 2299-2314.

Beersma, J. J., & Buishand, T. A. (2003). Multi-site simulation of daily precipitation and temperature conditional on the atmospheric circulation. Climate Research, 25, 121-133.

Buishand, T. A., & Brandsma, T. (2001). Multsite simulation of daily precipitation and temperature in the Rhine basin by nearest-neighbor resampling. Water Resources Research, 37(11), 2761-2776.

Jothityangkoon, C., Sivalapan, M., & Viney, N. R. (2000). Tests of a space-time model of daily rainfall in southwestern Australia based on nonhomogeneous random cascades. Water Resources Research, 36(1), 267-284.

Kigobe, M., McIntyre, N., Wheater, H., & Chandler, R. (2011). Multi-site stochastic modelling of daily rainfall in Uganda. Hydrological Sciences Journal, 56(1), 17-33.

Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water Resources Research, 32(3), 679-693.

Mehrotra, R., Srikanthan, R., & Sharma, A. (2006). A comparison of three stochastic multi-site precipitation occurrence generators. Journal of Hydrology, 331(1-2), 280-292.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Serinaldi, F. (2009). A multisite daily rainfall generator driven by bivariate copula-based mixed distributions. Journal of Geophysical Research: Atmospheres, 114(D10).

Srikanthan, R., & Pegram, G. G. S. (2009). A nested multisite daily rainfall stochastic generation model. Journal of Hydrology, 371, 142-153.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49, 7205-7220.

Wilks, D. S. (1998). Multisite generalization of a daily stochastic precipitation generation model. Journal of Hydrology, 210, 178-191.

Wilks, D. S. (1999). Simultaneous stochastic simulation of daily precipitation, temperature and solar radiation at multiple sites in complex terrain. Agricultural and Forest Meteorology, 96, 85-101.

Wilks, D. S. (2008). High-resolution spatial interpolation of weather generator parameters using local weighted regressions. Agricultural and Forest Meteorology, 148, 111-120.

Wilks, D. S. (2009). A gridded multisite weather generator and synchronization to observed weather data. Water Resources Research, 45(10).

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic press.

Young, K. C. (1994). A multivariate chain model for simulating climatic parameters from daily data. Journal of Applied Meteorology33(6), 661-671.

# Synthetic Weather Generation: Part II

Non-Parametric Generators

This is the second part of a blog series on synthetic weather generators. In Part I, I described common parametric methods of weather generation. Here I am going to discuss a subset of non-parametric methods. There are several techniques for non-parametric weather generation (Apipattanavis et al., 2007): simulating from empirical distributions of wet and dry spells and precipitation amounts (Semenov and Porter, 1995); using simulated annealing for generating precipitation (Bárdossy, 1998) and neural networks for generating temperature (Trigo and Palutikof, 1999); fitting multivariate PDFs to several weather variables using Kernel-based methods (Rajagopalan et al, 1997); and resampling historical observations (Young, 1994). For brevity, all of the methods I will discuss employ the nearest neighbor, or k-nn, bootstrap resampling technique introduced by Lall and Sharma (1996). Many of you are probably familiar with this method from streamflow generation, as our group has been using it to disaggregate monthly and annual flows to daily values using an adaptation of Lall and Sharma’s method described by Nowak, et al. (2010).

The original purpose of the nearest-neighbor bootstrap was to reproduce the autocorrelation in simulated time series without assuming a parametric generating process (Lall and Sharma, 1996). Traditionally, auto-correlated time series are modeled using autoregressive (AR) or autoregressive moving average (ARMA) models, as described in Part I for simulating non-precipitation variables in parametric weather generators. Recall that in the AR model, the values of the non-precipitation variables at any given time step are linear combinations of the values of the non-precipitation variables at the previous time step(s), plus some Gaussian-distributed random error. Hence, ARMA models assume a linear generating process with independent, normally distributed errors. Non-parametric methods do not require that these assumptions be met.

The most common non-parametric method is the bootstrap, in which each simulated value is generated by independently sampling (with replacement) from the historical record (Efron, 1979; Efron and Tibshirani, 1993). This is commonly used to generate confidence intervals. If simulating time series, though, this method will not reproduce the auto-correlation of the observations. For this reason, the moving-blocks bootstrap has often been used, where “blocks” of consecutive values are sampled jointly with replacement (Kunsch, 1989; Vogel and Shallcross, 1996). This will preserve the correlation structure within each block, but not between blocks. The k-nn bootstrap, however, can reproduce the auto-correlation in the simulated time series without discontinuities between blocks (Lall and Sharma, 1996).

In the simplest case of the nearest neighbor bootstrap, the historical record is divided into overlapping blocks of length d, where d represents the dependency structure of the time series. So if d=2, then each value in the time series depends on the preceding two values. To simulate the next value in the time series, the most recent block of length d is compared to all historical blocks of length d to find the k “nearest” neighbors (hence k-nn). How “near” two blocks are to one another is determined using some measure of distance, such as Euclidean distance. The k nearest neighbors are then sorted and assigned a rank j from 1 to k, where 1 is the closest and k the furthest. The next simulated value will be that which succeeds one of the k nearest neighbors, where the probability of selecting the successor of each neighbor is inversely proportional to the neighbor’s rank, j.

P(j) = (1/j) / Σi=1,…,k (1/i)

To simulate the second value, the most recent block of length d will now include the first simulated value, as well as the last d-1 historical values, and the process repeats.

Lall and Sharma recommend that d and k be chosen using the Generalized Cross Validation score (GCV) (Craven and Wahba, 1979). However, they also note that choosing k = √n, where n is the length of the historical record, works well for 1 ≤ d ≤ 6 and n ≥ 100. Using d=1, k = √n and selecting neighbors only from the month preceding that which is being simulated, Lall and Sharma demonstrate the k-nn bootstrap for simulating monthly streamflows at a snowmelt-fed mountain stream in Utah. They find that it reproduces the first two log-space moments and auto-correlation in monthly flows just as well as an AR(1) model on the log-transformed flows, but does a much better job reproducing the historical skew.

Because of its simplicity and successful reproduction of observed statistics, Rajagopalan and Lall (1999) extend the k-nn bootstrap for weather generation. In their application, d is again set to 1, so the weather on any given day depends only on the previous day’s weather. Unlike Lall and Sharma’s univariate application of the nearest neighbor bootstrap for streamflow, Rajagopalan and Lall simulate multiple variables simultaneously. In their study, a six dimensional vector of solar radiation, maximum temperature, minimum temperature, average dew point temperature, average wind speed and precipitation is simulated each day by resampling from all historical vectors of these variables.

First, the time series of each variable is “de-seasonalized” by subtracting the calendar day’s mean and dividing by the calendar day’s standard deviation. Next, the most recent day’s vector is compared to all historical daily vectors from the same season using Euclidean distance. Seasons were defined as JFM, AMJ, JAS, and OND. The k nearest neighbors are then ranked and one probabilistically selected using the same Kernel density estimator given by Lall and Sharma. The vector of de-seasonalized weather variables on the day following the selected historical neighbor is chosen as the next simulated vector, after “re-seasonalizing” by back-transforming the weather variables. Rajagopalan and Lall find that this method better captures the auto- and cross-correlation structure between the different weather variables than a modification of the popular parametric method used by Richardson (1981), described in Part I of this series.

One shortcoming of the nearest neighbor bootstrap is that it will not simulate any values outside of the range of observations; it will only re-order them. To overcome this deficiency, Lall and Sharma (1996) suggest, and Prairie et al. (2006) implement, a modification to the k-nn approach. Using d=1 for monthly streamflows, a particular month’s streamflow is regressed on the previous month’s streamflow using locally-weighted polynomial regression on the k nearest neighbors. The order, p, of the polynomial is selected using the GCV score. The residuals of the regression are saved for future use in the simulation.

The first step in the simulation is to predict the expected flow in the next month using the local regression. Next, the k nearest neighbors to the current month are found using Euclidean distance and given the probability of selection suggested by Lall and Sharma. After a neighbor has been selected, its residual from the locally weighted polynomial regression is added to the expected flow given by the regression, and the process is repeated. This will yield simulated flows that are outside of those observed in the historical record.

One problem with this method is that adding a negative residual could result in the simulation of negative values for strictly positive variables like precipitation. Leander and Buishand (2009) modify this approach by multiplying the expected value from the regression by a positive residual factor. The residual factor is calculated as the quotient when the observation is divided by the expected value of the regression, rather than the difference when the expected value of the regression is subtracted from the observation.

Another noted problem with the k-nn bootstrap for weather generation is that it often under-simulates the lengths of wet spells and dry spells (Apipattanavis et al., 2007). This is because precipitation is intermittent, while the other weather variables are not, but all variables are included simultaneously in the k-nn resampling. Apipattanavis et al. (2007) developed a semi-parametric approach to solve this problem. In their weather generator, they adopt the Richardson approach to modeling the occurrence of wet and dry days through a lag-1 Markov process. Once the precipitation state is determined, the k-nn resampling method is used, but with only neighbors of that precipitation state considered for selection.

Steinschneider and Brown (2013) expand on the Apipattanavis et al. (2007) generator by incorporating wavelet decomposition to better capture low-frequency variability in weather simulation, as both parametric and non-parametric weather generators are known to under-simulate inter-annual variability (Wilks and Wilby, 1999). Their generator uses a method developed by Kwon et al. (2007) in which the wavelet transform is used to decompose the historical time series of annual precipitation totals into H orthogonal component series, zh (plus a residual noise component), that represent different low-frequency signals of annual precipitation totals. Each signal, zh, is then simulated separately using auto-regressive models. The simulated precipitation time series is the sum of these H component models. This approach is called WARM for wavelet auto-regressive model.

The time series of annual precipitation totals produced by the WARM simulation model is used to condition the sampling of daily weather data using Apipattanavis et al.’s method. First, the precipitation total in each year of the WARM simulation is compared to the precipitation in all historical years using Euclidean distance. As in the k-nn generator, the k nearest neighbors are ranked and given a probability of selection according to Lall and Sharma’s Kernel density function. Next, instead of sampling one of these neighbors, 100 of them are selected with replacement, so there will be some repeats in the sampled set. The Apipattanavis generator is then used to generate daily weather data from this set of 100 sampled years rather than the entire historical record. Effectively, this method conditions the simulation such that data from years most similar to the WARM-simulated year are given a greater probability of being re-sampled.

Another characteristic of the Apipattanavis et al. (2007) and Steinschneider and Brown (2013) weather generators is their ability to simulate correlated weather across multiple sites simultaneously. Part III discusses methods used by these authors and others to generate spatially consistent weather using parametric or non-parametric weather generators.

Works Cited

Apipattanavis, A., Podestá, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43.

Bárdossy, A. (1998). Generating precipitation time series using simulated annealing. Water Resources Research34(7), 1737-1744.

Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische Mathematik31(4), 377-403.

Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26.

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 1217-1241.

Kwon, H. H., Lall, U., & Khalil, A. F. (2007). Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature. Water Resources Research43(5).

Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water Resources Research, 32(3), 679-693.

Leander, R., & Buishand, T. A. (2009). A daily weather generator based on a two-stage resampling algorithm. Journal of Hydrology, 374, 185-195.

Nowak, K., Prairie, J., Rajagopalan, B., & Lall, U. (2010). A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow. Water Resources Research, 46.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., & Zagona, E. A. (2006). Modified K-NN model for stochastic streamflow simulation. Journal of Hydrologic Engineering11(4), 371-378.

Rajagopalan, B., & Lall, U. (1999). A k‐nearest‐neighbor simulator for daily precipitation and other weather variables. Water Resources Research35(10), 3089-3101.

Rajagopalan, B., Lall, U., Tarboton, D. G., & Bowles, D. S. (1997). Multivariate nonparametric resampling scheme for generation of daily weather variables. Stochastic Hydrology and Hydraulics11(1), 65-93.

Semenov, M. A., & Porter, J. R. (1995). Climatic variability and the modelling of crop yields. Agricultural and forest meteorology73(3), 265-283.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49, 7205-7220.

Trigo, R. M., & Palutikof, J. P. (1999). Simulation of daily temperatures for climate change scenarios over Portugal: a neural network model approach.Climate Research13(1), 45-59.

Vogel, R. M., & Shallcross, A. L. (1996). The moving blocks bootstrap versus parametric time series models. Water Resources Research, 32(6), 1875-1882.

Wilks, D. S., & Wilby, R. L. (1999). The weather generation game: a review of stochastic weather models. Progress in Physical Geography, 23(3), 329-357.

Young, K. C. (1994). A multivariate chain model for simulating climatic parameters from daily data. Journal of Applied Meteorology33(6), 661-671.

# Synthetic Weather Generation: Part I

Parametric Generators

As water resources systems engineers, we spend a lot of time evaluating and comparing the performance of different water management policies. This is often achieved by modeling the system behavior under each policy when subjected to historical or synthetically generated hydrology. Testing performance under synthetic hydrology allows us to see how well the policy generalizes across plausible futures. One can generate synthetic hydrologies by directly simulating future streamflows, or by indirectly simulating them through a hydrologic model fed by real or synthetically generated weather conditions. Weather generation methods are also useful for predicting how hydrology will change given seasonal climate forecasts or long-term climate change projections.

Our group seems to be fairly well-versed in direct streamflow generation methods, but less familiar with indirect, weather generation methods. There are several similarities, though. First, both streamflow and weather (precipitation, temperature, solar radiation, etc.) can be simulated through “parametric” or “non-parametric” methods. In parametric methods, the streamflow and weather characteristics are assumed to come from some known distribution, whose parameters are fit with the historical record. In non-parametric methods, no such distribution is assumed, and instead future realizations are simulated by resampling from the historical record. Some methods, called “semi-parametric,” combine elements of each of these.

This will be the first of a series of blog posts on weather generators. In this blog, I will describe a handful of parametric weather generation techniques. In Part II, I will describe several non-parametric and semi-parametric weather generators. In Part III, I will cover different techniques for correlating weather generated at multiple sites. Finally, in Parts IV and V, I will discuss different methods of conditioning weather generation on climate change projections and seasonal climate forecasts, respectively. These blog posts will by no means be an exhaustive review of the literature, but a quick glimpse into commonly used weather generation techniques.

Generating Precipitation

Most parametric weather generation methods stem from the weather generator developed by Richardson (1981). In Richardson’s simulator, daily minimum temperature, maximum temperature and solar radiation are generated simultaneously, conditional on the occurrence or not of precipitation. Since both temperature and solar radiation are conditioned on whether or not precipitation has occurred, the first step is to generate a sequence of wet and dry days. This is done by assuming the occurrence of precipitation is a first-order Markov process, meaning whether or not it rains today depends solely on whether or not it rained yesterday.

Using the method of moments (MM), one can easily estimate the probability that today will be wet given that yesterday was wet or dry. Denoting a wet day at time t as Wt and a dry day at time t as Dt,

P(Wt+1|Dt) = (# of dry days followed by a wet day) / (# of dry days)

P(Wt+1|Wt) = (# of wet days followed by a wet day) / (# of wet days)

And by the law of total probability:

P(Dt+1|Dt) = 1 – P(Wt+1|Dt)

P(Dt+1|Wt) = 1 – P(Wt+1|Wt)

These estimates of the transition probabilities are the same as the Maximum Likelihood Estimates (MLE).

Wilks (1999) finds that the first-order Markov process often works well, except in the case of really dry climates, where it tends to under-simulate the lengths of dry spells. In this case, one could instead fit higher (k-)order Markov models (Chin, 1977; Gates and Tong, 1976), where the occurrence of rainy days depends on whether or not it rained on each of the previous k days. Unfortunately, the number of parameters that need to be estimated for a Markov model of order k is 2k, so a linear increase in the order of the process requires an exponential increase in the number of estimated parameters. Since it is often only dry spell lengths that are under-simulated, one could instead fit a hybrid-order Markov model where the occurrence of a wet or dry day depends on the previous min(l, k) days, where l is the number of days since it last rained, and k is the order of the hybrid model (Stern and Coe, 1984).

Another way to better capture persistence is to simulate the occurrence of wet and dry days by alternating between simulating the lengths of wet spells (consecutive wet days) and dry spells. A first order Markov process will create wet and dry spells whose lengths are geometrically distributed with parameter p, where p = P(Wt+1|Dt) for dry spells, and P(Dt+1|Wt) for wet spells. Racsko et al. (1991) found that they could better simulate dry spells in Hungary by using a mixed geometric distribution, where the probability of getting a dry spell, X, of length x is:

P(X = x) = αp1(1-p1)x-1 + (1-α)p2(1-p2)x-1.

So in this model, (100×α)% of dry spells are generated from a geometric distribution with parameter p1, and the rest of them are generated from a geometric distribution with parameter p2, thus requiring the estimation of three parameters. Since only dry spells were being under-simulated by the one-parameter geometric distribution, Racsko et al. (1991) still simulated wet spells from the one-parameter geometric distribution, requiring a total of four parameter estimates. Alternatively, one could simulate wet and dry spells from the negative binomial distribution (Wilby et al., 1998), which requires four total parameters if used for both wet and dry spells, and three if used only for dry spells while the geometric is used for wet spells.

Regardless of how precipitation occurrences are determined, once a wet day has been simulated, the amount of rainfall on that day must also be generated. It is often assumed that the amount of rainfall on any given day is independent of the amount of rainfall on any other day. While the auto-correlation in precipitation amounts is statistically significant, the consequences of assuming independence are not (Wilks, 1999). Richardson assumes that rainfall amounts are generated from an exponential distribution, which only has one parameter, λ, that must be estimated. More commonly (Katz, 1977; Wilks, 1989, 1992) a Gamma distribution is assumed, which has two parameters, α and β. Some authors (Foufoula-Georgiou and Lettenmaier, 1987; Wilks, 1998; 1999; Woolhiser and Roldan, 1982) find that a mixed exponential distribution works best. Like the mixed geometric distribution, a mixed exponential distribution is a weighted average of two exponential distributions:

f(x) = αλ1exp(-λ1x) + (1-α) λ2exp(-λ2x)

So on (100×α)% of wet days, rainfall amounts are generated from an exponential distribution with parameter λ1, and the rest of the time they are generated from an exponential distribution with parameter λ2.

Each of these models has different numbers of parameters that must be estimated; 1 for exponential, 2 for gamma, and 3 for mixed exponential. Using additional parameters should improve the fit, but at the risk of overfitting the data. One method of choosing between these models is to use the Akaike Information Criterion (AIC) which compares the log likelihood function of the different fitted models but with a penalty for the number of parameters estimated. The model with the lowest AIC is said to lose the least information in representing the generating process of the data. Another option is the Bayesian Information Criterion (BIC), which more harshly penalizes the use of additional parameters (Wilks and Wilby, 1999).

In Richardson’s model, he lumps the data from 26 14-day periods in each year of the historical record and calculates the transition probabilities and parameters of the amounts distribution over each of those 2-week periods. He then fits a harmonic function to the time series of transition probabilities and parameter estimates to capture their seasonality. So each day, the transition probabilities and distribution parameters are determined by this continuous function (see Figure 4 from his paper, below). It is more common to simply calculate transition probabilities and parameters of the amounts distribution each month, and use those for precipitation generation each day of that month (Wilks and Wilby, 1999). Despite the discontinuity from month to month, this method does a good job reproducing the seasonal pattern of observed weather.

Once the daily rainfall occurrence and amounts have been generated, the daily minimum temperature, maximum temperature and solar radiation can be simulated. These must be conditioned on the occurrence of rainfall because they are not independent; a rainy day in July is likely to be cooler than a dry day in July due to, among other reasons, increased cloud cover blocking incident solar radiation. Clearly, then, temperature and radiation are not only correlated with precipitation, but with each other. Additionally, like precipitation occurrence, there is great persistence in daily temperature and daily radiation time series. One model that can account for the persistence in the daily time series of each variable, as well as the cross-correlation between them, is a vector auto-regression (VAR) (Richardson, 1981). Often, a VAR(1) model is used, meaning the auto-regression is of order 1. This is similar to a Markov model of order 1, where the values of the weather variables on a particular day depend only on the values of those variables on the previous day.

To fit the vector auto-regression, one must first calculate the historical time series of “residual” minimum temperature, maximum temperature and solar radiation (Richardson, 1981). Outside of the tropics, temperature and solar radiation tend to vary sinusoidally over the course of a year, with peaks in the summer and troughs in the winter. One must first “remove” this signal before fitting the auto-regressive model. Removing the signal will yield “residuals” to which the VAR(1) model will be fit. Using minimum temperature as an example, the average minimum temperature of every January 1st in the historical record that was wet will be calculated, as well as the standard deviation. This will be done every single day, and a Fourier series fit to the resulting time series of means and standard deviations. The same will be done for the minimum temperature on dry days.

To remove the signal, for each observation of minimum temperature on a wet day in the historical record, that day’s mean wet-day minimum temperature (predicted by the Fourier model) will be subtracted, and the resulting difference divided by that day’s standard deviation of wet-day minimum temperature (predicted by the Fourier model) to produce a residual. The analogous process is performed for each observation of minimum temperature on a dry day. What will result is a historical record of “residuals”, also called “standardized anomalies”; the “anomaly” is the difference between the observation and the mean, which is “standardized” by dividing by the standard deviation. See Figure 1 from Richardson (1981) for a visual depiction of this process for solar radiation. Note that if the residuals are not normal, it may be appropriate to first transform the raw data to normality using a Box-Cox transformation, as Parlange and Katz (2000) do for radiation and wind speed in their extension of the VAR(1) model to include daily mean wind speed and dew point.

Sometimes, one might not have a long enough or complete enough historical record to calculate reliable means and standard deviations of wet day and dry day temperature and solar radiation. In this case, the daily data can be lumped to calculate average monthly wet day and dry day temperature and solar radiation data over the historical record. Fourier series can be fit to the mean and standard deviation of these time series instead, with a few caveats. Consider fitting a harmonic function to mean monthly wet day temperature. If one is going to interpolate daily values from this function, several of the daily values within each month should be above the mean and several below. However, oftentimes the peak and trough of the harmonic will lie on the interpolated value. In this case, what is supposed to be the mean daily value for that month will actually be the minimum or maximum. To account for this, one can adjust the amplitudes of the harmonics using a method described by Epstein (1991). A second caveat is that the standard deviation in average monthly wet day temperature will be much smaller than the standard deviation in average daily wet day temperature. If the variance of some random variable, X, is σ2, the variance in its average is σ2/n, where n is the number of observations being averaged. For this reason, the standard deviation of the average monthly wet day temperature must first be multiplied by √n before fitting a Fourier series to the monthly values, which again should be adjusted using Epstein’s method.

The residual time series resulting from this process should be auto-correlated within a particular variable’s time series, and cross-correlated between the three variables of minimum temperature, maximum temperature and solar radiation. This is captured by a VAR(1) process:

z(t) = [A] z(t-1) + [B] ε(t)

where z(t) and z(t-1) are 3×1 vectors of the residuals of the three variables at times t and t-1­, respectively, ε(t) is a 3×1 vector of independent standard normal random variables at time t, and [A] and [B] are 3×3 matrices that together capture the auto- and cross-correlation between the time series of the three variables. [A] and [B] are determined by the following two equations:

[A] = [S1][S]-1

[B][B]T = [S] – [A][S1]T

where [S] is the covariance matrix of the three variables, and [S1] the matrix of lagged (by one time step) covariances of the three variables. [B] can be found using either Cholesky decomposition or spectral decomposition (Wilks, 2011, pp. 470-481, 500-504). Sometimes [A] and [B] will be similar throughout the year, while seasonal changes in the autocovariance structure may require different estimates for different times of the year (Wilks and Wilby, 1999).

Once all of these parameters have been fit, you can start generating synthetic weather!

Works Cited

Chin, E. H. (1977). Modeling daily precipitation occurrence process with Markov chain. Water Resources Research, 13, 949-956.

Epstein, E. S. (1991). On obtaining daily climatological values from monthly means. Journal of Climate, 4, 365-368.

FouFoula-Georgiou, E., & Lettenmaier, D. P. (1987). A Markov renewal model for rainfall occurrences. Water Resources Research, 23, 875-884.

Gates, P., & Tong, H. (1976). On Markov chain modeling to some weather data. Journal of Applied Meteorology, 15, 1145-1151.

Katz, R. W. (1977). Precipitation as a chain-dependent process. Journal of Applied Meteorology, 16, 671-676.

Parlange, M. B., & Katz, R. W. (2000). An extended version of the Richardson model for simulating daily weather variables. Journal of Applied Meteorology, 39, 610-622.

Racsko, P., Szeidl, L., & Semenov, M. (1991). A serial approach to local stochastic weather models. Ecological Modelling, 57, 27-41.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Stern, R. D., & Coe, R. (1984). A model fitting analysis of daily rainfall data. Journal of the Royal Statistical Society, A147, 1-34.

Wilby, R. L., Wigley, T. M. L., Conway, D., Jones, P. D., Hewitson, B. C., Main, J., & Wilks, D. S. (1998). Statistical downscaling of general circulation model output: a comparison of methods. Water Resources Research, 34, 2995-3008.

Wilks, D. S. (1989). Conditioning stochastic daily precipitation models on total monthly precipitation. Water Resources Research, 25, 1429-1439.

Wilks, D. S. (1992). Adapting stochastic weather generation algorithms for climate change studies. Climatic Change, 22, 67-84.

Wilks, D. S. (1998). Multisite generalization of a daily stochastic precipitation generation model. Journal of Hydrology, 210, 178-191.

Wilks, D. S. (1999). Interannual variability and extreme-value characteristics of several stochastic daily precipitation models. Agricultural and Forest Meteorology, 93, 153-169.

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic press.

Wilks, D. S., & Wilby, R. L. (1999). The weather generation game: a review of stochastic weather models. Progress in Physical Geography, 23(3), 329-357.

Woolhiser, D. A., & Roldan, J. (1982). Stochastic daily precipitation models. 2. A comparison of distribution amounts. Water Resources Research, 18, 1461-1468.

# New Federal Flood Frequency Analysis Guidelines

This post is a bit off topic for this blog, but I think it should be interesting to our readers.  The current procedure used by Federal agencies (FEMA, Bureau of Reclamation, USGS, Army Corps, etc.) to assess flood risk at gauged sites is described in Bulletin 17B.  That procedure is used to estimate flood risk for things like FEMA flood maps, to set flood insurance rates, and to design riparian structures like levees.  Given how important the procedure is, many people are surprised to learn that it has not been updated since 1982, despite improvements in statistical techniques and computational resources, and the additional data that is now available.  This post wont speculate as to why change has taken so long, but I will briefly describe the old procedures and the proposed updates.

Bulletin 17B Procedure:

Dr. Veronica Webster has written an excellent brief history on the evolution of  national flood frequency standards in the US dating back to the 1960s.  For this post, we focus on the procedures adopted in 1982.  In short, Bulletin 17B recommends fitting the log-Pearson Type III distribution to the annual peak flow series at a gauged location, using the log-space method of moments.  In other words, one takes the logarithms of the flood series and then uses the mean, variance, and skew coefficient of the transformed series to fit a Pearson Type III distribution.  The Pearson Type III distribution is a shifted Gamma distribution.  When the population skew coefficient is positive the distribution is bounded on the low end, and when the population skew is negative the distribution is bounded on the high end.  When the population skew is zero, the Pearson Type III distribution becomes a normal distribution.  For those wanting more information, Veronica and Dr. Jery Stedinger have written a nice three-part series on the log-Pearson Type III, starting with a paper on the distribution characteristics.

Unfortunately data is not always well behaved and the true distribution of floods is certainly not log-Person Type III, so the Bulletin takes measures to make the procedure more robust.  One such measure is the use of a regional skew coefficient to supplement the sample skew coefficient computed from the transformed flood record.  The idea here is that computing the skew coefficient from short samples is difficult because the skew coefficient can be noisy, so one instead uses a weighted average of the sample skew and a regional skew.  The relative weights are proportional to the mean squared error of each skew, with the more accurate skew given more weight.  The Bulletin recommends obtaining a regional skew from either an average of nearby and hydrologically similar records, a model of skew based on basin characteristics (like drainage area), or the use of the National Skew Map (see below), based on a 1974 study.

National Skew Map provided in Bulletin 17B [IAWCD, 1982, Plate I]

A second concern is that small observations in a flood record might exert unjustifiable influence in the analysis and distort the estimates of the likelihood of the large floods of interest.  Often such small observations are caused by different hydrologic processes than those that cause the largest floods.  In my own experience, I’ve encountered basins wherein the maximum annual discharge is zero, or nearly zero in some years.  The Bulletin recommends censoring such observations, meaning that one removes them from the computation of the sample moments, then applies a probability adjustment to account for their presence.  The Bulletin provides an objective outlier detection test to identify potential nuisance observations, but ultimately leaves censoring decisions to the subjective judgement of the analyst.  The proposed objective test is the Grubbs-Beck test, which is a 10% significance test of whether the smallest observation in a normal sample is unusually small.

The Hydrologic Frequency Analysis Work Group (HFAWG) is charged with updating the Bulletin.  Their recommendations can be found on-line, as well as a testing report which compares the new methods to the old Bulletin.  Bulletin 17C is currently being written.  Like Bulletin 17B, the new recommendations also fit the log-Pearson Type III distribution to the annual maximum series from a gaged site, but a new fitting technique is used: the expected moments algorithm (EMA).  This method is related to the method of moments estimators previously used, but allows for the incorporation of historical/paleo flood information, censored observations, and regional skew information in a unified, systematic methodology.

High water mark of 1530 flood of the Tiber River at Rome

Historical information might include non-systematic records of large floods: “The town hall has been there since 1760 and has only been flooded once,”  thus providing a threshold flow which has only been crossed once in 256 years.  EMA can include that sort of valuable community experience about flooding in a statistically rigorous way! For the case that no historical information is available, no observations are censored, and no regional skew is used, the EMA moment estimators are exactly the same as the Bulletin 17B method of moments estimators.  See Dr. Tim Cohn’s website for EMA software.

The EMA methodology also has correct quantile confidence intervals (confidence intervals for the 100-year flood, etc.), which are more accurate than the ones used in Bulletin 17B.

Another update to the Bulletin involves the identification of outlier observations, which are now termed Potentially Influential Low Flows (PILFs).  A concern with the old detection test was that it rarely identified multiple outliers, even when several very small observations are present.  In fact, the Grubbs-Beck test, as used in the Bulletin, is only designed to test the smallest observation and not the second, third, or kth smallest observations.  Instead, the Bulletin adopts a generalization of the Grubbs-Beck test designed to test for multiple PILFs (see this proceedings, full paper to come).  The new test identifies PILFs more often and will identify more PILFs than the previous test, but we’ve found that use of a reasonable outlier test actually results in improved quantile efficiency when fitting log-Pearson Type III data with EMA  (again, full paper in review).

The final major revision is the retirement of the skew map (see above), and the adoption of language recommending more modern techniques, like Bayesian Generalized Least Squares (BGLS).  In fact, Dr. Andrea Veilleux, along with her colleagues at the USGS have been busy using iterations of BGLS to generate models of regional skew across the country, including California, Iowa, the Southeast, the Northwest, MissouriVermont, Alaska, Arizona, and California rainfall floods.  My masters work was in this area, and I’d be happy to write further on Bayesian regression techniques for hydrologic data, if folks are interested!

If folks are interested in software that implements the new techniques, the USGS has put out a package with good documentation.

That’s it for now!

Jon

# On constraints within MOEAs

When thinking about setting up a problem formulation for an MOEA like Borg, it’s helpful to spend a minute (or 15!) talking about our friend the constraint.  I’ve written about problem formulations in the past, and we even have a video about problem formulations, but there’s a subtlety that I though it would be helpful to discuss again here!

## A comment about constraints in classical mathematical programming formulations

Classical mathematical programming formulations are chock-full of constraints.  Hundreds of constraints, even!  This is due to the fact that when you are setting up a linear program (LP) or something similar, the set of constraints allows the program to have some amount of realism.  The classical linear programming water allocation problem (for example, something you would find in the classic Revelle et al textbook)  is something like:

Decisions: water allocations to different actors in different periods of time

Objective: minimize cost of the allocation plan, or maximize the net benefits of the plan

Constraints:

1. Ensure that demands are met

2. Ensure that, at each supply node, inflow equals outflow

3. Ensure that the capacity of the conveyance system is not exceeded

4. Ensure that all decision variables are greater than or equal to zero

5. Keep the water quality values above or below some threshold

We have proposed 5 categories of constraints, and there may be even more.  But mathematically, this simple formulation above could have hundreds of constraints, if you have to solve each equation for a particular point in time, or at a particular spatial location.

This is perfectly fine!  If you are solving a LP, I encourage you to put more constraints in there because constraints are really the only way that you can make your solutions look more physically accurate.  LP theory will tell you that, if there is one or more optimal solutions, at least one of them will be at the intersection of two or more of the constraints.  So, the constraints are important (more on this later).

## How the simulation-optimization of MOEAs is different

The MOEA approach we talk about on this blog is a simulation-optimization approach.  What that means, is that instead of having a set of a few hundred constraints to model your problem, you are actually using a simulation model of the system to model it.  This is going to fundamentally change the way that your decisions and constraints are defined, and it may even change your objectives.  Let’s talk about each in turn, using our simple water allocation example as, well, as an example.

Decisions: It would probably make more sense to optimize rules about how the water is allocated, instead of actually treating each decision variable as a separate volume of water over time.  There are a number of reasons why this is.  In the formulation above, if each decision variable is an allocation of water in a particular period of time, once you start to plan over long planning horizons, the problem will get unwieldy, with thousands of variables.  Plus, if you say that decision variable x_83 is the allocation of water to user 5, at t = 3, then how are you to plan what happens if you don’t have accurate or deterministic data for what happens at t = 3?  But that’s really the purpose of another post.  We’re here to talk about:

Constraints: So, in the above formulation, constraints were designed to create physical reality to your system.  For example, if the decision variables are volumes of water, and you have a basic estimation of the concentration of some pollutant, you can create a constraint that says the number of kilograms of a pollutant should be less than a particular value.  In that fashion, you can think of hundreds of constraints, and we just talked about this, right?

In the simulation-optimization approach, though the simulation model will provide physical reality to the system, not the constraint set.  So it is better to treat constraints as limits on acceptable performance.  For example if you are calculating reliability, you can say that all solutions should have at least some lower threshold of reliability (say, 98%, for example).  Or, don’t have constraints at all and instead just handle the value as an objective!  The point of doing tradeoff analysis with a MOEA is to find diversity of solutions, after all.  If you find that the performance of the values is too poor, or that you have too many solutions, you can impose constraints next time.  Or, you can also change the epsilon resolution within Borg in order to change the number of points in the tradeoff analysis; see this paper by Kollat and Reed where they showed the same tradeoff set with different resolutions, for example.

Constraints can also help with the representation of decisions within your formulation.  After all, the most basic constraint in any LP is that xi >= 0, right?  But think about creative ways to change the representation of decision variables such that all of the algorithm’s solutions are feasible.  This will greatly help the search process!

For example, consider a problem where you have to create drought restriction plans at different levels.  In other words, imagine that the value of restriction at level 2 needed to be greater than the value at level 1.  Mathematically, you could set this up as a constraint (x2 >= x1), but it will be very difficult to meet this, especially with an initially random population within a MOEA.  Especially when you have, say, 5 or 10 or 100 of these levels that each need to be greater as you go up. So, instead, you can do something like this.  Let xi’ be the decision variable in the algorithm, and xi (without a prime) be the actual decision variable used in your model.  Then, let:

x1 = x1′

x2 = x1 + x2′

The algorithm actually searches on x1′ and x2′.  Now, if you set your bounds up correctly, the algorithm can never generate an infeasible solution, and you are spending all your computational resources within search actually finding good, feasible solutions instead of just trying to find a single set of feasible solutions with which to work.  Something similar to this was used in Zeff et al. (2014).

## Some comments on how constraint handling works within Borg

The concepts of epsilon non-domination, and non-domination, are used to determine whether solutions survive within the search process.  The general concept of non-domination is, informally, that a solution’s objective function values are better in one objective and at least as good as another solution in all other objectives.  But that’s not what we’re here to talk about!  We are here to talk about constraints, remember? 🙂

Well remember that the definition of a feasible and infeasible solution is the same in LP-land versus for us.  Feasible solutions meet all constraints, infeasible solutions violate one or more constraints. An infeasible solution is defined to not be acceptable to the decision maker.  This is an extremely important point to remember, so it is worth repeating!  An infeasible solution is unacceptable. Because of this, it’s up to you to set up the constraints to keep acceptable solutions, and throw away unacceptable (infeasible) solutions.

However, there is an important distinction between, say, an LP and what we do.  Remember the simplex algorithm for solving LPs?  What it does is intelligently finds a corner of the feasible space (defined by your constraint set), and then it basically walks through each of the corners, trying each one of them to determine if the corner is optimal.  Notice what’s going on here, though.  Most of the time, at least for a simple problem, it is easy to find a feasible solution in the simplex method, and then you are off to the races.  If you can’t find a feasible solution, you’re in a lot of trouble!

Similarly, you also need to find one feasible solution to get started with MOEA search…preferably you’ll have many many feasible solutions!  But what if it is difficult to find a feasible solution? This is going to impact your ability to find solutions, and if your function evaluation count, or search duration, is not long enough you can have a failed search.

So let’s get to the point of this section, which is explaining how constraint handling works within Borg.  As explained in Reed et al., most of the algorithms we’ve typically tested within this blogosphere use restricted tournament selection. There are three conditions where a solution can dominate b:

1. Both solutions are feasible, and based on objective function performance, a dominates b

2. The solution is feasible, and is infeasible.

3. Both and are infeasible, and the sum of a’s constraint violations are less than the sum of b‘s constraint violations

So as you can see, item 3 is very interesting!  What item 3 means, is that Borg or these other algorithms will actually keep some infeasible solutions in the population or archive while all solutions are infeasible.  This is important because it helps move the population toward eventually finding a feasible region of the decision space!  But, according to item 2, any solution that is feasible will beat an infeasible solution, which is consistent with the fact that infeasible solutions are unacceptable, and oh gosh I already said that a few times didn’t I!

By the way, you can also use penalty functions to handle constraints, instead of using this built in restricted tournament selection feature.  An example of that is in the Kollat and Reed paper that I already linked to.  Plus there are other ways to do this too!  By no means is this an exhaustive discussion, but I hope what we’ve talked about is helpful.  Anyway:

## Some informal recommendations on how to use constraints within Borg if you choose to use them

1. If you don’t have to use constraints, don’t use them!  Optimize without them and see how you do (see above).

2. Make the constraint violations that you’re reporting to Borg meaningful.  Notice item 3 in the restricted tournament selection actually does use the values that you report from constraint handling.  I’ve seen some folks just say: if infeasible, c = 100000000.  But that won’t actually help the restricted tournament mechanism move the set toward meaningful, feasible solutions.  For example let’s say that r gives you a value of reliability for a system, and reliability must be above capital R.  You could set your constraint violation to: c = R – r, which will give you the difference between your performance and what you are required to have.  If all of your solutions are infeasible initially, the algorithm will favor a solution that has, say, 1% reliability lower than R over a solution that has 20% reliability lower than R.

3. If you have an expensive simulation, skip the objective function calculations if the solution is infeasible.  The objective function values for an infeasible solution are not really used for anything, in restricted tournament selection.  So you can report the constraint violation, but don’t spend time calculating objective functions that will never be used.

Two quick (important) caveats here:  One, remember that infeasible solutions are unacceptable.  And the reason why I keep bringing this up is the fact that you can technically shoot yourself in the foot here, and impose constraints that are too strict, where a real decision maker may actually want solutions that violate your imposed constraint.  So be careful when you define them!  Two, if you do skip the objective function calculation be sure to still report a value of your objectives to Borg, even if that value is just a placeholder.  This is important, to make sure the Borg procedure keeps running.  If you have questions about these caveats please leave them in the comment box below.

## To conclude

This has been quite a long post so thanks for sticking with me til the end.  Hopefully you’ve learned something, I know I did!  We can continue the discussion on this blog (for the contributors) and on the comment boxes below (for everyone)!  Have a great day.