**Non-Parametric Generators**

This is the second part of a blog series on synthetic weather generators. In Part I, I described common parametric methods of weather generation. Here I am going to discuss a subset of non-parametric methods. There are several techniques for non-parametric weather generation (Apipattanavis et al., 2007): simulating from empirical distributions of wet and dry spells and precipitation amounts (Semenov and Porter, 1995); using simulated annealing for generating precipitation (Bárdossy, 1998) and neural networks for generating temperature (Trigo and Palutikof, 1999); fitting multivariate PDFs to several weather variables using Kernel-based methods (Rajagopalan et al, 1997); and resampling historical observations (Young, 1994). For brevity, all of the methods I will discuss employ the nearest neighbor, or *k*-nn, bootstrap resampling technique introduced by Lall and Sharma (1996). Many of you are probably familiar with this method from streamflow generation, as our group has been using it to disaggregate monthly and annual flows to daily values using an adaptation of Lall and Sharma’s method described by Nowak, et al. (2010).

The original purpose of the nearest-neighbor bootstrap was to reproduce the autocorrelation in simulated time series without assuming a parametric generating process (Lall and Sharma, 1996). Traditionally, auto-correlated time series are modeled using autoregressive (AR) or autoregressive moving average (ARMA) models, as described in Part I for simulating non-precipitation variables in parametric weather generators. Recall that in the AR model, the values of the non-precipitation variables at any given time step are linear combinations of the values of the non-precipitation variables at the previous time step(s), plus some Gaussian-distributed random error. Hence, ARMA models assume a linear generating process with independent, normally distributed errors. Non-parametric methods do not require that these assumptions be met.

The most common non-parametric method is the bootstrap, in which each simulated value is generated by independently sampling (with replacement) from the historical record (Efron, 1979; Efron and Tibshirani, 1993). This is commonly used to generate confidence intervals. If simulating time series, though, this method will not reproduce the auto-correlation of the observations. For this reason, the moving-blocks bootstrap has often been used, where “blocks” of consecutive values are sampled jointly with replacement (Kunsch, 1989; Vogel and Shallcross, 1996). This will preserve the correlation structure *within *each block, but not* between* blocks. The *k*-nn bootstrap, however, can reproduce the auto-correlation in the simulated time series without discontinuities between blocks (Lall and Sharma, 1996).

In the simplest case of the nearest neighbor bootstrap, the historical record is divided into overlapping blocks of length *d*, where *d* represents the dependency structure of the time series. So if *d*=2, then each value in the time series depends on the preceding two values. To simulate the next value in the time series, the most recent block of length *d* is compared to all historical blocks of length *d* to find the *k* “nearest” neighbors (hence *k*-nn). How “near” two blocks are to one another is determined using some measure of distance, such as Euclidean distance. The *k *nearest neighbors are then sorted and assigned a rank *j* from 1 to *k*, where 1 is the closest and *k* the furthest. The next simulated value will be that which succeeds one of the *k *nearest neighbors, where the probability of selecting the successor of each neighbor is inversely proportional to the neighbor’s rank, *j*.

P(j) = (1/j) / Σ_{i=1,…,k} (1/i)

To simulate the second value, the most recent block of length *d* will now include the first simulated value, as well as the last *d*-1 historical values, and the process repeats.

Lall and Sharma recommend that *d* and *k* be chosen using the Generalized Cross Validation score (GCV) (Craven and Wahba, 1979). However, they also note that choosing *k = √n*, where *n* is the length of the historical record, works well for 1 ≤ *d* ≤ 6 and *n* ≥ 100. Using *d*=1, *k = √n* and selecting neighbors only from the month preceding that which is being simulated, Lall and Sharma demonstrate the *k*-nn bootstrap for simulating monthly streamflows at a snowmelt-fed mountain stream in Utah. They find that it reproduces the first two log-space moments and auto-correlation in monthly flows just as well as an AR(1) model on the log-transformed flows, but does a much better job reproducing the historical skew.

Because of its simplicity and successful reproduction of observed statistics, Rajagopalan and Lall (1999) extend the *k*-nn bootstrap for weather generation. In their application, *d* is again set to 1, so the weather on any given day depends only on the previous day’s weather. Unlike Lall and Sharma’s univariate application of the nearest neighbor bootstrap for streamflow, Rajagopalan and Lall simulate multiple variables simultaneously. In their study, a six dimensional vector of solar radiation, maximum temperature, minimum temperature, average dew point temperature, average wind speed and precipitation is simulated each day by resampling from all historical vectors of these variables.

First, the time series of each variable is “de-seasonalized” by subtracting the calendar day’s mean and dividing by the calendar day’s standard deviation. Next, the most recent day’s vector is compared to all historical daily vectors from the same season using Euclidean distance. Seasons were defined as JFM, AMJ, JAS, and OND. The *k* nearest neighbors are then ranked and one probabilistically selected using the same Kernel density estimator given by Lall and Sharma. The vector of de-seasonalized weather variables on the day following the selected historical neighbor is chosen as the next simulated vector, after “re-seasonalizing” by back-transforming the weather variables. Rajagopalan and Lall find that this method better captures the auto- and cross-correlation structure between the different weather variables than a modification of the popular parametric method used by Richardson (1981), described in Part I of this series.

One shortcoming of the nearest neighbor bootstrap is that it will not simulate any values outside of the range of observations; it will only re-order them. To overcome this deficiency, Lall and Sharma (1996) suggest, and Prairie et al. (2006) implement, a modification to the *k*-nn approach. Using *d*=1 for monthly streamflows, a particular month’s streamflow is regressed on the previous month’s streamflow using locally-weighted polynomial regression on the *k* nearest neighbors. The order, *p*, of the polynomial is selected using the GCV score. The residuals of the regression are saved for future use in the simulation.

The first step in the simulation is to predict the expected flow in the next month using the local regression. Next, the *k* nearest neighbors to the current month are found using Euclidean distance and given the probability of selection suggested by Lall and Sharma. After a neighbor has been selected, its residual from the locally weighted polynomial regression is added to the expected flow given by the regression, and the process is repeated. This will yield simulated flows that are outside of those observed in the historical record.

One problem with this method is that adding a negative residual could result in the simulation of negative values for strictly positive variables like precipitation. Leander and Buishand (2009) modify this approach by multiplying the expected value from the regression by a positive residual factor. The residual factor is calculated as the quotient when the observation is divided by the expected value of the regression, rather than the difference when the expected value of the regression is subtracted from the observation.

Another noted problem with the *k*-nn bootstrap for weather generation is that it often under-simulates the lengths of wet spells and dry spells (Apipattanavis et al., 2007). This is because precipitation is intermittent, while the other weather variables are not, but all variables are included simultaneously in the *k*-nn resampling. Apipattanavis et al. (2007) developed a semi-parametric approach to solve this problem. In their weather generator, they adopt the Richardson approach to modeling the occurrence of wet and dry days through a lag-1 Markov process. Once the precipitation state is determined, the *k*-nn resampling method is used, but with only neighbors of that precipitation state considered for selection.

Steinschneider and Brown (2013) expand on the Apipattanavis et al. (2007) generator by incorporating wavelet decomposition to better capture low-frequency variability in weather simulation, as both parametric and non-parametric weather generators are known to under-simulate inter-annual variability (Wilks and Wilby, 1999). Their generator uses a method developed by Kwon et al. (2007) in which the wavelet transform is used to decompose the historical time series of annual precipitation totals into *H* orthogonal component series, *z _{h}* (plus a residual noise component), that represent different low-frequency signals of annual precipitation totals. Each signal,

*z*, is then simulated separately using auto-regressive models. The simulated precipitation time series is the sum of these

_{h}*H*component models. This approach is called WARM for wavelet auto-regressive model.

The time series of annual precipitation totals produced by the WARM simulation model is used to condition the sampling of daily weather data using Apipattanavis et al.’s method. First, the precipitation total in each year of the WARM simulation is compared to the precipitation in all historical years using Euclidean distance. As in the *k*-nn generator, the *k* nearest neighbors are ranked and given a probability of selection according to Lall and Sharma’s Kernel density function. Next, instead of sampling one of these neighbors, 100 of them are selected with replacement, so there will be some repeats in the sampled set. The Apipattanavis generator is then used to generate daily weather data from this set of 100 sampled years rather than the entire historical record. Effectively, this method conditions the simulation such that data from years most similar to the WARM-simulated year are given a greater probability of being re-sampled.

Another characteristic of the Apipattanavis et al. (2007) and Steinschneider and Brown (2013) weather generators is their ability to simulate correlated weather across multiple sites simultaneously. Part III discusses methods used by these authors and others to generate spatially consistent weather using parametric or non-parametric weather generators.

**Works Cited**

Apipattanavis, A., Podestá, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. *Water Resources Research, 43*.

Bárdossy, A. (1998). Generating precipitation time series using simulated annealing. *Water Resources Research*, *34**(7)*, 1737-1744.

Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. *Numerische Mathematik*, *31**(4),* 377-403.

Efron, B. (1979). Bootstrap methods: another look at the jackknife. *The annals of Statistics*, 1-26.

Efron, B., & Tibshirani, R. J. (1994). *An introduction to the bootstrap*. CRC press.

Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. *The Annals of Statistics*, 1217-1241.

Kwon, H. H., Lall, U., & Khalil, A. F. (2007). Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature. *Water Resources Research*, *43**(5)*.

Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. *Water Resources Research, 32(3)*, 679-693.

Leander, R., & Buishand, T. A. (2009). A daily weather generator based on a two-stage resampling algorithm. *Journal of Hydrology, 374*, 185-195.

Nowak, K., Prairie, J., Rajagopalan, B., & Lall, U. (2010). A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow. *Water Resources Research, 46*.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., & Zagona, E. A. (2006). Modified K-NN model for stochastic streamflow simulation. *Journal of Hydrologic Engineering*, *11*(4), 371-378.

Rajagopalan, B., & Lall, U. (1999). A k‐nearest‐neighbor simulator for daily precipitation and other weather variables. *Water Resources Research*, *35**(10)*, 3089-3101.

Rajagopalan, B., Lall, U., Tarboton, D. G., & Bowles, D. S. (1997). Multivariate nonparametric resampling scheme for generation of daily weather variables. *Stochastic Hydrology and Hydraulics*, *11*(1), 65-93.

Semenov, M. A., & Porter, J. R. (1995). Climatic variability and the modelling of crop yields. *Agricultural and forest meteorology*, *73**(3)*, 265-283.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. *Water Resources Research, 49*, 7205-7220.

Trigo, R. M., & Palutikof, J. P. (1999). Simulation of daily temperatures for climate change scenarios over Portugal: a neural network model approach.*Climate Research*, *13**(1)*, 45-59.

Vogel, R. M., & Shallcross, A. L. (1996). The moving blocks bootstrap versus parametric time series models. *Water Resources Research, 32(6)*, 1875-1882.

Wilks, D. S., & Wilby, R. L. (1999). The weather generation game: a review of stochastic weather models. *Progress in Physical Geography, 23(3)*, 329-357.

Young, K. C. (1994). A multivariate chain model for simulating climatic parameters from daily data. *Journal of Applied Meteorology*, *33*(6), 661-671.

Pingback: Synthetic Weather Generation: Part I | Water Programming: A Collaborative Research Blog

Pingback: Synthetic Weather Generation: Part III – Water Programming: A Collaborative Research Blog

Pingback: Synthetic Weather Generation: Part IV – Water Programming: A Collaborative Research Blog

Pingback: Synthetic Weather Generation: Part V – Water Programming: A Collaborative Research Blog

Pingback: Water Programming Blog Guide (Part 2) – Water Programming: A Collaborative Research Blog

Pingback: Fitting Hidden Markov Models Part I: Background and Methods – Water Programming: A Collaborative Research Blog