Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables. provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions and, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

Pair Number Sites
1 Marietta and Muddy Run
2 Marietta and Lateral Inflows
3 Marietta and Evaporation
4 Muddy Run and Lateral Inflows
5 Muddy Run and Evaporation
6 Lateral Inflows and Evaporation




3 thoughts on “Open Source Streamflow Generator Part II: Validation

  1. Pingback: Open Source Streamflow Generator Part I: Synthetic Generation – Water Programming: A Collaborative Research Blog

  2. This is very interesting.

    We’ve been using monthly stochastic hydrology for more than 30 years now in South Africa to test different bulk water supply system configurations on a sub-continental scale through this type of probabilistic water resources system yield and planning models. Stochastic serial and cross correlation over larger areas are always tricky – having droughts and floods happening at the same time in large systems.

    Would be interesting to see how these results compare with our models. Also, we’ve concentrated on gross Draft-Storage curves for validation of stochastic ensembles.

    • Yeah, I’d be interested to hear how well it works for your systems. What methods are you currently using? Feel free to email me at

      Using draft-storage curves for validation is a good idea. We’re hoping to expand on this to build a larger repository of streamflow generation and validation methods, so that would be a good thing to add!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s