May 20, 2024 by Zach Brodeur

Ensemble forecasting – application

In contrast to my theoretically oriented previous post on ensemble forecasting, I will attempt, in this post, to provide some practically oriented operational context for ensemble forecasting from the perspective of water management. This and the previous post will serve as a useful background for the planned third post in this series that will focus on the details of ensemble forecast verification. This post is largely a distillation of my experience in working with the formal Forecast Informed Reservoir Operations (FIRO) efforts that are ongoing, primarily in the western US. I use the term ‘formal’ here deliberately to differentiate from the more general notion of FIRO, which has been in the research literature since at least the 90’s. Importantly, ‘formal’ FIRO, as I choose to label it, is an effort to formalize the use of forecast information into the operations of our nation’s portfolio of federally owned or controlled dams, which contain the largest and most important reservoirs in CONUS. This is in stark contrast to the largely informal use of FIRO that has been applied somewhat ‘under the radar’ and scattershot in various ways and for various purposes by dam operators up to this point. While there is a very interesting discussion to be had about the institutional and operational complexities of this effort (perhaps a future post), I’ll leave this brief once-over-the-FIRO-world contextualization at that for now.

For the purposes of this post, I mention the operational context to narrow the window of forecast applications to a tractable space that is a) pertinent to water systems operations and b) is a space that I am intimately familiar with. The world of hydrometeorological forecasting is large and growing, containing both advances in legacy dynamical forecasting techniques as well as emergent machine learning approaches. When it comes to federal agencies, tried/tested/verified is incredibly important, so I anticipate the uptake of the latter ML approaches to be some ways off. Baby steps…the US Army Corp of Engineers (USACE) is still trying to move past the ‘water on the ground’ legacy of water management.

Meteorological forecasting

In discussing the current state of ensemble forecasting, I will progress hierarchically from the meteorological to the hydrologic forecasting space. I will first discuss some aspects of dynamical meteorological forecasting that form the backbone of any hydrological forecasting effort. It is, after all, the weather gods casting their hydrometeors (rain/snow/ice) upon the earth that cause the majority of the hydrologic responses of interest to practical FIRO applications. This is an important point that I’ll note here and come back to later. Meteorological forcings dominate the hydrologic response and by consequence, dominate the hydrologic uncertainties when we start thinking in terms of future predictions (forecasts) where meteorologic uncertainties become large. This is wholly in contrast to efforts to understand hydrologic uncertainty with observational data as forcings (for instance, stochastic watershed modeling, see my post here), where the hydrologic model’s structural inadequacy is the primary driver of hydrologic predictive uncertainty. The observational (input) uncertainties are not inconsequential in these sorts of applications, but they form a less substantial portion of the overall uncertainty and can largely be considered aleatory.

Advances in meteorological forecasting skill

With some sense of the importance of meteorological uncertainty to the hydrologic forecasting problem, I want to turn now to a brief discussion on the advances in dynamical meteorological forecasting and areas for future research and improvement. Most folks have some practical sense of forecast skill (or lack thereof) from their daily lives; you check the smartphone weather before you head out for the day, put on shorts and a t-shirt, then get soaked walking to work due to some unforecast rain event. Despite events like this and the general challenge of dynamical uncertainty in the atmospheric system (see previous post), the reality is that forecast skill has steadily improved over time, to the tune of 1-day of ‘useful’ skill improvement per decade (see Figure 1). Although dependent on the particular skill metric being used, it is common to assess ‘useful’ skill above 60% skill and a ‘high degree of accuracy’ above 80% skill. By some visual extrapolation, one might surmise that we should be approaching ‘useful’ skill out to 10-days by now in the northern hemisphere (NH) for the atmospheric variable in question.

Figure 1. Improvements in forecast skill of 500 hPa geopotential height anomaly over time (Bauer et al., 2015)

Notably, Figure 1 shows the skill improvements for a large, synoptic scale variable at a moderate altitude in the atmosphere (500 hPa ~ 15,000’ above sea level). Skill varies greatly across atmospheric variables and is particularly challenging for variables that are highly dependent on sub-grid scale processes (i.e. processes that occur below the native resolution of the forecast model); one of these, as luck would have it, is the aforementioned hydrometeor (precipitation) generating process of such interest to hydrologic applications. As I will loosely paraphrase, something I heard from F. Martin Ralph, the current director of the Center for Western Weather and Water extremes (CW3E) at Scripps, is that ‘our skill in predicting upper atmospheric variables at large scales has increased enormously, whereas our skill in predicting local scale precipitation has barely budged’. This basic notion is a Pandora’s box of sorts in terms of thinking how we might use the information that is available in forecast models (i.e. the highly accurate synoptic scale variables) differently to more directly forecast key variables of interests at the local scale; many of these emerging techniques rely on ML techniques applied across wide spatial scales that can more precisely target key hydrometeorological attributes of interest to water management like the probability of exceeding some rainfall threshold (Zhang et al., 2022) or mapping directly to streamflow itself and skipping the whole precipitation part (Nearing et al., 2021). These are certainly highly promising avenues for research, but we’ll return to some practical advancements in dynamical forecasting for now.

Areas for improving meteorological forecasting

One could spend an entire blog post (or many posts really) on this subject. I want to touch on just a few key areas where much research effort is currently dedicated to advancing meteorological forecasting. The first of these is improving the spatiotemporal resolution of the forecast models. To take a quick step back, it is useful to compare a ‘forecast’ model to a Global Circulation/Climate Model (GCM) or Earth System Model (ESM). In a broad sense, these models are all constructed in the same basic way. They start with a dynamical representation of the atmosphere based on the Navier-Stokes fluid dynamics equations that are discretized and solved across ‘grids’ of some horizontal and vertical dimensions across the globe. In climate science speak, this part of the model is often referred to as the ‘dynamical core’ (or ‘dy-core’ if you want to sound cool). For the most part, things that happen below this native grid resolution are parameterized (e.g. cloud formation), which means that they are modeled based on fitted statistical relationships between grid scale atmospheric states and sub-grid scale process outcomes. These parameterizations are often referred to as the model’s ‘physics’. Intuitively, the smaller the scale we can represent the atmospheric processes, then the better we might be able to capture these local behaviors. Advances in computational power have enable much higher resolutions for forecast models/GCM/ESM over time and will likely continue to do so.

Where these models primarily differ is in all the other stuff needed to run a simulation of earth’s atmosphere, namely the land/ocean/ice surfaces and their couplings with the atmosphere. Forecast models must be computationally tractable at operational relevant timescales to be useful. In the space we are discussing, this means that the model must be able to reinitialize and produce a forecast at least daily. Moreover, the forecast temporal resolution must be much higher to capture evolution of the climate system at operationally relevant scales, typically 6-hourly with current technology. To overcome this complexity tradeoff (see Figure 2), forecast models have traditionally relied on extremely simplified representations of the land/ocean/ice surface boundary conditions that are assumed to be constant across the forecast period. Conversely, the development of GCM/ESMs has continually increased the complexity of the land/ocean/ice modeling and their atmospheric coupling while retaining sufficient atmospheric spatiotemporal resolution for long term simulations. GCM/ESM must also provide provable closure of things like energy and carbon fluxes across long time scales; this is not really a primary concern for forecast models, they can be ‘leaky’. Increasingly, however, computational advances have enabled forecast models to look more like GCM/ESM, with state-of-the-science forecast models now modeling and coupling land/ocean interactions along with the atmospheric evolution.

Figure 2. Tradeoff between model complexity and resolution in modeling the earth system (Bauer et al., 2015)

The second key area that I’ll highlight is data assimilation. Data assimilation is the process of constraining forecast trajectories based on some intermediate assessment of the actual evolution of the climate system (i.e. an observation). In meteorological forecasting, this is typically done in a relatively short window after the forecast has been generated, for instance the 12-hour assimilation window in Figure 3 below. Importantly, for the issuance of forecasts, this assimilation period is happening ‘under the hood’, so that the user of forecasts would only see the forecast as being issued at 21:00, not the 09:00 time when the actual forecast model was initialized. While the details of data assimilation are complex and well outside the scope of this post, it is an incredibly important area of continued research for dynamical forecasting. To put this into context, another loose paraphrase I heard from Vijay Tallapragada, senior scientist for NOAA’s Environmental Modeling Center, is ‘if you want long-term job security in the weather forecasting, get into data assimilation’.

Figure 3. Data assimilation example in ensemble forecasting (Bauer et al., 2015)

Lastly, as alluded to in my previous post, the methods to sample initial condition uncertainty and propagate it through the dynamical forecast model are also a very important area of advancement in meteorological ensemble forecasting. The mathematical challenge of adequately sampling initial condition uncertainty at global scales across highly correlated initialization variables is huge, as is the computational burden of running multiple perturbed members of the forecasts to produce an ensemble. The latter challenge has particular significance to the hydrological forecasting enterprise and may, at least partially, explain certain design choices in the hydrologic forecast systems discussed in the following section.

Hydrologic ensemble forecasting

To turn the meteorologic outputs of precipitation and temperature to streamflow, we need a streamflow generating process model. Moreover, if we want an ensemble of streamflow predictions, we need a way to generate some uncertainty in the streamflows to reflect the dominant meteorological uncertainties in the forecasts. With this modeling chain of meteorologic forecast model à streamflow process model, one could envision many possible ways to attempt to capture this uncertainty. Perhaps the most intuitive would be to take each ensemble output from the meteorologic forecast model and run it through the streamflow process model; voila! ensemble streamflow forecasts!

The Hydrologic Ensemble Forecast Service (HEFS)

This is, however, not how it’s done in the Hydrologic Ensemble Forecast Service (HEFS) that is the current operational model used by NOAA/NWS river forecast centers (RFC) and the model being primarily used for formal FIRO efforts (see Figure 4 below). Instead, HEFS has its own internal Meteorological Ensemble Forecast Processor (MEFP) that ingests an ensemble mean of temperature and precipitation inputs from the meteorological ensemble forecast (global ensemble forecast system, GEFS) and then creates its own internal ensemble of meteorologic forecasts. These internally generated traces of forecasted temperature/precipitation are each used to force a single, watershed calibrated version of the SAC-SMA hydrologic model to produce an ensemble of streamflow forecasts (Demargne et al., 2014). Why use this implementation instead of the seemingly more straightforward approach with the meteorological ensemble? I don’t have a definitive answer for that, but my understanding is that, from a statistical verification perspective, the current HEFS approach is more reliable.

Figure 4. Conceptual diagram of HEFS model (Demargne et al., 2014)

Another possible reason stems from the computational complexity challenge of meteorological ensemble forecasting mentioned in the previous section, particularly as it relates to hindcasting. Hindcasts are forecasts that are generated with a modern day forecast model (and their modern day skill) against historical observations of the appropriate quality (post satellite era generally speaking, ~1979 – present). In meteorological spheres, hindcasts are important in documenting the performance of forecast model across a much larger dataset. For the purposes of advancing operational water management with forecasts, hindcasts are more than important…they are indispensable. Hindcasts form the basis for the design and testing of forecast informed water management procedures; there is no workaround for this. To generate a meteorological hindcast, the NWS has to carve out time to dedicate its warehouse sized forecast model computational platform to generating 20 -30 years of forecasts and archiving the data. This time must be balanced with the operational necessity of producing real time forecasts for things like: not getting wet going to work or maybe, I don’t know, aviation? This challenge means that hindcasts must typically be produced at reduced resolution compared to the operational forecasts. For example, the current NOAA Global Ensemble Forecast System (GEFSv12) produces operational forecasts with 31 ensemble members, but only 5 ensemble members in the hindcasts (Guan et al., 2019).

This limitation is significant. 5 ensemble members is sufficient to produce a relatively good approximation of the ensemble-mean, but grossly undersized for producing an adequate probabilistic streamflow ensemble. In this hindcast context, then, the construction of HEFS makes sense and enables more flexibility for generating historical forecast ensembles for development of forecast informed water management policies.

Practical HEFS limitations

The current implementations of HEFS typically generate ~40 ensemble members. The MEFP is conditioned on a historical period of reference using a Schaake shuffle type procedure, so this ensemble size is actually the length (in years) of that historical conditioning period. As with meteorological forecasting, the production of HEFS hindcasts is also challenging, in that it requires the same carving out of time on an operational computational architecture to produce the hindcasts. This limitation substantiates one of the core interests from folks within the formal FIRO implementation efforts into synthetic forecasting methods to expand the length and availability of simulated hindcasts for operational policy development. In general, HEFS is reliable, but it suffers from systemic ‘Type-II conditional bias’, which in layman’s terms is the tendency to underpredict extreme events. I will dig into some of these attributes of HEFS forecasts in a future post on ensemble verification.

Outside of questions of statistical reliability, there are other more challenging to quantify concerns about the implementation of HEFS. The ensemble-mean forecasts from the NOAA GEFS model that force HEFS are bound to be more ‘skillful’ than, for instance, the ensemble control forecast. (Note: the control forecast is the forecast trajectory initialized from the ‘best guess’ at the initial condition state of the atmosphere). This is a well-documented property of ensemble-mean predictions. However, ensemble-mean forecasts are not dynamically consistent representations of the atmosphere (Wilks, 2019). As I noted in my previous post, an ideal forecast ensemble would include all possible trajectories of the forecasts, which might diverge substantially at some future time. Where forecast trajectories have diverged, an average of those trajectories will produce a value that is not in the space of plausible dynamic trajectories. In practical terms, this leads to ensemble-mean forecasts collapsing to climatology at longer leads and systematic reductions in the range of magnitudes of key variables as lead time increases. In short, this reliance on the meteorological ensemble-mean forecast does incur some risk (in my opinion) of missing high impact extremes due to, for example, even relatively small spatial errors in the landfall locations of atmospheric river storms.

Final thoughts

The intent of this post was to provide some current operational context for the types of forecasts being considered for formal implementation of forecast informed water management operations. The ensemble forecasting space is large and exciting from both research and operational perspectives, but it’s challenging to navigate and understand what is actually being considered for use, and what is simply novel and interesting. In my next post, I’ll discuss some of the primary verification techniques used for ensemble forecasts with a focus on the performance of HEFS in an example watershed and some associated practical implementation.

Reference

Bauer, P., Thorpe, A., & Brunet, G. (2015). The quiet revolution of numerical weather prediction. Nature, 525(7567), 47–55. https://doi.org/10.1038/nature14956

Demargne, J., Wu, L., Regonda, S. K., Brown, J. D., Lee, H., He, M., … Zhu, Y. (2014). The science of NOAA’s operational hydrologic ensemble forecast service. Bulletin of the American Meteorological Society, 95(1), 79–98. https://doi.org/10.1175/BAMS-D-12-00081.1

Guan, H., Zhu, Y., Sinsky, E., Fu, B., Zhou, X., Li, W., et al. (2019). The NCEP GEFS-v12 Reforecasts to Support Subseasonal and Hydrometeorological Applications. 44th NOAA Annual Climate Diagnostics and Prediction Workshop, (October), 78–81.

Nearing, G. S., Kratzert, F., Sampson, A. K., Pelissier, C. S., Klotz, D., Frame, J. M., … Gupta, H. V. (2021). What Role Does Hydrological Science Play in the Age of Machine Learning? Water Resources Research, 57(3). https://doi.org/10.1029/2020WR028091

Wilks, D. S., (2019). Statistical Methods in the Atmospheric Sciences, 4th ed. Cambridge, MA: Elsevier.

Zhang, C., Brodeur, Z. P., Steinschneider, S., & Herman, J. D. (2022). Leveraging Spatial Patterns in Precipitation Forecasts Using Deep Learning to Support Regional Water Management. Water Resources Research, 58(9), 1–18. https://doi.org/10.1029/2021WR031910

March 11, 2024 by Zach Brodeur

Nonstationary stochastic watershed modeling

In this post, I will describe the motivation for and implementation of a nonstationary stochastic watershed modeling (SWM) approach that we developed in the Steinschneider group during the course of my PhD. This work is in final revision and should be published in the next month or so. This post will attempt to distill key components of the model and their motivation, saving the full methodological development for those who’d like to read the forthcoming paper.

SWMs vs SSM/SSG

Before diving into the construction of the model, some preliminaries are necessary. First, what are SWMs, what do they do, and why use them? SWMs are a framework that combine deterministic, process-based watershed models (think HYMOD, SAC-SMA, etc.; we’ll refer to these as DWMs from here forward) with a stochastic model that capture their uncertainty. The stochastic part of this framework can be used to generate ensembles of SWM simulations that both represent the hydrologic uncertainty and are less biased estimators of the streamflow observations (Vogel, 2017).

Figure 1: SWM conceptual diagram

SWMs were developed to address challenges to earlier stochastic streamflow modeling/generation techniques (SSM/SSG; see for instance Trevor’s post on the Thomas-Fiering SSG; Julie’s post and Lillian’s post on other SSG techniques), the most important of which (arguably) being the question of how to formulate them under non-stationarity. Since SSMs are statistical models fitted directly to historical data, any attempt to implement them in a non-stationary setting requires strong assumptions about what the streamflow response might look like under an alternate forcing scenario. This is not to say that such an approach is not useful or valid for exploratory analyses (for instance Rohini’s post on synthetic streamflow generation to explore extreme droughts). SWMs attempt to address this issue of non-stationarity by using DWMs in their stochastic formulation, which lend some ‘physics-based’ cred to their response under alternate meteorological forcings.

Construction of an SWM

Over the years, there have been many SWM or SWM-esque approaches devised, ranging from simple autoregressive models to complex Bayesian approaches. In this work, we focus on a relatively straightforward SWM approach that models the hydrologic predictive uncertainty directly and simply adds random samples of it to the DWM simulations. The assumption here being that predictive uncertainty is an integrator of all traditional component modeling uncertainties (input, parameter, model/structural), so adding it back in can inject all these uncertainties into the SWM simulations at once (Shabestanipour et al., 2023).

Figure 2: Uncertainty components

By this straightforward approach, the fitting and parameter estimation of the DWM is accomplished first (and separately) via ‘standard’ fitting procedures; for instance, parameter optimization to minimize Nash-Sutcliffe Efficiency (NSE). Subsequently, we develop our stochastic part of the model on the predictive uncertainty that remains, which in this case, is defined simply by subtracting the target observations from the DWM predictions. This distribution of differenced errors is the ‘predictive uncertainty distribution’ or ‘predictive errors’ that form the target of our stochastic model.

Challenges in modeling predictive uncertainty

Easy, right? Not so fast. There is a rather dense and somewhat unpalatable literature (except for the masochists out there) on the subject of hydrologic uncertainty that details the challenges in modeling these sorts of errors. Suffice it to say that they aren’t well behaved. Any model we devise for these errors must be able to manage these bad behaviors.

So, what if we decide that we want to try to use this SWM thing for planning under future climates? Certainly the DWM part can hack it. We all know that lumped, conceptual DWMs are top-notch predictors of natural streamflow… At the least, they can produce physically plausible simulations under alternate forcings (we think). What of the hydrologic predictive uncertainty then? Is it fair or sensible to presume that some model we constructed to emulate historical uncertainty is appropriate for future hydrologic scenarios with drastically different forcings? My line of rhetorical questioning should clue you in on my feelings on the subject. YES!, of course. ‘Stationarity is immortal!’ (Montanari & Koutsoyiannis, 2014).

Towards a hybrid, state-variable dependent SWM

No, actually, there are a number of good reasons why this statement might not hold for hydrologic predictive uncertainty under non-stationarity. You can read the paper for the laundry list. In short, hydrologic predictive uncertainty of a DWM is largely a reflection of its structural misrepresentation of the true process. Thus, the historical predictive uncertainty that we fit our model to is a reflection of that structural uncertainty propagated through historical model states under historical, ‘stationary’ forcings. If we fundamentally alter those forcings, we should expect to see model states that do not exist under historical conditions. The predictive errors that result from these fundamentally new model states are thus likely to not fall neatly into the box carved out by the historical scenarios.

Figure 3: Structural uncertainty

To bring this back to the proposition for a nonstationary SWM approach. The intrinsic link between model structure and its predictive uncertainty raises an interesting prospect. Could there be a way to leverage a DWM’s structure to understand its predictive uncertainty? Well, I hope so, because that’s the premise of this work! What I’ll describe in the ensuing sections is the implementation of a hybrid, state-variable dependent SWM approach. ‘Hybrid’ because it couples both machine learning (ML) and traditional statistical techniques. ‘State-variable dependent’ because it uses the timeseries of model states (described later) as the means to infer the hydrologic predictive uncertainty. I’ll refer to this as the ‘hybrid SWM’ for brevity.

Implementation of the hybrid SWM

So, with backstory in hand, let’s talk details. The remainder of this post will describe the implementation of this hybrid SWM. This high-level discussion of the approach supports a practical training exercise I put together for the Steinschneider group at the following public GitHub repo: https://github.com/zpb4/hybrid-SWM_training. This training also introduces a standard implementation of a GRRIEN repository (see Rohini’s post). Details of implementing the code are contained in the ‘README.md’ and ‘training_exercise.md’ files in the repository. My intent in this post is to describe the model implementation at a conceptual level.

Model-as-truth experimental design

First, in order to address the problem of non-stationary hydrologic predictive uncertainty, we need an experimental design that can produce it. There is a very real challenge here of not having observational data from significantly altered climates to compare our hydrologic model against. We address this problem by using a ‘model-as-truth’ experimental design, where we fit one hydrologic model (‘truth’ model) to observations, and a second hydrologic model (‘process’ model) to the first truth model. The truth model becomes a proxy for the true, target flow of the SWM modeling procedure, while the process model serves as our proposed model, or hypothesis, about that true process. Under this design, we can force both models with any plausible forcing scenario to try to understand how the predictive uncertainty between ‘truth’ and ‘process’ models might change.

Figure 4: Conceptual diagram of ‘model-as-truth’ experimental design

For the actual work, we consider a very simple non-stationary scenario where we implement a 4^oC temperature shift to the temperature forcing data, which we refer to as the ‘Test+4C’ scenario. We choose this simple approach to confine non-stationarity to a high-confidence result of anthropogenic climate change, namely, thermodynamic warming. We compare this Test+4C scenario to a ‘Test’ scenario, which is the same out-of-sample temporal period (WY2005-2018) of meteorological inputs under historical values. SAC-SMA and HYMOD are the truth model and process model for this experiment, respectively. Other models could have been chosen. We chose these because they are conceptually similar and commonly used.

Figure 5: Errors between truth and process models in 5 wettest years of Test/Test+4C scenarios.

Hybrid SWM construction

The core feature of the hybrid SWM is a model for the predictive errors (truth model – process model) that uses the hydrologic model state-variables as predictors. We implement this model in two steps that have differing objectives, but use the same state-variable predictor information. An implicit assumption in using state-variable dependencies in both steps is that these dependencies can exist in both stages. In other words, we do not expect the error-correction step to produce independent and identically distributed residuals. We call the first step an ‘error-correction model’ and the second step a ‘dynamic residual model’. Since we use HYMOD as our process model, we use its state-variables (Table 1) as the predictors for these two steps.

Table 1: HYMOD state variables

Short Name	Long Name	Description
sim	Simulation	HYMOD predicted streamflow in mm
runoff	Runoff	Upper reservoir flow of HYMOD in mm
baseflow	Baseflow	Lower reservoir flow of HYMOD in mm
precip	Precipitation	Basin averaged precipitation in mm
tavg	Average temperature	Basin averaged temperature in ^oC
et	Evapotranspiration	Modeled evapotranspiration (Hamon approach) in mm
upr_sm	Upper soil moisture	Basin averaged soil moisture content (mm) in upper reservoir
lwr_sm	Lower soil moisture	Basin averaged soil moisture (mm) in lower reservoir
swe	Snow water equivalent	Basin averaged snow water equivalent simulated by degree day snow module (mm)

Hybrid SWM: Error correction

The error-correction model is simply a predictive model between the hydrologic model (HYMOD) state-variables and the raw predictive errors. The error-correction model also uses lag-1 to 3 errors as covariates to account for autocorrelation. The objective of this step is to infer state-dependent biases in the errors, which are the result of the predictive errors subsuming the structural deficiencies of the hydrologic model. This ‘deterministic’ behavior in the predictive errors can also be conceived as the ‘predictive errors doing what the model should be doing’ (Vogel, 2017). Once this error-correction model is fit to its training data, it can be implemented against any new timeseries of state-variables to predict and debias the errors. We use a Random Forest (RF) algorithm for this step because they are robust to overfitting, even with limited training data. This is certainly the case here, as we consider only individual basins and a ~15 year training period (WY1989-2004). Moreover, we partition the training period into a calibration and validation subset and fit the RF error-correction model only to the calibration data (WY1989-1998), reducing available RF algorithm training data to 9 years.

Hybrid SWM: Dynamic residual model

The dynamic residual model (DRM) is fit to the residuals of the error correction result in the validation subset. We predict the hydrologic model errors for the validation subset from the fitted RF model and subtract them from the empirical errors to yield the residual timeseries. By fitting the DRM to this separate validation subset (which the RF error-correction model has not seen), we ensure that the residuals adequately represent the out-of-sample uncertainty of the error-correction model.

A full mathematical treatment of the DRM is outside the scope of this post. In high-level terms, the DRM is built around a flexible distributional form particularly suited to hydrologic errors, called the skew exponential power (SEP) distribution. This distribution has 4 parameters (mean: mu, stdev: sigma, kurtosis: beta, skew: xi) and we assume a mean of zero (due to error-correction debiasing), while setting the other 3 parameters as time-varying predictands of the DRM model (i.e. sigma_t, beta_t, xi_t). We also include a lag-1 autocorrelation term (phi_t) to account for any leftover autocorrelation from the error-correction procedure. We formulate a linear model for each of these parameters with the state-variables as predictors. These linear models are embedded in a log-likelihood function that is maximized (i.e. MLE) against the residuals to yield the optimal set of coefficients for each of the linear models.

With a fitted model, the generation of a new residual at each timestep t is therefore a random draw from the SEP with parameters (mu=0, sigma_t, beta_t, xi_t) modified by the residual at t-1 (epsilon_t-1) via the lag-1 coefficient (phi_t).

Figure 6: Conceptual diagram of hybrid SWM construction.

Hybrid SWM: Simulation

The DRM is the core uncertainty modeling component of the hybrid SWM. Given a timeseries of state-variables from the hydrologic model for any scenario, the DRM simulation is implemented first, as described in the previous section. Subsequently, the error-correction model is implemented in ‘predict’ mode with the timeseries of random residuals from the DRM step. Because the error-correction model includes lag-1:3 terms, it must be implemented sequentially using errors generated at the previous 3 timesteps. The conclusion of these two simulation steps yields a timeseries of randomly generated, state-variable dependent errors that can be added to the hydrologic model simulation to produce a single SWM simulations. Repeating this procedure many times will produce an ensemble of SWM simulations.

Final thoughts

Hopefully this discussion of the hybrid SWM approach has given you some appreciation for the nuanced differences between SWMs and SSM/SSGs, the challenges in constructing an adequate uncertainty model for an SWM, and the novel approach developed here in utilizing state-variable information to infer properties of the predictive uncertainty. The hybrid SWM approach shows a lot of potential for extracting key attributes of the predictive errors, even under unprecedented forcing scenarios. It decouples the task of inferring predictive uncertainty from features of the data like temporal seasonality (e.g. day of year) that may be poor predictors under climate change. When linked with stochastic weather generation (see Rohini’s post and Nasser’s post), SWMs can be part of a powerful bottom-up framework to understand the implications of climate change on water resources systems. Keep an eye out for the forthcoming paper and check out the training noted above on implementation of the model.

References:

Brodeur, Z., Wi, S., Shabestanipour, G., Lamontagne, J., & Steinschneider, S. (2024). A Hybrid, Non‐Stationary Stochastic Watershed Model (SWM) for Uncertain Hydrologic Simulations Under Climate Change. Water Resources Research, 60(5), e2023WR035042. https://doi.org/10.1029/2023WR035042

Montanari, A., & Koutsoyiannis, D. (2014). Modeling and mitigating natural hazards: Stationarity is immortal! Water Resources Research, 50, 9748–9756. https://doi.org/10.1002/ 2014WR016092

Shabestanipour, G., Brodeur, Z., Farmer, W. H., Steinschneider, S., Vogel, R. M., & Lamontagne, J. R. (2023). Stochastic Watershed Model Ensembles for Long-Range Planning : Verification and Validation. Water Resources Research, 59. https://doi.org/10.1029/2022WR032201

Vogel, R. M. (2017). Stochastic watershed models for hydrologic risk management. Water Security, 1, 28–35. https://doi.org/10.1016/j.wasec.2017.06.001

September 5, 2023 by Nasser Najibi

Weather Regime-Based Stochastic Weather Generation (Part 2/2)

In this post on the Water Programming Blog, we continue to explore the application of the stochastic weather generator (available on GitHub) for climate-change scenario developments. This is the second installment of a two-part series of blog posts, and readers are strongly encouraged to familiarize themselves with different components of the stochastic weather generator, as explained in Part 1 by Rohini Gupta (The Reed Research Group).

Here, we will begin by offering a concise overview of developing climate change scenarios and how these scenarios are integrated into the weather generation model. Following this, we will proceed to interpret the impact of these climatic change conditions on key statistical insights concerning occurrences of floods and droughts. Through these examples, the implications of these climatic shifts on water resources management and flood risk analysis will become evident.

Climate Change Perturbations

In this stochastic weather generator, we specifically focus on two aspects of climate change scenario developments, including 1) thermodynamic and 2) dynamic perturbations.

1) Thermodynamic Change

Thermodynamic climate change, often referred to as temperature-driven change, is primarily driven by changes in the Earth’s energy balance and temperature distribution. This warming affects various aspects of the climate system, such as intensifying precipitation extremes, melting snowpacks and ice sheets, rising sea levels, altered weather patterns, and shifts in ecosystems. The primary driver of temperature-driven climate change is the increase in regional-to-global average temperatures due to the enhanced greenhouse effect. As temperatures rise due to natural and anthropogenic forcings, they trigger a cascade of interconnected impacts throughout the climate system.

In the stochastic weather generator, scenarios of temperature change are treated simply by adding trends to simulated temperature data for each location across the spatial domain. However, scenarios of thermodynamic precipitation intensification are modeled using a quantile mapping technique through scaling the distribution of daily precipitation in a way that replicates the effects of warming temperatures on precipitation as the moisture holding capacity of the atmosphere increases. In the context of California, previous studies have demonstrated that as temperatures rise, the most severe precipitation events (often associated with Atmospheric Rivers landfalling) are projected to increase in frequency, while the intensity of smaller precipitation events is expected to decrease (Gershunov et al., 2019). This alteration effectively stretches the distribution of daily precipitation, causing extreme events to become more pronounced while reducing the occurrence and strength of lighter precipitation events. We replicate this phenomenon by making adjustments to the statistical characteristics and distribution percentiles of precipitation (e.g., Pendergrass and Hartmann, 2014). To further elaborate on this, we select a scaling factor for the 99th percentile of nonzero precipitation and then modify the gamma-GPD mixed distribution to enforce this chosen scaling factor. For instance, in a scenario with a 3°C temperature warming and a 7% increase in extreme precipitation per °C (matching the theoretical Clausius-Clapeyron rate of increase in atmospheric water holding capacity due to warming; Najibi and Steinschneider (2023)), the most extreme precipitation events are projected to increase by approximately 22.5% (1.07³). We adjust the gamma-GPD models fitted to all locations to ensure this percentage increase in the upper tail of the distribution. Assuming that mean precipitation remains constant at baseline levels, this adjustment will cause smaller precipitation values in the gamma-GPD model to decrease, compensating for the increase in extreme events through stretching the distribution of nonzero precipitation.

Lines 33-40 from ‘config.simulations.R’ show the user-defined changes to implement the thermodynamic scenarios based on temperature in Celsius (tc: e.g. 1 °C), percent change in extreme precipitation quantile (pccc: e.g. 7% per °C), and percent change in average precipitation (pmuc: e.g. 12.5% decrease) inputs. Needless to say that the stochastic weather generator runs at baseline mode if tc=0 and pmuc=0.

    ##-------------Define perturbations-------------##
    ##climate changes and jitter to apply:
    change.list <- data.frame("tc"=  c(0), # {e.g., 0, 1, 2, ...} (changes in temperature)
                              "jitter"= c(TRUE),
                              "pccc"= c( 0.00), # {e.g., 0, 0.07, 0.14, ...} (changes for precipitation extreme quantile -- CC)
                              "pmuc"= c( 0.00)# {e.g., 0, -.125, .125, ...} (changes in precipitation mean)
    )
    ##----------------------------------------------##

2) Dynamic Change

Dynamic climate change, also known as circulation-driven change, is driven by shifts in atmospheric and oceanic circulation patterns. These circulation patterns are influenced by a variety of factors, including temperature gradients, differences in air pressure, and Earth’s rotation. Changes in these patterns can lead to alterations in weather patterns, precipitation distribution, and regional climate characteristics. One well-known example of dynamic climate change is the phenomenon of El Niño and La Niña, which involve changes in ocean temperatures and atmospheric pressure in the Pacific Ocean. These events can significantly impact local-to-global weather patterns, causing droughts, heavy rainfall, and other extreme weather events (Pfahl et al., 2017).

Dynamic changes impact the evolution of weather patterns and can modify the occurrence frequency of these patterns. This influence can occur through direct adjustments to the transition probabilities between different weather regimes, or indirectly by modifying the covariates that govern the progression of these weather regimes. In Steinschneider et al. (2019), a Niño 3.4 index is used to force weather regime evolution and is systematically adjusted to create more frequent El Niño and La Niña events. In Gupta et al. (in review), a 600-year long sequence of tree-ring reconstructed principal components of weather regime occurrence are used as an alternative covariate to better capture natural variability inherent in the weather regimes.

In the most recent version of the stochastic weather generator, we developed a novel non-parametric approach to simulation of weather regimes, allowing for future dynamic change scenarios with altered (customized) weather regime probabilities. Assuming that the historical time series of water regimes is divided into distinct, consecutive segments without overlaps, each segment has a duration of D years, and there is a total of N_D segments considered there (here, D=4 and N_D=18). In the non-parametric method, each segment (indexed as n=1 to N_D) is assigned a sampling probability denoted as p_n. To generate a new sequence of daily weather regimes spanning any desired number of years, the procedure involves resampling (with replacement) the nth D-year segment of daily weather regimes using the corresponding probability p_n. This process is repeated until the required number of years of simulated weather regimes has been attained. If needed, the last segment can be trimmed to ensure the precise desired duration of simulated weather regimes.

In the baseline scenario for the weather generator with no dynamic climate change (only thermodynamic change), each segment is considered equally likely (i.e., no changes to large-scale circulation patterns).

However, the probabilities p_n can be adjusted to alter the frequencies of each of the identified weather regimes in the final simulation, enabling the generation of dynamic climate change scenarios (i.e., scenarios in which the frequencies of different atmospheric flow patterns change compared to their historical frequencies). This is achieved using a linear program. The goal of the linear model (not shown) is to identify new sampling probabilities p_n that, when used in the non-parametric simulation approach above, create a sequence of weather regimes with long-term average frequencies that approach some vector of target probabilities for those identified weather regimes.

Lines 91-126 from ‘config.simulations.R’ show the user-defined changes to implement a non-parametric scenario with equal probabilities (0: no change to the historical sequence of weather regimes) to ten weather regimes, i.e., dynamic scenario #0; and a 30% increase in weather regime number three (a dry weather condition) in California, i.e., dynamic scenario #1.

##Choose below whether through parametric or non-parametric way to create the simulated WRs ##
    use.non_param.WRs <- TRUE #{TRUE, FALSE}: TRUE for non-parametric, FALSE for parametric simulated WRs

    dynamic.scenario  <- 0 # {0, 1, 2}: 0: no dynamic change; 1: dynamic scenario #1 (30% increase in WR3); or 2: dynamic scenario #2 (linear trend)

    if (use.non_param.WRs){      #----------- 1+2 dynamic scenarios ----------#
      if (dynamic.scenario==0){
        ##===> Attempt #0 (thermodynamic only; no change to freq of WRs) ===##
        # #specify target change (as a percent) for WR probabilities
        WR_prob_change <- c(0,0,0,0,0,0,0,0,0,0) # between 0 and 1
        # #how close (in % points) do the WR frequencies (probabilities) need to be to the target
        lp.threshold <- 0.00001
        # #how much change do we allow in a sub-period sampling probability before incurring a larger penalty in the optimization
        piecewise_limit <- .02
        
      }else if(dynamic.scenario==1){
        ##===> Attempt #1 (dynamic scenario #1) ===##
        # #specify target change (as a percent) for WR probabilities (if, increasing WR3 in future)
        WR_prob_change <- c(0,0,.3,0,0,0,0,0,0,0) # between 0 and 1
        # #how close (in % points) do the WR frequencies (probabilities) need to be to the target
        lp.threshold <- 0.007
        # #how much change do we allow in a sub-period sampling probability before incurring a larger penalty in the optimization
        piecewise_limit <- .02
        
      }else if(dynamic.scenario==2){
        ##===> Attempt #2 (dynamic scenario #2) ===##
        # specify target change (as a percent) for WR probabilities (if, continuing their current trends in future)
        WR_prob_change <- c(-0.09969436,  0.27467048,  0.33848792,
                            -0.28431861, -0.23549986,  0.03889970,
                            -0.05628958, 0.38059153, -0.16636739, -0.17995965) # between 0 and 1
        # how close (in % points) do the WR frequencies (probabilities) need to be to the target
        lp.threshold <- 0.008
        # how much change do we allow in a sub-period sampling probability before incurring a larger penalty in the optimization
        piecewise_limit <- .02
      }
    }

Stochastic Weather Generation for Climate Change Scenarios

The stochastic weather generator is utilized to generate two climate change scenarios as defined above. The description of specific configurations for each scenario is listed as follows:

Thermodynamic Scenario: 3°C increase in temperature, 7% per °C increase in precipitation extremes, no change in average precipitation.
Dynamic Scenario: 30% increase in occurrence frequency of one weather regime only, labeled as ‘WR3’, which exhibits a ridge directly over the northwest US, i.e., blocking moisture flow over California, and resulting in dry conditions during the cold season there.

Thus, we generated 1008 years of simulated precipitation and temperature for each 12 sites in the Tuolumne River Basin in California (similar to Part 1) following these two scenarios. Below is a list of figures to understand better the impact of each scenario on precipitation and temperature statistical distributions and important climate extremes at basin scale.

The two Figures below present the cumulative distribution function (CDF) of the generated scenario for precipitation (left) and temperature (right) based on the thermodynamic and dynamic change, respectively. The observed time-series of precipitation and temperature across these 12 sites is also illustrated.

As seen above, although the 3°C warming is clearly manifested in the alteration of simulated temperature’s CDF, it is hard to notice any drastic shifts in the overall CDF of precipitation time series, as the bulk of distribution has not been altered (remember the average precipitation remained constant although its extreme quantile scaled up by ~ 22.5%).

Similarly, while the CDF of precipitation demonstrates a slight shift towards a drier condition, we notice a large shift in tail of temperature distribution.

Now, we go ahead and examine a set of important indexes for climate risk assessment of water systems. The Figure below presents the 1-day precipitation maxima derived from the generated scenario for precipitation based on the thermodynamic (left) and dynamic (right) change.

In the plot above depicted for thermodynamic climate change, the median 1-day precipitation extremes at the basin scale throughout the entire synthetic weather generation vs. historical period demonstrates a 25.5% increase in its magnitude, which is consistent with the event-based precipitation scaling by 22.5% at each site. However, such metric has almost remained unchanged in the dynamic climate change scenario.

Finally, the Figure below shows the 5-year drought derived from the generated scenario for water-year precipitation total, based on the thermodynamic (left) and dynamic (right) change.

The boxplots presented above related to the thermodynamic scenario, revealing a consistent median 5-year drought magnitude, as anticipated (no shift in the distribution of average precipitation bulk). In contrast, the dynamic climate change scenario exhibits a substantial exacerbation, with the 5-year drought magnitude worsening by around 9% compared to the historical records.

There is plenty more to explore! The stochastic weather generator is suitable to quickly simulate a long sequence of weather variables that reflect any climate change of interest. Keep an eye out for upcoming additions to the repository in the coming months, and do not hesitate to contact us or create a GitHub issue if you need clarification.

References

Gershunov, A., Shulgina, T., Clemesha, R.E.S. et al. (2019). Precipitation regime change in Western North America: The role of Atmospheric Rivers. Scientific Reports, 9, 9944. https://doi.org/10.1038/s41598-019-46169-w.

Gupta, R.S., Steinschneider S., Reed, P.M. Understanding Contributions of Paleo-Informed Natural Variability and Climate Changes on Hydroclimate Extremes in the Central Valley Region of California. Authorea. March 13, 2023. https://doi.org/10.22541/essoar.167870424.46495295/v1

Najibi, N., and Steinschneider, S. (2023). Extreme precipitation-temperature scaling in California: The role of Atmospheric Rivers, Geophysical Research Letters, 50(14), 1–11, e2023GL104606. https://doi.org/10.1029/2023GL104606.

Pendergrass, A.G., and Hartmann, D.L. (2014). Two modes of change of the distribution of rain, Journal of Climate, 27(22), 8357-8371. https://doi.org/10.1175/JCLI-D-14-00182.1

Pfahl, S., O’Gorman, P.A., Fischer, E.M. (2017). Understanding the regional pattern of projected future changes in extreme precipitation, Nature Climate Change, 7 (6), 423-427. http://dx.doi.org/10.1038/nclimate3287

Steinschneider, S., Ray, P., Rahat, S. H., & Kucharski, J. (2019). A weather‐regime‐based stochastic weather generator for climate vulnerability assessments of water systems in the western United States. Water Resources Research, 55(8), 6923-6945. https://doi.org/10.1029/2018WR024446

August 18, 2023 by Trevor Amestoy

An overview of the National Hydrologic Model

Over the past several months, I have been working with data from the US Geological Survey’s (USGS) National Hydrologic Model (NHM), a valuable resource that required some time to become familiar with. The goal of this post is to provide an overview of the NHM, incorporating as many links as possible, with the hope of encouraging others to utilize these resources and serving as a springboard for further investigation.

Why should you care about the NHM?

Water systems modelers are consistently in need of data. You might find that the specific streamflow data you seek does not exist, or perhaps you have some data but want to expand your training set. You may also wish to broaden your data to include not only streamflow but also estimates of various geophysical, hydrological, or environmental variables such as catchment area, vegetation classifications, and evapotranspiration.

The NHM can quench your data thirst by offering Continental US (CONUS) scale estimates of different geo-hydro-climatic variables synthesized from multiple datasets.

You can access these NHM datasets (simulated streamflow, land-cover parameter values, etc.) yourself through the ScienceBase website. However, it is first beneficial to understand the various components of the NHM infrastructure more broadly.

Introduction

The National Hydrologic Model (NHM) infrastructure is designed with the goal…

“… to facilitate hydrologic model parameterization on the basis of datasets that characterize geology, soil, vegetation, contributing area, topography, growing season, snow dynamics, climate, solar radiation, and evapotranspiration across the CONUS.”

The NHM includes several different components

The Geospatial Fabric
Model input datasets
Physical models used to simulate hydrologic processes

In this post, I will delve into each of these components, providing more information, and conclude by pointing out ways you can access the model outputs yourself.

The Geospatial Fabric

The geospatial fabric contains spatial data that serves as the skeletal structure of the NHM, facilitating the routing of modeled streamflow across across catchments. The image below shows the CONUS-scale geospatial fabric.

The geospatial fabric contains individual stream reaches (called “segments”), delineated sub-catchments (called “Hydrologic Response Units”, or HRUs), and many specific points of interest which correspond to USGS observational gauge locations.

The raw data for version 1.1 of the Geospatial Fabric can be found here.

The spatial data is largely drawn from the National Hydrography Dataset (NHD) which defines the high-resolution stream linkages and provides a unique identifier (called “ComID”) for each stream segment.

If you are looking to set up a workflow which uses NHM streamflow data, you will need to specify the ComID numbers for your locations of interest. You can then retrieve the data for each ComID location.

If you are doing this with Python, then I suggest you check out PyNHD package which I previously highlighted on this blog, and which can identify NHD ComID numbers based on coordinate points.

For more information on the geospatial fabric, you can see Bock et al. (2020).

PRMS

The Precipitation-Runoff Modeling System (PRMS) is described by the USGS as being:

“a deterministic, distributed-parameter, physical process based modeling system developed to evaluate the response of various combinations of climate and land use on streamflow and general watershed hydrology.”

The PRMS simulates many different hydrologic processes such as snowmelt, infiltration, groundwater recharge, surface runoff, and finally streamflow. A conceptual representation of the PRMS, taken from Markstrom et al. (2015) shows the modeled relationships between these processes:

The input data for the PRMS is flexible, but requires some combination of precipitation, air temperature, and solar radiation timeseries. Historic Daymet data provide the climate forcings for the historic period, but future climate projections can also be used.

Additionally, the model requires a set of catchment-delineated parameter values which quantify things such as soil types, types of vegetation coverage, percentage of impervious surfaces, etc. These data can be provided by the National Land Cover Database (NLCD), or alternative land cover change scenarios can be used to assess the impact of land surface on hydrology.

The PRMS is thus able to provide a strong physical processes basis when coupled with the NHM. The combination of these two models is simply referred to as the NHM-PRMS.

You can access the user manual for the PRMS here, and a report describing the latest version (PRMS-IV) from Markstrom et al (2015) here.

Accessing NHM datasets

The NHM infrastructure is great in the sense that it is flexible and incorporates so many different datasets. However, this may introduce difficulty in running the models yourself (I have not attempted this, and don’t have guidance on that).

Fortunately, there are some datasets which have been shared by the USGS, and which conveniently provide various model outputs or calibrated parameters without the need to interact with the model directly.

You can access and explore available datasets from the NHM through the ScienceBase website.

A few notable datasets that you might be interest in using are:

NHM-PRMS simulated streamflow from 1980-2016

Here I want to highlight one of these datasets that I have the most experience working with, and which I believe may be the most interesting to the WaterProgramming crowd: the “Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS), 1980-2016, Daymet Version 3 calibration“.

This dataset contains daily streamflow values across CONUS for the period from 1980-2016, at each HRU and segment contained in the geospatial fabric, and is available through the link here.

Note: Given the scale of this dataset, the file is rather large (92 GB).

Here I want to show how easy in can be to access the streamflow timeseries from this dataset. Assuming that you have downloaded full NHM-PRMS dataset file, for the fully observed-calibrated and Muskingum routing simulation (byHRU_musk_obs.tar), you can extract the segment streamflow timeseries using just a few lines of Python code:

## Example of how to extract streamflow data from NHM-PRMS
import tarfile
import netCDF4 as nc
import pandas as pd

# Open file and extract just segment outflow
tar = tarfile.open('your_data_directory/byHRU_musk_obs.tar')
tar.extract('seg_outflow', path= './extracted/')

# Open the extracted netCDF
segment_outflow = nc.Dataset(f'./extracted/seg_outflow.nc')

# Store values and segment IDs
vals = segment_outflow['seg_outflow'][:]
segment_ids = segment_outflow['segment'][:]

# Make a dataframe
segment_outflow_df = pd.DataFrame(vals, columns = segment_ids)

At this point, segment_outflow_df will contain data from all of CONUS, and you will likely want to choose a subset of this data using the ComID numbers for each segment; I’ll have to leave that part up to you!

I hope this post helped to shine some light on this great resource, and encourages you to consider leveraging the NHM in your own work. As always, thanks for reading!

References

Bock, A.E, Santiago,M., Wieczorek, M.E., Foks, S.S., Norton, P.A., and Lombard, M.A., 2020, Geospatial Fabric for National Hydrologic Modeling, version 1.1 (ver. 3.0, November 2021): U.S. Geological Survey data release, https://doi.org/10.5066/P971JAGF.
Regan, R.S., Markstrom, S.L., Hay, L.E., Viger, R.J., Norton, P.A., Driscoll, J.M., LaFontaine, J.H., 2018, Description of the National Hydrologic Model for use with the Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Techniques and Methods, book 6, chap B9, 38 p., https://doi.org/10.3133/tm6B9.
Leavesley, G. H. (1984). Precipitation-runoff modeling system: User’s manual (Vol. 83, No. 4238). US Department of the Interior.
Markstrom, S. L., Regan, R. S., Hay, L. E., Viger, R. J., Webb, R. M., Payn, R. A., & LaFontaine, J. H. (2015). PRMS-IV, the precipitation-runoff modeling system, version 4. US Geological Survey Techniques and Methods, 6, B7.
Hay, L.E., and LaFontaine, J.H., 2020, Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS),1980-2016, Daymet Version 3 calibration: U.S. Geological Survey data release, https://doi.org/10.5066/P9PGZE0S.

November 22, 2022 by Trevor Amestoy

QPPQ Method for Streamflow Prediction at Ungauged Sites – Python Demo

Streamflow Predictions in Ungauged Basins

Predicting streamflow at ungauged locations is a classic problem in hydrology which has motivated significant research over the last several decades (Hrachowitz, Markus, et al., 2013).

There are numerous different methods for performing predictions in ungauged basins, but here I focus on the common QPPQ method.

Below, I describe the method and further down I provide a walkthrough demonstration of QPPQ streamflow prediction in Python.

The supporting code can be found on my GitHub here: QPPQ_Streamflow_Prediction_Tutorial.

QPPQ-Method for Streamflow Prediction

Fennessey (1994) introduced the QPPQ method for streamflow estimation at ungauged locations.

The QPPQ method is commonly used and encouraged by the USGS, and is described at length in their publication Estimation of Daily Mean Streamflow for Ungaged Stream locations… (2016).

QPPQ consists of four key steps:

Estimating an FDC for the target catchment of interest, $\hat{FDC}_{pred}$ .
Identify $K$ donor locations, nearest to the target point.
Transferring the timeseries of nonexceedance probabilities ( $\mathbf{P}$ ) from the donor site(s) to the target.
Using estimated FDC for the target to map the donated nonexceedance timeseries, $\mathbf{P}$ back to streamflow.

To limit the scope of this tutorial, let’s assume that an estimate of the FDC at the target site, $\hat{FDC}_{pred}$ , has already been determined through some other statistical or observational study.

Then the QPPQ method can be described more formally. Given an ungauged location with an estimated FDC, $\hat{FDC}{pred}$ , and set of observed streamflow timeseries $\mathbf{q_i}$ at $K$ neighboring sites, such that:

$Q_{obs} = [\mathbf{q_1}, \mathbf{q_2}, ..., \mathbf{q_k}]$

With corresponding $K$ FDCs at the observation locations:

$FDC = [FDC_1, FDC_2, ..., FDC_k]$

The FDCs are used to convert the observed streamflow timeseries, $\mathbf{q_{obs, i}}$ , to non-exceedance probability timeseries, $\mathbf{p_{obs, i}}$ .

$\displaystyle FDC_i : \mathbf{q_{i}} \to \mathbf{p_i}$

We can then perform a weighted-aggregation of the non-exceedance probability timeseries to estimate the non-exceedance timeseries at the ungauged location. It is most common to apply an inverse-squared-distance weight to each observed timeseries such that:

$\mathbf{p_{pred}} = \sum^k (\mathbf{p_i}w_i)$

Where $w_i = 1 / d_i^2$ where $d_i$ is the distance from the observation $i$ to the ungauged location, and $\sum^k w_i = 1$ .

Finally, the estimated FDC at the ungauged location, $\hat{FDC}_{pred}$ is used to convert the non-exceedance timeseries to streamflow timeseries:

$\hat{FDC}_{pred} : \mathbf{p}_{pred} \to \mathbf{q}_{pred}$

Looking at this formulation, and the sequence of transformations that take place, I hope it is clear why the method is rightfully called the QPPQ method.

This method is summarized well by the following graphic, taken from the USGS Report on the topic:

In the following section, I step through an implementation of this method in Python.

Tutorial

All Python scripts used in this tutorial can be found in my GitHub repository: QPPQ_Streamflow_Prediction_Tutorial.

In order run the scripts in this tutorial yourself, you will need to have installed the a few Python libraries, listed in requirements.txt. Running pip install -r requirements.txt from the command line, while inside a local copy of the directory will install all of these packages.

Data retrieval

I collected USGS streamflow data from $N$ gages using the HyRiver suite for Python.

If you would like to learn more about hydro-environmental data acquisition in Python, check out my old post on Efficient hydroclimatic data accessing with HyRiver for Python.

The script used to retrieve the data is available here. If you would like to experiment with this method in other regions, you can change the region variable on line 21, which specifies the corners of a bounding-box within which gage data will be retrieved:


# Specify time-range and region of interest
dates = ("2000-01-01", "2010-12-31")
region = (-108.0, 38.0, -105.0, 40.0)

Above, I specify a region West of the Rocky Mountains in Colorado. Running the generate_streamflow_data.py, I found 73 USGS gage locations (blue circles).

Fig: Locations of USGS gages used in this demo.

QPPQ Model

The file QPPQ.py contains the method outlined above, defined as the StreamflowGenerator class object.

The StreamflowGenerator has four key methods (or functions):

class StreamflowGenerator():
    def __init__(self, args):  
	    ...
	def get_knn(self):
		...
	def calculate_nep(self):
		...
	def interpolate_fdc(self):
		...
	def predict_streamflow(self):
		...
		return predicted_flow

The method get_knn finds the $k$ observation, gage locations nearest to the prediction location, and stores the distances to these observation locations (self.knn_distances) and the indices associated with these locations (self.knn_indices).

    def get_knn(self):
        """Find distances and indices of the K nearest gage locations."""
        distances = np.zeros((self.n_observations))
        for i in range(self.n_observations):
            distances[i] = geodesic(self.prediction_location, self.observation_locations[i,:]).kilometers
        self.knn_indices = np.argsort(distances, axis = 0)[0:self.K].flatten()
        self.knn_distances = np.sort(distances, axis = 0)[0:self.K].flatten()
        return

The next method, calculate_nep, calculates the NEP of a flow at an observation location at time t, or $P(Q \leq q_t)_{i}$ .

    def calculate_nep(self, KNN_Q, Q_t):
        "Finds the NEP at time t based on historic observatons."
        # Get historic FDC
        fdc = np.quantile(KNN_Q, self.fdc_neps, axis = 1).T
        # Find nearest FDC value
        nearest_quantile = np.argsort(abs(fdc - Q_t), axis = 1)[:,0]
        nep = self.fdc_neps[nearest_quantile]
        return nep

The interpolate_fdc performs a linear interpolate of the discrete FDC, and estimates flow for some given NEP.

    def interpolate_fdc(self, nep, fdc):
        "Performs linear interpolation of discrete FDC at a NEP."
        tol = 0.0000001
        if nep == 0:
            nep = np.array(tol)
        sq_diff = (self.fdc_neps - nep)**2

        # Index of nearest discrete NEP
        ind = np.argmin(sq_diff)

        # Handle edge-cases
        if nep <= self.fdc_neps[0]:
            return fdc[0]
        elif nep >= self.fdc_neps[-1]:
            return fdc[-1]

        if self.fdc_neps[ind] <= nep:
            flow_range = fdc[ind:ind+2]
            nep_range = self.fdc_neps[ind:ind+2]
        else:
            flow_range = fdc[ind-1:ind+1]
            nep_range = self.fdc_neps[ind-1:ind+1]

        slope = (flow_range[1] - flow_range[0])/(nep_range[1] - nep_range[0])
        flow = flow_range[0] + slope*(nep_range[1] - nep)
        return flow

Finally, predict_streamflow(self, *args) combines these other methods and performs the full QPPQ prediction.

    def predict_streamflow(self, args):
        "Run the QPPQ prediction method for a single locations."
        self.prediction_location = args['prediction_location']
        self.prediction_fdc = args['prediction_fdc']
        self.fdc_quantiles = args['fdc_quantiles']
        self.n_predictions = self.prediction_location.shape[0]

        ### Find nearest K observations
        self.get_knn()
        knn_flows = self.historic_flows[self.knn_indices, :]

        ### Calculate weights as inverse square distance
        self.wts = 1/self.knn_distances**2

        # Normalize weights
        self.norm_wts = self.wts/np.sum(self.wts)

        ### Create timeseries of NEP at observation locations
        self.observed_neps = np.zeros_like(knn_flows)
        for t in range(self.T):
            self.observed_neps[:,t] = self.calculate_nep(knn_flows, knn_flows[:,t:t+1])

        ### Calculate predicted NEP timeseries using weights
        self.predicted_nep = np.zeros((self.n_predictions, self.T))
        for t in range(self.T):
            self.predicted_nep[:,t] = np.sum(self.observed_neps[:,t:t+1].T * self.norm_wts)

        ### Convert NEP timeseries to flow timeseries
        self.predicted_flow = np.zeros_like(self.predicted_nep)
        for t in range(self.T):
            nep_t = self.predicted_nep[0,:][t]
            self.predicted_flow[0,t] = self.interpolate_fdc(nep_t, self.prediction_fdc)

        return self.predicted_flow

The predict_streamflow method is the only function called directly by the user. While get_knn, calculate_nep, and interpolate_fdc are all used by predict_streamflow.

Generate streamflow predictions

The script run_QPPQ_predictions.py runs the model and produces predictions at a test site. First, the data generated by generate_streamflow_data.py is loaded:

import numpy as np
from QPPQ import StreamflowGenerator

### Load Data
gage_locations = np.loadtxt('./data/observed_gage_locations.csv', delimiter = ',')
observed_flows = np.loadtxt('./data/observed_streamflow.csv', delimiter = ',')

The FDCs at each site are estimated at 200 discrete quantiles:

fdc_quantiles = np.linspace(0,1,200)
observed_fdc = np.quantile(observed_flows, fdc_quantiles, axis =1).T

A random test site is selected, and removed from the observation data:

# Select a test site and remove it from observations
test_site = np.random.randint(0, gage_locations.shape[0])

# Store test site
test_location = gage_locations[test_site,:]
test_flow = observed_flows[test_site, :]
test_site_fdc = observed_fdc[test_site, :]

# Remove test site from observations
gage_locations = np.delete(gage_locations, test_site, axis = 0)
observed_flows = np.delete(observed_flows, test_site, axis = 0)

When initializing the StreamflowGenerator, we must provide an array of gage location data (longitude, latitude), historic streamflow data at each gage, and the K number of nearest neighbors to include in the timeseries aggregation.

I’ve set-up the StreamflowGenerator class to receive these inputs as a dictionary, such as:

# Specify model prediction_inputs
QPPQ_args = {'observed_locations' : gage_locations,
            'historic_flows' : observed_flows,
            'K' : 20}

# Intialize the model
QPPQ_model = StreamflowGenerator(QPPQ_args)

Similarly, the prediction arguments are provided as a dictionary to the predict_streamflow function:

# Specify the prediction arguments
prediction_args = {'prediction_location': test_location,
                    'prediction_fdc' : test_site_fdc,
                    'fdc_quantiles' : fdc_quantiles}
                    
# Run the prediction
predicted_flow = QPPQ_model.predict_streamflow(prediction_args)

I made a function, plot_predicted_and_observed, which allows for a quick visual check of the predicted timeseries compared to the observed timeseries:

from plot_functions import plot_predicted_and_observed
plot_predicted_and_observed(predicted_flow, test_flow)

Which shows some nice quality predictions!

Fig: Comparison of observed streamflow and streamflow generated through QPPQ.

One benefit of working with the StreamflowGenerator as a Python class object is that we can retrieve the internal variables for further inspection.

For example, I can call QPPQ_model.knn_distances to retrieve the distances to the K nearest neighbors used to predict the flow at the ungauged location. In this case, the gages used to make the prediction above were located 4.44, 13.23,. 18.38,… kilometers away.

Caveat and Conclusion

It is worth highlighting one major caveat to this example, which is that the FDC used for the prediction site was perfectly known from the historic record. In most cases, the FDC will not be known when making predictions in ungauged basins. Rather, estimations of the FDC will need to be used, and thus the prediction quality shown above is somewhat of a ideal-case when performing a QPPQ in ungauged basins.

There are numerous methods for estimating FDCs at the ungauged site, including the Generalized Pareto distribution approximation proposed by Fennessey (1994) or, more recently, through the use of Neural Networks, as highlighted in Worland, et al. (2019).

Hopefully this tutorial helped to get you familiar with a foundational streamflow prediction method.

References

Fennessey, Neil Merrick. "A Hydro-Climatological Model of Daily Stream Flow for the Northeast United States." Order No. 9510802 Tufts University, 1994. Ann Arbor: ProQuest. Web. 21 Nov. 2022.

Hrachowitz, Markus, et al. "A decade of Predictions in Ungauged Basins (PUB)—a review." Hydrological sciences journal 58.6 (2013): 1198-1255.

Razavi, Tara, and Paulin Coulibaly. "Streamflow prediction in ungauged basins: review of regionalization methods." Journal of hydrologic engineering 18.8 (2013): 958-975.

Stuckey, M.H., 2016, Estimation of daily mean streamflow for ungaged stream locations in the Delaware River Basin, water years 1960–2010: U.S. Geological Survey Scientific Investigations Report 2015–5157, 42 p., http://dx.doi.org/10.3133/sir20155157.

Worland, Scott C., et al. "Prediction and inference of flow duration curves using multioutput neural networks." Water Resources Research 55.8 (2019): 6850-6868.

September 20, 2022 by Trevor Amestoy

Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.

PyGeohydro

PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.

PyNHD

The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…

PyDaymet

The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

Installation
Retrieving USGS Water Data
Retrieving Geophysical (NLDI) Data
Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
plot.signatures(flow_data)

Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    navigation="upstreamTributaries",
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

Minimum and maximum temperature (tmin, tmax)
Precipitation (prcp)
Vapor pressure (vp)
Snow-Water Equivalent (swe)
Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

NOTE:
So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()

daymet_data.shape

# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.

Conclusion

I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.

Citations

Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

July 20, 2022 by Trevor Amestoy

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Last week, Lillian and I published the first post in a series of training post studying the “Fisheries Game“, which is a decision making problem within a complex, non-linear, and uncertain ecological context.

In preparing for that post, I learned about the Ipywidgets python library for widget construction. It stood out to me as a tool for highlighting the influence of parametric uncertainty on model performance. More broadly, I think it has great as an educational or data-narrative device.

This blogpost is designed to highlight this potential, and provide a basic introduction to the library. A tutorial demonstration of how an interactive widget is constructed is provided, this time using the HYMOD rainfall-runoff model.

This post is intended to be viewed through a Jupyter Notebook for interaction, which can be accessed through a Binder at this link!

The Binder was built with an internal environment specification, so it should not be necessary to install any packages on your local machine! Because of this, it may take a minute to load the page.

Alternatively, you can pull the source code and run the Jupyter Notebook from your local machine. All of the source code is available in a GitHub repository: Ipywidget_Demo_Interactive_HYMOD.

If using your local machine, you will first need to install the Ipywidget library:

pip install ipywidgets

Let’s begin!

HYMOD Introduction

HYMOD is a conceptual rainfall-runoff model. Given some observed precipitation and evaporation, a parameterized HYMOD model simulates the resulting down-basin runoff.

This post does not focus on specific properties or performance of the HYMOD model, but rather uses the model as a demonstration of the utility of the Ipywidget library.

I chose to use the HYMOD model for this, because the HYMOD model is commonly taught in introductory hydrologic modeling courses. This demonstration shows how an Ipywidget can be used in an educational context. The resulting widget can allow students to interact in real-time with the model behavior, by adjusting parameter values and visualizing the changes in the resulted streamflow.

If you are interested in the technical details of implementing the HYMOD model, you can dig into the source code, available (and throughly commented/descriptive) in the repository for this post: Ipywidget_Demo_Interactive_HYMOD.

HYMOD represents surface flow as a series of several quick-flow reservoirs. Groundwater flow is represented as a single slow-flow reservoir. The reservoirs have constant flow rates, with the quick-flow reservoir rate, Kq, being greater than the slow-flow reservoir rate, Ks.

**Image source:** Sun, Wenchao & Ishidaira, Hiroshi & Bastola, Satish. (2010)

HYMOD Parameters:

Like any hydrologic model, the performance of HYMOD will be dependent upon the specified parameter values. There are several parameters that can be adjusted:

Cmax: Max soil moisture storage (mm) [10-2000]
B: Distribution of soil stores [0.0 – 7.0]
Alpha: Division between quick/slow routing [0.0 – 1.0]
Kq: Quick flow reservoir rate constant (day^-1) [0.15 – 1.0]
Ks: Slow flow reservoir rate constant. (day^-1) [0.0 – 0.15]
N: The number of quick-flow reservoirs.

Interactive widget demonstration

I’ve constructed an Ipywidets object which allows a user to visualize the impact of the HYMOD model parameters on the resulting simulation timeseries. The user also has the option to select from three different error metrics, which display in the plot, and toggle the observed timeseries plot on and off.

Later in this post, I will give detail on how the widget was created.

Before provided the detail, I want to show the widget in action so that you know the expectation for the final product.

The gif below shows the widget in-use:

Ipywidgets Introduction

The Ipywdiget library allows for highly customized widgets, like the one above. As with any new tool, I’d recommend you check out the documentation here.

Below, I walk through the process of generating the widget shown above.

Lets begin!

Import the library

# Import the library
import ipywidgets as widgets

Basic widget components

Consider an Ipywidget as being an arrangement of modular components.

The tutorial walks through the construction of five key widget components:

Variable slider
Drop-down selectors
Toggle buttons
Label objects
Interactive outputs (used to connect the plot to the other three components)

In the last section, I show how all of these components can be arranged together to construct the unified widget.

Sliders

Sliders are one of the most common ipywidet tools. They allow for manual manipulation of a variable value. The slider is an object that can be passed to the interactive widget (more on this further down).

For my HYMOD widget, I would like to be able to manipulate each of the model parameters listed above. I begin by constructing a slider object for each of the variables.

Here is an example, for the C_max variable:

# Construct the slider
Cmax_slider = widgets.FloatSlider(value = 500, min = 10, max = 2000, step = 1.0, description = "C_max",
                                  disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')


# Display the slider
display(Cmax_slider)

Notice that each slider recieves a specified min, max, and step corresponding to the possible values. For the HYMOD demo, I am using the parameter ranges specified in Herman, J.D., P.M. Reed, and T. Wagener (2013), Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior.

I will construct the sliders for the remaining parameters below. Notice that I don’t assign the description parameter in any of these sliders… this is intentional. Later in this tutorial I will show how to arrange the sliders with Label() objects for a cleaner widget design.

# Construct remaining sliders
Cmax_slider = widgets.FloatSlider(value = 100, min = 10, max = 2000, step = 1.0, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
B_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Alpha_slider = widgets.FloatSlider(value = 0.30, min = 0.00, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Kq_slider = widgets.FloatSlider(value = 0.33, min = 0.15, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Ks_slider = widgets.FloatSlider(value = 0.07, min = 0.00, max = 0.15, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
N_slider = widgets.IntSlider(value = 3, min = 2, max = 7, disabled = False, continuous_update = False, readout = True)

# Place all sliders in a list
list_of_sliders = [Kq_slider, Ks_slider, Cmax_slider, B_slider, Alpha_slider, N_slider]

The Dropdown() allows the user to select from a set of discrete variable options. Here, I want to give the user options on which error metric to use when comparing simulated and observed timeseries.

I provide three options:

RMSE: Root mean square error
NSE: Nash Sutcliffe efficiency
ROCE: Runoff coefficient error

See the calculate_error_by_type inside the HYMOD_components.py script to see how these are calculated.

To provide this functionality, I define the Dropdown() object, as below, with a list of options and the initial value:

# Construct the drop-down to select from different error metrics
drop_down = widgets.Dropdown(options=['RMSE','NSE','ROCE'], description='',
                                value = 'RMSE', disabled = False)

# Display the drop-down
display(drop_down)

ToggleButton

The ToggleButton() allows for a bool variable to be toggled between True and False. For my streamflow plot function, I have an option plot_observed = False which determines if the observed streamflow timeseries is shown in the figure.

# Construct the button to toggle observed data On/Off
plot_button = widgets.ToggleButton(value = False, description = 'Toggle', disabled=False, button_style='', tooltip='Description')

# Display the button
display(plot_button)

Labels

As mentioned above, I choose to not include the description argument within the slider, drop-down, or toggle objects. This is because it is common for these labels to get cut-off when displaying the widget object.

For example, take a look at this slider below, with a long description argument:

# Make a slider with a long label
long_title_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, description = 'This slider has a long label!', readout = True)

# Display: Notice how the label is cut-off!
display(long_title_slider)

The ipywidgets.Label() function provides a way of avoiding this while allowing for long descriptions. Using Label() will ultimately provide you with a lot more control over your widget layout (last section of the tutorial).

The Label() function generates a separate object. Below, I create a unique Label() object for each HYMOD parameter.

# Import the Label() function
from ipywidgets import Label

# Make a list of label strings
param_labs = ['Kq : Quick flow reservoir rate constant (1/day)',
            'Ks : Slow flow reservoir rate constant (1/day)',
            'C_max : Maximum soil moisture storage (mm)',
            'B : Distribution of soil stores',
            'Alpha : Division between quick/slow routing',
            'N : Number of quick-flow reservoirs']

# Make a list of Label() objects
list_of_labels = [Label(i) for i in param_labs]

# Display the first label, for example.
list_of_labels[0]

Interactive_output

Now that we have constructed interactive

The interactive_output function takes two inputs, the function to interact with, and a dictionary of variable assignments:

interactive_output( function, {‘variable_name’ : variable_widget, …} )

I have created a custome function plot_HYMOD_results which:

Loads 1-year of precipitation and evaporation data for the Leaf River catchment.
Runs the HYMOD simulation using the provided parameter values.
Calculates the error of the simulated vs. observed data.
Plots the timeseries of runoff.

The source code for this function can be found in the GitHub repository for this post, or specifically here.

The function receives parameter values for each of the HYMOD parameters discussed above, a bool indicator if observed data should be plotted, and a specified error metric.

plot_HYMOD_results(C_max, B, Alpha, Ks, Kq, N_reservoirs, plot_observed = False, error_type = ‘RMSE’):

I have already generated widget components corresponding to each of these variables! (If you are on the Jupyter Notebook version of this post, make sure to have Run every cell before this, or else the following code wont work.

I can now use the interactive_output function to link the widget components generated earlier with the function inputs:

# Import the interactive_output function
from ipywidgets import interactive_output

# Import my custom plotting function
from HYMOD_plots import plot_HYMOD_results

result_comparison_plot = interactive_output(plot_HYMOD_results, {'C_max' : Cmax_slider, 'B': B_slider, 'Alpha':Alpha_slider, 
                                                                 'Ks':Ks_slider, 'Kq':Kq_slider,'N_reservoirs':N_slider, 
                                                                 'plot_observed' : plot_button,'error_type': drop_down})

# Show the output
result_comparison_plot

Output generated by the interactive_output().

Displaying the interactive_output reveals only the plot, but does not include any of the widget functionality…

Despite this, the plot is still linked to the widget components generated earlier. If you don’t believe me (and are reading the Jupyter Notebook version of this post), scroll up and click the ToggleButton a few cells up, then come back and look at the plot again.

Using the interactive_output() function, rather than other variations of the interact() functions available, allows for cleaner widgets to be produced, because now the arrangment of the widget components can be entirely customizable.

Keep reading for more detail on this!

Arranging widget components

Rather than using widget features one-at-a-time, Ipywidgets allow for several widgets to be arranged in a unified layout. Think of everything that has been generated previously as being a cell within the a gridded widget; the best part is that each cell is linked with one another.

Once the individual widget features (e.g., sliders, buttons, drop-downs, and output plots) are defined, they can be grouped using the VBox() (vertical box) and HBox() (horizontal box) functions.

I’ve constructed a visual representation of my intended widget layout, shown below. The dashed orange boxes show those components grouped by the HBox() function, and the blue boxes show those grouped by the VBox() function.

Visual representation of the final widget layout.

Before getting started, import some of the basic layout functions:

# Import the various 
from ipywidgets import HBox, VBox, Layout

Before constructing the entire widget, it is good to get familiar with the basic HBox() and VBox() functionality.

Remember the list of sliders and list of labels that we created earlier?

# Stack the list of label objects vertically:
VBox(list_of_labels)

# Try the same thing with the sliders (remove comment #):
#VBox(list_of_sliders)

In the final widget, I want the column of labels to be located on the left of the column of sliders. HBox() allows for these two columns to be arrange next to one another:

# Putting the columns side-by-side
HBox([VBox(list_of_labels), VBox(list_of_sliders)])

Generating the final widget

Using the basic HBox() and VBox() functions shown above, I arrange all of the widget components I’ve defined previously. I first define each row of the widget using HBox(), and finally stack the rows using VBox().

The script below will complete the arrangement, and call the final widget!

# Define secifications for the widgets: center and wrap 
box_layout = widgets.Layout(display='flex', flex_flow = 'row', align_items ='center', justify_content = 'center')

# Create the rows of the widets
title_row = Label('Select parameter values for the HYMOD model:')
slider_row = HBox([VBox(list_of_labels), VBox(list_of_sliders)], layout = box_layout)
error_menu_row = HBox([Label('Choose error metric:'), drop_down], layout = box_layout)
observed_toggle_row = HBox([Label('Click to show observed flow'), plot_button], layout = box_layout)
plot_row = HBox([result_comparison_plot], layout = box_layout)


# Combine label and slider box (row_one) with plot for the final widget
HYMOD_widget = VBox([title_row, slider_row, plot_row, error_menu_row, observed_toggle_row])


# Call the widget and have fun!
HYMOD_widget

Concluding remarks

If you’ve made it this far, thank you for reading!

I hope that you are able to find some fun/interesting/educational use for the Ipywidget skills learned in this post.

May 31, 2021 by Lillian Lau

MORDM Basics IV: Visualizing ROF-Storage Dynamics (finally)

The previous post described a simple, two-objective test case in which the city of Cary employed risk-of-failure (ROF) triggers as levers to adjust for its preferred tradeoff level between its objectives. The example given showed how ROF triggers allowed Cary to account for future uncertainty in its system inputs, thus enabling it to visualize how their risk appetite would affect their desired outcomes.

In meeting these objectives, different risk thresholds would have affected Cary’s response to extreme events such as floods and droughts, and its ability to fulfill demand. Simply analyzing the tradeoffs between objectives that result from a range of ROF trigger values only presents one side of the story. It is vital to visualize how these performance objectives and tradeoffs manifest in the system’s capacity (pun intended) to store enough water in times of scarcity, and by extension, its ability to fulfill its customers’ demand for water.

Using ROFs allow us to more concretely measure how the dynamics of both storage and demand fulfillment evolve and change over time for a given risk tolerance. In the long term, these dynamics will influence when and where new water infrastructure is built to cope with storage requirements and demand growth, but this is a topic for a future blog post. This week, we will focus on unpacking the dynamic evolution of storage and demand in response to different ROF trigger values.

As a quick refresher, our system is a water supply utility located in Cary, which is a city within the Research Triangle region in North Carolina (Trindade et al, 2017). Cary uses water-use restrictions when a weekly ROF exceeds the threshold of risk that Cary is willing to tolerate (α) during which only 50% of demand is met. More frequent water use restrictions help to maintain the reservoir levels and ensure reliability, which was defined in the previous blog post. However, the decision to implement restrictions (or not) will impact the storage levels of the system. With this in mind, we will first examine storage responds to the triggering of a water use restriction. For context, we consider a scenario in which Cary’s inflow timeseries is only 20% of the original levels. Figure 1 below shows the inflow, demand and storage timeseries for this scenario.

*Figure 1: The hydrologic timeseries for Cary given that no water restrictions are implemented in a scenario where inflows are 20% of the original levels.*

Cary’s challenge becomes apparent in Figure 1. While inflow decreases over time (fewer peaks), demand is steadily growing and has effectively tripled by the end of the period. This results in periods during which storage levels drop to zero, which occurs once past 2040. Also note that the frequency of low storage peaks have increased in the second half of the period. The following questions can thus be posed:

How does the system’s ROF change with increasing demand and decreasing supply?
How does risk tolerance affect the implementation of water-use restrictions during drought?
How will the system reservoir levels respond to different levels of risk tolerance?

Figure 2: The length of the pink bars denote the n^th-week during which the first water use restriction was implemented for a given α-value. This is an indicator of the responsiveness of the system to a drought, or decrease in storage levels. The blue line indicates the percent of storage filled with water.

To answer the first question, it is useful to identify how different values of α affect the first instance of a water-use restriction. Figure 2, generated using ‘rof_dynamics.py‘, demonstrates that lower risk tolerances result in earlier implementations of restrictions. This is reasonable, as an actor who more risk-averse will quickly implement water-use restrictions to maintain reliable levels of storage during a drought. However, an actor who is more willing to tolerate the change of low reservoir levels will delay implementing water use restrictions. The blue line juxtaposed on top of the bars indicates the inflows to the reservoir. After the first period of low flows between weeks 30-40, the plot shows that the amount of inflows do not recover, and is likely insufficient to fill the reservoir to initial levels. With a lower α, an actor is more likely to implement restrictions almost immediately after observing merely a few weeks of low inflows. In contrast, an actor who opts for a higher α will only resort to restrictions after seeing an extended period of low flows during which they can be more certain that restrictions are absolutely necessary.

Answering the second and third questions first require that periods of drought are more definitively quantified. To do this, the standardized streamflow indicator (SSI₆) was used. The SSI₆ is a method that identifies time periods during which the standardized inflow is less than the 6-month rolling mean (Herman et al, 2016). It detects a drought period when the value of the SSI₆ < 0 for three consecutive months and SSI₆ < -1 at least once during the three-month period. The juxtaposition of storage-restrictions and the periods of drought will allow us to see where restrictions were imposed and its impacts on reservoir levels for a given demand timeseries.

Figure 3 and Figure 4 are a visualization of how the system’s storage levels responds to drought (the red bars in the lower subplot) by implementing water-use restrictions (the dark red lines in the upper subplot) given α = 1% and α = 15% respectively. Predictably, restrictions coincide with periods of drought as defined by the SSI₆. However, with a lower risk tolerance, period of restrictions are longer and more frequent. As Figure 3 shows, an actor with a lower risk tolerance may implement restrictions where only a slight risk of failure exists.

Figure 3: Storage dynamics given α=1%. (Upper subplot) The blue lines indicate the reservoir storage levels in billion gallons per week. The yellow lines are the weekly ROF values, or the likelihood that the percent of water stored will drop below 20% of the reservoir levels. The grey lines indicate where water use restrictions are implemented, and the red dashed line denotes α=2%. (Lower subplot) The zones are where droughts were detected using the SSI₆ (Herman et al, 2016) method are highlighted in red.

Compared to α = 1%, an actor who is willing to tolerate higher ROF values (Figure 4 as an example) will implement restrictions less frequently and for shorter periods of time. Although this means that demands are less likely to get disrupted, this also puts water supplies at a higher risk of dropping to critical levels (<20%), as restrictions may not get implemented even during times of drought.

Figure 4: Storage dynamics given α=15%. (Upper subplot) The blue lines indicate the reservoir storage levels in billion gallons per week. The yellow lines are the weekly ROF values, or the likelihood that the percent of water stored will drop below 20% of the reservoir levels. The grey lines indicate where water use restrictions are implemented, and the red dashed line denotes α=15%. (Lower subplot) The zones are where droughts were detected using the SSI₆ (Herman et al, 2016) method are highlighted in red.

There is one important thing to note when comparing Figures 3 and 4. When the periods water use restrictions coincide for both α-values (between 2040 and 2050), the actor with a lower tolerance implements water use restrictions at the beginning of both drought periods. This decision makes the biggest difference in terms of the reservoir storage levels. By implementing water use restrictions early and for a longer period of time, Cary’s reservoir levels are consistently kept at levels above 50% of full capacity (given full capacity of 7.45 BG). A different actor with higher risk tolerance resulted in water levels that dropped below the 30% of full capacity during periods of drought.

Although this seems undesirable, recall that the system is said to have failed if the capacity drops below 20% of full capacity. Herein lies the power of using an ROF metric – questions 2 and 3 can be answered by generating storage-restriction response figures as shown in the above figures, which allows an actor to examine the consequences of being varying levels of risk tolerance on their ability to fulfill demand while maintaining sufficient water levels. This ability can improve judgement on how much risk a utility can actually tolerate without adversely impacting the socioeconomic aspects of the systems dependent on a water supply utility. In addition, using ROFs enable a utility to better estimate when new infrastructure really needs to be built, instead of making premature investments as a result of unwarranted risk aversion.

To briefly summarize this blog post, we have shown how different risk tolerance levels affect the decisions made by an actor, and how these decisions in turn impact the system. Not shown here is the ability of an ROF to evolve over time given climate change and the building of new water supply infrastructure. In the next blog post, we will briefly discuss the role of ROFs in mapping out adaptation pathways for a utility, how ROFs form the basis of a dynamic and adaptive pathway and their associated operation policies, and connect this to the concept of the soft path (Gleick, 2002) in water supply management.

References

Gleick, P., 2002. Water management: Soft water paths. Nature, 418(6896), pp.373-373.

Herman, J., Zeff, H., Lamontagne, J., Reed, P. and Characklis, G., 2016. Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments. Journal of Water Resources Planning and Management, 142(11), p.04016050.

Trindade, B., Reed, P., Herman, J., Zeff, H. and Characklis, G., 2017. Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources, 104, pp.195-209.

February 15, 2021 by Lillian Lau

MORDM Basics I: Synthetic Streamflow Generation

In this post, we will break down the key concepts underlying synthetic streamflow generation, and how it fits within the Many Objective Robust Decision Making (MORDM) framework (Kasprzyk, Nataraj et. al, 2012). This post is the first in a series on MORDM which will begin here: with generating and validating the data used in the framework. To provide some context as to what we are about to attempt, please refer to this post by Jon Herman.

What is synthetic streamflow generation?

Synthetic streamflow generation is a non-parametric, direct statistical approach used to generate synthetic streamflow timeseries from a reasonably long historical record. It is used when there is a need to diversify extreme event scenarios, such as flood and drought, or when we want to generate flows to reflect a shift in the hydrologic regime due to climate change. It is favored as it relies on a re-sampling of the historical record, preserving temporal correlation up to a certain degree, and results in a more realistic synthetic dataset. However, its dependence on a historical record also implies that this approach requires a relatively long historical inflow data. Jon Lamontagne’s post goes into further detail regarding this approach.

Why synthetic streamflow generation?

An important step in the MORDM framework is scenario discovery, which requires multiple realistic scenarios to predict future states of the world (Kasprzyk et. al., 2012). Depending solely on the historical dataset is insufficient; we need to generate multiple realizations of realistic synthetic scenarios to facilitate a comprehensive scenario discovery process. As an approach that uses a long historical record to generate synthetic data that has been found to preserve seasonal and annual correlation (Kirsch et. al., 2013; Herman et. al., 2016), this method provides us with a way to:

Fully utilize a large historical dataset
Stochastically generate multiple synthetic datasets while preserving temporal correlation
Explore many alternative climate scenarios by changing the mean and the spread of the synthetic datasets

The basics of synthetic streamflow generation in action

To better illustrate the inner workings of synthetic streamflow generation, it is helpful to use a test case. In this post, the historical dataset is obtained from the Research Triangle Region in North Carolina. The Research Triangle region consists of four main utilities: Raleigh, Durham, Cary and the Orange County Water and Sewer Authority (OWASA). These utilities are receive their water supplies from four water sources: the Little River Reservoir, Lake Wheeler, Lake Benson, and the Jordan Lake (Figure 1), and historical streamflow data is obtained from ten different stream gauges located at each of these water sources. For the purpose of this example, we will be using 81 years’ worth of weekly streamflow data available here.

Figure 1: The Research Triangle region (Trindade et. al., 2019).

The statistical approach that drive synthetic streamflow generation is called the Kirsch Method (Kirsch et. al., 2013). In plain language, this method does the following:

Converts the historical streamflows from real space to log space, and then standardize the log-space data.
Bootstrap the log-space historical matrix to obtain an uncorrelated matrix of historical data.
Obtain the correlation matrix of the historical dataset by performing Cholesky decomposition.
Impose the historical correlation matrix upon the uncorrelated matrix obtained in (2) to generate a standardized synthetic dataset. This preserves seasonal correlation.
De-standardize the synthetic data, and transform it back into real space.
Repeat steps (1) to (5) with a historical dataset that is shifted forward by 6 months (26 weeks). This preserves year-to-year correlation.

This post by Julie Quinn delves deeper into the Kirsch Method’s theoretical steps. The function that executes these steps can be found in the stress_dynamic.m Matlab file, which in turn is executed by the wsc_main_rate.m file by setting the input variable p = 0 as shown on Line 27. Both these files are available on GitHub here.

However, this is simply where things get interesting. Prior to this, steps (1) to (6) would have simply generated a synthetic dataset based on only historical statistical characteristics as validated here in Julie’s second blog post on a similar topic. Out of the three motivations for using synthetic streamflow generation, the third one (exploration of multiple scenarios) has yet to be satisfied. This is a nice segue into out next topic:

Generating multiple scenarios using synthetic streamflow generation

The true power of synthetic streamflow generation lies in its ability to generate multiple climate (or in this case, streamflow) scenarios. This is done in stress_dynamic.m using three variables:

Input variable	Data type
p	The lowest x% of streamflows
n	A vector where each element n_i is the number of copies of the p-lowest streamflow years to be added to the bootstrapped historical dataset.
m	A vector where each element m_i is the number of copies of the (1-p)-highest streamflow years to be added to the bootstrapped historical dataset.

Table 1: The input variables to the stress_dynamic function.

These three variables bootstrap (increase the length of) the historical record while allow us to perturb the historical streamflow record streamflows to reflect an increase in frequency or severity of extreme events such as floods and droughts using the following equation:

new_hist_years = old_historical_years + [(p*old_historical_years)*n_i ] + (old_hist_years – [(p*old_historical_years)m_i])

The stress_dynamic.m file contains more explanation regarding this step.

This begs the question: how do we choose the value of p? This brings us to the topic of the standardized streamflow indicator (SSI₆).

The SSI₆ is the 6-month moving average of the standardized streamflows to determine the occurrence and severity of drought on the basis of duration and frequency (Herman et. al., 2016). Put simply, this method determines the occurrence of drought if the the value of the SSI₆ < 0 continuously for at least 3 months, and SSI₆ < -1 at least once during the 6-month interval. The periods and severity (or lack thereof) of drought can then be observed, enabling the decision on the length of both the n and m vectors (which correspond to the number of perturbation periods, or climate event periods). We will not go into further detail regarding this method, but there are two important points to be made:

The SSI₆ enables the determination of the frequency (likelihood) and severity of drought events in synthetic streamflow generation through the values contained in p, n and m.
This approach can be used to generate flood events by exchanging the values between the n and m vectors.

A good example of point (2) is done in this test case, in which more-frequent and more-severe floods was simulated by ensuring that most of the values in m where larger than those of n. Please refer to Jon Herman’s 2016 paper titled ‘Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments’ for further detail.

A brief conceptual letup

Now we have shown how synthetic streamflow generation satisfies all three factors motivating its use. We should have three output folders:

synthetic-data-stat: contains the synthetic streamflows based on the unperturbed historical dataset
synthetic-data-dyn: contains the synthetic streamflows based on the perturbed historical dataset

Comparing these two datasets, we can compare how increasing the likelihood and severity of floods has affected the resulting synthetic data.

Validation

To exhaustively compare the statistical characteristics of the synthetic streamflow data, we will perform two forms of validation: visual and statistical. This method of validation is based on Julie’s post here.

Visual validation

Done by generating flow duration curves (FDCs) . Figure 2 below compares the unperturbed (left) and perturbed (right) synthetic datasets.

Figure 2: (Above) The FDC of the unperturbed historical dataset (pink) and its resulting synthetic dataset (blue). (Below) The corresponsing perturbed historical and synthetic dataset.

The bottom plots in Figure 2 shows an increase in the volume of weekly flows, as well as an smaller return period, when the the historical streamflows were perturbed to reflect an increasing frequency and magnitude of flood events. Together with the upper plots in Figure 2, this visually demonstrates that the synthetic streamflow generation approach (1) faithfully reconstructs historical streamflow patterns, (2) increases the range of possible streamflow scenarios and (3) can model multiple extreme climate event scenarios by perturbing the historical dataset. The file to generate this Figure can be found in the plotFDCrange.py file.

Statistical validation

The mean and standard deviation of the perturbed and unperturbed historical datasets are compared to show if the perturbation resulted in significant changes in the synthetic datasets. Ideally, the perturbed synthetic data would have higher means and similar standard deviations compared to the unperturbed synthetic data.

Figure 3: (Above) The unperturbed synthetic (blue) and historical (pink) streamflow datasets for each of the 10 gauges. (Below) The perturbed counterpart.

The mean and tails of the synthetic streamflow values of the bottom plots in Figure 3 show that the mean and maximum values of the synthetic flows are significantly higher than the unperturbed values. In addition, the spread of the standard deviations of the perturbed synthetic streamflows are similar to that of its unperturbed counterpart. This proves that synthetic streamflow generation can be used to synthetically change the occurrence and magnitude of extreme events while maintaining the periodicity and spread of the data. The file to generate Figure 3 can be found in weekly-moments.py.

Synthetic streamflow generation and internal variability

The generation of multiple unperturbed realizations of synthetic streamflow is vital for characterizing the internal variability of a system., otherwise known as variability that arises from natural variations in the system (Lehner et. al., 2020). As internal variability is intrinsic to the system, its effects cannot be eliminated – but it can be moderated. By evaluating multiple realizations, we can determine the number of realizations at which the internal variability (quantified here by standard deviation as a function of the number of realizations) stabilizes. Using the synthetic streamflow data for the Jordan Lake, it is shown that more than 100 realizations are required for the standard deviation of the 25% highest streamflows across all years to stabilize (Figure 4). Knowing this, we can generate sufficient synthetic realizations to render the effects of internal variability insignificant.

Figure 4: The highest 25% of synthetic streamflows for the Jordan Lake gauge.

The file internal-variability.py contains the code to generate the above figure.

How does this all fit within the context of MORDM?

So far, we have generated synthetic streamflow datasets and validated them. But how are these datasets used in the context of MORDM?

Synthetic streamflow generation lies within the domain of the second part of the MORDM framework as shown in Figure 5 above. Specifically, synthetic streamflow generation plays an important role in the design of experiments by preserving the effects of deeply uncertain factors that cause natural events. As MORDM requires multiple scenarios to reliably evaluate all possible futures, this approach enables the simulation of multiple scenarios, while concurrently increasing the severity or frequency of extreme events in increments set by the user. This will allow us to evaluate how coupled human-natural systems change over time given different scenarios, and their consequences towards the robustness of the system being evaluated (in this case, the Research Triangle).

Figure 4: The taxonomy of robustness frameworks. The bold-outlined segments highlight where MORDM fits within this taxonomy (Herman et. al., 2015).

Typically, this evaluation is performed in two main steps:

Generation and evaluation of multiple realizations of unperturbed annual synthetic streamflow. The resulting synthetic data is used to generate the Pareto optimal set of policies. This step can help us understand how the system’s internal variability affects future decision-making by comparing it with the results in step (2).
Generation and evaluation of multiple realizations of perturbed annual synthetic streamflow. These are the more extreme scenarios in which the previously-found Pareto-optimal policies will be evaluated against. This step assesses the robustness of the base state under deeply uncertain deviations caused by the perturbations in the synthetic data and other deeply uncertain factors.

Conclusion

Overall, synthetic streamflow generation is an approach that is highly applicable in the bottom-up analysis of a system. It preserves historical characteristics of a streamflow timeseries while providing the flexibility to modify the severity and frequency of extreme events in the face of climate change. It also allows the generation of multiple realizations, aiding in the characterization and understanding of a system’s internal variability, and a more exhaustive scenario discovery process.

This summarizes the basics of data generation for MORDM. In my next blog post, I will introduce risk-of-failure (ROF) triggers, their background, key concepts, and how they are applied within the MORDM framework.

References

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 142(11), 04016050. doi:10.1061/(asce)wr.1943-5452.0000701

Kasprzyk, J. R., Nataraj, S., Reed, P. M., & Lempert, R. J. (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71. doi:10.1016/j.envsoft.2012.12.007

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406. doi:10.1061/(asce)wr.1943-5452.0000287

Mankin, J. S., Lehner, F., Coats, S., & McKinnon, K. A. (2020). The value of initial condition large ensembles to Robust Adaptation Decision‐Making. Earth’s Future, 8(10). doi:10.1029/2020ef001610

Trindade, B., Reed, P., Herman, J., Zeff, H., & Characklis, G. (2017). Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources, 104, 195-209. doi:10.1016/j.advwatres.2017.03.023

August 12, 2019 by keyvan Malek

Variable Infiltration Capacity (VIC) Model- Part 2

As promised, I am back with the second part of my blog post for variable infiltration capacity (VIC) model training. Before we start talking about how you can run the model, take a look at my first blog post on VIC; that post will hopefully give you a high-level understanding of the model as well as its application, theoretical background, and main processes.

In this blog post, we’ll go over model input and output files and things that you need to do to run the model. We’ll use a “popular among first-timers” real-world example provided by the VIC development team. I found the instructions provided by the VIC website clear and very helpful, and I recommend referring to the site for more information beyond this post. However, the goal of this blog post is to provide easy-to-follow hands-on instructions on how to run the VIC model. Lastly, before I forget, although many of these details are applicable to all VIC versions, this instruction is specifically for VIC-4.x and VIC-5 (classic mode) versions.

Input Files

The most important input to the VIC model is likely the global parameter file. The global parameter file has two main responsibilities: (1) setting the model’s main options and (2) providing the paths to other input files. The other input files include soil parameters, meteorological data, vegetation library, vegetation parameter file, and snow band file. The following sections start by providing more information about these input files and then discuss the global parameter file.

Soil File

Broadly speaking, the soil file provides two categories of information to the model. (1) Calibration parameters include the coefficient that adjusts the rainfall-runoff formulation (b_i), and additional parameters control base flow generation (Ws, Ds, Ds_max, C). Calibrating the depths of the middle and last layers of VIC is also a standard practice. However, keep in mind that although the snow model of VIC can be calibrated, its parameters are not in the soil file. (2) Soil textural information includes the parameters of the Brooks-Corey/Campbell relationship, soil depth, soil bulk density, field capacity, permanent wilting point, and more. The soil file usually has multiple lines, and each line corresponds to a specific grid cell. The soil file also tells the model whether or not a specific grid cell should be part of the simulation, and the first number in each line indicates these data. If the number is zero (0), the cell will not be part of the simulation.

Met Data

Met data in VIC also has two options. (1) Users can only provide precipitation, maximum temperature, minimum temperature, and wind speed; VIC’s internal weather generator (MTCLIM; Bohn et al. 2013) calculates the rest of the parameters such as shortwave and longwave radiation, atmospheric pressure, and vapor pressure. (2) Users can provide a complete time series of input data.

Vegetation Parameter File

The vegetation parameter file tells the model what percentage of each grid cell is occupied by each vegetation cover type. If you have a soil file (see the test case for an example) and sum up all the fractions in each grid cell, you will probably notice that the figure is almost always less than one. This is because the rest of the grid cell is bare soil with no vegetation cover, but keep in mind that bare soil is part of simulations. The vegetation parameter file also includes information about root depth, root fraction, and other vegetation-related parameters.

Vegetation Library

The vegetation library provides the model with characteristics of each vegetation type—for example, albedo, LAI, canopy coverage, roughness, and other parameters that the model needs to calculate Penman-Monteith’s evapotranspiration. The original vegetation library comes with twelve vegetation classes, and users usually don’t need to modify this file unless they want to add a new class of vegetation. Keep in mind that the original vegetation file does not do a good job of representing different crop types. That’s one of the reasons that VIC-CropSyst exists.

Snow Band (aka Elevation Band) File

The snow band file divides each grid cell into different elevations. The model simulates each elevation band separately while lapsing temperature and precipitation for each elevation. VIC does this to improve the accuracy of the simulation. Remember that VIC is a large-scale model; a 50 km by 50 km grid cell might contain mountains and valleys with various climates. Although using the snow band file is optional (as specified in the global parameter file), doing so is usually recommended—especially over snow-dominant regions—because snow is very sensitive to elevation.

Global Parameter File

The global parameter file provides two things:

(1) Model main options such as output parameters of interest; whether or not the model should simulate frozen soil and full energy balance; and information related to start and end date of simulation, start date of met data file, parameters included in the met data file, number of soil layers, and maximum number of snow bands. (2) Path to different input files.

How to Download VIC

VIC works in the Linux/Unix environment. The VIC website has recently been updated and provides everything that you need to know about the model. To download the model, go to its GitHub page. The model comes with all necessary codes. You should explore different folders in your “VIC-master” folder. However, in this example, we are only interested in the “/VIC-master/vic/drivers/classic” directory, which provides the VIC code and executable files.

How to Download the Test Dataset

VIC has a test dataset, which is for Stehekin river basin in Pacific Northwest. You can download it from here. This provides you with all the input files needed for the test case.

How to Adjust the Global Parameter File

The global parameter file needs to be adjusted based on what you want from the model. For example, if you need a specific output parameter, you must include it in the list of outputs and modify the number of output parameters accordingly. For this test case, let’s stick to the original setting of the model. You just need to point the global parameter file to your directory. To do so, open “VIC_sample_data-master/classic/Stehekin/parameters/global_param.STEHE.txt”

Create the Executable File

To run VIC, you need to have an executable file. To create an executable file, go to “/VIC-master/vic/drivers/classic,” and use Linux’s make command:

make clean

make

Run VIC

Finally, you’re ready to run the model. It’s super easy to type in the following command on your Linux terminal:

/VIC-master/vic/drivers/classic/vic_classic.exe -g VIC_sample_data-master/classic/Stehekin/parameters/global_param.STEHE.txt

I think that’s enough for the VIC model. As I mentioned in my last blog post, there is a coupled agro-hydrologic model called VIC-CropSyst that simulates agricultural processes on top of hydrology. If time allows, I will post about VIC-CropSyst in the future.