Exploring droughts produced by an ensemble of synthetically generated streamflow records

If you’re a reader of this blog, you are probably well aware that historical observations of streamflow and rainfall represent single traces of inherently stochastic processes. Simply relying on the historical record when evaluating decisions in water resources problems may cause us to underestimate uncertainty and overestimate the reliability of our system. Synthetic generation of streamflow and weather data can help us overcome this challenge. Synthetic generation is a family of methods that allow us to explore conditions not observed in the historical record while preserving fundamental statistical properties of the recorded observations (for a great history of synthetic generation, see Jon’s post). Before using synthetically generated records, we first need to perform a diagnostic evaluation to ensure that the synthetic records preserve properties such as the moments of the historical record as well as the spatial and temporal correlation structures (for some great tutorials on this process, see Julie and Lillian’s posts). Another important aspect of understanding the synthetically generated records is the extreme events they produce – a topic that this blog has not explored in detail. Evaluating extreme events helps us understand the internal variability present in the stochastic system and contextualize the ensemble of synthetic records with events in the historical record (such as large droughts and floods).

In this post, we’ll explore how a small ensemble of synthetically generated streamflows in the Upper Colorado River basin produces decadal drought events. We’ll start with a definition of decadal drought and then explore the frequency, magnitude, and distribution of drought events across our ensemble of synthetically generated streamflows. The streamflow records used in this post were generated using a Gaussian Hidden Markov Model (HMM) streamflow generator. These records have previously been diagnostically evaluated to confirm that they reproduce the statistical properties of the historical record. For details on HMM generation and diagnostic evaluation, see Julie’s posts on introducing HMM generators and providing example code.

Defining decadal drought

We’ll begin by defining a measure of decadal drought introduced by Ault et al., (2014). This measure defines a decadal drought as a period when the 11-year rolling mean of streamflows drops below a drought threshold. The threshold is defined as half a standard deviation below the mean of the full historical record. Note this is one of many definitions of drought. For a detailed discussion of drought metrics, see Rohini’s recent post.

\mu_{11} < \mu- 0.5\sigma

Periods of decadal drought in the historical record are highlighted with shading in the figure below.

This image has an empty alt attribute; its file name is cm_historical_ts.png

I’ve written a simple Python function to find these droughts in a streamflow record, count the number of droughts, and calculate the magnitude of drought events (more on this below).

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statistics
import matplotlib.cm as cm
from mpl_toolkits.axes_grid1 import make_axes_locatable

def drought_statistics(file_name, hist, drought_def):
    '''
    Identifies drought periods in a streamflow record. Returns drought instances, list of drought years, and severity

    :param file_name: path to the annual Q file (string)
    :param hist: historical record toggle (bool)
    :param drought_def: size of moving window (float)

    :return:
        - drought_instances (list) - all drought periods in the record
        - final_drought_range (list) ranges of years of decadal droughts in the record
        - total_severity - sum of the difference between the mean & annual flows during a drought normalized by std
    '''

    # Read historical or synthetic record
    if hist:
        AnnualQ_s = pd.read_csv(file_name, sep=',')
    else:
        AnnualQ_s = pd.read_csv(file_name, sep=' ', header=None)
        AnnualQ_s.columns = ['cm', 'gm', 'ym', 'wm', 'sj']
    AnnualQ_s['Year'] = list(range(1909, 2014))

    # find the drought threshold
    std = statistics.stdev(AnnualQ_s.iloc[:, 0])
    threshold = np.mean(AnnualQ_s.iloc[:, 0]) - (0.5 * std)

    drought_instances = [i for i, v in enumerate(AnnualQ_s.iloc[:, 0].rolling(drought_def).mean()) if v < threshold]
    drought_years = AnnualQ_s.iloc[:, 5].rolling(drought_def).mean()[drought_instances]

    # find years of drought in each record
    drought_range = [range(0, 1)]
    j = 0
    for i in range(len(drought_years)):
        already_in = 0
        for j in range(len(drought_range)):
            if drought_years.iloc[i] in drought_range[j]:
                already_in = 1
        if already_in == 0:
            start_year = drought_years.iloc[i] - 5
            end_year = drought_years.iloc[i] + 5
            drought_range.append(range(int(start_year), int(end_year)))
        elif already_in == 1:
            if drought_years.iloc[i] - 5 < start_year:
                start_year = drought_years.iloc[i] - 5
            if drought_years.iloc[i] + 5 > end_year:
                end_year = drought_years.iloc[i] + 5
            drought_range.append(range(int(start_year), int(end_year)))

    # combine overlapping periods
    final_drought_range = []
    for i in range(1, len(drought_range)):
        if final_drought_range:
            if min(drought_range[i]) < max(final_drought_range[-1]):
                final_drought_range[-1] = range(min(final_drought_range[-1]), max(drought_range[i]) + 1)
            elif min(drought_range[i]) < min(drought_range[i - 1]):
                final_drought_range.append(range(min(drought_range[i - 1]), int(max(drought_range[i]) + 1)))
            else:
                final_drought_range.append(drought_range[i])
        else:
            if min(drought_range[i]) < min(drought_range[i - 1]):
                final_drought_range.append(range(min(drought_range[i - 1]), max(drought_range[i]) + 1))
            else:
                final_drought_range.append(drought_range[i])

    # calculate the severity of droughts
    total_severity = []
    if drought_instances:
        for i in range(len(final_drought_range)):
            cur_severity = 0
            for j in range(105):
                if AnnualQ_s.iloc[j, 5] in final_drought_range[i]:
                    cur_severity += (AnnualQ_s.iloc[:, 0].mean() - AnnualQ_s.iloc[j, 0])/std
            total_severity.append(cur_severity)

    return drought_instances, final_drought_range, total_severity

We can run this function for our small ensemble of 100 records generated (which can be found here) using the code below:

drought_def = 11 # size of moving window

# synthetic realizations
num_droughts = np.zeros(100)
severity_curve = []
all_drought_years =[]
max_severity = np.zeros(100)

for r in range(0,100):
    file_name = 'data/AnnualQ_s' + str(r) + '.txt' # edit here
    drought_instances, drought_years, drought_severity = drought_statistics(file_name, False, drought_def)

    if drought_instances:
        all_drought_years.append(drought_years)
        severity_curve.append(drought_severity)
        max_severity[r] = max(drought_severity)

# Historical (edit file name)
file_name = 'data/historical.csv'
drought_instances_hist, drought_years_hist, drought_severity_hist = drought_statistics(file_name, True, drought_def)

drought_ranks = np.zeros([len(severity_curve), 5])
for i in range(len(severity_curve)):
    sorted_curve = np.sort(severity_curve[i])[::-1]
    for j in range(len(sorted_curve)):
        drought_ranks[i, j] = sorted_curve[j]


drought_ranks_hist = np.zeros(5)
sorted_ranks_hist = np.sort(drought_severity_hist)[::-1]
for i in range(len(drought_severity_hist)):
        drought_ranks_hist[i] = sorted_ranks_hist[i]
max_ranks = np.max(drought_ranks, axis=0)
min_ranks = np.min(drought_ranks, axis=0)

# pad ranks with zeros for realizations without droughts
no_drought_traces = 99-len(severity_curve)
print('num no drought rels = ' + str(no_drought_traces))
for nr in range(no_drought_traces):
    drought_ranks = np.vstack((drought_ranks, np.zeros(5)))

Examining Drought Frequency

The first diagnostic we’ll perform is to examine the frequency of drought years in our ensemble of synthetic records (shaded years in the figure above) against the number of drought years in the historical record.

def count_drought_years(droughts):
    total_drought_years = 0
    for i in range(len(droughts)):
        total_drought_years += len(droughts[i])

    return total_drought_years

drought_lengths = np.zeros(100)
for d in range(len(all_drought_years)):
    drought_lengths[d] = count_drought_years(all_drought_years[d])

hist_drought_years = count_drought_years(drought_years_hist)


fig, ax = plt.subplots()
ax.bar(np.arange(len(num_droughts)), np.sort(drought_lengths))
ax.plot(np.arange(100), np.ones(100) * hist_drought_years, color='k', linestyle='--')
ax.set_xlim([0, 100])
ax.set_ylim([0, 50])
ax.set_xlabel('Rank Across Ensemble')
ax.set_ylabel('Number of Drought Years')
plt.savefig('DroughtFrequency.png')

This code produces the figure shown below. Each bar represents the number of drought years in a single synthetic realization (ranked from lowest to highest), and the dashed line represents the number of drought years in the historical record. We observe that the number of drought years in the historical record sits right in the middle of the range of synthetic realizations. We can see that the maximum number of drought years is twice as high as observed in the historical record – indicating that some realizations generate much more challenging drought scenarios than the observed record. We also see that some realizations have zero years of drought – indicating they are much milder than the observed record.

This image has an empty alt attribute; its file name is droughtfrequency.png

Evaluating drought magnitude

Another way we can evaluate droughts generated by our synthetic records is to examine the maximum magnitude of drought events produced by each record. To do this, we need a measure of drought magnitude. Here, we’ll define drought magnitude as the difference between the historical mean flow and the flow during drought periods normalized by the standard deviation. We can express this mathematically with a simple integral:

\int_{start}^{end} \frac{\mu-Q(t)}{\sigma} \, dt

We can visualize this metric in the figure below. Drought magnitude is the sum of the red regions for each drought.

Below, we’ll plot drought magnitude using a horizontal bar chart to make it look distinct from the frequency plot. The maximum drought magnitude in the historical record will be plotted as a black dashed line.

fig, ax = plt.subplots()
ax.barh(np.arange(len(num_droughts)), np.sort(max_severity))
ax.plot(np.ones(100)*max(drought_severity_hist), np.arange(100), color='k', linestyle='--')
ax.set_ylim([0,100])
ax.set_xlabel('Drought Magnitude')
ax.set_ylabel('Rank Across Ensemble')
plt.savefig('DroughtMagnitude.png')
This image has an empty alt attribute; its file name is droughtmagnitude.png

In the plot above, we observe that the magnitude of droughts in our synthetic records can get much more severe than the worst drought observed in the historical period. This finding aligns with studies on the Upper Colorado River basin using paleo data (Ault et al., 2014; Woodhouse et al., 2014) and underscores that relying solely on the historical record may underestimate drought risk for the basin.

Pulling it all together

To end our exploration, we’ll generate a final plot that simultaneously visualizes the frequency and magnitude of drought events in the synthetic realizations and the historical record. This plot will show all droughts in a given record, ranked according to their magnitude. To visualize the full ensemble of synthetic records, we’ll plot the percentiles of magnitude across the ensemble of 100 records.

fig, ax = plt.subplots()

# Calculate the percentiles of drought magnitude across the ensemble
ps = np.arange(0, 1.01, 0.01) * 100
for j in range(1, len(ps)):
    u = np.percentile(drought_ranks, ps[j], axis=0)
    l = np.percentile(drought_ranks, ps[j - 1], axis=0)
    # plot the percentiles
    ax.fill_between(np.arange(0, 1.25, .25), l, u, color=cm.OrRd(ps[j - 1] / 100.0), alpha=0.75,
                    edgecolor='none')
dummy = ax.scatter([100,200], [100,200], c=[0,1], cmap='OrRd') # dummy for the colorbar

# plot the historical droughts for context
for i in range(4):
    if drought_ranks_hist[i] > 0:
        ax.scatter(np.arange(0, 1.25, .25)[i], drought_ranks_hist[i], color='k', s=50, zorder=3)

# make a color bar
divider = make_axes_locatable(ax)
cax = divider.append_axes('right', size='5%', pad=0.05)
fig.colorbar(dummy, cmap='OrRd', cax=cax, orientation='vertical', label='Ensemble Exceedance')

# format the plot
ax.set_xlim([0, 1])
ax.set_xticks([0,.25, .5, .75, 1])
ax.set_ylim([0, 16])
ax.set_title('Drought In Synthetic Records', fontsize=16)
ax.set_ylabel('Drought Magnitude', fontsize=14)
ax.set_xlabel('Exceedance', fontsize=14)
plt.savefig('FreqSeverity.png')
This image has an empty alt attribute; its file name is freqseverity-1.png

The color in the figure above represents the percentiles of the synthetic records, and the two black points represent the droughts observed in the historical record. The vertical axis represents the magnitude of drought events, and the horizontal axis represents the fraction of droughts in a record that exceeds a given event. We can see that the maximum drought magnitude generated by the synthetic ensemble is much greater than the most severe drought of record (as observed in our magnitude figure). We can also see that the second most severe historical drought is right in the middle of the distribution of the second-highest droughts produced by the synthetic ensemble (.25 on the horizontal axis). Finally, we observe that while most realizations produce two decadal drought events, a small fraction contain an additional drought event.

Final thoughts

This analysis complements traditional diagnostic evaluation of synthetic streamflow records by revealing the extent of extreme events generated by our synthetic generator and contextualizing synthetic events in the historical record. Historical contextualization can help us communicate our experimental results to decision-makers by making an analysis that uses synthetic records less abstract. This analysis also helps us understand the internal variability of the stochastic process, which can provide a helpful baseline if we’re investigating non-stationarity.

References

Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate, 27(20), 7529-7549.

Woodhouse, C. A., Gray, S. T., & Meko, D. M. (2006). Updated streamflow reconstructions for the Upper Colorado River basin. Water Resources Research, 42(5).

An overview of the National Hydrologic Model

Over the past several months, I have been working with data from the US Geological Survey’s (USGS) National Hydrologic Model (NHM), a valuable resource that required some time to become familiar with. The goal of this post is to provide an overview of the NHM, incorporating as many links as possible, with the hope of encouraging others to utilize these resources and serving as a springboard for further investigation.

Why should you care about the NHM?

Water systems modelers are consistently in need of data. You might find that the specific streamflow data you seek does not exist, or perhaps you have some data but want to expand your training set. You may also wish to broaden your data to include not only streamflow but also estimates of various geophysical, hydrological, or environmental variables such as catchment area, vegetation classifications, and evapotranspiration.

The NHM can quench your data thirst by offering Continental US (CONUS) scale estimates of different geo-hydro-climatic variables synthesized from multiple datasets.

You can access these NHM datasets (simulated streamflow, land-cover parameter values, etc.) yourself through the ScienceBase website. However, it is first beneficial to understand the various components of the NHM infrastructure more broadly.


Introduction

The National Hydrologic Model (NHM) infrastructure is designed with the goal…

“… to facilitate hydrologic model parameterization on the basis of datasets that characterize geology, soil, vegetation, contributing area, topography, growing season, snow dynamics, climate, solar radiation, and evapotranspiration across the CONUS.”

The NHM includes several different components

  1. The Geospatial Fabric
  2. Model input datasets
  3. Physical models used to simulate hydrologic processes

In this post, I will delve into each of these components, providing more information, and conclude by pointing out ways you can access the model outputs yourself.

The Geospatial Fabric

The geospatial fabric contains spatial data that serves as the skeletal structure of the NHM, facilitating the routing of modeled streamflow across across catchments. The image below shows the CONUS-scale geospatial fabric.

The geospatial fabric contains individual stream reaches (called “segments”), delineated sub-catchments (called “Hydrologic Response Units”, or HRUs), and many specific points of interest which correspond to USGS observational gauge locations.

The raw data for version 1.1 of the Geospatial Fabric can be found here.

The spatial data is largely drawn from the National Hydrography Dataset (NHD) which defines the high-resolution stream linkages and provides a unique identifier (called “ComID”) for each stream segment.

If you are looking to set up a workflow which uses NHM streamflow data, you will need to specify the ComID numbers for your locations of interest. You can then retrieve the data for each ComID location.

If you are doing this with Python, then I suggest you check out PyNHD package which I previously highlighted on this blog, and which can identify NHD ComID numbers based on coordinate points.

For more information on the geospatial fabric, you can see Bock et al. (2020).

PRMS

The Precipitation-Runoff Modeling System (PRMS) is described by the USGS as being:

“a deterministic, distributed-parameter, physical process based modeling system developed to evaluate the response of various combinations of climate and land use on streamflow and general watershed hydrology.”

The PRMS simulates many different hydrologic processes such as snowmelt, infiltration, groundwater recharge, surface runoff, and finally streamflow. A conceptual representation of the PRMS, taken from Markstrom et al. (2015) shows the modeled relationships between these processes:

The input data for the PRMS is flexible, but requires some combination of precipitation, air temperature, and solar radiation timeseries. Historic Daymet data provide the climate forcings for the historic period, but future climate projections can also be used.

Additionally, the model requires a set of catchment-delineated parameter values which quantify things such as soil types, types of vegetation coverage, percentage of impervious surfaces, etc. These data can be provided by the National Land Cover Database (NLCD), or alternative land cover change scenarios can be used to assess the impact of land surface on hydrology.

The PRMS is thus able to provide a strong physical processes basis when coupled with the NHM. The combination of these two models is simply referred to as the NHM-PRMS.

You can access the user manual for the PRMS here, and a report describing the latest version (PRMS-IV) from Markstrom et al (2015) here.

Accessing NHM datasets

The NHM infrastructure is great in the sense that it is flexible and incorporates so many different datasets. However, this may introduce difficulty in running the models yourself (I have not attempted this, and don’t have guidance on that).

Fortunately, there are some datasets which have been shared by the USGS, and which conveniently provide various model outputs or calibrated parameters without the need to interact with the model directly.

You can access and explore available datasets from the NHM through the ScienceBase website.

A few notable datasets that you might be interest in using are:

NHM-PRMS simulated streamflow from 1980-2016

Here I want to highlight one of these datasets that I have the most experience working with, and which I believe may be the most interesting to the WaterProgramming crowd: the “Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS), 1980-2016, Daymet Version 3 calibration“.

This dataset contains daily streamflow values across CONUS for the period from 1980-2016, at each HRU and segment contained in the geospatial fabric, and is available through the link here.

Note: Given the scale of this dataset, the file is rather large (92 GB).

Here I want to show how easy in can be to access the streamflow timeseries from this dataset. Assuming that you have downloaded full NHM-PRMS dataset file, for the fully observed-calibrated and Muskingum routing simulation (byHRU_musk_obs.tar), you can extract the segment streamflow timeseries using just a few lines of Python code:

## Example of how to extract streamflow data from NHM-PRMS
import tarfile
import netCDF4 as nc
import pandas as pd

# Open file and extract just segment outflow
tar = tarfile.open('your_data_directory/byHRU_musk_obs.tar')
tar.extract('seg_outflow', path= './extracted/')

# Open the extracted netCDF
segment_outflow = nc.Dataset(f'./extracted/seg_outflow.nc')

# Store values and segment IDs
vals = segment_outflow['seg_outflow'][:]
segment_ids = segment_outflow['segment'][:]

# Make a dataframe
segment_outflow_df = pd.DataFrame(vals, columns = segment_ids)

At this point, segment_outflow_df will contain data from all of CONUS, and you will likely want to choose a subset of this data using the ComID numbers for each segment; I’ll have to leave that part up to you!

I hope this post helped to shine some light on this great resource, and encourages you to consider leveraging the NHM in your own work. As always, thanks for reading!

References

  1. Bock, A.E, Santiago,M., Wieczorek, M.E., Foks, S.S., Norton, P.A., and Lombard, M.A., 2020, Geospatial Fabric for National Hydrologic Modeling, version 1.1 (ver. 3.0, November 2021): U.S. Geological Survey data release, https://doi.org/10.5066/P971JAGF.
  2. Regan, R.S., Markstrom, S.L., Hay, L.E., Viger, R.J., Norton, P.A., Driscoll, J.M., LaFontaine, J.H., 2018, Description of the National Hydrologic Model for use with the Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Techniques and Methods, book 6, chap B9, 38 p., https://doi.org/10.3133/tm6B9.
  3. Leavesley, G. H. (1984). Precipitation-runoff modeling system: User’s manual (Vol. 83, No. 4238). US Department of the Interior.
  4. Markstrom, S. L., Regan, R. S., Hay, L. E., Viger, R. J., Webb, R. M., Payn, R. A., & LaFontaine, J. H. (2015). PRMS-IV, the precipitation-runoff modeling system, version 4. US Geological Survey Techniques and Methods6, B7.
  5. Hay, L.E., and LaFontaine, J.H., 2020, Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS),1980-2016, Daymet Version 3 calibration: U.S. Geological Survey data release, https://doi.org/10.5066/P9PGZE0S.

Weather Regime-Based Stochastic Weather Generation (Part 1/2)

*This post is written in collaboration with Nasser Najibi from Steinschneider Research Group*

In this blog post, we debut the weather-regime-based stochastic weather generator on the Water Programming Blog! This weather generator was first published in the following article:

Steinschneider, S., Ray, P., Rahat, S. H., & Kucharski, J. (2019). A weather‐regime‐based stochastic weather generator for climate vulnerability assessments of water systems in the western United States. Water Resources Research, 55(8), 6923-6945.

Since its inception, the generator has been applied in studies led by researchers primarily affiliated with Cornell or the Army Corp of Engineers. In this blog post, we’ll step through the key theory and components of this weather generator and show how to run a simple simulation and diagnostics with code that is now provided in an official GitHub repo! Future blog posts in this series will step through more sophisticated examples and additions to the generator. Readers can refer to the paper above for more details on the specific components or methods used in this generator.   

Why use this generator?

If you are looking to develop alternative local weather scenarios informed by climate changes, you can go down two avenues: (1) using projections from General Circulation Models (GCMs) or (2) using statistical weather generation approaches. GCMs have a rich physical representation of ocean and atmospheric processes and are useful to generate internally consistent scenarios that can ultimately be downscaled to the region of interest. However, due to coarse model resolution, GCMs exhibit significant variability and biases in processes that are important for regional planning and vulnerability assessments like precipitation, atmospheric river flux and droughts. These models are also computationally expensive to run and so you are limited to a few storylines of warming scenarios to explore. 

Stochastic weather generators, however, are a complementary way to develop local weather scenarios. They are directly conditioned on the weather of the region of interest and therefore can better capture regional processes and be more applicable to regional water systems planning. This generator can be used to create many scenarios that preserve regional and temporal statistics but are not bounded by the historical record; notably these scenarios can also be generated in a computationally efficient manner. The output from these generators can also be systematically adjusted to create basic climate change scenarios. What is often missing from these statistical models, however, is process information, and this is where the weather-regime-based stochastic generator can be useful.

What are the key components of this generator?

Figure 1: Generator flow chart (Steinschneider et al., 2019)

Figure 1 shows the main components of the first version of the generator, some of which have been extended or altered in newer studies. There are three primary levels to this generator listed below.

Simulation of Weather Regimes: The first step of using this generator is identifying weather regimes that bring regional weather to the application area. Weather regimes are large scale atmospheric flow patterns that tend to persist and organize local weather systems and can effectively be modeled by a Markovian process. Weather regimes are often identified by either clustering or fitting a Hidden Markov Model (HMM) to a selected number of principal components of geopotential height (GPH) anomalies. In many of the studies that use this generator in the Western U.S. (Steinschneider et al., 2019, Najibi et al., 2021, Rahat et al., 2022, Gupta et al., (in review), 500 hPa GPH anomalies are used to identify cool season weather regimes because this GPH effectively capture the weather systems organized by the jet stream. The same GPH anomalies also have been used to identify North Atlantic weather regimes that bring weather systems to Europe (Charlton-Perez et al., 2018). A combination of 500 and 250 hPa GPH anomalies have been used to identify weather regimes in Chile (Meseguer-Ruiz et al., 2020), for example. The choice of number of weather regimes is also region specific. In the case of Western US weather regimes, prior work has developed a framework to identify the optimal number of weather regimes that produce the best results in reproducing regional weather statistics over a large geographic area in California (Najibi et al., 2021). In the case of the Western U.S., 4-5 metastable regimes are defined (both in Guirguis et al., 2020 and Najibi et al., 2021), but other regions might be defined by more. Once weather regimes are identified, then they can be simulated as Markov chains, effectively dictating the large-scale atmospheric flow processes that are taking place on a given day.

Simulation of Local Weather: The next step is a bootstrapping procedure to create local weather. Each day of the historical record is classified to be in a specific weather regime and has a certain value of precipitation and temperature associated with each grid cell in the basin of interest. Thus, when chains of weather regimes are simulated, precipitation and temperature can be bootstrapped from the historical record from time periods where simulated blocks of weather regimes match with historical blocks. The block bootstrapping procedure helps maintain the joint distribution of the weather variables, but an additional copula-based jittering technique helps to broaden the range of simulated precipitation outside of the historical distribution while still maintaining the shape of the underlying distribution.

Dynamic and Thermodynamic Perturbations: Dynamic and thermodynamic climate changes are imposed separately in this generator. This allows a way to systematically understand how specific climate changes lead to changes to local climate. Dynamical changes influence how weather regimes evolve and can change the frequency of weather regimes. This can be done by directly changing the transition probabilities of the weather regimes or indirectly by imposing covariates that dictate the evolution of the weather regimes. In Steinschneider et al. (2019), a Niño 3.4 index is used to force weather regime evolution and is systematically adjusted to create more frequent El Niño and La Niña events. In Gupta et al. (in review), a 600-year long sequence of tree-ring reconstructed principal components of weather regime occurrence are used as an alternative covariate to better capture natural variability inherent in the weather regimes. Thermodynamic trends are applied post generation of local weather and consist of manipulations to temperature or precipitation. Stepwise temperature trends can be added to each grid cell, or elevation-dependent warming can be imposed. Based on an imposed temperature, the mean or an upper quantile of the precipitation distribution can be scaled to reflect an increased capacity for the atmosphere to hold moisture due to increased temperature.

Where can I use this generator?

A weather-regime based stochastic weather generator is not applicable everywhere. It can be utilized for any region in which weather regimes dictate regional weather, which are the midlatitude regions of the Earth. Examples of locations where weather regimes are active and have been identified in studies include locations like the Western U.S, the Great lakes region of the U.S., parts of Western Europe, Morocco, Chile, and New Zealand, among others. If a midlatitude location is not applicable to your case study, consider using other synthetic weather generation techniques extensively covered in this series of posts by Julie Quinn. It is worth noting that the quality of the generator is also going to be dependent on having a sufficient historical precipitation and temperature record to bootstrap off of.

Where can I find the code?

Now, for the first time ever, demo code is publicly available in an official GitHub repo. In this demo, created by Nasser, the user can simulate precipitation and temperature for 12 randomly selected grid cells in the Tuolumne River Basin in California. In its most complex form, the generator can be used to generate weather across 12 basins simultaneously in the Central Valley region of California. Download or clone the GitHub repository and then download the necessary data from the Zenodo archive here. Then, drag these data to the Data folder in the repo. Then, the user can proceed with stepping through the rest of the demo.

  1. You will first simulate weather regimes for a certain number of years using either a parametric (config.WRs.param.NHMM.R) or non-parametric method (config.WRs.non_param.R).
    • Parametric Method: In this method, NHMMs are fitted using the following covariates: the first four PCs of the SPI index over California and two harmonics for the cold season. The two harmonics are the only covariates used during the warm season. From this fitted NHMM, one can simulate an arbitrary sequence of WRs. The parametric method is particularly useful if the user is interested in using specific covariates to force the evolution of the weather regime sequence.
    • Non-parametric Method: Randomly resample a block of m-years (here, m=4 years) of observed WRs to create a sequence of 1000-yr simulated WRs. Because the non-parametric method doesn’t requires covariates, it is particularly effective for flexibly generating very long sequences of weather regimes that can more appropriate capture natural variability for example.
  2. Once the WRs are simulated, you can simulate local precipitation and temperature. This is done by running the config.simulations.R script.
#Demo code- simulates precipitation and temperature for 12 grid cells in the Tuolumne Basin
rm(list=ls())

library(MASS)
library(fExtremes)
library(evmix)
library(zoo)
library(abind)
library(parallel)
library(mvtnorm)
library(tictoc)
library(lubridate)

#adjust main directory and directory for simulation files
mainDir <- "D:/Projects/Tuolumne_River_Basin/GitHub_WGENv2.0"
setwd(mainDir)

dir.to.sim.files <- "./Data/simulated.data.files/WGEN.out"
dir.create(file.path(mainDir, dir.to.sim.files), showWarnings = FALSE)

num.iter <- 1 # A single long trace (e.g., thousand years) is sufficient although we create like 5 ensembles in the simulated WRs
basin.cnt <- 'myPilot' # for a set of 12 randomly selected Livneh grids in the Tuolumne River Basin 
number.years.long <- 3050 # e.g., 500, 1000, 2000, 3000, 5000 years, etc [note: current NHMM output (parametric) is for 1036 years; current non-parametric is for 3050 years]

##############################Define perturbations#######################################
## Climate changes and jitter to apply:
change.list <- data.frame("tc"=  c(0), # e.g., 1, 2, ...
                          "jitter"=  c(TRUE),
                          "pccc"=c( 0), # e.g., 0.07, 0.14, ...
                          "pmuc"=c( 0)# e.g., -.125
)

# For simulating the WRs (i.e., 'config.WRs.non_param.R'; 'config.WRs.param.NHMM.R'), do you use non-parametric or parametric method
use.non_param.WRs <- TRUE #TRUE for non-parametric, FALSE for parametric simulated WRs
####### Choose A (FALSE) or B (TRUE) below for the simulated WRs #################
#-- A) load in NHMM with WRs (parametric)
tmp.list <- readRDS(paste0("./Data/simulated.data.files/WRs.out/final.NHMM.output.rds"))
weather.state.assignments <- tmp.list$WR.historical # this is for BOTH the parametric and non-parametric approach 
sim.weather.state.assignments <- tmp.list$WR.simulation[,1:num.iter] # this is only for the parametric approach of simulating WRs
num.states <- length(unique(as.vector(weather.state.assignments)))    #number of WRs in the model
rm(tmp.list) # for memory

#-- B) load in non-parametric simulation with WRs
np.list.sim.weather.state.assignments <- readRDS(paste0("./Data/simulated.data.files/WRs.out/final.non_param.WRs.output.rds")) #this is only for the non-parametric approach of simulating WRs
num.states <- length(unique(np.list.sim.weather.state.assignments$markov.chain.sim[[num.iter]]))    #number of WRs in the model

# load in supporting functions
files.sources = list.files("./Programs/functions",full.names = TRUE)
my.functions <- sapply(files.sources, source)
##################################

#dates for the historical WRs
start_date_synoptic="1948-01-01"; end_date_synoptic="2021-12-31"
dates.synoptic <- seq(as.Date(start_date_synoptic),as.Date(end_date_synoptic), by="days")
#create dates for the simulated WRs
my.num.sim = ceiling(number.years.long/length(unique(format(dates.synoptic,'%Y')))) # number of chunks of historical periods; e.g., 1 is one set of simulation equal to the historical
long.dates.sim <- rep(dates.synoptic,times=my.num.sim)


  • The overarching parameters of the simulation are listed in Lines 14-23 of the code. Here you can adjust paths to directories, how many iterations you want to run (set to “1” currently) and the length of the simulation.
  • Lines 26-31: These are the thermodynamic climate change parameters. Here you can impose a temperature increase (tc), a precip scaling factor for the upper quantile of the distribution (pccc), and a mean shift to the simulated precipitation distribution (pmuc). Currently, the simulation is set to no imposed climate changes but copula-based jittering is imposed and so is kept as TRUE.  
  • Lines 33-45: Depending on which method was used to simulate the WRs, load in the WRs accordingly, using the blocks of code under either A or B. Be sure to run Line 38 no matter which method was used.  
  1. Lines 62-98 contain more details on precipitation characteristics.
#########Precipitation characteristics#########

#location of obs weather data (RData format): weather data (e.g., precip and temp) as matrices (time x lat|lon: t-by-number of grids); dates vector for time; basin average precip (see the example meteohydro file)
processed.data.meteohydro <- paste0("./Data/processed.data.files/processed.meteohydro/processed.meteohydro.myPilot.RData")
load(processed.data.meteohydro) #load in weather data


qq <- .99              # percentile threshold to separate Gamma and GPD distributions
thshd.prcp <- apply(prcp.site,2,function(x) {quantile(x[x!=0],qq,na.rm=T)})


#Bootstrapping choices###
window.size <- rep(3,length(months))   #the size of the window (in days) from which runs can be bootstrapped around the current day of simulation, by month: Jan -- Dec

trace <- 0.25     #trace prcp threshold. 0.25 mm (for Livneh dataset); or 0.01 inches


#The spearman correlation between basin and site precipitation, used in the copula-based jitters
prcp.basin.site <- cbind(prcp.basin,prcp.site)
S <- cor(prcp.basin.site,method="spearman")
n.sites <- dim(prcp.site)[2] # Number of gridded points for precipitation


#fit emission distributions to each site by month. sites along the columns, parameters for month down the rows
emission.fits.site <- fit.emission(prcp.site=prcp.site,
                                   months=months,
                                   months.weather=months.weather,
                                   n.sites=n.sites,
                                   thshd.prcp=thshd.prcp)


#how often is prcp under threshold by month and site
qq.month <- sapply(1:n.sites,function(i,x=prcp.site,m,mm) {
  sapply(m,function(m) {
    length(which(as.numeric(x[x[,i]!=0 & mm==m & !is.na(x[,i]),i])<=thshd.prcp[i]))/length(as.numeric(x[x[,i]!=0 & mm==m & !is.na(x[,i]),i]))
  })},
  m=months, mm=months.weather)

  • thshd.prcp: the threshold precipitation quantile that dictates the separation between extreme precipitation and the rest of the distribution (set at .99). Precipitation scaling is applied to this upper quantile.
  • trace: the minimum precipitation for which a day is classified as “dry”
  • window.size: The window around which data is bootstrapped from the historical datasetS: The Spearman rank correlation between individual basin sites and the basin average (for the copula-based jittering)
  • The fit.emission.R function which fits gamma and GPD distributions to the prcp.site data. A GPD is fit to the precipitation that lies above the thshd.prcp at each site, while the gamma distribution is fit to the rest of the precipitation data by month. 
  1. Line 122 is where the bootstrapping process occurs (wgen.simulator.R)
##########################Simulate model with perturbations#######################################

#decide whether or not to use historic or simulated WRs
if(use.non_param.WRs) {
  
  dates.sim <- np.list.sim.weather.state.assignments$dates.sim
  markov.chain.sim <- np.list.sim.weather.state.assignments$markov.chain.sim
  identical.dates.idx <- dates.synoptic%in%dates.weather
  weather.state.assignments <- weather.state.assignments[identical.dates.idx]
  
} else {
  dates.sim <- long.dates.sim
  markov.chain.sim <- as.list(data.frame(sim.weather.state.assignments))
}

#run the daily weather generate num.iter times using the num.iter Markov chains
mc.sim <- resampled.date.sim <- resampled.date.loc.sim <- prcp.site.sim <- tmin.site.sim <- tmax.site.sim <-  list()
start_time <- Sys.time()
for (k in 1:num.iter) {
  my.itertime <- Sys.time()
  my.sim <- wgen.simulator(weather.state.assignments=weather.state.assignments,mc.sim=markov.chain.sim[[k]],
                           prcp.basin=prcp.basin,dates.weather=dates.weather,
                           first.month=first.month,last.month=last.month,dates.sim=dates.sim,
                           months=months,window.size=window.size,min.thresh=trace)
  print(paste(k,":", Sys.time()-my.itertime))
  
  #each of these is a list of length iter
  mc.sim[[k]] <- my.sim[[1]]
  resampled.date.sim[[k]] <- my.sim[[2]]
  resampled.date.loc.sim[[k]] <- my.sim[[3]]
  prcp.site.sim[[k]] <- prcp.site[resampled.date.loc.sim[[k]],]
  tmin.site.sim[[k]] <- tmin.site[resampled.date.loc.sim[[k]],]
  tmax.site.sim[[k]] <- tmax.site[resampled.date.loc.sim[[k]],]
}
end_time <- Sys.time(); run.time.sim <- end_time - start_time
print(paste("SIMULATION started at:",start_time,", ended at:",end_time)); print(round(run.time.sim,2))
#remove for memory
rm(my.sim)
  1. Once baseline weather is generated, then perturbations can be applied in Line 156 using the perturb.climate.R function. In this function, temperature trends and precipitation scaling is applied if applicable. If not, just the copula-based jittering is applied. This output is then saved for analysis.
#once the simulations are created, we now apply post-process climate changes (and jitters)
for (change in 1:nrow(change.list)) {
  start_time <- Sys.time()
  cur.tc <- change.list$tc[change]
  cur.jitter <- change.list$jitter[change]
  cur.pccc <- change.list$pccc[change]
  cur.pmuc <- change.list$pmuc[change]
  
  #precipitation scaling (temperature change dependent)
  perc.q <- (1 + cur.pccc)^cur.tc    #scaling in the upper tail for each month of non-zero prcp
  perc.mu <- (1 + cur.pmuc)          #scaling in the mean for each month of non-zero prcp
  
  #perturb the climate from the simulations above (the longest procedure in this function is saving the output files)
  set.seed(1)   #this ensures the copula-based jitterrs are always performed in the same way for each climate change
  perturbed.sim <-  perturb.climate(prcp.site.sim=prcp.site.sim,
                                    tmin.site.sim=tmin.site.sim,
                                    tmax.site.sim=tmax.site.sim,
                                    emission.fits.site=emission.fits.site,
                                    months=months,dates.sim=dates.sim,n.sites=n.sites,
                                    qq=qq,perc.mu=perc.mu,perc.q=perc.q,S=S,cur.jitter=cur.jitter,cur.tc=cur.tc,
                                    num.iter=num.iter,thshd.prcp=thshd.prcp,qq.month=qq.month)
  set.seed(NULL)
  prcp.site.sim.perturbed <- perturbed.sim[[1]]
  tmin.site.sim.perturbed <- perturbed.sim[[2]]
  tmax.site.sim.perturbed <- perturbed.sim[[3]]
  
  #remove for memory
  rm(perturbed.sim)
  end_time <- Sys.time(); run.time.qmap <- end_time - start_time
  print(paste("QMAPPING started at:",start_time,", ended at:",end_time)); print(round(run.time.qmap,2))
  
  #how to name each file name to track perturbations in each set of simulations
  file.suffix <- paste0(".temp.",cur.tc,"_p.CC.scale.",cur.pccc,"_p.mu.scale.",cur.pmuc,"_hist.state.",use.non_param.WRs,"_jitter.",cur.jitter,"_s",num.states,"_with_",num.iter,".",basin.cnt)
  
  print(paste("|---start saving---|"))
  write.output.large(dir.to.sim.files,
                     prcp.site.sim=prcp.site.sim.perturbed,
                     tmin.site.sim=tmin.site.sim.perturbed,
                     tmax.site.sim=tmax.site.sim.perturbed,
                     mc.sim=mc.sim,resampled.date.sim=resampled.date.sim,
                     dates.sim=dates.sim,file.suffix=file.suffix)
  print(paste("|---finished saving---|"))
  
}

#remove for memory
rm(resampled.date.sim,resampled.date.loc.sim,markov.chain.sim,mc.sim)
##################################################################################################
print(paste0("--- done.  state= ", num.states," --- ensemble member:",num.iter))
gc()

# The End
#####################################################################################
  1. The R script plot.statistics.full.diagnostics.long.R can then be used to generate a variety of figures that demonstrate the quality of the generator in preserving and/or expanding around the observed statistics of each site. Options include:
Exceedance probability curves for all 12 sites
Simulated vs. observed maximum precipitation for all 12 sites
Simulated vs. observed 1-20 year droughts consolidated across all locations

And much more! Code is provided for 11 different diagnostic assessments of the quality of the generator. Look out for additions to the repository over the next months and feel free to open a GitHub issue if you have questions!

References

Charlton‐Perez, A. J., Ferranti, L., & Lee, R. W. (2018). The influence of the stratospheric state on North Atlantic weather regimes. Quarterly Journal of the Royal Meteorological Society, 144(713), 1140-1151.

Guirguis, K., Gershunov, A., DeFlorio, M.J., Shulgina, T., Delle Monache, L., Subramanian, A.C., Corringham, T.W. and Ralph, F.M., 2020. Four atmospheric circulation regimes over the North Pacific and their relationship to California precipitation on daily to seasonal timescales. Geophysical Research Letters, 47(16), p.e2020GL087609.

Gupta, R.S., Steinschneider S., Reed, P.M. Understanding Contributions of Paleo-Informed Natural Variability and Climate Changes on Hydroclimate Extremes in the Central Valley Region of California. Authorea. March 13, 2023. DOI: 10.22541/essoar.167870424.46495295/v1

Meseguer-Ruiz, O., Cortesi, N., Guijarro, J. A., & Sarricolea, P. (2020). Weather regimes linked to daily precipitation anomalies in Northern Chile. Atmospheric Research, 236, 104802.

Najibi, N., Mukhopadhyay, S., & Steinschneider, S. (2021). Identifying weather regimes for regional‐scale stochastic weather generators. International Journal of Climatology41(4), 2456-2479.

Rahat, S. H., Steinschneider, S., Kucharski, J., Arnold, W., Olzewski, J., Walker, W., … & Ray, P. (2022). Characterizing hydrologic vulnerability under nonstationary climate and antecedent conditions using a process-informed stochastic weather generator. Journal of Water Resources Planning and Management148(6), 04022028.

Steinschneider, S., Ray, P., Rahat, S. H., & Kucharski, J. (2019). A weather‐regime‐based stochastic weather generator for climate vulnerability assessments of water systems in the western United States. Water Resources Research, 55(8), 6923-6945.