Intro to HDF5/h5py and comparison to CSV for speed & compression

The Hierarchical Data Format (HDF) is a family of free and open-source file formats designed to facilitate the storage and manipulation of large and heterogeneous datasets. The latest version, HDF5, is commonly used across a range of commercial and non-commercial applications. Like Parquet (see previous blog post by Rohini Gupta) and NetCDF (see this introductory post by Jon Herman as well as several others on more advanced topics), HDF5 offers substantial speed and compression advantages over flat text file formats like CSV and thus is especially useful for scientific modeling efforts that require storage and analysis of large and complex datasets.

In this post, I will first give an introduction to HDF5 file format and how to read/write HDF5 files in Python using the h5py package. I will then perform an experiment writing and reading different dataset sizes in order to quantitatively demonstrate the speed and compression advantages of HDF5/h5py over CSV.

Figure 1: Advantages of HDF5 format. Source: HDF Group.

Background

HDF was originally developed by the US National Center for Supercomputing Applications, and is now maintained by the non-profit HDF Group. HDF5 builds off of the earlier HDF4 as well as the popular NetCDF file format. One of its major advantages is its hierarchical data structure that allows for nested “groups,” which are analogous to “directories” in a computer file system. Each group can contain one or more “datasets,” which are individual arrays of data. HDF5 allows for heterogeneous data types to be combined either within a dataset or across groups. HDF5 also allows for metadata “attributes” to be attached at any level of the hierarchy, from the root group down to individual datasets.

This hierarchical and flexible format has many advantages for data-heavy scientific modeling workflows. For example, in large-scale exploratory modeling workflows that dispatch ensembles of thousands of simulations, HDF5 format can be used to organize both the input data files and the output results files. Hierarchical groups can be used to organize data across nested experimental designs; for example, results can be organized with an outer group for each candidate policy/solution, an inner group for each sampled uncertain state-of-the-world, and one or more inner datasets to store arrays of relevant results for each policy/solution within each state-of-the-world. The relevant policy parameters, sampled states-of-the-world, and other important information can be stored right alongside the data as metadata at the proper level of the hierarchy. This can drastically improve the transparency and portability of results in complex modeling workflows.

Figure 2: Hierarchical structure of an HDF5 file. Source: Neon Science.

Another advantage is that HDF5 automatically applies gzip compression, which can significantly reduce the file size compared to flat text file formats like CSV. HDF5 also allows for “data slicing” and “chunking” to read and write small subsets of the data at a time. This enables users to efficiently interact with large datasets that exceed the available RAM.

Interacting with HDF5 & h5py

Because of its complex hierarchical format, HDF5 is not quite so simple to read and write as CSV files. You cannot open the files directly in a text editor or terminal, for example. However, the HDF Group provides the free HDFView software which has a simple and effective graphical user interface for viewing and editing the contents of HDF5 files.

Figure 3: Viewing an HDF5 dataset with HDFView software. The upper left panel shows the hierarchical structure of this data, with an outer group “soln50”, and inner group “du_1”, and within that many datasets such as “du10”. This dataset shows several attributes such as “dunames”, an array of character strings (bottom left). We can also view the main numeric data array associated with “du10” (bottom right).

The other way to interact with HDF5 files is through an application programming interface (API) provided by your programming language of choice. For example, h5py is a free and open source Python package that provides a fast and user-friendly interface to the C-level HDF5 API. Python users who are already familiar with the basics of dictionaries and Numpy arrays should find the h5py interface to be very intuitive.

Here is an example demonstrating the basic syntax for writing and reading to HDF5 files using the h5py package.

import h5py
import numpy as np

### Create a new HDF5 file with a write connection
with h5py.File('example.hdf5', 'w') as f_write:
    ### create a new 2x2 short int array called mammals, within animals group
    ds1 = f_write.create_dataset('animals/mammals', (2,2), dtype='i8')
    ### set values for mammals array
    ds1[...] = [[1,2],[3,4]]
    ### create/set metadata attribute tied to mammals array
    ds1.attrs['names'] = ['cat', 'dog']
    ### here is a different way to create and set a new dataset directly with data
    f_write['animals/reptiles'] = [[1.2, 2.3], [3.4, 4.5], [5.6, 6.7]]
    ### attribute tied directly to group rather than dataset
    f_write['animals'].attrs['data collection'] = 'made up'
    ### here is a new group and dataset, this time storing character string data.
    f_write['vehicles/cars'] = ['prius', 'focus']

### Reopen the HDF5 file in read mode to verify our data was stored correctly
with h5py.File('example.hdf5', 'r') as f_read:
    ### assign mammals array to variable ds1
    ds1 = f_read['animals/mammals']
    ### read stored array
    print(ds1[...])
    ### read attribute
    print(ds1.attrs['names'])
    ### you can also read array data directly without naming a variable first
    print(f_read['vehicles/cars'][...])

Testing the performance of HDF5/h5py vs CSV

Now that we have covered the basics, I will demonstrate the value of HDF5/h5py over a standard CSV data workflow in terms of writing and reading speed as well as compressed storage size. To do this, I perform a simple experiment to loop over 9 different synthetic datasets, ranging from 200,000 to 60,000,000 total elements (Table 1). These datasets vary both in the size and number of arrays that are stored. For example, there are 3 datasets with 20,000,000 total elements: (1) a single array of 20,000,000 elements; (2) 1,000 different arrays with 20,000 elements each; and (3) 1,000,000 different arrays with 20 elements each. For each of the 9 synthetic datasets, I time both a writing-based and a reading-based task using CSV as well as HDF5. I also record the storage size for the full dataset in each case. Each individual array within a dataset is stored as its own CSV file, while for HDF5 the entire dataset is stored as a single file with individual arrays separated by multiple levels of nested groups.

For brevity’s sake I will only provide the code needed for the HDF5 experiment below. However, all code used in this blog post, including the CSV experiment, can be found at my WPB_HDF5 GitHub repository.

First, here is the function used to create the synthetic data for this experiment. Given the means and standard deviations for two random variables as well as their correlation, this function will return a NumPy array with nsamp randomly sampled values of the two variables.

import numpy as np
import h5py
import time
import os
import shutil
import sys
import subprocess

def generate_mvn(means, stds, corrs, nsamp):
    '''
    function for generating correlated multivariate normal random variables as numpy array
    :param means: 1d array of means for N variables
    :param stds: 1d array of standard deviations for N variables
    :param corrs: NxN correlation matrix
    :param size: Number of samples to generate
    :return: (size x N) array of sampled variables
    '''
    covs = np.matmul(np.matmul(np.diag(stds), corrs),
                    np.diag(stds))
    return np.random.multivariate_normal(means, covs, size=nsamp)

Now to create the multi-level data, I draw nparam alternative parameterizations for the means, the standard deviations, and the correlation. For each of the nparam^3 total parameterizations, I draw nsamp samples of the 2 variables. For example, if nsamp=30 and nparam=100, then I have 1,000,000 total parameterizations, each of which has its own array of size 2×30=60. In the CSV experiment this is stored as 1,000,000 different CSV files. In the HDF5 experiment this is stored as a single HDF5 file with 1,000,000 different arrays. The data are nested with outer groups representing the 10 different sample means. Inside of each are 10 inner groups representing the sample standard deviations. Inside of that are the 10 different arrays for the sampled correlations. Each of these inner arrays has size 2×30.

In the experiment below, I loop over 9 different combinations of nsamp and nparam. For each, I randomly sample the means, standard deviations, and correlations, and then generate the random variable arrays for each combination. Each is stored as an array in the HDF5 file at the appropriated nested group path, along with the corresponding sampled parameter values which are stored as attributes. I print the total time for writing each nsamp/nparam combination for HDF5 as well as CSV (not shown).

### loop over different experiment sizes
for nsamp, nparam in [(30,100), (30000,10), (10,100), (10000,10), (10000000,1), (1000,10), (1000000,1), (100,10), (100000,1)][::-1]:
    ### create new directories to hold data. Remove any old data.
    data_dir = f'nsamp{nsamp}_nparam{nparam}/'
    try:
        shutil.rmtree(data_dir)
    except:
        pass
    os.mkdir(data_dir)
    os.mkdir(data_dir + 'data_hdf5/')
    print(f'nsamp: {nsamp}, nparam: {nparam}\n')
    sys.stdout.flush()

    ### Randomly sample alternative means, stds, & corrs for 2-variable system.
    np.random.seed(101)
    mean_samps = np.random.uniform(-1, 1, size=(nparam,2))
    std_samps = np.random.uniform(0, 1, size=(nparam,2))
    corr_samps = np.random.uniform(0, 1, size=nparam)

    #### start timer
    t0 = time.perf_counter()

    ### Open connection to HDF5 file in write mode via h5py package
    with h5py.File(f'{data_dir}/data_hdf5/data.hdf5', 'w') as f_write:
        ### Loop over sampled means, stds, corrs -> within each, randomly sample 2-variable system nsamp times
        for i in range(nparam):
            means = mean_samps[i,:]
            for j in range(nparam):
                stds = std_samps[j,:]
                for k in range(nparam):
                    corr = corr_samps[k]
                    corrs = np.array([[1, corr], [corr, 1]])
                    ### generate sample for this combination of parameters
                    samps = generate_mvn(means, stds, corrs, nsamp)
                    ### store sample as hdf5 dataset in hierarchical structure -> outer group by means, inner group by stds, then separate dataset for separate corr samples within that
                    f_write[f'means{i}/stds{j}/corr{k}'] = samps
                    ### store correlation, stds, and means as metadata attributes at their appropriate group/dataset level
                    f_write[f'means{i}/stds{j}/corr{k}'].attrs['corr'] = [corr]
                f_write[f'means{i}/stds{j}'].attrs['stds'] = [stds[0], stds[1]]
            f_write[f'means{i}'].attrs['means'] = [means[0], means[1]]

    ### print run time
    dt = time.perf_counter() - t0
    print(f'HDF5 write: {dt:0.4f} seconds')
    sys.stdout.flush()

After writing, I also calculate the time needed to reread in each array, calculate its means, standard deviations, and correlations, and compare these to the original parameter values stored as attributes. Finally, I record the total disk storage size for each dataset before deleting it.

    ### now loop back over and read hdf5 to compare actual to sampled params for each set
    t0 = time.perf_counter()
    params_actual, params_samp = {}, {}

    ### Connect to previously written HDF5 file, in read mode
    with h5py.File(f'{data_dir}/data_hdf5/data.hdf5', 'r') as f_read:
        ### loop over previously sampled means, stds, & corrs
        for i in range(nparam):
            params_actual[f'means{i}'], params_samp[f'means{i}'] = {}, {}
            for j in range(nparam):
                params_actual[f'means{i}'][f'stds{j}'], params_samp[f'means{i}'][f'stds{j}'] = {}, {}
                for k in range(nparam):
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}'], params_samp[f'means{i}'][f'stds{j}'][f'corr{k}'] = {}, {}
                    ### retrieve actual means, stds, & corr params that we used to sample -> stored as metadata attributes
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = f_read[f'means{i}'].attrs['means']
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}']['stds'] = f_read[f'means{i}/stds{j}'].attrs['stds']
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}']['corr'] = f_read[f'means{i}/stds{j}/corr{k}'].attrs['corr']
                    ### Calculate means, stds, & corrs directly from sampled data arrays
                    params_samp[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = f_read[f'means{i}/stds{j}/corr{k}'][...].mean(axis=0)
                    params_samp[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = f_read[f'means{i}/stds{j}/corr{k}'][...].std(axis=0)
                    params_samp[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = np.corrcoef(f_read[f'means{i}/stds{j}/corr{k}'][:, 0], f_read[f'means{i}/stds{j}/corr{k}'][:, 1])[0, 1]

    ### print run time
    dt = time.perf_counter() - t0
    print(f'HDF5 read: {dt:0.4f} seconds')
    sys.stdout.flush()

    ### print directory size and remove
    subprocess.run(['du', '-h', f'{data_dir}/data_hdf5/'])
    print('')
    shutil.rmtree(f'{data_dir}/data_hdf5/')

Results

The results of the experiment can be found in Table 1 and Figure 4. In general, the HDF5 method significantly outperforms the CSV method on write speed, read speed, and storage size. The best performance in terms of timing occurs for the largest array size of 20,000,000 elements, where the write and read times for HDF5 are only 2.67% and 3.66% of the CSV times, respectively. Timing benefits are found to decay for array sizes under 10,000. At 200 elements or fewer, HDF5 actually performs similarly to CSV on writing and worse than CSV on reading. However, in general most large-scale scientific computing workflows that would consider using HDF5 will have larger array sizes and thus should benefit significantly from the 20x+ speedup seen for large file sizes.

In terms of compression, the best results are seen for the small array sizes. For example, the HDF5 file for the 20-element array case is only 13.18% of the size of the corresponding CSV collection. However, significant compression benefits are seen throughout the experiment, with approximately 31% relative storage size for all datasets with array sizes between 2,000 and 20,000,000 elements.

Total number samplesNumber samples per datasetNumber datasetsCSV write time (s)HDF5 write time (s)Relative write time (%)CSV read time (s)HDF5 read time (s)Relative read time (%)CSV storage size (MB)HDF5 storage size (MB)Relative storage size (%)
200,000200,00010.240.013.310.180.018.134.91.632.65
200,0002001,0000.480.2961.230.360.57157.527.9225.32
2,000,0002,000,00011.930.062.942.080.115.20491632.65
2,000,0002,0001,0002.070.3416.381.800.6435.48511631.37
20,000,00020,000,000117.150.462.6717.610.653.6648015331.88
20,000,00020,0001,00018.890.723.8116.591.197.1948915431.49
20,000,000201,000,000318.83277.2786.97205.07557.91272.06390051413.18
60,000,00060,0001,00054.031.633.0251.272.605.06150045930.6
60,000,000601,000,000352.47311.9388.50243.45593.66243.85390082521.15
Table 1: Results from testing CSV vs HDF5 on write speed, compression, and read speed
Figure 4: Scaling of HDF5 performance (relative to CSV) as a function of the number of samples per dataset

Next steps

This scaling experiment demonstrates the significant advantages in write speed, read speed, and storage compression that are enabled by HDF5 and h5py. HDF5 also has significant organizational benefits due to the simplicity of storing complex heterogeneous data arrays and associated metadata within a hierarchical format.

In this blog post, I have only covered the basics of HDF5 and h5py. Additional benefits are possible by leveraging HDF5/h5py’s data slicing and chunking capabilities, which make it possible to efficiently access and manipulate small subsets of very large datasets that exceed available RAM. Some users may also be interested in PyTables, a Python package that provides additional flexibility for manipulating and querying complex heterogeneous HDF5 datasets.

Parameter estimation of Hidden Markov Models using a Markov Chain Monte Carlo method in Julia

I recently completed a course Final Project in which I attempted to implement a Hidden Markov Model (HMM) parameterized using the Markov Chain Monte Carlo method. In this post, I will briefly introduce HMM models as well as the MCMC method. I will then walk you through an exercise implementing a HMM in the Julia programming language and share some preliminary results from my exploration in combining both these methods. All the code found in this exercise can be located in this GitHub repository.

Note: The exercise in this post will require some familiarity with Julia. For a quick introduction to this relatively recent newcomer to the programming languages scene, please refer to Peter Storm’s blog post here.

Hidden Markov Models

A Hidden Markov Model (HMM) is a statistical tool used to model different types of time series data ranging from signal processing, hydrology, financial statistics, speech recognition, and econometrics (Rabiner, 1990; Rydén, 2008), where the underlying process is assumed to be Markovian. Specifically, it has been found to be effectively at modeling systems with long-term persistence, where past events have effects on current and future events at a multi-year scale (Akintug and Rasmussen, 2005; Bracken et al., 2014). Recent studies have also found that it is possible to use HMMs to model the dynamics of streamflow, where the HMM state space represents the different hydrological conditions (e.g. dry vs. wet) and the observation space represents the streamflow data (Hadjimichael et al., 2020; 2020a). These synthetic streamflows are computationally generated with the aid of packages such as hmmlearn in Python and HMMBase in Julia. An example implementation of hmmlearn can be found in Julie Quinn’s blog post here. She also wrote the prequel to the implementation post, where she dives into the background of HMMs and applying Expectation-Maximization (EM) to parameterize the model.

Markov Chain Monte Carlo

The EM approach of parameterizing HMMs reflects a frequentist approach that assumes the existence of one “true” value for each HMM parameter. In this exercise, I will instead use a Bayesian approach via the Markov Chain Monte Carlo method to generate and randomly sample potential parameter (model prior) values from a Markov chain, which are statistical models that approximate Markovian time series. In these time series, the probability of being in any given state st at current time t is dependent only on states prior to st. Using the MCMC approach, a Markov Chain is constructed for each HMM parameter such that its stationary distribution (distribution of the parameters at equilibrium) is the distribution of its associated model posterior (the likelihood of a model structure after, or a posteriori, obtaining system observed data). The model priors (assumed model structure prior to having any knowledge about the system) are randomly sampled from their respective chains — hence the term “Monte Carlo”. These samples are then used to estimate the statistical properties of the distribution of each parameter.

Overall, MCMC is a powerful method to estimate the statistical properties of high-dimensional problems with many parameters. Nonetheless, it is beholden to its own set of limitations that include high computational expense and the inability to guarantee convergence. Figure 1 below summarize some of the benefits and drawbacks of using MCMC parameterization:

Figure 1: The pros and cons of using MCMC parameterization.

Setting up the parameters

In this exercise, I will be using the HMM to approximate the statistical properties of streamflow data found in the annualQ.csv file previously used also by Rohini in her post on creating HMMs to generate synthetic streamflow series and by the MSD UC Ebook HMM training Jupyter Notebook. This streamflow time series originated from a USGS outlet gauge in the Upper Colorado River Basin. It records 105 years’ worth of streamflow from 1909 and 2013 (inclusive). It was previously found that it was best described using two states (wet and dry; Bracken et al, 2014). In this exercise, I also assume that this time series is best approximated by a first-order HMM where the state at current time t (st) is only affected by the state at the timestep immediately prior to it (st-1).

In this exercise, I will be using MCMC to parameterize the following HMM parameters:

ParameterDescription
The initial state distribution a for wet (1)
and dry (2) states
The transition matrix A from a wet (1) to
a dry (2) state
The probability distributions for the vector
of observations x for a given state where
Wet: 1 and Dry: 2
Table 1: Summary of all HMM parameters to be adjusted using the MCMC approach.

From Table 1 above:

  • P(ai) is the initial state probability for state i
  • P(Aij) is the transition probability from state i to state j
  • μi and σi are the mean and standard deviation for the vector of streamflow observations x for a given state i

Now that the dataset and model parameters have been introduced, we will proceed with demonstrating the code required to implement the HMM model with MCMC parameterization.

Implementing the streamflow HMM with using MCMC parameterization

Before begin, I recommend downloading and installing Julia 1.8.5. Please find the Windows and Mac 64-bit installer here. Once Julia is set up, we can download and install all required packages. In Julia, this can be done as follows:

# make sure all required packages are installed
import Pkg
Pkg.activate(@__DIR__);
Pkg.instantiate();

Once this is done, we can load all the required packages. In Julia, this is done using the using key word, which can be thought of as the Python import parallel. Note that I placed a semicolon after the using CSV line. This is to prevent the automatic printing of output messages that Julia automatically displays when loading packages.

# load all packages
using Random
using CSVFiles # load CSV data
using DataFrames # data storage and presentation
using Plots # plotting library
using StatsPlots # statistical plotting
using Distributions # statistical distribution interface
using Turing # probabilistic programming and MCMC
using Optim # optimization library
using StatsBase
using LaTeXStrings
using HMMBase
using DelimitedFiles
using CSV;

Now, let’s load the data and identify the number of years and number of states inherent to time streamflow time series.

# load data
# source: https://github.com/IMMM-SFA/msd_uncertainty_ebook/tree/main/notebooks/data
Q = CSV.File("annualQ.csv", header=[:Inflow]) |> DataFrame
Q = Q[2:end,:]
Q[!,:Inflow] = parse.(Float64, Q[!, :Inflow])
Q[!, :Year] = 1909:2013;
Q[!, :logQ] = log.(Q[!, :Inflow]);

# Assume first-order HMM where only the immediate prior step influences the next step
n_states = 2
logQ = Q[!,:logQ];
n_years = size(logQ)[1];

Next, we should set a random seed to ensure that our following results are consistent. This is because the package that we are using in this exercise called “Distributions” performs random sampling. Without setting the seed (shown below), the outcomes of the exercise may be wildly different each time the MCMC parameterization algorithm is run.

# Set the random seed
Random.seed!(1)

Next, we define the HMM model with MCMC sampling. The sampling is performed using methods and syntax built into the Turing.jl package. In the code below, also note lines 32 to 38. Within those lines, the observations’ states are enforced where observations with log-streamflows below the input mean are assigned to a dry state, while the opposite is implemented for log-streamflows above the wet state. This step is important to prevent mode collapse, where the HMM model cannot distinguish between the wet and dry state distributions due to the random sampling of the MCMC parameterization.

@model function HMM_mcmc(logQ, n_states, n_years)
    # Source: https://chat.openai.com/c/d81ef577-9a2f-40b2-be9d-c19bf89f9980
    # Prompt example: Please suggest how you would parameterize Hidden Markov Model with the Markov Chain Monte Carlo method 
    # Uses the Turing.jl package in Julia. The input to this model will be an inflow timeseries with two hidden states.
    mean_logQ = mean(logQ)

    # Define the prior distributions for the parameters
    μ₁ ~ Normal(15.7, 0.25)   # wet state   
    μ₂ ~ Normal(15.2, 0.25)   # dry state
    σ₁ ~ truncated(Normal(0.24, 0.1), 0, Inf)
    σ₂ ~ truncated(Normal(0.24, 0.1), 0, Inf)
    a11 ~ Dirichlet(ones(n_states))
    A ~ filldist(Dirichlet(ones(n_states)), n_states)
    s = tzeros(Int, n_years)   # Define the vector of hidden states

    # Define intial state probability distribution
    a = [a11[1], 1-a11[1]]   # [wet state, dry state]

    # Define the observation distributions 
    B = [Normal(μ₁, σ₁), Normal(μ₂, σ₂)]

    # Sample the initial hidden state and observation variables
    s[1] ~ Categorical(a)

    # Loop over time steps and sample the hidden+observed variables 
    for t in 2:n_years
        s[t] ~ Categorical(A[:,s[t-1]])
    end
    for t in 1:n_years
        logQ[t] ~ B[s[t]]

        # if the logQ is greater than the input mean, force the state to be a wet state
        if logQ[t] > mean_logQ
            s[t] ~ Categorical(A[:,1])
        # if the logQ is smaller than the input mean, force the state to be a dry state
        elseif logQ[t] < mean_logQ
            s[t] ~ Categorical(A[:,2])
        end

    end
end

We then have to run the model for at least 3,000 iterations before the sample values for each parameter converge to stable values. The code to perform this is as follows:

# run for 3000 iterations; usually enough
# run using >3 chains to double check convergence if desired
hmm_model = HMM_mcmc(logQ, n_states, n_years)
g = Gibbs(HMC(0.05, 10, :μ₁, :μ₂, :σ₁, :σ₂, :a11, :A), PG(10, :s))
chains = sample(hmm_model, g, MCMCThreads(), 3000, 1, drop_warmup=true)

To check for convergence, you can attempt to run plot(chains) to check for convergence, and corner(chains) to examine the parameters for dependencies. In this post, we will not discuss MCMC convergence in detail, but some things to look out for would be r_hat values in the output of the code shown above that are close to 1, as well as chains that are similar to each other. Refer to Table 2 and Figure 2 for some quick statistics that can be used as a baseline to determine if your model has converged.

Table 2: The mean converged values for some of the HMM parameters.
Figure 2: The mean converged values of the transition matrix components.

Now, let’s use the model to generate 1,000 samples of streamflow. First, we define the streamflow prediction function:

function predict_inflow(chain, logQ)
    μ₁ = Array(group(chain, :μ₁))
    μ₂ = Array(group(chain, :μ₂))
    σ₁ = Array(group(chain, :σ₁))
    σ₂ = Array(group(chain, :σ₂))
    a11 = Array(group(chain, "a11[1]"))
    a22 = Array(group(chain, "a11[2]"))
    A11 = Array(group(chain, "A[1,1]"))
    A12 = Array(group(chain, "A[1,2]"))
    A21 = Array(group(chain, "A[2,1]"))
    A22 = Array(group(chain, "A[2,2]"))

    n_samples = 1000
    
    μ1_sample = sample(μ₁, n_samples, replace=true);
    μ2_sample = sample(μ₂, n_samples, replace=true);
    σ1_sample = sample(σ₁, n_samples, replace=true);
    σ2_sample = sample(σ₂, n_samples, replace=true);
    a11_sample = sample(a11, n_samples, replace=true);
    a22_sample = sample(a22, n_samples, replace=true);
    A11_sample = sample(A11, n_samples, replace=true);
    A12_sample = sample(A12, n_samples, replace=true);
    A21_sample = sample(A21, n_samples, replace=true);
    A22_sample = sample(A22, n_samples, replace=true);
    
    Q_predict = zeros(length(logQ), n_samples)    
    #residuals_ar = zeros(length(temp_data), n_samples)
    
    s_matrix = zeros(length(logQ), n_samples)
    logQ_matrix = zeros(length(logQ), n_samples)
    meanQ = mean(logQ)
    
    for n = 1:n_samples 
        s_sample = tzeros(Int, n_years)
        Q_sample = zeros(n_years)

        μ1 = μ1_sample[n]
        μ2 = μ2_sample[n]
        σ1 = σ1_sample[n]
        σ2 = σ2_sample[n]
        a11 = a11_sample[n]
        a22 = a22_sample[n]
        A11 = A11_sample[n]
        A21 = 1 - A11_sample[n]
        A12 = 1 - A22_sample[n]
        A22 = A22_sample[n]
        #println(A)
        A = [A11 A12; A21 A22]
        A = transpose(A)
        #println(A)
        #a =  [1-a22, a22]
        a = [a11, 1-a11]
        print(a)
        B = [Normal(μ1, σ1), Normal(μ2, σ2)]

        # Sample the initial hidden state and observation variables
        s_sample[1] = rand(Categorical(a))

        # Loop over time steps and sample the hidden+observed variables 
        for t in 2:n_years
            s_sample[t] = rand(Categorical(A[s_sample[t-1], :]))
        end
        for t in 1:n_years
            Q_sample[t] = rand(B[s_sample[t]])
            if Q_sample[t] < meanQ
                s_sample[t] = 1
            elseif Q_sample[t] > meanQ
                s_sample[t] = 2
            end
        end

        s_matrix[:, n] = s_sample
        Q_predict[:, n] = Q_sample
        
    end

    return s_matrix, Q_predict

end

From this function, we obtain the state and log-inflow matrix with dimensions 105 × 1000 each.

s_matrix, logQ_predict = predict_inflow(chains, logQ) 

We randomly select a timeseries out of the 1,000 timeseries output by the prediction function and plot the wet and dry states identified:

# create the dry and wet masks
q = Q[!,:Inflow]
y = Q[!,:Year]

# select a random sample timeseries to plot
idx = 7
Q_wet_mcmc = q[s_matrix[:,idx] .== 1]
Q_dry_mcmc = q[s_matrix[:,idx] .== 2]

y_wet_mcmc = y[s_matrix[:,idx] .== 1]
y_dry_mcmc = y[s_matrix[:,idx] .== 2];

# plot the figure
plot(y, Q.Inflow, c=:grey, label="Inflow", xlabel="Year", ylabel="Annual inflow (cubic ft/yr)")
scatter!(y_dry_mcmc, Q_dry_mcmc, c=:salmon, label="Dry state (MCMC)", xlabel="Year", ylabel="Annual inflow (cubic ft/yr)")
scatter!(y_wet_mcmc, Q_wet_mcmc, c=:lightblue, label="Wet state (MCMC)", xlabel="Year", ylabel="Annual inflow (cubic ft/yr)")

The output figure should look similar to this:

Figure 2: Wet (pink) and dry (blue) states identified within the streamflow timeseries.

From this figure, we can see that the HMM can identify streamflow extreme as wet and dry states (albeit with some inconsistencies) where historically high streamflows are identified as a “wet state” and low streamflows are correspondingly identified as “dry”.

To make sure that the model can distinguish the observation distributions for wet and dry states, we plot their probability distribution functions. Before we plot anything, though, let’s define a function that will help us plot the distributions themselves. For this function, we first need to load the LinearAlgebra package.

# load the LinearAlgebra package 
using LinearAlgebra

# function to plot the distribution of the wet and dry states 
function plot_dist(Q, μ, σ, A)
    evals, evecs = eigen(A).values, eigen(A).vectors
    one_eval = argmin(abs.(evals.-1))
    π = evecs[:, one_eval] / sum(evecs[:, one_eval])

    x_0 = LinRange(μ[1] - 4*σ[1], μ[1] + 4*σ[1], 10000)
    norm0 = Normal(μ[1], σ[1])
    fx_0 = π[1].*pdf.(norm0, x_0)
 
    x_1 = LinRange(μ[2] - 4*σ[2], μ[2] + 4*σ[2], 10000)
    norm1 = Normal(μ[2], σ[2])
    fx_1 = π[2].*pdf.(norm1, x_1)
 
    x = LinRange(μ[1] - 4*σ[1], μ[2] + 4*σ[2], 10000)
    fx = π[1].*pdf.(norm0, x) .+ π[2].*pdf.(norm1, x)
    
    #fig, ax = plt.subplots(1, 1, figsize=(12, 8))

    b = range(μ[1] - 4*σ[1], μ[2] + 4*σ[2], length=100)
    
    histogram(log.(Q), c=:lightgrey, normalize=:pdf, label="Log-Inflow")
    plot!(x_0, fx_0, c=:red, linewidth=3, label="Dry State Distr", xlabel="x", ylabel="P(x)")
    plot!(x_1, fx_1, c=:blue, linewidth=2, label="Wet State Distr")
    plot!(x, fx, c=:black, linewidth=2, label="Combined State Distr")
end

Great! Now let’s use this function to plot the distributions themselves.

# get the model parameter value means
μ₁_mean = mean(chains, :μ₁)
μ₂_mean = mean(chains, :μ₂)
σ₁_mean = mean(chains, :σ₁)
σ₂_mean = mean(chains, :σ₂)
a11_mean = mean(chains, "a11[2]")
A11_mean = mean(chains, "A[1,1]")
A12_mean = mean(chains, "A[1,2]")
A21_mean = mean(chains, "A[2,1]")
A22_mean = mean(chains, "A[2,2]")

# obtained from model above
μ = [μ₂_mean, μ₁_mean]
σ = [σ₂_mean, σ₁_mean]
A_mcmc = [A11_mean 1-A11_mean; 1-A22_mean A22_mean]

# plot the distributions
plot_dist(Q[!,:Inflow], μ, σ, A_mcmc)

This code block will output the following figure:

Figure 3: The overall (black), wet (blue) and dry (red) streamflow probability distributions.

From Figure 3, we can observe that the dry state streamflow distribution is different from that of the wet state. The HMM model correctly estimates a lower mean for the dry state, and a higher one for the dry state. Based on visual analysis, the states have similar standard deviations, but this is likely an artifact of the prior parameterization of the initial HMM model that we set up earlier. The step we took earlier to force the wet/dry state distributions given a sampled log-streamflow value also likely prevented mode collapse and enabled the identification of two distinct distributions.

Now, let’s see if our HMM can be used to generate synthetic streamflow. We do this using the following lines of code:

# plot the FDC of the sampled inflow around the historical 
sampled_obs_mcmc = zeros(1000, n_years)
Q_predict_mcmc = transpose(logQ_predict)
for i in 1:1000
    sampled_obs_mcmc[i,:] = exp.(Q_predict_mcmc[i, :])
    sampled_obs_mcmc[i,:] = sort!(sampled_obs_mcmc[i, :], rev=true)
end

hist_obs = copy(Q[!,:Inflow])
hist_obs = sort!(hist_obs, rev=true)
probs = (1:n_years)/n_years

plot(probs, hist_obs, c=:grey, linewidth=2, label="Historical inflow",
    xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")
for i in 1:1000
    if i == 1
        plot!(probs, sampled_obs_mcmc[i,:], c=:lightgreen, linewidth=2,
            label="Sampled inflow (MCMC)", xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")
    end

    plot!(probs, sampled_obs_mcmc[i,:], c=:lightgreen, linewidth=2, legend=false,
        xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")
end

plot!(probs, hist_obs, c=:grey, linewidth=2, legend=false,
    xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")

Once implemented, the following figure should be output:

Figure 4: The original historical streamflow timeseries (grey) and the synthetic streamflow generated by the HMM (green).

This demonstrates that the HMM model is capable of generating synthetic streamflow that approximates the statistical properties of the historical timeseries, while expanding upon the range of uncertainty in streamflow scenarios. This has implications for applications such as exploratory modeling of different streamflow scenarios, where such approaches can be used to generate a wetter- or dryer-than-historical streamflows.

Summary

While using MCMC to parameterize a HMM is a useful method to generate synthetic streamflow, one significant drawback is the long runtime and the large number of iterations that the MCMC algorithm takes to converge (Table 3).

Table 3: MCMC parameterization and runtime.

This is likely due to the HMM sampling setup. Notice that in lines 27 and 33-27 in the HMM_mcmc function: we are essentially fitting an additional probability distribution at each st essentially turning them into tunable parameters themselves. Since there are 105 elements in the streamflow timeseries, this introduces an additional 105 parameters for the MCMC to fit distributions to. This is the likely cause for the lengthy runtime. While we will not explore alternative HMM-MCMC parameterization approaches, more efficient alternatives definitely exist, and you are highly encouraged to check it out!

Overall, we began this post with a quick introduction to HMM and MCMC sampling. We then compared the pros and cons of using the MCMC approach to parameterize a HMM. Once we understood its benefits and drawbacks, we formulated a HMM for a 105-year long timeseries of annual streamflow and examined the performance of the model by analyzing its converged parameter values, observation probability distributions, as well as its ability to detect wet/dry states in the historical time series. We also used the HMM to generate 1,000 synthetic streamflow scenarios. Finally, we diagnosed the cause for the long time to convergence of the MCMC sampling approach.

This brings us to the end of the post! I hope you now have a slightly better understanding of HMMs, how they can be parameterized using the MCMC approach, as well as issues you might run into when you implement the latter.

Happy coding!

References

Akintug, B., & Rasmussen, P. F. (2005). A Markov switching model for annual hydrologic time series. Water Resources Research, 41(9). https://doi.org/10.1029/2004wr003605

Bracken, C., Rajagopalan, B., & Zagona, E. (2014). A hidden markov model combined with climate indices for Multidecadal Streamflow Simulation. Water Resources Research, 50(10), 7836–7846. https://doi.org/10.1002/2014wr015567

Hadjimichael, A., Quinn, J., & Reed, P. (2020). Advancing diagnostic model evaluation to better understand water shortage mechanisms in institutionally complex river basins. Water Resources Research, 56(10), e2020WR028079.

Hadjimichael, A., Quinn, J., Wilson, E., Reed, P., Basdekas, L., Yates, D., & Garrison, M. (2020a). Defining robustness, vulnerabilities, and consequential scenarios for diverse stakeholder interests in institutionally complex river basins. Earth’s Future, 8(7), e2020EF001503.

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Rabiner, L. R. (1990). A tutorial on Hidden Markov models and selected applications in speech recognition. Readings in Speech Recognition, 267–296. https://doi.org/10.1016/b978-0-08-051584-7.50027-9

Rydén, T. (2008). Em versus markov chain Monte Carlo for estimation of Hidden Markov models: A computational perspective. Bayesian Analysis, 3(4). https://doi.org/10.1214/08-ba326

Turek, D., de Valpine, P. and Paciorek, C.J. (2016) Efficient Markov Chain Monte Carlo Sampling for Hierarchical Hidden Markov Models, arXiv.org. Available at: https://arxiv.org/abs/1601.02698.

Highlighting the Excellence in Systems Analysis Teaching and Innovative Communication (ECSTATIC) Repository

*This post was written in collaboration with Dr. Chris Chini (Air Force Institute of Technology) and Dr. David Rosenberg (Utah State University)

The EWRI Congress and EWRS committee meeting that a good majority of us in water resources engineering attend every year serves as a reminder to me that there exist some really wonderful free resources to aid in teaching and communicating in our field. Most of us in EWRS are familiar with the ECSTASTIC repository, but I wanted to use our blog as a forum to formally discuss the repository and provide guidance on how to contribute.

Background  

David: In 2013, several of us among the 100+ members of the Environmental Water Resources Systems (EWRS) committee saw the need to develop, share, and disseminate excellent and innovative methods to teach water systems content. There were organized channels such as journals and conferences to communicate research. Individual committee members were mostly using teaching resources they developed themselves. In most cases, those materials were not accessible to other faculty, students, or life learners. We wanted a channel – the ECSTATIC repository – to share excellent teaching videos, audio clips, lecture materials, readings, problems, games, code, etc.

A second purpose for the task committee was to submit a peer-reviewed article that reviewed the state of systems analysis education, discussed what a systems analysis curriculum should include, and assessed the impact of the interactive website. We ended up authoring and publishing two peer-reviewed articles. The second article took data from interviews we did with practitioners in the field. The notes/videos from the interviews were also posted to the repository!

What is in the repository?

The repository can be found here: https://digitalcommons.usu.edu/ecstatic/

The teaching repository has a multitude of collections ranging from course syllabi, games, problems, lecture materials, and code/models.

Some of the most popular materials on ECSTATIC include lecture notes on multi-objective optimization and nonlinear optimization from Dr. Lina Sela, an Associate Professor at the University of Texas at Austin. Additionally, Dr. Joseph Kasprzyk, Associate Professor University of Colorado at Boulder, has lectures on Introduction to Water Resources Systems and Introduction to Reservoir Analysis. Other submissions include GAMS examples (https://digitalcommons.usu.edu/ecstatic_all/63/), case studies for a guided project on stochastic hydrology (https://digitalcommons.usu.edu/ecstatic_all/57/), and syllabi for courses on environmental systems (https://digitalcommons.usu.edu/ecstatic_all/19/) and water technology & policy (https://digitalcommons.usu.edu/ecstatic_all/66/).

Example Slide Deck Content

There are also free books, video lectures (including a series on Managing Infrastructure with Deep Uncertainty Discussions), and games that can be used to supplement lectures.

Video Lectures from the Managing Infrastructure with Deep Uncertainty Discussions Series

Who can contribute?

Please submit new material at: https://digitalcommons.usu.edu/cgi/ir_submit.cgi?context=ecstatic_all

New submissions are always welcome and are accepted in whatever format suits the material. Submissions can be made by professors, PhD students, practitioners, or any person who feels they have educational materials to share to benefit the Water Systems community. For any questions on the repository, its content, or how to submit an article please reach out to either Nathan Bonham (nathan.bonham@colorado.edu) or Christopher Chini (christopher.m.chini@gmail.com).

David: Submissions on any topic related to water resources would be great, but the ECSTATIC task committee is particularly interested in entire courses from colleagues who may be retired or who are retiring soon. Furthermore, the committee would like to build up the video content in the repository. During the pandemic, many seminars were conducted online and recorded through Zoom. Provided the speakers are fine with having their seminars online, these videos are the best opportunity for people using the site to get context around the methods being used and the problems being solved. In short- content that you can’t get from a textbook or a syllabus.

What is the peer review process like?

New submissions have to pass through a review process. When a new work is submitted via Utah State University’s Digital Commons, a volunteer Chief Editor assigns one or more peer reviewers who are solicited from the EWRS community.

Reviewers use 4 criteria to evaluate submissions:

·       Is the submission topical to ECSTATIC (water resources)?

·       Are the materials organized so someone can download and use them?

·       What are the strengths of the submission?

·       What changes, if any, are needed to post on ECSTATIC?

The entire submission, review, and publishing process occur through the content management system.

What is the reach of ECSTATIC?

ECSTATIC has grown – and continues to grow – to a global audience and usership that surpasses the EWRS community.

ECSTATIC Repository Reach

Since it’s inception, there have been 26,900 total downloads of material across many different countries (as shown in the map above) and types of institutions (academic, government, etc)

Downloads over time

The ECSTATIC repo is a great resource, not only to preserve and share knowledge within our community but to create a platform for those outside our community to become integrated in. Please considering sharing your resources and helping the ECSTATIC repository grow!