Generating synthetic timeseries while preserving historic correlation: A case study for reservoir inflow and water demand

Motivation

Robust water resource planning and management decisions rely upon the evaluation of alternative policies under a wide variety of plausible future scenarios. Often, despite the availability of historic records, non-stationarity and system uncertainty require the generation of synthetic datasets to be used in these analyses.

When creating synthetic timeseries data from historic records, it is important to replicate the statistical properties of the system while preserving the inherent stochasticity in the system. Along with replicating statistical autocorrelation, means, and variances it is important to replicate the correlation between variables present in the historic record.

Previous studies by Zeff et al. (2016) and Gold et al. (2022) have relied upon synthetic streamflow and water demand timeseries to inform infrastructure planning and management decisions in the “Research Triangle” region of North Carolina. The methods used for generating the synthetic streamflow data emphasized the preservation of autocorrelation, seasonal correlation, and cross-site correlation of the inflows. However, a comprehensive investigation into the preservation of correlation in the generated synthetic data has not been performed.

Given the critical influence of both reservoir inflow and water demand in the success of water resource decisions, it is important that potential interactions between these timeseries are not ignored.

In this post, I present methods for producing synthetic demand timeseries conditional upon synthetic streamflow data. I also present an analysis of the correlation in both the historic and synthetic timeseries.

A GitHub repository containing all of the necessary code and data can be accessed here.

Case Study: Reservoir Inflow and Water Demand

This post studies the correlation between reservoir inflow and water demand at one site in the Research Triangle region of North Carolina, and assesses the preservation of this correlation in synthetic timeseries generated using two different methods: an empirical joint probability distribution sampling scheme, and a conditional expectation sampling scheme.

Methods

Synthetic data was generated using historic reservoir inflow and water demand data from a shared 18-year period, at weekly timesteps. Demand data is reported as the unit water demand, in order to remove the influence of growing population demands. Unit water demand corresponds to the fraction of the average annual water demand observed in that week; i.e., a unit water demand of 1.2 suggests that water demand was 120% of the annual average during that week. Working with unit demand allows for the synthetic data to be scaled according to projected changes in water demand for a site.

Notably, all of the synthetic generation techniques presented below are performed using weekly-standardized inflow and demand data. This is necessary to remove the seasonality in both variables. If not standardized, measurement of the correlation will be dominated by this seasonal correlation. Measurement of the correlation between the standardized data thus accounts for shared deviances from the seasonal mean in both data. In each case, historic seasonality, as described by the weekly means and variances, is re-applied to the standardized synthetic data after it is generated.

Synthetic Streamflow Generation

Synthetic inflow was generated using the modified Fractional Gaussian Noise (mFGN) method described by Kirsch et al. (2013). The mFGN method is specifically intended to preserve both seasonal correlation, intra-annual autocorrelation, and inter-annual autocorrelation. The primary modification of the mFGN compared to the traditional Fractional Gaussian Noise method is a matrix manipulation technique which allows for the generation of longer timeseries, whereas the traditional technique was limited to timeseries of roughly 100-time steps (McLeod and Hipel, 1978; Kirsch et al., 2013).

Professor Julie Quinn wrote a wonderful blog post describing the mFGN synthetic streamflow generator in her 2017 post, Open Source Streamflow Generator Part 1: Synthetic Generation. For the sake of limiting redundancy on this blog, I will omit the details of the streamflow generation in this post, and refer you to the linked post above. My own version of the mFGN synthetic generator is included in the repository for this post, and can be found here.

Synthetic Demand Generation

Synthetic demand data is generated after the synthetic streamflow and is conditional upon the corresponding weekly synthetic streamflow. Here, two alternative synthetic demand generation methods are considered:

  1. An empirical joint probability distribution sampling method
  2. A conditional expectation sampling method

Joint Probability Distribution Sampling Method

The first method relies upon the construction of an empirical joint inflow-demand probability density function (PDF) using historic data. The synthetic streamflow is then used to perform a conditional sampling of demand from the PDF.

The joint PDF is constructed using the weekly standardized demand and weekly standardized log-inflow. Historic values are then assigned to one of sixteen bins within each inflow or demand PDF, ranging from -4.0 to 4.0 at 0.5 increments. The result is a 16 by 16 matrix joint PDF. A joint cumulative density function (CDF) is then generated from the PDF.

For some synthetic inflow timeseries, the synthetic log-inflow is standardized using historic inflow mean and standard deviations. The corresponding inflow-bin from the marginal inflow PDF is identified. A random number is randomly selected from a uniform distribution ranging from zero to the number of observations in that inflow-bin. The demand-CDF bin number corresponding to the value of the random sample is identified. The variance of the demand value is then determined to be the value corresponding to that bin along the discretized PDF range, from -4.0 to 4.0. Additionally, some statistical noise is added to the sampled standard demand by taking a random sample from a normal distribution, N(0, 0.5).

Admittedly, this process is difficult to translate into words. With that in mind, I recommend the curious reader take a look at the procedure in the code included in the repository.

Lastly, for each synthetic standard demand, d_{s_{i,j}}, the historic weekly demand mean, \mu_{D_j}, and standard deviation, \sigma_{D_j}, are applied to convert to a synthetic unit demand, D_{s_{i,j}}.

D_{s_{i,j}} = d_{s_{i,j}} \sigma_{D_j} + \mu_{D_j}

Additionally, the above process is season-specific: PDFs and CDFs are independently constructed for the irrigation and non-irrigation seasons. When sampling the synthetic demand, samples are drawn from the corresponding distribution according to the week in the synthetic timeseries.

Conditional Expectation Sampling Method

The second method does not rely upon an empirical joint PDF, but rather uses the correlation between standardized inflow and demand data to calculate demand expectation and variance conditional upon the corresponding synthetic streamflow and the correlation between historic observations. The conditional expectation of demand, E[D|Q_{s_i}], given a specific synthetic streamflow, Q_{s_i}, is:

E[D|Q_{s_i}] = E[D] + \rho \frac{\sigma_Q}{\sigma_D} (Q_i - \mu_Q)

Where \rho is the Pearson correlation coefficient of the weekly standardized historic inflow and demand data. Since the data is standardized, ( E[d] = 0 and \sigma_z = \sigma_d = 1) the above form of the equation simplifies to:

E[d|Z_{s_i}] = \rho (Z_{s_i})

Where d is standard synthetic demand and Z_{s_i} is the standard synthetic streamflow for the i^{th} week. The variance of the standard demand conditional upon the standard streamflow is then:

Var(d|Z_{s_i}) = \sigma_d^2(1 - \rho^2) = (1 - \rho^2)

The weekly standard demand, d_{s_i}, is then randomly sampled from a normal distribution centered around the conditional expectation with standard deviation equal to the square root of the conditional variance.

d_{s_i} \approx N(E[d|Z_{s_i}], Var(d|Z_{s_i})^{1/2})

As in the previous method, this method is performed according to whether the week is within the irrigation season or not. The correlation values used in the calculation of expected value and variance are calculated for both irrigated and non-irrigated seasons and applied respective of the week.

As in the first method, the standard synthetic demand is converted to a unit demand, and seasonality is reintroduced, using the weekly means and standard deviations of the historic demand:

D_{s_{i,j}} = d_{s_{i,j}} \sigma_{D_j} + \mu_{D_j}

Results

Historic Correlation Patterns

It is worthwhile to first consider the correlation pattern between stream inflow and demand in the historic record.

Figure 1: Correlation patterns between inflow and demand across weeks, for the historic record.

The correlation patterns between inflow and demand found in this analysis support the initial hypothesis that inflow and demand are correlated with one another. More specifically, there is a strong negative correlation between inflow and demand week to week (along the diagonal in the above figure). Contextually, this makes sense; low reservoir inflow correspond to generally dryer climatic conditions. When considering that agriculture accounts for a substantial contribution to demand in the region, it is understandable that demand will be high during dry periods, when are farmers require more reservoir supply to irrigate their crops. During wet periods, they depend less upon the reservoir supply.

Interestingly, there appears to be some type of lag-correlation, between variables across different weeks (dark coloring on the off-diagonals in the matrix). For example, there exists strong negative correlation between the inflow during week 15 with the demands in weeks 15, 16, 17 and 18. This may be indicative of persistence in climatic conditions which influence demand for several subsequent weeks.

Synthetic Streamflow Results

Figure 2: Comparison of the range of flow duration curves for the historic record (hist. = 81), and the synthetic streamflow data (nsyn = 50) generated using the mFGN method.

Consideration of the above flow duration curves reveal that the synthetic streamflow generated through the mFGN method exceedance probabilities are in close alignment with the historic record. While it should not be assumed that future hydrologic conditions will follow historic trends (Milly et al., 2008), the focus of this analysis is the replication of historic patterns. This result confirms previous studies by Mandelbrot and Wallis (1968) that the FGN method is capable of capturing flood and drought patterns from the historic record.

Synthetic Demand Results

Figure 3: Comparison of the ranges in unit demands in the historic record  (hist = 18) and the synthetic demand record (nsyn = 50) generated using both the joint PMF sampling method and the conditional expectation method.

The above figure shows a comparison of the ranges in unit demand data between historic and synthetic data sets. Like the synthetic streamflow data, these figures reveal that both demand generation techniques are producing timeseries that align closely with historic patterns. The joint probability sampling method does appear to produce consistently higher unit demands than the historic record, but this discrepancy is not significant enough to disregard the method, and may be corrected with some tweaking of the PDF-sampling scheme.

Synthetic Correlation Patterns

Now that we know both synthetic inflow and demand data resemble historic ranges, it is important to consider how correlation is replicated in those variables.

Figure 4: Correlation patterns between inflow and demand across weeks for both synthetic data generated using both the joint probability and conditional expectation methods.

Take a second to compare the historic correlation patterns in Figure 1 with the correlation in the synthetic data shown in Figure 4. The methods are working!

As in the historic data, the synthetic data contain strong negative correlations between inflow and demand week-to-week (along the diagonal).

Visualizing the joint distributions of the standardized data provides more insight into the correlation of the data. The Pearson correlation coefficients for each aggregated data set are shown in the upper right of each scatter plot, and in the table below.

Figure 5: Joint distributions for historic standardized inflow and demand using the full set of annual weekly observations (a) and irrigation-season data (b). Pearson correlation coefficients are included in the upper right of each plot.
Data TypeAnnual
Pearson Correlation
Irrigation Season
Pearson Correlation
Historic-0.29-0.47
Synthetic: Joint PDF Method-0.35-0.54
Synthetic: Conditional Expectation Method-0.38-0.48
Table 1: Pearson correlation coefficients for the historic data and the synthetic data sets, for the entire annual period and the irrigation seasons.

One concern with this result is that the correlation is actually too strong in the synthetic data. For both methods, the Pearson Correlation coefficient is greater in the synthetic data than it is in the historic data.

This may be due to the fact that correlation is highly variable throughout the year in the historic record, but the methods used here only separate the year into two seasons – non-irrigation and irrigation seasons. Aggregated across these seasons, the historic correlations are negative. However, there exist weeks (e.g., during the winter months) when weekly correlations are 0 or even positive. Imposing the aggregated negative-correlation to every week during the generation process may be the cause of the overly-negative correlation in the synthetic timeseries.

It may be possible to produce synthetic data with better preservation of historic correlations by performing the same demand generation methods but with more than two seasons.

Conclusions

When generating synthetic timeseries, it is important to replicate the historic means and variances of the data, but also to capture the correlation that exist between variables. Interactions between exogenous variables can have critical implications for policy outcomes.

For example, when evaluating water resource policies, strong negative correlation between demand and inflow can constitute a compounding risk (Simpson et al., 2021), where the risk associated with low streamflow during a drought is then compounded by high demand at the same time.

Here, I’ve shared two different methods of producing correlated synthetic timeseries which do well in preserving historic correlation patterns. Additionally, I’ve tried to demonstrate different analyses and visualizations that can be used to verify this preservation. While demonstrated using inflow and demand data, the methods described in this post can be applied to a variety of different timeseries variables.

Lastly, I want to thank David Gold and David Gorelick for sharing their data and insight on this project. I also want to give a shout out to Professor Scott Steinschneider whose Multivariate Environmental Statistics class at Cornell motivated this work, and who fielded questions along the way.

Happy programming!

References

Gold, D. F., Reed, P. M., Gorelick, D. E., & Characklis, G. W. (2022). Power and Pathways: Exploring Robustness, Cooperative Stability, and Power Relationships in Regional Infrastructure Investment and Water Supply Management Portfolio Pathways. Earth’s Future10(2), e2021EF002472.

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406.

Lettenmaier, D. P., Leytham, K. M., Palmer, R. N., Lund, J. R., & Burges, S. J. (1987). Strategies for coping with drought: Part 2, Planning techniques and reliability assessment (No. EPRI-P-5201). Washington Univ., Seattle (USA). Dept. of Civil Engineering; Electric Power Research Inst., Palo Alto, CA (USA).

Mandelbrot, B. B., & Wallis, J. R. (1968). Noah, Joseph, and operational hydrology. Water resources research, 4(5), 909-918.

McLeod, A. I., & Hipel, K. W. (1978). Preservation of the rescaled adjusted range: 1. A reassessment of the Hurst Phenomenon. Water Resources Research14(3), 491-508.

Simpson, N. P., Mach, K. J., Constable, A., Hess, J., Hogarth, R., Howden, M., … & Trisos, C. H. (2021). A framework for complex climate change risk assessment. One Earth4(4), 489-501.

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research, 52(9), 7327-7346.

CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

    

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

MORDM Basics I: Synthetic Streamflow Generation

In this post, we will break down the key concepts underlying synthetic streamflow generation, and how it fits within the Many Objective Robust Decision Making (MORDM) framework (Kasprzyk, Nataraj et. al, 2012). This post is the first in a series on MORDM which will begin here: with generating and validating the data used in the framework. To provide some context as to what we are about to attempt, please refer to this post by Jon Herman.

What is synthetic streamflow generation?

Synthetic streamflow generation is a non-parametric, direct statistical approach used to generate synthetic streamflow timeseries from a reasonably long historical record. It is used when there is a need to diversify extreme event scenarios, such as flood and drought, or when we want to generate flows to reflect a shift in the hydrologic regime due to climate change. It is favored as it relies on a re-sampling of the historical record, preserving temporal correlation up to a certain degree, and results in a more realistic synthetic dataset. However, its dependence on a historical record also implies that this approach requires a relatively long historical inflow data. Jon Lamontagne’s post goes into further detail regarding this approach.

Why synthetic streamflow generation?

An important step in the MORDM framework is scenario discovery, which requires multiple realistic scenarios to predict future states of the world (Kasprzyk et. al., 2012). Depending solely on the historical dataset is insufficient; we need to generate multiple realizations of realistic synthetic scenarios to facilitate a comprehensive scenario discovery process. As an approach that uses a long historical record to generate synthetic data that has been found to preserve seasonal and annual correlation (Kirsch et. al., 2013; Herman et. al., 2016), this method provides us with a way to:

  1. Fully utilize a large historical dataset
  2. Stochastically generate multiple synthetic datasets while preserving temporal correlation
  3. Explore many alternative climate scenarios by changing the mean and the spread of the synthetic datasets

The basics of synthetic streamflow generation in action

To better illustrate the inner workings of synthetic streamflow generation, it is helpful to use a test case. In this post, the historical dataset is obtained from the Research Triangle Region in North Carolina. The Research Triangle region consists of four main utilities: Raleigh, Durham, Cary and the Orange County Water and Sewer Authority (OWASA). These utilities are receive their water supplies from four water sources: the Little River Reservoir, Lake Wheeler, Lake Benson, and the Jordan Lake (Figure 1), and historical streamflow data is obtained from ten different stream gauges located at each of these water sources. For the purpose of this example, we will be using 81 years’ worth of weekly streamflow data available here.

Figure 1: The Research Triangle region (Trindade et. al., 2019).

The statistical approach that drive synthetic streamflow generation is called the Kirsch Method (Kirsch et. al., 2013). In plain language, this method does the following:

  1. Converts the historical streamflows from real space to log space, and then standardize the log-space data.
  2. Bootstrap the log-space historical matrix to obtain an uncorrelated matrix of historical data.
  3. Obtain the correlation matrix of the historical dataset by performing Cholesky decomposition.
  4. Impose the historical correlation matrix upon the uncorrelated matrix obtained in (2) to generate a standardized synthetic dataset. This preserves seasonal correlation.
  5. De-standardize the synthetic data, and transform it back into real space.
  6. Repeat steps (1) to (5) with a historical dataset that is shifted forward by 6 months (26 weeks). This preserves year-to-year correlation.

This post by Julie Quinn delves deeper into the Kirsch Method’s theoretical steps. The function that executes these steps can be found in the stress_dynamic.m Matlab file, which in turn is executed by the wsc_main_rate.m file by setting the input variable p = 0 as shown on Line 27. Both these files are available on GitHub here.

However, this is simply where things get interesting. Prior to this, steps (1) to (6) would have simply generated a synthetic dataset based on only historical statistical characteristics as validated here in Julie’s second blog post on a similar topic. Out of the three motivations for using synthetic streamflow generation, the third one (exploration of multiple scenarios) has yet to be satisfied. This is a nice segue into out next topic:

Generating multiple scenarios using synthetic streamflow generation

The true power of synthetic streamflow generation lies in its ability to generate multiple climate (or in this case, streamflow) scenarios. This is done in stress_dynamic.m using three variables:

Input variableData type
pThe lowest x% of streamflows
nA vector where each element ni is the number of copies of the p-lowest streamflow years to be added to the bootstrapped historical dataset.
mA vector where each element mi is the number of copies of the (1-p)-highest streamflow years to be added to the bootstrapped historical dataset.
Table 1: The input variables to the stress_dynamic function.

These three variables bootstrap (increase the length of) the historical record while allow us to perturb the historical streamflow record streamflows to reflect an increase in frequency or severity of extreme events such as floods and droughts using the following equation:

new_hist_years = old_historical_years + [(p*old_historical_years)*ni ] + (old_hist_years – [(p*old_historical_years)mi])

The stress_dynamic.m file contains more explanation regarding this step.

This begs the question: how do we choose the value of p? This brings us to the topic of the standardized streamflow indicator (SSI6).

The SSI6 is the 6-month moving average of the standardized streamflows to determine the occurrence and severity of drought on the basis of duration and frequency (Herman et. al., 2016). Put simply, this method determines the occurrence of drought if the the value of the SSI6 < 0 continuously for at least 3 months, and SSI6 < -1 at least once during the 6-month interval. The periods and severity (or lack thereof) of drought can then be observed, enabling the decision on the length of both the n and m vectors (which correspond to the number of perturbation periods, or climate event periods). We will not go into further detail regarding this method, but there are two important points to be made:

  1. The SSI6 enables the determination of the frequency (likelihood) and severity of drought events in synthetic streamflow generation through the values contained in p, n and m.
  2. This approach can be used to generate flood events by exchanging the values between the n and m vectors.

A good example of point (2) is done in this test case, in which more-frequent and more-severe floods was simulated by ensuring that most of the values in m where larger than those of n. Please refer to Jon Herman’s 2016 paper titled ‘Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments’ for further detail.

A brief conceptual letup

Now we have shown how synthetic streamflow generation satisfies all three factors motivating its use. We should have three output folders:

  • synthetic-data-stat: contains the synthetic streamflows based on the unperturbed historical dataset
  • synthetic-data-dyn: contains the synthetic streamflows based on the perturbed historical dataset

Comparing these two datasets, we can compare how increasing the likelihood and severity of floods has affected the resulting synthetic data.

Validation

To exhaustively compare the statistical characteristics of the synthetic streamflow data, we will perform two forms of validation: visual and statistical. This method of validation is based on Julie’s post here.

Visual validation

Done by generating flow duration curves (FDCs) . Figure 2 below compares the unperturbed (left) and perturbed (right) synthetic datasets.

Figure 2: (Above) The FDC of the unperturbed historical dataset (pink) and its resulting synthetic dataset (blue). (Below) The corresponsing perturbed historical and synthetic dataset.

The bottom plots in Figure 2 shows an increase in the volume of weekly flows, as well as an smaller return period, when the the historical streamflows were perturbed to reflect an increasing frequency and magnitude of flood events. Together with the upper plots in Figure 2, this visually demonstrates that the synthetic streamflow generation approach (1) faithfully reconstructs historical streamflow patterns, (2) increases the range of possible streamflow scenarios and (3) can model multiple extreme climate event scenarios by perturbing the historical dataset. The file to generate this Figure can be found in the plotFDCrange.py file.

Statistical validation

The mean and standard deviation of the perturbed and unperturbed historical datasets are compared to show if the perturbation resulted in significant changes in the synthetic datasets. Ideally, the perturbed synthetic data would have higher means and similar standard deviations compared to the unperturbed synthetic data.

Figure 3: (Above) The unperturbed synthetic (blue) and historical (pink) streamflow datasets for each of the 10 gauges. (Below) The perturbed counterpart.

The mean and tails of the synthetic streamflow values of the bottom plots in Figure 3 show that the mean and maximum values of the synthetic flows are significantly higher than the unperturbed values. In addition, the spread of the standard deviations of the perturbed synthetic streamflows are similar to that of its unperturbed counterpart. This proves that synthetic streamflow generation can be used to synthetically change the occurrence and magnitude of extreme events while maintaining the periodicity and spread of the data. The file to generate Figure 3 can be found in weekly-moments.py.

Synthetic streamflow generation and internal variability

The generation of multiple unperturbed realizations of synthetic streamflow is vital for characterizing the internal variability of a system., otherwise known as variability that arises from natural variations in the system (Lehner et. al., 2020). As internal variability is intrinsic to the system, its effects cannot be eliminated – but it can be moderated. By evaluating multiple realizations, we can determine the number of realizations at which the internal variability (quantified here by standard deviation as a function of the number of realizations) stabilizes. Using the synthetic streamflow data for the Jordan Lake, it is shown that more than 100 realizations are required for the standard deviation of the 25% highest streamflows across all years to stabilize (Figure 4). Knowing this, we can generate sufficient synthetic realizations to render the effects of internal variability insignificant.

Figure 4: The highest 25% of synthetic streamflows for the Jordan Lake gauge.

The file internal-variability.py contains the code to generate the above figure.

How does this all fit within the context of MORDM?

So far, we have generated synthetic streamflow datasets and validated them. But how are these datasets used in the context of MORDM?

Synthetic streamflow generation lies within the domain of the second part of the MORDM framework as shown in Figure 5 above. Specifically, synthetic streamflow generation plays an important role in the design of experiments by preserving the effects of deeply uncertain factors that cause natural events. As MORDM requires multiple scenarios to reliably evaluate all possible futures, this approach enables the simulation of multiple scenarios, while concurrently increasing the severity or frequency of extreme events in increments set by the user. This will allow us to evaluate how coupled human-natural systems change over time given different scenarios, and their consequences towards the robustness of the system being evaluated (in this case, the Research Triangle).

Figure 4: The taxonomy of robustness frameworks. The bold-outlined segments highlight where MORDM fits within this taxonomy (Herman et. al., 2015).

Typically, this evaluation is performed in two main steps:

  1. Generation and evaluation of multiple realizations of unperturbed annual synthetic streamflow. The resulting synthetic data is used to generate the Pareto optimal set of policies. This step can help us understand how the system’s internal variability affects future decision-making by comparing it with the results in step (2).
  2. Generation and evaluation of multiple realizations of perturbed annual synthetic streamflow. These are the more extreme scenarios in which the previously-found Pareto-optimal policies will be evaluated against. This step assesses the robustness of the base state under deeply uncertain deviations caused by the perturbations in the synthetic data and other deeply uncertain factors.

Conclusion

Overall, synthetic streamflow generation is an approach that is highly applicable in the bottom-up analysis of a system. It preserves historical characteristics of a streamflow timeseries while providing the flexibility to modify the severity and frequency of extreme events in the face of climate change. It also allows the generation of multiple realizations, aiding in the characterization and understanding of a system’s internal variability, and a more exhaustive scenario discovery process.

This summarizes the basics of data generation for MORDM. In my next blog post, I will introduce risk-of-failure (ROF) triggers, their background, key concepts, and how they are applied within the MORDM framework.

References

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 142(11), 04016050. doi:10.1061/(asce)wr.1943-5452.0000701

Kasprzyk, J. R., Nataraj, S., Reed, P. M., & Lempert, R. J. (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71. doi:10.1016/j.envsoft.2012.12.007

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406. doi:10.1061/(asce)wr.1943-5452.0000287

Mankin, J. S., Lehner, F., Coats, S., & McKinnon, K. A. (2020). The value of initial condition large ensembles to Robust Adaptation Decision‐Making. Earth’s Future, 8(10). doi:10.1029/2020ef001610

Trindade, B., Reed, P., Herman, J., Zeff, H., & Characklis, G. (2017). Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources, 104, 195-209. doi:10.1016/j.advwatres.2017.03.023

How to make horizon plots in Python

Horizon plots were invented about a decade ago to facilitate visual comparison between two time series. They are not intuitive to read right away, but they are great for comparing and presenting many sets of timeseries together. They can take advantage of a minimal design by avoiding titles and ticks on every axis and packing them close together to convey a bigger picture. The example below shows percent changes in the price of various food items in 25 years.

The way they are produced and read is by dividing the values along the y axis in bands based on ranges. The color of each band is given by a divergent color map. By collapsing the bands to the zero axis and layering the higher bands on top, one can create a time-varying heatmap of sorts.

Source: http://idl.cs.washington.edu/papers/horizon

I wasn’t able to find a script that could produce this in Python, besides some code in this github repository, that is about a decade old and cannot really run in Python 3. I cleaned it up and updated the scripts with some additional features. I also added example data comparing USGS streamflow data with model simulation data for the same locations for 38 years. The code can be found here and can be used with any two datasets that one would like to compare with as many points of comparison as needed (I used eight below, but the script can accept larger csv files with more or less comparison points, which will be detected automatically). The script handles the transformation of the data to uniform bands and produces the following figure, with every subplot comparing model output with observations at eight gauges, i.e. model prediction error. When the model is over predicting the area is colored blue, when the area is underpredicting, the area is colored red. Darker shades indicate further divergence from the zero axis. The script automatically uses three bands for both positive or negative divergence, but more can be added, as long as the user defines additional colors to be used.

Using this type of visualization for these data allows for time-varying comparisons of multiple locations in the same basin. The benefit of it is most exploited with many subplots that make up a bigger picture.

Future extensions in this repository will include code to accept more file types than csv, more flexibility in how the data is presented and options to select different colormaps when executing.

From MATLAB to Julia: Insights from Translating an Opensource Kirsch-Nowak Streamflow Generator to Julia

A quick look into translating code: speed comparisons, practicality, and comments

As I am becoming more and more familiar with Julia—an open-source programming language—I’ve been attracted to translate code to not only run it on an opensource and free language but also to test its performance. Since Julia was made to be an open source language made to handle matrix operations efficiently (when compared to other high-level opensource languages), finding a problem to utilize these performance advantages only makes sense.

As with any new language, understanding how well it performs relative to the other potential tools in your toolbox is vital. As such, I decided to use a problem that is easily scalable and can be directly compare the performances of MATLAB and Julia—the Kirsch-Nowak synthetic stationary streamflow generator.

So, in an effort to sharpen my understanding of the Kirsch-Nowak synthetic stationary streamflow generator created by Matteo GiulianiJon Herman and Julianne Quinn, I decided to take on this project of converting from this generator from MATLAB. This specific generator takes in historical streamflow data from multiple sites (while assuming stationarity) and returns a synthetically generated daily timeseries of streamflow. For a great background on synthetic streamflow generation, please refer to this post by Jon Lamontagne.

Model Description

The example is borrowed from Julie’s code utilizes data from the Susquehanna River flows (cfs) at both Marietta (USGS station 01576000) and Muddy Run along with lateral inflows (cfs) between Marietta and Conowingo Damn (1932-2001). Additionally, evaporation rates (in/day) over the Conowingo and Muddy Run Dams (from an OASIS model simulation) utilized. The generator developed by Kirsch et al. (2013) utilizes a Cholesky decomposition to create a monthly synthetic record which preserves the autocorrelation structure of the historical data. The method proposed by Nowak et al. (2010) is then used to disaggregate to daily flows (using a historical month +/- 7 days). A full description of the methods can be found at this link.

Comparing Julia and MATLAB

Comparison of Performance between Julia and MATLAB

To compare the speeds of each language, I adapted the MATLAB code into Julia (shown here) on as nearly of equal basis as possible. I attempted to keep the loops, data structures, and function formulation as similar as possible, even calling similar libraries for any given function. julia_matlab_streamflow

When examining the performance between Julia (solid lines) and MATLAB (dashed lines), there is only one instance where MATLAB(x) outperformed Julia(+)—in the 10-realization, 1000-year simulation shown in the yellow dots in the upper left. Needless to say, Julia easily outperformed MATLAB in all other situations and required only 53% of the time on average (all simulations considered equal). However, Julia was much proportionally faster at lower dimensions of years (17-35% of the time required) than MATLAB. This is likely because I did not handle arrays optimally—the code could likely be sped up even more.

Considerations for Speeding Up Code

Row- Versus Column-Major Array Architecture

It is worth knowing how a specific language processes its arrays/matrices. MATLAB and Julia are both column-major languages, meaning the sequential indexes and memory paths are grouped by descending down row by row through a column then going through the next column. On the other hand, Numpy in Python specifically uses row-major architecture. The Wikipedia article on this is brief but well worthwhile for understanding these quirks.

This is especially notable because ensuring that proper indexing and looping methods are followed can substantially speed up code. In fact, it is likely that the reason Julia slowed down significantly on a 10-realization 1000-year simulation when compared to both its previous performances and MATLAB because of how the arrays were looped through. As a direct example shown below, when exponentiating through a [20000, 20000] array row-by-row took approximately 47.7 seconds while doing the same operation column-by-column only took 12.7 seconds.

5555

Dealing with Arrays

Simply put, arrays and matrices in Julia are a pain compared to MATLAB. As an example of the bad and the ugly, unlike in MATLAB where you can directly declare any size array you wish to work with, you must first create an array and then fill the array with individual array in Julia. This is shown below where an array of arrays is initialized below.  However, once an array is established, Julia is extremely fast in loops, so dealing with filling a previously established array makes for a much faster experience.

# initialize output
qq = Array{Array}(undef, num_sites) #(4, 100, 1200)

for i = 1:num_sites
     qq[i] = Array{Float64}(undef, nR, nY * 12)
end

Once the plus side  when creating arrays, Julia is extremely powerful in its ability to assign variable types to the components of a given array. This can drastically speed up your code during the day. Shown below, it is easy to the range of declarations and assignments being made to populate the array. There’s an easy example of declaring an array with zeros, and another where we’re populating an array using slices of another. Note the indexing structure for Qd_cg in the second loop–it is not technically a 3-D array but rather a 2-D array nested within a 1-D array–showing the issues mentioned prior.

delta = zeros(n_totals)
for i = 1:n_totals
     for j = 1:n_sites
          delta[i] += (Qtotals[month][j][i] - Z[j]) ^ 2
     end
end

q_ = Array{Float64, 2}(undef, num_realizations[k], 365 * num_years[k])
for i = 1: Nsites
     # put into array of [realizations, 365*num_yrs]
     for j = 1: num_realizations[k]
          q_[j, :] = Qd_cg[j][:, i]'
     end
end

Code Profiling: Order of Experiments

An interesting observation I’ve noticed is that Julia’s first run on a given block of code is substantially slower than every other attempt. Thus, it is likely worthwhile to run a smaller-scale array through to initialize the code if there are plans to move on to substantially more expensive operations (i.e. scaling up).

In the example below, we can see that the second iteration of the same exact code was over 10% faster when calling it a second time. However, when running the code without the function wrapper (in the original timed runs), the code was 10% faster (177 seconds) than the second sequential run shown below. This points to the importance of profiling and experimenting with sections of your code.

4444

Basic profiling tools are directly built into Julia, as shown in the Julia profiling documentation. This can be visualized easily using the ProfileView library. The Juno IDE (standard with Julia Pro) allegedly has a good built-in profile as well. However, it should be expected that most any IDE should do the trick (links to IDEs can be found here).

Syntax and Library Depreciation

While Julia is very similar in its structure and language to MATLAB, much of the similar language has depreciated as Julia has been rapidly upgraded. Notably, Julia released V1.0 in late 2018 and recently released V1.1, moving further away from similarities in function names. Thus, this stands as a lesson for individuals wishing to translate all of their code between these languages. I found a useful website that assists in translating general syntax, but many of the functions have depreciated. However, as someone who didn’t have any experience with MATLAB but was vaguely familiar with Julia, this was a godsend for learning differences in coding styles.

For example, creating an identity matrix in MATLAB utilizes the function eye(size(R)) to create an nxn matrix the size of R. While this was initially the language used in Julia, this specific language was depreciated in V0.7. To get around this, either ‘I’ can be used to create a scalable identity matrix or Matrix{Float64}(I, size(R), size(R)) declare an identity matrix of size(R) by size(R) for a more foolproof and faster operation.

When declaring functions, I have found Julia to be relatively straightforward and Pythonic in its declarations. While I still look to insert colons at the ends of declarations while forgetting to add ‘end’ at the end of functions, loops, and more, the ease of creating, calling, and interacting with functions makes Julia very accessible.  Furthermore, its ability to interact with matrices in without special libraries (e.g. Numpy in Python) allows for more efficient coding without having to know specific library notation.

Debugging Drawbacks

One of the most significant drawbacks I run into when using Julia is the lack of clarity in generated error codes for common mistakes, such as adding extra brackets. For example, the following error code is generated in Python when adding an extra parenthesis at the end of an expression.3333

However, Julia produces the follow error for an identical mistake:

2222

One simple solution to this is to simply upgrade my development environment from Jupyter Notebooks to a general IDE to more easily root out issues by running code line-by-line. However, I see the lack of clarity in showing where specific errors arise a significant drawback to development within Julia. However, as shown in the example below where an array has gone awry, an IDE (such as Atom shown below) can make troubleshooting and debugging a relative breeze.

1111

Furthermore, when editing auxiliary functions in another file or module that was loaded as a library, Julia is not kind enough to simply reload and recompile the module; to get it to properly work in Atom, I had to shut down the Julia kernel then rerun the entirety of the code. Since Julia takes a bit to initially load and compile libraries and code, this slows down the debugging process substantially. There is a specific package (Revise) that exists to take care of this issue, but it is not standard and requires loading this specific library into your code.

GitHub Repositories: Streamflow Generators

PyMFGM: A parallelized Python version of the code, written by Bernardo Trindade

Kirsch-Nowak Stationary Generator in MATLAB: Developed by Matteo GiulianiJon Herman and Julianne Quinn

Kirsch-Nowak Stationary Generator in Julia: Please note that the results are not validated. However, you can easily access the Jupyter Notebook version to play around with the code in addition to running the code from your terminal using the main.jl script.

Full Kirsch-Nowak Streamflow Generator: Also developed by Matteo GiulianiJon Herman and Julianne Quinn and can handle rescaling flows for changes due to monsoons. I would highly suggest diving into this code alongside the relevant blog posts: Part 1 (explanation), Part 2 (validation).

Magnitude-varying sensitivity analysis and visualization (Part 2)

In my last post, I talked about producing these flow-duration-curve-type figures for an output time-series one might be interested in, and talked about their potential use in an exploratory approach for the purpose of robust decision making. Again, the codes to perform the analysis and visualization are in this Github repository.

experiment_data_range

Fig. 1: Historical data vs. range of experiment outputs

As already discussed, there are multiple benefits for visualizing the output in such manner: we are often concerned with the levels and frequencies of extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), or we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“). The various percentiles tell a different part of the story of how a system operates, the 5th percentile tells as that its level is exceeded 95% of the time, the 99th tells as that its level is only reached once in every 100 years in our records. These might seem obvious to the readers of this blog, but often times we perform our analyses for only some of these percentiles, “the worst event”, “the average”, etc., which is certainly very informative, but can potentially miss part of the bigger picture.

In this post I’m going to walk the reader through performing a sensitivity analysis using the output of an experiment using multiple Latin Hypercube Samples. The analysis will be magnitude-varying, i.e., it will be performed at different magnitudes of our output of interest. For this particular example, we aim to see what are the most significant drivers of shortage at the different levels it’s experienced by this user. In other words, if some factors appear to be driving the frequent small shortages experienced, are those factors the same for the rare large shortages?

To perform the sensitivity analysis, I am going to use SALib (featured in this blog multiple times already), to perform a Delta Moment-Independent Analysis [1] (also produces a first order Sobol sensitivity index [2]). You’ll probably need to install SALib if it’s not a package you’ve used already. I’m also going to use statsmodels, to perform a simple linear regression on the outputs and look at their R2 values. But, why, you might ask, perform not one, not two, but three sensitivity analyses for this? There are nuanced, yet potentially important differences between what the three methods capture:

Delta method: Look for parameters most significantly affecting the density function of observed shortages. This method is moment-independent, i.e., it looks at differences in the entire distribution of the output we’re interested in.
First order Sobol (S1): Look for parameters that most significantly affect the variance of observed outputs, including non-linear effects.
R2: Look for parameters best able to describe the variance of observed outputs, limited to linear effects.

Another important thing to note is that using the First order Sobol index, the total variance resulting from the parameters should equal 1. This means that if we sum up the S1’s we get from our analysis, the sum represents the variance described by the first order effects of our parameters, leaving whatever is left to interactions between our variables (that S1 cannot capture). The same holds using R2, as we are repeatedly fitting our parameters and scoring them on how much of the output variance they describe as a sole linear predictor (with no interactions or other relationships).

The following Python script will produce all three as well as confidence intervals for the Delta index and S1. The script essentially loops through all percentiles in the time-series and performs the two analyses for each one. In other words, we’re are looking at how sensitive each magnitude percentile is to each of the sampled parameters.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from SALib.analyze import delta
# Load parameter samples
LHsamples = np.loadtxt('./LHsamples.txt')
params_no = len(LHsamples[0,:])
param_bounds=np.loadtxt('./uncertain_params.txt', usecols=(1,2))
# Parameter names
param_names=['IWRmultiplier','RESloss','TBDmultiplier','M_Imultiplier',
'Shoshone','ENVflows','EVAdelta','XBM_mu0','XBM_sigma0',
'XBM_mu1','XBM_sigma1','XBM_p00','XBM_p11']
# Define problem class
problem = {
'num_vars': params_no,
'names': param_names,
'bounds': param_bounds.tolist()
}
# Percentiles for analysis to loop over
percentiles = np.arange(0,100)
# Function to fit regression with Ordinary Least Squares using statsmodels
def fitOLS(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])
# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors
#fit OLS regression
ols = sm.OLS(dta['Shortage'], dta[cols])
result = ols.fit()
return result
# Create empty dataframes to store results
DELTA = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
DELTA_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
S1 = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
S1_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
R2_scores = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
DELTA.index=DELTA_conf.index=S1.index=S1_conf.index = R2_scores.index = param_names
# Read in experiment data
expData = np.loadtxt('./experiment_data.txt')
# Identify magnitude at each percentiles
syn_magnitude = np.zeros([len(percentiles),len(LHsamples[:,0])])
for j in range(len(LHsamples[:,0])):
syn_magnitude[:,j]=[np.percentile(expData[:,j], i) for i in percentiles]
# Delta Method analysis
for i in range(len(percentiles)):
if syn_magnitude[i,:].any():
try:
result= delta.analyze(problem, LHsamples, syn_magnitude[i,:], print_to_console=False, num_resamples=2)
DELTA[percentiles[i]]= result['delta']
DELTA_conf[percentiles[i]] = result['delta_conf']
S1[percentiles[i]]=result['S1']
S1_conf[percentiles[i]]=result['S1_conf']
except:
pass
S1.to_csv('./S1_scores.csv')
S1_conf.to_csv('./S1_conf_scores.csv')
DELTA.to_csv('./DELTA_scores.csv')
DELTA_conf.to_csv('./DELTA_conf_scores.csv')
# OLS regression analysis
dta = pd.DataFrame(data = LHsamples, columns=param_names)
# fig = plt.figure()
for i in range(len(percentiles)):
shortage = np.zeros(len(LHsamples[:,0]))
for k in range(len(LHsamples[:,0])):
shortage[k]=syn_magnitude[i,k]
dta['Shortage']=shortage
for m in range(params_no):
predictors = dta.columns.tolist()[m😦m+1)]
result = fitOLS(dta, predictors)
R2_scores.at[param_names[m],percentiles[i]]=result.rsquared
R2_scores.to_csv('./R2_scores.csv')

The script produces the sensitivity analysis indices for each magnitude percentile and stores them as .csv files.

I will now present a way of visualizing these outputs, using the curves from Fig. 1 as context.  The code below reads in the values for each sensitivity index, normalizes them to the range of magnitude at each percentile, and then plots them using matplotlib’s stackplot fuction, which stacks the contribution of each parameter to the sum (in this case the maximum of the resulting range)

I’ll go through what the code does in more detail:

First, we take the range boundaries (globalmax and globalmin) which give us the max and min values for each percentile. We then read in the values for each sensitivity index and normalize them to that range (i.e. globalmaxglobalmin for each percentile). The script also adds two more arrays (rows in the pandas dataframe), one representing interaction and one representing the globalmin, upon which we’re going to stack the rest of the values. [Note: This is a bit of a roundabout way of getting the figures how we like them, but it’s essentially creating a pseudo-stack for the globalmin, that we’re plotting in white.] 

The interaction array is only used when normalizing the S1 and R2 values, where we attribute to it the difference between 1 and the sum of the calculated indices (i.e. we’re attributing the rest to interaction between the parameters). We don’t need to do this for the delta method indices (if you run the code the array remains empty), but the reason I had to put it there was to make it simpler to create labels and a single legend later.

The plotting simply creates three subplots and for each one uses stackplot to plot the normalized values and then the edges in black. It is important to note that the colorblocks in each figure do not represent the volume of shortage attributed to each parameter at each percentile, but rather the contribution of each parameter to the change in the metric, namely, the density distribution (Delta Method), and the variance (S1 and R2). The code for this visualization is provided at the bottom of the post.

experiment_sensitivity_curves.png

Fig. 2: Magnitude sensitivity curves using three sensitivity indeces

The first thing that pops out from this figure is the large blob of peach, which represents the irrigation demand multiplier in our experiment. The user of interest here was an irrigation user, which would suggest that their shortages are primarily driven by increases in their own demands and of other irrigation users. This is important, because irrigation demand is an uncertainty for which we could potentially have direct or indirect control over, e.g. through conservation efforts.

Looking at the other factors, performing the analysis in a magnitude-varying manner, allowed us to explore the vulnerabilities of this metric across its different levels. For example, dark blue and dark green represent the mean flow of dry and wet years, respectively. Across the three figures we can see that the contribution of mean wet-year flow is larger in the low-magnitude percentiles (left hand side) and diminishes as we move towards the larger-magnitude percentiles.

Another thing that I thought was interesting to note was the difference between the S1 and the R2 plots. They are both variance-based metrics, with R2 limited to linear effects in this case. In this particular case, the plots are fairly similar which would suggest that a lot of the parameter effects on the output variance are linear. Larger differences between the two would point to non-linearities between changes in parameter values and the output.

The code to produce Fig. 2:

# Percentiles for analysis to loop over
percentiles = np.arange(0,100)
# Estimate upper and lower bounds
globalmax = [np.percentile(np.max(expData_sort[:,:],1),p) for p in percentiles]
globalmin = [np.percentile(np.min(expData_sort[:,:],1),p) for p in percentiles]
delta_values = pd.read_csv('./DELTA_scores.csv')
delta_values.set_index(list(delta_values)[0],inplace=True)
delta_values = delta_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(delta_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(delta_values.columns.values))
delta_values = pd.concat([top_row,delta_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(delta_values[str(percentiles[p])])-delta_values.at['Min',str(percentiles[p])]
if total!=0:
for param in param_names:
value = (globalmax[p]-globalmin[p])*delta_values.at[param,str(percentiles[p])]/total
delta_values.set_value(param,str(percentiles[p]),value)
delta_values = delta_values.round(decimals = 2)
delta_values_to_plot = delta_values.values.tolist()
S1_values = pd.read_csv('./S1_scores.csv')
S1_values.set_index(list(S1_values)[0],inplace=True)
S1_values = S1_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(S1_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(S1_values.columns.values))
S1_values = pd.concat([top_row,S1_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(S1_values[str(percentiles[p])])-S1_values.at['Min',str(percentiles[p])]
if total!=0:
diff = 1-total
S1_values.set_value('Interaction',str(percentiles[p]),diff)
for param in param_names+['Interaction']:
value = (globalmax[p]-globalmin[p])*S1_values.at[param,str(percentiles[p])]
S1_values.set_value(param,str(percentiles[p]),value)
S1_values = S1_values.round(decimals = 2)
S1_values_to_plot = S1_values.values.tolist()
R2_values = pd.read_csv('./R2_scores.csv')
R2_values.set_index(list(R2_values)[0],inplace=True)
R2_values = R2_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(R2_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(R2_values.columns.values))
R2_values = pd.concat([top_row,R2_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(R2_values[str(percentiles[p])])-R2_values.at['Min',str(percentiles[p])]
if total!=0:
diff = 1-total
R2_values.set_value('Interaction',str(percentiles[p]),diff)
for param in param_names+['Interaction']:
value = (globalmax[p]-globalmin[p])*R2_values.at[param,str(percentiles[p])]
R2_values.set_value(param,str(percentiles[p]),value)
R2_values = R2_values.round(decimals = 2)
R2_values_to_plot = R2_values.values.tolist()
color_list = ["white", "#F18670", "#E24D3F", "#CF233E", "#681E33", "#676572", "#F3BE22", "#59DEBA", "#14015C", "#DAF8A3", "#0B7A0A", "#F8FFA2", "#578DC0", "#4E4AD8", "#F77632"]
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(14.5,8))
ax1.stackplot(percentiles, delta_values_to_plot, colors = color_list, labels=parameter_names_long)
l1 = ax1.plot(percentiles, globalmax, color='black', linewidth=2)
l2 = ax1.plot(percentiles, globalmin, color='black', linewidth=2)
ax1.set_title("Delta index")
ax1.set_xlim(0,100)
ax2.stackplot(np.arange(0,100), S1_values_to_plot, colors = color_list, labels=parameter_names_long)
ax2.plot(percentiles, globalmax, color='black', linewidth=2)
ax2.plot(percentiles, globalmin, color='black', linewidth=2)
ax2.set_title("S1")
ax2.set_xlim(0,100)
ax3.stackplot(np.arange(0,100), R2_values_to_plot, colors = color_list, labels=parameter_names_long)
ax3.plot(percentiles, globalmax, color='black', linewidth=2)
ax3.plot(percentiles, globalmin, color='black', linewidth=2)
ax3.set_title("R^2")
ax3.set_xlim(0,100)
handles, labels = ax3.get_legend_handles_labels()
ax1.set_ylabel('Annual shortage (af)', fontsize=12)
ax2.set_xlabel('Shortage magnitude percentile', fontsize=12)
ax1.legend((l1), ('Global ensemble',), fontsize=10, loc='upper left')
fig.legend(handles[1:], labels[1:], fontsize=10, loc='lower center',ncol = 5)
plt.subplots_adjust(bottom=0.2)
fig.savefig('./experiment_sensitivity_curves.png')

References:

[1]: Borgonovo, E. “A New Uncertainty Importance Measure.” Reliability Engineering & System Safety 92, no. 6 (June 1, 2007): 771–84. https://doi.org/10.1016/j.ress.2006.04.015.

[2]: Sobol, I. M. (2001). “Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates.” Mathematics and Computers in Simulation, 55(1-3):271-280, doi:10.1016/S0378-4754(00)00270-6.

Magnitude-varying sensitivity analysis and visualization (Part 1)

Various posts have discussed sensitivity analysis and techniques in this blog before. The purpose of this post is to show an application of the methods and demonstrate how they can be used in an exploratory manner, for the purposes of robust decision making (RDM). RDM aims to evaluate the performance of a policy/strategy/management plan over an ensemble of deeply uncertain parameter combinations – commonly referred to as “states of the world” (SOWs) – and then identify the policies that are most robust to those uncertainties. Most importantly, this process allows the decision maker to examine the implications of their assumptions about the world (or how it will unfold) on their candidate strategies [1].

This is Part 1 of a two part post. In this first post, I’ll introduce the types of figures I’ll be talking about, and some visualization code. In the second post (up in a couple days), I’ll discuss sensitivity analysis for the system as well as some visuals. All the code and data to produce the figures below can be found in this repository.

Now assume the performance of a system is described by a time-series, produced by our model as an output. This might be a streamflow we care about, reservoir releases, nutrient loading, or any type of time-series produced by a run of our model. For the purposes of this example, I’ll use a time-series from the system I’ve been working on, which represents historical shortages for an agricultural user.

historical_data

Fig. 1: Historical data in series

We can sort and rank these data, in the style of a flow duration curve, which would allow us to easily see, levels for median shortage (50th percentile), worst (99th), etc. The reasons one might care about these things (instead of, say, just looking at the mean, or at the time series as presented in Fig. 1) are multiple : we are often concerned with the levels and frequencies of our extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“), or, simply, maintain the distributional information of the series we care about in an easily interpretable format.

historical_data_percentiles

Fig. 2: Historical data sorted by percentile

For the purposes of an exploratory experiment, we would like to see how this time-series of model output might change under different conditions (or SOWs). There are multiple ways one might go about this [2], and in this study we sampled a broad range of parameters that we thought would potentially affect the system using Latin Hypercube Sampling [3], producing 1000 parameter combinations. We then re-simulated the system and saved all equivalent outputs for this time-series. We would like to see how this output changes under all the sampled runs.

experiment_data_all

Fig. 3: Historical data vs. experiment outputs (under 1000 SOWs)

Another way of visualizing this information, if we’re not interested in seeing all the individual lines, is to look at the range of outputs. To produce Fig. 4, I used the fill_between function in matplotlib, filling between the max and min values at each percentile level.

experiment_data_range.png

Fig. 4: Historical data vs. range of experiment outputs

By looking at the individual lines or the range, there’s one piece of potentially valuable information we just missed. We have little to no idea of what the density of outputs is within our experiment. We can see the max and min range, the lines thinning out at the edges, but it’s very difficult to infer any density of output within our samples. To address this, I’ve written a little function that loops through 10 frequency levels (you can also think of them as percentiles) and uses the fill_between function again. The only tricky thing to figure out was how to appropriately represent each layer of increasing opacity in the legend – they are all the same color and transparency, but become darker as they’re overlaid. I pulled two tricks for this. First, I needed a function that calculates the custom alpha, or the transparency, as it is not cumulative in matplotlib (e.g., two objects with transparency 0.2 together will appear as a single object with transparency 0.36).

def alpha(i, base=0.2):
l = lambda x: x+base-x*base
ar = [l(0)]
for j in range(i):
ar.append(l(ar[-1]))
return ar[-1]
view raw alpha.py hosted with ❤ by GitHub

Second, I needed proxy artists representing the color at each layer. These are the handles in the code below, produced with every loop iteration.

handles = []
labels=[]
fig = plt.figure()
ax=fig.add_subplot(1,1,1)
for i in range(len(p)):
ax.fill_between(P, np.min(expData_sort[:,:],1), np.percentile(expData_sort[:,:], p[i], axis=1), color='#4286f4', alpha = 0.1)
ax.plot(P, np.percentile(expData_sort[:,:], p[i], axis=1), linewidth=0.5, color='#4286f4', alpha = 0.3)
handle = matplotlib.patches.Rectangle((0,0),1,1, color='#4286f4', alpha=alpha(i, base=0.1))
handles.append(handle)
label = "{:.0f} %".format(100-p[i])
labels.append(label)
ax.plot(P,hist_sort, c='black', linewidth=2, label='Historical record')
ax.set_xlim(0,100)
ax.legend(handles=handles, labels=labels, framealpha=1, fontsize=8, loc='upper left', title='Frequency in experiment',ncol=2)
ax.set_xlabel('Shortage magnitude percentile', fontsize=12)
plt.savefig('experiment_data_density.png')
view raw plot.py hosted with ❤ by GitHub

experiment_data_density

Fig. 5: Historical data vs. frequency of experiment outputs

This allows us to draw some conclusions about how events of different magnitudes/frequencies shift under the SOWs we evaluated. For this particular case, it seems that high frequency, small shortages (left hand side) are becoming smaller and/or less frequent, whereas low frequency, large shortages (right hand side) are becoming larger and/or more frequent. Of course, the probabilistic inference here depends on the samples we chose, but it serves the exploratory purposes of this analysis.

References:

[1]: Bryant, Benjamin P., and Robert J. Lempert. “Thinking inside the Box: A Participatory, Computer-Assisted Approach to Scenario Discovery.” Technological Forecasting and Social Change 77, no. 1 (January 1, 2010): 34–49. https://doi.org/10.1016/j.techfore.2009.08.002.

[2]: Herman, Jonathan D., Patrick M. Reed, Harrison B. Zeff, and Gregory W. Characklis. “How Should Robustness Be Defined for Water Systems Planning under Change?” Journal of Water Resources Planning and Management 141, no. 10 (2015): 4015012. https://doi.org/10.1061/(ASCE)WR.1943-5452.0000509.

[3]: McKay, M. D., R. J. Beckman, and W. J. Conover. “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code.” Technometrics 21, no. 2 (1979): 239–45. https://doi.org/10.2307/1268522.

Time series forecasting in Python for beginners

Time series forecasting in Python for beginners

This semester I am teaching Engineering Management Methods here at Cornell University. The course is aimed at introducing engineering students to systems thinking and a variety of tools and analyses they can use to analyze data. The first chapter has been on time series forecasting, where we discussed some of the simpler models one can use and apply for forecasting purposes, including Simple and Weighted Moving Average, Single and Double Exponential Smoothing, Additive and Multiplicative Seasonal Models, and Holt Winter’s Method.

The class applications as well as the homework are primarily performed in Excel, but I have been trying, with limited success, to encourage the use of programming languages for the assignments. One comment I’ve received by a student has been that it takes significantly more time to perform the calculations by coding; they feel that it’s a waste of time. I initially attributed the comment to the fact that the student was new to coding and it takes time in the beginning, but on later reflection I realized that, in fact, the student was probably simply manually repeating the same Excel operations by using code: take a set of 30 observations, create an array to store forecasts, loop through every value and calculate forecast using model formula, calculate error metrics, print results, repeat steps for next set of data. It occurred to me that of course they think it’s a waste of time, because doing it that way completely negates what programming is all about: designing and building an executable program or function to accomplish a specific computing task. In this instance, the task is to forecast using each of the models we learn in class and the advantage of coding comes with the development of some sort of program or function that performs these operations for us, given a set of data as input. Simply going through the steps of performing a set of calculations for a problem using code is not much different than doing so manually or in Excel. What is different (and beneficial) is designing a code so that it can then be effortlessly applied to all similar problems without having to re-perform all calculations. I realize this is obvious to the coding virtuosos frequenting this blog, but it’s not immediately obvious to the uninitiated who are rather confused on why Dr. Hadjimichael is asking them to waste so much time for a meager bonus on the homework.

So this blog post, is aimed at demonstrating to coding beginners how one can transition from one way of thinking to the other, and providing a small time-series-forecasting toolkit for users that simply want to apply the models to their data.

The code and data for this example can be found on my GitHub page and I will discuss it below. I will be using a wine sales dataset that lists Australian wine sales (in kiloliters) from January 1980 to October 1991. The data looks like this:

Date Sales
1/1/80 464
2/1/80 675
3/1/80 703
4/1/80 887
5/1/80 1139
6/1/80 1077
7/1/80 1318
8/1/80 1260
9/1/80 1120
10/1/80 963
11/1/80 996
12/1/80 960
view raw wine_sales.csv hosted with ❤ by GitHub

And this is what the time series looks like:

Figure_1-0

We first need to import the packages we’ll be using and load the data. I will be using Pandas in this example (but there’s other ways). I’m also defining the number of seasonal periods in a cycle, in this case 12.

import numpy as np #Package we'll use for numerical calculations
import matplotlib.pyplot as plt #From matplotlib package we import pyplot for plots
import pandas #Package to data manipulation
import scipy.optimize #Package we'll use to optimize
plt.style.use('seaborn-colorblind') #This is a pyplot style (optional)

'''Load the data into a pandas series with the name wine_sales'''
time_series = pandas.Series.from_csv("wine_sales.csv", header=0)

P=12 #number of seasonal periods in a cycle
In class, I’ve always mentioned that one should use a training and a validation set for model development, primarily to avoid overfitting our model to the specific training set. In this example, the functions are written as they apply to the training set. Should you choose to apply the functions listed here, you should apply the functions for the training set, extract forecasts and then use those to initialize your validation period. To divide the observations, you would do something like this:
training = time_series[0:108] # Up to December '88
validation = time_series[108:] # From January '89 until end

Now, if say, we wanted to apply the Naive model of the next steps forecast being equal to the current observation, i.e., \hat{y}_{t+1}=y_t, we’d do something like:

 
y_hat=pandas.Series().reindex_like(time_series) # Create an array to store forecasts
y_hat[0]= time_series[0] # Initialize forecasting array with first observation
''' Loop through every month using the model to forecast y_hat'''
for t in range(len(y_hat)-1): # Set a range for the index to loop through
    y_hat[t+1]= time_series[t] # Apply model to forecast time i+1

Now if we’d like to use this for any time series, so we don’t have to perform our calculations every time, we need to reformat this a bit so it’s a function:

 
def naive(time_series):
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0]= time_series[0] # Initialize forecasting array with first observation
    ''' Loop through every month using the model to forecast y'''
    #This sets a range for the index to loop through
    for t in range(len(y_hat)-1): 
        y_hat[t+1]= time_series[t] # Apply model to forecast time i+1
    return y_hat

Now we can just call define this function at the top of our code and just call it with any time series as an input. The function as I’ve defined it returns a pandas.Series with all our forecasts. We can then do the same for all the other modeling methods (below). Some things to note:

  • The data we read in the top, outside the functions, as well as any parameters defined (P in this case) are global variables and do not need to be defined as an input to the function. The functions below only need a list of parameter values as inputs.
  • For the models with seasonality and/or trend we need to create separate series to store those estimates for E, S, and T.
  • Each model has its own initialization formulas and if we wanted to apply them to the validation set that follows our training set, we’d need to initialize with the last values of our training.
'''SIMPLE MOVING AVERAGE
Using this model, y_hat(t+1)=(y(t)+y(t-1)...+y(t-k+1))/k (i.e., the predicted 
next value is equal to the average of the last k observed values).'''
def SMA(params):
    k=int(np.array(params))
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0:k]=time_series[0:k]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]= np.sum(time_series[t-k+1:t+1])/k # Apply model to forecast time i+1
    return y_hat

'''WEIGHTED MOVING AVERAGE
Using this model, y_hat(t+1)=w(1)*y(t)+w(2)*y(t-1)...+w(k)*y(t-k+1) (i.e., the 
predicted next value is equal to the weighted average of the last k observed 
values).'''
def WMA(params):
    weights = np.array(params)
    k=len(weights)
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0:k]=time_series[0:k] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]= np.sum(time_series[t-k+1:t+1].multiply(weights)) # Apply model to forecast time i+1
    return y_hat
'''This model includes the constraint that all our weights should sum to one. 
To include this in our optimization later, we need to define it as a function of our 
weights.'''
def WMAcon(params):
    weights = np.array(params)
    return np.sum(weights)-1

'''SINGLE EXPONENTIAL SMOOTHING
Using this model, y_hat(t+1)=y_hat(t)+a*(y(t)-y_hat(t))(i.e., the 
predicted next value is equal to the weighted average of the last forecasted value and its
difference from the observed).'''
def SES(params):
    a = np.array(params)
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0]=time_series[0] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]=  y_hat[t]+a*(time_series[t]-y_hat[t])# Apply model to forecast time i+1
    return y_hat

'''DOUBLE EXPONENTIAL SMOOTHING (Holts Method)
Using this model, y_hat(t+1)=E(t)+T(t) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend).'''
def DES(params):
    a,b = np.array(params)
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and T values.'''
    E = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[0]=E[0]=time_series[0] # Initialize values
    T[0]=0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*time_series[t]+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        y_hat[t+1] = E[t] + T[t] # Apply model to forecast time i+1
    return y_hat

'''ADDITIVE SEASONAL
Using this model, y_hat(t+1)=E(t)+S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
appropriate seasonal factor). We first need to create an array to store our 
forecast values.'''
def ASM(params):
    a,b = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]-E[:p]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*E[t]
        S[t+1] = b*(time_series[t]-E[t])+(1-b)*S[t+1-p]
        y_hat[t+1] = E[t] + S[t+1-p] # Apply model to forecast time i+1
    return y_hat

'''MULTIPLICATIVE SEASONAL
Using this model, y_hat(t+1)=E(t)*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series times 
the appropriate seasonal factor). We first need to create an array to store our 
forecast values.'''
def MSM(params):
    a,b = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]/E[:p]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*E[t]
        S[t+1] = b*(time_series[t]/E[t])+(1-b)*S[t+1-p]
        y_hat[t+1] = E[t]*S[t+1-p] # Apply model to forecast time i+1
    return y_hat

'''ADDITIVE HOLT-WINTERS METHOD
Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend, times the appropriate seasonal factor). We first need to create an array 
to store our forecast values.'''
def AHW(params):
    a, b, g = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]-E[:p]
    T[:p] = 0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        S[t+1] = g*(time_series[t]-E[t])+(1-g)*S[t+1-p]
        y_hat[t+1] = E[t]+T[t]+S[t+1-p] # Apply model to forecast time i+1
    return y_hat

'''MUTLIPLICATIVE HOLT-WINTERS METHOD
Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend, times the appropriate seasonal factor). We first need to create an array 
to store our forecast values.'''
def MHW(params):
    a, b, g = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    S[:p] = time_series[:p]/(np.sum(time_series[:p])/p)
    E[:p] = time_series[:p]/S[:p]
    T[:p] = 0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        S[t+1] = g*(time_series[t]/E[t])+(1-g)*S[t+1-p]
        y_hat[t+1] = (E[t]+T[t])*S[t+1-p] # Apply model to forecast time i+1
    return y_hat

Having defined this, I can then, for example, call the Multiplicative Holt Winters method by simply typing:

MHW([0.5,0.5,0.5])

This will produce a forecast using the Multiplicative Holt Winters method with those default parameters, but we would like to calibrate them to get the “best” forecasts from our model. To do so, we need to define what we mean by “best”, and in this example I’m choosing to use Mean Square Error as my performance metric. I define it below as a function that receives the parameters and some additional arguments as inputs. I only need to set it up this way because my optimization function is trying to minimize the MSE function by use of those parameters. I’m using the “args” array to simply tell the function which model it’s using to forecast.

def MSE(params, args):
    model, = args
    t_error = np.zeros(len(time_series))
    forecast = model(params)
    for t in range(len(time_series)):
        t_error[t] = time_series[t]-forecast[t]
    MSE = np.mean(np.square(t_error))
    return MSE

To perform the optimization in Excel, we’d use Solver, but in Python we have other options. SciPy is a Python package that allows us, among many other things, to optimize such single-objective problems. What I’m doing here is that I define a list of all the models I want to optimize, their default parameters, and the parameters’ bounds. I then use a loop to go through my list of models and run the optimization. To store the minimized MSE values as well as the parameter values that produce them, we can create an array to store the MSEs and a list to store the parameter values for each model. The optimization function produces a “dictionary” item that contains the minimized MSE value (under ‘fun’), the parameters that produce it (under ‘x’) and other information.

''' List of all the models we will be optimizing'''
models = [SES, DES, ASM, MSM, AHW, MHW]
''' This is a list of all the default parameters for the models we will be 
optimizing. '''
                      #SES,  DES,     ASM
default_parameters = [[0.5],[0.5,0.5],[0.5,0.5],
                      #MSM,        AHW,            MHW
                      [0.5,0.5],[0.5,0.5,0.5],[0.5,0.5,0.5]]
''' This is a list of all the bounds for the default parameters we will be 
optimizing. All the a,b,g's are weights between 0 and 1. '''
bounds = [[(0,1)],[(0,1)]*2, [(0,1)]*2,
           [(0,1)]*2,[(0,1)]*3,[(0,1)]*3]
min_MSEs = np.zeros(len(models)) # Array to store minimized MSEs
opt_params = [None]*len(models) # Empty list to store optim. parameters
for i in range(len(models)):
    res = scipy.optimize.minimize(MSE, # Function we're minimizing (MSE in this case)
                                  default_parameters[i], # Default parameters to use
                                  # Additional arguments that the optimizer 
                                  # won't be changing (model in this case)
                                  args=[models[i]], 
                                  method='L-BFGS-B', # Optimization method to use
                                  bounds=bounds[i]) # Parameter bounds
    min_MSEs[i] = res['fun'] #Store minimized MSE value
    opt_params[i] = res['x'] #Store parameter values identified by optimizer

Note: For the WMA model, the weights should sum to 1 and this should be input to our optimization as a constraint. To do so, we need to define the constraint function as a dictionary and include the following in our minimization call: constraints=[{‘type’:’eq’,’fun’: WMAcon}]. The number of periods to consider cannot be optimized by this type of optimizer.

Finally, we’d like to present our results. I’ll do so by plotting the observations and all my models as well as their minimized MSE values:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1) # Create figure
ax.set_title("Australian wine sales (kilolitres)") # Set figure title
l1 = ax.plot(time_series, color='black', linewidth=3.0, label='Observations') # Plot observations
for i in range(len(models)):
    ax.plot(time_series.index,models[i](opt_params[i]), label = models[i].__name__)
ax.legend() # Activate figure legend
plt.show()
print('The estimated MSEs for all the models are:')
for i in range(len(models)):
    print(models[i].__name__ +': '+str(min_MSEs[i]))

This snippet of code should produce this figure of all our forecasts, as well as a report of all MSEs:Figure_1

The estimated MSEs for all the models are:
SES: 133348.78
DES: 245436.67
ASM: 80684.00
MSM: 64084.48
AHW: 72422.34
MHW: 64031.19

The Multiplicative Holt Winters method appears to give the smallest MSE when applied to these data.

Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
    # fit Gaussian HMM to Q
    model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))
    
    # classify each observation as state 0 or 1
    hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

    # find parameters of Gaussian HMM
    mus = np.array(model.means_)
    sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
    P = np.array(model.transmat_)

    # find log-likelihood of Gaussian HMM
    logProb = model.score(np.reshape(Q,[len(Q),1]))

    # generate nSamples from Gaussian HMM
    samples = model.sample(nSamples)

    # re-organize mus, sigmas and P so that first row is lower mean (if not already)
    if mus[0] > mus[1]:
        mus = np.flipud(mus)
        sigmas = np.flipud(sigmas)
        P = np.fliplr(np.flipud(P))
        hidden_states = 1 - hidden_states

    return hidden_states, mus, sigmas, P, logProb, samples

# load annual flow data for the Colorado River near the Colorado/Utah state line
AnnualQ = np.loadtxt('AnnualQ.txt')

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)

Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)

    xs = np.arange(len(Q))+1909
    masks = hidden_states == 0
    ax.scatter(xs[masks], Q[masks], c='r', label='Dry State')
    masks = hidden_states == 1
    ax.scatter(xs[masks], Q[masks], c='b', label='Wet State')
    ax.plot(xs, Q, c='k')
    
    ax.set_xlabel('Year')
    ax.set_ylabel(ylabel)
    fig.subplots_adjust(bottom=0.2)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')

Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)

Running that we get the following:

[[ 0.6794469   0.3205531 ]
[ 0.34904974  0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σi[ei])e in which e is the eigenvector of PT corresponding to an eigenvalue of 1, and ei is the ith element of e. The overall distribution for our observations is then Y ~ π0N(μ0,σ02) + π1*N(μ1,σ12). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

    # calculate stationary distribution
    eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
    one_eigval = np.argmin(np.abs(eigenvals-1))
    pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

    x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
    fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

    x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
    fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

    x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
    fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
        pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.hist(Q, color='k', alpha=0.5, density=True)
    l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
    l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
    l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

    fig.subplots_adjust(bottom=0.15)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')

Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

Fitting Hidden Markov Models Part I: Background and Methods

Hydro-climatological variables often exhibit long-term persistence caused by regime-shifting behavior in the climate, such as the El Niño-Southern Oscillations (ENSO). One popular way of modeling this long-term persistence is with hidden Markov models (HMMs) [Thyer and Kuczera, 2000; Akintug and Rasmussen, 2005; Bracken et al., 2014]. What is an HMM? Recall from my five blog posts on weather generators, that the occurrence of precipitation is often modeled by a (first order) Markov model in which the probability of rain on a given day depends only on whether or not it rained on the previous day. A (first order) hidden Markov model is similar in that the climate “state” (e.g., wet or dry) at a particular time step depends only on the state from the previous time step, but the state in this case is “hidden,” i.e. not observable. Instead, we only observe a random variable (discrete or continuous) that was generated under a particular state, but we don’t know what that state was.

For example, imagine you are a doctor trying to diagnose when an individual has the flu. On any given day, this person is in one of two states: sick or healthy. These states are likely to exhibit great persistence; when the person gets the flu, he/she will likely have it for several days or weeks, and when he/she is heathy, he/she will likely stay healthy for months. However, suppose you don’t have the ability to test the individual for the flu virus and can only observe his/her temperature. Different (overlapping) distributions of body temperatures may be observed depending on whether this person is sick or healthy, but the state itself is not observed. In this case, the person’s temperature can be modeled by an HMM.

So why are HMMs useful for describing hydro-climatological variables? Let’s go back to the example of ENSO. Maybe El Niño years in a particular basin tend to be wetter than La Niña years. Normally we can observe whether or not it is an El Niño year based on SST anomalies in the tropical Pacific, but suppose we only have paleodata of tree ring widths. We can infer from the tree ring data (with some error) what the total precipitation might have been in each year of the tree’s life, but we may not know what the SST anomalies were those years. Or even if we do know the SST anomalies, maybe there is another more predictive regime-shifting teleconnection we haven’t yet discovered. In either case, we can model the total annual precipitation with an HMM.

What is the benefit of modeling precipitation in these cases with an HMM as opposed to say, an autoregressive model? Well often the year to year correlation of annual precipitation may not actually be that high, but several consecutive wet or consecutive dry years are observed [Bracken et al., 2014]. Furthermore, paleodata suggests that greater persistence (e.g. megadroughts) in precipitation is often observed than would be predicted by autoregressive models [Ault et al., 2013; Ault et al., 2014]. This is where HMMs may come in handy.

Here I will explain how to fit HMMs generally, and in Part II I will show how to apply these methods using the Python package hmmlearn. To understand how to fit HMMs, we first need to define some notation. Let Yt be the observed variable at time t (e.g., annual streamflow). The distribution of Yt depends on the state at time t, Xt (e.g., wet or dry). Let’s assume for simplicity that our observations can be modeled by Gaussian distributions. Then f(Yt | Xt = i) ~ N(μi,σi 2) and f(Yt | Xt = j) ~ N(μj,σj 2) for a two-state HMM. The state at time t, Xt, depends on the state at the previous time step, Xt-1. Let P be the state transition matrix, where each element pi,j represents the probability of transitioning from state i at time t to state j at time t+1, i.e. pij = P(Xt+1 = j | Xt = i). P is a n x n matrix where n is the number of states (e.g. 2 for wet and dry). In all Markov models (hidden or not), the unconditional probability of being in each state, π can be modeled by the equation π = πP, where π is a 1 x n vector in which each element πi represents the unconditional probability of being in state i, i.e. πi = P(Xt = i). π is also called the stationary distribution and can be calculated from P as described here. Since we have no prior information on which to condition the first set of observations, we assume the initial probability of being in each state is the stationary distribution.

In fitting a two-state Gaussian HMM, we therefore need to estimate the following vector of parameters: θ = [μ0, σ0, μ1, σ1, p00, p11]. Note p01 = 1 – p00 and p10 = 1 – p11. The most common approach to estimating these parameters is through the Baum-Welch algorithm, an application of Expectation-Maximization built off of the forward-backward algorithm. The first step of this process is to set initial estimates for each of the parameters. These estimates can be random or based on an informed prior. We then begin with the forward step, which computes the joint probability of observing the first t observations and ending up in state i at time t, given the initial parameter estimates: P(Xt = i, Y1 = y1, Y2 = y2, …, Yt = yt | θ). This is computed for all t ϵ {1, …, T}. Then in the backward step, the conditional probability of observing the remaining observations after time t given the state observed at time t is computed: P(Yt+1 = yt+1, …, YT = yT | Xt=i, θ). Using Bayes’ theorem, it can shown that the product of the forward and backward probabilities is proportional to the probability of ending up in state i at time t given all of the observations, i.e. P(Xt = i | Y1 = y1,…, YT = yT, θ). This is derived below:

1) P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(Y_1=y_1, ..., Y_T=y_T \vert X_t=i, \theta) P(X_t=i \vert \theta)}{P(Y_1=y_1, ..., Y_T=y_t \vert \theta)}

2) P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(Y_1=y_1, ..., Y_t=y_t \vert X_t=i, \theta) P(Y_{t+1} = y_{t+1}, ..., Y_T=y_T \vert X_t=i, \theta) P(X_t=i \vert \theta)}{P(Y_1=y_1, ..., Y_T=y_t \vert \theta)}

3) P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(X_t=i, Y_1=y_1, ..., Y_t=y_t \vert \theta) P(Y_{t+1} = y_{t+1}, ..., Y_T=y_T \vert X_t=i, \theta)}{P(Y_1=y_1, ..., Y_T=y_t \vert \theta)}

4) P(X_t=i \vert Y_1=y_1,..., Y_T=y_T, \theta) \propto P(X_t=i, Y_1=y_1, ..., Y_t=y_t \vert \theta) P(Y_{t+1}=y_{t+1}, ..., Y_T=y_T \vert X_t=i, \theta)

The first equation is Bayes’ Theorem. The second equation is derived by the conditional independence of the observations up to time t (Y1, Y2, …, Yt) and the observations after time t (Yt+1, Yt+2, …, YT), given the state at time t (Xt). The third equation is derived from the definition of conditional probability, and the fourth recognizes the denominator as a normalizing constant.

Why do we care about the probability of ending up in state i at time t given all of the observations (the left hand side of the above equations)? In fitting a HMM, our goal is to find a set of parameters, θ, that maximize this probability, i.e. the likelihood function of the state trajectories given our observations. This is therefore equivalent to maximizing the product of the forward and backward probabilities. We can maximize this product using Expectation-Maximization. Expectation-Maximization is a two-step process for maximum likelihood estimation when the likelihood function cannot be computed directly, for example, because its observations are hidden as in an HMM. The first step is to calculate the expected value of the log likelihood function with respect to the conditional distribution of X given Y and θ (the left hand side of the above equations, or proportionally, the right hand side of equation 4). The second step is to find the parameters that maximize this function. These parameter estimates are then used to re-implement the forward-backward algorithm and the process repeats iteratively until convergence or some specified number of iterations. It is important to note that the maximization step is a local optimization around the current best estimate of θ. Hence, the Baum-Welch algorithm should be run multiple times with different initial parameter estimates to increase the chances of finding the global optimum.

Another interesting question beyond fitting HMMs to observations is diagnosing which states the observations were likely to have come from given the estimated parameters. This is often performed using the Viterbi algorithm, which employs dynamic programming (DP) to find the most likely state trajectory. In this case, the “decision variables” of the DP problem are the states at each time step, Xt, and the “future value function” being optimized is the probability of observing the true trajectory, (Y1, …,YT), given those alternative possible state trajectories. For example, let the probability that the first state was k be V1,k. Then V1,k = P(X1 = k) = P(Y1 = y1 | X1 = k)πk. For future time steps, Vt,k = P(Yt = yt | Xt = k)pik*Vt-1,i where i is the state in the previous time step. Thus, the Viterbi algorithm finds the state trajectory (X1, …, XT) maximizing VT,k.

Now that you know how HMMs are fit using the Baum-Welch algorithm and decoded using the Viterbi algorithm, read Part II to see how to perform these steps in practice in Python!