# Fitting Hidden Markov Models Part I: Background and Methods

Hydro-climatological variables often exhibit long-term persistence caused by regime-shifting behavior in the climate, such as the El Niño-Southern Oscillations (ENSO). One popular way of modeling this long-term persistence is with hidden Markov models (HMMs) [Thyer and Kuczera, 2000; Akintug and Rasmussen, 2005; Bracken et al., 2014]. What is an HMM? Recall from my five blog posts on weather generators, that the occurrence of precipitation is often modeled by a (first order) Markov model in which the probability of rain on a given day depends only on whether or not it rained on the previous day. A (first order) hidden Markov model is similar in that the climate “state” (e.g., wet or dry) at a particular time step depends only on the state from the previous time step, but the state in this case is “hidden,” i.e. not observable. Instead, we only observe a random variable (discrete or continuous) that was generated under a particular state, but we don’t know what that state was.

For example, imagine you are a doctor trying to diagnose when an individual has the flu. On any given day, this person is in one of two states: sick or healthy. These states are likely to exhibit great persistence; when the person gets the flu, he/she will likely have it for several days or weeks, and when he/she is heathy, he/she will likely stay healthy for months. However, suppose you don’t have the ability to test the individual for the flu virus and can only observe his/her temperature. Different (overlapping) distributions of body temperatures may be observed depending on whether this person is sick or healthy, but the state itself is not observed. In this case, the person’s temperature can be modeled by an HMM.

So why are HMMs useful for describing hydro-climatological variables? Let’s go back to the example of ENSO. Maybe El Niño years in a particular basin tend to be wetter than La Niña years. Normally we can observe whether or not it is an El Niño year based on SST anomalies in the tropical Pacific, but suppose we only have paleodata of tree ring widths. We can infer from the tree ring data (with some error) what the total precipitation might have been in each year of the tree’s life, but we may not know what the SST anomalies were those years. Or even if we do know the SST anomalies, maybe there is another more predictive regime-shifting teleconnection we haven’t yet discovered. In either case, we can model the total annual precipitation with an HMM.

What is the benefit of modeling precipitation in these cases with an HMM as opposed to say, an autoregressive model? Well often the year to year correlation of annual precipitation may not actually be that high, but several consecutive wet or consecutive dry years are observed [Bracken et al., 2014]. Furthermore, paleodata suggests that greater persistence (e.g. megadroughts) in precipitation is often observed than would be predicted by autoregressive models [Ault et al., 2013; Ault et al., 2014]. This is where HMMs may come in handy.

Here I will explain how to fit HMMs generally, and in Part II I will show how to apply these methods using the Python package hmmlearn. To understand how to fit HMMs, we first need to define some notation. Let Yt be the observed variable at time t (e.g., annual streamflow). The distribution of Yt depends on the state at time t, Xt (e.g., wet or dry). Let’s assume for simplicity that our observations can be modeled by Gaussian distributions. Then f(Yt | Xt = i) ~ N(μi,σi 2) and f(Yt | Xt = j) ~ N(μj,σj 2) for a two-state HMM. The state at time t, Xt, depends on the state at the previous time step, Xt-1. Let P be the state transition matrix, where each element pi,j represents the probability of transitioning from state i at time t to state j at time t+1, i.e. pij = P(Xt+1 = j | Xt = i). P is a n x n matrix where n is the number of states (e.g. 2 for wet and dry). In all Markov models (hidden or not), the unconditional probability of being in each state, π can be modeled by the equation π = πP, where π is a 1 x n vector in which each element πi represents the unconditional probability of being in state i, i.e. πi = P(Xt = i). π is also called the stationary distribution and can be calculated from P as described here. Since we have no prior information on which to condition the first set of observations, we assume the initial probability of being in each state is the stationary distribution.

In fitting a two-state Gaussian HMM, we therefore need to estimate the following vector of parameters: θ = [μ0, σ0, μ1, σ1, p00, p11]. Note p01 = 1 – p00 and p10 = 1 – p11. The most common approach to estimating these parameters is through the Baum-Welch algorithm, an application of Expectation-Maximization built off of the forward-backward algorithm. The first step of this process is to set initial estimates for each of the parameters. These estimates can be random or based on an informed prior. We then begin with the forward step, which computes the probability of ending up in state i at time t given the first t observations and the initial parameter estimates: P(Xt = i |Y1 = y1, Y2 = y2, …, Yt = yt, θ). This is computed for all t ϵ {1, …, T}. Then in the backward step, the probability of observing the remaining observations after time t is computed: P(Yt+1 = yt+1, …, YT = yT | Xt, θ). From Bayes’ theorem, the probability estimates from the forward and backward steps can be combined to estimate the probability of ending up in state i at time t given all of the observations:

1) $P(X_t \vert Y_1=y_1,..., Y_T=y_T, \theta) = \frac{P(Y_1=y_1, ..., Y_T = y_t \vert X_t, \theta) P(X_t \vert \theta)}{P(Y_1=y_1, ..., Y_T=y_T \vert \theta)}$

Since Xt does not depend on future observations, P(Xt | θ) = P(Xt | Y1 = y1, …, YT = yT, θ), making the numerator of the right hand side the product of our probability estimates from the forward and backward steps.

Why do we care about the probability of ending up in state i at time t given all of the observations (the left hand side of this equation)? In fitting a HMM, our goal is to find a set of parameters, θ, that maximize this probability, i.e. the likelihood function of the state trajectories given our observations. Since the denominator on the right hand side is just a normalizing constant, our goal is therefore to maximize the numerator, or the probability estimates from the forward-backward algorithm. We can maximize this product using Expectation-Maximization.

Expectation-Maximization is a two-step process for maximum likelihood estimation when the likelihood function cannot be computed directly, for example, because its observations are hidden as in an HMM. The first step is to calculate the expected value of the log likelihood function with respect to the conditional distribution of X given Y and θ (the left hand side of equation 1, or proportionally, the numerator of the right hand side). The second step is to find the parameters that maximize this function. These parameter estimates are then used to re-implement the forward-backward algorithm and the process repeats iteratively until convergence or some specified number of iterations. It is important to note that the maximization step is a local optimization around the current best estimate of θ. Hence, the Baum-Welch algorithm should be run multiple times with different initial parameter estimates to increase the chances of finding the global optimum.

Another interesting question beyond fitting HMMs to observations is diagnosing which states the observations were likely to have come from given the estimated parameters. This is often performed using the Viterbi algorithm, which employs dynamic programming (DP) to find the most likely state trajectory. In this case, the “decision variables” of the DP problem are the states at each time step, Xt, and the “future value function” being optimized is the probability of observing the true trajectory, (Y1, …,YT), given those alternative possible state trajectories. For example, let the probability that the first state was k be V1,k. Then V1,k = P(X1 = k) = P(Y1 = y1 | X1 = k)πk. For future time steps, Vt,k = P(Yt = yt | Xt = k)pik*Vt-1,i where i is the state in the previous time step. Thus, the Viterbi algorithm finds the state trajectory (X1, …, XT) maximizing VT,k.

Now that you know how HMMs are fit using the Baum-Welch algorithm and decoded using the Viterbi algorithm, read Part II to see how to perform these steps in practice in Python!