This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with `from hmmlearn.hmm import GaussianHMM`

. The `fit`

function of this class requires as inputs the number of states (*n_components*, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (*n_iter*; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the *init_params* argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 `model`

.

Among other attributes and methods, `model`

will have associated with it the means (`means_`

) and covariances (`covars_`

) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (`score`

) and methods for simulating from the HMM (`sample`

) and predicting the states of observed values with the Viterbi algorithm described in Part I (`predict`

). The `score`

attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.

from hmmlearn.hmm import GaussianHMM def fitHMM(Q, nSamples): # fit Gaussian HMM to Q model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1])) # classify each observation as state 0 or 1 hidden_states = model.predict(np.reshape(Q,[len(Q),1])) # find parameters of Gaussian HMM mus = np.array(model.means_) sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])]))) P = np.array(model.transmat_) # find log-likelihood of Gaussian HMM logProb = model.score(np.reshape(Q,[len(Q),1])) # generate nSamples from Gaussian HMM samples = model.sample(nSamples) # re-organize mus, sigmas and P so that first row is lower mean (if not already) if mus[0] > mus[1]: mus = np.flipud(mus) sigmas = np.flipud(sigmas) P = np.fliplr(np.flipud(P)) hidden_states = 1 - hidden_states return hidden_states, mus, sigmas, P, logProb, samples # load annual flow data for the Colorado River near the Colorado/Utah state line AnnualQ = np.loadtxt('AnnualQ.txt') # log transform the data and fit the HMM logQ = np.log(AnnualQ) hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)

Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.

from matplotlib import pyplot as plt import seaborn as sns import numpy as np def plotTimeSeries(Q, hidden_states, ylabel, filename): sns.set() fig = plt.figure() ax = fig.add_subplot(111) xs = np.arange(len(Q))+1909 masks = hidden_states == 0 ax.scatter(xs[masks], Q[masks], c='r', label='Dry State') masks = hidden_states == 1 ax.scatter(xs[masks], Q[masks], c='b', label='Wet State') ax.plot(xs, Q, c='k') ax.set_xlabel('Year') ax.set_ylabel(ylabel) fig.subplots_adjust(bottom=0.2) handles, labels = plt.gca().get_legend_handles_labels() fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True) fig.savefig(filename) fig.clf() return None plt.switch_backend('agg') # turn off display when running with Cygwin plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')

Wow, looks like there’s some persistence! What are the transition probabilities?

print(model.transmat_)

Running that we get the following:

[[ 0.6794469 0.3205531 ]

[ 0.34904974 0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = π**P**, where **P** is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σ* _{i}*[

*e*])

_{i}*N(*

**e**in which**e**is the eigenvector of**P**corresponding to an eigenvalue of 1, and^{T}*e*is the_{i}*i*element of^{th}**e**. The overall distribution for our observations is then Y ~ π_{0}*μ*,

_{0}*σ*) + π

_{0}^{2}_{1}*N(

*μ*,

_{1}*σ*). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.

_{1}^{2}from scipy import stats as ss def plotDistribution(Q, mus, sigmas, P, filename): # calculate stationary distribution eigenvals, eigenvecs = np.linalg.eig(np.transpose(P)) one_eigval = np.argmin(np.abs(eigenvals-1)) pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval]) x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000) fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0]) x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000) fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1]) x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000) fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \ pi[1]*ss.norm.pdf(x,mus[1],sigmas[1]) sns.set() fig = plt.figure() ax = fig.add_subplot(111) ax.hist(Q, color='k', alpha=0.5, density=True) l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn') l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn') l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn') fig.subplots_adjust(bottom=0.15) handles, labels = plt.gca().get_legend_handles_labels() fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True) fig.savefig(filename) fig.clf() return None plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')

Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

Can I get a copy of the dataset, ‘AnnualQ.txt’?

It’s just a text file with the annual flow (in acre-ft) at USGS gauge 09163500 from WY 1909 – WY 2013. Each year is a different row of the file.

How many columns? Because i’m getting a reshape error

One column for the one site.

Hi,

Great post ! Very neatly explained.

I am working on a dataset , where there are around 6-7 hidden states. I want to map them in the transition matrix. is it possible using hmm learn ?

I also don’t get how you mapped state 0 to be dry and state 1 to be wet?

I mean i don’t get this

***This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1 ***

Yep, you can do that with hmmlearn. Just change “n_components=2” to 6 on line 5 for a 6-state HMM. If you want to find the best fit across different numbers of states, you can use the log-likelihood to inform that (logProb on line 16). Since fitting more states requires estimating more parameters and could lead to overfitting that will increase logProb, you should penalize the addition of those parameters, e.g. by choosing the model with the lowest AIC (=2k-2lnL, where k is the number of parameters and lnL is the log-likelihood) or BIC (=ln(n)k – 2lnL, where n is the number of observations). The number of parameters is the number of means (# states) + standard deviations (# states) + transition probabilities (# states * (# states – 1)).

On line 22 I check if mu[0] > mu[1], i.e. if the mean of state 0 is greater than the mean of state 1. If that is true, I flip their order on lines 23-26 so 0 corresponds to the state with the lower mean. If you have more than 2 states and you’d like them to be in order of lowest mean to highest, you should sort rather than flip them.

Hi ,

Thanks for your reply.

Assume in the same example of yours, i have 3 states i.e wet, mixed, dry.

Now i get the transition matrix as

*** ignore the values, the probabities may not sum upto 1 ***

[[ 0.6 0.2 0.2 ]

[ 0.34. 0.34 0.32]

[ 0.06 0.34 0.6]]

So my question is how do you identify the first column as wet, second one as mixed and the third as dry. it can also be be dry, wet, mixed or any other order?. How do we differentiate ?

Are we mapping them based on the mean values ? Then how can we differentiate if mean values are same for a couple of states.

Ultimately, i need the mapping between columns of tranmat_ and hidden state labels.

The question may look dumb. excuse me.

That’s what I’m determining on lines 22-26. I use the mus vector to identify which state is wet vs. dry based on which has a higher mean. If you have a 3-state HMM and the vector mus is [3, 6, 4], for example, then state 0 corresponds to the driest state (lowest mean), state 1 corresponds to the wettest state (highest mean), and state 2 corresponds to the mixed state. So similarly your rows would be starting at time t in the driest, wettest and mixed states, in that order. And your columns would be ending at time t+1 in the driest, wettest and mixed states, in that order. That’s somewhat confusing since it doesn’t go from dry to wet, so you could re-order the state identification as I do in the loop on lines 22-26. You would instead want something like this for multiple states:

index_order = np.argsort(np.transpose(mus))

mus = mus[index_order]

sigmas = sigmas[index_order]

newP = np.zeros(np.shape(P))

state_indices = []

for i in range(len(mus)):

newP[i,:] = P[index_order[0][i],index_order]

state_indices.append(np.where(hidden_states==index_order[0][i]))

P = newP

for i in range(len(state_indices)):

hidden_states[state_indices[i]] = i

Hi Julie

First of thanks for your great post. But, I still I don’t know the job of the lines 22-26. Would you mind please explain it clearly. And if it is possible please tell me the codes that you provide in this comment, can use for more than 3 states? And I can use it or I have to modify it? Because I don’t understand it honestly.

index_order = np.argsort(np.transpose(mus))

mus = mus[index_order]

sigmas = sigmas[index_order]

newP = np.zeros(np.shape(P))

state_indices = []

for i in range(len(mus)):

newP[i,:] = P[index_order[0][i],index_order]

state_indices.append(np.where(hidden_states==index_order[0][i]))

P = newP

for i in range(len(state_indices)):

hidden_states[state_indices[i]] = i

Hi Mohammad,

What those lines are doing is defining the states so that they are in order of driest to wettest. The label of 0/1 to each state is arbitrary. For consistency, I wanted 0 to always be dry and 1 to always be wet. So if the automatic labeling was the opposite (condition on line 22), I swap the parameters so that will be true. The code that you copied for more states sorts them from driest to wettest. So if you had three states, state 0 will have the lowest mean, state 1 the middle mean, and state 2 the highest mean. Hope that clears it up.

Julie

Hi Julie,

Thanks for your post. I have 2 questions here and I appreciate it if you answer me.

1- I don’t get why you use “logQ = np.log(AnnualQ)”. I mean, why you didn’t use AnnualQ data directly.

2- I applied GaussianHMM for my data and fitted data with 3 states. The results of predicting the model are not correct and I am not sure what the problem is. I received pi and covariance as below:

pi = [1.0000000e+00, 8.85441247e-83, 3.31092446e-46]

covs = [[[4.22687301]]

[[72.41435194]]

[[72.49319852]]]

do you think the problem is related to these parameters? how can I fix this problem?

Hi Bahare. I used the log of annual flows because the flows have a lower bound of zero, but fitting a mixed Gaussian distribution could produce negative values. This probability might be very low, and negatives could be replaced with 0s if that were to happen, so it could work with the annual flows themselves. But the fit was also better when using the log of annual flows. You can experiment with that.

It looks like your model essentially fit one state since the probability of being in state 1 is 1. In this case, it sounds like 3 states is not appropriate. For the state identification, was every observation identified as state 1? How did you decide on a 3-state model?

Thanks a lot for your reply.

Yes, for the state identification, every observation was identified as state 1 and this is exactly my issue in the model.

I am using aviation trajectory data and trying to apply a gaussian hidden Markov model to identify flight modes(ascend, hold, descend). Therefore, I selected a 3-state model.

For finding the modes, I computed dy/dt for each time in a trajectory. What is your suggestion to find the appropriate number of states? I appreciate your time.

See my reply to Guruprasad above about using the AIC or BIC to determine the best number of states.

I just re-read what you are using this for. Might it be better to just try to identify a change point between ascend and hold, and again between hold and descend? You also might want to smooth your data first.

Hi,

I am trying to implement Hidden Markov model using GaussianHMM library. I am wondering how can I set my prior parameters such as startprob_prior, transmat_prior, and means_prior in your model?

And in your model, how did you get predicted sequence (model.decode()). For this, have you checked several model scores to get the best predict?

I appreciate your time

I have not tried setting prior parameters with HMMlearn, but on line 5 where I have:

`model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))`

it looks like you should be able to also pass startprob_prior, transmat_prior and means_weight and covars_weight in addition to n_components and n_iter. What you are passing for startprob_prior and transmat_prior are parameters of a Dirichlet prior distribution. For means_weight, you pass the mean and std. dev of a normal prior on the means, and for covers_weight, you pass the parameters of the prior distribution for the covariance. If passing covariance_type=’spherical’ or ‘diag’, the prior distribution is inverse Gamma, otherwise it is the inverse Wichart distribution. This can all be found in the API: https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn-hmm.

I used model.predict() on line 8 instead of model.decode(). This uses the Viterbi algorithm. model.decode() can use ‘viterbi’ or ‘map’. I don’t know what is different about the ‘map’ method, to be honest. I did not test several models, but you could try different initial estimates or different priors and compare the log-likelihood, ‘logProb’, which is calculated by model.score() on line 16 above. If you compare models with different numbers of parameters, you should penalize logprob based on the number of additional parameters using AIC or BIC, as described in some of the comments above.

Hope that helps!

Hi Julia,

Thanks, for sharing this code.

I would like to have a scale counting every day here:

xs = np.arange(len(Q))+1909

Do you know, how I have to change it to get:

for example: 05.01.2020 ….. 05.30.2020?

I’m an absolute beginner in python.

I would recommend reformatting your data into a pandas time series and plotting that. See here: https://ourcodingclub.github.io/tutorials/pandas-time-series/

Hi Julie,

Thanks for the code and the explanation. Your code is well organized. After creating the model, how did you validate the model? I know you used historical data to train your model and generated samples for testing. Am I right?

since I am new in machine learning, Could you explain how we can validate an HMM model and which techniques are more suited to do that? Thanks a lot.

Hi Julie,

Thanks for the code and the explanation. Your code is well organized. After creating the model, how did you validate the model? I know you used historical data to train your model and generated samples for testing. Am I right?

since I am new in machine learning, Could you explain how we can validate an HMM model and which techniques are more suited to do that? Thanks a lot.

I built my model to only part of the record (the last 2/3), but validated it over the whole record. If it doesn’t validate well, it could be your data is non-stationary. To validate it, you can do the Viterbi classification over the whole time period and then check the Gaussian fits of the time steps classified within each state visually with a qq-plot. You can also perform a formal test for normality, such as a probability plot correlation test, Kolmogorov-Smirnov test or Shapiro-Wilk test. To test the Markov property, the lengths of spells in state i should follow a Type I geometric distribution with p=(1-p_ii) where pii is the probability of transitioning from state i to state i. So you could plot a histogram of the lengths of continuous spells in state i from the Viterbi classification compared with a geometric distribution with p=(1-pii). Similarly, for the Markov property to hold, the auto-correlation of the time series of states should decay exponentially, while the partial auto-correlation function should only be significant for 1 lag. So you could plot the acf and pacf of the state time series to check that.

Hi Julie, I am trying to learn the parameters of a HMM (with multinomial emissions) using Baum-Welch. But it doesnt seem to converge!! the following is the script with a toydata that I have created. Each time I run, I get different parameters and different from the original eg the init, trans and emission probabilities that i used to generate the sequences! any comment on that?

S = [“S1”, “S2”, ‘S3’]

n_states = len(S)

E = [i+1 for i in range(6)] ## sides of a dice

trans_prob = np.array(

[[0.7, 0.2, 0.1],

[0.1, 0.7, 0.2],

[0.1, 0.2, 0.7]]

)

start_prob = np.array([0.7, 0.2, 0.1])

### Lets assume that we are throwing 3 6-sided dices: one is fair with equel probablity of each side, and the other two are loaded

em_prob = np.array([[1/4, 3/20, 3/20, 3/20, 3/20, 3/20],

[2/15, 2/15, 2/15, 2/15, 2/15, 1/3],

[1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6]])

model = hmm.MultinomialHMM(n_components=3)

model.startprob_ = start_prob

model.transmat_ = trans_prob

model.emissionprob_ = em_prob

print(‘*** Original parameters *********’)

print(‘p_init’)

print(model.startprob_)

print(‘p_tans ‘)

print(model.transmat_)

print(‘p_em’)

print(model.emissionprob_)

### using the above model, generate n_sequences each of length seq_length, then use this data to train another model, and call baum-weltch to learn the parameters

n_sequences = 10

seqs = []

seq_lengths = []

for i in range(n_sequences):

seq_length = 20

### generate sequence

seq = model.sample(seq_length)[0]

# print(seq)

### generate sequence

seq = model.sample(seq_length)[0]

seq_lengths.append(seq_length)

# print(seq)

if len(seqs) > 0:

seqs = np.append(seqs, seq, axis=0)

else: seqs = seq

seq_lengths = np.array(seq_lengths)

model2 = hmm.MultinomialHMM(n_components=3, n_iter=1000)

model2.fit(seqs, seq_lengths)

print(‘\n ================== After training ==============’)

print(‘initial’)

print(model2.startprob_)

print( ‘transition: ‘)

print( model2.transmat_)

print( ’emission: ‘)

print(model2.emissionprob_)

Since you are generating a different sequence from the model each time, when you fit the HMM you will get different parameter estimates. This is just random sampling variability, which is to be expected. If the estimates are extremely different, this is likely an issue of equifinality where multiple parameter combinations represent the data nearly as well. The other potential issue is the parameter estimation is stuck in a local maximum. To reduce the chances of that, I would recommend trying different initial parameter estimates and comparing logProb (the log likelihood of the model) – the parameter set with the highest log likelihood fits the best. And make sure to test the goodness of fit (often visually) of the different models you estimate to see which looks best.

Thanks for the reply.

sorry it was my bad to generate sequences each time and in fact as soon as posted my question, I realized that. I have therefore changed the model accordingly, but still it doesnt seem to give right answer. Also agree with you that the solution might not be unique, but since this is just a toy simple example and I am going to use this for much more complex dataset, I need to make sure I do everything correct. Starting with different initial parameters doesnt sound very helpful unless we know a clever way of choosing starting parameters. The following is what I have done after you suggestions but still not very satisfactory:

def initialize_components(n_components, n_emissions):

“””

Initialize start probabilities, transition and emission matrices, start for a model with a fixed number of components,

for Multinomial model with a certain number of dimensions.

“””

trans_mat = np.abs(np.random.randn(n_components, n_components))

trans_mat = (trans_mat.T / trans_mat.sum( axis=1 )).T

emission_mat = np.abs(np.random.randn(n_components, n_emissions))

emission_mat = (emission_mat.T/emission_mat.sum(axis=1)).T

# print(emission_mat.shape)

start_probs = np.abs( np.random.randn(n_components) )

start_probs /= start_probs.sum()

return (start_probs, trans_mat, emission_mat)

def generate_sequences(Pi, T, E, n_components=3, n_features=6, N=200, K=20):

“””

Use the given parameters, initiate a hmm model and generate some sequences

“””

states = []

seqs = []

seq_lengths = []

model = hmm.MultinomialHMM(n_components=3)

model.startprob_ = Pi

model.transmat_ = T

model.emissionprob_ = E

model.n_features = n_features

print(‘*** Original parameters *********’)

print(‘p_init’)

print(model.startprob_)

print(‘p_tans ‘)

print(model.transmat_)

print(‘p_emm’)

print(model.emissionprob_)

for j in range(N):

seq_length = K

### generate sequence

seq = model.sample(seq_length)[0]

z = model.sample(seq_length)[1]

# print(z)

# print(seq)

### generate sequence

seq = model.sample(seq_length)[0]

seq_lengths.append(seq_length)

states.append(z)

# print(seq)

if len(seqs) > 0:

seqs = np.append(seqs, seq, axis=0)

else:

seqs = seq

model.fit(seqs, seq_lengths)

score = model.score(seqs)

print(‘Score: ‘, score)

return(seqs, states, seq_lengths)

def train(hmm_model, Seqs, Lengths, nIteration):

“””

Train the model with given data, and find the best describing parameters (Baub-Welch)

“””

best_model = None

best_trans = None

best_emissions = None

best_pi = None

max_score = -np.inf

for i in range(nIteration):

pi, A, B = initialize_components(n_states, n_emissions)

hmm_model.emissionprob_ = B

hmm_model.transmat_ = A

hmm_model.startprob_ = pi

hmm_model.n_features_ = 6

hmm_model.fit(Seqs, Lengths)

this_score = hmm_model.score(Seqs)

if this_score > max_score:

max_score = this_score

best_model = hmm_model

best_pi = pi

best_trans = A

best_emissions = B

### used these start poing

if i % 10 == 0:

print(‘^’*50)

print(‘\niteration’, i)

print(‘best score: ‘, max_score)

# print(i , ‘\n ================== After training ==============’)

# print(model2.score(seqs))

print(max_score)

print(best_model.startprob_)

print(best_model.transmat_)

print(best_model.emissionprob_)

return(best_model)

*********************************

*********************************

S = [“S1”, “S2”, ‘S3’]

n_states = len(S)

E = [i+1 for i in range(6)] ## sides of a dice

trans_prob = np.array(

[[0.9, 0.05, 0.05],

[0.1, 0.8, 0.1],

[0.05, 0.05, 0.9]]

)

start_prob = np.array([0.8, 0.1, 0.1])

### Lets assume that we are throwing 3 6-sided dices: one is fair with equel probablity of each side, and the other two are loaded

em_prob = np.array([[1/4, 3/20, 3/20, 3/20, 3/20, 3/20],

[2/15, 2/15, 2/15, 2/15, 2/15, 1/3],

[1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6]])

X, Z, L = generate_sequences(start_prob, trans_prob, em_prob, n_components=3, n_features=6, N=200, K=20)

model2 = hmm.MultinomialHMM(n_components=3, n_iter=200)

fitted_hmm = train(model2, X, L, 10)

Hi Hashem,

The issue still remains that the one sequence you randomly generated may actually be fit better by a different model than the true model due to sampling variability. I just ran your code and got the following output:

Original parameters

p_init

[0.8 0.1 0.1]

p_tans

[[0.9 0.05 0.05]

[0.1 0.8 0.1 ]

[0.05 0.05 0.9 ]]

p_emm

[[0.25 0.15 0.15 0.15 0.15 0.15 ]

[0.13333333 0.13333333 0.13333333 0.13333333 0.13333333 0.33333333]

[0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]]

Score: -7134.656307667264

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

iteration 0

best score: -7130.093158931911

-7130.093158931911

[0.55342232 0.26642825 0.18014943]

[[0.40926052 0.23864853 0.35209095]

[0.1978113 0.40300539 0.39918331]

[0.51917648 0.30639327 0.17443025]]

[[0.39610989 0.09730218 0.08404591 0.15664999 0.08554425 0.18034779]

[0.10821451 0.23115766 0.05959483 0.0727108 0.23658125 0.29174095]

[0.06873274 0.13744418 0.33595719 0.21040032 0.17550099 0.07196458]]

So the fitted model actually had a higher log-likelihood (-7130) than the true model (-7134). That means the algorithm found a good fit! It just isn’t the right fit. If you repeated this several times, you might find the average parameter estimates will converge to the true values. Likewise, the larger the sample you generate, the closer your estimates should be to the true values. In terms of your process, you seem to be doing everything correctly. If nothing else, this should just show you that the parameters you’ll estimate will not be the true values, but the best fit to the observed data (assuming you’ve found the global maximum, not just a local maximum).

Dear Julie;

Thanks for sharing the code and your insight!

From what I understood, this code is based on a univariate model with only one variable; Flow. I was wondering if I can use it in a multivariate context with multiple time series. More specifically, I’m trying to to model the joint return distribution for multiple assets (stock, bond, and real estate).

Moreover, I would be grateful if you could explain a little more about the data characteristics like variable names in “AnnualQ” file. I’m asking this because I’m a bit confused about the parameters I need to fill the “Fit” function with.

Thanks again.

Hi Harold,

Great question. Yes, it can be used with multivariate data as well! Here is an example for bivariate data: https://hmmlearn.readthedocs.io/en/latest/auto_examples/plot_hmm_sampling.html#sphx-glr-auto-examples-plot-hmm-sampling-py. And a Stack Overflow question along the same lines: https://stackoverflow.com/questions/52141332/is-it-possible-to-fit-a-multivariate-gmhmm-in-hmmlearn.

AnnualQ.txt just has one column for my one-dimensional flow dataset, where each row is a different time step’s observation.

Julie

My questions have been removed!!

Hi Julie,

Thank you very much for your great post. I have a some questions. It would be really appreciated if you could help me on this.

The first one is regarding fitting the model. My data consist of 1000000 rows. Each row corresponds to a sequence with the length of 12. I don’t know how to give my data to the model. Lines 5 and 8 do not work in my case.

My second question is about setting initial parameters for the model. I would like model to improve my initial guess for parameters.

model = GaussianHMM(n_components=3, n_iter=1000,transmat_prior=[[0.95,0.04,0.01],[0.05,0.9,0.05],[0.01,0.04,0.95]],startprob_prior=[0.6,0.3,0.1],means_weight=[10,100,1000],covers_weight=[2,20,100]).fit(XXXXX)

Is the above line correct to fit the model?

It would be really appreciated if you could answer my questions.

Best regards

Hi Julie.

I asked a question. But my question has been removed.

Hi Shahab,

The questions don’t appear until we approve them, which is why it looked like your questions were removed.

To clarify, are you trying to fit a 3-state, 12-dimensional Gaussian HMM to a time series of 1 million observations? If so, the input (X) should be a (1 million x 12) numpy array: model = GaussianHMM(n_components=3, n_iter=1000).fit(X).

I haven’t specified initial conditions to hmmlearn before, but I believe your initiation of transmat_prior and startprob_prior are correct. I am confused about what means_weight and covars_weight should be, though. “means” should be 3×12 if it is a 3-state, 12-dimensional HMM, so I would think the prior on the means (means_weight) would be the same dimension, but the dostring of the source code (https://github.com/hmmlearn/hmmlearn/blob/master/lib/hmmlearn/hmm.py) suggests it is a vector of n_components (3). Similarly, “covars” would be 12x3x3, but covars_weight represents parameters of the prior on the covariance matrix (where the prior is the inverse Gamma if covariance_type=”diag” (the default)). The inverse Gamma has two parameters (alpha and beta). So I would think you need 2 parameters for each entry of covars, but again, the code suggests it is a vector of n_components. So you probably have a better understanding than I do of what those are supposed to be. Sorry I’m not much help there!

Julie

Hi Julie,

your tutorial looks great! I’d like to replicate it, but as I’m not working in the hydrology field I don’t know how to re-create the dataset AnnualQ. Would you be able to share it? Thanks!!

G

Hi Giovanni. It’s just a text file with the annual flow (in acre-ft) at USGS gauge 09163500 from WY 1909 – WY 2013. Each year is a different row of the file. The data can be obtained here: https://waterdata.usgs.gov/co/nwis/uv?site_no=09163500. WY 1909 is Oct 1908-Sept 1909 and WYT 2013 is Oct 2012-Sept 2013.

Thanks Julie!

Hi Julie

Thanks for your useful post. I’m trying to use the Multinomial HMM, but my observations has 3 dimensions. For example, for a sequence with the length of 5, I have the following observation:

Seq1=[(3,1,2),(1,1,1),(2,1,0),(0,0,1),(2,1,1)]

Each dimension of observation could get any number from 0 to 3.

It would be appreciated if you explain how can I use the HMM package to fit my data.

Thanks

Hi Shahab,

Yes, you can do that. Here is an example fitting data I generated (X1):

“`

from hmmlearn import hmm

import numpy as np

# initiate model object for 3-state Multinomial HMM

model = hmm.MultinomialHMM(n_components=3)

# assign number of elements each state can take on

model.n_features = 4

# randomly generate time series of 100 vectors of length 3 whose values are integers from 0-3

X1 = np.random.randint(0,4,[3,100])

# fit Multinomial HMM to X1

fit = model.fit(X1)

# find transition probabilities from fit

fit.transmat_

# find probability distribution of each state (i.e. probability of being 0,1,2 or 3

fit.emissionprob_

“`

Julie

Thanks Julie for your response.

Sorry, I think my question was not clear.

My data set has around 50,000 sequences of observation. Each sequence has its own length. Each observation in a sequence has three dimensions.

For example:

Seq1= [(3,1,2),(1,1,1),(2,1,0),(0,0,1),(2,1,1)] len(seq1) =5

seq2= [(0,2,1),(1,2,1),(3,3,3)] len(seq2) =3

. .

. .

. .

seq50000= [(1,1,0),(2,3,3),(2,2,0),(3,1,0)] len(seq50000)=4

X1 in your code, has 3 sequences with the length of 100, and each observation has one dimension (single int).

Best regards

No, X1 is a 3×100 array, so the first column is a 3-dimensional state at time 1, the second column is a 3-dimensional state at time 2, etc. You’ll see when you fit the model that transmat is 3×3 indicating the transition probabilities between the 3 states and emissionprob is a 3×4 matrix indicating the probability of being a 0,1,2 or 3 for each of the 3 states.

I have a question about the emission. In this case, what is the probability of P((0,2,0)|state1)?

(0,2,0) is an arbitrary observation. I mean, which row and column of the emission matrix shows the probability?

Thanks

Oh, I apologize, I think I did misunderstand. Assuming I get it now, you need to define every possible combination of three values as a state and find the transition probabilities between those univariate definitions of the states. That wouldn’t be a hidden MM, just a pure MM. So, e.g.

(0,0,0) = state 1

(0,0,1) = state 2

…

(3,3,3) = state 4^3

That’s a lot of states! You likely do not have enough data to estimate those transition probabilities. Alternatively, you could fit an n-state HMM, where n is some known number of states, and then generate the 3 values from that state from a multivariate multinomial distribution, but I’m not sure that’s actually possible with hmmlearn (see here: https://github.com/hmmlearn/hmmlearn/issues/216)

Hi Julie,

Why do you assume diagonal covariance type for the annual flow data?

The annual flow data is univariate, so the covariance is just the variance of the one variable.

Hi!

I have written a code and is afraid that the program overfit the results, therefore I want to restrict the model to a minimum variance. In the API I can read:

“min_covar (float, optional) – Floor on the diagonal of the covariance matrix to prevent overfitting. Defaults to 1e-3.”

So i thought that this feature should solve my problems.

My code

remodel = hmm.GaussianHMM(n_components=2, min_covar=5, n_iter=1000).fit(X)

for i in range(remodel.n_components):

print(“{0}th hidden state”.format(i))

print(“covariance = “, (covariance))

When I run this code the program is writing:

0th hidden state

covariance = [[2.05740878 0. 0. ]

[0. 1.83566175 0. ]

[0. 0. 2.1124605 ]]

1th hidden state

covariance = [[0.95922665 0. 0. ]

[0. 1.01140285 0. ]

[0. 0. 1.03875355]]

I don’t understand how the volatility can be lower than the floor(=5).

I’ve never used (and was not familiar with) that option, so I’m afraid I can’t help, sorry!

Hi Julie

Thank you for sharing this post and code with us. I have a question.

I want to train the HMM with two different sequences from two different CSV file. But I got this error:

fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)

fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)

data_train = np.concatenate([fkinematics, fkinematics1])

lengths = [fkinematics, fkinematics1]

hidden_states, mus, sigmas, P, logProb, samples = fitHMM(data_train,lengths)

error:

cannot reshape array of size 25767 into shape (3681,1)

and if:

hidden_states, mus, sigmas, P, logProb, samples = fitHMM(data_train,lengths, 100)

fitHMM() takes 2 positional arguments but 3 were given

Would you mind please help me to fix this problem?

Thanks in advance

What are the dimensions of fkinematics and fkinematics1? For the second error, you are passing 3 arguments (data_train, lengths, 100) to fitHMM, which only takes in 2 arguments (Q, nSamples)

Thanks Julie for Answers to me

Honestly, I am completely new in python. fkinematics has 2176 rows and fkinematics1 has 2050. I need to use both of them for training. Would you mind please help me what should I do?

and If it is possible I have another question. If I want to use 80 percents of the data for training and 20 percents for test. How can I do that?

Thanks in advance for your time and answer

Do they have multiple columns? I don’t see how those numbers are resulting in any of the numbers that area printed. Can you explain what the data is? Is it univariate data and you want a two-state HMM? Because my code will only be suited to that and would need to be adapted for something else.

Hi Julie

First of all, thank you for your time and answer.

Yes I have multiple columns but I only used one of the columns of each file as follow:

hidden_states, mus, sigmas, P, logProb, samples = fitHMM(fkinematics[:,5], 100)

I wanted to use multiple files for training. Every file is a trajectory of the end effector of the robot. One of them perfect trajectory and the rest are some noisy trajectories. I only use one of the columns of each file( regarding the end effector). However, the problem is the specific column, that shows the trajectory, has different rows. And I wanted to have 3 and 4 states. Would you mind please help me with how to do that? And I don’t know how to choose 80% of the data for training and the rest for tests? I searched a lot however, I couldn’t manage to handle my problem! I think even it is not necessary for me to use the log to train.

fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)

fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)

data_train = np.concatenate([fkinematics, fkinematics1])

lengths = [fkinematics[:,5], fkinematics1[:,5]]

model = GaussianHMM(n_components=3, n_iter=1000).fit(data_train,lengths)

Thanks in advance for your answer

fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)

fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)

data_train = np.concatenate([fkinematics, fkinematics1])

lengths = [fkinematics[:,5], fkinematics1[:,5]]

model = GaussianHMM(n_components=3, n_iter=1000).fit(data_train,lengths)

The error for this code is:

ValueError: operands could not be broadcast together with shapes (1617,) (2064,)

With respect

Mohammad

I think that error is probably arising when you run np.concatenate(), or because of how you ran it. You probably want something like this:

data_all = np.concatenate((fkinematics,fkinematics1),axis=0).

To train on 80% of the data, use:

data_train = data_all[0:int(0.8*np.shape(data_all)[0]),:]

data_test = data_all[int(0.8*np.shape(data_all)[0])::,]

model = GaussianHMM(n_components=3, n_iter=1000).fit(np.reshape(data_train[:,5],[len(data_train[:,5]),1]))

hidden_states_train = model.predict(np.reshape(data_train[:,5],[len(data_train[:,5]),1]))

hidden_states_test = model.predict(np.reshape(data_test[:,5],[len(data_test[:,5]),1]))

Then make a QQ plot of all the observations in the test set classified as being in each state against a normal distribution with the parameters fit for that state to see how well it fits in testing.

Thanks, Julie. It would be very helpful and almost worked. However, This line

hidden_states_test = model.predict(np.reshape(data_test[:,5],[len(data_test[:,5]),1]))

gave this error:

too many indices for array

and it is weird. Since everything is the same with the last line. I prepare the QQ plot but because the test line didn’t work so I couldn’t check it. I think it is about dimensions? right?

But the files that I used for train and test are the same and the codes you provide the same too. Would you mind please tell me what is your opinion about that?

Thanks in advance

Mohammad

Actually, it worked. I made mistake. Could you please tell me what is the difference between ” ::,] ” and ” ,:”. when we might use every one of them?

Many thanks for your help

Mohammad

Oh, I actually should have put

data_all[int(0.8*np.shape(data_all)[0])::,:]

int(0.8*np.shape(data_all)[0]):: is pulling the last 20% of rows and ,:] is pulling all of the columns

Sorry Julie, One another question.

When I print the result of hidden_states_train, I get this

array([1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

it is not true? right?

Hi Mohammad,

It might be easier if you just share your data with me (if that’s ok) and I can play around with it. You can email me at jdq8@cornell.edu

Julie

hello Julie,

Thanks for this post. It has been a very helpful source of information for working with HMMs. Can you kindly shed some light on the following for me?

I have a data set with values in the range 0 – 100. Just like in the wet and dry states in your example, I would love to have a two state model with all values > 90 representing one state and all values <= 90 be the other state. In your code (lines 7 & 8), you mentioned something along these lines but it really doesn't clear things up for me.

please help.

Hello Julie,

Thanks for this post. It is a good source of information for working with HMMs. Can you kindly shed some light on the following:

I have a data set with values in the range 0 – 100. Just like in your example with wet and dry states, I will like to have a two state model with all values 90 be the other state. In your code, you mentioned something along these lines (lines 7 & 8) but I am not entirely clear how to do this.

please help.

Hi Tanifor,

That actually wouldn’t be a hidden Markov model because the states would then be observable. So you could directly label all the historical data and estimate the transition probabilities empirically, then fit separate distributions to the separate states. In a hidden Markov model, the distributions overlap and so it’s unclear in the middle which state they belong to, so that’s estimated probabilistically.

Julie

Hi Julie, Thanks for this post! can I have the AnnualQ.txt file please? I cannot found it! Gabry

It has one column with the (naturalized) water year flow (Oct 1-Sept 30) at USGS gauge 09163500 from Oct 1908-Sept 2013. You can access the gauged data beginning in 1951 here: https://waterdata.usgs.gov/nwis/inventory/?site_no=09163500&agency_cd=USGS.

Nice explained. Thanks.

Hi Julie,

Thanks for this very interesting post, just a question if I want to predict status in t+1 having the data available up to t how could I do it?

I found this but I am not sure if it is right….

state_sequence = model.predict(train_set)

prob_next_step = model.transmat_[state_sequence[-1], :].argmax()

That will find the most likely next state. You could get a probabilistic forecast of the flow on the next time step as well. For example:

P(log(Q_{t+1}) < log(q)) = P(X_{t+1}=0 | X_t) * F(log(q) | X=0) + P(X_{t+1}=1 | X_t) * F(log(q) | X=1)

where

P(X_{t+1}=0 | X_t) = model.transmat_[state_sequence[-1], 0]

F(log(q) | X=0) = scipy.stats.norm.cdf(q, mus[0], sigmas[0])

P(X_{t+1}=1 | X_t) = model.transmat_[state_sequence[-1], 1]

F(log(q) | X=1) = scipy.stats.norm.cdf(q, mus[1], sigmas[1])

Hello,

My input datasets contain video files in csv format. Total 9 motion, where each motion contains more than 100 csv file. I want to train 9 HMM with multiple samples. Could you please let me know how can I work with multiple samples (more than 100) to train HMM.

The documentation for hmmlearn explains how to train on multiple sequences here: https://hmmlearn.readthedocs.io/en/latest/tutorial.html#building-hmm-and-generating-samples (scroll down to “Working with multiple sequences”). Note they simply concatenate the sequences. You should be careful to do the concatenation in a way that preserves the proper state-to-state transition. See more here: https://stats.stackexchange.com/questions/95144/training-a-hidden-markov-model-multiple-training-instances. Note, I haven’t done this myself and am not sure how exactly you would know how to concatenate them properly since the state is hidden, but maybe you can understand that explanation better than I do.