Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
    # fit Gaussian HMM to Q
    model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))
    
    # classify each observation as state 0 or 1
    hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

    # find parameters of Gaussian HMM
    mus = np.array(model.means_)
    sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
    P = np.array(model.transmat_)

    # find log-likelihood of Gaussian HMM
    logProb = model.score(np.reshape(Q,[len(Q),1]))

    # generate nSamples from Gaussian HMM
    samples = model.sample(nSamples)

    # re-organize mus, sigmas and P so that first row is lower mean (if not already)
    if mus[0] > mus[1]:
        mus = np.flipud(mus)
        sigmas = np.flipud(sigmas)
        P = np.fliplr(np.flipud(P))
        hidden_states = 1 - hidden_states

    return hidden_states, mus, sigmas, P, logProb, samples

# load annual flow data for the Colorado River near the Colorado/Utah state line
AnnualQ = np.loadtxt('AnnualQ.txt')

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)

Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)

    xs = np.arange(len(Q))+1909
    masks = hidden_states == 0
    ax.scatter(xs[masks], Q[masks], c='r', label='Dry State')
    masks = hidden_states == 1
    ax.scatter(xs[masks], Q[masks], c='b', label='Wet State')
    ax.plot(xs, Q, c='k')
    
    ax.set_xlabel('Year')
    ax.set_ylabel(ylabel)
    fig.subplots_adjust(bottom=0.2)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')

Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)

Running that we get the following:

[[ 0.6794469   0.3205531 ]
[ 0.34904974  0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σi[ei])e in which e is the eigenvector of PT corresponding to an eigenvalue of 1, and ei is the ith element of e. The overall distribution for our observations is then Y ~ π0N(μ0,σ02) + π1*N(μ1,σ12). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

    # calculate stationary distribution
    eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
    one_eigval = np.argmin(np.abs(eigenvals-1))
    pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

    x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
    fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

    x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
    fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

    x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
    fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
        pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.hist(Q, color='k', alpha=0.5, density=True)
    l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
    l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
    l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

    fig.subplots_adjust(bottom=0.15)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')

Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

67 thoughts on “Fitting Hidden Markov Models Part II: Sample Python Script

  1. Hi,
    Great post ! Very neatly explained.
    I am working on a dataset , where there are around 6-7 hidden states. I want to map them in the transition matrix. is it possible using hmm learn ?
    I also don’t get how you mapped state 0 to be dry and state 1 to be wet?
    I mean i don’t get this
    ***This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1 ***

    • Yep, you can do that with hmmlearn. Just change “n_components=2” to 6 on line 5 for a 6-state HMM. If you want to find the best fit across different numbers of states, you can use the log-likelihood to inform that (logProb on line 16). Since fitting more states requires estimating more parameters and could lead to overfitting that will increase logProb, you should penalize the addition of those parameters, e.g. by choosing the model with the lowest AIC (=2k-2lnL, where k is the number of parameters and lnL is the log-likelihood) or BIC (=ln(n)k – 2lnL, where n is the number of observations). The number of parameters is the number of means (# states) + standard deviations (# states) + transition probabilities (# states * (# states – 1)).

      On line 22 I check if mu[0] > mu[1], i.e. if the mean of state 0 is greater than the mean of state 1. If that is true, I flip their order on lines 23-26 so 0 corresponds to the state with the lower mean. If you have more than 2 states and you’d like them to be in order of lowest mean to highest, you should sort rather than flip them.

      • Hi ,

        Thanks for your reply.
        Assume in the same example of yours, i have 3 states i.e wet, mixed, dry.
        Now i get the transition matrix as

        *** ignore the values, the probabities may not sum upto 1 ***
        [[ 0.6 0.2 0.2 ]
        [ 0.34. 0.34 0.32]
        [ 0.06 0.34 0.6]]

        So my question is how do you identify the first column as wet, second one as mixed and the third as dry. it can also be be dry, wet, mixed or any other order?. How do we differentiate ?

        Are we mapping them based on the mean values ? Then how can we differentiate if mean values are same for a couple of states.

        Ultimately, i need the mapping between columns of tranmat_ and hidden state labels.

        The question may look dumb. excuse me.

  2. That’s what I’m determining on lines 22-26. I use the mus vector to identify which state is wet vs. dry based on which has a higher mean. If you have a 3-state HMM and the vector mus is [3, 6, 4], for example, then state 0 corresponds to the driest state (lowest mean), state 1 corresponds to the wettest state (highest mean), and state 2 corresponds to the mixed state. So similarly your rows would be starting at time t in the driest, wettest and mixed states, in that order. And your columns would be ending at time t+1 in the driest, wettest and mixed states, in that order. That’s somewhat confusing since it doesn’t go from dry to wet, so you could re-order the state identification as I do in the loop on lines 22-26. You would instead want something like this for multiple states:

    index_order = np.argsort(np.transpose(mus))
    mus = mus[index_order]
    sigmas = sigmas[index_order]
    newP = np.zeros(np.shape(P))
    state_indices = []
    for i in range(len(mus)):
    newP[i,:] = P[index_order[0][i],index_order]
    state_indices.append(np.where(hidden_states==index_order[0][i]))

    P = newP
    for i in range(len(state_indices)):
    hidden_states[state_indices[i]] = i

    • Hi Julie
      First of thanks for your great post. But, I still I don’t know the job of the lines 22-26. Would you mind please explain it clearly. And if it is possible please tell me the codes that you provide in this comment, can use for more than 3 states? And I can use it or I have to modify it? Because I don’t understand it honestly.

      index_order = np.argsort(np.transpose(mus))
      mus = mus[index_order]
      sigmas = sigmas[index_order]
      newP = np.zeros(np.shape(P))
      state_indices = []
      for i in range(len(mus)):
      newP[i,:] = P[index_order[0][i],index_order]
      state_indices.append(np.where(hidden_states==index_order[0][i]))
      P = newP
      for i in range(len(state_indices)):
      hidden_states[state_indices[i]] = i

      • Hi Mohammad,
        What those lines are doing is defining the states so that they are in order of driest to wettest. The label of 0/1 to each state is arbitrary. For consistency, I wanted 0 to always be dry and 1 to always be wet. So if the automatic labeling was the opposite (condition on line 22), I swap the parameters so that will be true. The code that you copied for more states sorts them from driest to wettest. So if you had three states, state 0 will have the lowest mean, state 1 the middle mean, and state 2 the highest mean. Hope that clears it up.
        Julie

  3. Hi Julie,
    Thanks for your post. I have 2 questions here and I appreciate it if you answer me.
    1- I don’t get why you use “logQ = np.log(AnnualQ)”. I mean, why you didn’t use AnnualQ data directly.
    2- I applied GaussianHMM for my data and fitted data with 3 states. The results of predicting the model are not correct and I am not sure what the problem is. I received pi and covariance as below:
    pi = [1.0000000e+00, 8.85441247e-83, 3.31092446e-46]
    covs = [[[4.22687301]]
    [[72.41435194]]
    [[72.49319852]]]
    do you think the problem is related to these parameters? how can I fix this problem?

  4. Hi Bahare. I used the log of annual flows because the flows have a lower bound of zero, but fitting a mixed Gaussian distribution could produce negative values. This probability might be very low, and negatives could be replaced with 0s if that were to happen, so it could work with the annual flows themselves. But the fit was also better when using the log of annual flows. You can experiment with that.

    It looks like your model essentially fit one state since the probability of being in state 1 is 1. In this case, it sounds like 3 states is not appropriate. For the state identification, was every observation identified as state 1? How did you decide on a 3-state model?

    • Thanks a lot for your reply.
      Yes, for the state identification, every observation was identified as state 1 and this is exactly my issue in the model.
      I am using aviation trajectory data and trying to apply a gaussian hidden Markov model to identify flight modes(ascend, hold, descend). Therefore, I selected a 3-state model.
      For finding the modes, I computed dy/dt for each time in a trajectory. What is your suggestion to find the appropriate number of states? I appreciate your time.

  5. I just re-read what you are using this for. Might it be better to just try to identify a change point between ascend and hold, and again between hold and descend? You also might want to smooth your data first.

  6. Hi,
    I am trying to implement Hidden Markov model using GaussianHMM library. I am wondering how can I set my prior parameters such as startprob_prior, transmat_prior, and means_prior in your model?
    And in your model, how did you get predicted sequence (model.decode()). For this, have you checked several model scores to get the best predict?
    I appreciate your time

    • I have not tried setting prior parameters with HMMlearn, but on line 5 where I have:
      `model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))`
      it looks like you should be able to also pass startprob_prior, transmat_prior and means_weight and covars_weight in addition to n_components and n_iter. What you are passing for startprob_prior and transmat_prior are parameters of a Dirichlet prior distribution. For means_weight, you pass the mean and std. dev of a normal prior on the means, and for covers_weight, you pass the parameters of the prior distribution for the covariance. If passing covariance_type=’spherical’ or ‘diag’, the prior distribution is inverse Gamma, otherwise it is the inverse Wichart distribution. This can all be found in the API: https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn-hmm.

      I used model.predict() on line 8 instead of model.decode(). This uses the Viterbi algorithm. model.decode() can use ‘viterbi’ or ‘map’. I don’t know what is different about the ‘map’ method, to be honest. I did not test several models, but you could try different initial estimates or different priors and compare the log-likelihood, ‘logProb’, which is calculated by model.score() on line 16 above. If you compare models with different numbers of parameters, you should penalize logprob based on the number of additional parameters using AIC or BIC, as described in some of the comments above.

      Hope that helps!

  7. Hi Julia,
    Thanks, for sharing this code.

    I would like to have a scale counting every day here:

    xs = np.arange(len(Q))+1909

    Do you know, how I have to change it to get:

    for example: 05.01.2020 ….. 05.30.2020?

    I’m an absolute beginner in python.

  8. Hi Julie,
    Thanks for the code and the explanation. Your code is well organized. After creating the model, how did you validate the model? I know you used historical data to train your model and generated samples for testing. Am I right?
    since I am new in machine learning, Could you explain how we can validate an HMM model and which techniques are more suited to do that? Thanks a lot.

  9. Hi Julie,
    Thanks for the code and the explanation. Your code is well organized. After creating the model, how did you validate the model? I know you used historical data to train your model and generated samples for testing. Am I right?
    since I am new in machine learning, Could you explain how we can validate an HMM model and which techniques are more suited to do that? Thanks a lot.

  10. I built my model to only part of the record (the last 2/3), but validated it over the whole record. If it doesn’t validate well, it could be your data is non-stationary. To validate it, you can do the Viterbi classification over the whole time period and then check the Gaussian fits of the time steps classified within each state visually with a qq-plot. You can also perform a formal test for normality, such as a probability plot correlation test, Kolmogorov-Smirnov test or Shapiro-Wilk test. To test the Markov property, the lengths of spells in state i should follow a Type I geometric distribution with p=(1-p_ii) where pii is the probability of transitioning from state i to state i. So you could plot a histogram of the lengths of continuous spells in state i from the Viterbi classification compared with a geometric distribution with p=(1-pii). Similarly, for the Markov property to hold, the auto-correlation of the time series of states should decay exponentially, while the partial auto-correlation function should only be significant for 1 lag. So you could plot the acf and pacf of the state time series to check that.

  11. Hi Julie, I am trying to learn the parameters of a HMM (with multinomial emissions) using Baum-Welch. But it doesnt seem to converge!! the following is the script with a toydata that I have created. Each time I run, I get different parameters and different from the original eg the init, trans and emission probabilities that i used to generate the sequences! any comment on that?

    S = [“S1”, “S2”, ‘S3’]
    n_states = len(S)
    E = [i+1 for i in range(6)] ## sides of a dice
    trans_prob = np.array(
    [[0.7, 0.2, 0.1],
    [0.1, 0.7, 0.2],
    [0.1, 0.2, 0.7]]
    )
    start_prob = np.array([0.7, 0.2, 0.1])

    ### Lets assume that we are throwing 3 6-sided dices: one is fair with equel probablity of each side, and the other two are loaded
    em_prob = np.array([[1/4, 3/20, 3/20, 3/20, 3/20, 3/20],
    [2/15, 2/15, 2/15, 2/15, 2/15, 1/3],
    [1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6]])

    model = hmm.MultinomialHMM(n_components=3)
    model.startprob_ = start_prob
    model.transmat_ = trans_prob
    model.emissionprob_ = em_prob
    print(‘*** Original parameters *********’)
    print(‘p_init’)
    print(model.startprob_)
    print(‘p_tans ‘)
    print(model.transmat_)
    print(‘p_em’)
    print(model.emissionprob_)

    ### using the above model, generate n_sequences each of length seq_length, then use this data to train another model, and call baum-weltch to learn the parameters

    n_sequences = 10
    seqs = []
    seq_lengths = []
    for i in range(n_sequences):
    seq_length = 20
    ### generate sequence
    seq = model.sample(seq_length)[0]
    # print(seq)
    ### generate sequence
    seq = model.sample(seq_length)[0]
    seq_lengths.append(seq_length)

    # print(seq)
    if len(seqs) > 0:
    seqs = np.append(seqs, seq, axis=0)
    else: seqs = seq

    seq_lengths = np.array(seq_lengths)

    model2 = hmm.MultinomialHMM(n_components=3, n_iter=1000)
    model2.fit(seqs, seq_lengths)

    print(‘\n ================== After training ==============’)

    print(‘initial’)
    print(model2.startprob_)

    print( ‘transition: ‘)
    print( model2.transmat_)
    print( ’emission: ‘)
    print(model2.emissionprob_)

  12. Since you are generating a different sequence from the model each time, when you fit the HMM you will get different parameter estimates. This is just random sampling variability, which is to be expected. If the estimates are extremely different, this is likely an issue of equifinality where multiple parameter combinations represent the data nearly as well. The other potential issue is the parameter estimation is stuck in a local maximum. To reduce the chances of that, I would recommend trying different initial parameter estimates and comparing logProb (the log likelihood of the model) – the parameter set with the highest log likelihood fits the best. And make sure to test the goodness of fit (often visually) of the different models you estimate to see which looks best.

    • Thanks for the reply.
      sorry it was my bad to generate sequences each time and in fact as soon as posted my question, I realized that. I have therefore changed the model accordingly, but still it doesnt seem to give right answer. Also agree with you that the solution might not be unique, but since this is just a toy simple example and I am going to use this for much more complex dataset, I need to make sure I do everything correct. Starting with different initial parameters doesnt sound very helpful unless we know a clever way of choosing starting parameters. The following is what I have done after you suggestions but still not very satisfactory:
      def initialize_components(n_components, n_emissions):
      “””
      Initialize start probabilities, transition and emission matrices, start for a model with a fixed number of components,
      for Multinomial model with a certain number of dimensions.
      “””

      trans_mat = np.abs(np.random.randn(n_components, n_components))
      trans_mat = (trans_mat.T / trans_mat.sum( axis=1 )).T

      emission_mat = np.abs(np.random.randn(n_components, n_emissions))

      emission_mat = (emission_mat.T/emission_mat.sum(axis=1)).T
      # print(emission_mat.shape)

      start_probs = np.abs( np.random.randn(n_components) )
      start_probs /= start_probs.sum()
      return (start_probs, trans_mat, emission_mat)

      def generate_sequences(Pi, T, E, n_components=3, n_features=6, N=200, K=20):
      “””
      Use the given parameters, initiate a hmm model and generate some sequences
      “””
      states = []
      seqs = []
      seq_lengths = []

      model = hmm.MultinomialHMM(n_components=3)
      model.startprob_ = Pi
      model.transmat_ = T
      model.emissionprob_ = E
      model.n_features = n_features
      print(‘*** Original parameters *********’)
      print(‘p_init’)
      print(model.startprob_)
      print(‘p_tans ‘)
      print(model.transmat_)
      print(‘p_emm’)
      print(model.emissionprob_)

      for j in range(N):
      seq_length = K
      ### generate sequence
      seq = model.sample(seq_length)[0]
      z = model.sample(seq_length)[1]
      # print(z)
      # print(seq)
      ### generate sequence
      seq = model.sample(seq_length)[0]
      seq_lengths.append(seq_length)
      states.append(z)
      # print(seq)
      if len(seqs) > 0:
      seqs = np.append(seqs, seq, axis=0)

      else:
      seqs = seq

      model.fit(seqs, seq_lengths)
      score = model.score(seqs)
      print(‘Score: ‘, score)
      return(seqs, states, seq_lengths)

      def train(hmm_model, Seqs, Lengths, nIteration):
      “””
      Train the model with given data, and find the best describing parameters (Baub-Welch)
      “””
      best_model = None
      best_trans = None
      best_emissions = None
      best_pi = None
      max_score = -np.inf

      for i in range(nIteration):

      pi, A, B = initialize_components(n_states, n_emissions)
      hmm_model.emissionprob_ = B
      hmm_model.transmat_ = A
      hmm_model.startprob_ = pi
      hmm_model.n_features_ = 6
      hmm_model.fit(Seqs, Lengths)
      this_score = hmm_model.score(Seqs)

      if this_score > max_score:
      max_score = this_score
      best_model = hmm_model
      best_pi = pi
      best_trans = A
      best_emissions = B
      ### used these start poing
      if i % 10 == 0:
      print(‘^’*50)
      print(‘\niteration’, i)
      print(‘best score: ‘, max_score)

      # print(i , ‘\n ================== After training ==============’)
      # print(model2.score(seqs))

      print(max_score)
      print(best_model.startprob_)
      print(best_model.transmat_)
      print(best_model.emissionprob_)
      return(best_model)
      *********************************
      *********************************
      S = [“S1”, “S2”, ‘S3’]
      n_states = len(S)
      E = [i+1 for i in range(6)] ## sides of a dice
      trans_prob = np.array(
      [[0.9, 0.05, 0.05],
      [0.1, 0.8, 0.1],
      [0.05, 0.05, 0.9]]
      )
      start_prob = np.array([0.8, 0.1, 0.1])

      ### Lets assume that we are throwing 3 6-sided dices: one is fair with equel probablity of each side, and the other two are loaded
      em_prob = np.array([[1/4, 3/20, 3/20, 3/20, 3/20, 3/20],
      [2/15, 2/15, 2/15, 2/15, 2/15, 1/3],
      [1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6]])

      X, Z, L = generate_sequences(start_prob, trans_prob, em_prob, n_components=3, n_features=6, N=200, K=20)
      model2 = hmm.MultinomialHMM(n_components=3, n_iter=200)

      fitted_hmm = train(model2, X, L, 10)

  13. Hi Hashem,
    The issue still remains that the one sequence you randomly generated may actually be fit better by a different model than the true model due to sampling variability. I just ran your code and got the following output:

    Original parameters
    p_init
    [0.8 0.1 0.1]
    p_tans
    [[0.9 0.05 0.05]
    [0.1 0.8 0.1 ]
    [0.05 0.05 0.9 ]]
    p_emm
    [[0.25 0.15 0.15 0.15 0.15 0.15 ]
    [0.13333333 0.13333333 0.13333333 0.13333333 0.13333333 0.33333333]
    [0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]]
    Score: -7134.656307667264
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    iteration 0
    best score: -7130.093158931911
    -7130.093158931911
    [0.55342232 0.26642825 0.18014943]
    [[0.40926052 0.23864853 0.35209095]
    [0.1978113 0.40300539 0.39918331]
    [0.51917648 0.30639327 0.17443025]]
    [[0.39610989 0.09730218 0.08404591 0.15664999 0.08554425 0.18034779]
    [0.10821451 0.23115766 0.05959483 0.0727108 0.23658125 0.29174095]
    [0.06873274 0.13744418 0.33595719 0.21040032 0.17550099 0.07196458]]

    So the fitted model actually had a higher log-likelihood (-7130) than the true model (-7134). That means the algorithm found a good fit! It just isn’t the right fit. If you repeated this several times, you might find the average parameter estimates will converge to the true values. Likewise, the larger the sample you generate, the closer your estimates should be to the true values. In terms of your process, you seem to be doing everything correctly. If nothing else, this should just show you that the parameters you’ll estimate will not be the true values, but the best fit to the observed data (assuming you’ve found the global maximum, not just a local maximum).

  14. Dear Julie;
    Thanks for sharing the code and your insight!
    From what I understood, this code is based on a univariate model with only one variable; Flow. I was wondering if I can use it in a multivariate context with multiple time series. More specifically, I’m trying to to model the joint return distribution for multiple assets (stock, bond, and real estate).
    Moreover, I would be grateful if you could explain a little more about the data characteristics like variable names in “AnnualQ” file. I’m asking this because I’m a bit confused about the parameters I need to fill the “Fit” function with.
    Thanks again.

  15. Hi Julie,
    Thank you very much for your great post. I have a some questions. It would be really appreciated if you could help me on this.

    The first one is regarding fitting the model. My data consist of 1000000 rows. Each row corresponds to a sequence with the length of 12. I don’t know how to give my data to the model. Lines 5 and 8 do not work in my case.
    My second question is about setting initial parameters for the model. I would like model to improve my initial guess for parameters.

    model = GaussianHMM(n_components=3, n_iter=1000,transmat_prior=[[0.95,0.04,0.01],[0.05,0.9,0.05],[0.01,0.04,0.95]],startprob_prior=[0.6,0.3,0.1],means_weight=[10,100,1000],covers_weight=[2,20,100]).fit(XXXXX)

    Is the above line correct to fit the model?

    It would be really appreciated if you could answer my questions.

    Best regards

  16. Hi Shahab,

    The questions don’t appear until we approve them, which is why it looked like your questions were removed.

    To clarify, are you trying to fit a 3-state, 12-dimensional Gaussian HMM to a time series of 1 million observations? If so, the input (X) should be a (1 million x 12) numpy array: model = GaussianHMM(n_components=3, n_iter=1000).fit(X).

    I haven’t specified initial conditions to hmmlearn before, but I believe your initiation of transmat_prior and startprob_prior are correct. I am confused about what means_weight and covars_weight should be, though. “means” should be 3×12 if it is a 3-state, 12-dimensional HMM, so I would think the prior on the means (means_weight) would be the same dimension, but the dostring of the source code (https://github.com/hmmlearn/hmmlearn/blob/master/lib/hmmlearn/hmm.py) suggests it is a vector of n_components (3). Similarly, “covars” would be 12x3x3, but covars_weight represents parameters of the prior on the covariance matrix (where the prior is the inverse Gamma if covariance_type=”diag” (the default)). The inverse Gamma has two parameters (alpha and beta). So I would think you need 2 parameters for each entry of covars, but again, the code suggests it is a vector of n_components. So you probably have a better understanding than I do of what those are supposed to be. Sorry I’m not much help there!

    Julie

  17. Hi Julie,
    your tutorial looks great! I’d like to replicate it, but as I’m not working in the hydrology field I don’t know how to re-create the dataset AnnualQ. Would you be able to share it? Thanks!!
    G

  18. Hi Julie

    Thanks for your useful post. I’m trying to use the Multinomial HMM, but my observations has 3 dimensions. For example, for a sequence with the length of 5, I have the following observation:

    Seq1=[(3,1,2),(1,1,1),(2,1,0),(0,0,1),(2,1,1)]

    Each dimension of observation could get any number from 0 to 3.
    It would be appreciated if you explain how can I use the HMM package to fit my data.

    Thanks

    • Hi Shahab,
      Yes, you can do that. Here is an example fitting data I generated (X1):

      “`
      from hmmlearn import hmm
      import numpy as np

      # initiate model object for 3-state Multinomial HMM
      model = hmm.MultinomialHMM(n_components=3)

      # assign number of elements each state can take on
      model.n_features = 4

      # randomly generate time series of 100 vectors of length 3 whose values are integers from 0-3
      X1 = np.random.randint(0,4,[3,100])

      # fit Multinomial HMM to X1
      fit = model.fit(X1)

      # find transition probabilities from fit
      fit.transmat_

      # find probability distribution of each state (i.e. probability of being 0,1,2 or 3
      fit.emissionprob_
      “`

      Julie

      • Thanks Julie for your response.
        Sorry, I think my question was not clear.

        My data set has around 50,000 sequences of observation. Each sequence has its own length. Each observation in a sequence has three dimensions.

        For example:

        Seq1= [(3,1,2),(1,1,1),(2,1,0),(0,0,1),(2,1,1)] len(seq1) =5
        seq2= [(0,2,1),(1,2,1),(3,3,3)] len(seq2) =3
        . .
        . .
        . .
        seq50000= [(1,1,0),(2,3,3),(2,2,0),(3,1,0)] len(seq50000)=4

        X1 in your code, has 3 sequences with the length of 100, and each observation has one dimension (single int).

        Best regards

  19. No, X1 is a 3×100 array, so the first column is a 3-dimensional state at time 1, the second column is a 3-dimensional state at time 2, etc. You’ll see when you fit the model that transmat is 3×3 indicating the transition probabilities between the 3 states and emissionprob is a 3×4 matrix indicating the probability of being a 0,1,2 or 3 for each of the 3 states.

    • I have a question about the emission. In this case, what is the probability of P((0,2,0)|state1)?
      (0,2,0) is an arbitrary observation. I mean, which row and column of the emission matrix shows the probability?

      Thanks

      • Oh, I apologize, I think I did misunderstand. Assuming I get it now, you need to define every possible combination of three values as a state and find the transition probabilities between those univariate definitions of the states. That wouldn’t be a hidden MM, just a pure MM. So, e.g.
        (0,0,0) = state 1
        (0,0,1) = state 2

        (3,3,3) = state 4^3
        That’s a lot of states! You likely do not have enough data to estimate those transition probabilities. Alternatively, you could fit an n-state HMM, where n is some known number of states, and then generate the 3 values from that state from a multivariate multinomial distribution, but I’m not sure that’s actually possible with hmmlearn (see here: https://github.com/hmmlearn/hmmlearn/issues/216)

  20. Hi!

    I have written a code and is afraid that the program overfit the results, therefore I want to restrict the model to a minimum variance. In the API I can read:
    “min_covar (float, optional) – Floor on the diagonal of the covariance matrix to prevent overfitting. Defaults to 1e-3.”
    So i thought that this feature should solve my problems.

    My code
    remodel = hmm.GaussianHMM(n_components=2, min_covar=5, n_iter=1000).fit(X)
    for i in range(remodel.n_components):
    print(“{0}th hidden state”.format(i))
    print(“covariance = “, (covariance))

    When I run this code the program is writing:
    0th hidden state
    covariance = [[2.05740878 0. 0. ]
    [0. 1.83566175 0. ]
    [0. 0. 2.1124605 ]]
    1th hidden state
    covariance = [[0.95922665 0. 0. ]
    [0. 1.01140285 0. ]
    [0. 0. 1.03875355]]

    I don’t understand how the volatility can be lower than the floor(=5).

  21. Hi Julie
    Thank you for sharing this post and code with us. I have a question.
    I want to train the HMM with two different sequences from two different CSV file. But I got this error:
    fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)
    fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)
    data_train = np.concatenate([fkinematics, fkinematics1])
    lengths = [fkinematics, fkinematics1]
    hidden_states, mus, sigmas, P, logProb, samples = fitHMM(data_train,lengths)
    error:
    cannot reshape array of size 25767 into shape (3681,1)
    and if:
    hidden_states, mus, sigmas, P, logProb, samples = fitHMM(data_train,lengths, 100)
    fitHMM() takes 2 positional arguments but 3 were given

    Would you mind please help me to fix this problem?

    Thanks in advance

    • What are the dimensions of fkinematics and fkinematics1? For the second error, you are passing 3 arguments (data_train, lengths, 100) to fitHMM, which only takes in 2 arguments (Q, nSamples)

      • Thanks Julie for Answers to me
        Honestly, I am completely new in python. fkinematics has 2176 rows and fkinematics1 has 2050. I need to use both of them for training. Would you mind please help me what should I do?
        and If it is possible I have another question. If I want to use 80 percents of the data for training and 20 percents for test. How can I do that?
        Thanks in advance for your time and answer

      • Do they have multiple columns? I don’t see how those numbers are resulting in any of the numbers that area printed. Can you explain what the data is? Is it univariate data and you want a two-state HMM? Because my code will only be suited to that and would need to be adapted for something else.

  22. Hi Julie
    First of all, thank you for your time and answer.
    Yes I have multiple columns but I only used one of the columns of each file as follow:
    hidden_states, mus, sigmas, P, logProb, samples = fitHMM(fkinematics[:,5], 100)

    I wanted to use multiple files for training. Every file is a trajectory of the end effector of the robot. One of them perfect trajectory and the rest are some noisy trajectories. I only use one of the columns of each file( regarding the end effector). However, the problem is the specific column, that shows the trajectory, has different rows. And I wanted to have 3 and 4 states. Would you mind please help me with how to do that? And I don’t know how to choose 80% of the data for training and the rest for tests? I searched a lot however, I couldn’t manage to handle my problem! I think even it is not necessary for me to use the log to train.

    fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)
    fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)
    data_train = np.concatenate([fkinematics, fkinematics1])
    lengths = [fkinematics[:,5], fkinematics1[:,5]]
    model = GaussianHMM(n_components=3, n_iter=1000).fit(data_train,lengths)

    Thanks in advance for your answer

  23. fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)
    fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)
    data_train = np.concatenate([fkinematics, fkinematics1])
    lengths = [fkinematics[:,5], fkinematics1[:,5]]

    model = GaussianHMM(n_components=3, n_iter=1000).fit(data_train,lengths)

    The error for this code is:

    ValueError: operands could not be broadcast together with shapes (1617,) (2064,)

    With respect
    Mohammad

    • I think that error is probably arising when you run np.concatenate(), or because of how you ran it. You probably want something like this:

      data_all = np.concatenate((fkinematics,fkinematics1),axis=0).

      To train on 80% of the data, use:

      data_train = data_all[0:int(0.8*np.shape(data_all)[0]),:]
      data_test = data_all[int(0.8*np.shape(data_all)[0])::,]
      model = GaussianHMM(n_components=3, n_iter=1000).fit(np.reshape(data_train[:,5],[len(data_train[:,5]),1]))
      hidden_states_train = model.predict(np.reshape(data_train[:,5],[len(data_train[:,5]),1]))
      hidden_states_test = model.predict(np.reshape(data_test[:,5],[len(data_test[:,5]),1]))

      Then make a QQ plot of all the observations in the test set classified as being in each state against a normal distribution with the parameters fit for that state to see how well it fits in testing.

      • Thanks, Julie. It would be very helpful and almost worked. However, This line
        hidden_states_test = model.predict(np.reshape(data_test[:,5],[len(data_test[:,5]),1]))
        gave this error:
        too many indices for array
        and it is weird. Since everything is the same with the last line. I prepare the QQ plot but because the test line didn’t work so I couldn’t check it. I think it is about dimensions? right?
        But the files that I used for train and test are the same and the codes you provide the same too. Would you mind please tell me what is your opinion about that?
        Thanks in advance

        Mohammad

  24. Actually, it worked. I made mistake. Could you please tell me what is the difference between ” ::,] ” and ” ,:”. when we might use every one of them?

    Many thanks for your help

    Mohammad

    • Oh, I actually should have put
      data_all[int(0.8*np.shape(data_all)[0])::,:]
      int(0.8*np.shape(data_all)[0]):: is pulling the last 20% of rows and ,:] is pulling all of the columns

  25. Sorry Julie, One another question.
    When I print the result of hidden_states_train, I get this
    array([1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

    it is not true? right?

  26. hello Julie,
    Thanks for this post. It has been a very helpful source of information for working with HMMs. Can you kindly shed some light on the following for me?

    I have a data set with values in the range 0 – 100. Just like in the wet and dry states in your example, I would love to have a two state model with all values > 90 representing one state and all values <= 90 be the other state. In your code (lines 7 & 8), you mentioned something along these lines but it really doesn't clear things up for me.

    please help.

  27. Hello Julie,

    Thanks for this post. It is a good source of information for working with HMMs. Can you kindly shed some light on the following:

    I have a data set with values in the range 0 – 100. Just like in your example with wet and dry states, I will like to have a two state model with all values 90 be the other state. In your code, you mentioned something along these lines (lines 7 & 8) but I am not entirely clear how to do this.

    please help.

    • Hi Tanifor,
      That actually wouldn’t be a hidden Markov model because the states would then be observable. So you could directly label all the historical data and estimate the transition probabilities empirically, then fit separate distributions to the separate states. In a hidden Markov model, the distributions overlap and so it’s unclear in the middle which state they belong to, so that’s estimated probabilistically.
      Julie

  28. Hi Julie,

    Thanks for this very interesting post, just a question if I want to predict status in t+1 having the data available up to t how could I do it?

    I found this but I am not sure if it is right….

    state_sequence = model.predict(train_set)
    prob_next_step = model.transmat_[state_sequence[-1], :].argmax()

    • That will find the most likely next state. You could get a probabilistic forecast of the flow on the next time step as well. For example:
      P(log(Q_{t+1}) < log(q)) = P(X_{t+1}=0 | X_t) * F(log(q) | X=0) + P(X_{t+1}=1 | X_t) * F(log(q) | X=1)
      where
      P(X_{t+1}=0 | X_t) = model.transmat_[state_sequence[-1], 0]
      F(log(q) | X=0) = scipy.stats.norm.cdf(q, mus[0], sigmas[0])
      P(X_{t+1}=1 | X_t) = model.transmat_[state_sequence[-1], 1]
      F(log(q) | X=1) = scipy.stats.norm.cdf(q, mus[1], sigmas[1])

  29. Hello,
    My input datasets contain video files in csv format. Total 9 motion, where each motion contains more than 100 csv file. I want to train 9 HMM with multiple samples. Could you please let me know how can I work with multiple samples (more than 100) to train HMM.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s