July 3, 2018 by Julie

Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
    # fit Gaussian HMM to Q
    model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))
    
    # classify each observation as state 0 or 1
    hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

    # find parameters of Gaussian HMM
    mus = np.array(model.means_)
    sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
    P = np.array(model.transmat_)

    # find log-likelihood of Gaussian HMM
    logProb = model.score(np.reshape(Q,[len(Q),1]))

    # generate nSamples from Gaussian HMM
    samples = model.sample(nSamples)

    # re-organize mus, sigmas and P so that first row is lower mean (if not already)
    if mus[0] > mus[1]:
        mus = np.flipud(mus)
        sigmas = np.flipud(sigmas)
        P = np.fliplr(np.flipud(P))
        hidden_states = 1 - hidden_states

    return hidden_states, mus, sigmas, P, logProb, samples

# load annual flow data for the Colorado River near the Colorado/Utah state line
AnnualQ = np.loadtxt('AnnualQ.txt')

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)

Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)

    xs = np.arange(len(Q))+1909
    masks = hidden_states == 0
    ax.scatter(xs[masks], Q[masks], c='r', label='Dry State')
    masks = hidden_states == 1
    ax.scatter(xs[masks], Q[masks], c='b', label='Wet State')
    ax.plot(xs, Q, c='k')
    
    ax.set_xlabel('Year')
    ax.set_ylabel(ylabel)
    fig.subplots_adjust(bottom=0.2)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')

Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)

Running that we get the following:

[[ 0.6794469 0.3205531 ]
[ 0.34904974 0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σ_i[e_i])e in which e is the eigenvector of P^T corresponding to an eigenvalue of 1, and e_i is the i^th element of e. The overall distribution for our observations is then Y ~ π₀N(μ₀,σ₀²) + π₁*N(μ₁,σ₁²). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

    # calculate stationary distribution
    eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
    one_eigval = np.argmin(np.abs(eigenvals-1))
    pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

    x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
    fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

    x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
    fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

    x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
    fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
        pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.hist(Q, color='k', alpha=0.5, density=True)
    l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
    l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
    l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

    fig.subplots_adjust(bottom=0.15)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')

Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

86 thoughts on “Fitting Hidden Markov Models Part II: Sample Python Script”

rei | February 13, 2019 at 5:39 pm

Can I get a copy of the dataset, ‘AnnualQ.txt’?

Reply
- Julie | February 14, 2019 at 6:21 pm
  
  It’s just a text file with the annual flow (in acre-ft) at USGS gauge 09163500 from WY 1909 – WY 2013. Each year is a different row of the file.
  
  Reply
  - A. Pittelli | January 28, 2021 at 9:23 am
    
    How many columns? Because i’m getting a reshape error
  - Julie | January 28, 2021 at 10:48 am
    
    One column for the one site.
- Nancy Jiang | February 21, 2023 at 2:38 am
  
  how should we calculate transition matrix?can we start from the historical data and calculate the conditional probability?
  
  Reply
  - Julie | February 21, 2023 at 8:40 pm
    
    The “P” returned by fitHMM is the transition matrix, which is estimated from the historical data. You can run this code by replacing:
    AnnualQ = np.loadtxt(‘AnnualQ.txt’)
    logQ = np.log(AnnualQ)
    with
    AnnualQ = np.array(pd.DataFrame.from_csv(‘AnnualQ.csv’))[35::] # only fit to last 70 years (1944-2013)
    logQ = np.log(AnnualQ[:,-1])
    where AnnualQ.csv can be found here: https://github.com/julianneq/UCRB_analysis/blob/master/Qgen/AnnualQ.csv.
Guruprasad | March 22, 2020 at 1:34 pm

Hi,
Great post ! Very neatly explained.
I am working on a dataset , where there are around 6-7 hidden states. I want to map them in the transition matrix. is it possible using hmm learn ?
I also don’t get how you mapped state 0 to be dry and state 1 to be wet?
I mean i don’t get this
***This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1 ***

Reply
- Julie | March 22, 2020 at 3:33 pm
  
  Yep, you can do that with hmmlearn. Just change “n_components=2” to 6 on line 5 for a 6-state HMM. If you want to find the best fit across different numbers of states, you can use the log-likelihood to inform that (logProb on line 16). Since fitting more states requires estimating more parameters and could lead to overfitting that will increase logProb, you should penalize the addition of those parameters, e.g. by choosing the model with the lowest AIC (=2k-2lnL, where k is the number of parameters and lnL is the log-likelihood) or BIC (=ln(n)k – 2lnL, where n is the number of observations). The number of parameters is the number of means (# states) + standard deviations (# states) + transition probabilities (# states * (# states – 1)).
  
  On line 22 I check if mu[0] > mu[1], i.e. if the mean of state 0 is greater than the mean of state 1. If that is true, I flip their order on lines 23-26 so 0 corresponds to the state with the lower mean. If you have more than 2 states and you’d like them to be in order of lowest mean to highest, you should sort rather than flip them.
  
  Reply
  - Guruprasad | March 22, 2020 at 10:51 pm
    
    Hi ,
    
    Thanks for your reply.
    Assume in the same example of yours, i have 3 states i.e wet, mixed, dry.
    Now i get the transition matrix as
    
    *** ignore the values, the probabities may not sum upto 1 ***
    [[ 0.6 0.2 0.2 ]
    [ 0.34. 0.34 0.32]
    [ 0.06 0.34 0.6]]
    
    So my question is how do you identify the first column as wet, second one as mixed and the third as dry. it can also be be dry, wet, mixed or any other order?. How do we differentiate ?
    
    Are we mapping them based on the mean values ? Then how can we differentiate if mean values are same for a couple of states.
    
    Ultimately, i need the mapping between columns of tranmat_ and hidden state labels.
    
    The question may look dumb. excuse me.
Julie | March 22, 2020 at 11:30 pm

That’s what I’m determining on lines 22-26. I use the mus vector to identify which state is wet vs. dry based on which has a higher mean. If you have a 3-state HMM and the vector mus is [3, 6, 4], for example, then state 0 corresponds to the driest state (lowest mean), state 1 corresponds to the wettest state (highest mean), and state 2 corresponds to the mixed state. So similarly your rows would be starting at time t in the driest, wettest and mixed states, in that order. And your columns would be ending at time t+1 in the driest, wettest and mixed states, in that order. That’s somewhat confusing since it doesn’t go from dry to wet, so you could re-order the state identification as I do in the loop on lines 22-26. You would instead want something like this for multiple states:

index_order = np.argsort(np.transpose(mus))
mus = mus[index_order]
sigmas = sigmas[index_order]
newP = np.zeros(np.shape(P))
state_indices = []
for i in range(len(mus)):
newP[i,:] = P[index_order[0][i],index_order]
state_indices.append(np.where(hidden_states==index_order[0][i]))

P = newP
for i in range(len(state_indices)):
hidden_states[state_indices[i]] = i

Reply
- Mohammad | April 7, 2021 at 7:25 am
  
  Hi Julie
  First of thanks for your great post. But, I still I don’t know the job of the lines 22-26. Would you mind please explain it clearly. And if it is possible please tell me the codes that you provide in this comment, can use for more than 3 states? And I can use it or I have to modify it? Because I don’t understand it honestly.
  
  index_order = np.argsort(np.transpose(mus))
  mus = mus[index_order]
  sigmas = sigmas[index_order]
  newP = np.zeros(np.shape(P))
  state_indices = []
  for i in range(len(mus)):
  newP[i,:] = P[index_order[0][i],index_order]
  state_indices.append(np.where(hidden_states==index_order[0][i]))
  P = newP
  for i in range(len(state_indices)):
  hidden_states[state_indices[i]] = i
  
  Reply
  - Julie | April 7, 2021 at 5:39 pm
    
    Hi Mohammad,
    What those lines are doing is defining the states so that they are in order of driest to wettest. The label of 0/1 to each state is arbitrary. For consistency, I wanted 0 to always be dry and 1 to always be wet. So if the automatic labeling was the opposite (condition on line 22), I swap the parameters so that will be true. The code that you copied for more states sorts them from driest to wettest. So if you had three states, state 0 will have the lowest mean, state 1 the middle mean, and state 2 the highest mean. Hope that clears it up.
    Julie
bahare rajab zade | May 6, 2020 at 4:38 pm

Hi Julie,
Thanks for your post. I have 2 questions here and I appreciate it if you answer me.
1- I don’t get why you use “logQ = np.log(AnnualQ)”. I mean, why you didn’t use AnnualQ data directly.
2- I applied GaussianHMM for my data and fitted data with 3 states. The results of predicting the model are not correct and I am not sure what the problem is. I received pi and covariance as below:
pi = [1.0000000e+00, 8.85441247e-83, 3.31092446e-46]
covs = [[[4.22687301]]
[[72.41435194]]
[[72.49319852]]]
do you think the problem is related to these parameters? how can I fix this problem?

Reply
Julie | May 6, 2020 at 4:48 pm

Hi Bahare. I used the log of annual flows because the flows have a lower bound of zero, but fitting a mixed Gaussian distribution could produce negative values. This probability might be very low, and negatives could be replaced with 0s if that were to happen, so it could work with the annual flows themselves. But the fit was also better when using the log of annual flows. You can experiment with that.

It looks like your model essentially fit one state since the probability of being in state 1 is 1. In this case, it sounds like 3 states is not appropriate. For the state identification, was every observation identified as state 1? How did you decide on a 3-state model?

Reply
- bahare rajab zade | May 6, 2020 at 5:42 pm
  
  Thanks a lot for your reply.
  Yes, for the state identification, every observation was identified as state 1 and this is exactly my issue in the model.
  I am using aviation trajectory data and trying to apply a gaussian hidden Markov model to identify flight modes(ascend, hold, descend). Therefore, I selected a 3-state model.
  For finding the modes, I computed dy/dt for each time in a trajectory. What is your suggestion to find the appropriate number of states? I appreciate your time.
  
  Reply
Julie | May 6, 2020 at 9:33 pm

See my reply to Guruprasad above about using the AIC or BIC to determine the best number of states.

Reply
Julie | May 6, 2020 at 9:36 pm

I just re-read what you are using this for. Might it be better to just try to identify a change point between ascend and hold, and again between hold and descend? You also might want to smooth your data first.

Reply
MOSTAFA SAEIDIKAKHKI | May 21, 2020 at 7:07 pm

Hi,
I am trying to implement Hidden Markov model using GaussianHMM library. I am wondering how can I set my prior parameters such as startprob_prior, transmat_prior, and means_prior in your model?
And in your model, how did you get predicted sequence (model.decode()). For this, have you checked several model scores to get the best predict?
I appreciate your time

Reply
- Julie | May 23, 2020 at 3:45 pm
  
  I have not tried setting prior parameters with HMMlearn, but on line 5 where I have:
  `model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))`
  it looks like you should be able to also pass startprob_prior, transmat_prior and means_weight and covars_weight in addition to n_components and n_iter. What you are passing for startprob_prior and transmat_prior are parameters of a Dirichlet prior distribution. For means_weight, you pass the mean and std. dev of a normal prior on the means, and for covers_weight, you pass the parameters of the prior distribution for the covariance. If passing covariance_type=’spherical’ or ‘diag’, the prior distribution is inverse Gamma, otherwise it is the inverse Wichart distribution. This can all be found in the API: https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn-hmm.
  
  I used model.predict() on line 8 instead of model.decode(). This uses the Viterbi algorithm. model.decode() can use ‘viterbi’ or ‘map’. I don’t know what is different about the ‘map’ method, to be honest. I did not test several models, but you could try different initial estimates or different priors and compare the log-likelihood, ‘logProb’, which is calculated by model.score() on line 16 above. If you compare models with different numbers of parameters, you should penalize logprob based on the number of additional parameters using AIC or BIC, as described in some of the comments above.
  
  Hope that helps!
  
  Reply
veganonelifeoneway | June 18, 2020 at 10:33 am

Hi Julia,
Thanks, for sharing this code.

I would like to have a scale counting every day here:

xs = np.arange(len(Q))+1909

Do you know, how I have to change it to get:

for example: 05.01.2020 ….. 05.30.2020?

I’m an absolute beginner in python.

Reply
Julie | June 18, 2020 at 12:09 pm

I would recommend reformatting your data into a pandas time series and plotting that. See here: https://ourcodingclub.github.io/tutorials/pandas-time-series/

Reply
Ryan | June 19, 2020 at 11:17 am

Hi Julie,
Thanks for the code and the explanation. Your code is well organized. After creating the model, how did you validate the model? I know you used historical data to train your model and generated samples for testing. Am I right?
since I am new in machine learning, Could you explain how we can validate an HMM model and which techniques are more suited to do that? Thanks a lot.

Reply
Ryan Eilen | June 19, 2020 at 12:18 pm

Hi Julie,
Thanks for the code and the explanation. Your code is well organized. After creating the model, how did you validate the model? I know you used historical data to train your model and generated samples for testing. Am I right?
since I am new in machine learning, Could you explain how we can validate an HMM model and which techniques are more suited to do that? Thanks a lot.

Reply
Julie | June 20, 2020 at 10:57 am

I built my model to only part of the record (the last 2/3), but validated it over the whole record. If it doesn’t validate well, it could be your data is non-stationary. To validate it, you can do the Viterbi classification over the whole time period and then check the Gaussian fits of the time steps classified within each state visually with a qq-plot. You can also perform a formal test for normality, such as a probability plot correlation test, Kolmogorov-Smirnov test or Shapiro-Wilk test. To test the Markov property, the lengths of spells in state i should follow a Type I geometric distribution with p=(1-p_ii) where pii is the probability of transitioning from state i to state i. So you could plot a histogram of the lengths of continuous spells in state i from the Viterbi classification compared with a geometric distribution with p=(1-pii). Similarly, for the Markov property to hold, the auto-correlation of the time series of states should decay exponentially, while the partial auto-correlation function should only be significant for 1 lag. So you could plot the acf and pacf of the state time series to check that.

Reply
hkoohy | July 8, 2020 at 4:17 am

Hi Julie, I am trying to learn the parameters of a HMM (with multinomial emissions) using Baum-Welch. But it doesnt seem to converge!! the following is the script with a toydata that I have created. Each time I run, I get different parameters and different from the original eg the init, trans and emission probabilities that i used to generate the sequences! any comment on that?

S = [“S1”, “S2”, ‘S3’]
n_states = len(S)
E = [i+1 for i in range(6)] ## sides of a dice
trans_prob = np.array(
[[0.7, 0.2, 0.1],
[0.1, 0.7, 0.2],
[0.1, 0.2, 0.7]]
)
start_prob = np.array([0.7, 0.2, 0.1])

### Lets assume that we are throwing 3 6-sided dices: one is fair with equel probablity of each side, and the other two are loaded
em_prob = np.array([[1/4, 3/20, 3/20, 3/20, 3/20, 3/20],
[2/15, 2/15, 2/15, 2/15, 2/15, 1/3],
[1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6]])

model = hmm.MultinomialHMM(n_components=3)
model.startprob_ = start_prob
model.transmat_ = trans_prob
model.emissionprob_ = em_prob
print(‘*** Original parameters *********’)
print(‘p_init’)
print(model.startprob_)
print(‘p_tans ‘)
print(model.transmat_)
print(‘p_em’)
print(model.emissionprob_)

### using the above model, generate n_sequences each of length seq_length, then use this data to train another model, and call baum-weltch to learn the parameters

n_sequences = 10
seqs = []
seq_lengths = []
for i in range(n_sequences):
seq_length = 20
### generate sequence
seq = model.sample(seq_length)[0]
# print(seq)
### generate sequence
seq = model.sample(seq_length)[0]
seq_lengths.append(seq_length)

# print(seq)
if len(seqs) > 0:
seqs = np.append(seqs, seq, axis=0)
else: seqs = seq

seq_lengths = np.array(seq_lengths)

model2 = hmm.MultinomialHMM(n_components=3, n_iter=1000)
model2.fit(seqs, seq_lengths)

print(‘\n ================== After training ==============’)

print(‘initial’)
print(model2.startprob_)

print( ‘transition: ‘)
print( model2.transmat_)
print( ’emission: ‘)
print(model2.emissionprob_)

Reply
Julie | July 10, 2020 at 10:14 am

Since you are generating a different sequence from the model each time, when you fit the HMM you will get different parameter estimates. This is just random sampling variability, which is to be expected. If the estimates are extremely different, this is likely an issue of equifinality where multiple parameter combinations represent the data nearly as well. The other potential issue is the parameter estimation is stuck in a local maximum. To reduce the chances of that, I would recommend trying different initial parameter estimates and comparing logProb (the log likelihood of the model) – the parameter set with the highest log likelihood fits the best. And make sure to test the goodness of fit (often visually) of the different models you estimate to see which looks best.

Reply
- hashem koohy | July 13, 2020 at 11:14 am
  
  Thanks for the reply.
  sorry it was my bad to generate sequences each time and in fact as soon as posted my question, I realized that. I have therefore changed the model accordingly, but still it doesnt seem to give right answer. Also agree with you that the solution might not be unique, but since this is just a toy simple example and I am going to use this for much more complex dataset, I need to make sure I do everything correct. Starting with different initial parameters doesnt sound very helpful unless we know a clever way of choosing starting parameters. The following is what I have done after you suggestions but still not very satisfactory:
  def initialize_components(n_components, n_emissions):
  “””
  Initialize start probabilities, transition and emission matrices, start for a model with a fixed number of components,
  for Multinomial model with a certain number of dimensions.
  “””
  
  trans_mat = np.abs(np.random.randn(n_components, n_components))
  trans_mat = (trans_mat.T / trans_mat.sum( axis=1 )).T
  
  emission_mat = np.abs(np.random.randn(n_components, n_emissions))
  
  emission_mat = (emission_mat.T/emission_mat.sum(axis=1)).T
  # print(emission_mat.shape)
  
  start_probs = np.abs( np.random.randn(n_components) )
  start_probs /= start_probs.sum()
  return (start_probs, trans_mat, emission_mat)
  
  def generate_sequences(Pi, T, E, n_components=3, n_features=6, N=200, K=20):
  “””
  Use the given parameters, initiate a hmm model and generate some sequences
  “””
  states = []
  seqs = []
  seq_lengths = []
  
  model = hmm.MultinomialHMM(n_components=3)
  model.startprob_ = Pi
  model.transmat_ = T
  model.emissionprob_ = E
  model.n_features = n_features
  print(‘*** Original parameters *********’)
  print(‘p_init’)
  print(model.startprob_)
  print(‘p_tans ‘)
  print(model.transmat_)
  print(‘p_emm’)
  print(model.emissionprob_)
  
  for j in range(N):
  seq_length = K
  ### generate sequence
  seq = model.sample(seq_length)[0]
  z = model.sample(seq_length)[1]
  # print(z)
  # print(seq)
  ### generate sequence
  seq = model.sample(seq_length)[0]
  seq_lengths.append(seq_length)
  states.append(z)
  # print(seq)
  if len(seqs) > 0:
  seqs = np.append(seqs, seq, axis=0)
  
  else:
  seqs = seq
  
  model.fit(seqs, seq_lengths)
  score = model.score(seqs)
  print(‘Score: ‘, score)
  return(seqs, states, seq_lengths)
  
  def train(hmm_model, Seqs, Lengths, nIteration):
  “””
  Train the model with given data, and find the best describing parameters (Baub-Welch)
  “””
  best_model = None
  best_trans = None
  best_emissions = None
  best_pi = None
  max_score = -np.inf
  
  for i in range(nIteration):
  
  pi, A, B = initialize_components(n_states, n_emissions)
  hmm_model.emissionprob_ = B
  hmm_model.transmat_ = A
  hmm_model.startprob_ = pi
  hmm_model.n_features_ = 6
  hmm_model.fit(Seqs, Lengths)
  this_score = hmm_model.score(Seqs)
  
  if this_score > max_score:
  max_score = this_score
  best_model = hmm_model
  best_pi = pi
  best_trans = A
  best_emissions = B
  ### used these start poing
  if i % 10 == 0:
  print(‘^’*50)
  print(‘\niteration’, i)
  print(‘best score: ‘, max_score)
  
  # print(i , ‘\n ================== After training ==============’)
  # print(model2.score(seqs))
  
  print(max_score)
  print(best_model.startprob_)
  print(best_model.transmat_)
  print(best_model.emissionprob_)
  return(best_model)
  *********************************
  *********************************
  S = [“S1”, “S2”, ‘S3’]
  n_states = len(S)
  E = [i+1 for i in range(6)] ## sides of a dice
  trans_prob = np.array(
  [[0.9, 0.05, 0.05],
  [0.1, 0.8, 0.1],
  [0.05, 0.05, 0.9]]
  )
  start_prob = np.array([0.8, 0.1, 0.1])
  
  ### Lets assume that we are throwing 3 6-sided dices: one is fair with equel probablity of each side, and the other two are loaded
  em_prob = np.array([[1/4, 3/20, 3/20, 3/20, 3/20, 3/20],
  [2/15, 2/15, 2/15, 2/15, 2/15, 1/3],
  [1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6, 1.0/6]])
  
  X, Z, L = generate_sequences(start_prob, trans_prob, em_prob, n_components=3, n_features=6, N=200, K=20)
  model2 = hmm.MultinomialHMM(n_components=3, n_iter=200)
  
  fitted_hmm = train(model2, X, L, 10)
  
  Reply
Julie | July 14, 2020 at 1:16 pm

Hi Hashem,
The issue still remains that the one sequence you randomly generated may actually be fit better by a different model than the true model due to sampling variability. I just ran your code and got the following output:

Original parameters
p_init
[0.8 0.1 0.1]
p_tans
[[0.9 0.05 0.05]
[0.1 0.8 0.1 ]
[0.05 0.05 0.9 ]]
p_emm
[[0.25 0.15 0.15 0.15 0.15 0.15 ]
[0.13333333 0.13333333 0.13333333 0.13333333 0.13333333 0.33333333]
[0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]]
Score: -7134.656307667264
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

iteration 0
best score: -7130.093158931911
-7130.093158931911
[0.55342232 0.26642825 0.18014943]
[[0.40926052 0.23864853 0.35209095]
[0.1978113 0.40300539 0.39918331]
[0.51917648 0.30639327 0.17443025]]
[[0.39610989 0.09730218 0.08404591 0.15664999 0.08554425 0.18034779]
[0.10821451 0.23115766 0.05959483 0.0727108 0.23658125 0.29174095]
[0.06873274 0.13744418 0.33595719 0.21040032 0.17550099 0.07196458]]

So the fitted model actually had a higher log-likelihood (-7130) than the true model (-7134). That means the algorithm found a good fit! It just isn’t the right fit. If you repeated this several times, you might find the average parameter estimates will converge to the true values. Likewise, the larger the sample you generate, the closer your estimates should be to the true values. In terms of your process, you seem to be doing everything correctly. If nothing else, this should just show you that the parameters you’ll estimate will not be the true values, but the best fit to the observed data (assuming you’ve found the global maximum, not just a local maximum).

Reply
Harold | August 23, 2020 at 7:02 pm

Dear Julie;
Thanks for sharing the code and your insight!
From what I understood, this code is based on a univariate model with only one variable; Flow. I was wondering if I can use it in a multivariate context with multiple time series. More specifically, I’m trying to to model the joint return distribution for multiple assets (stock, bond, and real estate).
Moreover, I would be grateful if you could explain a little more about the data characteristics like variable names in “AnnualQ” file. I’m asking this because I’m a bit confused about the parameters I need to fill the “Fit” function with.
Thanks again.

Reply
- Julie | August 24, 2020 at 10:05 am
  
  Hi Harold,
  Great question. Yes, it can be used with multivariate data as well! Here is an example for bivariate data: https://hmmlearn.readthedocs.io/en/latest/auto_examples/plot_hmm_sampling.html#sphx-glr-auto-examples-plot-hmm-sampling-py. And a Stack Overflow question along the same lines: https://stackoverflow.com/questions/52141332/is-it-possible-to-fit-a-multivariate-gmhmm-in-hmmlearn.
  AnnualQ.txt just has one column for my one-dimensional flow dataset, where each row is a different time step’s observation.
  Julie
  
  Reply
  - Shahab | September 29, 2020 at 9:16 pm
    
    My questions have been removed!!
shahab | September 29, 2020 at 8:19 pm

Hi Julie,
Thank you very much for your great post. I have a some questions. It would be really appreciated if you could help me on this.

The first one is regarding fitting the model. My data consist of 1000000 rows. Each row corresponds to a sequence with the length of 12. I don’t know how to give my data to the model. Lines 5 and 8 do not work in my case.
My second question is about setting initial parameters for the model. I would like model to improve my initial guess for parameters.

model = GaussianHMM(n_components=3, n_iter=1000,transmat_prior=[[0.95,0.04,0.01],[0.05,0.9,0.05],[0.01,0.04,0.95]],startprob_prior=[0.6,0.3,0.1],means_weight=[10,100,1000],covers_weight=[2,20,100]).fit(XXXXX)

Is the above line correct to fit the model?

It would be really appreciated if you could answer my questions.

Best regards

Reply
shahab | September 29, 2020 at 9:14 pm

Hi Julie.
I asked a question. But my question has been removed.

Reply
Julie | September 30, 2020 at 1:54 pm

Hi Shahab,

The questions don’t appear until we approve them, which is why it looked like your questions were removed.

To clarify, are you trying to fit a 3-state, 12-dimensional Gaussian HMM to a time series of 1 million observations? If so, the input (X) should be a (1 million x 12) numpy array: model = GaussianHMM(n_components=3, n_iter=1000).fit(X).

I haven’t specified initial conditions to hmmlearn before, but I believe your initiation of transmat_prior and startprob_prior are correct. I am confused about what means_weight and covars_weight should be, though. “means” should be 3×12 if it is a 3-state, 12-dimensional HMM, so I would think the prior on the means (means_weight) would be the same dimension, but the dostring of the source code (https://github.com/hmmlearn/hmmlearn/blob/master/lib/hmmlearn/hmm.py) suggests it is a vector of n_components (3). Similarly, “covars” would be 12x3x3, but covars_weight represents parameters of the prior on the covariance matrix (where the prior is the inverse Gamma if covariance_type=”diag” (the default)). The inverse Gamma has two parameters (alpha and beta). So I would think you need 2 parameters for each entry of covars, but again, the code suggests it is a vector of n_components. So you probably have a better understanding than I do of what those are supposed to be. Sorry I’m not much help there!

Julie

Reply
Giovanni | December 17, 2020 at 8:48 am

Hi Julie,
your tutorial looks great! I’d like to replicate it, but as I’m not working in the hydrology field I don’t know how to re-create the dataset AnnualQ. Would you be able to share it? Thanks!!
G

Reply
- Julie | December 17, 2020 at 12:45 pm
  
  Hi Giovanni. It’s just a text file with the annual flow (in acre-ft) at USGS gauge 09163500 from WY 1909 – WY 2013. Each year is a different row of the file. The data can be obtained here: https://waterdata.usgs.gov/co/nwis/uv?site_no=09163500. WY 1909 is Oct 1908-Sept 1909 and WYT 2013 is Oct 2012-Sept 2013.
  
  Reply
  - Giovanni | December 18, 2020 at 3:23 am
    
    Thanks Julie!
Shahab | January 7, 2021 at 5:46 pm

Hi Julie

Thanks for your useful post. I’m trying to use the Multinomial HMM, but my observations has 3 dimensions. For example, for a sequence with the length of 5, I have the following observation:

Seq1=[(3,1,2),(1,1,1),(2,1,0),(0,0,1),(2,1,1)]

Each dimension of observation could get any number from 0 to 3.
It would be appreciated if you explain how can I use the HMM package to fit my data.

Thanks

Reply
- Julie | January 7, 2021 at 9:05 pm
  
  Hi Shahab,
  Yes, you can do that. Here is an example fitting data I generated (X1):
  
  “`
  from hmmlearn import hmm
  import numpy as np
  
  # initiate model object for 3-state Multinomial HMM
  model = hmm.MultinomialHMM(n_components=3)
  
  # assign number of elements each state can take on
  model.n_features = 4
  
  # randomly generate time series of 100 vectors of length 3 whose values are integers from 0-3
  X1 = np.random.randint(0,4,[3,100])
  
  # fit Multinomial HMM to X1
  fit = model.fit(X1)
  
  # find transition probabilities from fit
  fit.transmat_
  
  # find probability distribution of each state (i.e. probability of being 0,1,2 or 3
  fit.emissionprob_
  “`
  
  Julie
  
  Reply
  - Shahab | January 7, 2021 at 10:28 pm
    
    Thanks Julie for your response.
    Sorry, I think my question was not clear.
    
    My data set has around 50,000 sequences of observation. Each sequence has its own length. Each observation in a sequence has three dimensions.
    
    For example:
    
    Seq1= [(3,1,2),(1,1,1),(2,1,0),(0,0,1),(2,1,1)] len(seq1) =5
    seq2= [(0,2,1),(1,2,1),(3,3,3)] len(seq2) =3
    . .
    . .
    . .
    seq50000= [(1,1,0),(2,3,3),(2,2,0),(3,1,0)] len(seq50000)=4
    
    X1 in your code, has 3 sequences with the length of 100, and each observation has one dimension (single int).
    
    Best regards
Julie | January 7, 2021 at 10:31 pm

No, X1 is a 3×100 array, so the first column is a 3-dimensional state at time 1, the second column is a 3-dimensional state at time 2, etc. You’ll see when you fit the model that transmat is 3×3 indicating the transition probabilities between the 3 states and emissionprob is a 3×4 matrix indicating the probability of being a 0,1,2 or 3 for each of the 3 states.

Reply
- Shahab | January 7, 2021 at 10:47 pm
  
  I have a question about the emission. In this case, what is the probability of P((0,2,0)|state1)?
  (0,2,0) is an arbitrary observation. I mean, which row and column of the emission matrix shows the probability?
  
  Thanks
  
  Reply
  - Julie | January 7, 2021 at 11:25 pm
    
    Oh, I apologize, I think I did misunderstand. Assuming I get it now, you need to define every possible combination of three values as a state and find the transition probabilities between those univariate definitions of the states. That wouldn’t be a hidden MM, just a pure MM. So, e.g.
    (0,0,0) = state 1
    (0,0,1) = state 2
    …
    (3,3,3) = state 4^3
    That’s a lot of states! You likely do not have enough data to estimate those transition probabilities. Alternatively, you could fit an n-state HMM, where n is some known number of states, and then generate the 3 values from that state from a multivariate multinomial distribution, but I’m not sure that’s actually possible with hmmlearn (see here: https://github.com/hmmlearn/hmmlearn/issues/216)
salitosm | March 1, 2021 at 7:23 am

Hi Julie,

Why do you assume diagonal covariance type for the annual flow data?

Reply
- Julie | March 1, 2021 at 10:54 am
  
  The annual flow data is univariate, so the covariance is just the variance of the one variable.
  
  Reply
Liam | March 12, 2021 at 8:01 am

Hi!

I have written a code and is afraid that the program overfit the results, therefore I want to restrict the model to a minimum variance. In the API I can read:
“min_covar (float, optional) – Floor on the diagonal of the covariance matrix to prevent overfitting. Defaults to 1e-3.”
So i thought that this feature should solve my problems.

My code
remodel = hmm.GaussianHMM(n_components=2, min_covar=5, n_iter=1000).fit(X)
for i in range(remodel.n_components):
print(“{0}th hidden state”.format(i))
print(“covariance = “, (covariance))

When I run this code the program is writing:
0th hidden state
covariance = [[2.05740878 0. 0. ]
[0. 1.83566175 0. ]
[0. 0. 2.1124605 ]]
1th hidden state
covariance = [[0.95922665 0. 0. ]
[0. 1.01140285 0. ]
[0. 0. 1.03875355]]

I don’t understand how the volatility can be lower than the floor(=5).

Reply
- Julie | March 13, 2021 at 10:26 am
  
  I’ve never used (and was not familiar with) that option, so I’m afraid I can’t help, sorry!
  
  Reply
Mohammad Khoshnazar | April 7, 2021 at 2:12 pm

Hi Julie
Thank you for sharing this post and code with us. I have a question.
I want to train the HMM with two different sequences from two different CSV file. But I got this error:
fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)
fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)
data_train = np.concatenate([fkinematics, fkinematics1])
lengths = [fkinematics, fkinematics1]
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(data_train,lengths)
error:
cannot reshape array of size 25767 into shape (3681,1)
and if:
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(data_train,lengths, 100)
fitHMM() takes 2 positional arguments but 3 were given

Would you mind please help me to fix this problem?

Thanks in advance

Reply
- Julie | April 7, 2021 at 5:42 pm
  
  What are the dimensions of fkinematics and fkinematics1? For the second error, you are passing 3 arguments (data_train, lengths, 100) to fitHMM, which only takes in 2 arguments (Q, nSamples)
  
  Reply
  - Mohammad | April 8, 2021 at 2:24 am
    
    Thanks Julie for Answers to me
    Honestly, I am completely new in python. fkinematics has 2176 rows and fkinematics1 has 2050. I need to use both of them for training. Would you mind please help me what should I do?
    and If it is possible I have another question. If I want to use 80 percents of the data for training and 20 percents for test. How can I do that?
    Thanks in advance for your time and answer
  - Julie | April 9, 2021 at 10:01 am
    
    Do they have multiple columns? I don’t see how those numbers are resulting in any of the numbers that area printed. Can you explain what the data is? Is it univariate data and you want a two-state HMM? Because my code will only be suited to that and would need to be adapted for something else.
Mohammad Khoshnazar | April 10, 2021 at 5:08 am

Hi Julie
First of all, thank you for your time and answer.
Yes I have multiple columns but I only used one of the columns of each file as follow:
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(fkinematics[:,5], 100)

I wanted to use multiple files for training. Every file is a trajectory of the end effector of the robot. One of them perfect trajectory and the rest are some noisy trajectories. I only use one of the columns of each file( regarding the end effector). However, the problem is the specific column, that shows the trajectory, has different rows. And I wanted to have 3 and 4 states. Would you mind please help me with how to do that? And I don’t know how to choose 80% of the data for training and the rest for tests? I searched a lot however, I couldn’t manage to handle my problem! I think even it is not necessary for me to use the log to train.

fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)
fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)
data_train = np.concatenate([fkinematics, fkinematics1])
lengths = [fkinematics[:,5], fkinematics1[:,5]]
model = GaussianHMM(n_components=3, n_iter=1000).fit(data_train,lengths)

Thanks in advance for your answer

Reply
Mohammad Khoshnazar | April 10, 2021 at 5:10 am

fkinematics = np.genfromtxt(‘fkinematics.csv’,delimiter=’,’)
fkinematics1 = np.genfromtxt(‘fkinematics1.csv’,delimiter=’,’)
data_train = np.concatenate([fkinematics, fkinematics1])
lengths = [fkinematics[:,5], fkinematics1[:,5]]

model = GaussianHMM(n_components=3, n_iter=1000).fit(data_train,lengths)

The error for this code is:

ValueError: operands could not be broadcast together with shapes (1617,) (2064,)

With respect
Mohammad

Reply
- Julie | April 11, 2021 at 11:21 am
  
  I think that error is probably arising when you run np.concatenate(), or because of how you ran it. You probably want something like this:
  
  data_all = np.concatenate((fkinematics,fkinematics1),axis=0).
  
  To train on 80% of the data, use:
  
  data_train = data_all[0:int(0.8*np.shape(data_all)[0]),:]
  data_test = data_all[int(0.8*np.shape(data_all)[0])::,]
  model = GaussianHMM(n_components=3, n_iter=1000).fit(np.reshape(data_train[:,5],[len(data_train[:,5]),1]))
  hidden_states_train = model.predict(np.reshape(data_train[:,5],[len(data_train[:,5]),1]))
  hidden_states_test = model.predict(np.reshape(data_test[:,5],[len(data_test[:,5]),1]))
  
  Then make a QQ plot of all the observations in the test set classified as being in each state against a normal distribution with the parameters fit for that state to see how well it fits in testing.
  
  Reply
  - Mohammad | April 12, 2021 at 10:01 am
    
    Thanks, Julie. It would be very helpful and almost worked. However, This line
    hidden_states_test = model.predict(np.reshape(data_test[:,5],[len(data_test[:,5]),1]))
    gave this error:
    too many indices for array
    and it is weird. Since everything is the same with the last line. I prepare the QQ plot but because the test line didn’t work so I couldn’t check it. I think it is about dimensions? right?
    But the files that I used for train and test are the same and the codes you provide the same too. Would you mind please tell me what is your opinion about that?
    Thanks in advance
    
    Mohammad
Mohammad | April 12, 2021 at 10:23 am

Actually, it worked. I made mistake. Could you please tell me what is the difference between ” ::,] ” and ” ,:”. when we might use every one of them?

Many thanks for your help

Mohammad

Reply
- Julie | April 12, 2021 at 1:46 pm
  
  Oh, I actually should have put
  data_all[int(0.8*np.shape(data_all)[0])::,:]
  int(0.8*np.shape(data_all)[0]):: is pulling the last 20% of rows and ,:] is pulling all of the columns
  
  Reply
Mohammad | April 12, 2021 at 10:29 am

Sorry Julie, One another question.
When I print the result of hidden_states_train, I get this
array([1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

it is not true? right?

Reply
- Julie | April 12, 2021 at 1:45 pm
  
  Hi Mohammad,
  It might be easier if you just share your data with me (if that’s ok) and I can play around with it. You can email me at jdq8@cornell.edu
  Julie
  
  Reply
Tanifor Fuh | April 16, 2021 at 2:37 pm

hello Julie,
Thanks for this post. It has been a very helpful source of information for working with HMMs. Can you kindly shed some light on the following for me?

I have a data set with values in the range 0 – 100. Just like in the wet and dry states in your example, I would love to have a two state model with all values > 90 representing one state and all values <= 90 be the other state. In your code (lines 7 & 8), you mentioned something along these lines but it really doesn't clear things up for me.

please help.

Reply
Tanifor | April 17, 2021 at 11:27 am

Hello Julie,

Thanks for this post. It is a good source of information for working with HMMs. Can you kindly shed some light on the following:

I have a data set with values in the range 0 – 100. Just like in your example with wet and dry states, I will like to have a two state model with all values 90 be the other state. In your code, you mentioned something along these lines (lines 7 & 8) but I am not entirely clear how to do this.

please help.

Reply
- Julie | April 17, 2021 at 11:49 am
  
  Hi Tanifor,
  That actually wouldn’t be a hidden Markov model because the states would then be observable. So you could directly label all the historical data and estimate the transition probabilities empirically, then fit separate distributions to the separate states. In a hidden Markov model, the distributions overlap and so it’s unclear in the middle which state they belong to, so that’s estimated probabilistically.
  Julie
  
  Reply
Gabry | May 15, 2021 at 4:56 am

Hi Julie, Thanks for this post! can I have the AnnualQ.txt file please? I cannot found it! Gabry

Reply
- Julie | May 17, 2021 at 9:42 am
  
  It has one column with the (naturalized) water year flow (Oct 1-Sept 30) at USGS gauge 09163500 from Oct 1908-Sept 2013. You can access the gauged data beginning in 1951 here: https://waterdata.usgs.gov/nwis/inventory/?site_no=09163500&agency_cd=USGS.
  
  Reply
Ravi | May 19, 2021 at 6:12 am

Nice explained. Thanks.

Reply
Edoardo | May 20, 2021 at 3:19 am

Hi Julie,

Thanks for this very interesting post, just a question if I want to predict status in t+1 having the data available up to t how could I do it?

I found this but I am not sure if it is right….

state_sequence = model.predict(train_set)
prob_next_step = model.transmat_[state_sequence[-1], :].argmax()

Reply
- Julie | May 20, 2021 at 4:08 pm
  
  That will find the most likely next state. You could get a probabilistic forecast of the flow on the next time step as well. For example:
  P(log(Q_{t+1}) < log(q)) = P(X_{t+1}=0 | X_t) * F(log(q) | X=0) + P(X_{t+1}=1 | X_t) * F(log(q) | X=1)
  where
  P(X_{t+1}=0 | X_t) = model.transmat_[state_sequence[-1], 0]
  F(log(q) | X=0) = scipy.stats.norm.cdf(q, mus[0], sigmas[0])
  P(X_{t+1}=1 | X_t) = model.transmat_[state_sequence[-1], 1]
  F(log(q) | X=1) = scipy.stats.norm.cdf(q, mus[1], sigmas[1])
  
  Reply
Divyasha | May 27, 2021 at 4:43 pm

Hello,
My input datasets contain video files in csv format. Total 9 motion, where each motion contains more than 100 csv file. I want to train 9 HMM with multiple samples. Could you please let me know how can I work with multiple samples (more than 100) to train HMM.

Reply
- Julie | June 5, 2021 at 10:27 pm
  
  The documentation for hmmlearn explains how to train on multiple sequences here: https://hmmlearn.readthedocs.io/en/latest/tutorial.html#building-hmm-and-generating-samples (scroll down to “Working with multiple sequences”). Note they simply concatenate the sequences. You should be careful to do the concatenation in a way that preserves the proper state-to-state transition. See more here: https://stats.stackexchange.com/questions/95144/training-a-hidden-markov-model-multiple-training-instances. Note, I haven’t done this myself and am not sure how exactly you would know how to concatenate them properly since the state is hidden, but maybe you can understand that explanation better than I do.
  
  Reply
Anne | February 9, 2022 at 7:40 pm

Hi Julie,
Thank you very much for this post! It helps a lot!
I have been trying to use Baum-Welch HMM to fit my 1D time series data. It’s a data set possibly ranging from 0-500, and we estimated that the state-changing steps should be very similar. For example, we think if the difference between state 1 and state 2 is about 50, then it would also be about 50 for the step from state 2 to state 3, or from state 3 to state 4.
Right now it seems that the step sizes are all different, is there a way to specify this?
Thanks in advance!

Reply
- Julie | February 10, 2022 at 11:03 am
  
  Hi Anne,
  
  It sounds like you want to use a Bayesian estimation approach where you have priors on the parameter values that you’ll update to estimate a posterior based on the likelihood of the observed data rather than strictly using the likelihood of the observed data as I’ve done. According to issue 58 on the hmmlearn github page (https://github.com/hmmlearn/hmmlearn/issues/58), the HMM estimation is all frequentist, but you can provide a “prior” for regularization. Looking at the API for hmmlearn (https://hmmlearn.readthedocs.io/en/latest/api.html) under the GaussianHMM class, there are attributes in the __init__ method for startprob_prior, transmat_prior, means_prior and covars_prior. It sounds to me like you have an idea of the mean values of the states and could therefore define means_prior. Admittedly, I am a bit confused on how to do that as the API says it is only shape (n_components,) but that it specifies the mean and precision of a normal prior for the state means, which I would think would need to be dimension (n_components,2). Maybe you can play around with that.
  
  Alternatively, you can just change the initial estimates of the means for the likelihood estimation and see if the estimates are near what you expect. See how to do that under “Fixing parameters” here: https://hmmlearn.readthedocs.io/en/stable/tutorial.html. Sorry I can’t be of more help!
  
  Julie
  
  Reply
ACEN CAROLINE | April 4, 2022 at 11:30 am

Hello julie,
first thank you for the codes. i am new to HMM, I an trying to adjust your codes to n_components=3. my data is a 1106×280 dataframe.

def fitHMM(Q, nSamples):
# fit Gaussian HMM to Q
model2 = GaussianHMM(n_components=3, n_iter=1000).fit(Q)

# classify each observation as state 0 or 1
hidden_states = model2.predict(Q)

# find parameters of Gaussian HMM
mus = np.array(model2.means_)
sigmas = np.array(np.sqrt(np.array([np.diag(model2.covars_[0]),np.diag(model2.covars_[1]), np.diag(model2.covars_[2])])))
P = np.array(model2.transmat_)

# find log-likelihood of Gaussian HMM
logProb = model2.score(Q)

# generate nSamples from Gaussian HMM
samples = model2.sample(nSamples)

# re-organize mus, sigmas and P so that first row is lower mean (if not already)
index_order = np.argsort(np.transpose(mus))
mus = mus[index_order]
sigmas = sigmas[index_order]
newP = np.zeros(np.shape(P))
state_indices = []
#for i in range(len(mus)):
#newP[i,:] = P[index_order[0][i],index_order]
#state_indices.append(np.where(hidden_states==index_order[0][i]))
#P = newP
for i in range(len(state_indices)):
hidden_states[state_indices[i]] = i

return hidden_states, mus, sigmas, P, logProb, samples
can you help me verify if my modification is correct. i couldn’t use the newP because I was getting an error saying I couldn’t broadcast an array of (280×3) into an array of (3,) and I also did not reshape Q because it was giving me an error saying couldn’t put an array of size 309680 into an array of (1106,1).
can you also help me to modify to use for a MultinomialHMM
Thank you for your time

Reply
- Julie | April 7, 2022 at 3:53 pm
  
  Hi Acen,
  Sorry for the late reply. The issue is you need to reshape your data into a column vector instead of a row vector. Just let `Q = np.reshape(Q,[len(Q),1])` and it should work.
  Julie
  
  Reply
  - Acen Carolyn | April 10, 2022 at 2:57 am
    
    i have tried but i get an error
    
    c = np.reshape(X_train, [len(X_train),1])
    c.shape
    
    ValueError: cannot reshape array of size 309680 into shape (1106,1)
  - Julie | April 10, 2022 at 10:08 am
    
    What are the dimensions of X_train? It should be a row vector of the data. Is your data multivariate? This code assumes you are fitting a univariate HMM.
  - ACEN CAROLINE | April 10, 2022 at 10:17 am
    
    The dimensions of X_train is 1106×280. 280 are the columns of the different features and 1106 is the length of each feature. my data is multivariate.
  - Julie | April 10, 2022 at 10:42 am
    
    What are you trying to fit an HMM to? Are you trying to fit a 3-state, 280-dimensional HMM?! 1106 observations is not enough data to be doing that.
  - ACEN CAROLINE | April 10, 2022 at 10:52 am
    
    Yes, am trying to fit a 3- state 280-dimensional HMM. That’s the data i have of 1106 samples for training. Can you help me with modifying the code for my HMM from univariante to multivariate.
  - Julie | April 10, 2022 at 11:02 am
    
    See this Stack overflow answer for sitting a multivariate Gaussian HMM: https://stackoverflow.com/questions/52141332/is-it-possible-to-fit-a-multivariate-gmhmm-in-hmmlearn. But I wouldn’t recommend using all 280 features. For even 1 state, that would require estimating a 280×280 covariance matrix. Since the covariance matrix is symmetric, that requires the estimation of 280*279/2 = 39,060 covariances and you only have 1106 observations! You should choose only the most informative features.
Pingback: A Hidden-Markov Modeling Based Approach to Creating Synthetic Streamflow Scenarios – Hydrogen Water
Karem Paes | May 25, 2023 at 2:41 pm

Hi Julie. Your code is very good!

do you know if i can use the same hmmlearn library for higher order hmm with continuous issues? I’m having trouble finding a library that matches this

Reply
- Julie | May 25, 2023 at 2:50 pm
  
  Yes, just change n_components from 2 to whatever order you want to fit (and order the states by increasing mean). If you don’t know what order is best, try multiple and compare their log-likelihoods. You can also penalize the log-likelihood by how many parameters you have to estimate for each HMM using the formula for AIC or BIC to prevent overfitting. See answers above for others who have asked similar questions for more details.
  
  Reply
Pingback: Parameter estimation of Hidden Markov Models using a Markov Chain Monte Carlo method in Julia – Water Programming: A Collaborative Research Blog
Pingback: Exploring droughts produced by an ensemble of synthetically generated streamflow records – Water Programming: A Collaborative Research Blog
Gowri Shankar D | October 9, 2023 at 11:50 pm

Hi Julie,
Is it possible to fit a hidden markov model for predicting the sugarcane pests. In this case, the symptoms of sugarcane pests are considered as the observed states and the sugarcane pests are considered as the hidden states.
Pls revert me, if it is possible or not?

Reply
- Julie | October 10, 2023 at 4:18 pm
  
  Typically you would observe a continuous value that would be indicative of an underlying discrete state. So if the symptoms are continuous and there aren’t confounding factors beyond pests that could explain those symptoms (e.g. if you just picked yield as your symptom, that wouldn’t be good because weather and soil conditions would likely also influence the state), you could probably assume the states are representative of different combinations of pests. This is outside my area of expertise though, so I don’t want to say for sure.
  
  Reply