Practical 8: Fitting a Hidden Markov Model¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

Welcome to the Hidden Markov Model practical!

This practical is built on Emmeke Aarts's practical workshop on "Extracting personalised latent dynamics using a multilevel hidden Markov model".

In this practical, we introduce a dataset and use it to fit a two-state hidden Markov model (HMM) (so not multilevel).

Explore the documentation provided in libraries like hmmlearn or pomegranate, which are used for building and analyzing hidden Markov models. The complete documentation for both can be accessed in bellow:

  1. hmmlearn
  2. pomegranate

For this practical, we will use several Python libraries. The hmmlearn library is a popular choice for building Hidden Markov Models in Python. We start by ensuring that it is installed in our environment.

Learning objectives:

By the end of this practical, you will be able to:

  • Understand the core principles of Hidden Markov Models and how they apply to emotion data.
  • Fit a two-state HMM using hmmlearn to model latent mood dynamics.
  • Visualize emission distributions (probability of mood values per state) and interpret emotional states.
  • Predict and analyze hidden state sequences across individuals.
  • Compare emotional profiles and transitions across subjects using visualizations and summary statistics.
In [ ]:
!pip install hmmlearn
!pip install plotly
Collecting hmmlearn
  Downloading hmmlearn-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.9 kB)
Requirement already satisfied: numpy>=1.10 in /usr/local/lib/python3.10/dist-packages (from hmmlearn) (1.25.2)
Requirement already satisfied: scikit-learn!=0.22.0,>=0.16 in /usr/local/lib/python3.10/dist-packages (from hmmlearn) (1.2.2)
Requirement already satisfied: scipy>=0.19 in /usr/local/lib/python3.10/dist-packages (from hmmlearn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn!=0.22.0,>=0.16->hmmlearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn!=0.22.0,>=0.16->hmmlearn) (3.5.0)
Downloading hmmlearn-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.1/161.1 kB 2.0 MB/s eta 0:00:00
Installing collected packages: hmmlearn
Successfully installed hmmlearn-0.3.2
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (5.15.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly) (8.5.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from plotly) (24.1)
In [ ]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import hmmlearn
import plotly.graph_objects as go
from hmmlearn import hmm
from scipy.stats import norm

Data¶

We will be working with an open access dataset from Rowland and Wenzel (2020), which is part of a study involving 125 undergraduate students from the University of Mainz in Germany. These students completed a 40-day ambulatory assessment six times a day, reporting on their affective experiences such as happiness, excitement, relaxation, satisfaction, anger, anxiety, depression, and sadness. These affective states were quantified using a visual analog slider, ranging from 0 to 100.

Before the data collection, participants were randomly assigned to either a group receiving weekly mindfulness treatment during the study or a control group. We will be working with this dataset dataset. This dataset has been cleaned and is provided by Haslbeck, Ryan, & Dablander (2023) and can be found in their OSF repository.

1. Load the dataset ('emotion_data.csv') that can be found in the folder with the practicals' datasets. You can use the function pd.read_csv(). Give the dataset the name 'emotion_data'. Then, inspect the first few rows of the data.

The dataset contains multiple affective states recorded over time for multiple participants. Each row in the dataset represents a unique measurement instance. Before we fit a hidden Markov model, we need to understand the structure of the data, including the variables and their relationships.

Ee can see the summary here:

No Name Label Options
1 subjno individual code 1-164
2 dayno study days 1-40
3 beep daily signals 1-6
4 group condition allocation 1 = control, 2 = training
5 emo1_m happy 0-100
6 emo2_m excited 0-100
7 emo3_m relaxed 0-100
8 emo4_m satisfied 0-100
9 emo5_m angry 0-100
10 emo6_m anxious 0-100
11 emo7_m depressed 0-100
12 emo8_m sad 0-100

Next, we need to preprocess the data to make it suitable for fitting a hidden Markov model. First we are gonna check for missing values across the columns by applying the .isnull().sum() function to each column in the dataset:

In [ ]:
# Count of NaN values in each column
NaN_counts = emotion_data.isnull().sum()
NaN_counts
Out[ ]:
subj_id         0
dayno           0
beep            0
group           0
happy        8430
excited      8430
relaxed      8430
satisfied    8430
angry        8430
anxious      8430
depressed    8430
sad          8430
dtype: int64

For this case, we will fill the NaN value by mean.

HMMs model temporal sequences where the order and alignment of observations matter. Removing rows with NaNs would disrupt the time series continuity, making the sequence shorter and potentially meaningless. Also, the hmmlearn library expects a full matrix of observations. If any row has a missing value, it will raise an error. Mean imputation ensures the input stays complete.

Mean imputation is a simple, fast, and model-agnostic approach that preserves the overall distribution of the variable—though it's not always optimal.

In [ ]:
# Remove rows with any NaN values
emotion_data_clean = emotion_data.fillna(emotion_data.mean())

Now, we will create a plot to visualize the hidden states over time for each subject in the dataset.

2. Visualize time sequences for individual subjects separately to analyze emotional responses over time. Start by transforming your dataset from wide to long format using melt() function from pandas, which prepares it for detailed analysis. Select only five emotions: 'happy', 'excited', 'relaxed', 'angry', 'depressed'. How does the dataframe look like now?

pandas.melt() transforms a DataFrame from wide format to long format.

  • Wide format: Each emotion is in its own column.
  • Long format: Emotions are stored in a single column (variable), and their values are stored in another (value).

Example:

Before melting (wide format):

id time happy excited angry
1 1 0.5 0.8 0.2
1 2 0.4 0.6 0.3

After melting (long format):

pd.melt(df, id_vars=['id', 'time'], value_vars=['happy', 'excited', 'angry'])

id time variable value
1 1 happy 0.5
1 1 excited 0.8
1 1 angry 0.2
1 2 happy 0.4
1 2 excited 0.6
1 2 angry 0.3

Why we do this?

We want to plot how each emotion changes over time. To do this with tools like plotly, the data is better to be in long format — so each row is one observation of an emotion at a specific time point.

3. Create a set of individual line plots to visualize emotional responses over time for a small group of subjects. Use seaborn.FacetGrid to generate separate plots for the first four subjects, making it easier to interpret each subject’s emotional patterns without clutter.

Setting Up the Hidden Markov Model¶

In this section, you will fit a 2-state hidden Markov model. For this, you need a DataFrame in which only the affective variables you want to use in the model are included, plus the subject ID as the first column. Please feel free to use any subset of the provided affective variables, but do make sure to select at least two affective variables.

4. Create a dataframe emotion_mHMM which has the subject ID variable in the first column, followed by only the affective variables you want to use in the hidden Markov model. We are using 'happy', 'excited', 'relaxed', 'angry', 'depressed'

5. Set up the first set of model input arguments: m =2 for the number of states, n_dep for the number of dependent variables (in our case this will be 5), and starting values for gamma (this is the initial state transition matrix and can be defined with an np.array) and the emission distribution (that refer to the starting values for the Gaussian emission distributions for each of the number of selected variables).

Next, we convert the data into a format required by hmmlearn

hmmlearn expects the format: means_: shape (n_components, nfeatures) → (2 states × 5 variables) `covars`: same shape, variances

So we are going to extract from our data the following:

  • Means per state → start_means
  • Variances per state → start_covars (note: variance = std²)
In [ ]:
# Convert start_emiss to the format required by hmmlearn
start_means = np.array([emiss[:, 0] for emiss in start_emiss]).T
start_covars = np.array([emiss[:, 1] for emiss in start_emiss]).T ** 2

We also drop the subj_id column from the emotion_mHMM DataFrame (HMMs learn from the sequence of observed features, not the identity.) to ensure the observations matrix consists only of emotional state data for Hidden Markov Model (HMM) analysis.

In [ ]:
observations = emotion_mHMM.drop(columns=['subj_id']).values
# Ensure the shape of observations
print(f"Shape of observations: {observations.shape}")
Shape of observations: (30000, 5)

Fitting a 2-State Hidden Markov Model with Custom Initialization¶

6. Initialize a Gaussian Hidden Markov Model (hmm.GaussianHMM) with a n_component = m (m was already set up to 2) and n_iter=500. Set the model's initial state probabilities (np.array([0.5, 0.5])), transition matrix, means, and covariances using predefined variables (start_gamma, start_means, start_covars)

Next, we will proceed to fit a Hidden Markov Model to this preprocessed data.

7. Fit (guess how the function is called... yes, right it is called fit()) a 2-state hidden Markov model. So use the fit method of the HMM model to train it on the prepared observations data. Assign the fitted model to a variable named out_2st_emotion.

Inspecting General Model Output¶

8. Let’s evaluate the fitted Hidden Markov Model using two key metrics. First, count how many unique individuals are included in the dataset (use .nunique() on subj_id). Second, compute the overall average log-likelihood of the model across all data points (model.score()). Print the number of subjects and the average log-likelihood to summarize key metrics These metrics will help you assess who the model is fitted on and how well it explains the observed emotional data.

9. Document the structure of your Hidden Markov Model. First, use the n_components attribute to find out how many hidden states were used. Next, use the .shape attribute of the observations array to determine how many affective (dependent) variables were included. Print both values to summarize the model’s configuration.

10. Let’s explore what the Hidden Markov Model has learned. Print the state transition probability matrix using model.transmat_ to understand how likely it is to switch or stay in the same state. Store the state transition matrix from the fitted HMM model in a variable called gamma.

11. Now, we will access the model’s learned parameters for each emotion in both hidden states. Use the model.means_ and model.covars_ attributes from the model to extract and save these parameters. Organize the emission statistics into a dictionary called emiss, so each emotion shows its mean and standard deviation in both states. Print the emiss to inspect the mean and standard deviation for each dependent variable in each state. Which are your conclusions? Which emotions have the most distinct means between State 1 and State 2?

Visualizing the Obtained Output¶

Visualizing the transition probabilities can be very helpful, especially when dealing with a large number of states and/or dependent variables.

12. In this question, you will create a Sankey diagram to visually represent the transition probabilities between hidden emotional states in your HMM. The Sankey diagram will help you understand how frequently the model predicts staying in the same state vs switching states. Visually compare transitions from State 1 and State 2. Your diagram should include: 1. Nodes for each state (e.g., State 1, State 2 as both source and target), 2. Arrows (links) representing transitions from and to each state. 3. Arrow thickness scaled to transition probability, 4. Labels and colors to distinguish directions (e.g., blue for staying, red for switching)

We give you a structure to help you with that.

# Step 1: Define state labels and number of states
states = [...]  # e.g., ['State 1', 'State 2']
num_states = ...

# Step 2: Prepare labels for source and target nodes
labels = ...
source = ...  # from-state indices
target = ...  # to-state indices
values = ...  # flatten the transition matrix (e.g., gamma.flatten())

# Step 3: Create the Sankey diagram
fig = go.Figure(data=[go.Sankey(
    node=dict(
        label=...,
        color=...,
        # padding, thickness, and line style (optional)
    ),
    link=dict(
        source=...,
        target=...,
        value=...,
        color=...  # Optional: different colors for stay vs switch
    )
)])

# Step 4: Customize layout
fig.update_layout(title_text="...", font_size=..., width=..., height=...)
fig.show()

13. Visualize the transition probabilities gamma saved in the gamma object. Use a heatmap to represent the probabilities of transitioning from one state to another.

14. Visualize the emission distributions to understand the mean values of different dependent variables across states. Prepare the data by creating a DataFrame with the state, mean emission value, and dependent variable name using the emiss dictionary. Use Seaborn's barplot function to create the plot

15. Compute the density probabilities for each emotion and state. First define the length of the grid and generate a sequence of mood values ranging from 1 to 100. Select only those emotions that are present in the emiss dataset to ensure the accuracy of the density plots. Create a DataFrame emiss_dens with columns for state, emotion (Dep), mood value (X), and probability (Prob).Finally by using the normal distribution parameters calculate the density probabilitiesfor each emotion-state pair

16. Create subplots for each emotion using seaborn.FacetGrid with specified attributes. Plot density probabilities for mood values differentiated by state using sns.lineplot. Add vertical dashed lines at the mean mood values for each state. Set y-axis limits from 0 to 0.05 for consistency, and include a legend and axis labels for mood value and density.

Obtaining the Most Likely Sequence of Hidden States¶

17. Using the predict method from hmmlearn, the fitted model, and the observations emotion_mHMM, obtain the most likely hidden state sequence. Save the state sequences in the object emotion_states_2st, and inspect the object.

Plotting the Inferred State Sequence Over Time¶

In this section we will plot the inferred state sequence over time for a selection of the subjects within the data. To plot the state sequences over time, a new variable denoting the beep moment needs to be added to the matrix containing the inferred states.

18. Now, create a DataFrame that includes subj_id, state, and beep moment. Filter the DataFrame to include subjects 1 to 10, and map the states to categorical labels ('state 1' and 'state 2'). Use Seaborn's 'Set2' palette for the states and FacetGrid to create horizontal bar plots for each subject, adjusting the figure size and aspect ratio for better readability. Customize the plot with axis labels, titles, and a legend.

End of Practical!