Practical 9: Movie review classification using Active learning¶

Anastasia Giachanou & Misja Groen

Machine Learning with Python - Utrecht Summer School

In the world of Machine Learning, Active Learning is becoming popular. This method trains ML models step-by-step, which means you need less training data to get competitive results.

An Active Learning pipeline consists of a classifier and an oracle. The oracle, either an individual or a group, cleans, selects, labels data, and feeds it to the model, ensuring consistent labeling. The process starts with annotating a small dataset subset to train an initial model, saving the best model checkpoint, and testing it on a balanced set. After initial evaluation, the oracle labels more samples based on needs, adds new data to the training set, and repeats the cycle until the model achieves the acceptable performance.

In this practical, we’ll drop into Active Learning using the IMDB dataset, which has 50,000 movie reviews split evenly between positive and negative sentiments. We’ll explore three exciting strategies:

  1. Simple Evaluation Study: We'll use pool-based active learning with uncertainty sampling, where the model queries the most uncertain samples and retrains iteratively.The other sampling techniques include:
  • Committee Sampling: Multiple models vote on the best data points to sample.
  • Entropy Reduction: Samples are selected based on the highest entropy scores.
  • Minimum Margin Based Sampling: Chooses data points closest to the decision boundary.
  1. Multi-annotator Pool-based Active Learning: This simulates multiple annotators with varying noise levels, using a SingleAnnotatorWrapper with probabilistic active learning. It highlights how multiple annotators impact model performance.

  2. Stream-based Active Learning: Here, we implement a stream-based approach using StreamRandomSampling and StreamProbabilisticAL, ideal for real-time decision-making as data continuously flows in.

We'll classify movie reviews as positive or negative using their text. This binary classification task is both significant and widely applicable in machine learning. Let's get started and see how we can impliment these strategies together!

Let's get started¶

We would like to use the scikit-activeml library. This library helps with important query strategies and is easy to use because it's built on scikit-learn. We'll show how it works by classifying IMDB reviews using the active learning cycle. Let's start by installing it with pip install scikit-activeml and importing the needed packages from scikit-learn and scikit-activeml. Take care to have them installed!

In [1]:
!pip install scikit-activeml
!pip install numpy==1.24.4 scipy==1.10.1
!pip install matplotlib
!pip install pandas
!pip install ipympl
Requirement already satisfied: scikit-activeml in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (0.6.2)
Requirement already satisfied: joblib>=1.4.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.5.1)
Collecting numpy>=1.26 (from scikit-activeml)
  Using cached numpy-2.3.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scipy>=1.11.3 (from scikit-activeml)
  Using cached scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Requirement already satisfied: scikit-learn>=1.6.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.7.1)
Requirement already satisfied: matplotlib>=3.7.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (3.10.3)
Requirement already satisfied: iteration-utilities>=0.12.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (0.13.0)
Requirement already satisfied: makefun>=1.15.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.16.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (1.4.8)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (25.0)
Requirement already satisfied: pillow>=8 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (2.9.0.post0)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-learn>=1.6.0->scikit-activeml) (3.6.0)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib>=3.7.3->scikit-activeml) (1.17.0)
Using cached numpy-2.3.1-cp311-cp311-win_amd64.whl (13.0 MB)
Using cached scipy-1.16.0-cp311-cp311-win_amd64.whl (38.6 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.4
    Uninstalling numpy-1.24.4:
      Successfully uninstalled numpy-1.24.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.10.1
    Uninstalling scipy-1.10.1:
      Successfully uninstalled scipy-1.10.1
Successfully installed numpy-2.3.1 scipy-1.16.0
[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Collecting numpy==1.24.4
  Using cached numpy-1.24.4-cp311-cp311-win_amd64.whl.metadata (5.6 kB)
Collecting scipy==1.10.1
  Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl.metadata (58 kB)
Using cached numpy-1.24.4-cp311-cp311-win_amd64.whl (14.8 MB)
Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl (42.2 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.1
    Uninstalling numpy-2.3.1:
      Successfully uninstalled numpy-2.3.1
  Attempting uninstall: scipy
    Found existing installation: scipy 1.16.0
    Uninstalling scipy-1.16.0:
      Successfully uninstalled scipy-1.16.0
Successfully installed numpy-1.24.4 scipy-1.10.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikit-activeml 0.6.2 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible.
scikit-activeml 0.6.2 requires scipy>=1.11.3, but you have scipy 1.10.1 which is incompatible.

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Requirement already satisfied: matplotlib in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (3.10.3)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: numpy>=1.23 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.24.4)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (25.0)
Requirement already satisfied: pillow>=8 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)
[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Requirement already satisfied: pandas in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (2.3.1)
Requirement already satisfied: numpy>=1.23.2 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (1.24.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2025.2)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Requirement already satisfied: ipympl in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (0.9.7)
Requirement already satisfied: ipython<10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (9.4.0)
Requirement already satisfied: ipywidgets<9,>=7.6.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (8.1.7)
Requirement already satisfied: matplotlib<4,>=3.5.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (3.10.3)
Requirement already satisfied: numpy in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (1.24.4)
Requirement already satisfied: pillow in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (11.3.0)
Requirement already satisfied: traitlets<6 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (5.14.3)
Requirement already satisfied: colorama in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.4.6)
Requirement already satisfied: decorator in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (1.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.19.2)
Requirement already satisfied: matplotlib-inline in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.1.7)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (3.0.51)
Requirement already satisfied: pygments>=2.4.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (2.19.2)
Requirement already satisfied: stack_data in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.6.3)
Requirement already satisfied: typing_extensions>=4.6 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (4.14.1)
Requirement already satisfied: comm>=0.1.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (0.2.2)
Requirement already satisfied: widgetsnbextension~=4.0.14 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (4.0.14)
Requirement already satisfied: jupyterlab_widgets~=3.0.15 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (3.0.15)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.4.8)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (25.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (2.9.0.post0)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from jedi>=0.16->ipython<10->ipympl) (0.8.4)
Requirement already satisfied: wcwidth in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython<10->ipympl) (0.2.13)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib<4,>=3.5.0->ipympl) (1.17.0)
Requirement already satisfied: executing>=1.2.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (2.2.0)
Requirement already satisfied: asttokens>=2.1.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (3.0.0)
Requirement already satisfied: pure-eval in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (0.2.3)
[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

No need to worry about >/dev/null 2>&1! We just used it to hide the output and keep our practical tidy :)

In [2]:
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE

import warnings
mlp.rcParams["figure.facecolor"] = "white"
warnings.filterwarnings("ignore")

Loading the IMDB Dataset¶

We'll be using the IMDB dataset, featuring 50,000 movie reviews from the Internet Movie Database, for our experiments. Now it is time to load the dataset:

# Load the IMDB dataset
df = pd.read_csv("IMDB Dataset.csv")
ParserError: Error tokenizing data. C error: EOF inside string starting at row 16597

When loading real-world datasets, you may encounter ParserError. This is usually due to loading a large CSV file into Python Pandas using the read_csv function. The solution is to use the engine='python' parameter in the read_csv function call to handle complex CSV structures, and the on_bad_lines parameter to skip problematic lines, like this:

# Load the IMDB dataset with proper handling for encoding and skipping bad lines
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')

Another solution is to load the data by mounting Google Drive which ensures that the file paths are correctly mapped. It can help avoid FileNotFoundError or similar issues that might lead to a ParserError. We will load the data by using this mehod:

In [3]:
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')

These methods ensure you can load the dataset effectively, even if there are issues with the CSV file's formatting.

Pre-processing the Text Data¶

Pre-processing text data is a crucial step in text mining and machine learning tasks. It involves cleaning and removing noise from the text data to make it analyzable and transform it into a form that machine learning algorithms can work with effectively. There are various approaches to preprocessing text data. One approach is minimal preprocessing, which involves the most essential steps like lowercasing, punctuation removal, and whitespace normalization. In contrast, the full preprocessing approach includes additional steps such as tokenization, stop word removal, stemming, and lemmatization to thoroughly clean and prepare text data.

For more information on several common ways to deal with text data, please refer to the A Beginner's Guide to Dealing with Text Data tutorial.

In this practical, we opted for a minimal preprocessing approach without tokenization. This is because the subsequent steps involve using TF-IDF vectorization, which can handle the tokenization implicitly, and the model used (Logistic Regression) requires numerical input rather than raw text. Lets follow the next steps to see how we preprocess the text data in this case.

Text Preprocessing¶

Text preprocessing is a method used to clean and remove noise from text data. It makes your text easier to analyze and transforms it into a format that machine learning algorithms can handle more effectively. At the beginning of this practical, we introduced two essential libraries for text preprocessing: re and string. The re library supports regular expressions for pattern matching, and the string library provides constants like punctuation characters. The preprocess_text function, which we defined below, converts text to lowercase, removes punctuation using re.sub(), and eliminates extra whitespace with re.sub().strip(). We applied this function to each review in the dataset to clean the text.

In [4]:
# Preprocess the text data
def preprocess_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

df['review'] = df['review'].apply(preprocess_text)

df.head()
Out[4]:
review sentiment
0 one of the other reviewers has mentioned that ... positive
1 a wonderful little production br br the filmin... positive
2 i thought this was a wonderful way to spend ti... positive
3 basically theres a family where a little boy j... negative
4 petter matteis love in the time of money is a ... positive

When working with large datasets, starting with a smaller subset for initial testing allows for quicker iterations and helps identify issues before scaling up. Here, we reduce the IMDB dataset to 10,000 samples using Pandas' sample method. The random_state parameter ensures reproducibility, selecting the same 10,000 samples each time.

In [5]:
# Reduce the dataset size for initial testing (e.g., 100 samples)
df = df.sample(10000, random_state=42)

Next, we convert the sentiment labels to binary values, where 'positive' is mapped to 1 and 'negative' to 0.

In [6]:
# Convert labels to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

Let's split data into training and test sets with an 80/20 ratio using train_test_split. This results in X_train and y_train for training, and X_test and y_test for testing, ensuring that the model is trained and evaluated on separate data.

In [7]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

TF-IDF and Vectorization¶

Once the text data is preprocessed, it needs to be converted into a numerical format that machine learning algorithms can work with. This process is known as vectorization. One of the most common methods for vectorization is the TF-IDF (Term Frequency-Inverse Document Frequency) approach.

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value increases proportionally with the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Term Frequency (TF): Measures how frequently a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in that document.

$$ \text{TF}(t,d) =\frac{\text{Number of times term } t \text{ appears in document} d}{\text{Total number of terms in document } d} $$

Inverse Document Frequency (IDF): Measures how important a term is. It is calculated by taking the logarithm of the number of documents in the corpus divided by the number of documents containing the term.

$$ \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) $$

TF-IDF Score: The TF-IDF score is the product of the TF and IDF scores. It reflects the importance of a term in a document within the corpus.

Formula: $$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$

1. To apply TF-IDF vectorization, create an instance of the class TfidfVectorizer with max_features=5000 to limit the vocabulary size. Use fit_transform() on the training data to learn the vocabulary and convert the text into TF-IDF vectors. Then, apply transform() on the test data to vectorize it using the same vocabulary.

Note: When working with text data, our features are the tokens

In [8]:
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

Part A: The Active Learning Loop¶

Traditional Machine Learning Supervised¶

Normally in a Traditional (supervised) Machine Learning loop we would:

  • Collect the training data;
  • Train and test on the data;
  • And deploy it in production.

However, this depends highly on the situation whether or not we have sufficient (high-quality) labelled training data.

For this example we'll use regression as an example of how such a ML Loop looks like. Keep in mind this is NOT the Active Learning situation just yet.

Scikit-learn offers many easy-to-use classification algorithms. In this example, we'll use the logisticRegression classifier. Since scikit-learn classifiers can't handle missing labels directly, we'll use the logistic regression model with SklearnClassifier. In active learning scenarios, you often start with many unlabeled samples, which creates a challenge. To get around this, tools like modAL provide wrappers like SklearnClassifier, which wrap scikit-learn classifiers (e.g., LogisticRegression) and allow partial training and querying even when only part of the data is labeled

2. Create such a Machine Learning loop using LogisticRegression

In [9]:
X_log, y_log = X_train_vect.toarray(), y_train
clf = LogisticRegression(random_state=0).fit(X_log,y_log)
print(clf.predict(X_log[:5]))
print(X_train[:5], y_train[:5])
print("Accuracy is: ", clf.score(X_log,y_log))
[0 0 1 1 0]
2141     i saw it tonight and fell asleep in the movieb...
46172    this is one of them movies that has a awesome ...
18558    and i do mean it if not literally after all i ...
32956    this film has a lot of raw potential the scrip...
13094    hello i normally love movies im 19 i have seen...
Name: review, dtype: object 2141     0
46172    0
18558    1
32956    1
13094    0
Name: sentiment, dtype: int64
Accuracy is:  0.92075

Initialize the active learning¶

Going from a traditional Machine Learning loop to an Active Learning environment, we'll need to make some adjustments to our loop. To start we'll need to define our classifier a bit differently to also have it work with Active Learning.

In [10]:
clf = SklearnClassifier(LogisticRegression(max_iter=1000))

3. Create an initial set of labeled data by setting most labels to missing and randomly selecting a small subset. You can randomly select 10 samples from the training data to label initially.

The first step is to create a label array (y_train_initial) that marks all labels as missing (you can use a constant like MISSING_LABEL). In a real-world setting you will have a large unlabeled dataset, however to simulate that we will remove all the labels. After removing those labels, randomly select 10 data points to serve as your initial labeled set. Copy their true labels into your new array while keeping the rest as missing

In [11]:
# Initialize Training Labels labels
y_train_initial = np.full(y_train.shape, fill_value=MISSING_LABEL)

# Randomly select 10 indices to label
initial_idx = np.random.choice(np.arange(len(y_train)), size=10, replace=False)

# Fill in the true labels for the selected indices
y_train_initial[initial_idx] = y_train.iloc[initial_idx]

To have an idea of how such a loop might look like we will simulate this with RandomSampling.

We have created already: X_train which we will use as input y_train which we will as output for the classifier

clf, logisticregression which we will use as the classifier.

However, we are now only missing a query strategy.

To start off we will use the RandomSampling strategy, which is just basically randomly doing something. In the next lecture slides you will dive deeper into the different query strategies.

for now: qs = RandomSampling() will be sufficient

5. Now, implement 10 iterations of the Active Learning (AL) cycle. In each iteration, select 10 unlabeled samples to be labeled using Random sampling. Determine the selected samples by their indices (query_idx) and assign their labels from y_train to the initially missing labels in y_train_initial. After updating the labels, retrain the classifier on the updated training data. Finally, evaluate the classifier's performance on the test set after each iteration of the AL cycle.

In [12]:
qs = RandomSampling()

# Active learning cycle:
n_queries = 10
for i in range(n_queries):
    # Fit the classifier with current labels.
    clf.fit(X_train_vect.toarray(), y_train_initial)

    # Query the next sample(s).
    query_idx = qs.query(X=X_train_vect.toarray(), y=y_train_initial, batch_size=10)

    # Update labels based on query.
    y_train_initial[query_idx] = y_train.iloc[query_idx]

    # Evaluate the classifier on the test set
    y_pred = clf.predict(X_test_vect.toarray())
    acc = accuracy_score(y_test, y_pred)
    print(f'Simple Evaluation Iteration {i + 1}/{n_queries}, Accuracy: {acc:.4f}')
Simple Evaluation Iteration 1/10, Accuracy: 0.4995
Simple Evaluation Iteration 2/10, Accuracy: 0.5000
Simple Evaluation Iteration 3/10, Accuracy: 0.5090
Simple Evaluation Iteration 4/10, Accuracy: 0.5895
Simple Evaluation Iteration 5/10, Accuracy: 0.5085
Simple Evaluation Iteration 6/10, Accuracy: 0.6685
Simple Evaluation Iteration 7/10, Accuracy: 0.6750
Simple Evaluation Iteration 8/10, Accuracy: 0.6710
Simple Evaluation Iteration 9/10, Accuracy: 0.6755
Simple Evaluation Iteration 10/10, Accuracy: 0.6895

Part B.1 Active Learning Strategies¶

Set Up the Query Strategy¶

4. For setting up the query strategy (qs), use UncertaintySampling with entropy method, and random_state=42 to identify the most uncertain data points for the model to focus on.

In [13]:
qs = UncertaintySampling(method='entropy', random_state=42)

You can explore additional classifiers and query strategies available in scikit-learn and the skactiveml library for more options. Detailed information on other classifiers can be found here and all implemented strategies are listed here.

Also because we just have utilised our y_train, we'll need to reset it.

In [14]:
# Initialize Training Labels labels
y_train_initial = np.full(y_train.shape, fill_value=MISSING_LABEL)

# Randomly select 10 indices to label
initial_idx = np.random.choice(np.arange(len(y_train)), size=10, replace=False)

# Fill in the true labels for the selected indices
y_train_initial[initial_idx] = y_train.iloc[initial_idx]

Pool-based Active Learning -¶

5. Now, implement 10 iterations of the Active Learning (AL) cycle. In each iteration, select 10 unlabeled samples to be labeled using uncertainty sampling. Determine the selected samples by their indices (query_idx) and assign their labels from y_train to the initially missing labels in y_train_initial. After updating the labels, retrain the classifier on the updated training data. Finally, evaluate the classifier's performance on the test set after each iteration of the AL cycle.

In [15]:
clf = SklearnClassifier(LogisticRegression(max_iter=1000))

n_queries = 10
for i in range(n_queries):
    # Uses current state of y_train_initial, which contains both labeled and missing entries.
    clf.fit(X_train_vect.toarray(), y_train_initial)

    # Queries the 10 most uncertain samples
    query_idx = qs.query(X=X_train_vect.toarray(), y=y_train_initial, clf=clf, batch_size=10)

    # Copies the true label from y_train into y_train_initial
    # Now the model can use those new labels in the next iteration
    y_train_initial[query_idx] = y_train.iloc[query_idx]

    # Evaluate the classifier on the test set
    y_pred = clf.predict(X_test_vect.toarray())
    acc = accuracy_score(y_test, y_pred)
    print(f'Simple Evaluation Iteration {i + 1}/{n_queries}, Accuracy: {acc:.4f}')
Simple Evaluation Iteration 1/10, Accuracy: 0.6365
Simple Evaluation Iteration 2/10, Accuracy: 0.5010
Simple Evaluation Iteration 3/10, Accuracy: 0.4995
Simple Evaluation Iteration 4/10, Accuracy: 0.5885
Simple Evaluation Iteration 5/10, Accuracy: 0.5965
Simple Evaluation Iteration 6/10, Accuracy: 0.6205
Simple Evaluation Iteration 7/10, Accuracy: 0.6375
Simple Evaluation Iteration 8/10, Accuracy: 0.6435
Simple Evaluation Iteration 9/10, Accuracy: 0.6295
Simple Evaluation Iteration 10/10, Accuracy: 0.6530

From the output we see that the model starts near random guessing. Accuracy in the first few iterations is around 0.50, which suggests it's barely better than random (typical for binary classification at the start of active learning). From Iteration 5 onward, accuracy increases steadily. By Iteration 10, the model reaches ~68% accuracy, which is a significant improvement with just 100 labeled points (10 queries × 10 samples per batch).

6. Retrain the classifier on the fully labeled training set and evaluate its final accuracy. This is the upper bound, if all labelled data were available from the beginning

In [16]:
clf = SklearnClassifier(LogisticRegression(max_iter=1000))
# Final evaluation
clf.fit(X_train_vect.toarray(), y_train)
y_pred = clf.predict(X_test_vect.toarray())
final_acc = accuracy_score(y_test, y_pred)
print(f'Simple Evaluation Final accuracy: {final_acc:.4f}')
Simple Evaluation Final accuracy: 0.8700

Further Down The Line (Not Discussed Today)¶

Maybe you noticed that we are using the .toarray() method. This method converts a sparse matrix into a dense NumPy array. This is because the most traditional text vectorizers return a sparse matrix because most of the entries are zeros — especially in text data with thousands of possible words.

It's time to visualize the IMDB reviews using t-SNE. To reduce the dimensionality of the TF-IDF vectors to 2D, we'll use t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. By converting the high-dimensional TF-IDF vectors into 2D space, we can visualize the relationships and clusters within the data. We will create a scatter plot to visualize the data.

7. First, use t-SNE to reduce the dimensionality of the TF-IDF vectors to 2D. First, you can create an object of the type TSNE and then use the fit_transform() function on the training vector

In [24]:
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_vect.toarray())

8. Now that you’ve reduced the TF-IDF vectors to 2D using t-SNE, create a scatter plot to visualize the reviews: Use the first t-SNE component on the x-axis and the second on the y-axis. Color the points by sentiment label (0 for negative, 1 for positive). Add a legend, title, and axis labels to clearly interpret the results.

Hint: Use plt.scatter() with a loop over sentiment classes to color them differently. What does the plot tell you about how well t-SNE separates the sentiment classes?

In [25]:
plt.figure(figsize=(10, 7))
for sentiment in [0, 1]:
    indices = (y_train == sentiment)
    plt.scatter(X_train_tsne[indices, 0], X_train_tsne[indices, 1], label=f'Sentiment {sentiment}', alpha=0.6)

plt.legend()
plt.title('t-SNE Visualization of IMDB Reviews')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()

We observe that there is heavy overlap between classes. The two sentiment classes are not clearly separated in the 2D t-SNE space. Positive and negative reviews are intermixed across much of the plot.

Multi-annotator Pool-based Active Learning¶

What Is Multi-Annotator Pool-based Active Learning?

This is a scenario where the learning algorithm selects the most informative unlabeled samples from a pool, and also decides which annotator to query for the label. This means the model answers two key questions at each step:

  • Which data point should be labeled next? (as in standard active learning)
  • Who should label it? (which annotator will give the best/most reliable answer)

In this part, we simulate multi-annotator using skactiveml. We start by initializing multiple annotators with varying noise levels and generating noisy labels.

9. Initialize multiple annotators with different noise levels and generate noisy labels. Start by defining the number of annotators (n_annotators) and set it to 5. Create an array y_annot with dimensions (number of training samples, number of annotators), filled with zeros to store annotator labels. Initialize a random number generator rng with a fixed seed (e.g., 0) for reproducibility. Then, generate noise.

Summary Table

Variable Purpose
n_annotators Number of simulated annotators
noise_levels Reliability settings for each annotator
y_noise_matrix Indicates which labels should be flipped
y_annot The actual noisy labels from all annotators
y The training label matrix (initially missing)

In [26]:
# Number of annotators
n_annotators = 5

# Generate noisy labels for each annotator
y_annot = np.zeros(shape=(X_train_vect.shape[0], n_annotators), dtype=int)
rng = np.random.default_rng(seed=0)

# Creates 5 different noise levels, evenly spaced between 0.0 and 0.3 with np.linspace
# These represent the probability that each annotator flips a label incorrectly.
# Annotator 0 is perfect (0% noise), Annotator 4 is quite noisy (30%).
noise_levels = np.linspace(0.0, 0.3, num=n_annotators)

# Generate noise for all annotators simultaneously
#This line generates the actual **noise values**:
#1. Each element is a random 0 or 1 indicating whether the label should be flipped for that annotator and sample.
#2. Shape of output: `(num_samples, n_annotators)`
#3. 1 = flip the label (add noise), 0 = keep the true label

y_noise_matrix = rng.binomial(1, noise_levels[:, np.newaxis], size=(n_annotators, X_train_vect.shape[0])).T

# Apply noise to the true labels
# This line flips labels using XOR (^):
y_annot = y_noise_matrix ^ y_train.values[:, np.newaxis]

# Initialize training labels with missing values
y = np.full(shape=(X_train_vect.shape[0], n_annotators), fill_value=MISSING_LABEL)

Why have we added the noise?

We did all this to simulate real-world human annotators

In practice: Not all annotators are perfect. Annotators may have different expertise levels,biases, or attention spans. For example: A medical intern might mislabel X-rays more often than a senior radiologist. Crowdsourced workers might guess or misunderstand complex tasks. Adding noise lets you model these imperfections so your learning system can adapt to unreliable or uncertain label sources.

10. Configure a classifier and query strategy for multi-annotator active learning. In this step, create a probabilistic classifier using ParzenWindowClassifier. Use an RBF kernel and specify the gamma value to 0.1. Next set up a probabilistic active learning strategy to select the most uncertain samples for labeling. For this use ProbabilisticAL with a smoothing prior of 0.001. Since we’re working with multiple annotators, wrap your query strategy using SingleAnnotatorWrapper so it works with the multi-annotator setting.

Why are we doing this? The classifier estimates uncertainty, and the wrapped query strategy makes sure we can use it in a setup with several annotators. This allows us to actively choose which sample to label, and later you'll decide which annotator to ask.

In [27]:
# Create the classifier
clf = ParzenWindowClassifier(classes=np.unique(y_train.values), metric="rbf", metric_dict={"gamma": 0.1}, random_state=0)

# Set up the query strategy

# ProbabilisticAL selects the sample for which the model is most uncertain, often by entropy or margin.
sa_qs = ProbabilisticAL(random_state=0, prior=0.001)

# SingleAnnotatorWrapper makes sa_qs compatible with a multi-annotator active learning loop.
ma_qs = SingleAnnotatorWrapper(sa_qs, random_state=0)

11. Perform one iteration of the active learning cycle. In this iteration, query 10 unlabeled samples to be labeled by 3 annotators. Assign their labels to the initially missing labels in y. After updating the labels, retrain the classifier on the updated training data, and evaluate its performance on the test set.

In [28]:
# Function to be able to index via an array of indices
idx = lambda A: (A[:, 0], A[:, 1])

# Initial fit of the classifier
clf.fit(X_train_vect.toarray(), majority_vote(y))

# Perform one active learning cycle
print("Cycle 1/1")

# The model selects 100 unlabeled (or partially labeled) samples, for each, it picks 3 annotators to label them
# The result is a set of index pairs like [ [row, annotator], ... ].
query_idx = ma_qs.query(X_train_vect.toarray(), y, batch_size=100, n_annotators_per_sample=3, clf = clf)

# Update labels
y[idx(query_idx)] = y_annot[idx(query_idx)]

# Retrain the classifier on the updated label matrix, again using majority voting across annotators to get a single label per example.
clf.fit(X_train_vect.toarray(), majority_vote(y, random_state=0))

# Evaluate the classifier on the test set
y_pred = clf.predict(X_test_vect.toarray())
acc = accuracy_score(y_test, y_pred)
print(f'Multi-annotator Iteration 1/1, Accuracy: {acc:.4f}')
Cycle 1/1
Multi-annotator Iteration 1/1, Accuracy: 0.5635

Practice: Implement more iterations of the active learning cycle. Use the above code as a reference to perform 10 iterations, similar to how it was done in the pool-based active learnings ection.

12. Retrain the classifier on the fully labeled training set and evaluate its final accuracy.

In [29]:
# Final evaluation for multi-annotator
clf.fit(X_train_vect.toarray(), y_train)
y_pred = clf.predict(X_test_vect.toarray())
final_acc = accuracy_score(y_test, y_pred)
print(f'Multi-annotator Final accuracy: {final_acc:.4f}')
Multi-annotator Final accuracy: 0.5005

For better visualization of annotator labels, it's recommended to randomly select a smaller subset of 500 data points from the training set. This can reduce visual clutter and will be used for t-SNE visualization, similar to what we did previously.

In [30]:
# Sample a smaller subset for visualization
sample_indices = np.random.choice(X_train_vect.shape[0], 500, replace=False)
X_train_subset = X_train_vect[sample_indices].toarray()
y_train_subset = y_train.values[sample_indices]
y_annot_subset = y_annot[sample_indices]

# Use t-SNE to reduce the dimensionality of the TF-IDF vectors to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_subset)

13. Create a visual representation of annotator labels on the IMDB dataset.

Start by setting up a figure with subplots, one for each annotator. For each annotator, identify the correctly labeled data points (is_true) and plot them as circles with a specific color indicating the sentiment. Incorrectly labeled points are plotted as crosses.

In [31]:
# Visualize the noisy labels from each annotator using a scatter plot
fig, axes = plt.subplots(1, n_annotators, figsize=(25, 5))
for a in range(n_annotators):
    is_true = y_annot_subset[:, a] == y_train_subset
    # Correct labels: circles
    axes[a].scatter(X_train_tsne[is_true, 0], X_train_tsne[is_true, 1], c=y_annot_subset[is_true, a], s=30, marker='o', alpha=0.4, cmap='coolwarm')
    # Incorrect labels: crosses
    axes[a].scatter(X_train_tsne[~is_true, 0], X_train_tsne[~is_true, 1], c=y_annot_subset[~is_true, a], s=50, marker='x', alpha=1 , cmap='coolwarm', edgecolors='k', linewidths=1.5)
    axes[a].set_title(f'Annotator {a}', fontsize=15)
    axes[a].set_xlabel('t-SNE Dimension 1')
    axes[a].set_ylabel('t-SNE Dimension 2')

plt.show()

In this plot we see 5 annotators. Each subplot corresponds to one annotator. Annotator 0 (left) is likely noise-free or very accurate. Annotator 4 (right) has the most labeling noise.

  • Red dots = label 1 (e.g., positive sentiment or class 1)
  • Blue dots = label 0 (e.g., negative sentiment or class 0)
  • X-shaped markers = samples where the annotator flipped the correct label. These are places where the annotator disagreed with the ground truth

Part C Evaluation of Active Learning¶

Stream-based Active Learning¶

Stream-based Active Learning (AL) is an active learning strategy where data points are presented one at a time, and the learner must decide immediately whether to query the label or discard the instance.

How It Works

A data stream feeds unlabeled samples sequentially (like a real-time feed). For each new instance:

The model evaluates how informative or uncertain the sample is. It decides on the spot:

  • Query the label (add to training)
  • Reject the sample (discard it permanently) The model updates itself incrementally after each queried instance.

In this part, we will show how stream-based active learning strategies are used and compared them to one another. For this purpose we will follow the next four steps:

  1. Set up query strategies
  2. Initialize classifier and training data
  3. Create stream-based active learning loop
  4. Calculate and track accuracy

We will divide each step into substeps for better clarity and ease of implementation. So let's start!

Set Up Query Strategies¶

14. Now, it's time to set up query strategies i.e., StreamRandomSampling, and StreamProbabilisticAL for our stream-based active learning, for this purpose you need follow up the steps in bellow:

  1. Define the length of the data stream (stream_length) to be 1000 samples, and use the first 1000 samples from X_train_vect and their corresponding labels from y_train.values.
  2. Initialize the query strategies with a fixed random_state=0, and set the training_size to 1000 and fit_clf to False. Then store the accuracy results for each query strategy by using accuracies = {}.

What are those queries that we are using?

StreamRandomSampling

  • What it does: For each incoming sample in the stream, it randomly decides whether to query the label or discard it.
  • No consideration of the model's confidence or prediction.
  • Useful to compare against smarter, model-driven strategies.

StreamProbabilisticAL

  • What it does: Actively evaluates each sample using the model’s prediction uncertainty (e.g., entropy or margin).
  • Queries a sample only if it is uncertain enough, meaning the model is not confident.
  • It’s a smarter, model-guided approach.
  • Helps focus labeling effort on informative samples.
In [17]:
stream_length = 1000
X_stream = X_train_vect.toarray()[:stream_length]
y_stream = y_train.values[:stream_length]
In [18]:
query_strategies = {
    'StreamRandomSampling': StreamRandomSampling(random_state=0),
    'StreamProbabilisticAL': StreamProbabilisticAL(random_state=0)
}

training_size = 1000


fit_clf = False # Don't automatically retrain the classifier every time a new sample is queried.
accuracies = {}

Initialize Classifier and Training Data¶

15. For each query strategy (so use a for loop, something like or query_strategy_name, query_strategy in query_strategies.items()):

  1. create a ParzenWindowClassifier with unique classes from y_train.values. Then set up X_train_stream and y_train_stream deques with a maximum length of training_size and then initialize them with the first 10 samples from X_stream and y_stream.
  2. Fit the classifier with this initial data.

What is deque? A deque (from collections) is like a list, but:

  • Faster for appending/removing from both ends
  • Ideal for streaming data, where old entries are discarded as new ones come in
Tips / Object Purpose
query_strategies.items() Loop over each strategy (e.g., random, entropy)
clf = ParzenWindowClassifier Initialize a new classifier for this strategy
deque(maxlen=training_size) Sliding window of most recent training data
extend(X_stream[:10]) Seed the model with initial labeled data
clf.fit(...) Train model on initial small dataset
In [19]:
for query_strategy_name, query_strategy in query_strategies.items():
    clf = ParzenWindowClassifier(classes=np.unique(y_train.values), random_state=0)

    # Initialize the training data
    X_train_stream = deque(maxlen=training_size)
    y_train_stream = deque(maxlen=training_size)

    # Initialize with the first 10 samples
    X_train_stream.extend(X_stream[:10])
    y_train_stream.extend(y_stream[:10])

    # Fit the classifier with this initial data.
    clf.fit(X_train_stream, y_train_stream)

Create Stream-based Active Learning Loop¶

16. Create stream-based active learning loop by folowing steps:

  1. To keep track of the number of queried samples and to track the accuracy of its predictions set up:
    correct_classifications = []
    count = 0
    
  2. Now start a loop from the 10th sample to the end of X_stream(since the first 10 samples were used to initialize the classifier).
  3. Reshape the current sample (X_stream[t]) to be a 2D array with one sample.
  4. Refit the classifier with the current training data. Use clf.predict for predicting the label for the current sample (X_cand), and compare it to the true label (y_cand). Then, use correct_classifications.append to append the result (True if correct, False if incorrect).
  5. Update the query strategy with the selected samples (sampled_indices) and their associated utilities. Use the call_func function facilitates this process.For this purpose Start by defining the parameters you want to pass to the query method. These parameters include the candidates (X_cand), the classifier (clf), and the flags return_utilities and fit_clf.
  6. Create a dictionary budget_manager_param_dict to hold the utilities information.
  7. Use call_func to dynamically call the update method on query_strategy, by passing the parameters which you've defined earlier.
  8. Add the number of newly queried samples to the count variable by
    count += len(sampled_indices)
    
  9. Update the training data by adding the current sample and its label. If the sample was queried, add its true label; otherwise, add a missing label.
In [20]:
correct_classifications = []
count = 0
for t in range(10, len(X_stream)): #`t` is the index of the current sample in the stream
    # Reshape the current sample for compatibility with the classifier's predict method, which expects a 2D array
    X_cand = X_stream[t].reshape(1, -1)
    y_cand = y_stream[t]

    # Refit the classifier and predict the current sample's label
    clf.fit(X_train_stream, y_train_stream)
    correct_classifications.append(clf.predict(X_cand)[0] == y_cand)

    # Update the query strategy with the selected samples
    sampled_indices, utilities = call_func(query_strategy.query, candidates=X_cand, clf=clf, return_utilities=True, fit_clf=fit_clf)

    # Create a dictionary budget_manager_param_dict
    budget_manager_param_dict = {"utilities": utilities}

    # Dynamically call the update method on `query_strategy`
    call_func(query_strategy.update, candidates=X_cand, queried_indices=sampled_indices, budget_manager_param_dict=budget_manager_param_dict)

    # Track the number of queried samples
    count += len(sampled_indices)

    # Update the training data with new samples and labels
    X_train_stream.append(X_stream[t]), y_train_stream.append(y_cand if len(sampled_indices) > 0 else clf.missing_label)

Calculate and Track Accuracy¶

We need to measure how well the classifier is performing overall.

17. Use np.mean(correct_classifications), to calculate the average accuracy. This average accuracy, along with the correct_classifications list, should store in the accuracies dictionary for each query strategy. It will allow you to keep track of how each strategy performed.

In [21]:
# Calculate and print the average accuracy for each query strategy
avg_accuracy = np.mean(correct_classifications)
accuracies[query_strategy_name] = correct_classifications

Now Let's run the code from beginning to end

In [22]:
# Stream-based learning setup
stream_length = 1000
X_stream = X_train_vect.toarray()[:stream_length]
y_stream = y_train.values[:stream_length]

# Set up query strategies
query_strategies = {
    'StreamRandomSampling': StreamRandomSampling(random_state=0),
    'StreamProbabilisticAL': StreamProbabilisticAL(random_state=0)
}

training_size = 1000
fit_clf = False
accuracies = {}

for query_strategy_name, query_strategy in query_strategies.items():
    clf = ParzenWindowClassifier(classes=np.unique(y_train.values), random_state=0)

    # Initialize the training data
    X_train_stream = deque(maxlen=training_size)
    y_train_stream = deque(maxlen=training_size)

    # Initialize with the first 10 samples
    X_train_stream.extend(X_stream[:10])
    y_train_stream.extend(y_stream[:10])

    clf.fit(X_train_stream, y_train_stream)
    correct_classifications = []
    count = 0
    for t in range(10, len(X_stream)):
        # Reshape the current sample for compatibility
        X_cand = X_stream[t].reshape(1, -1)
        y_cand = y_stream[t]

        # Refit the classifier and predict the current sample's label
        clf.fit(X_train_stream, y_train_stream)
        correct_classifications.append(clf.predict(X_cand)[0] == y_cand)

        # Query the classifier
        sampled_indices, utilities = call_func(query_strategy.query, candidates=X_cand, clf=clf, return_utilities=True, fit_clf=fit_clf)
        budget_manager_param_dict = {"utilities": utilities}
        call_func(query_strategy.update, candidates=X_cand, queried_indices=sampled_indices, budget_manager_param_dict=budget_manager_param_dict)

        # Update the training data with new samples and labels
        X_train_stream.append(X_stream[t])
        y_train_stream.append(y_cand if len(sampled_indices) > 0 else clf.missing_label)

        # Track the number of queried samples
        count += len(sampled_indices)

    # Calculate and print the average accuracy for each query strategy
    avg_accuracy = np.mean(correct_classifications)
    print(f"Query Strategy: {query_strategy_name}, Avg Accuracy: {avg_accuracy:.4f}, Acquisition count: {count}")
    accuracies[query_strategy_name] = correct_classifications
Query Strategy: StreamRandomSampling, Avg Accuracy: 0.4889, Acquisition count: 107
Query Strategy: StreamProbabilisticAL, Avg Accuracy: 0.4838, Acquisition count: 100

The acquisition count tells you how many samples were selected for labeling during the active learning process. In our case, it means that:

  • StreamRandomSampling selected 107 samples and achieved an average accuracy of 0.4889.
  • StreamProbabilisticAL selected 100 samples and achieved an average accuracy of 0.4838.

This count shows how many times each strategy asked for more information to improve the model.

18. Let's plot the accuracy over time for each query strategy, using a Gaussian filter.

A Gaussian filter is a technique for smoothing a noisy signal or curve. It's like saying "Let’s look at the accuracy near this point, and take a weighted average of nearby values — where points closer in time count more than distant ones.” In our example, we have the raw accuracy values over time (e.g., one per query) and this is why we need to smooth. In particular we average each point with ±50 surrounding values (weighted)

In [23]:
for query_strategy_name, correct_classifications in accuracies.items():
    plt.plot(gaussian_filter1d(np.array(correct_classifications, dtype=float), 20), label=query_strategy_name)
plt.legend();
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Accuracy over time for different query strategies')
plt.show()

19. (Optional) Repeat the stream-based active learning process by adding the other strategies (e.g., FixedUncertainty, VariableUncertainty, Split, StreamDensityBasedAL, CognitiveDualQueryStrategyRan, CognitiveDualQueryStrategyFixUn, CognitiveDualQueryStrategyRanVarUn, CognitiveDualQueryStrategyVarUn, PeriodicSampling), then compare your results. Make sure to import them from skactiveml.stream beforehand!

End of Practical!