Anastasia Giachanou & Misja Groen
Machine Learning with Python - Utrecht Summer School
In the world of Machine Learning, Active Learning is becoming popular. This method trains ML models step-by-step, which means you need less training data to get competitive results.
An Active Learning pipeline consists of a classifier and an oracle. The oracle, either an individual or a group, cleans, selects, labels data, and feeds it to the model, ensuring consistent labeling. The process starts with annotating a small dataset subset to train an initial model, saving the best model checkpoint, and testing it on a balanced set. After initial evaluation, the oracle labels more samples based on needs, adds new data to the training set, and repeats the cycle until the model achieves the acceptable performance.
In this practical, we’ll drop into Active Learning using the IMDB dataset, which has 50,000 movie reviews split evenly between positive and negative sentiments. We’ll explore three exciting strategies:
Multi-annotator Pool-based Active Learning: This simulates multiple annotators with varying noise levels, using a SingleAnnotatorWrapper with probabilistic active learning. It highlights how multiple annotators impact model performance.
Stream-based Active Learning: Here, we implement a stream-based approach using StreamRandomSampling and StreamProbabilisticAL, ideal for real-time decision-making as data continuously flows in.
We'll classify movie reviews as positive or negative using their text. This binary classification task is both significant and widely applicable in machine learning. Let's get started and see how we can impliment these strategies together!
We would like to use the scikit-activeml library. This library helps with important query strategies and is easy to use because it's built on scikit-learn. We'll show how it works by classifying IMDB reviews using the active learning cycle. Let's start by installing it with pip install scikit-activeml and importing the needed packages from scikit-learn and scikit-activeml. Take care to have them installed!
!pip install scikit-activeml
!pip install numpy==1.24.4 scipy==1.10.1
!pip install matplotlib
!pip install pandas
!pip install ipympl
Requirement already satisfied: scikit-activeml in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (0.6.2)
Requirement already satisfied: joblib>=1.4.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.5.1)
Collecting numpy>=1.26 (from scikit-activeml)
Using cached numpy-2.3.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scipy>=1.11.3 (from scikit-activeml)
Using cached scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Requirement already satisfied: scikit-learn>=1.6.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.7.1)
Requirement already satisfied: matplotlib>=3.7.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (3.10.3)
Requirement already satisfied: iteration-utilities>=0.12.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (0.13.0)
Requirement already satisfied: makefun>=1.15.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.16.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (1.4.8)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (25.0)
Requirement already satisfied: pillow>=8 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (2.9.0.post0)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-learn>=1.6.0->scikit-activeml) (3.6.0)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib>=3.7.3->scikit-activeml) (1.17.0)
Using cached numpy-2.3.1-cp311-cp311-win_amd64.whl (13.0 MB)
Using cached scipy-1.16.0-cp311-cp311-win_amd64.whl (38.6 MB)
Installing collected packages: numpy, scipy
Attempting uninstall: numpy
Found existing installation: numpy 1.24.4
Uninstalling numpy-1.24.4:
Successfully uninstalled numpy-1.24.4
Attempting uninstall: scipy
Found existing installation: scipy 1.10.1
Uninstalling scipy-1.10.1:
Successfully uninstalled scipy-1.10.1
Successfully installed numpy-2.3.1 scipy-1.16.0
[notice] A new release of pip is available: 24.0 -> 25.1.1 [notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Collecting numpy==1.24.4
Using cached numpy-1.24.4-cp311-cp311-win_amd64.whl.metadata (5.6 kB)
Collecting scipy==1.10.1
Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl.metadata (58 kB)
Using cached numpy-1.24.4-cp311-cp311-win_amd64.whl (14.8 MB)
Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl (42.2 MB)
Installing collected packages: numpy, scipy
Attempting uninstall: numpy
Found existing installation: numpy 2.3.1
Uninstalling numpy-2.3.1:
Successfully uninstalled numpy-2.3.1
Attempting uninstall: scipy
Found existing installation: scipy 1.16.0
Uninstalling scipy-1.16.0:
Successfully uninstalled scipy-1.16.0
Successfully installed numpy-1.24.4 scipy-1.10.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. scikit-activeml 0.6.2 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible. scikit-activeml 0.6.2 requires scipy>=1.11.3, but you have scipy 1.10.1 which is incompatible. [notice] A new release of pip is available: 24.0 -> 25.1.1 [notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Requirement already satisfied: matplotlib in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (3.10.3) Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.3.2) Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (4.59.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.4.8) Requirement already satisfied: numpy>=1.23 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.24.4) Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (25.0) Requirement already satisfied: pillow>=8 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (11.3.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (3.2.3) Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (2.9.0.post0) Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)
[notice] A new release of pip is available: 24.0 -> 25.1.1 [notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Requirement already satisfied: pandas in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (2.3.1) Requirement already satisfied: numpy>=1.23.2 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (1.24.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2025.2) Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
[notice] A new release of pip is available: 24.0 -> 25.1.1 [notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
Requirement already satisfied: ipympl in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (0.9.7) Requirement already satisfied: ipython<10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (9.4.0) Requirement already satisfied: ipywidgets<9,>=7.6.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (8.1.7) Requirement already satisfied: matplotlib<4,>=3.5.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (3.10.3) Requirement already satisfied: numpy in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (1.24.4) Requirement already satisfied: pillow in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (11.3.0) Requirement already satisfied: traitlets<6 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (5.14.3) Requirement already satisfied: colorama in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.4.6) Requirement already satisfied: decorator in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (5.2.1) Requirement already satisfied: ipython-pygments-lexers in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (1.1.1) Requirement already satisfied: jedi>=0.16 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.19.2) Requirement already satisfied: matplotlib-inline in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.1.7) Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (3.0.51) Requirement already satisfied: pygments>=2.4.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (2.19.2) Requirement already satisfied: stack_data in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.6.3) Requirement already satisfied: typing_extensions>=4.6 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (4.14.1) Requirement already satisfied: comm>=0.1.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (0.2.2) Requirement already satisfied: widgetsnbextension~=4.0.14 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (4.0.14) Requirement already satisfied: jupyterlab_widgets~=3.0.15 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (3.0.15) Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.3.2) Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (4.59.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.4.8) Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (25.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (3.2.3) Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (2.9.0.post0) Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from jedi>=0.16->ipython<10->ipympl) (0.8.4) Requirement already satisfied: wcwidth in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython<10->ipympl) (0.2.13) Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib<4,>=3.5.0->ipympl) (1.17.0) Requirement already satisfied: executing>=1.2.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (2.2.0) Requirement already satisfied: asttokens>=2.1.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (3.0.0) Requirement already satisfied: pure-eval in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (0.2.3)
[notice] A new release of pip is available: 24.0 -> 25.1.1 [notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
No need to worry about >/dev/null 2>&1! We just used it to hide the output and keep our practical tidy :)
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE
import warnings
mlp.rcParams["figure.facecolor"] = "white"
warnings.filterwarnings("ignore")
We'll be using the IMDB dataset, featuring 50,000 movie reviews from the Internet Movie Database, for our experiments. Now it is time to load the dataset:
# Load the IMDB dataset
df = pd.read_csv("IMDB Dataset.csv")
ParserError: Error tokenizing data. C error: EOF inside string starting at row 16597
When loading real-world datasets, you may encounter ParserError. This is usually due to loading a large CSV file into Python Pandas using the read_csv function. The solution is to use the engine='python' parameter in the read_csv function call to handle complex CSV structures, and the on_bad_lines parameter to skip problematic lines, like this:
# Load the IMDB dataset with proper handling for encoding and skipping bad lines
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')
Another solution is to load the data by mounting Google Drive which ensures that the file paths are correctly mapped. It can help avoid FileNotFoundError or similar issues that might lead to a ParserError. We will load the data by using this mehod:
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')
These methods ensure you can load the dataset effectively, even if there are issues with the CSV file's formatting.
Pre-processing text data is a crucial step in text mining and machine learning tasks. It involves cleaning and removing noise from the text data to make it analyzable and transform it into a form that machine learning algorithms can work with effectively. There are various approaches to preprocessing text data. One approach is minimal preprocessing, which involves the most essential steps like lowercasing, punctuation removal, and whitespace normalization. In contrast, the full preprocessing approach includes additional steps such as tokenization, stop word removal, stemming, and lemmatization to thoroughly clean and prepare text data.
For more information on several common ways to deal with text data, please refer to the A Beginner's Guide to Dealing with Text Data tutorial.
In this practical, we opted for a minimal preprocessing approach without tokenization. This is because the subsequent steps involve using TF-IDF vectorization, which can handle the tokenization implicitly, and the model used (Logistic Regression) requires numerical input rather than raw text. Lets follow the next steps to see how we preprocess the text data in this case.
Text preprocessing is a method used to clean and remove noise from text data. It makes your text easier to analyze and transforms it into a format that machine learning algorithms can handle more effectively. At the beginning of this practical, we introduced two essential libraries for text preprocessing: re and string. The re library supports regular expressions for pattern matching, and the string library provides constants like punctuation characters. The preprocess_text function, which we defined below, converts text to lowercase, removes punctuation using re.sub(), and eliminates extra whitespace with re.sub().strip(). We applied this function to each review in the dataset to clean the text.
# Preprocess the text data
def preprocess_text(text):
text = text.lower() # Lowercase text
text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
return text
df['review'] = df['review'].apply(preprocess_text)
df.head()
| review | sentiment | |
|---|---|---|
| 0 | one of the other reviewers has mentioned that ... | positive |
| 1 | a wonderful little production br br the filmin... | positive |
| 2 | i thought this was a wonderful way to spend ti... | positive |
| 3 | basically theres a family where a little boy j... | negative |
| 4 | petter matteis love in the time of money is a ... | positive |
When working with large datasets, starting with a smaller subset for initial testing allows for quicker iterations and helps identify issues before scaling up. Here, we reduce the IMDB dataset to 10,000 samples using Pandas' sample method. The random_state parameter ensures reproducibility, selecting the same 10,000 samples each time.
# Reduce the dataset size for initial testing (e.g., 100 samples)
df = df.sample(10000, random_state=42)
Next, we convert the sentiment labels to binary values, where 'positive' is mapped to 1 and 'negative' to 0.
# Convert labels to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
Let's split data into training and test sets with an 80/20 ratio using train_test_split. This results in X_train and y_train for training, and X_test and y_test for testing, ensuring that the model is trained and evaluated on separate data.
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)
Once the text data is preprocessed, it needs to be converted into a numerical format that machine learning algorithms can work with. This process is known as vectorization. One of the most common methods for vectorization is the TF-IDF (Term Frequency-Inverse Document Frequency) approach.
TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value increases proportionally with the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Term Frequency (TF): Measures how frequently a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in that document.
$$ \text{TF}(t,d) =\frac{\text{Number of times term } t \text{ appears in document} d}{\text{Total number of terms in document } d} $$Inverse Document Frequency (IDF): Measures how important a term is. It is calculated by taking the logarithm of the number of documents in the corpus divided by the number of documents containing the term.
$$ \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) $$TF-IDF Score: The TF-IDF score is the product of the TF and IDF scores. It reflects the importance of a term in a document within the corpus.
Formula: $$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$
1. To apply TF-IDF vectorization, create an instance of the class TfidfVectorizer with max_features=5000 to limit the vocabulary size. Use fit_transform() on the training data to learn the vocabulary and convert the text into TF-IDF vectors. Then, apply transform() on the test data to vectorize it using the same vocabulary.
Note: When working with text data, our features are the tokens
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)
Normally in a Traditional (supervised) Machine Learning loop we would:
However, this depends highly on the situation whether or not we have sufficient (high-quality) labelled training data.
For this example we'll use regression as an example of how such a ML Loop looks like. Keep in mind this is NOT the Active Learning situation just yet.
Scikit-learn offers many easy-to-use classification algorithms. In this example, we'll use the logisticRegression classifier. Since scikit-learn classifiers can't handle missing labels directly, we'll use the logistic regression model with SklearnClassifier. In active learning scenarios, you often start with many unlabeled samples, which creates a challenge. To get around this, tools like modAL provide wrappers like SklearnClassifier, which wrap scikit-learn classifiers (e.g., LogisticRegression) and allow partial training and querying even when only part of the data is labeled
2. Create such a Machine Learning loop using LogisticRegression
X_log, y_log = X_train_vect.toarray(), y_train
clf = LogisticRegression(random_state=0).fit(X_log,y_log)
print(clf.predict(X_log[:5]))
print(X_train[:5], y_train[:5])
print("Accuracy is: ", clf.score(X_log,y_log))
[0 0 1 1 0] 2141 i saw it tonight and fell asleep in the movieb... 46172 this is one of them movies that has a awesome ... 18558 and i do mean it if not literally after all i ... 32956 this film has a lot of raw potential the scrip... 13094 hello i normally love movies im 19 i have seen... Name: review, dtype: object 2141 0 46172 0 18558 1 32956 1 13094 0 Name: sentiment, dtype: int64 Accuracy is: 0.92075
Going from a traditional Machine Learning loop to an Active Learning environment, we'll need to make some adjustments to our loop. To start we'll need to define our classifier a bit differently to also have it work with Active Learning.
clf = SklearnClassifier(LogisticRegression(max_iter=1000))
3. Create an initial set of labeled data by setting most labels to missing and randomly selecting a small subset. You can randomly select 10 samples from the training data to label initially.
The first step is to create a label array (y_train_initial) that marks all labels as missing (you can use a constant like MISSING_LABEL). In a real-world setting you will have a large unlabeled dataset, however to simulate that we will remove all the labels. After removing those labels, randomly select 10 data points to serve as your initial labeled set. Copy their true labels into your new array while keeping the rest as missing
# Initialize Training Labels labels
y_train_initial = np.full(y_train.shape, fill_value=MISSING_LABEL)
# Randomly select 10 indices to label
initial_idx = np.random.choice(np.arange(len(y_train)), size=10, replace=False)
# Fill in the true labels for the selected indices
y_train_initial[initial_idx] = y_train.iloc[initial_idx]
To have an idea of how such a loop might look like we will simulate this with RandomSampling.
We have created already: X_train which we will use as input y_train which we will as output for the classifier
clf, logisticregression which we will use as the classifier.
However, we are now only missing a query strategy.
To start off we will use the RandomSampling strategy, which is just basically randomly doing something. In the next lecture slides you will dive deeper into the different query strategies.
for now: qs = RandomSampling() will be sufficient
5. Now, implement 10 iterations of the Active Learning (AL) cycle. In each iteration, select 10 unlabeled samples to be labeled using Random sampling. Determine the selected samples by their indices (query_idx) and assign their labels from y_train to the initially missing labels in y_train_initial. After updating the labels, retrain the classifier on the updated training data. Finally, evaluate the classifier's performance on the test set after each iteration of the AL cycle.
qs = RandomSampling()
# Active learning cycle:
n_queries = 10
for i in range(n_queries):
# Fit the classifier with current labels.
clf.fit(X_train_vect.toarray(), y_train_initial)
# Query the next sample(s).
query_idx = qs.query(X=X_train_vect.toarray(), y=y_train_initial, batch_size=10)
# Update labels based on query.
y_train_initial[query_idx] = y_train.iloc[query_idx]
# Evaluate the classifier on the test set
y_pred = clf.predict(X_test_vect.toarray())
acc = accuracy_score(y_test, y_pred)
print(f'Simple Evaluation Iteration {i + 1}/{n_queries}, Accuracy: {acc:.4f}')
Simple Evaluation Iteration 1/10, Accuracy: 0.4995 Simple Evaluation Iteration 2/10, Accuracy: 0.5000 Simple Evaluation Iteration 3/10, Accuracy: 0.5090 Simple Evaluation Iteration 4/10, Accuracy: 0.5895 Simple Evaluation Iteration 5/10, Accuracy: 0.5085 Simple Evaluation Iteration 6/10, Accuracy: 0.6685 Simple Evaluation Iteration 7/10, Accuracy: 0.6750 Simple Evaluation Iteration 8/10, Accuracy: 0.6710 Simple Evaluation Iteration 9/10, Accuracy: 0.6755 Simple Evaluation Iteration 10/10, Accuracy: 0.6895
4. For setting up the query strategy (qs), use UncertaintySampling with entropy method, and random_state=42 to identify the most uncertain data points for the model to focus on.
qs = UncertaintySampling(method='entropy', random_state=42)
You can explore additional classifiers and query strategies available in scikit-learn and the skactiveml library for more options. Detailed information on other classifiers can be found here and all implemented strategies are listed here.
Also because we just have utilised our y_train, we'll need to reset it.
# Initialize Training Labels labels
y_train_initial = np.full(y_train.shape, fill_value=MISSING_LABEL)
# Randomly select 10 indices to label
initial_idx = np.random.choice(np.arange(len(y_train)), size=10, replace=False)
# Fill in the true labels for the selected indices
y_train_initial[initial_idx] = y_train.iloc[initial_idx]
5. Now, implement 10 iterations of the Active Learning (AL) cycle. In each iteration, select 10 unlabeled samples to be labeled using uncertainty sampling. Determine the selected samples by their indices (query_idx) and assign their labels from y_train to the initially missing labels in y_train_initial. After updating the labels, retrain the classifier on the updated training data. Finally, evaluate the classifier's performance on the test set after each iteration of the AL cycle.
clf = SklearnClassifier(LogisticRegression(max_iter=1000))
n_queries = 10
for i in range(n_queries):
# Uses current state of y_train_initial, which contains both labeled and missing entries.
clf.fit(X_train_vect.toarray(), y_train_initial)
# Queries the 10 most uncertain samples
query_idx = qs.query(X=X_train_vect.toarray(), y=y_train_initial, clf=clf, batch_size=10)
# Copies the true label from y_train into y_train_initial
# Now the model can use those new labels in the next iteration
y_train_initial[query_idx] = y_train.iloc[query_idx]
# Evaluate the classifier on the test set
y_pred = clf.predict(X_test_vect.toarray())
acc = accuracy_score(y_test, y_pred)
print(f'Simple Evaluation Iteration {i + 1}/{n_queries}, Accuracy: {acc:.4f}')
Simple Evaluation Iteration 1/10, Accuracy: 0.6365 Simple Evaluation Iteration 2/10, Accuracy: 0.5010 Simple Evaluation Iteration 3/10, Accuracy: 0.4995 Simple Evaluation Iteration 4/10, Accuracy: 0.5885 Simple Evaluation Iteration 5/10, Accuracy: 0.5965 Simple Evaluation Iteration 6/10, Accuracy: 0.6205 Simple Evaluation Iteration 7/10, Accuracy: 0.6375 Simple Evaluation Iteration 8/10, Accuracy: 0.6435 Simple Evaluation Iteration 9/10, Accuracy: 0.6295 Simple Evaluation Iteration 10/10, Accuracy: 0.6530
From the output we see that the model starts near random guessing. Accuracy in the first few iterations is around 0.50, which suggests it's barely better than random (typical for binary classification at the start of active learning). From Iteration 5 onward, accuracy increases steadily. By Iteration 10, the model reaches ~68% accuracy, which is a significant improvement with just 100 labeled points (10 queries × 10 samples per batch).
6. Retrain the classifier on the fully labeled training set and evaluate its final accuracy. This is the upper bound, if all labelled data were available from the beginning
clf = SklearnClassifier(LogisticRegression(max_iter=1000))
# Final evaluation
clf.fit(X_train_vect.toarray(), y_train)
y_pred = clf.predict(X_test_vect.toarray())
final_acc = accuracy_score(y_test, y_pred)
print(f'Simple Evaluation Final accuracy: {final_acc:.4f}')
Simple Evaluation Final accuracy: 0.8700
Maybe you noticed that we are using the .toarray() method. This method converts a sparse matrix into a dense NumPy array. This is because the most traditional text vectorizers return a sparse matrix because most of the entries are zeros — especially in text data with thousands of possible words.
It's time to visualize the IMDB reviews using t-SNE. To reduce the dimensionality of the TF-IDF vectors to 2D, we'll use t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. By converting the high-dimensional TF-IDF vectors into 2D space, we can visualize the relationships and clusters within the data. We will create a scatter plot to visualize the data.
7. First, use t-SNE to reduce the dimensionality of the TF-IDF vectors to 2D. First, you can create an object of the type TSNE and then use the fit_transform() function on the training vector
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_vect.toarray())
8. Now that you’ve reduced the TF-IDF vectors to 2D using t-SNE, create a scatter plot to visualize the reviews: Use the first t-SNE component on the x-axis and the second on the y-axis. Color the points by sentiment label (0 for negative, 1 for positive). Add a legend, title, and axis labels to clearly interpret the results.
Hint: Use plt.scatter() with a loop over sentiment classes to color them differently. What does the plot tell you about how well t-SNE separates the sentiment classes?
plt.figure(figsize=(10, 7))
for sentiment in [0, 1]:
indices = (y_train == sentiment)
plt.scatter(X_train_tsne[indices, 0], X_train_tsne[indices, 1], label=f'Sentiment {sentiment}', alpha=0.6)
plt.legend()
plt.title('t-SNE Visualization of IMDB Reviews')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()
We observe that there is heavy overlap between classes. The two sentiment classes are not clearly separated in the 2D t-SNE space. Positive and negative reviews are intermixed across much of the plot.
What Is Multi-Annotator Pool-based Active Learning?
This is a scenario where the learning algorithm selects the most informative unlabeled samples from a pool, and also decides which annotator to query for the label. This means the model answers two key questions at each step:
In this part, we simulate multi-annotator using skactiveml. We start by initializing multiple annotators with varying noise levels and generating noisy labels.
9. Initialize multiple annotators with different noise levels and generate noisy labels. Start by defining the number of annotators (n_annotators) and set it to 5. Create an array y_annot with dimensions (number of training samples, number of annotators), filled with zeros to store annotator labels. Initialize a random number generator rng with a fixed seed (e.g., 0) for reproducibility. Then, generate noise.
Summary Table
| Variable | Purpose |
|---|---|
n_annotators |
Number of simulated annotators |
noise_levels |
Reliability settings for each annotator |
y_noise_matrix |
Indicates which labels should be flipped |
y_annot |
The actual noisy labels from all annotators |
y |
The training label matrix (initially missing) |
# Number of annotators
n_annotators = 5
# Generate noisy labels for each annotator
y_annot = np.zeros(shape=(X_train_vect.shape[0], n_annotators), dtype=int)
rng = np.random.default_rng(seed=0)
# Creates 5 different noise levels, evenly spaced between 0.0 and 0.3 with np.linspace
# These represent the probability that each annotator flips a label incorrectly.
# Annotator 0 is perfect (0% noise), Annotator 4 is quite noisy (30%).
noise_levels = np.linspace(0.0, 0.3, num=n_annotators)
# Generate noise for all annotators simultaneously
#This line generates the actual **noise values**:
#1. Each element is a random 0 or 1 indicating whether the label should be flipped for that annotator and sample.
#2. Shape of output: `(num_samples, n_annotators)`
#3. 1 = flip the label (add noise), 0 = keep the true label
y_noise_matrix = rng.binomial(1, noise_levels[:, np.newaxis], size=(n_annotators, X_train_vect.shape[0])).T
# Apply noise to the true labels
# This line flips labels using XOR (^):
y_annot = y_noise_matrix ^ y_train.values[:, np.newaxis]
# Initialize training labels with missing values
y = np.full(shape=(X_train_vect.shape[0], n_annotators), fill_value=MISSING_LABEL)
Why have we added the noise?
We did all this to simulate real-world human annotators
In practice: Not all annotators are perfect. Annotators may have different expertise levels,biases, or attention spans. For example: A medical intern might mislabel X-rays more often than a senior radiologist. Crowdsourced workers might guess or misunderstand complex tasks. Adding noise lets you model these imperfections so your learning system can adapt to unreliable or uncertain label sources.
10. Configure a classifier and query strategy for multi-annotator active learning. In this step, create a probabilistic classifier using ParzenWindowClassifier. Use an RBF kernel and specify the gamma value to 0.1.
Next set up a probabilistic active learning strategy to select the most uncertain samples for labeling. For this use ProbabilisticAL with a smoothing prior of 0.001. Since we’re working with multiple annotators, wrap your query strategy using SingleAnnotatorWrapper so it works with the multi-annotator setting.
Why are we doing this? The classifier estimates uncertainty, and the wrapped query strategy makes sure we can use it in a setup with several annotators. This allows us to actively choose which sample to label, and later you'll decide which annotator to ask.
# Create the classifier
clf = ParzenWindowClassifier(classes=np.unique(y_train.values), metric="rbf", metric_dict={"gamma": 0.1}, random_state=0)
# Set up the query strategy
# ProbabilisticAL selects the sample for which the model is most uncertain, often by entropy or margin.
sa_qs = ProbabilisticAL(random_state=0, prior=0.001)
# SingleAnnotatorWrapper makes sa_qs compatible with a multi-annotator active learning loop.
ma_qs = SingleAnnotatorWrapper(sa_qs, random_state=0)
11. Perform one iteration of the active learning cycle. In this iteration, query 10 unlabeled samples to be labeled by 3 annotators. Assign their labels to the initially missing labels in y. After updating the labels, retrain the classifier on the updated training data, and evaluate its performance on the test set.
# Function to be able to index via an array of indices
idx = lambda A: (A[:, 0], A[:, 1])
# Initial fit of the classifier
clf.fit(X_train_vect.toarray(), majority_vote(y))
# Perform one active learning cycle
print("Cycle 1/1")
# The model selects 100 unlabeled (or partially labeled) samples, for each, it picks 3 annotators to label them
# The result is a set of index pairs like [ [row, annotator], ... ].
query_idx = ma_qs.query(X_train_vect.toarray(), y, batch_size=100, n_annotators_per_sample=3, clf = clf)
# Update labels
y[idx(query_idx)] = y_annot[idx(query_idx)]
# Retrain the classifier on the updated label matrix, again using majority voting across annotators to get a single label per example.
clf.fit(X_train_vect.toarray(), majority_vote(y, random_state=0))
# Evaluate the classifier on the test set
y_pred = clf.predict(X_test_vect.toarray())
acc = accuracy_score(y_test, y_pred)
print(f'Multi-annotator Iteration 1/1, Accuracy: {acc:.4f}')
Cycle 1/1 Multi-annotator Iteration 1/1, Accuracy: 0.5635
Practice: Implement more iterations of the active learning cycle. Use the above code as a reference to perform 10 iterations, similar to how it was done in the pool-based active learnings ection.
12. Retrain the classifier on the fully labeled training set and evaluate its final accuracy.
# Final evaluation for multi-annotator
clf.fit(X_train_vect.toarray(), y_train)
y_pred = clf.predict(X_test_vect.toarray())
final_acc = accuracy_score(y_test, y_pred)
print(f'Multi-annotator Final accuracy: {final_acc:.4f}')
Multi-annotator Final accuracy: 0.5005
For better visualization of annotator labels, it's recommended to randomly select a smaller subset of 500 data points from the training set. This can reduce visual clutter and will be used for t-SNE visualization, similar to what we did previously.
# Sample a smaller subset for visualization
sample_indices = np.random.choice(X_train_vect.shape[0], 500, replace=False)
X_train_subset = X_train_vect[sample_indices].toarray()
y_train_subset = y_train.values[sample_indices]
y_annot_subset = y_annot[sample_indices]
# Use t-SNE to reduce the dimensionality of the TF-IDF vectors to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_subset)
13. Create a visual representation of annotator labels on the IMDB dataset.
Start by setting up a figure with subplots, one for each annotator. For each annotator, identify the correctly labeled data points (is_true) and plot them as circles with a specific color indicating the sentiment. Incorrectly labeled points are plotted as crosses.
# Visualize the noisy labels from each annotator using a scatter plot
fig, axes = plt.subplots(1, n_annotators, figsize=(25, 5))
for a in range(n_annotators):
is_true = y_annot_subset[:, a] == y_train_subset
# Correct labels: circles
axes[a].scatter(X_train_tsne[is_true, 0], X_train_tsne[is_true, 1], c=y_annot_subset[is_true, a], s=30, marker='o', alpha=0.4, cmap='coolwarm')
# Incorrect labels: crosses
axes[a].scatter(X_train_tsne[~is_true, 0], X_train_tsne[~is_true, 1], c=y_annot_subset[~is_true, a], s=50, marker='x', alpha=1 , cmap='coolwarm', edgecolors='k', linewidths=1.5)
axes[a].set_title(f'Annotator {a}', fontsize=15)
axes[a].set_xlabel('t-SNE Dimension 1')
axes[a].set_ylabel('t-SNE Dimension 2')
plt.show()
In this plot we see 5 annotators. Each subplot corresponds to one annotator. Annotator 0 (left) is likely noise-free or very accurate. Annotator 4 (right) has the most labeling noise.
Stream-based Active Learning (AL) is an active learning strategy where data points are presented one at a time, and the learner must decide immediately whether to query the label or discard the instance.
How It Works
A data stream feeds unlabeled samples sequentially (like a real-time feed). For each new instance:
The model evaluates how informative or uncertain the sample is. It decides on the spot:
In this part, we will show how stream-based active learning strategies are used and compared them to one another. For this purpose we will follow the next four steps:
We will divide each step into substeps for better clarity and ease of implementation. So let's start!
14. Now, it's time to set up query strategies i.e., StreamRandomSampling, and StreamProbabilisticAL for our stream-based active learning, for this purpose you need follow up the steps in bellow:
stream_length) to be 1000 samples, and use the first 1000 samples from X_train_vect and their corresponding labels from y_train.values.random_state=0, and set the training_size to 1000 and fit_clf to False. Then store the accuracy results for each query strategy by using accuracies = {}.What are those queries that we are using?
StreamRandomSampling
StreamProbabilisticAL
stream_length = 1000
X_stream = X_train_vect.toarray()[:stream_length]
y_stream = y_train.values[:stream_length]
query_strategies = {
'StreamRandomSampling': StreamRandomSampling(random_state=0),
'StreamProbabilisticAL': StreamProbabilisticAL(random_state=0)
}
training_size = 1000
fit_clf = False # Don't automatically retrain the classifier every time a new sample is queried.
accuracies = {}
15. For each query strategy (so use a for loop, something like or query_strategy_name, query_strategy in query_strategies.items()):
ParzenWindowClassifier with unique classes from y_train.values. Then set up X_train_stream and y_train_stream deques with a maximum length of training_size and then initialize them with the first 10 samples from X_stream and y_stream.What is deque? A deque (from collections) is like a list, but:
| Tips / Object | Purpose |
|---|---|
query_strategies.items() |
Loop over each strategy (e.g., random, entropy) |
clf = ParzenWindowClassifier |
Initialize a new classifier for this strategy |
deque(maxlen=training_size) |
Sliding window of most recent training data |
extend(X_stream[:10]) |
Seed the model with initial labeled data |
clf.fit(...) |
Train model on initial small dataset |
for query_strategy_name, query_strategy in query_strategies.items():
clf = ParzenWindowClassifier(classes=np.unique(y_train.values), random_state=0)
# Initialize the training data
X_train_stream = deque(maxlen=training_size)
y_train_stream = deque(maxlen=training_size)
# Initialize with the first 10 samples
X_train_stream.extend(X_stream[:10])
y_train_stream.extend(y_stream[:10])
# Fit the classifier with this initial data.
clf.fit(X_train_stream, y_train_stream)
16. Create stream-based active learning loop by folowing steps:
correct_classifications = []
count = 0
X_stream(since the first 10 samples were used to initialize the classifier). X_stream[t]) to be a 2D array with one sample.clf.predict for predicting the label for the current sample (X_cand), and compare it to the true label (y_cand). Then, use correct_classifications.append to append the result (True if correct, False if incorrect).sampled_indices) and their associated utilities. Use the call_func function facilitates this process.For this purpose Start by defining the parameters you want to pass to the query method. These parameters include the candidates (X_cand), the classifier (clf), and the flags return_utilities and fit_clf.budget_manager_param_dict to hold the utilities information.call_func to dynamically call the update method on query_strategy, by passing the parameters which you've defined earlier.count += len(sampled_indices)
correct_classifications = []
count = 0
for t in range(10, len(X_stream)): #`t` is the index of the current sample in the stream
# Reshape the current sample for compatibility with the classifier's predict method, which expects a 2D array
X_cand = X_stream[t].reshape(1, -1)
y_cand = y_stream[t]
# Refit the classifier and predict the current sample's label
clf.fit(X_train_stream, y_train_stream)
correct_classifications.append(clf.predict(X_cand)[0] == y_cand)
# Update the query strategy with the selected samples
sampled_indices, utilities = call_func(query_strategy.query, candidates=X_cand, clf=clf, return_utilities=True, fit_clf=fit_clf)
# Create a dictionary budget_manager_param_dict
budget_manager_param_dict = {"utilities": utilities}
# Dynamically call the update method on `query_strategy`
call_func(query_strategy.update, candidates=X_cand, queried_indices=sampled_indices, budget_manager_param_dict=budget_manager_param_dict)
# Track the number of queried samples
count += len(sampled_indices)
# Update the training data with new samples and labels
X_train_stream.append(X_stream[t]), y_train_stream.append(y_cand if len(sampled_indices) > 0 else clf.missing_label)
We need to measure how well the classifier is performing overall.
17. Use np.mean(correct_classifications), to calculate the average accuracy. This average accuracy, along with the correct_classifications list, should store in the accuracies dictionary for each query strategy. It will allow you to keep track of how each strategy performed.
# Calculate and print the average accuracy for each query strategy
avg_accuracy = np.mean(correct_classifications)
accuracies[query_strategy_name] = correct_classifications
Now Let's run the code from beginning to end
# Stream-based learning setup
stream_length = 1000
X_stream = X_train_vect.toarray()[:stream_length]
y_stream = y_train.values[:stream_length]
# Set up query strategies
query_strategies = {
'StreamRandomSampling': StreamRandomSampling(random_state=0),
'StreamProbabilisticAL': StreamProbabilisticAL(random_state=0)
}
training_size = 1000
fit_clf = False
accuracies = {}
for query_strategy_name, query_strategy in query_strategies.items():
clf = ParzenWindowClassifier(classes=np.unique(y_train.values), random_state=0)
# Initialize the training data
X_train_stream = deque(maxlen=training_size)
y_train_stream = deque(maxlen=training_size)
# Initialize with the first 10 samples
X_train_stream.extend(X_stream[:10])
y_train_stream.extend(y_stream[:10])
clf.fit(X_train_stream, y_train_stream)
correct_classifications = []
count = 0
for t in range(10, len(X_stream)):
# Reshape the current sample for compatibility
X_cand = X_stream[t].reshape(1, -1)
y_cand = y_stream[t]
# Refit the classifier and predict the current sample's label
clf.fit(X_train_stream, y_train_stream)
correct_classifications.append(clf.predict(X_cand)[0] == y_cand)
# Query the classifier
sampled_indices, utilities = call_func(query_strategy.query, candidates=X_cand, clf=clf, return_utilities=True, fit_clf=fit_clf)
budget_manager_param_dict = {"utilities": utilities}
call_func(query_strategy.update, candidates=X_cand, queried_indices=sampled_indices, budget_manager_param_dict=budget_manager_param_dict)
# Update the training data with new samples and labels
X_train_stream.append(X_stream[t])
y_train_stream.append(y_cand if len(sampled_indices) > 0 else clf.missing_label)
# Track the number of queried samples
count += len(sampled_indices)
# Calculate and print the average accuracy for each query strategy
avg_accuracy = np.mean(correct_classifications)
print(f"Query Strategy: {query_strategy_name}, Avg Accuracy: {avg_accuracy:.4f}, Acquisition count: {count}")
accuracies[query_strategy_name] = correct_classifications
Query Strategy: StreamRandomSampling, Avg Accuracy: 0.4889, Acquisition count: 107 Query Strategy: StreamProbabilisticAL, Avg Accuracy: 0.4838, Acquisition count: 100
The acquisition count tells you how many samples were selected for labeling during the active learning process. In our case, it means that:
This count shows how many times each strategy asked for more information to improve the model.
18. Let's plot the accuracy over time for each query strategy, using a Gaussian filter.
A Gaussian filter is a technique for smoothing a noisy signal or curve. It's like saying "Let’s look at the accuracy near this point, and take a weighted average of nearby values — where points closer in time count more than distant ones.” In our example, we have the raw accuracy values over time (e.g., one per query) and this is why we need to smooth. In particular we average each point with ±50 surrounding values (weighted)
for query_strategy_name, correct_classifications in accuracies.items():
plt.plot(gaussian_filter1d(np.array(correct_classifications, dtype=float), 20), label=query_strategy_name)
plt.legend();
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Accuracy over time for different query strategies')
plt.show()
19. (Optional) Repeat the stream-based active learning process by adding the other strategies (e.g., FixedUncertainty, VariableUncertainty, Split, StreamDensityBasedAL, CognitiveDualQueryStrategyRan, CognitiveDualQueryStrategyFixUn, CognitiveDualQueryStrategyRanVarUn, CognitiveDualQueryStrategyVarUn, PeriodicSampling), then compare your results. Make sure to import them from skactiveml.stream beforehand!
End of Practical!