Tina Shahedi, Anastasia Giachanou
Machine Learning with Python - Utrecht Summer School
In this practical, we’ll work with Active Learning using the IMDB dataset, which has 50,000 movie reviews split into positive and negative sentiments. We’ll explore three strategies:
Simple Evaluation Study: We'll use pool-based active learning with uncertainty sampling, where the model queries the most uncertain samples and retrains iteratively. The other sampling techniques include:
Multi-annotator Pool-based Active Learning: This simulates multiple annotators with varying noise levels, using a SingleAnnotatorWrapper
with probabilistic active learning. It highlights how multiple annotators impact model performance.
Stream-based Active Learning: Here, we will implement a stream-based approach using StreamRandomSampling
and StreamProbabilisticAL
, ideal for real-time decision-making as data continuously flows in.
We'll classify movie reviews as positive or negative using their text.
We will use the scikit-activeml library. This library is built on scikit-learn. We'll show how it works by classifying IMDB reviews using the active learning cycle. Let's start by installing the library with pip install scikit-activeml
and importing the needed packages from scikit-learn and scikit-activeml.
!pip install scikit-activeml > /dev/null 2>&1
!pip install numpy==1.24.4 scipy==1.10.1
Collecting numpy==1.24.4 Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB) Collecting scipy==1.10.1 Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 2.5 MB/s eta 0:00:00 Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 41.2 MB/s eta 0:00:00 Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 14.2 MB/s eta 0:00:00 Installing collected packages: numpy, scipy Attempting uninstall: numpy Found existing installation: numpy 1.26.4 Uninstalling numpy-1.26.4: Successfully uninstalled numpy-1.26.4 Attempting uninstall: scipy Found existing installation: scipy 1.11.4 Uninstalling scipy-1.11.4: Successfully uninstalled scipy-1.11.4 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. pandas-stubs 2.0.3.230814 requires numpy>=1.25.0; python_version >= "3.9", but you have numpy 1.24.4 which is incompatible. scikit-activeml 0.5.0 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible. scikit-activeml 0.5.0 requires scipy>=1.11.3, but you have scipy 1.10.1 which is incompatible. Successfully installed numpy-1.24.4 scipy-1.10.1
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE
We'll be using the IMDB dataset, featuring 50,000 movie reviews from the Internet Movie Database, for our experiments. Now it is time to load the dataset:
When loading real-world datasets, you may encounter ParserError
. This is usually due to loading a large CSV file into Python Pandas using the read_csv
function. The solution is to use the engine='python'
parameter in the read_csv
function call to handle complex CSV structures, and the on_bad_lines
parameter to skip problematic lines, like this:
# Load the IMDB dataset with proper handling for encoding and skipping bad lines
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')
Another solution is to load the data by mounting Google Drive which can help with issues that might lead to a ParserError
.
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)
# Load the IMDB dataset
#df = pd.read_csv('/content/drive/My Drive/IMDB Dataset.csv')
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')
df.head()
review | sentiment | |
---|---|---|
91299 | I thought this movie did a down right good job... | positive |
91300 | Bad plot, bad dialogue, bad acting, idiotic di... | negative |
91301 | I am a Catholic taught in parochial elementary... | negative |
91302 | I'm going to have to disagree with the previou... | negative |
91303 | No one expects the Star Trek movies to be high... | negative |
When working with large datasets, starting with a smaller subset for initial testing helps us to develop and test our code faster. Here, we reduce the IMDB dataset to 10,000 samples using Pandas' sample
method. Run the following code to sample part of the dataset.
# Reduce the dataset size for initial testing
df = df.sample(10000, random_state=42)
In this practical, we will work with text data. When we have text data, we need to do some pre-processing to bring it into a format that can be understandable by machines, so to convert it into numbers.
Another step is to clean and remove noise (terms that are not important) from the text. Pre-processing steps include lowercase, punctuation removal, stemming and stop word removal.
For more information on how to work with text data, please refer to the A Beginner's Guide to Dealing with Text Data tutorial.
At the beginning of this practical, we introduced two essential libraries for text preprocessing: re
and string
. The re
library supports regular expressions for pattern matching, and the string
library provides constants like punctuation characters. The preprocess_text
function, which we define below, converts text to lowercase, removes punctuation using re.sub()
, and eliminates extra whitespace with re.sub().strip()
. We will apply this function to each review in the dataset to clean the text.
# Preprocess the text data
def preprocess_text(text):
text = text.lower() # Lowercase text
text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
return text
df['review'] = df['review'].apply(preprocess_text)
df.describe()
review | sentiment | |
---|---|---|
count | 10000 | 10000 |
unique | 9510 | 2 |
top | br br back in his youth the old man had wanted... | negative |
freq | 3 | 5045 |
Next, we convert the sentiment labels to binary values, where 'positive
' is mapped to 1 and 'negative
' to 0.
# Convert labels to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
Let's split data into training and test sets with an 80/20 ratio using train_test_split
. This results in X_train
and y_train
for training, and X_test
and y_test
for testing, ensuring that the model is trained and evaluated on separate data.
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)
Once the text data is preprocessed, it needs to be converted into a numerical format that machine learning algorithms can work with. This process is known as vectorization. One of the most common methods for vectorization is the TF-IDF (Term Frequency-Inverse Document Frequency) approach.
TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value is large when a term appears many times in a document and few times in the collection of documents.
Term Frequency (TF): Measures how frequently a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in that document.
$$ \text{TF}(t,d) =\frac{\text{Number of times term } t \text{ appears in document} d}{\text{Total number of terms in document } d} $$Inverse Document Frequency (IDF): Measures how important a term is given a collection of documents. It is calculated by taking the logarithm of the number of documents in the corpus divided by the number of documents containing the term.
$$ \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) $$TF-IDF Score: The TF-IDF score is the product of the TF and IDF scores. It reflects the importance of a term in a document within the corpus.
Formula: $$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$
1. To apply TF-IDF, create an instance of TfidfVectorizer() with max_features=5000
. This means that we will only consider the 5,000 most frequent terms. Use fit_transform()
on the training data to learn the vocabulary and convert the text into TF-IDF vectors. Then, apply transform()
on the test data to vectorize it using the same vocabulary.
In this practical, we will use the logisticRegression
classifier.
2. Create the Logistic Regression model using SklearnClassifier from skactiveml since it can handle missing labels.
3. Create an initial small set of labeled data by first setting all labels to missing and then randomly selecting a small subset that will be labeled. You can randomly select 10 samples from the training data to label.
Start with defining y_train_initial
.Then creates an array of the same shape as y_train
, and filled with MISSING_LABEL
, using np.full
. Finally, randomly select 10 indices from the training data and assign the true labels to these initially selected indices in y_train_initial
.
4. Now set up the query strategy (qs
), using UncertaintySampling
with entropy
method, and random_state=42
As a method we use 'entropy', which measures the uncertainty of the model's predictions by calculating the entropy of the predicted class probabilities. Higher entropy indicates higher uncertainty.
Now, implement 10 iterations of the Active Learning (AL) cycle. In each iteration, we will select 10 unlabeled samples to be labeled using uncertainty sampling.
5. Implement the 10 iterations. In each iteration:
query_idx
)y_train
to the missing labels in y_train_initial
.6. For comparison, you can train the classifier on the fully labeled training set and evaluate its accuracy when there are no missing labels.
Suppose we have 5 annotators to label the samples. The annotators have different accuracies for labeling the samples.
In this part, we simulate multi-annotator active learning. We start by initializing multiple annotators with varying noise levels and generating noisy labels.
We will use the following code to initialize multiple annotators with different noise levels and generate noisy labels. Here are the steps we will take:
n_annotators
) and set it to 5.y_annot
with dimensions (number of training samples, number of annotators) and fill it with zeros to store annotator labels.# Number of annotators
n_annotators = 5
# Generate noisy labels for each annotator
y_annot = np.zeros(shape=(X_train_vect.shape[0], n_annotators), dtype=int)
rng = np.random.default_rng(seed=0)
# Noise levels
noise_levels = np.linspace(0.0, 0.3, num=n_annotators)
# Generate noise for all annotators simultaneously
y_noise_matrix = rng.binomial(1, noise_levels[:, np.newaxis], size=(n_annotators, X_train_vect.shape[0])).T
# Apply noise to the true labels
y_annot = y_noise_matrix ^ y_train.values[:, np.newaxis]
# Initialize training labels with missing values
y = np.full(shape=(X_train_vect.shape[0], n_annotators), fill_value=MISSING_LABEL)
We want to label these samples using a ParzenWindowClassifier. We query the samples using uncertainty sampling, and the annotators at random using the SingleAnnotWrapper.
7. Creare a clf object with ParzenWindowClassifier
and set metric="rbf". Then pass the ParzenWindowClassifier as an argument to the single annotator query strategy ProbabilisticAL. Then pass the single annotator query strategy as an argument to the wrapper, also specifying the number of annotators.
8. Perform one iteration of the active learning cycle. In this iteration, query 10 unlabeled samples to be labeled by 3 annotators. Assign their labels to the initially missing labels in y
. After updating the labels, retrain the classifier on the updated training data, and evaluate its performance on the test set.
Practice: Implement more iterations of the active learning cycle. Use the above code as a reference to perform 10 iterations, similar to how it was done in the pool-based active learnings ection.
In this part, we will show how stream-based active learning strategies are used and compared them to one another. For this purpose we will follow the next four steps:
We will divide each step into substeps for better clarity and ease of implementation. So let's start!
9. Now, it's time to set up query strategies i.e., StreamRandomSampling
, and StreamProbabilisticAL
for our stream-based active learning, for this purpose you need follow up the steps in bellow:
X_train_vect
and their corresponding labels from y_train.values
.random_state=0
, and set the training_size
to 1000 and fit_clf
to False
. Then store the accuracy results for each query strategy by using accuracies = {}
.10. For each query strategy:
ParzenWindowClassifier
with unique classes from y_train.values
. Set up X_train_stream
and y_train_stream
deques with a maximum length of training_size
and initialize them with the first 10 samples from X_stream
and y_stream
.11. Create Stream-based Active Learning Loop by folowing steps:
correct_classifications = []
count = 0
X_stream
(since the first 10 samples were used to initialize the classifier). X_stream[t])
to be a 2D array with one sample.clf.predict
for predicting the label for the current sample (X_cand
), and compare it to the true label (y_cand
). Then, use correct_classifications.append
to append the result (True if correct, False if incorrect).sampled_indices
) and their associated utilities
. Use the call_func
function facilitates this process.For this purpose Start by defining the parameters you want to pass to the query method. These parameters include the candidates (X_cand
), the classifier (clf
), and the flags return_utilities
and fit_clf
.budget_manager_param_dict
to hold the utilities information.call_func
to dynamically call the update method on query_strategy
, by passing the parameters which you've defined earlier.count += len(sampled_indices)
We need to measure how well the classifier is performing overall.
12. Use np.mean(correct_classifications)
, to calculate the average accuracy. This average accuracy, along with the correct_classifications
list, should store in the accuracies
dictionary for each query strategy. It will allow you to keep track of how each strategy performed.
Now Let's run the code from Beginning to End
13. Let's plot the accuracy over time for each query strategy, using a Gaussian filter.
14. Repeat the stream-based active learning process by adding the other strategies (e.g., FixedUncertainty
, VariableUncertainty
, Split
, StreamDensityBasedAL
, CognitiveDualQueryStrategyRan
, CognitiveDualQueryStrategyFixUn
, CognitiveDualQueryStrategyRanVarUn
, CognitiveDualQueryStrategyVarUn
, PeriodicSampling
), then ompare your results. Make sure to import
them from skactiveml.stream
beforehand!
End of Practical!