Practical 9: Movie review classification using Active learning¶

Tina Shahedi, Anastasia Giachanou

Machine Learning with Python - Utrecht Summer School

In this practical, we’ll work with Active Learning using the IMDB dataset, which has 50,000 movie reviews split into positive and negative sentiments. We’ll explore three strategies:

  1. Simple Evaluation Study: We'll use pool-based active learning with uncertainty sampling, where the model queries the most uncertain samples and retrains iteratively. The other sampling techniques include:

  2. Multi-annotator Pool-based Active Learning: This simulates multiple annotators with varying noise levels, using a SingleAnnotatorWrapper with probabilistic active learning. It highlights how multiple annotators impact model performance.

  3. Stream-based Active Learning: Here, we will implement a stream-based approach using StreamRandomSampling and StreamProbabilisticAL, ideal for real-time decision-making as data continuously flows in.

We'll classify movie reviews as positive or negative using their text.

Let's get started¶

We will use the scikit-activeml library. This library is built on scikit-learn. We'll show how it works by classifying IMDB reviews using the active learning cycle. Let's start by installing the library with pip install scikit-activeml and importing the needed packages from scikit-learn and scikit-activeml.

In [ ]:
!pip install scikit-activeml > /dev/null 2>&1
!pip install numpy==1.24.4 scipy==1.10.1
Collecting numpy==1.24.4
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting scipy==1.10.1
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 2.5 MB/s eta 0:00:00
Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 41.2 MB/s eta 0:00:00
Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 14.2 MB/s eta 0:00:00
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.4
    Uninstalling scipy-1.11.4:
      Successfully uninstalled scipy-1.11.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-stubs 2.0.3.230814 requires numpy>=1.25.0; python_version >= "3.9", but you have numpy 1.24.4 which is incompatible.
scikit-activeml 0.5.0 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible.
scikit-activeml 0.5.0 requires scipy>=1.11.3, but you have scipy 1.10.1 which is incompatible.
Successfully installed numpy-1.24.4 scipy-1.10.1
In [ ]:
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE

Loading the IMDB Dataset¶

We'll be using the IMDB dataset, featuring 50,000 movie reviews from the Internet Movie Database, for our experiments. Now it is time to load the dataset:

When loading real-world datasets, you may encounter ParserError. This is usually due to loading a large CSV file into Python Pandas using the read_csv function. The solution is to use the engine='python' parameter in the read_csv function call to handle complex CSV structures, and the on_bad_lines parameter to skip problematic lines, like this:

# Load the IMDB dataset with proper handling for encoding and skipping bad lines
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')

Another solution is to load the data by mounting Google Drive which can help with issues that might lead to a ParserError.

In [ ]:
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)

# Load the IMDB dataset
#df = pd.read_csv('/content/drive/My Drive/IMDB Dataset.csv')


df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')
df.head()
Out[ ]:
review sentiment
91299 I thought this movie did a down right good job... positive
91300 Bad plot, bad dialogue, bad acting, idiotic di... negative
91301 I am a Catholic taught in parochial elementary... negative
91302 I'm going to have to disagree with the previou... negative
91303 No one expects the Star Trek movies to be high... negative

When working with large datasets, starting with a smaller subset for initial testing helps us to develop and test our code faster. Here, we reduce the IMDB dataset to 10,000 samples using Pandas' sample method. Run the following code to sample part of the dataset.

In [ ]:
# Reduce the dataset size for initial testing
df = df.sample(10000, random_state=42)

Pre-processing the Text Data¶

In this practical, we will work with text data. When we have text data, we need to do some pre-processing to bring it into a format that can be understandable by machines, so to convert it into numbers.

Another step is to clean and remove noise (terms that are not important) from the text. Pre-processing steps include lowercase, punctuation removal, stemming and stop word removal.

For more information on how to work with text data, please refer to the A Beginner's Guide to Dealing with Text Data tutorial.

Text Preprocessing¶

At the beginning of this practical, we introduced two essential libraries for text preprocessing: re and string. The re library supports regular expressions for pattern matching, and the string library provides constants like punctuation characters. The preprocess_text function, which we define below, converts text to lowercase, removes punctuation using re.sub(), and eliminates extra whitespace with re.sub().strip(). We will apply this function to each review in the dataset to clean the text.

In [ ]:
# Preprocess the text data
def preprocess_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

df['review'] = df['review'].apply(preprocess_text)

df.describe()
Out[ ]:
review sentiment
count 10000 10000
unique 9510 2
top br br back in his youth the old man had wanted... negative
freq 3 5045

Next, we convert the sentiment labels to binary values, where 'positive' is mapped to 1 and 'negative' to 0.

In [ ]:
# Convert labels to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

Let's split data into training and test sets with an 80/20 ratio using train_test_split. This results in X_train and y_train for training, and X_test and y_test for testing, ensuring that the model is trained and evaluated on separate data.

In [ ]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

TF-IDF and Vectorization¶

Once the text data is preprocessed, it needs to be converted into a numerical format that machine learning algorithms can work with. This process is known as vectorization. One of the most common methods for vectorization is the TF-IDF (Term Frequency-Inverse Document Frequency) approach.

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value is large when a term appears many times in a document and few times in the collection of documents.

Term Frequency (TF): Measures how frequently a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in that document.

$$ \text{TF}(t,d) =\frac{\text{Number of times term } t \text{ appears in document} d}{\text{Total number of terms in document } d} $$

Inverse Document Frequency (IDF): Measures how important a term is given a collection of documents. It is calculated by taking the logarithm of the number of documents in the corpus divided by the number of documents containing the term.

$$ \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right) $$

TF-IDF Score: The TF-IDF score is the product of the TF and IDF scores. It reflects the importance of a term in a document within the corpus.

Formula: $$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$

1. To apply TF-IDF, create an instance of TfidfVectorizer() with max_features=5000. This means that we will only consider the 5,000 most frequent terms. Use fit_transform() on the training data to learn the vocabulary and convert the text into TF-IDF vectors. Then, apply transform() on the test data to vectorize it using the same vocabulary.

Initialize the active learning¶

In this practical, we will use the logisticRegression classifier.

2. Create the Logistic Regression model using SklearnClassifier from skactiveml since it can handle missing labels.

3. Create an initial small set of labeled data by first setting all labels to missing and then randomly selecting a small subset that will be labeled. You can randomly select 10 samples from the training data to label.

Start with defining y_train_initial.Then creates an array of the same shape as y_train, and filled with MISSING_LABEL, using np.full. Finally, randomly select 10 indices from the training data and assign the true labels to these initially selected indices in y_train_initial.

Set Up the Query Strategy¶

4. Now set up the query strategy (qs), using UncertaintySampling with entropy method, and random_state=42

As a method we use 'entropy', which measures the uncertainty of the model's predictions by calculating the entropy of the predicted class probabilities. Higher entropy indicates higher uncertainty.

You can explore additional classifiers and query strategies available in scikit-learn and the skactiveml library for more options. Detailed information on other classifiers can be found here and all implemented strategies are listed here.

Pool-based Active Learning - Simple Evaluation Study¶

Now, implement 10 iterations of the Active Learning (AL) cycle. In each iteration, we will select 10 unlabeled samples to be labeled using uncertainty sampling.

5. Implement the 10 iterations. In each iteration:

  • Fit the logistic regression model on the training data with the missing labels
  • Run qs.query to determine the selected samples by their indices (query_idx)
  • Assign their labels from y_train to the missing labels in y_train_initial.
  • Make the predictions on the test set
  • evaluate the classifier's performance on the test set after each iteration of the AL cycle.

6. For comparison, you can train the classifier on the fully labeled training set and evaluate its accuracy when there are no missing labels.

Multi-annotator Pool-based Active Learning¶

Suppose we have 5 annotators to label the samples. The annotators have different accuracies for labeling the samples.

In this part, we simulate multi-annotator active learning. We start by initializing multiple annotators with varying noise levels and generating noisy labels.

We will use the following code to initialize multiple annotators with different noise levels and generate noisy labels. Here are the steps we will take:

  • We start by defining a variable for the number of annotators (n_annotators) and set it to 5.
  • Create an array y_annot with dimensions (number of training samples, number of annotators) and fill it with zeros to store annotator labels.
  • Initialize a random number generator rng with a fixed seed (e.g., 0) for reproducibility.
  • Then, we generate noise levels, linearly spaced between 0.0 and 0.3, with a total of n_annotators values. Each annotator will have a different noise level. (np.linspace(0.0, 0.3, num=n_annotators)).
  • Next, we generate a matrix of the same shape as y_annot using using a binomial distribution (rng.binomial). Each value is either 0 or 1, representing whether the label is flipped or not, according to the respective annotator's noise level.
  • Then we apply noise to the true labels. For this you will need the XOR operation (^) which flips the true label based on the noise matrix value (1 flips the label, 0 keeps it unchanged).
In [ ]:
# Number of annotators
n_annotators = 5

# Generate noisy labels for each annotator
y_annot = np.zeros(shape=(X_train_vect.shape[0], n_annotators), dtype=int)
rng = np.random.default_rng(seed=0)

# Noise levels
noise_levels = np.linspace(0.0, 0.3, num=n_annotators)

# Generate noise for all annotators simultaneously
y_noise_matrix = rng.binomial(1, noise_levels[:, np.newaxis], size=(n_annotators, X_train_vect.shape[0])).T

# Apply noise to the true labels
y_annot = y_noise_matrix ^ y_train.values[:, np.newaxis]

# Initialize training labels with missing values
y = np.full(shape=(X_train_vect.shape[0], n_annotators), fill_value=MISSING_LABEL)

We want to label these samples using a ParzenWindowClassifier. We query the samples using uncertainty sampling, and the annotators at random using the SingleAnnotWrapper.

7. Creare a clf object with ParzenWindowClassifier and set metric="rbf". Then pass the ParzenWindowClassifier as an argument to the single annotator query strategy ProbabilisticAL. Then pass the single annotator query strategy as an argument to the wrapper, also specifying the number of annotators.

8. Perform one iteration of the active learning cycle. In this iteration, query 10 unlabeled samples to be labeled by 3 annotators. Assign their labels to the initially missing labels in y. After updating the labels, retrain the classifier on the updated training data, and evaluate its performance on the test set.

Practice: Implement more iterations of the active learning cycle. Use the above code as a reference to perform 10 iterations, similar to how it was done in the pool-based active learnings ection.

Stream-based Active Learning¶

In this part, we will show how stream-based active learning strategies are used and compared them to one another. For this purpose we will follow the next four steps:

  1. Set Up Query Strategies
  2. Initialize Classifier and Training Data
  3. Create Stream-based Active Learning Loop
  4. Calculate and Track Accuracy

We will divide each step into substeps for better clarity and ease of implementation. So let's start!

Set Up Query Strategies¶

9. Now, it's time to set up query strategies i.e., StreamRandomSampling, and StreamProbabilisticAL for our stream-based active learning, for this purpose you need follow up the steps in bellow:

  1. Define the length of the data stream to be 1000 samples, and use the first 1000 samples from X_train_vect and their corresponding labels from y_train.values.
  2. Initialize the query strategies with a fixed random_state=0, and set the training_size to 1000 and fit_clf to False. Then store the accuracy results for each query strategy by using accuracies = {}.

Initialize Classifier and Training Data¶

10. For each query strategy:

  1. create a ParzenWindowClassifier with unique classes from y_train.values. Set up X_train_stream and y_train_stream deques with a maximum length of training_size and initialize them with the first 10 samples from X_stream and y_stream.
  2. Fit the classifier with this initial data.

Create Stream-based Active Learning Loop¶

11. Create Stream-based Active Learning Loop by folowing steps:

  1. To keep track of the number of queried samples and to track the accuracy of its predictions set up:
    correct_classifications = []
    count = 0
    
  2. Now start a loop from the 10th sample to the end of X_stream(since the first 10 samples were used to initialize the classifier).
  3. Reshape the current sample (X_stream[t]) to be a 2D array with one sample.
  4. Refit the classifier with the current training data. Use clf.predict for predicting the label for the current sample (X_cand), and compare it to the true label (y_cand). Then, use correct_classifications.append to append the result (True if correct, False if incorrect).
  5. Update the query strategy with the selected samples (sampled_indices) and their associated utilities. Use the call_func function facilitates this process.For this purpose Start by defining the parameters you want to pass to the query method. These parameters include the candidates (X_cand), the classifier (clf), and the flags return_utilities and fit_clf.
  6. Create a dictionary budget_manager_param_dict to hold the utilities information.
  7. Use call_func to dynamically call the update method on query_strategy, by passing the parameters which you've defined earlier.
  8. Add the number of newly queried samples to the count variable by
    count += len(sampled_indices)
    
  9. Update the training data by adding the current sample and its label. If the sample was queried, add its true label; otherwise, add a missing label.

Calculate and Track Accuracy¶

We need to measure how well the classifier is performing overall.

12. Use np.mean(correct_classifications), to calculate the average accuracy. This average accuracy, along with the correct_classifications list, should store in the accuracies dictionary for each query strategy. It will allow you to keep track of how each strategy performed.

Now Let's run the code from Beginning to End

13. Let's plot the accuracy over time for each query strategy, using a Gaussian filter.

14. Repeat the stream-based active learning process by adding the other strategies (e.g., FixedUncertainty, VariableUncertainty, Split, StreamDensityBasedAL, CognitiveDualQueryStrategyRan, CognitiveDualQueryStrategyFixUn, CognitiveDualQueryStrategyRanVarUn, CognitiveDualQueryStrategyVarUn, PeriodicSampling), then ompare your results. Make sure to import them from skactiveml.stream beforehand!

End of Practical!