!pip install scikit-activeml > /dev/null 2>&1
!pip install numpy==1.24.4 scipy==1.10.1

Collecting numpy==1.24.4
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting scipy==1.10.1
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 2.5 MB/s eta 0:00:00
Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 41.2 MB/s eta 0:00:00
Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 14.2 MB/s eta 0:00:00
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.4
    Uninstalling scipy-1.11.4:
      Successfully uninstalled scipy-1.11.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-stubs 2.0.3.230814 requires numpy>=1.25.0; python_version >= "3.9", but you have numpy 1.24.4 which is incompatible.
scikit-activeml 0.5.0 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible.
scikit-activeml 0.5.0 requires scipy>=1.11.3, but you have scipy 1.10.1 which is incompatible.
Successfully installed numpy-1.24.4 scipy-1.10.1


import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE

# Load the IMDB dataset with proper handling for encoding and skipping bad lines
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')


#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)

# Load the IMDB dataset
#df = pd.read_csv('/content/drive/My Drive/IMDB Dataset.csv')


df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')
df.head()


# Reduce the dataset size for initial testing
df = df.sample(10000, random_state=42)


# Preprocess the text data
def preprocess_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

df['review'] = df['review'].apply(preprocess_text)

df.describe()


# Convert labels to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})


# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)


# Number of annotators
n_annotators = 5

# Generate noisy labels for each annotator
y_annot = np.zeros(shape=(X_train_vect.shape[0], n_annotators), dtype=int)
rng = np.random.default_rng(seed=0)

# Noise levels
noise_levels = np.linspace(0.0, 0.3, num=n_annotators)

# Generate noise for all annotators simultaneously
y_noise_matrix = rng.binomial(1, noise_levels[:, np.newaxis], size=(n_annotators, X_train_vect.shape[0])).T

# Apply noise to the true labels
y_annot = y_noise_matrix ^ y_train.values[:, np.newaxis]

# Initialize training labels with missing values
y = np.full(shape=(X_train_vect.shape[0], n_annotators), fill_value=MISSING_LABEL)

correct_classifications = []
count = 0

	review	sentiment
91299	I thought this movie did a down right good job...	positive
91300	Bad plot, bad dialogue, bad acting, idiotic di...	negative
91301	I am a Catholic taught in parochial elementary...	negative
91302	I'm going to have to disagree with the previou...	negative
91303	No one expects the Star Trek movies to be high...	negative

Practical 9: Movie review classification using Active learning¶

Let's get started¶

Loading the IMDB Dataset¶

Pre-processing the Text Data¶

Text Preprocessing¶

TF-IDF and Vectorization¶

Initialize the active learning¶

Set Up the Query Strategy¶

Pool-based Active Learning - Simple Evaluation Study¶

Multi-annotator Pool-based Active Learning¶

Stream-based Active Learning¶

Set Up Query Strategies¶

Initialize Classifier and Training Data¶

Create Stream-based Active Learning Loop¶

Calculate and Track Accuracy¶

	review	sentiment
count	10000	10000
unique	9510	2
top	br br back in his youth the old man had wanted...	negative
freq	3	5045