!pip install scikit-activeml
!pip install numpy==1.24.4 scipy==1.10.1
!pip install matplotlib
!pip install pandas
!pip install ipympl

Requirement already satisfied: scikit-activeml in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (0.6.2)
Requirement already satisfied: joblib>=1.4.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.5.1)
Collecting numpy>=1.26 (from scikit-activeml)
  Using cached numpy-2.3.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scipy>=1.11.3 (from scikit-activeml)
  Using cached scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Requirement already satisfied: scikit-learn>=1.6.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.7.1)
Requirement already satisfied: matplotlib>=3.7.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (3.10.3)
Requirement already satisfied: iteration-utilities>=0.12.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (0.13.0)
Requirement already satisfied: makefun>=1.15.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-activeml) (1.16.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (1.4.8)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (25.0)
Requirement already satisfied: pillow>=8 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib>=3.7.3->scikit-activeml) (2.9.0.post0)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from scikit-learn>=1.6.0->scikit-activeml) (3.6.0)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib>=3.7.3->scikit-activeml) (1.17.0)
Using cached numpy-2.3.1-cp311-cp311-win_amd64.whl (13.0 MB)
Using cached scipy-1.16.0-cp311-cp311-win_amd64.whl (38.6 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.4
    Uninstalling numpy-1.24.4:
      Successfully uninstalled numpy-1.24.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.10.1
    Uninstalling scipy-1.10.1:
      Successfully uninstalled scipy-1.10.1
Successfully installed numpy-2.3.1 scipy-1.16.0

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

Collecting numpy==1.24.4
  Using cached numpy-1.24.4-cp311-cp311-win_amd64.whl.metadata (5.6 kB)
Collecting scipy==1.10.1
  Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl.metadata (58 kB)
Using cached numpy-1.24.4-cp311-cp311-win_amd64.whl (14.8 MB)
Using cached scipy-1.10.1-cp311-cp311-win_amd64.whl (42.2 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.1
    Uninstalling numpy-2.3.1:
      Successfully uninstalled numpy-2.3.1
  Attempting uninstall: scipy
    Found existing installation: scipy 1.16.0
    Uninstalling scipy-1.16.0:
      Successfully uninstalled scipy-1.16.0
Successfully installed numpy-1.24.4 scipy-1.10.1

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikit-activeml 0.6.2 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible.
scikit-activeml 0.6.2 requires scipy>=1.11.3, but you have scipy 1.10.1 which is incompatible.

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

Requirement already satisfied: matplotlib in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (3.10.3)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: numpy>=1.23 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (1.24.4)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (25.0)
Requirement already satisfied: pillow>=8 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

Requirement already satisfied: pandas in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (2.3.1)
Requirement already satisfied: numpy>=1.23.2 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (1.24.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from pandas) (2025.2)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip

Requirement already satisfied: ipympl in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (0.9.7)
Requirement already satisfied: ipython<10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (9.4.0)
Requirement already satisfied: ipywidgets<9,>=7.6.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (8.1.7)
Requirement already satisfied: matplotlib<4,>=3.5.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (3.10.3)
Requirement already satisfied: numpy in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (1.24.4)
Requirement already satisfied: pillow in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (11.3.0)
Requirement already satisfied: traitlets<6 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipympl) (5.14.3)
Requirement already satisfied: colorama in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.4.6)
Requirement already satisfied: decorator in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (1.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.19.2)
Requirement already satisfied: matplotlib-inline in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.1.7)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (3.0.51)
Requirement already satisfied: pygments>=2.4.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (2.19.2)
Requirement already satisfied: stack_data in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (0.6.3)
Requirement already satisfied: typing_extensions>=4.6 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipython<10->ipympl) (4.14.1)
Requirement already satisfied: comm>=0.1.3 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (0.2.2)
Requirement already satisfied: widgetsnbextension~=4.0.14 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (4.0.14)
Requirement already satisfied: jupyterlab_widgets~=3.0.15 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from ipywidgets<9,>=7.6.0->ipympl) (3.0.15)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.3.2)
Requirement already satisfied: cycler>=0.10 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (1.4.8)
Requirement already satisfied: packaging>=20.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (25.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from matplotlib<4,>=3.5.0->ipympl) (2.9.0.post0)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from jedi>=0.16->ipython<10->ipympl) (0.8.4)
Requirement already satisfied: wcwidth in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython<10->ipympl) (0.2.13)
Requirement already satisfied: six>=1.5 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from python-dateutil>=2.7->matplotlib<4,>=3.5.0->ipympl) (1.17.0)
Requirement already satisfied: executing>=1.2.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (2.2.0)
Requirement already satisfied: asttokens>=2.1.0 in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (3.0.0)
Requirement already satisfied: pure-eval in c:\users\groen\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from stack_data->ipython<10->ipympl) (0.2.3)

[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\groen\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE

import warnings
mlp.rcParams["figure.facecolor"] = "white"
warnings.filterwarnings("ignore")

# Load the IMDB dataset
df = pd.read_csv("IMDB Dataset.csv")

# Load the IMDB dataset with proper handling for encoding and skipping bad lines
df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')


df = pd.read_csv("IMDB Dataset.csv", engine="python", on_bad_lines='skip')


# Preprocess the text data
def preprocess_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

df['review'] = df['review'].apply(preprocess_text)

df.head()


# Reduce the dataset size for initial testing (e.g., 100 samples)
df = df.sample(10000, random_state=42)


# Convert labels to binary
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})


# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)


vectorizer = TfidfVectorizer(max_features=5000)
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)


X_log, y_log = X_train_vect.toarray(), y_train
clf = LogisticRegression(random_state=0).fit(X_log,y_log)
print(clf.predict(X_log[:5]))
print(X_train[:5], y_train[:5])
print("Accuracy is: ", clf.score(X_log,y_log))

[0 0 1 1 0]
2141     i saw it tonight and fell asleep in the movieb...
46172    this is one of them movies that has a awesome ...
18558    and i do mean it if not literally after all i ...
32956    this film has a lot of raw potential the scrip...
13094    hello i normally love movies im 19 i have seen...
Name: review, dtype: object 2141     0
46172    0
18558    1
32956    1
13094    0
Name: sentiment, dtype: int64
Accuracy is:  0.92075


clf = SklearnClassifier(LogisticRegression(max_iter=1000))


# Initialize Training Labels labels
y_train_initial = np.full(y_train.shape, fill_value=MISSING_LABEL)

# Randomly select 10 indices to label
initial_idx = np.random.choice(np.arange(len(y_train)), size=10, replace=False)

# Fill in the true labels for the selected indices
y_train_initial[initial_idx] = y_train.iloc[initial_idx]


qs = RandomSampling()

# Active learning cycle:
n_queries = 10
for i in range(n_queries):
    # Fit the classifier with current labels.
    clf.fit(X_train_vect.toarray(), y_train_initial)

    # Query the next sample(s).
    query_idx = qs.query(X=X_train_vect.toarray(), y=y_train_initial, batch_size=10)

    # Update labels based on query.
    y_train_initial[query_idx] = y_train.iloc[query_idx]

    # Evaluate the classifier on the test set
    y_pred = clf.predict(X_test_vect.toarray())
    acc = accuracy_score(y_test, y_pred)
    print(f'Simple Evaluation Iteration {i + 1}/{n_queries}, Accuracy: {acc:.4f}')

Simple Evaluation Iteration 1/10, Accuracy: 0.4995
Simple Evaluation Iteration 2/10, Accuracy: 0.5000
Simple Evaluation Iteration 3/10, Accuracy: 0.5090
Simple Evaluation Iteration 4/10, Accuracy: 0.5895
Simple Evaluation Iteration 5/10, Accuracy: 0.5085
Simple Evaluation Iteration 6/10, Accuracy: 0.6685
Simple Evaluation Iteration 7/10, Accuracy: 0.6750
Simple Evaluation Iteration 8/10, Accuracy: 0.6710
Simple Evaluation Iteration 9/10, Accuracy: 0.6755
Simple Evaluation Iteration 10/10, Accuracy: 0.6895


qs = UncertaintySampling(method='entropy', random_state=42)


# Initialize Training Labels labels
y_train_initial = np.full(y_train.shape, fill_value=MISSING_LABEL)

# Randomly select 10 indices to label
initial_idx = np.random.choice(np.arange(len(y_train)), size=10, replace=False)

# Fill in the true labels for the selected indices
y_train_initial[initial_idx] = y_train.iloc[initial_idx]


clf = SklearnClassifier(LogisticRegression(max_iter=1000))

n_queries = 10
for i in range(n_queries):
    # Uses current state of y_train_initial, which contains both labeled and missing entries.
    clf.fit(X_train_vect.toarray(), y_train_initial)

    # Queries the 10 most uncertain samples
    query_idx = qs.query(X=X_train_vect.toarray(), y=y_train_initial, clf=clf, batch_size=10)

    # Copies the true label from y_train into y_train_initial
    # Now the model can use those new labels in the next iteration
    y_train_initial[query_idx] = y_train.iloc[query_idx]

    # Evaluate the classifier on the test set
    y_pred = clf.predict(X_test_vect.toarray())
    acc = accuracy_score(y_test, y_pred)
    print(f'Simple Evaluation Iteration {i + 1}/{n_queries}, Accuracy: {acc:.4f}')

Simple Evaluation Iteration 1/10, Accuracy: 0.6365
Simple Evaluation Iteration 2/10, Accuracy: 0.5010
Simple Evaluation Iteration 3/10, Accuracy: 0.4995
Simple Evaluation Iteration 4/10, Accuracy: 0.5885
Simple Evaluation Iteration 5/10, Accuracy: 0.5965
Simple Evaluation Iteration 6/10, Accuracy: 0.6205
Simple Evaluation Iteration 7/10, Accuracy: 0.6375
Simple Evaluation Iteration 8/10, Accuracy: 0.6435
Simple Evaluation Iteration 9/10, Accuracy: 0.6295
Simple Evaluation Iteration 10/10, Accuracy: 0.6530


clf = SklearnClassifier(LogisticRegression(max_iter=1000))
# Final evaluation
clf.fit(X_train_vect.toarray(), y_train)
y_pred = clf.predict(X_test_vect.toarray())
final_acc = accuracy_score(y_test, y_pred)
print(f'Simple Evaluation Final accuracy: {final_acc:.4f}')

Simple Evaluation Final accuracy: 0.8700


tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_vect.toarray())


plt.figure(figsize=(10, 7))
for sentiment in [0, 1]:
    indices = (y_train == sentiment)
    plt.scatter(X_train_tsne[indices, 0], X_train_tsne[indices, 1], label=f'Sentiment {sentiment}', alpha=0.6)

plt.legend()
plt.title('t-SNE Visualization of IMDB Reviews')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()


# Number of annotators
n_annotators = 5

# Generate noisy labels for each annotator
y_annot = np.zeros(shape=(X_train_vect.shape[0], n_annotators), dtype=int)
rng = np.random.default_rng(seed=0)

# Creates 5 different noise levels, evenly spaced between 0.0 and 0.3 with np.linspace
# These represent the probability that each annotator flips a label incorrectly.
# Annotator 0 is perfect (0% noise), Annotator 4 is quite noisy (30%).
noise_levels = np.linspace(0.0, 0.3, num=n_annotators)

# Generate noise for all annotators simultaneously
#This line generates the actual **noise values**:
#1. Each element is a random 0 or 1 indicating whether the label should be flipped for that annotator and sample.
#2. Shape of output: `(num_samples, n_annotators)`
#3. 1 = flip the label (add noise), 0 = keep the true label

y_noise_matrix = rng.binomial(1, noise_levels[:, np.newaxis], size=(n_annotators, X_train_vect.shape[0])).T

# Apply noise to the true labels
# This line flips labels using XOR (^):
y_annot = y_noise_matrix ^ y_train.values[:, np.newaxis]

# Initialize training labels with missing values
y = np.full(shape=(X_train_vect.shape[0], n_annotators), fill_value=MISSING_LABEL)


# Create the classifier
clf = ParzenWindowClassifier(classes=np.unique(y_train.values), metric="rbf", metric_dict={"gamma": 0.1}, random_state=0)

# Set up the query strategy

# ProbabilisticAL selects the sample for which the model is most uncertain, often by entropy or margin.
sa_qs = ProbabilisticAL(random_state=0, prior=0.001)

# SingleAnnotatorWrapper makes sa_qs compatible with a multi-annotator active learning loop.
ma_qs = SingleAnnotatorWrapper(sa_qs, random_state=0)


# Function to be able to index via an array of indices
idx = lambda A: (A[:, 0], A[:, 1])

# Initial fit of the classifier
clf.fit(X_train_vect.toarray(), majority_vote(y))

# Perform one active learning cycle
print("Cycle 1/1")

# The model selects 100 unlabeled (or partially labeled) samples, for each, it picks 3 annotators to label them
# The result is a set of index pairs like [ [row, annotator], ... ].
query_idx = ma_qs.query(X_train_vect.toarray(), y, batch_size=100, n_annotators_per_sample=3, clf = clf)

# Update labels
y[idx(query_idx)] = y_annot[idx(query_idx)]

# Retrain the classifier on the updated label matrix, again using majority voting across annotators to get a single label per example.
clf.fit(X_train_vect.toarray(), majority_vote(y, random_state=0))

# Evaluate the classifier on the test set
y_pred = clf.predict(X_test_vect.toarray())
acc = accuracy_score(y_test, y_pred)
print(f'Multi-annotator Iteration 1/1, Accuracy: {acc:.4f}')

Cycle 1/1
Multi-annotator Iteration 1/1, Accuracy: 0.5635


# Final evaluation for multi-annotator
clf.fit(X_train_vect.toarray(), y_train)
y_pred = clf.predict(X_test_vect.toarray())
final_acc = accuracy_score(y_test, y_pred)
print(f'Multi-annotator Final accuracy: {final_acc:.4f}')

Multi-annotator Final accuracy: 0.5005


# Sample a smaller subset for visualization
sample_indices = np.random.choice(X_train_vect.shape[0], 500, replace=False)
X_train_subset = X_train_vect[sample_indices].toarray()
y_train_subset = y_train.values[sample_indices]
y_annot_subset = y_annot[sample_indices]

# Use t-SNE to reduce the dimensionality of the TF-IDF vectors to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train_subset)


# Visualize the noisy labels from each annotator using a scatter plot
fig, axes = plt.subplots(1, n_annotators, figsize=(25, 5))
for a in range(n_annotators):
    is_true = y_annot_subset[:, a] == y_train_subset
    # Correct labels: circles
    axes[a].scatter(X_train_tsne[is_true, 0], X_train_tsne[is_true, 1], c=y_annot_subset[is_true, a], s=30, marker='o', alpha=0.4, cmap='coolwarm')
    # Incorrect labels: crosses
    axes[a].scatter(X_train_tsne[~is_true, 0], X_train_tsne[~is_true, 1], c=y_annot_subset[~is_true, a], s=50, marker='x', alpha=1 , cmap='coolwarm', edgecolors='k', linewidths=1.5)
    axes[a].set_title(f'Annotator {a}', fontsize=15)
    axes[a].set_xlabel('t-SNE Dimension 1')
    axes[a].set_ylabel('t-SNE Dimension 2')

plt.show()


stream_length = 1000
X_stream = X_train_vect.toarray()[:stream_length]
y_stream = y_train.values[:stream_length]


query_strategies = {
    'StreamRandomSampling': StreamRandomSampling(random_state=0),
    'StreamProbabilisticAL': StreamProbabilisticAL(random_state=0)
}

training_size = 1000


fit_clf = False # Don't automatically retrain the classifier every time a new sample is queried.
accuracies = {}


for query_strategy_name, query_strategy in query_strategies.items():
    clf = ParzenWindowClassifier(classes=np.unique(y_train.values), random_state=0)

    # Initialize the training data
    X_train_stream = deque(maxlen=training_size)
    y_train_stream = deque(maxlen=training_size)

    # Initialize with the first 10 samples
    X_train_stream.extend(X_stream[:10])
    y_train_stream.extend(y_stream[:10])

    # Fit the classifier with this initial data.
    clf.fit(X_train_stream, y_train_stream)

correct_classifications = []
count = 0


correct_classifications = []
count = 0
for t in range(10, len(X_stream)): #`t` is the index of the current sample in the stream
    # Reshape the current sample for compatibility with the classifier's predict method, which expects a 2D array
    X_cand = X_stream[t].reshape(1, -1)
    y_cand = y_stream[t]

    # Refit the classifier and predict the current sample's label
    clf.fit(X_train_stream, y_train_stream)
    correct_classifications.append(clf.predict(X_cand)[0] == y_cand)

    # Update the query strategy with the selected samples
    sampled_indices, utilities = call_func(query_strategy.query, candidates=X_cand, clf=clf, return_utilities=True, fit_clf=fit_clf)

    # Create a dictionary budget_manager_param_dict
    budget_manager_param_dict = {"utilities": utilities}

    # Dynamically call the update method on `query_strategy`
    call_func(query_strategy.update, candidates=X_cand, queried_indices=sampled_indices, budget_manager_param_dict=budget_manager_param_dict)

    # Track the number of queried samples
    count += len(sampled_indices)

    # Update the training data with new samples and labels
    X_train_stream.append(X_stream[t]), y_train_stream.append(y_cand if len(sampled_indices) > 0 else clf.missing_label)


# Calculate and print the average accuracy for each query strategy
avg_accuracy = np.mean(correct_classifications)
accuracies[query_strategy_name] = correct_classifications


# Stream-based learning setup
stream_length = 1000
X_stream = X_train_vect.toarray()[:stream_length]
y_stream = y_train.values[:stream_length]

# Set up query strategies
query_strategies = {
    'StreamRandomSampling': StreamRandomSampling(random_state=0),
    'StreamProbabilisticAL': StreamProbabilisticAL(random_state=0)
}

training_size = 1000
fit_clf = False
accuracies = {}

for query_strategy_name, query_strategy in query_strategies.items():
    clf = ParzenWindowClassifier(classes=np.unique(y_train.values), random_state=0)

    # Initialize the training data
    X_train_stream = deque(maxlen=training_size)
    y_train_stream = deque(maxlen=training_size)

    # Initialize with the first 10 samples
    X_train_stream.extend(X_stream[:10])
    y_train_stream.extend(y_stream[:10])

    clf.fit(X_train_stream, y_train_stream)
    correct_classifications = []
    count = 0
    for t in range(10, len(X_stream)):
        # Reshape the current sample for compatibility
        X_cand = X_stream[t].reshape(1, -1)
        y_cand = y_stream[t]

        # Refit the classifier and predict the current sample's label
        clf.fit(X_train_stream, y_train_stream)
        correct_classifications.append(clf.predict(X_cand)[0] == y_cand)

        # Query the classifier
        sampled_indices, utilities = call_func(query_strategy.query, candidates=X_cand, clf=clf, return_utilities=True, fit_clf=fit_clf)
        budget_manager_param_dict = {"utilities": utilities}
        call_func(query_strategy.update, candidates=X_cand, queried_indices=sampled_indices, budget_manager_param_dict=budget_manager_param_dict)

        # Update the training data with new samples and labels
        X_train_stream.append(X_stream[t])
        y_train_stream.append(y_cand if len(sampled_indices) > 0 else clf.missing_label)

        # Track the number of queried samples
        count += len(sampled_indices)

    # Calculate and print the average accuracy for each query strategy
    avg_accuracy = np.mean(correct_classifications)
    print(f"Query Strategy: {query_strategy_name}, Avg Accuracy: {avg_accuracy:.4f}, Acquisition count: {count}")
    accuracies[query_strategy_name] = correct_classifications

Query Strategy: StreamRandomSampling, Avg Accuracy: 0.4889, Acquisition count: 107
Query Strategy: StreamProbabilisticAL, Avg Accuracy: 0.4838, Acquisition count: 100


for query_strategy_name, correct_classifications in accuracies.items():
    plt.plot(gaussian_filter1d(np.array(correct_classifications, dtype=float), 20), label=query_strategy_name)
plt.legend();
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Accuracy over time for different query strategies')
plt.show()

Variable	Purpose
`n_annotators`	Number of simulated annotators
`noise_levels`	Reliability settings for each annotator
`y_noise_matrix`	Indicates which labels should be flipped
`y_annot`	The actual noisy labels from all annotators
`y`	The training label matrix (initially missing)

Tips / Object	Purpose
`query_strategies.items()`	Loop over each strategy (e.g., random, entropy)
`clf = ParzenWindowClassifier`	Initialize a new classifier for this strategy
`deque(maxlen=training_size)`	Sliding window of most recent training data
`extend(X_stream[:10])`	Seed the model with initial labeled data
`clf.fit(...)`	Train model on initial small dataset

Practical 9: Movie review classification using Active learning¶

Let's get started¶

Loading the IMDB Dataset¶

Pre-processing the Text Data¶

Text Preprocessing¶

TF-IDF and Vectorization¶

Part A: The Active Learning Loop¶

Traditional Machine Learning Supervised¶

Initialize the active learning¶

Part B.1 Active Learning Strategies¶

Set Up the Query Strategy¶

Pool-based Active Learning -¶

Further Down The Line (Not Discussed Today)¶

Multi-annotator Pool-based Active Learning¶

Part C Evaluation of Active Learning¶

Stream-based Active Learning¶

Set Up Query Strategies¶

Initialize Classifier and Training Data¶

Create Stream-based Active Learning Loop¶

Calculate and Track Accuracy¶

	review	sentiment
0	one of the other reviewers has mentioned that ...	positive
1	a wonderful little production br br the filmin...	positive
2	i thought this was a wonderful way to spend ti...	positive
3	basically theres a family where a little boy j...	negative
4	petter matteis love in the time of money is a ...	positive