Right now we have been using a real-world dataset IMDB. However, to incorporate more of the understanding of Active Learning we are going to use a synthetic dataset, such that visualisation will be a bit easier to do.
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt, animation
import pandas as pd
import re
import string
import skactiveml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from skactiveml.classifier import SklearnClassifier, ParzenWindowClassifier
from skactiveml.pool import UncertaintySampling, ProbabilisticAL, RandomSampling
from skactiveml.pool import MonteCarloEER, QueryByCommittee
from skactiveml.pool.multiannotator import SingleAnnotatorWrapper
from skactiveml.stream import StreamRandomSampling, StreamProbabilisticAL
from skactiveml.utils import unlabeled_indices, labeled_indices, MISSING_LABEL, majority_vote, call_func
from skactiveml.visualization import plot_utilities, plot_decision_boundary
from collections import deque
from scipy.ndimage import gaussian_filter1d
from sklearn.manifold import TSNE
import warnings
mlp.rcParams["figure.facecolor"] = "white"
warnings.filterwarnings("ignore")
random_state = np.random.RandomState(0)
# Build a dataset.
X, y_true = make_blobs(
n_samples=200,
n_features=2,
centers=[[0, 1], [-3, 0.5], [-1, -1], [2, 1], [1, -0.5]],
cluster_std=0.7,
random_state=random_state,
)
y_true = y_true % 2
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap="coolwarm", edgecolor="k", s=60)
plt.title("make blobs")
plt.xlabel("Feature A")
plt.ylabel("Feature B")
plt.grid(True)
plt.colorbar(scatter, ticks=[0, 1], label="Label")
plt.show()
Now by creating this dataset we can see there is a cluster in the middle and two clusters on the left and right side.
7. Make an Active Learning Loop with this dataset and include in there the following strategies: Uncertainty Sampling, Query By Committee, Expected Error Reduction and Probabilistic Active Learning. Use the clf = ParzenWindowClassifer() as your classifier
In each iteration, select 10 unlabeled samples to be labeled using the query strategy based on whichever is active. Determine the selected samples by their indices (query_idx) and assign their labels from y_train to the initially missing labels in y_train_initial. After updating the labels, retrain the classifier on the updated training data. Finally, evaluate the classifier's performance on the test set after each iteration of the AL cycle.
Okay but what does this mean? What does it entail and what can we deduce from it? Does one perform better than the other one or do they all perform equally as well?
The next step is visualizing this, first I want you to try this out for yourself. In scikit-activeml there are visualisation methods (skactiveml.visualization) and these can help together with plt & animation to plot.
8. Make a visualization for each of the strategies. Showcasing the decision boundary after acquiring x amount of labels. BONUS: Make it so that it can be animated and it animates over 10 labels you acquired This is a difficult assignment
%matplotlib ipympl
"[...]"