Practical 7: Unsupervised learning¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

In this practical, we will analyse the Breast Cancer Wisconsin (Diagnostic) dataset, that is available from the scikit-learn library. This dataset contains data from Fine Needle Aspiration (FNA) of breast masses. It offers a detailed examination of cell nuclei characteristics in digitized images from FNA procedures.

First, we will load the dataset and then transform it into a pandas dataframe. Next, we'll apply PCA to reduce dimensions and we will use clustering to detect clusters.

After this practical, you will be able to:

  • Understand the differences between supervised and unsupervised learning
  • Implement k-means clustering and interpret its results
  • Evaluate clustering with silhouette score
  • Visualize clusters
  • Apply dimensionality reduction with PCA

First let's import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import GridSearchCV
from scipy.cluster.hierarchy import dendrogram, linkage

Let's get started¶

In the first part of this practical, we will focus on the Breast Cancer Wisconsin Dataset from scikit-learn. This dataset includes 569 instances with 30 features. These features, such as radius, texture, perimeter, area, and several others, capture various characteristics of cell nuclei in breast mass images. Each feature is quantified in three different ways: the mean, standard error, and the largest value among a subset.

1. Let's load the dataset first using the function load_breast_cancer(). We can store it in the variable data.

In [3]:
# Load the breast cancer dataset
data = load_breast_cancer()

2. Print the feature names of the Breast Cancer Dataset.

Hint: The variable data is a dictionary-like object that encompasses data, target, and additional metadata. Key components include:

  • data: A numpy array with the shape (n_samples, n_features), containing the feature matrix.
  • target: A numpy array with the shape (n_samples,), which holds the target variable.
  • feature_names: A list of feature names.
  • DESCR: A detailed description of the dataset.

You can access these attributes using dictionary-like indexing, such as data.feature_names, data.target_names, and so on. This approach allows you to explore the dataset's structure and understand the features that will be used for analysis.

In [4]:
# Print the feature_names of the dataset
print("Feature names:", data.feature_names)
Feature names: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

3. Explore the description of the data.

In [6]:
# Print the description of the dataset
print(data['DESCR'])
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - class:
            - WDBC-Malignant
            - WDBC-Benign

:Summary Statistics:

===================================== ====== ======
                                        Min    Max
===================================== ====== ======
radius (mean):                        6.981  28.11
texture (mean):                       9.71   39.28
perimeter (mean):                     43.79  188.5
area (mean):                          143.5  2501.0
smoothness (mean):                    0.053  0.163
compactness (mean):                   0.019  0.345
concavity (mean):                     0.0    0.427
concave points (mean):                0.0    0.201
symmetry (mean):                      0.106  0.304
fractal dimension (mean):             0.05   0.097
radius (standard error):              0.112  2.873
texture (standard error):             0.36   4.885
perimeter (standard error):           0.757  21.98
area (standard error):                6.802  542.2
smoothness (standard error):          0.002  0.031
compactness (standard error):         0.002  0.135
concavity (standard error):           0.0    0.396
concave points (standard error):      0.0    0.053
symmetry (standard error):            0.008  0.079
fractal dimension (standard error):   0.001  0.03
radius (worst):                       7.93   36.04
texture (worst):                      12.02  49.54
perimeter (worst):                    50.41  251.2
area (worst):                         185.2  4254.0
smoothness (worst):                   0.071  0.223
compactness (worst):                  0.027  1.058
concavity (worst):                    0.0    1.252
concave points (worst):               0.0    0.291
symmetry (worst):                     0.156  0.664
fractal dimension (worst):            0.055  0.208
===================================== ====== ======

:Missing Attribute Values: None

:Class Distribution: 212 - Malignant, 357 - Benign

:Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

:Donor: Nick Street

:Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. dropdown:: References

  - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
    for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
    Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
    San Jose, CA, 1993.
  - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
    prognosis via linear programming. Operations Research, 43(4), pages 570-577,
    July-August 1995.
  - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
    to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
    163-171.

4. Convert the dataset into a pandas DataFrame called df. If you Transpose the dataframe, you will be able to see the columns as rows.

In [9]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head().T
Out[9]:
0 1 2 3 4
mean radius 17.990000 20.570000 19.690000 11.420000 20.290000
mean texture 10.380000 17.770000 21.250000 20.380000 14.340000
mean perimeter 122.800000 132.900000 130.000000 77.580000 135.100000
mean area 1001.000000 1326.000000 1203.000000 386.100000 1297.000000
mean smoothness 0.118400 0.084740 0.109600 0.142500 0.100300
mean compactness 0.277600 0.078640 0.159900 0.283900 0.132800
mean concavity 0.300100 0.086900 0.197400 0.241400 0.198000
mean concave points 0.147100 0.070170 0.127900 0.105200 0.104300
mean symmetry 0.241900 0.181200 0.206900 0.259700 0.180900
mean fractal dimension 0.078710 0.056670 0.059990 0.097440 0.058830
radius error 1.095000 0.543500 0.745600 0.495600 0.757200
texture error 0.905300 0.733900 0.786900 1.156000 0.781300
perimeter error 8.589000 3.398000 4.585000 3.445000 5.438000
area error 153.400000 74.080000 94.030000 27.230000 94.440000
smoothness error 0.006399 0.005225 0.006150 0.009110 0.011490
compactness error 0.049040 0.013080 0.040060 0.074580 0.024610
concavity error 0.053730 0.018600 0.038320 0.056610 0.056880
concave points error 0.015870 0.013400 0.020580 0.018670 0.018850
symmetry error 0.030030 0.013890 0.022500 0.059630 0.017560
fractal dimension error 0.006193 0.003532 0.004571 0.009208 0.005115
worst radius 25.380000 24.990000 23.570000 14.910000 22.540000
worst texture 17.330000 23.410000 25.530000 26.500000 16.670000
worst perimeter 184.600000 158.800000 152.500000 98.870000 152.200000
worst area 2019.000000 1956.000000 1709.000000 567.700000 1575.000000
worst smoothness 0.162200 0.123800 0.144400 0.209800 0.137400
worst compactness 0.665600 0.186600 0.424500 0.866300 0.205000
worst concavity 0.711900 0.241600 0.450400 0.686900 0.400000
worst concave points 0.265400 0.186000 0.243000 0.257500 0.162500
worst symmetry 0.460100 0.275000 0.361300 0.663800 0.236400
worst fractal dimension 0.118900 0.089020 0.087580 0.173000 0.076780

Principal Component Analysis (PCA)¶

The dataset has 30 variables, and this means that it is impossible to plot all of them in one single plot. One option is to create multiple 2-dimensions plots for every pair of the variables. However, we will have too many plots to compare.

The alternative solution is to use the Principal Component Analysis (PCA) to reduce the dimensions of the data to 2 main principal components. In this way, we can generate a 2-d plot.

Feature scaling is the first step for PCA and will transform the values of the features so their mean is 0 and their standard deviation 1. We use StandardScaler before applying PCA because PCA is sensitive to the scale of the features.

5. Apply the StandardScaler on the df dataframe. Then use the fit function to fit the scaler to the dataframe. Then transform the features of the dataframe (you can call the scaler.transform() function for this). Name the stardarised dataframe df_scaled

In [10]:
# Standardize the features
scaler = StandardScaler()
scaler.fit(df)
df_scaled =scaler.transform(df)

6. Now run the PCA on the dataset by calling PCA() that is part of the scikit-learn. First create the instance of PCA (you can also set the number of components but we will not do that here). Then fit the pca object on the dataframe using the fit_tranform() function. Print the explained variance ratio of the principal components (explained_variance_ratio_ attribute). How much variance is explained by the 2 first components?

In [11]:
# Implement PCA
pca = PCA()
pca_fit = pca.fit_transform(df_scaled)

# Print the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Explained Variance Ratio: [4.42720256e-01 1.89711820e-01 9.39316326e-02 6.60213492e-02
 5.49576849e-02 4.02452204e-02 2.25073371e-02 1.58872380e-02
 1.38964937e-02 1.16897819e-02 9.79718988e-03 8.70537901e-03
 8.04524987e-03 5.23365745e-03 3.13783217e-03 2.66209337e-03
 1.97996793e-03 1.75395945e-03 1.64925306e-03 1.03864675e-03
 9.99096464e-04 9.14646751e-04 8.11361259e-04 6.01833567e-04
 5.16042379e-04 2.72587995e-04 2.30015463e-04 5.29779290e-05
 2.49601032e-05 4.43482743e-06]

The first 2 principal components explain 0.44 + 0.19 = 0.63 of the total variance

Now we will plot the principal components.

7. Visualize the first 2 PCA components of dataset. To get access to the transformed values of the first component you can run pca_fit[:, 0], where pca_fit is the fitted object. For the plot you can use matplotlib and create a scatter plot. Then you can create a second plot by adding the target labels as a color (c = data.target). What can you conclude from the plots?

Note that you can also plot the target outcome since in this dataset we have it to see if there were clusters created based on the principal components.

In [12]:
colors = np.array(['blue', 'green'])
target_colors = colors[data.target]

# Plot PCA
plt.figure(figsize=(8, 6))
plt.scatter(pca_fit[:, 0], pca_fit[:, 1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Breast Cancer Dataset')
plt.show()
In [ ]:
plt.figure(figsize=(8, 6))
plt.scatter(pca_fit[:, 0], pca_fit[:, 1], c=data.target)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar(label='Target')
plt.show()

While not perfectly separated, the yellow and purple dots tend to form distinct clusters. That suggests that PCA components help distinguish the classes This shows that a linear projection (like PCA) retains meaningful structure

This kind of plot helps us:

  • Visualize class separation in high-dimensional data (in this case, yes we have the labels)
  • Assess whether linear methods (like logistic regression or SVM) might work
  • Decide if PCA is helpful as a preprocessing step for clustering or classification

Note: PCA is unsupervised, so it doesn’t use the labels when creating the projection

8. Visualize the explained variance and the cumulative variance of the principal components obtained from PCA. For the cumulative variance, you can use np.cumsum(explained_variance_ratio). pca.explained_variance_ratio_ is the one that will give you the explained_variance_ratio

In [ ]:
# Calculate cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

# Set up the matplotlib figure with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))

# Plot for Individual explained variance
ax1.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.5, align='center', color='g')
ax1.set_ylabel('Explained variance ratio')
ax1.set_xlabel('Principal component index')
ax1.set_title('Individual Explained Variance')

# Plot for Cumulative explained variance
ax2.step(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, where='mid', color='b')
ax2.set_xlabel('Principal component index')
ax2.set_ylabel('Explained Variance')
ax2.set_title('Cumulative Explained Variance')
ax2.set_ylim([0, 1.1])  # Ensuring the y-axis goes from 0 to slightly above 1
plt.tight_layout()
# Show plot with a tight layout
plt.show()

The first two principal components explain 44.27% and 18.97% of the variance respectively, and together they account for 63.24% of the total variance. The cumulative variance plot can help us to decide how many PC we need to reach a particular value of explained variance. For example to reach 80% in the explained variance, we need the first five principal components.

9. Let's see also the loadings from PC1 and PC2. To get the loadings you can run pca.components_.T and get the first 2 columns. pca.components_ has shape [n_components, n_features] and will give you values per component (rows) and features (columns). Then from those loadings extract the first 2 columns and then put then in a dataframe

In [14]:
# Get loadings (eigenvectors)
loadings = pca.components_.T

# Extract loadings for the first two components
loadings_pc1_pc2 = loadings[:, :2]

# Create a DataFrame for better visualization
loadings_df = pd.DataFrame(loadings_pc1_pc2, index=data.feature_names, columns=['PC1', 'PC2'])
print(loadings_df.head())
                      PC1       PC2
mean radius      0.218902 -0.233857
mean texture     0.103725 -0.059706
mean perimeter   0.227537 -0.215181
mean area        0.220995 -0.231077
mean smoothness  0.142590  0.186113

The table shows the PCA loadings, which indicate how much each original feature contributes to the first two principal components. Higher absolute values mean the feature has a stronger influence on that component. A positive or negative value represents the direction of influence, not just its strength. For example, features like mean radius, mean area, and mean perimeter have high positive weights on PC1, suggesting that PC1 captures overall tumor size. PC2 is influenced more by features like mean smoothness, and it may reflect differences in texture or shape uniformity. On PC2, mean radius, mean area, and mean perimeter have negative loadings, which means they influence the second component in the opposite direction compared to features with positive loadings (like mean smoothness). These loadings help us understand what the principal components represent in terms of the original data features.

10. Let's now create a barplot showing the loadings for every feature per PC1 and PC2. You can use the dataframe that we created in the previous question.

In [15]:
plt.figure(figsize=(6, 4))
# Plot the loadings
plt.bar(loadings_df.index, loadings_df['PC1'], color='r', alpha=0.5, label='PC1')
plt.bar(loadings_df.index, loadings_df['PC2'], color='b', alpha=0.5, label='PC2')
plt.title('Loadings for PC1 and PC2')
plt.xlabel('Features')
plt.ylabel('Loadings')
plt.xticks(rotation=90)
plt.legend()

plt.tight_layout()
plt.show()

From the plot we can see that we have some positive and some negative loadings. Positive loading indicates that a variable contributes to some degree to the principal component, and a negative loading indicates that its absence contributes to some degree to the principal component.

PC1 is heavily influenced by “mean” and “worst” size-related features. Features like mean radius, mean perimeter, mean area, worst area, and worst radius have strong positive loadings on PC1. This supports the idea that PC1 captures tumor size and shape characteristics.

Mean smoothness, mean compactness, worst smoothness, worst symmetry, etc. show positive or negative contributions to PC2. This indicates PC2 may capture differences in texture, symmetry, or variability between tumors.

We also observe that many “mean” and “worst” features (e.g., mean radius, worst radius) have similar patterns, suggesting multicollinearity in the original data — and PCA is helping reduce that.

11. Let's also create the scatter plot that will show the loadings for every feature per PC1 and PC2

In [21]:
# Plot the loadings
plt.figure(figsize=(12, 12))
plt.scatter(loadings_df['PC1'], loadings_df['PC2'])

# Annotate the points with feature names
for i, feature in enumerate(loadings_df.index):
    plt.annotate(feature, (loadings_df['PC1'][i], loadings_df['PC2'][i]), fontsize=8, xytext=(3, 3), textcoords='offset points')

plt.xlabel('PC1 Loadings')
plt.ylabel('PC2 Loadings')
plt.title('PCA Loadings for the First Two Principal Components')
plt.axhline(0, color='grey', linestyle='--', linewidth=0.5)
plt.axvline(0, color='grey', linestyle='--', linewidth=0.5)
plt.grid()
plt.show()
/tmp/ipython-input-21-383545454.py:7: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  plt.annotate(feature, (loadings_df['PC1'][i], loadings_df['PC2'][i]), fontsize=8, xytext=(3, 3), textcoords='offset points')

Features like mean radius, mean perimeter, mean area, worst radius, etc. are clustered on the far right. These features have high positive loadings on PC1, meaning PC1 likely captures tumor size/volume

PC2 (Y-axis) reflects variation in texture, smoothness, or complexity. Features like smoothness error, symmetry error, and mean fractal dimension have high positive PC2 loadings. These features are located toward the top of the plot, indicating their importance in PC2

Features like mean radius, mean perimeter, and worst radius appear close together, meaning they are highly correlated.

K-means clustering¶

Now we are going to apply some clustering!

12. Implement k-means clustering on our dataset (you can use the scaled version of the dataframe). First, you can create an object of KMeans() and set the parameters of n_clusters to 2. Also set the n_init= 10; this is how many times the model will run with different centroid seeds. Then fit the model on the scaled dataset. Print the centroids of the clusters using the cluster_centers_ attribute.

Note: Look on the documentation for more information on the implementation of k-means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

In [22]:
kmeans = KMeans(n_clusters=2, n_init= 10)
kmeans.fit(df_scaled)
print("Centroids of the clusters:", kmeans.cluster_centers_)
Centroids of the clusters: [[ 0.93362406  0.4620646   0.96730462  0.92424266  0.60116657  1.01678879
   1.11927003  1.13499581  0.59745592  0.28828783  0.82729086  0.0408053
   0.83306382  0.77658149  0.0460656   0.71559384  0.64677371  0.76877382
   0.14483925  0.45777508  1.00100618  0.48610441  1.02895252  0.96525332
   0.61097579  0.95149718  1.0377126   1.12132923  0.58638203  0.65089347]
 [-0.48299485 -0.23904142 -0.50041892 -0.47814154 -0.3110035  -0.52601873
  -0.57903569 -0.58717117 -0.30908386 -0.1491409  -0.42798514 -0.02110994
  -0.43097168 -0.40175149 -0.02383127 -0.37020054 -0.3345976  -0.39771232
  -0.07493017 -0.23682231 -0.51785387 -0.25147802 -0.53231144 -0.49935772
  -0.31607814 -0.49224121 -0.53684332 -0.58010099 -0.30335497 -0.33672889]]

These centroids represent the mean values of each feature for the two clusters. Because the data was scaled, each feature was transformed to have mean 0 and std 1, so positive values are above average for that feature and negative values are below average

13. Visualize the clusters obtained from k-means clustering. Use a color mapping for the k-means labels (kmeans.labels_)

In [23]:
colors = np.array(['blue', 'green'])
cluster_colors = colors[kmeans.labels_]

plt.figure(figsize=(8, 6))
plt.scatter(df_scaled[:, 0], df_scaled[:, 1], c = cluster_colors)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='*')
plt.title('K-means Clustering of Breast Cancer Dataset')
plt.show()

Here the different colors correspond to the 2 different clusters that we created with k-means. The red spot corresponds to the center of the centroids.

In this case we have the original labels as well, so we can also see how the original clusters look like (but ususally this is not the case).

In [24]:
colors = np.array(['blue', 'green'])
target_colors = colors[data.target]
plt.figure(figsize=(8, 6))
plt.scatter( df_scaled[:, 0], df_scaled[:, 1], c=target_colors)
plt.title('Original Labels of Breast Cancer Dataset')
plt.show()

If we compare the two plots, we see that at the bottom left there are some points assigned to a different cluster from the original one

If some of those label assignments look weird to you then remember; you are seeing data projected onto 2 principal components, which compress 30-dimensional data into 2D. In this compressed view, points may look close, but in the original high-dimensional space, they may be far apart.

14. Let's run the k-means again, trying different values for k. Let's say to try values from 1 to 11. In every step add the inertia score (kmeans.inertia_) to a list.

In [25]:
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, n_init=10)
    kmeans.fit(df_scaled)
    inertia.append(kmeans.inertia_)

15. Determine the optimal number of clusters by using the elbow method (so actually plot the inertia for every k).

In [26]:
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia') #Inertia is a measure of how well the data points fit into their assigned clusters.
plt.show()

When we plot those values, we can see that the best k is 2 .

From k = 1 to k = 2 there's a steep drop in inertia. After k = 2, the rate of decrease flattens out more gradually. The biggest “bend” or elbow in the curve appears at k = 2.

After k = 2, the inertia continues to decrease but at a slower rate.

16. It is better to calculate the silhouette score for the best k.

In [30]:
kmeans = KMeans(n_clusters=2, n_init=10)
kmeans.fit(df_scaled)
labels = kmeans.labels_

# Calculate the silhouette score
silhouette_avg = silhouette_score(df_scaled, labels)
print("Silhouette Score for 3 clusters:", silhouette_avg)
Silhouette Score for 3 clusters: 0.3449740051034408

The silhouette score for the K-means clustering with 2 clusters is approximately 0.34

A general rule of thumb is that a silhouette score above 0.5 indicates a good clustering, a silhouette score below 0.25 indicates a bad clustering, and a silhouette score between 0.25 and 0.5 indicates a fair clustering.

17. Now Iimplement the hierarchical clustering using scikit-learn on the Breast Cancer Wisconsin dataset. The constructor in this case is the AgglomerativeClustering. Use linkage='complete'

In [31]:
# Implementing hierarchical clustering
hc = AgglomerativeClustering(metric='euclidean', linkage='complete')
hc_labels = hc.fit_predict(df_scaled)

18. Plot the dendrogram with the dendrogram function

In [35]:
# Creating the linkage matrix for dendrogram
linked_matrix = linkage(df_scaled, method='complete')

# Plotting the dendrogram
plt.figure(figsize=(8, 6))
dendrogram(linked_matrix)
plt.title('Dendrogram for Hierarchical Clustering with Complete Linkage')
plt.ylim(0,20)
plt.tight_layout()
plt.show()

The x-axis of the dendrogram represents the samples in the data. The y-axis represents the distance between those samples. The higher the line, the more dissimilar are those samples/clusters.

Try also different linkage methods and compare the dendrograms.

End of practical!