Anastasia Giachanou, Tina Shahedi
Machine Learning with Python - Utrecht Summer School
In this practical, we will analyse the Breast Cancer Wisconsin (Diagnostic) dataset, that is available from the scikit-learn
library. This dataset contains data from Fine Needle Aspiration (FNA) of breast masses. It offers a detailed examination of cell nuclei characteristics in digitized images from FNA procedures.
First, we will load the dataset and then transform it into a pandas dataframe. Next, we'll apply PCA to reduce dimensions and we will use clustering to detect clusters.
First let's import the necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import GridSearchCV
from scipy.cluster.hierarchy import dendrogram, linkage
In the first part of this practical, we will focus on the Breast Cancer Wisconsin Dataset from scikit-learn
. This dataset includes 569 instances with 30 features. These features, such as radius, texture, perimeter, area, and several others, capture various characteristics of cell nuclei in breast mass images. Each feature is quantified in three different ways: the mean, standard error, and the largest value among a subset.
1. Let's load the dataset first using the function load_breast_cancer()
. We can store it in the variable data
.
# Load the breast cancer dataset
data = load_breast_cancer()
2. Print the feature names of the Breast Cancer Dataset.
Hint
: The variable data
is a dictionary-like object that encompasses data, target, and additional metadata. Key components include:
data
: A numpy array with the shape (n_samples, n_features), containing the feature matrix.target
: A numpy array with the shape (n_samples,), which holds the target variable.feature_names
: A list of feature names.DESCR
: A detailed description of the dataset.You can access these attributes using dictionary-like indexing, such as data.feature_names
, data.target_names
, and so on. This approach allows you to explore the dataset's structure and understand the features that will be used for analysis.
# Print the feature_names of the dataset
print("Feature names:", data.feature_names)
Feature names: ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
3. Explore the structure and description of the data.
# Print the keys of the dataset
print(data.keys())
# Print the description of the dataset
print(data['DESCR'])
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']) .. _breast_cancer_dataset: Breast cancer wisconsin (diagnostic) dataset -------------------------------------------- **Data Set Characteristics:** :Number of Instances: 569 :Number of Attributes: 30 numeric, predictive attributes and the class :Attribute Information: - radius (mean of distances from center to points on the perimeter) - texture (standard deviation of gray-scale values) - perimeter - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (severity of concave portions of the contour) - concave points (number of concave portions of the contour) - symmetry - fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three worst/largest values) of these features were computed for each image, resulting in 30 features. For instance, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius. - class: - WDBC-Malignant - WDBC-Benign :Summary Statistics: ===================================== ====== ====== Min Max ===================================== ====== ====== radius (mean): 6.981 28.11 texture (mean): 9.71 39.28 perimeter (mean): 43.79 188.5 area (mean): 143.5 2501.0 smoothness (mean): 0.053 0.163 compactness (mean): 0.019 0.345 concavity (mean): 0.0 0.427 concave points (mean): 0.0 0.201 symmetry (mean): 0.106 0.304 fractal dimension (mean): 0.05 0.097 radius (standard error): 0.112 2.873 texture (standard error): 0.36 4.885 perimeter (standard error): 0.757 21.98 area (standard error): 6.802 542.2 smoothness (standard error): 0.002 0.031 compactness (standard error): 0.002 0.135 concavity (standard error): 0.0 0.396 concave points (standard error): 0.0 0.053 symmetry (standard error): 0.008 0.079 fractal dimension (standard error): 0.001 0.03 radius (worst): 7.93 36.04 texture (worst): 12.02 49.54 perimeter (worst): 50.41 251.2 area (worst): 185.2 4254.0 smoothness (worst): 0.071 0.223 compactness (worst): 0.027 1.058 concavity (worst): 0.0 1.252 concave points (worst): 0.0 0.291 symmetry (worst): 0.156 0.664 fractal dimension (worst): 0.055 0.208 ===================================== ====== ====== :Missing Attribute Values: None :Class Distribution: 212 - Malignant, 357 - Benign :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian :Donor: Nick Street :Date: November, 1995 This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2 Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]. This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/ .. topic:: References - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993. - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995. - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
4. Convert the dataset into a pandas DataFrame called df
. Print the dataframe in a way to see all the columns.
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
mean radius | 17.990000 | 20.570000 | 19.690000 | 11.420000 | 20.290000 |
mean texture | 10.380000 | 17.770000 | 21.250000 | 20.380000 | 14.340000 |
mean perimeter | 122.800000 | 132.900000 | 130.000000 | 77.580000 | 135.100000 |
mean area | 1001.000000 | 1326.000000 | 1203.000000 | 386.100000 | 1297.000000 |
mean smoothness | 0.118400 | 0.084740 | 0.109600 | 0.142500 | 0.100300 |
mean compactness | 0.277600 | 0.078640 | 0.159900 | 0.283900 | 0.132800 |
mean concavity | 0.300100 | 0.086900 | 0.197400 | 0.241400 | 0.198000 |
mean concave points | 0.147100 | 0.070170 | 0.127900 | 0.105200 | 0.104300 |
mean symmetry | 0.241900 | 0.181200 | 0.206900 | 0.259700 | 0.180900 |
mean fractal dimension | 0.078710 | 0.056670 | 0.059990 | 0.097440 | 0.058830 |
radius error | 1.095000 | 0.543500 | 0.745600 | 0.495600 | 0.757200 |
texture error | 0.905300 | 0.733900 | 0.786900 | 1.156000 | 0.781300 |
perimeter error | 8.589000 | 3.398000 | 4.585000 | 3.445000 | 5.438000 |
area error | 153.400000 | 74.080000 | 94.030000 | 27.230000 | 94.440000 |
smoothness error | 0.006399 | 0.005225 | 0.006150 | 0.009110 | 0.011490 |
compactness error | 0.049040 | 0.013080 | 0.040060 | 0.074580 | 0.024610 |
concavity error | 0.053730 | 0.018600 | 0.038320 | 0.056610 | 0.056880 |
concave points error | 0.015870 | 0.013400 | 0.020580 | 0.018670 | 0.018850 |
symmetry error | 0.030030 | 0.013890 | 0.022500 | 0.059630 | 0.017560 |
fractal dimension error | 0.006193 | 0.003532 | 0.004571 | 0.009208 | 0.005115 |
worst radius | 25.380000 | 24.990000 | 23.570000 | 14.910000 | 22.540000 |
worst texture | 17.330000 | 23.410000 | 25.530000 | 26.500000 | 16.670000 |
worst perimeter | 184.600000 | 158.800000 | 152.500000 | 98.870000 | 152.200000 |
worst area | 2019.000000 | 1956.000000 | 1709.000000 | 567.700000 | 1575.000000 |
worst smoothness | 0.162200 | 0.123800 | 0.144400 | 0.209800 | 0.137400 |
worst compactness | 0.665600 | 0.186600 | 0.424500 | 0.866300 | 0.205000 |
worst concavity | 0.711900 | 0.241600 | 0.450400 | 0.686900 | 0.400000 |
worst concave points | 0.265400 | 0.186000 | 0.243000 | 0.257500 | 0.162500 |
worst symmetry | 0.460100 | 0.275000 | 0.361300 | 0.663800 | 0.236400 |
worst fractal dimension | 0.118900 | 0.089020 | 0.087580 | 0.173000 | 0.076780 |
The dataset has 30 variables, and this means that it is impossible to plot all of them in one single plot. One option is to create multiple 2-dimensions plots for every pair of the variables. However, we will have too many plots to compare.
The alternative solution is to use the Principal Component Analysis (PCA) to reduce the dimensions of the data to 2 main principal components. In this way, we can generate a 2-d plot.
Feature scaling is the first step for PCA and will transform the values of the features so their mean is 0 and their standard deviation 1.
5. Apply the StandardScaler
on the df
dataframe. Then use the fit
function to fit the scaler to the dataframe. Then transform the features of the dataframe (you can call the scaler.transform()
function for this). Name the stardarised dataframe df_scaled
# Standardize the features
scaler = StandardScaler()
scaler.fit(df)
df_scaled =scaler.transform(df)
6. Now run the PCA on the dataset by calling PCA()
that is part of the scikit-learn. First create the instance of PCA (you can also set the number of components but we will not do that here). Then fit the pca object on the dataframe using the fit_tranform()
function. Print the explained variance ratio of the principal components (explained_variance_ratio_
attribute). How much variance is explained by the 2 first components?
# Implement PCA
pca = PCA()
pca_fit = pca.fit_transform(df_scaled)
# Print the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
Explained Variance Ratio: [4.42720256e-01 1.89711820e-01 9.39316326e-02 6.60213492e-02 5.49576849e-02 4.02452204e-02 2.25073371e-02 1.58872380e-02 1.38964937e-02 1.16897819e-02 9.79718988e-03 8.70537901e-03 8.04524987e-03 5.23365745e-03 3.13783217e-03 2.66209337e-03 1.97996793e-03 1.75395945e-03 1.64925306e-03 1.03864675e-03 9.99096464e-04 9.14646751e-04 8.11361259e-04 6.01833567e-04 5.16042379e-04 2.72587995e-04 2.30015463e-04 5.29779290e-05 2.49601032e-05 4.43482743e-06]
The first 2 principal components explain 0.44 + 0.19 = 0.63 of the total variance
Now we will plot the principal components.
7. Visualize the first 2 PCA components of dataset. To get access to the transformed values of the first component you can run pca_fit[:, 0]
, where pca_fit is the fitted object. For the plot you can use matplotlib.
Note that you can also plot the target outcome since in this dataset we have it to see if there were clusters created based on the principal components.
colors = np.array(['blue', 'green'])
target_colors = colors[data.target]
# Plot PCA
plt.figure(figsize=(8, 6))
plt.scatter(pca_fit[:, 0], pca_fit[:, 1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Breast Cancer Dataset')
plt.show()
plt.figure(figsize=(8, 6))
plt.scatter(pca_fit[:, 0], pca_fit[:, 1], c=data.target)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar(label='Target')
plt.show()
The scatter shows the data represented by the first two principal components.
8. Visualize the explained variance and the cumulative variance of the principal components obtained from PCA. For the cumulative variance, you can use np.cumsum(explained_variance_ratio)
. pca.explained_variance_ratio_
is the one that will give you the explained_variance_ratio
# Calculate cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Set up the matplotlib figure with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
# Plot for Individual explained variance
ax1.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.5, align='center', color='g')
ax1.set_ylabel('Explained variance ratio')
ax1.set_xlabel('Principal component index')
ax1.set_title('Individual Explained Variance')
# Plot for Cumulative explained variance
ax2.step(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, where='mid', color='b')
ax2.set_xlabel('Principal component index')
ax2.set_ylabel('Explained Variance')
ax2.set_title('Cumulative Explained Variance')
ax2.set_ylim([0, 1.1]) # Ensuring the y-axis goes from 0 to slightly above 1
plt.tight_layout()
# Show plot with a tight layout
plt.show()
The first two principal components explain 44.27% and 18.97% of the variance respectively, and together they account for 63.24% of the total variance. The cumulative variance plot can help us to decide how many PC we need to reach a particular value of explained variance. For example to reach 80% in the explained variance, we need the first five principal components.
9. Let's see also the loadings from PC1 and PC2. To get the loadings you can run pca.components_.T
and get the first 2 columns. pca.components_
has shape [n_components, n_features] and will give you values per component (rows) and features (columns). Then from those loadings extract the first 2 columns and then put then in a dataframe
# Get loadings (eigenvectors)
loadings = pca.components_.T
# Extract loadings for the first two components
loadings_pc1_pc2 = loadings[:, :2]
# Create a DataFrame for better visualization
loadings_df = pd.DataFrame(loadings_pc1_pc2, index=data.feature_names, columns=['PC1', 'PC2'])
print(loadings_df.head())
PC1 PC2 mean radius 0.218902 -0.233857 mean texture 0.103725 -0.059706 mean perimeter 0.227537 -0.215181 mean area 0.220995 -0.231077 mean smoothness 0.142590 0.186113
10. Let's now create a barplot showing the loadings for every feature per PC1 and PC2. You can use the dataframe that we created in the previous question.
plt.figure(figsize=(6, 4))
# Plot the loadings
plt.bar(loadings_df.index, loadings_df['PC1'], color='r', alpha=0.5, label='PC1')
plt.bar(loadings_df.index, loadings_df['PC2'], color='b', alpha=0.5, label='PC2')
plt.title('Loadings for PC1 and PC2')
plt.xlabel('Features')
plt.ylabel('Loadings')
plt.xticks(rotation=90)
plt.legend()
plt.tight_layout()
plt.show()
From the plot we can see that we have some positive and some negative loadings. Positive loading indicates that a variable contributes to some degree to the principal component, and a negative loading indicates that its absence contributes to some degree to the principal component.
11. Let's also create the scatter plot that will show the loadings for every feature per PC1 and PC2
# Plot the loadings
plt.figure(figsize=(8, 8))
plt.scatter(loadings_df['PC1'], loadings_df['PC2'])
# Annotate the points with feature names
for i, feature in enumerate(loadings_df.index):
plt.annotate(feature, (loadings_df['PC1'][i], loadings_df['PC2'][i]), fontsize=12)
plt.xlabel('PC1 Loadings')
plt.ylabel('PC2 Loadings')
plt.title('PCA Loadings for the First Two Principal Components')
plt.axhline(0, color='grey', linestyle='--', linewidth=0.5)
plt.axvline(0, color='grey', linestyle='--', linewidth=0.5)
plt.grid()
plt.show()
Now we are going to apply some clustering!
12. Implement k-means clustering on our dataset (you can use the scaled version). First, you create an object of KMeans()
and set the parameters of n_clusters
to 2. Also set the n_init= 10
; this is how many times the model will run with different centroid seeds. Then fit the model on the scaled dataset. Print the centroids of the clusters using the cluster_centers_
attribute.
Note: Look on the documentation for more information on the implementation of k-means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
kmeans = KMeans(n_clusters=2, n_init= 10)
kmeans.fit(df_scaled)
print("Centroids of the clusters:", kmeans.cluster_centers_)
Centroids of the clusters: [[ 0.97397614 0.48151381 1.0066346 0.96352718 0.60925407 1.020696 1.13942935 1.16458212 0.61113855 0.25222982 0.85859633 0.04274078 0.86027888 0.8071077 0.01706063 0.69505052 0.63689512 0.77623856 0.14038222 0.41503212 1.04008365 0.50631048 1.06597067 1.00315418 0.60829274 0.95083725 1.04429844 1.14621103 0.59741617 0.62246932] [-0.48442497 -0.23948977 -0.50066826 -0.47922799 -0.30302374 -0.50766196 -0.56671617 -0.57922637 -0.30396101 -0.12545115 -0.4270387 -0.02125791 -0.42787555 -0.40142988 -0.00848542 -0.34569618 -0.31677152 -0.38607654 -0.06982168 -0.20642387 -0.51730476 -0.25182285 -0.53018015 -0.49893721 -0.3025456 -0.47291642 -0.51940106 -0.57008917 -0.29713594 -0.30959659]]
These centroids represent the mean values of each feature for the two clusters
13. Visualize the clusters obtained from k-means clustering.
colors = np.array(['green', 'blue'])
cluster_colors = colors[kmeans.labels_]
plt.figure(figsize=(8, 6))
plt.scatter(df_scaled[:, 0], df_scaled[:, 1], c = cluster_colors)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='*')
plt.title('K-means Clustering of Breast Cancer Dataset')
plt.show()
Here the different colors correspond to the 2 different clusters that we created with k-means. The red spot corresponds to the center of the centroids.
In this case we have the original labels as well, so we can also see how the original clusters look like (but ususally this is not the case).
colors = np.array(['blue', 'green'])
target_colors = colors[data.target]
plt.figure(figsize=(8, 6))
plt.scatter( df_scaled[:, 0], df_scaled[:, 1], c=target_colors)
plt.title('Original Labels of Breast Cancer Dataset')
plt.show()
14. Let's run the k-means again, trying different values for k. Let's say to try values from 1 to 11. In every step add the inertia score (kmeans.inertia_
) to a list.
inertia = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, n_init=10)
kmeans.fit(df_scaled)
inertia.append(kmeans.inertia_)
15. Determine the optimal number of clusters by using the elbow method (so actually plot the inertia for every k).
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia') #Inertia is a measure of how well the data points fit into their assigned clusters.
plt.show()
When we plot those values, we can see that the best k is 2.
We obseerve that there is a sharp decrease in inertia as the number of clusters increases from 1 to 2. This suggests that moving from a single cluster to two clusters significantly improves the fit.
After k = 2, the inertia continues to decrease but at a slower rate.
16. It is better to calculate the silhouette score for k = 2.
kmeans = KMeans(n_clusters=2, n_init=10)
kmeans.fit(df_scaled)
labels = kmeans.labels_
# Calculate the silhouette score
silhouette_avg = silhouette_score(df_scaled, labels)
print("Silhouette Score for 2 clusters:", silhouette_avg)
Silhouette Score for 2 clusters: 0.34338224069077816
The silhouette score for the K-means clustering with 2 clusters is approximately 0.345
A general rule of thumb is that a silhouette score above 0.5 indicates a good clustering, a silhouette score below 0.25 indicates a bad clustering, and a silhouette score between 0.25 and 0.5 indicates a fair clustering.
17. Now Iimplement the hierarchical clustering using scikit-learn on the Breast Cancer Wisconsin dataset. The constructor in this case is the AgglomerativeClustering
. Use linkage='complete'
# Implementing hierarchical clustering
hc = AgglomerativeClustering(metric='euclidean', linkage='complete')
hc_labels = hc.fit_predict(df_scaled)
18. Plot the dendrogram with the dendrogram function
# Creating the linkage matrix for dendrogram
linked_matrix = linkage(df_scaled, method='complete')
# Plotting the dendrogram
plt.figure(figsize=(8, 6))
dendrogram(linked_matrix)
plt.title('Dendrogram for Hierarchical Clustering with Complete Linkage')
plt.ylim(0,20)
plt.tight_layout()
plt.show()
The x-axis of the dendrogram represents the samples in the data. The y-axis represents the distance between those samples. The higher the line, the more dissimilar are those samples/clusters.
Try also different linkage methods and compare the dendrograms.
End of practical!