Anastasia Giachanou, Tina Shahedi
Machine Learning with Python - Utrecht Summer School
In this practical, we will analyse the Breast Cancer Wisconsin (Diagnostic) dataset, that is available from the scikit-learn
library. This dataset contains data from Fine Needle Aspiration (FNA) of breast masses. It offers a detailed examination of cell nuclei characteristics in digitized images from FNA procedures.
First, we will load the dataset and then transform it into a pandas dataframe. Next, we'll apply PCA to reduce dimensions and we will use clustering to detect clusters.
First let's import the necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import GridSearchCV
from scipy.cluster.hierarchy import dendrogram, linkage
In the first part of this practical, we will focus on the Breast Cancer Wisconsin Dataset from scikit-learn
. This dataset includes 569 instances with 30 features. These features, such as radius, texture, perimeter, area, and several others, capture various characteristics of cell nuclei in breast mass images. Each feature is quantified in three different ways: the mean, standard error, and the largest value among a subset.
1. Let's load the dataset first using the function load_breast_cancer()
. We can store it in the variable data
.
2. Print the feature names of the Breast Cancer Dataset.
Hint
: The variable data
is a dictionary-like object that encompasses data, target, and additional metadata. Key components include:
data
: A numpy array with the shape (n_samples, n_features), containing the feature matrix.target
: A numpy array with the shape (n_samples,), which holds the target variable.feature_names
: A list of feature names.DESCR
: A detailed description of the dataset.You can access these attributes using dictionary-like indexing, such as data.feature_names
, data.target_names
, and so on. This approach allows you to explore the dataset's structure and understand the features that will be used for analysis.
3. Explore the structure and description of the data.
4. Convert the dataset into a pandas DataFrame called df
. Print the dataframe in a way to see all the columns.
The dataset has 30 variables, and this means that it is impossible to plot all of them in one single plot. One option is to create multiple 2-dimensions plots for every pair of the variables. However, we will have too many plots to compare.
The alternative solution is to use the Principal Component Analysis (PCA) to reduce the dimensions of the data to 2 main principal components. In this way, we can generate a 2-d plot.
Feature scaling is the first step for PCA and will transform the values of the features so their mean is 0 and their standard deviation 1.
5. Apply the StandardScaler
on the df
dataframe. Then use the fit
function to fit the scaler to the dataframe. Then transform the features of the dataframe (you can call the scaler.transform()
function for this). Name the stardarised dataframe df_scaled
6. Now run the PCA on the dataset by calling PCA()
that is part of the scikit-learn. First create the instance of PCA (you can also set the number of components but we will not do that here). Then fit the pca object on the dataframe using the fit_tranform()
function. Print the explained variance ratio of the principal components (explained_variance_ratio_
attribute). How much variance is explained by the 2 first components?
Now we will plot the principal components.
7. Visualize the first 2 PCA components of dataset. To get access to the transformed values of the first component you can run pca_fit[:, 0]
, where pca_fit is the fitted object. For the plot you can use matplotlib.
Note that you can also plot the target outcome since in this dataset we have it to see if there were clusters created based on the principal components. If you used plt.scatter you can add the parameter c=data.target into the scatter function.
8. Visualize the explained variance and the cumulative variance of the principal components obtained from PCA. For the cumulative variance, you can use np.cumsum(explained_variance_ratio)
. pca.explained_variance_ratio_
is the one that will give you the explained_variance_ratio
9. Let's see also the loadings from PC1 and PC2. To get the loadings you can run pca.components_.T
and get the first 2 columns. pca.components_
has shape [n_components, n_features] and will give you values per component (rows) and features (columns). Then from those loadings extract the first 2 columns and then put then in a dataframe
10. Let's now create a barplot showing the loadings for every feature per PC1 and PC2. You can use the dataframe that we created in the previous question.
11. Let's also create the scatter plot that will show the loadings for every feature per PC1 and PC2
Now we are going to apply some clustering!
12. Implement k-means clustering on our dataset (you can use the scaled version). First, you create an object of KMeans()
and set the parameters of n_clusters
to 2. Also set the n_init= 10
; this is how many times the model will run with different centroid seeds. Then fit the model on the scaled dataset. Print the centroids of the clusters using the cluster_centers_
attribute.
Note: Look on the documentation for more information on the implementation of k-means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
13. Visualize the clusters obtained from k-means clustering.
In this example we have the original labels as well, so we can also see how the original clusters look like (but usually this is not the case).
colors = np.array(['blue', 'green'])
target_colors = colors[data.target]
plt.figure(figsize=(8, 6))
plt.scatter( df_scaled[:, 0], df_scaled[:, 1], c=target_colors)
plt.title('Original Labels of Breast Cancer Dataset')
plt.show()
14. Let's run the k-means again, trying different values for k. Let's say to try values from 1 to 11. In every step add the inertia score (kmeans.inertia_
) to a list.
15. Determine the optimal number of clusters by using the elbow method (so actually plot the inertia for every k).
16. It is better to calculate the silhouette score for k = 2.
17. Now Iimplement the hierarchical clustering using scikit-learn on the Breast Cancer Wisconsin dataset. The constructor in this case is the AgglomerativeClustering
. Use linkage='complete'
18. Plot the dendrogram using the dendrogram function
The x-axis of the dendrogram represents the samples in the data. The y-axis represents the distance between those samples. The higher the line, the more dissimilar are those samples/clusters.
Try also different linkage methods and compare the dendrograms.
End of practical!