Practical 3: Classification¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

At this practical, we are going to work with a classification problem and we are going to apply different classification algorithms data about loans.

The problem that we will work today is about automating the loan eligibility based on customer detail. In this practical we will try Logistic Regression, SVM and KNN, three very popular classification models.

As our first step, we are going to load the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

Let's get started¶

First we are going to load the loan_train.csv file that contains our data. It's a dataset from Kaggle provided by Dream Housing Finance company, which specializes in home loans across urban to rural areas.

1. Load the data into a variable called loan_df. Print some of the basic information of the dataset to understand its content. You can print the summary of the data, the first rows and the basic information

In [2]:
loan_df = pd.read_csv('loan_train.csv')
In [3]:
print("Head of the dataset:")
print(loan_df.head())

print("\nSummary statistics of the dataset:")
print(loan_df.describe())

print("\nInformation about the dataset:")
print(loan_df.info())
Head of the dataset:
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0         Urban           Y  
3             1.0         Urban           Y  
4             1.0         Urban           Y  

Summary statistics of the dataset:
       ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
count       614.000000         614.000000  592.000000         600.00000   
mean       5403.459283        1621.245798  146.412162         342.00000   
std        6109.041673        2926.248369   85.587325          65.12041   
min         150.000000           0.000000    9.000000          12.00000   
25%        2877.500000           0.000000  100.000000         360.00000   
50%        3812.500000        1188.500000  128.000000         360.00000   
75%        5795.000000        2297.250000  168.000000         360.00000   
max       81000.000000       41667.000000  700.000000         480.00000   

       Credit_History  
count      564.000000  
mean         0.842199  
std          0.364878  
min          0.000000  
25%          1.000000  
50%          1.000000  
75%          1.000000  
max          1.000000  

Information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
None

We can see all the different columns of the dataset. There are 13 columns in the dataset and 614 rows. A detailed description of the dataset is presented in bellow:

Column Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History Credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N) \

As we see not all of them are relevant (e.g., the id of the loan). We can also see that the feature that is the most relevant as a classification output is the loan_status. Before we proceed with the classification task, it is important to do some pre-processing. One of the first steps is to drop the NA values and any columns that are not relevant. So, let’s have a look at how may Nulls do we have per field:

2. Check the null counts per variable.

In [4]:
loan_df.isnull().sum()
Out[4]:
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

From the information above, we see that there are NAs in seven variables.

3. Remove the rows with NA values and drop the Loan_ID column.

In [5]:
loan_df = loan_df.dropna()
loan_df = loan_df.drop('Loan_ID',axis=1)
# Displaying the first few rows of the new dataframe
print(loan_df.head())
print(loan_df.info())
  Gender Married Dependents     Education Self_Employed  ApplicantIncome  \
1   Male     Yes          1      Graduate            No             4583   
2   Male     Yes          0      Graduate           Yes             3000   
3   Male     Yes          0  Not Graduate            No             2583   
4   Male      No          0      Graduate            No             6000   
5   Male     Yes          2      Graduate           Yes             5417   

   CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  \
1             1508.0       128.0             360.0             1.0   
2                0.0        66.0             360.0             1.0   
3             2358.0       120.0             360.0             1.0   
4                0.0       141.0             360.0             1.0   
5             4196.0       267.0             360.0             1.0   

  Property_Area Loan_Status  
1         Rural           N  
2         Urban           Y  
3         Urban           Y  
4         Urban           Y  
5         Urban           Y  
<class 'pandas.core.frame.DataFrame'>
Index: 480 entries, 1 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             480 non-null    object 
 1   Married            480 non-null    object 
 2   Dependents         480 non-null    object 
 3   Education          480 non-null    object 
 4   Self_Employed      480 non-null    object 
 5   ApplicantIncome    480 non-null    int64  
 6   CoapplicantIncome  480 non-null    float64
 7   LoanAmount         480 non-null    float64
 8   Loan_Amount_Term   480 non-null    float64
 9   Credit_History     480 non-null    float64
 10  Property_Area      480 non-null    object 
 11  Loan_Status        480 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 48.8+ KB
None

There are 480 rows left in our data set. Now that all the NULLs are removed, we anticipate a dataset with no missing values. Let's check to confirm:

In [6]:
loan_df.isnull().sum()
Out[6]:
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

4. One of the important steps is to encode the categorical features. Try to encode the features ['Education', 'Property_Area', 'Loan_Status', 'Gender', 'Married', 'Dependents', 'Self_Employed'] with the help of the LabelEncoder class.

In [7]:
# Encoding categorical variables
label_encoder = LabelEncoder()

loan_df[['Education', 'Property_Area', 'Loan_Status', 'Gender', 'Married', 'Dependents', 'Self_Employed']] = loan_df[['Education', 'Property_Area', 'Loan_Status', 'Gender', 'Married', 'Dependents', 'Self_Employed']].apply(LabelEncoder().fit_transform)

You can also encode categorical variables as shown below:

loan_df['Property_Area'] = label_encoder.fit_transform(loan_df['Property_Area'])
loan_df['Loan_Status'] = label_encoder.fit_transform(loan_df['Loan_Status'])
loan_df['Gender'] = label_encoder.fit_transform(loan_df['Gender'])
loan_df['Married'] = label_encoder.fit_transform(loan_df['Married'])
loan_df['Dependents'] = label_encoder.fit_transform(loan_df['Dependents'])
loan_df['Self_Employed'] = label_encoder.fit_transform(loan_df['Self_Employed'])

Let's check the top rows of our training dataset to confirm whether it now contains the numerical values as expected.

In [8]:
loan_df.head()
Out[8]:
Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
1 1 1 1 0 0 4583 1508.0 128.0 360.0 1.0 0 0
2 1 1 0 0 1 3000 0.0 66.0 360.0 1.0 2 1
3 1 1 0 1 0 2583 2358.0 120.0 360.0 1.0 2 1
4 1 0 0 0 0 6000 0.0 141.0 360.0 1.0 2 1
5 1 1 2 0 1 5417 4196.0 267.0 360.0 1.0 2 1

We can see that now the categorical variables have been encoded into a format that can be used by the machine learning model. Another step is to scale the numerical features.

5. Scale the numerical features (['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']) with the help of the MinMaxScaler class. This will scale the data to a fixed range, between 0 and 1.Print the first 5 rows to see if anything changed

In [9]:
# Feature scaling
scaler = MinMaxScaler()
loan_df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']] = scaler.fit_transform(loan_df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']])
In [10]:
loan_df.head()
Out[10]:
Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
1 1 1 1 0 0 0.054830 0.044567 0.201354 360.0 1.0 0 0
2 1 1 0 0 1 0.035250 0.000000 0.096447 360.0 1.0 2 1
3 1 1 0 1 0 0.030093 0.069687 0.187817 360.0 1.0 2 1
4 1 0 0 0 0 0.072356 0.000000 0.223350 360.0 1.0 2 1
5 1 1 2 0 1 0.065145 0.124006 0.436548 360.0 1.0 2 1

Now, we are going to separate the dataset into input features and the target variable, and then split it into training and testing sets.

Data Splitting¶

6. Split the dataset into input features (X) and target variable (y)

In [11]:
X = loan_df.drop('Loan_Status', axis=1)
y = loan_df['Loan_Status']

7. Split the dataset into training and testing sets. Use the train_test_split function for this. Use 20% of the data as test data

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Implementing Classification Algorithms¶

For this problem, we will explore three different Classification algorithms:

  1. Logistic Regression
  2. Support Vector Machine (SVM)
  3. K-Nearest Neighbors (KNN)

Logistic Regression¶

8. Make a new variable called logistic_reg to which you will assign an instance of the LogisticRegression class that is the one that implements this model.

In [13]:
logistic_reg = LogisticRegression(max_iter=1000)

9. Fit the model to the training data with the function fit().

In [14]:
logistic_reg.fit(X_train, y_train)
Out[14]:
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)

Once the model is trained, we can make the predictions.

10. Use the predict function to predict the values of the test set using the logistic regression model you trained

In [15]:
y_pred_logistic = logistic_reg.predict(X_test)

The next step is to evaluate the performance of the logistic regression model.

11. Print the classification report and the the confusion matrix calling the appropriate functions (classification_report() and confusion_matrix()). Print also the accuracy and the f1-score.

Hint: If we are interested in one particular metric and we do not need the whole classification report, then there are individual functions to call (e.g., accuracy_score will print the accuracy)

In [16]:
# accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
# print("\nLogistic Regression - Accuracy:", accuracy_logistic)
print("Classification Report:")
print(classification_report(y_test, y_pred_logistic))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logistic))
print("\n")
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.39      0.56        28
           1       0.80      1.00      0.89        68

    accuracy                           0.82        96
   macro avg       0.90      0.70      0.73        96
weighted avg       0.86      0.82      0.79        96

Confusion Matrix:
[[11 17]
 [ 0 68]]


In [17]:
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
f1_logistic = f1_score(y_test, y_pred_logistic)

print("\nAccuracy for Logistic Regression:", accuracy_logistic)
print("F1-score for Logistic Regression:", f1_logistic)
Accuracy for Logistic Regression: 0.8229166666666666
F1-score for Logistic Regression: 0.888888888888889

We notice that accuracy is 82%, that means that 82 out of 100 cases were classified correctly. We also notice that the weighted f1-score if 0.88.

The rows of the confusion matrix represent the actual classes (true labels), while the columns represent the predicted classes.

The top-left cell represents the number of true negatives (TN), the top-right cell represents the number of false positives (FP), he bottom-left cell represents the number of false negatives (FN), and the bottom-right cell represents the number of true positives (TP).

Let's see how other models perform.

SVM Classification¶

The next model that we will try is the SVM model.

12. Create an instance of the Support Vector Classifier (SVC) class using the SVC() constructor and then fit the model to the training data.

Hint: Wnen you create the model, you can use kernel='rbf' as a parameter.

In [18]:
classifier = SVC(kernel='rbf')
classifier.fit(X_train, y_train)
Out[18]:
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC()

13. Generate the predictions on the test data and then print the accuracy and the f1-score

In [19]:
pred_svm = classifier.predict(X_test)
accuracy_svm = accuracy_score(y_test, pred_svm)
print("Accuracy for SVM:", accuracy_svm)
f1_svm = f1_score(y_test, pred_svm)
print("F1-score for SVM:", f1_svm)
Accuracy for SVM: 0.7083333333333334
F1-score for SVM: 0.8292682926829268

We can see that SVM performs worse than the Logistic Regression.

KNN¶

The last model that we will try is the KNN. One of the paramters of KNN is the number of neighbours. For this model, we will optimise the number of neighbors n_neighbors. To optimise any of the parameters, we first have to create a dictinary that will hold values for the parameters.

14. Create a param_grid_knn dictionary and store the values of 3, 5, 7, 9 and 11 as the number of neighbors.

In [20]:
param_grid_knn = {'n_neighbors': [3, 5, 7, 9, 11]}

15. Call the GridSearchCV constructor to create an object grid_search_knn. As first argument use the constructor of the KNN, that is KNeighborsClassifier(). The second parameter is the param_grid_knn dictionary. To find the best parameter we have to apply cross validation, Use cv = 5 and optimise for scoring = f1

In [21]:
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, scoring='f1')

16. Fit the model (grid_search_knn) to the training data. This model is the model with the best parameter.

In [22]:
grid_search_knn.fit(X_train, y_train)
Out[22]:
GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [3, 5, 7, 9, 11]}, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [3, 5, 7, 9, 11]}, scoring='f1')
KNeighborsClassifier()
KNeighborsClassifier()

17. Print the best parameters for the KNN using grid_search_knn.best_params_

In [23]:
print("\nBest hyperparameters for KNN:", grid_search_knn.best_params_)
Best hyperparameters for KNN: {'n_neighbors': 7}

We see that the best performance is obtained when the number of neighbors is set to 7

18. Generate the predictions of the grid_search_knn.

In [24]:
y_pred_best_knn = grid_search_knn.predict(X_test)

19. Calculate the metrics of accuracy and f1-score

In [25]:
accuracy_best_knn = accuracy_score(y_test, y_pred_best_knn)
f1_best_knn = f1_score(y_test, y_pred_best_knn)
print("Accuracy of KNN with best hyperparameters for KNN:", accuracy_best_knn)
print("F1-score of KNN with best hyperparameters for KNN:", f1_best_knn)
Accuracy of KNN with best hyperparameters for KNN: 0.7604166666666666
F1-score of KNN with best hyperparameters for KNN: 0.8535031847133758

To further evaluate the performance of our K-Nearest Neighbors (KNN) classifier, we will also output the Receiver Operating Characteristic (ROC) curve and the ROC Area Under the Curve (AUC).

20. Compute the ROC curve and ROC area under the curve. Let's do that for the KNN using the roc_curve and the roc_auc_score functions.

In [26]:
# Compute ROC curve and ROC area under the curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_best_knn)
roc_auc = roc_auc_score(y_test, y_pred_best_knn)

21. Plot the ROC curve

In [27]:
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='red', linestyle='--', lw=2, label='Random Guess')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

observations

Base on the ROC curve plot, the following observations can be made:

  1. ROC Curve Position: The blue ROC curve is above the red line of random guessing, so the classifier performs better than random.

  2. AUC Value: The AUC of 0.60 that meanse that the classifier's ability to differentiate between the classes is modest.

22. Compare the performance of K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression classifiers on the dataset. Use GridSearchCV for hyperparameter tuning for the KNN classifier and evaluate all models using accuracy and F1-score metrics.

In [28]:
# KNN with GridSearchCV
print("Best KNN Parameters:", grid_search_knn.best_params_)
print("Accuracy for KNN:", accuracy_best_knn)
print("F1-score for KNN:", f1_best_knn)

# SVM Classifier
accuracy_svm = accuracy_score(y_test, pred_svm)
f1_svm = f1_score(y_test, pred_svm)

print("\nAccuracy for SVM:", accuracy_svm)
print("F1-score for SVM:", f1_svm)

# Logistic Regression
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
f1_logistic = f1_score(y_test, y_pred_logistic)

print("\nAccuracy for Logistic Regression:", accuracy_logistic)
print("F1-score for Logistic Regression:", f1_logistic)

# Print Classification Reports and Confusion Matrices
print("\nKNN Classification Report:")
print(classification_report(y_test, y_pred_best_knn))
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best_knn))

print("\nSVM Classification Report:")
print(classification_report(y_test, pred_svm))
print("SVM Confusion Matrix:")
print(confusion_matrix(y_test, pred_svm))

print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logistic))
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logistic))
Best KNN Parameters: {'n_neighbors': 7}
Accuracy for KNN: 0.7604166666666666
F1-score for KNN: 0.8535031847133758

Accuracy for SVM: 0.7083333333333334
F1-score for SVM: 0.8292682926829268

Accuracy for Logistic Regression: 0.8229166666666666
F1-score for Logistic Regression: 0.888888888888889

KNN Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.21      0.34        28
           1       0.75      0.99      0.85        68

    accuracy                           0.76        96
   macro avg       0.80      0.60      0.60        96
weighted avg       0.78      0.76      0.70        96

KNN Confusion Matrix:
[[ 6 22]
 [ 1 67]]

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        28
           1       0.71      1.00      0.83        68

    accuracy                           0.71        96
   macro avg       0.35      0.50      0.41        96
weighted avg       0.50      0.71      0.59        96

SVM Confusion Matrix:
[[ 0 28]
 [ 0 68]]

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.39      0.56        28
           1       0.80      1.00      0.89        68

    accuracy                           0.82        96
   macro avg       0.90      0.70      0.73        96
weighted avg       0.86      0.82      0.79        96

Logistic Regression Confusion Matrix:
[[11 17]
 [ 0 68]]
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Here is a the key observations for KNN, SVM, and Logistic Regression performance:

Metric KNN SVM Logistic Regression
Accuracy 76.04% 70.83% 82.29%
F1-score 85.35% 82.92% 88.89%

As we can see Logistic Regression has the highest performance. The F1-score for KNN is quite high (85.35%), indicating it performs well in terms of balancing precision and recall, however the accuracy of KNN is slightly lower (76.04%).

If we see the confusion matrix counts, we observe that KNN correctly classifies 67 instances of Class 1 but misclassifies 22 instances of Class 0 as Class 1. Logistic Regression both classifiers correctly classify all 68 instances of Class 1 but misclassify 17 instances of Class 0 as Class 1.