Practical 3: Classification¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

At this practical, we are going to work with a classification problem and we are going to apply different classification algorithms data about loans.

The problem that we will work today is about automating the loan eligibility based on customer detail. In this practical we will try Logistic Regression, SVM and KNN, three very popular classification models.

As our first step, we are going to load the necessary libraries.

In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

Let's get started¶

First we are going to load the loan_train.csv file that contains our data. It's a dataset from Kaggle provided by Dream Housing Finance company, which specializes in home loans across urban to rural areas.

1. Load the data into a variable called loan_df. Print some of the basic information of the dataset to understand its content. You can print the summary of the data, the first rows and the basic information

2. Check the null counts per variable.

3. Remove the rows with NA values (dropna()) and drop the Loan_ID column.

4. One of the important steps is to encode the categorical features. Try to encode the features ['Education', 'Property_Area', 'Loan_Status', 'Gender', 'Married', 'Dependents', 'Self_Employed'] with the help of the LabelEncoder class (This will convert categorical data, like strings or labels, into numerical values, which most algorithms can process more efficiently.). You can use the function fit_transform() to do this conversion

5. Scale the numerical features (['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']) with the help of the MinMaxScaler class. This will scale the data to a fixed range, between 0 and 1.Print the first 5 rows to see if anything changed

Most machine learning algorithms do not perform well when input features have very different scales. Here's what scaling solves:

  • Imagine these two features: ApplicantIncome: ranges from 0 to 100,000 and Credit_History: values are 0 or 1. Without scaling, models like k-NN, SVM, logistic regression, neural networks may give more weight to ApplicantIncome just because its values are larger — even if it’s not more important.
  • Algorithms that involve gradient descent (e.g., logistic regression, neural nets) converge faster and more reliably when all features are on a similar scale.
  • Models like K-Nearest Neighbors, K-Means, and SVMs rely on Euclidean distance, which gets distorted if one feature dominates due to its scale.

Now, we are going to separate the dataset into input features and the target variable, and then split it into training and testing sets.

Data Splitting¶

6. Split the dataset into input features (X) and target variable (y)

7. Split the dataset into training and testing sets. Use the train_test_split function for this. Use 20% of the data as test data and random_state = 42

Implementing Classification Algorithms¶

For this problem, we will explore three different Classification algorithms:

  1. Logistic Regression
  2. Support Vector Machine (SVM)
  3. K-Nearest Neighbors (KNN)

Logistic Regression¶

8. Make a new variable called logistic_reg to which you will assign an instance of the LogisticRegression class that is the one that implements this model. You can use max_iter=1000 as an argument

The max_iter controls how many times the optimization algorithm is allowed to run when trying to find the best coefficients (weights) for the logistic regression model.

9. Fit the model to the training data with the function fit().

Once the model is trained, we can make the predictions.

10. Use the predict function to predict the values of the test set using the logistic regression model you trained

The next step is to evaluate the performance of the logistic regression model.

11. Print the classification report and the the confusion matrix calling the appropriate functions (classification_report() and confusion_matrix()). Print also the accuracy and the f1-score.

Hint: If we are interested in one particular metric and we do not need the whole classification report, then there are individual functions to call (e.g., accuracy_score will print the accuracy)

The rows of the confusion matrix represent the actual classes (true labels), while the columns represent the predicted classes.

The top-left cell represents the number of true negatives (TN), the top-right cell represents the number of false positives (FP), he bottom-left cell represents the number of false negatives (FN), and the bottom-right cell represents the number of true positives (TP).

Let's see how other models perform.

SVM Classification¶

The next model that we will try is the SVM model.

12. Create an instance of the Support Vector Classifier (SVC) class using the SVC() constructor and then fit the model to the training data.

Hint: Wnen you create the model, you can use kernel='rbf' as a parameter.

13. Generate the predictions on the test data and then print the accuracy and the f1-score. Is this model better than Logistic Regression? Why/Why not?

KNN¶

The last model that we will try is the KNN. One of the paramters of KNN is the number of neighbours. For this model, we will optimise the number of neighbors n_neighbors. To optimise any of the parameters, we first have to create a dictionary that will hold values for the parameters.

14. Create a dictionary (called param_grid_knn) and store the values of 3, 5, 7, 9 and 11 as the number of neighbors. These are the number of neighbors that we will try out

15. Call the GridSearchCV constructor to create an object called grid_search_knn. As first argument use the constructor of the KNN, that is KNeighborsClassifier(). The second parameter is the param_grid_knn dictionary that we just created. To find the best parameter we have to apply cross validation. So set the parameter cv = 5 and optimise for scoring = f1

GridSearchCV is a class from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) that helps you find the best hyperparameters for your model by automatically trying out combinations and evaluating them using cross-validation.

Think of it like a smart assistant that:

  • Tries every combination of settings (from your parameter grid),
  • Trains the model with each setting using cross-validation,
  • Scores each version, and
  • Tells you which combination works best.

16. We have now defined the set up but no model is trained yet. Fit the grid_search_knn to the training data to find the best parameter combination.

Once we call .fit(), grid_search_knn:

  • Finds the best hyperparameters (e.g., best n_neighbors)
  • Trains a final model on the entire X_train set using those best params
  • Stores that final model inside itself

17. Print the best parameters for the KNN using grid_search_knn.best_params_

18. Generate the predictions on X_test using the grid_search_knn.

19. Calculate the metrics of accuracy and f1-score. Is your model better or worse compared to Logistic Regression and SVM?

To further evaluate the performance of our K-Nearest Neighbors (KNN) classifier, we will also output the Receiver Operating Characteristic (ROC) curve and the ROC Area Under the Curve (AUC).

ROC is a curve that shows the trade-off between true positive rate (TPR) and false positive rate (FPR) at different classification thresholds. It helps us evaluate how well a classifier can separate the classes (especially for binary classification).

20. Compute the ROC curve and ROC area under the curve. Let's do that for the KNN using the roc_curve and the roc_auc_score functions. roc_curve() is typically used with predicted probabilities, not class labels so the first step is to retrieve the probabilities (grid_search_knn.predict_proba(X_test)[:, 1]). Then call the roc_curve() function and finally the roc_auc_score function

Remember that:

  • AUC = 1.0 Perfect model (ideal performance)
  • AUC > 0.5 Better than random (some predictive power)
  • AUC = 0.5 No skill (random guess)
  • AUC < 0.5 Worse than random (model is misleading)

21. Plot the ROC curve. What are your observations from the plots?

22. Compare the performance of K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression classifiers on the dataset. Use GridSearchCV for hyperparameter tuning for the KNN classifier and evaluate all models using accuracy and F1-score metrics. What are your conclusions about the models? Do you notice anything unusual on SVM confusion matrix?

End of Practical!