Anastasia Giachanou, Tina Shahedi
Machine Learning with Python - Utrecht Summer School
At this practical, we are going to work with a classification problem and we are going to apply different classification algorithms data about loans.
The problem that we will work today is about automating the loan eligibility based on customer detail. In this practical we will try Logistic Regression, SVM and KNN, three very popular classification models.
As our first step, we are going to load the necessary libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
First we are going to load the loan_train.csv file that contains our data. It's a dataset from Kaggle provided by Dream Housing Finance company, which specializes in home loans across urban to rural areas.
1. Load the data into a variable called loan_df. Print some of the basic information of the dataset to understand its content. You can print the summary of the data, the first rows and the basic information
loan_df = pd.read_csv('loan_train.csv')
print("Head of the dataset:")
print(loan_df.head())
print("\nSummary statistics of the dataset:")
print(loan_df.describe())
print("\nInformation about the dataset:")
print(loan_df.info())
Head of the dataset:
Loan_ID Gender Married Dependents Education Self_Employed \
0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \
0 5849 0.0 NaN 360.0
1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0
Credit_History Property_Area Loan_Status
0 1.0 Urban Y
1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
Summary statistics of the dataset:
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \
count 614.000000 614.000000 592.000000 600.00000
mean 5403.459283 1621.245798 146.412162 342.00000
std 6109.041673 2926.248369 85.587325 65.12041
min 150.000000 0.000000 9.000000 12.00000
25% 2877.500000 0.000000 100.000000 360.00000
50% 3812.500000 1188.500000 128.000000 360.00000
75% 5795.000000 2297.250000 168.000000 360.00000
max 81000.000000 41667.000000 700.000000 480.00000
Credit_History
count 564.000000
mean 0.842199
std 0.364878
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
None
We can see all the different columns of the dataset. There are 13 columns in the dataset and 614 rows. A detailed description of the dataset is presented in bellow:
| Column | Description | |
|---|---|---|
| Loan_ID | Unique Loan ID | |
| Gender | Male/ Female | |
| Married | Applicant married (Y/N) | |
| Dependents | Number of dependents | |
| Education | Applicant Education (Graduate/ Under Graduate) | |
| Self_Employed | Self employed (Y/N) | |
| ApplicantIncome | Applicant income | |
| CoapplicantIncome | Coapplicant income | |
| LoanAmount | Loan amount in thousands | |
| Loan_Amount_Term | Term of loan in months | |
| Credit_History | Credit history meets guidelines | |
| Property_Area | Urban/ Semi Urban/ Rural | |
| Loan_Status | Loan approved (Y/N) | \ |
As we see not all of them are relevant (e.g., the id of the loan). We can also see that the feature that is the most relevant as a classification output is the loan_status. Before we proceed with the classification task, it is important to do some pre-processing. One of the first steps is to drop the NA values and any columns that are not relevant. So, let’s have a look at how may Nulls do we have per field:
2. Check the null counts per variable.
loan_df.isnull().sum()
| 0 | |
|---|---|
| Loan_ID | 0 |
| Gender | 13 |
| Married | 3 |
| Dependents | 15 |
| Education | 0 |
| Self_Employed | 32 |
| ApplicantIncome | 0 |
| CoapplicantIncome | 0 |
| LoanAmount | 22 |
| Loan_Amount_Term | 14 |
| Credit_History | 50 |
| Property_Area | 0 |
| Loan_Status | 0 |
From the information above, we see that there are NAs in seven variables.
3. Remove the rows with NA values (dropna()) and drop the Loan_ID column.
loan_df = loan_df.dropna()
loan_df = loan_df.drop('Loan_ID',axis=1)
# Displaying the first few rows of the new dataframe
print(loan_df.head())
print(loan_df.info())
Gender Married Dependents Education Self_Employed ApplicantIncome \ 1 Male Yes 1 Graduate No 4583 2 Male Yes 0 Graduate Yes 3000 3 Male Yes 0 Not Graduate No 2583 4 Male No 0 Graduate No 6000 5 Male Yes 2 Graduate Yes 5417 CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History \ 1 1508.0 128.0 360.0 1.0 2 0.0 66.0 360.0 1.0 3 2358.0 120.0 360.0 1.0 4 0.0 141.0 360.0 1.0 5 4196.0 267.0 360.0 1.0 Property_Area Loan_Status 1 Rural N 2 Urban Y 3 Urban Y 4 Urban Y 5 Urban Y <class 'pandas.core.frame.DataFrame'> Index: 480 entries, 1 to 613 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 480 non-null object 1 Married 480 non-null object 2 Dependents 480 non-null object 3 Education 480 non-null object 4 Self_Employed 480 non-null object 5 ApplicantIncome 480 non-null int64 6 CoapplicantIncome 480 non-null float64 7 LoanAmount 480 non-null float64 8 Loan_Amount_Term 480 non-null float64 9 Credit_History 480 non-null float64 10 Property_Area 480 non-null object 11 Loan_Status 480 non-null object dtypes: float64(4), int64(1), object(7) memory usage: 48.8+ KB None
There are 480 rows left in our data set. Now that all the NULLs are removed, we anticipate a dataset with no missing values. Let's check to confirm:
loan_df.isnull().sum()
| 0 | |
|---|---|
| Gender | 0 |
| Married | 0 |
| Dependents | 0 |
| Education | 0 |
| Self_Employed | 0 |
| ApplicantIncome | 0 |
| CoapplicantIncome | 0 |
| LoanAmount | 0 |
| Loan_Amount_Term | 0 |
| Credit_History | 0 |
| Property_Area | 0 |
| Loan_Status | 0 |
4. One of the important steps is to encode the categorical features. Try to encode the features ['Education', 'Property_Area', 'Loan_Status', 'Gender', 'Married', 'Dependents', 'Self_Employed'] with the help of the LabelEncoder class (This will convert categorical data, like strings or labels, into numerical values, which most algorithms can process more efficiently.). You can use the function fit_transform() to do this conversion
le = LabelEncoder()
cols_to_encode = ['Education', 'Property_Area', 'Loan_Status', 'Gender', 'Married', 'Dependents', 'Self_Employed']
for col in cols_to_encode:
loan_df[col] = le.fit_transform(loan_df[col])
You can also encode categorical variables without the for-loop as shown below:
loan_df['Property_Area'] = label_encoder.fit_transform(loan_df['Property_Area'])
loan_df['Loan_Status'] = label_encoder.fit_transform(loan_df['Loan_Status'])
loan_df['Gender'] = label_encoder.fit_transform(loan_df['Gender'])
loan_df['Married'] = label_encoder.fit_transform(loan_df['Married'])
loan_df['Dependents'] = label_encoder.fit_transform(loan_df['Dependents'])
loan_df['Self_Employed'] = label_encoder.fit_transform(loan_df['Self_Employed'])
Let's check the top rows of our training dataset to confirm whether it now contains the numerical values as expected.
loan_df.head()
| Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 | 0 | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | 0 | 0 |
| 2 | 1 | 1 | 0 | 0 | 1 | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | 2 | 1 |
| 3 | 1 | 1 | 0 | 1 | 0 | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 | 2 | 1 |
| 4 | 1 | 0 | 0 | 0 | 0 | 6000 | 0.0 | 141.0 | 360.0 | 1.0 | 2 | 1 |
| 5 | 1 | 1 | 2 | 0 | 1 | 5417 | 4196.0 | 267.0 | 360.0 | 1.0 | 2 | 1 |
We can see that now the categorical variables have been encoded into a format that can be used by the machine learning model. Another step is to scale the numerical features.
5. Scale the numerical features (['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']) with the help of the MinMaxScaler class. This will scale the data to a fixed range, between 0 and 1.Print the first 5 rows to see if anything changed
Most machine learning algorithms do not perform well when input features have very different scales. Here's what scaling solves:
# Feature scaling
scaler = MinMaxScaler()
loan_df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']] = scaler.fit_transform(loan_df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']])
loan_df.head()
| Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 | 0 | 0.054830 | 0.044567 | 0.201354 | 360.0 | 1.0 | 0 | 0 |
| 2 | 1 | 1 | 0 | 0 | 1 | 0.035250 | 0.000000 | 0.096447 | 360.0 | 1.0 | 2 | 1 |
| 3 | 1 | 1 | 0 | 1 | 0 | 0.030093 | 0.069687 | 0.187817 | 360.0 | 1.0 | 2 | 1 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0.072356 | 0.000000 | 0.223350 | 360.0 | 1.0 | 2 | 1 |
| 5 | 1 | 1 | 2 | 0 | 1 | 0.065145 | 0.124006 | 0.436548 | 360.0 | 1.0 | 2 | 1 |
Now, we are going to separate the dataset into input features and the target variable, and then split it into training and testing sets.
6. Split the dataset into input features (X) and target variable (y)
X = loan_df.drop('Loan_Status', axis=1)
y = loan_df['Loan_Status']
7. Split the dataset into training and testing sets. Use the train_test_split function for this. Use 20% of the data as test data and random_state = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
For this problem, we will explore three different Classification algorithms:
8. Make a new variable called logistic_reg to which you will assign an instance of the LogisticRegression class that is the one that implements this model. You can use max_iter=1000 as an argument
The max_iter controls how many times the optimization algorithm is allowed to run when trying to find the best coefficients (weights) for the logistic regression model.
logistic_reg = LogisticRegression(max_iter=1000)
9. Fit the model to the training data with the function fit().
logistic_reg.fit(X_train, y_train)
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=1000)
Once the model is trained, we can make the predictions.
10. Use the predict function to predict the values of the test set using the logistic regression model you trained
y_pred_logistic = logistic_reg.predict(X_test)
The next step is to evaluate the performance of the logistic regression model.
11. Print the classification report and the the confusion matrix calling the appropriate functions (classification_report() and confusion_matrix()). Print also the accuracy and the f1-score. All those functions take as arguments the actual test labels and the predictions
Hint: If we are interested in one particular metric and we do not need the whole classification report, then there are individual functions to call (e.g., accuracy_score will print the accuracy)
# accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
# print("\nLogistic Regression - Accuracy:", accuracy_logistic)
print("Classification Report:")
print(classification_report(y_test, y_pred_logistic))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logistic))
print("\n")
Classification Report:
precision recall f1-score support
0 1.00 0.39 0.56 28
1 0.80 1.00 0.89 68
accuracy 0.82 96
macro avg 0.90 0.70 0.73 96
weighted avg 0.86 0.82 0.79 96
Confusion Matrix:
[[11 17]
[ 0 68]]
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
f1_logistic = f1_score(y_test, y_pred_logistic)
print("\nAccuracy for Logistic Regression:", round(accuracy_logistic, 2))
print("F1-score for Logistic Regression:", round(f1_logistic, 2))
Accuracy for Logistic Regression: 0.82 F1-score for Logistic Regression: 0.89
We notice that accuracy is 82%, that means that 82 out of 100 cases were classified correctly. We also notice that the weighted f1-score is 0.88.
The rows of the confusion matrix represent the actual classes (true labels), while the columns represent the predicted classes.
The top-left cell represents the number of true negatives (TN), the top-right cell represents the number of false positives (FP), he bottom-left cell represents the number of false negatives (FN), and the bottom-right cell represents the number of true positives (TP).
Let's see how other models perform.
The next model that we will try is the SVM model.
12. Create an instance of the Support Vector Classifier (SVC) class using the SVC() constructor and then fit the model to the training data.
Hint: Wnen you create the model, you can use kernel='rbf' as a parameter.
classifier = SVC(kernel='rbf')
classifier.fit(X_train, y_train)
SVC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC()
13. Generate the predictions on the test data and then print the accuracy and the f1-score. Is this model better than Logistic Regression? Why/Why not?
pred_svm = classifier.predict(X_test)
accuracy_svm = accuracy_score(y_test, pred_svm)
print("Accuracy for SVM:", round(accuracy_svm, 2))
f1_svm = f1_score(y_test, pred_svm)
print("F1-score for SVM:", round(f1_svm, 2))
Accuracy for SVM: 0.71 F1-score for SVM: 0.83
We can see that SVM performs worse than the Logistic Regression. This tells us that our data may be linearly separable and that we do not need non-linear decision boundaries
SVM with an RBF kernel is trying to find complex patterns that may not exist, which can lead to overfitting on the training data and worse generalization.
The last model that we will try is the KNN. One of the paramters of KNN is the number of neighbours. For this model, we will optimise the number of neighbors n_neighbors. To optimise any of the parameters, we first have to create a dictionary that will hold values for the parameters.
14. Create a dictionary (called param_grid_knn) and store the values of 3, 5, 7, 9 and 11 as the number of neighbors. These are the number of neighbors that we will try out
param_grid_knn = {'n_neighbors': [3, 5, 7, 9, 11]}
15. Call the GridSearchCV constructor to create an object called grid_search_knn. As first argument use the constructor of the KNN, that is KNeighborsClassifier(). The second parameter is the param_grid_knn dictionary that we just created. To find the best parameter we have to apply cross validation. So set the parameter cv = 5 and optimise for scoring = f1
GridSearchCV is a class from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) that helps you find the best hyperparameters for your model by automatically trying out combinations and evaluating them using cross-validation.
Think of it like a smart assistant that:
grid_search_knn = GridSearchCV(KNeighborsClassifier(),
param_grid_knn,
cv=5,
scoring='f1')
16. We have now defined the set up but no model is trained yet. Fit the grid_search_knn to the training data to find the best parameter combination.
Once we call .fit(), grid_search_knn:
grid_search_knn.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [3, 5, 7, 9, 11]}, scoring='f1')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [3, 5, 7, 9, 11]}, scoring='f1')KNeighborsClassifier(n_neighbors=7)
KNeighborsClassifier(n_neighbors=7)
17. Print the best parameters for the KNN using grid_search_knn.best_params_
print("\nBest hyperparameters for KNN:", grid_search_knn.best_params_)
Best hyperparameters for KNN: {'n_neighbors': 7}
We see that the best performance is obtained when the number of neighbors is set to 7
18. Generate the predictions on X_test using the grid_search_knn.
y_pred_best_knn = grid_search_knn.predict(X_test)
# best_knn_model = grid_search_knn.best_estimator_
# y_pred_best_knn = best_knn_model.predict(X_test) # Same result
19. Calculate the metrics of accuracy and f1-score. Is your model better or worse compared to Logistic Regression and SVM?
accuracy_best_knn = accuracy_score(y_test, y_pred_best_knn)
f1_best_knn = f1_score(y_test, y_pred_best_knn)
print("Accuracy of KNN with best hyperparameters for KNN:", round(accuracy_best_knn, 2))
print("F1-score of KNN with best hyperparameters for KNN:", round(f1_best_knn, 2))
Accuracy of KNN with best hyperparameters for KNN: 0.76 F1-score of KNN with best hyperparameters for KNN: 0.85
KNN performns worse compared to logistic regression and better (in terms of accuracy) compared to SVM
To further evaluate the performance of our K-Nearest Neighbors (KNN) classifier, we will also output the Receiver Operating Characteristic (ROC) curve and the ROC Area Under the Curve (AUC).
ROC is a curve that shows the trade-off between true positive rate (TPR) and false positive rate (FPR) at different classification thresholds. It helps us evaluate how well a classifier can separate the classes (especially for binary classification).
20. Compute the ROC curve and ROC area under the curve. Let's do that for the KNN using the roc_curve and the roc_auc_score functions. roc_curve() is typically used with predicted probabilities, not class labels so the first step is to retrieve the probabilities (grid_search_knn.predict_proba(X_test)[:, 1]). Then call the roc_curve() function and finally the roc_auc_score function
#print(grid_search_knn.predict_proba(X_test)) #this line returns the probabilities of each case belonging to class 0 and to class 1
y_scores = grid_search_knn.predict_proba(X_test)[:, 1] # return probabilities for class 1
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
#This computes the Area Under the Curve (AUC) of the ROC curve — a single number summarizing how well the model separates the classes:
roc_auc = roc_auc_score(y_test, y_pred_best_knn)
print(roc_auc)
0.5997899159663866
Remember that:
21. Plot the ROC curve
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='red', linestyle='--', lw=2, label='Random Guess')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()
Observations
Based on the ROC curve plot, the following observations can be made:
ROC Curve Position: The blue ROC curve is above the red line of random guessing, so the classifier performs better than random.
AUC Value: The AUC of 0.69 that meanse that the classifier's ability to differentiate between the classes is modest. The model captures some useful signal, but it may need improvement (e.g., better features, model tuning, or resampling if classes are imbalanced).
22. Compare the performance of K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Logistic Regression classifiers on the dataset. Use GridSearchCV for hyperparameter tuning for the KNN classifier and evaluate all models using accuracy and F1-score metrics. What are your conclusions about the models? Do you notice anything unusual on SVM confusion matrix?
# KNN with GridSearchCV
print("Best KNN Parameters:", grid_search_knn.best_params_)
print("Accuracy for KNN:", round(accuracy_best_knn, 2))
print("F1-score for KNN:", round(f1_best_knn, 2))
# SVM Classifier
accuracy_svm = accuracy_score(y_test, pred_svm)
f1_svm = f1_score(y_test, pred_svm)
print("\nAccuracy for SVM:", round(accuracy_svm, 2))
print("F1-score for SVM:", round(f1_svm, 2))
# Logistic Regression
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
f1_logistic = f1_score(y_test, y_pred_logistic)
print("\nAccuracy for Logistic Regression:", round(accuracy_logistic, 2))
print("F1-score for Logistic Regression:", round(f1_logistic, 2))
# Print Classification Reports and Confusion Matrices
print("\nKNN Classification Report:")
print(classification_report(y_test, y_pred_best_knn))
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best_knn))
print("\nSVM Classification Report:")
print(classification_report(y_test, pred_svm))
print("SVM Confusion Matrix:")
print(confusion_matrix(y_test, pred_svm))
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logistic))
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logistic))
Best KNN Parameters: {'n_neighbors': 7}
Accuracy for KNN: 0.76
F1-score for KNN: 0.85
Accuracy for SVM: 0.71
F1-score for SVM: 0.83
Accuracy for Logistic Regression: 0.82
F1-score for Logistic Regression: 0.89
KNN Classification Report:
precision recall f1-score support
0 0.86 0.21 0.34 28
1 0.75 0.99 0.85 68
accuracy 0.76 96
macro avg 0.80 0.60 0.60 96
weighted avg 0.78 0.76 0.70 96
KNN Confusion Matrix:
[[ 6 22]
[ 1 67]]
SVM Classification Report:
precision recall f1-score support
0 0.00 0.00 0.00 28
1 0.71 1.00 0.83 68
accuracy 0.71 96
macro avg 0.35 0.50 0.41 96
weighted avg 0.50 0.71 0.59 96
SVM Confusion Matrix:
[[ 0 28]
[ 0 68]]
Logistic Regression Classification Report:
precision recall f1-score support
0 1.00 0.39 0.56 28
1 0.80 1.00 0.89 68
accuracy 0.82 96
macro avg 0.90 0.70 0.73 96
weighted avg 0.86 0.82 0.79 96
Logistic Regression Confusion Matrix:
[[11 17]
[ 0 68]]
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.11/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Here is a the key observations for KNN, SVM, and Logistic Regression performance:
| Metric | KNN | SVM | Logistic Regression |
|---|---|---|---|
| Accuracy | 76% | 71% | 82% |
| F1-score | 85% | 83% | 89% |
As we can see Logistic Regression has the highest performance. The F1-score for KNN is quite high (85%), indicating it performs well in terms of balancing precision and recall.
If we see the confusion matrix counts, we observe that Logistic Regression correctly classifies all 68 instances of Class 1 but misclassifies 17 instances of Class 0 as Class 1.
Another observation that needs attention is that SVM is predicting only one class: class 1 (positive) for every input, regardless of the features.
What to do next: