Anastasia Giachanou, Tina Shahedi
Machine Learning with Python - Utrecht Summer School
In this practical, we are going to apply different tree based algorithms on a regression problem.
Let’s begin by importing the required libraries. We will use scikit-learn to build and evaluate the decision tree regression model.
As usual we will start with importing all the necessary libraries. Have a look here if you want to find more information about the packages that implement tree-based methods in sklearn: for decision trees https://scikit-learn.org/stable/modules/tree.html# and for random forests https://scikit-learn.org/stable/modules/ensemble.html
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, classification_report
In the first part of this practial, we will build a regression tree with a pre-defined depth. For the regression problem we will use the diabetes dataset that is part of the sklearn.datasets
to predict the disease progression.
There are various variables (age, sex, body mass index, average blood pressure, and six blood serum measurements)for each of n = 442 diabetes patients. As outout we will use the disease progression one year after baseline.
1. Let's load the dataset first using the function load_diabetes(). We can store it in the variable diabetes
2. Print the feature_names and the target of the diabetes
Hint:
The variable diabetes
is a dictionary-like object that contains data, target, and additional metadata.
Among others, it contains:
data
: numpy array of shape (n_samples, n_features) containing the feature matrix.target
: numpy array of shape (n_samples,) containing the target variable.feature_names
: list of feature names.DESCR
: a description of the dataset.You can access these attributes using indexing, for example, diabetes['data']
or diabetes.data
, diabetes['target']
or diabetes.target
, etc.
Now we want to separate the dataset into features and target variables, which is a standard preprocessing step in machine learning.
3. Create two vectors so that one of them (X_reg
) contains the values of the input features (diabetes.data
) and the other one (y_reg
) contains the target values(diabetes.target
)
The features matrix, X_reg
contains all the independent variables used for prediction and the target vector, y_reg
contains the dependent variable which we aim to predict.
4. Now you can split the two vectors that you created before into training and test sets, using 20% of the data as test and using a random_state of 42. The random state is important for the reproducability of the task. Similar to the previous practicals, you can use the function train_test_split()
and store into the variables X_train_reg, X_test_reg, y_train_reg, y_test_reg
We split our data so a part of it is used to train the model and the other part to test it. By setting the random_state parameter to 42, we guarantee that the split is reproducible.
5. Now initialise the tree regressor with limited depth. Let's use depth = 2. To initialise the tree regressor you can use the DecisionTreeRegressor
and you can store it into a variable called regression_tree
The max_depth
parameter restricts the depth of the tree and prevents overfitting, so the model does not become too complex and can't generalise. This step initializes the tree model that will be trained and defines how deep the tree can grow.
6.Now fit the regression tree model on the training data using the fit
function. The fit function takes two parameters, the predictors (X_train_reg
) and the target (y_train_reg
). After that step, you can predict the target variable values using the trained regression tree model on the test data with the predict
function. Store the predicted values in a variable called y_pred
In this process, the decision tree learns to split the data in order to minimize the prediction error.
After training, we use the predict method to generate predictions (y_pred_reg
) for the test data (X_test_reg
). This step allows us to evaluate how well the trained model performs on unseen data.
7. Now calculate the mean squared error using the function mean_squared_error
8. Visualize the regression tree using the plot_tree
function. This function takes as parameters the regression tree and the feature names used for the regression. In addition you can use the parameter filled
if you want to fill the tree with color. After you visualise the tree, try to interpret it (e.g., how many leaf does it contain? what do the numbers in the nodes mean?)
Here we will build a second regression tree after we optimise some of its parameters.
9. Define the hyperparameters grid for grid search. Define a range of parameters for maximum depth (3, 5, 7), minimum samples for each split (2, 5, 10) and minimm samples per leaf (1, 2, 4). Name your variable param_grid_reg
The param_grid_reg
is actually a dictionary in which we can have the names of the model parameters as keys and a set of values for each of them. For example if you add the following entry into the dictionary:
'max_depth': [3, 5, 7]
then this means that we will then try those 3 values for the model parameter max_depth.
10. We will use the GridSearchCV
and the param_grid=param_grid
to try different parameters. First, initialize the GridSearchCV
and use as estimator
the regression tree from question 5. Define the number of folds for cross-validation to be 5. You have also to set the parameter param_grid
to the variable param_grid
that we created in the previous question. Also set the parameter scoring
to the neg_mean_squared_error
. This means that we're using the negative of the mean squared error as the scoring metric. Name this variable as grid_search_reg
By default, GridSearchCV
uses higher scores to indicate better performance, so using the negative of the mean squared error effectively turns it into a loss function to be minimized.
Now we fit the grid search to the training data to find the best hyperparameters.
11. Fit the grid_search_reg
into the training set
12a. Print the best combination of hyperparameters using the best_params_
attribute of the GridSearchCV
object.
12b. Generate the predictions on the test set and calculate the MSE. To do the predictions using the best parameters you can use grid_search_reg.predict()
. Is the performance better than the first tree?
Let's now explore how a Random Forest is performing on our data. As a reminder, Random Forest
builds multiple decision trees during training. Each decision tree is trained on a random subset of the training data and a random subset of the features (columns).
When making predictions, each decision tree independently predicts a value for the target variable. The final prediction is then obtained by averaging (for regression) or voting (for classification) the predictions of all the individual trees.
13. The first step is to initialise the random forest regressor using the RandomForestRegressor
constructor. Store it to a variable cllaed random_forest_reg
14. Random forest has a set of parameters that can be optimised. Create a dictionary param_grid_rf_reg
and set the number of estimators to (100, 200), maximum depth to (3, 5, 7), minimum samples for each split (2, 5, 10) and minimm samples per leaf (1, 2, 4)
15. Similar to questions 10-12, initialize the grid search CV (you can use cv = 5 and scoring = neg_mean_squared_error
again) for the random forest regressor, fit to the training data and print the best parameters.
This part (when you fit to the training data) can take a few minutes to run. To speed up the process, you can remove some of the parameters in the grid search.
16. Make predictions and calculate the mean squared error. Compare it with the MSE of the simple and the optimised decision tree
17. Finally, let's plot the importance of the features of the best random forest. To get the importance of each feature you can use the feature_importances_
attribute (this is the impurity based importance calculated as the mean impurity decrease within each tree.). This is an attribute of scikit-learn's Random Forest regressor models (RandomForestRegressor) and represents the importance of each feature in predicting the target variable. To plot the importances, you can use a bar plot.
Optional: Do the same for the eature importance based on feature permutation. For help look here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-feature-permutation
End of Practical!