Practical: Tree based models¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

In this practical, we are going to apply different tree based algorithms on a regression problem.

Let’s begin by importing the required libraries. We will use scikit-learn to build and evaluate the decision tree regression model.

As usual we will start with importing all the necessary libraries. Have a look here if you want to find more information about the packages that implement tree-based methods in sklearn: for decision trees https://scikit-learn.org/stable/modules/tree.html# and for random forests https://scikit-learn.org/stable/modules/ensemble.html

In [ ]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, classification_report

Let's get started¶

Regression tree with limited depth¶

In the first part of this practial, we will build a regression tree with a pre-defined depth. For the regression problem we will use the diabetes dataset that is part of the sklearn.datasets to predict the disease progression.

There are various variables (age, sex, body mass index, average blood pressure, and six blood serum measurements)for each of n = 442 diabetes patients. As outout we will use the disease progression one year after baseline.

1. Let's load the dataset first using the function load_diabetes(). We can store it in the variable diabetes

2. Print the feature_names and the target of the diabetes

Hint:The variable diabetes is a dictionary-like object that contains data, target, and additional metadata.

Among others, it contains:

  • data: numpy array of shape (n_samples, n_features) containing the feature matrix.
  • target: numpy array of shape (n_samples,) containing the target variable.
  • feature_names: list of feature names.
  • DESCR: a description of the dataset.

You can access these attributes using indexing, for example, diabetes['data'] or diabetes.data, diabetes['target'] or diabetes.target, etc.

Now we want to separate the dataset into features and target variables, which is a standard preprocessing step in machine learning.

3. Create two vectors so that one of them (X_reg) contains the values of the input features (diabetes.data) and the other one (y_reg) contains the target values(diabetes.target)

The features matrix, X_reg contains all the independent variables used for prediction and the target vector, y_reg contains the dependent variable which we aim to predict.

4. Now you can split the two vectors that you created before into training and test sets, using 20% of the data as test and using a random_state of 42. The random state is important for the reproducability of the task. Similar to the previous practicals, you can use the function train_test_split() and store into the variables X_train_reg, X_test_reg, y_train_reg, y_test_reg

We split our data so a part of it is used to train the model and the other part to test it. By setting the random_state parameter to 42, we guarantee that the split is reproducible.

5. Now initialise the tree regressor with limited depth. Let's use depth = 2. To initialise the tree regressor you can use the DecisionTreeRegressor and you can store it into a variable called regression_tree

The max_depth parameter restricts the depth of the tree and prevents overfitting, so the model does not become too complex and can't generalise. This step initializes the tree model that will be trained and defines how deep the tree can grow.

6.Now fit the regression tree model on the training data using the fit function. The fit function takes two parameters, the predictors (X_train_reg) and the target (y_train_reg). After that step, you can predict the target variable values using the trained regression tree model on the test data with the predict function. Store the predicted values in a variable called y_pred

In this process, the decision tree learns to split the data in order to minimize the prediction error.

After training, we use the predict method to generate predictions (y_pred_reg) for the test data (X_test_reg). This step allows us to evaluate how well the trained model performs on unseen data.

7. Now calculate the mean squared error using the function mean_squared_error

8. Visualize the regression tree using the plot_tree function. This function takes as parameters the regression tree and the feature names used for the regression. In addition you can use the parameter filled if you want to fill the tree with color. After you visualise the tree, try to interpret it (e.g., how many leaf does it contain? what do the numbers in the nodes mean?)

Optimise the parameters of the regression tree¶

Here we will build a second regression tree after we optimise some of its parameters.

9. Define the hyperparameters grid for grid search. Define a range of parameters for maximum depth (3, 5, 7), minimum samples for each split (2, 5, 10) and minimm samples per leaf (1, 2, 4). Name your variable param_grid_reg

The param_grid_reg is actually a dictionary in which we can have the names of the model parameters as keys and a set of values for each of them. For example if you add the following entry into the dictionary:

'max_depth': [3, 5, 7]

then this means that we will then try those 3 values for the model parameter max_depth.

10. We will use the GridSearchCV and the param_grid=param_grid to try different parameters. First, initialize the GridSearchCV and use as estimator the regression tree from question 5. Define the number of folds for cross-validation to be 5. You have also to set the parameter param_grid to the variable param_grid that we created in the previous question. Also set the parameter scoring to the neg_mean_squared_error. This means that we're using the negative of the mean squared error as the scoring metric. Name this variable as grid_search_reg

By default, GridSearchCV uses higher scores to indicate better performance, so using the negative of the mean squared error effectively turns it into a loss function to be minimized.

Now we fit the grid search to the training data to find the best hyperparameters.

11. Fit the grid_search_reg into the training set

12a. Print the best combination of hyperparameters using the best_params_ attribute of the GridSearchCV object.

12b. Generate the predictions on the test set and calculate the MSE. To do the predictions using the best parameters you can use grid_search_reg.predict(). Is the performance better than the first tree?

Random Forest Regressor¶

Let's now explore how a Random Forest is performing on our data. As a reminder, Random Forest builds multiple decision trees during training. Each decision tree is trained on a random subset of the training data and a random subset of the features (columns).

When making predictions, each decision tree independently predicts a value for the target variable. The final prediction is then obtained by averaging (for regression) or voting (for classification) the predictions of all the individual trees.

13. The first step is to initialise the random forest regressor using the RandomForestRegressor constructor. Store it to a variable cllaed random_forest_reg

14. Random forest has a set of parameters that can be optimised. Create a dictionary param_grid_rf_reg and set the number of estimators to (100, 200), maximum depth to (3, 5, 7), minimum samples for each split (2, 5, 10) and minimm samples per leaf (1, 2, 4)

15. Similar to questions 10-12, initialize the grid search CV (you can use cv = 5 and scoring = neg_mean_squared_error again) for the random forest regressor, fit to the training data and print the best parameters.

This part (when you fit to the training data) can take a few minutes to run. To speed up the process, you can remove some of the parameters in the grid search.

16. Make predictions and calculate the mean squared error. Compare it with the MSE of the simple and the optimised decision tree

17. Finally, let's plot the importance of the features of the best random forest. To get the importance of each feature you can use the feature_importances_ attribute (this is the impurity based importance calculated as the mean impurity decrease within each tree.). This is an attribute of scikit-learn's Random Forest regressor models (RandomForestRegressor) and represents the importance of each feature in predicting the target variable. To plot the importances, you can use a bar plot.

Optional: Do the same for the eature importance based on feature permutation. For help look here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-feature-permutation

End of Practical!