Anastasia Giachanou, Tina Shahedi
Machine Learning with Python - Utrecht Summer School
In our first practical, we explored our data and made some tranformations. Also, we used Matplotlib and Seaborn to visualise the data. Now, in this practical, we're using the California Housing dataset and we will practice with Linear Regression.
Our goal is to develop a machine learning model to predict house prices from some input features. In this practical, you will learn:
Install the necessary libraries (numpy, pandas, scikit-learn, matplotlib, seaborn) if they are not yet installed using !pip install, and then import them at the start of your script.
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn
# Importing necessary libraries
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import iqr
Also, we can import the functions required for some of the code that we will use. In this way, we can use the functions directly without calling them via the package.
# Importing necessary functions
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.linear_model import LassoCV, RidgeCV
from scipy.stats import probplot
As you see from the packages that we imported, one of the is the scikit-learn (https://scikit-learn.org/stable/). scikit-learn (also known as sklearn) is a free and open-source machine learning library for Python programming. It includes code for various classification, regression and clustering algorithms including linear regression, support-vector machines, random forests, gradient boosting, k-means, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
We begin our model development by loading the 'houses_cleaned' dataset. This is the dataset after we did some preprocessing steps in the first practical.
houses_cleaned = pd.read_csv('houses_cleaned.csv')
houses_cleaned.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | rooms_per_household | log_median_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.000 | 129.0 | 322.0 | 126.0 | 8.013025 | 452600.0 | near_bay | 6.984127 | 2.198671 |
| 1 | -122.22 | 37.86 | 21.0 | 5698.375 | 1106.0 | 2401.0 | 1092.5 | 8.013025 | 358500.0 | near_bay | 6.238137 | 2.198671 |
| 2 | -122.24 | 37.85 | 52.0 | 1467.000 | 190.0 | 496.0 | 177.0 | 7.257400 | 352100.0 | near_bay | 8.288136 | 2.111110 |
| 3 | -122.25 | 37.85 | 52.0 | 1274.000 | 235.0 | 558.0 | 219.0 | 5.643100 | 341300.0 | near_bay | 5.817352 | 1.893579 |
| 4 | -122.25 | 37.85 | 52.0 | 1627.000 | 280.0 | 565.0 | 259.0 | 3.846200 | 342200.0 | near_bay | 6.281853 | 1.578195 |
houses_cleaned.tail()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | rooms_per_household | log_median_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20630 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | inland | 5.045455 | 0.940124 |
| 20631 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | inland | 6.114035 | 1.268861 |
| 20632 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | inland | 5.205543 | 0.993252 |
| 20633 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | inland | 5.329513 | 1.053336 |
| 20634 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | inland | 5.254717 | 1.220417 |
houses_cleaned.describe()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | rooms_per_household | log_median_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 | 20635.000000 |
| mean | -119.569999 | 35.632412 | 28.636152 | 2441.902575 | 502.689858 | 1337.121105 | 469.066731 | 3.801266 | 205938.952435 | 5.304655 | 1.510583 |
| std | 2.003685 | 2.135918 | 12.583924 | 1397.859491 | 287.261212 | 765.561218 | 265.518130 | 1.657765 | 113192.930024 | 1.246199 | 0.342970 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | -112.080009 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 | 2.023219 | 0.405398 |
| 25% | -121.800000 | 33.930000 | 18.000000 | 1448.000000 | 296.000000 | 787.000000 | 280.000000 | 2.563100 | 119600.000000 | 4.440684 | 1.270631 |
| 50% | -118.500000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.535200 | 179700.000000 | 5.229091 | 1.511869 |
| 75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743700 | 264700.000000 | 6.052257 | 1.748104 |
| max | -114.310000 | 41.950000 | 52.000000 | 5698.375000 | 1173.500000 | 3132.000000 | 1092.500000 | 8.013025 | 482412.500000 | 8.469878 | 2.198671 |
This initial view helps in understanding the types of data columns (features) the dataset contains, such as continuous numerical values, categorical data, etc. The only remaining step is to perform label encoding on the categorical features.
In our dataset, the ocean_proximity feature is categorical and needs to be encoded into a numeric format. Since there is no ordinal relationship between different types of locations like
'near_bay', 'inland', etc., we can use One-Hot Encoding which creates a separate binary column for each category. In this case, we'll have separate columns for 'near_bay', 'inland', 'near_ocean', and '<1h_ocean', with each row marked as 1 (true) if it belongs to that category and 0 (false) otherwise.
1. Apply One-Hot Encoding on the 'ocean_proximity' feature and transform it into a numeric format suitable for machine learning training. Use the function get_dummies() that is part of pandas. Save the new dataframe in the houses_cleaned variable. Check the new columns to understand hot the one hot encoding is working
One of the key steps before splitting our data into train and test sets, is to remove and save the target feature on a seperate variable from the predictors. In this way, we are sure that we will not use the target feature as one of the predictors.
2. Separate the 'houses_cleaned' dataset into feature sets (X) and target variable (y). The target variable is the median_house_value
We split the dataset into two parts: independent/input variables and 'median_house_value' as the dependent target variable.
One of the most important steps when working on a machine learning problem is to split the data into training and test sets. Sklearn has one function that can do that for us.
3. First ranform the target value into its log values since in this way the linear regression will work better (reduces the skew). Use np.log() for this transformation. Then use the function train_test_split from the sklearn.model_selection to divide 'houses_cleaned' into training (80%) and testing (20%) sets (parameter test_size). Use a fixed random_state=42 for reproducibility. Confirm the split by examining the initial rows of the training set.
Tip: The function train_test_split returns 4 objects, so you can start with X_train, X_test, y_train, y_test = train_test_split()
4. Print and verify the dimensions and total number of elements in each of the training and testing sets (X_train, X_test, y_train, y_test) after splitting. Use the attributes shape and size for this.
Now let's apply Standarization on our features. This is particularly important for regression models and regularization techniques.
5. Scale the input features using the StandardScaler().
Note: for your training data you can use the function fit_transform() which learns the scaling parameters from training data, and immediately applies them to the training data. However, for the test data you have to use transform() to apply the scaling rules learned from the training data to the test data.
Why not fit on test data? Because that would leak information from the test set into the model, which breaks the rule of fair evaluation. The test set should simulate new, unseen data — we shouldn’t "observe" it while preprocessing
Linear regression is the one of the basic algorithms to use for predicting a quantitative response. To refresh your memory, linear regression assumes a linear relationship among the predictors (X) and the target variable (y). One of the advantages of linear regression is also that the results are interpretable; meaning that it is easy to understand the model and the predictions.
6. Create a LinearRegression object and store it in the variable Linear_Regression. Once the model is created you can fit the model on the training data using the function fit(). The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Note: This is something that will come up in most of the models that we will use in the next practicals. First we create the model X by calling the constructor (so we create an object in Python language, for exampleLinear_Regression = LinearRegression()) and then we fit the model that we created on the training set (Linear_Regression.fit(X_train, y_train)). Once the model is fit (trained) we can use it to fo the predictions on the test set.
In the current question, you need to use the scaled version of the X_train which we created in the previous exercise
If now we want to print the intercept and the coefficients, we can run the following
print("Intercept:", Linear_Regression.intercept_)
coef_df = pd.DataFrame({
'Feature': X_train.columns,
'Coefficient': Linear_Regression.coef_
})
print(coef_df)
Can you explain the intercept and the coefficients? Which predictors influnce the median price in a negative way and which in a positive way?
Now that we have the model, we can use it to make preditcion on new data (the test set).
7. Make predictions on the test data and evaluate the model's performance using R-squared and MSE. For the predictions, you can use the function predict() and save the result into the y_predict variable. This function will take as a parameter the X test data since we want to use the input features from the test data to do the predictions. The functions r2_score() and mean_squared_error() return the R-squared and MSE values respectively
Lasso regression is one of the regularization techniques and is going to shrink some of the feature coefficients to zero. We will use cross validation to find the best lambda paramter.
8. Implement LassoCV() with a range of alphas for cross-validation to identify relevant features and penalize irrelevant ones. In this context alpha is the same parameter as lambda. Start your code by defining a range of alphas such as alphas=np.logspace(-4, 4, 100). Also, set the number of folds to 5 (cv = 5)
9. Plot alpha values versus the MSEs. What do you notice?
10. Fit the lasso model on the training set and print the best alpha value. Use the attribute lasso_model.alpha_to get the best alpha. We give you the structure of the code, you only need to fill two lines. What does this plot tell us?
# Plot alpha values against mean squared error
plt.figure(figsize=(10, 6))
# Add the plt.plot line to plot the values of alphas versus the mean mse
# Note: mse_path_ is an array that stores the mean squared error for each alpha value and each cross-validation fold.
# plt.plot()
# use plt.axvline() to highlight the alpha value that gave the lowest mean error, change the linestyle to make it visible
# plt.axvline()
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('Mean Squared Error vs. Alpha for Lasso')
plt.legend(["MSE per alpha", "Optimal alpha"])
plt.grid(True)
plt.show()
11. Use the Lasso model that is now trained to make the predictions on the test set. Print the R-squared and the MSE values.
The second regularization method that we learned is the Ridge regression. Let's also apply ridge regression on our data.
12. Construct a RidgeCV object with a range of alphas. Similar to Lasso, use 5 folds and use the same alphas. Then print the optimal alpha value
13. Use the predict() function to generate the predictions on the test set.
14. Calculate the R-squared and the MSE for the ridge regression model
A very useful tool is to visualise the model fit and see how the predictions look compared to the actual data.
15. Create comparative visualizations to assess the model fit of Lasso and Ridge regression models.
Ypu can use the the regplot() from seaborn. To make the plot nicer, add the argument scatter_kws={'alpha': 0.5, 'color': 'lightblue'}, while to add a regression line you can add the following argument line_kws={'color': 'black', 'linestyle': '--', 'linewidth': 2.5}. Play with the parameters to see how your plot changes
End of practical!