Practical 2: Basic ML & Linear Regression¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

In our first practical, we explored our data and made some tranformations. Also, we used Matplotlib and Seaborn to visualise the data. Now, in this practical, we're using the California Housing dataset and we will practice with Linear Regression.

Learning goals¶

Our goal is to develop a machine learning model to predict house prices from some input features. In this practical, you will learn:

  • To build machine learning models to solve a regression problem.
  • Split the data into training/test sets
  • To evaluate their performance
  • Apply Ridge and Lasso regulariazation

Install the necessary libraries (numpy, pandas, scikit-learn, matplotlib, seaborn) if they are not yet installed using !pip install, and then import them at the start of your script.

In [1]:
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn
In [ ]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import iqr

Also, we can import the functions required for some of the code that we will use. In this way, we can use the functions directly without calling them via the package.

In [2]:
# Importing necessary functions
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.linear_model import LassoCV, RidgeCV
from scipy.stats import probplot

As you see from the packages that we imported, one of the is the scikit-learn (https://scikit-learn.org/stable/). scikit-learn (also known as sklearn) is a free and open-source machine learning library for Python programming. It includes code for various classification, regression and clustering algorithms including linear regression, support-vector machines, random forests, gradient boosting, k-means, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Load the data and set up¶

We begin our model development by loading the 'houses_cleaned' dataset. This is the dataset after we did some preprocessing steps in the first practical.

In [ ]:
houses_cleaned = pd.read_csv('houses_cleaned.csv')
In [ ]:
houses_cleaned.head()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity rooms_per_household log_median_income
0 -122.23 37.88 41.0 880.000 129.0 322.0 126.0 8.013025 452600.0 near_bay 6.984127 2.198671
1 -122.22 37.86 21.0 5698.375 1106.0 2401.0 1092.5 8.013025 358500.0 near_bay 6.238137 2.198671
2 -122.24 37.85 52.0 1467.000 190.0 496.0 177.0 7.257400 352100.0 near_bay 8.288136 2.111110
3 -122.25 37.85 52.0 1274.000 235.0 558.0 219.0 5.643100 341300.0 near_bay 5.817352 1.893579
4 -122.25 37.85 52.0 1627.000 280.0 565.0 259.0 3.846200 342200.0 near_bay 6.281853 1.578195
In [ ]:
houses_cleaned.tail()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity rooms_per_household log_median_income
20630 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 inland 5.045455 0.940124
20631 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 inland 6.114035 1.268861
20632 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 inland 5.205543 0.993252
20633 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 inland 5.329513 1.053336
20634 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 inland 5.254717 1.220417
In [ ]:
houses_cleaned.describe()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value rooms_per_household log_median_income
count 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000
mean -119.569999 35.632412 28.636152 2441.902575 502.689858 1337.121105 469.066731 3.801266 205938.952435 5.304655 1.510583
std 2.003685 2.135918 12.583924 1397.859491 287.261212 765.561218 265.518130 1.657765 113192.930024 1.246199 0.342970
min -124.350000 32.540000 1.000000 2.000000 -112.080009 3.000000 1.000000 0.499900 14999.000000 2.023219 0.405398
25% -121.800000 33.930000 18.000000 1448.000000 296.000000 787.000000 280.000000 2.563100 119600.000000 4.440684 1.270631
50% -118.500000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.535200 179700.000000 5.229091 1.511869
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743700 264700.000000 6.052257 1.748104
max -114.310000 41.950000 52.000000 5698.375000 1173.500000 3132.000000 1092.500000 8.013025 482412.500000 8.469878 2.198671

This initial view helps in understanding the types of data columns (features) the dataset contains, such as continuous numerical values, categorical data, etc. The only remaining step is to perform label encoding on the categorical features.

One-Hot encoding¶

In our dataset, the ocean_proximity feature is categorical and needs to be encoded into a numeric format. Since there is no ordinal relationship between different types of locations like 'near_bay', 'inland', etc., we can use One-Hot Encoding which creates a separate binary column for each category. In this case, we'll have separate columns for 'near_bay', 'inland', 'near_ocean', and '<1h_ocean', with each row marked as 1 (true) if it belongs to that category and 0 (false) otherwise.

1. Apply One-Hot Encoding on the 'ocean_proximity' feature and transform it into a numeric format suitable for machine learning training. Use the function get_dummies() that is part of pandas. Save the new dataframe in the houses_cleaned variable. Check the new columns to understand hot the one hot encoding is working

Separating target and features¶

One of the key steps before splitting our data into train and test sets, is to remove and save the target feature on a seperate variable from the predictors. In this way, we are sure that we will not use the target feature as one of the predictors.

2. Separate the 'houses_cleaned' dataset into feature sets (X) and target variable (y). The target variable is the median_house_value

We split the dataset into two parts: independent/input variables and 'median_house_value' as the dependent target variable.

Splitting the dataset into training and test sets¶

One of the most important steps when working on a machine learning problem is to split the data into training and test sets. Sklearn has one function that can do that for us.

3. First ranform the target value into its log values since in this way the linear regression will work better (reduces the skew). Use np.log() for this transformation. Then use the function train_test_split from the sklearn.model_selection to divide 'houses_cleaned' into training (80%) and testing (20%) sets (parameter test_size). Use a fixed random_state=42 for reproducibility. Confirm the split by examining the initial rows of the training set.

Tip: The function train_test_split returns 4 objects, so you can start with X_train, X_test, y_train, y_test = train_test_split()

4. Print and verify the dimensions and total number of elements in each of the training and testing sets (X_train, X_test, y_train, y_test) after splitting. Use the attributes shape and size for this.

Feature scaling¶

Now let's apply Standarization on our features. This is particularly important for regression models and regularization techniques.

5. Scale the input features using the StandardScaler().

Note: for your training data you can use the function fit_transform() which learns the scaling parameters from training data, and immediately applies them to the training data. However, for the test data you have to use transform() to apply the scaling rules learned from the training data to the test data.

Why not fit on test data? Because that would leak information from the test set into the model, which breaks the rule of fair evaluation. The test set should simulate new, unseen data — we shouldn’t "observe" it while preprocessing

Building prediction model¶

Linear regression is the one of the basic algorithms to use for predicting a quantitative response. To refresh your memory, linear regression assumes a linear relationship among the predictors (X) and the target variable (y). One of the advantages of linear regression is also that the results are interpretable; meaning that it is easy to understand the model and the predictions.

6. Create a LinearRegression object and store it in the variable Linear_Regression. Once the model is created you can fit the model on the training data using the function fit(). The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Note: This is something that will come up in most of the models that we will use in the next practicals. First we create the model X by calling the constructor (so we create an object in Python language, for exampleLinear_Regression = LinearRegression()) and then we fit the model that we created on the training set (Linear_Regression.fit(X_train, y_train)). Once the model is fit (trained) we can use it to fo the predictions on the test set.

In the current question, you need to use the scaled version of the X_train which we created in the previous exercise

If now we want to print the intercept and the coefficients, we can run the following

In [ ]:
print("Intercept:", Linear_Regression.intercept_)

coef_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': Linear_Regression.coef_
})
print(coef_df)

Can you explain the intercept and the coefficients? Which predictors influnce the median price in a negative way and which in a positive way?

Prediction and evaluation¶

Now that we have the model, we can use it to make preditcion on new data (the test set).

7. Make predictions on the test data and evaluate the model's performance using R-squared and MSE. For the predictions, you can use the function predict() and save the result into the y_predict variable. This function will take as a parameter the X test data since we want to use the input features from the test data to do the predictions. The functions r2_score() and mean_squared_error() return the R-squared and MSE values respectively

Lasso and Ridge Regularization¶

Lasso regression is one of the regularization techniques and is going to shrink some of the feature coefficients to zero. We will use cross validation to find the best lambda paramter.

8. Implement LassoCV() with a range of alphas for cross-validation to identify relevant features and penalize irrelevant ones. In this context alpha is the same parameter as lambda. Start your code by defining a range of alphas such as alphas=np.logspace(-4, 4, 100). Also, set the number of folds to 5 (cv = 5)

9. Plot alpha values versus the MSEs. What do you notice?

10. Fit the lasso model on the training set and print the best alpha value. Use the attribute lasso_model.alpha_to get the best alpha. We give you the structure of the code, you only need to fill two lines. What does this plot tell us?

# Plot alpha values against mean squared error
plt.figure(figsize=(10, 6))
# Add the plt.plot line to plot the values of alphas versus the mean mse
# Note: mse_path_ is an array that stores the mean squared error for each alpha value and each cross-validation fold.
# plt.plot()

# use plt.axvline() to highlight the alpha value that gave the lowest mean error, change the linestyle to make it visible
# plt.axvline()

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('Mean Squared Error vs. Alpha for Lasso')
plt.legend(["MSE per alpha", "Optimal alpha"])
plt.grid(True)
plt.show()

11. Use the Lasso model that is now trained to make the predictions on the test set. Print the R-squared and the MSE values.

The second regularization method that we learned is the Ridge regression. Let's also apply ridge regression on our data.

12. Construct a RidgeCV object with a range of alphas. Similar to Lasso, use 5 folds and use the same alphas. Then print the optimal alpha value

13. Use the predict() function to generate the predictions on the test set.

14. Calculate the R-squared and the MSE for the ridge regression model

Comparing lasso and ridge regression models¶

A very useful tool is to visualise the model fit and see how the predictions look compared to the actual data.

15. Create comparative visualizations to assess the model fit of Lasso and Ridge regression models.

Ypu can use the the regplot() from seaborn. To make the plot nicer, add the argument scatter_kws={'alpha': 0.5, 'color': 'lightblue'}, while to add a regression line you can add the following argument line_kws={'color': 'black', 'linestyle': '--', 'linewidth': 2.5}. Play with the parameters to see how your plot changes

End of practical!