Practical 2: Basic ML & Linear Regression¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

In our first practical, we explored our data and made some tranformations. Also, we used Matplotlib and Seaborn to visualise the data. Now, in this practical, we're using the California Housing dataset and we will practice with Linear Regression.

Learning goals¶

Our goal is to develop a machine learning model to predict house prices from some input features. In this practical, you will learn:

To build machine learning models to solve a regression problem.
Split the data into training/test sets
To evaluate their performance
Apply Ridge and Lasso regulariazation

Install the necessary libraries (numpy, pandas, scikit-learn, matplotlib, seaborn) if they are not yet installed using !pip install, and then import them at the start of your script.

In [1]:

!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn

In [ ]:

# Importing necessary libraries
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import iqr

Also, we can import the functions required for some of the code that we will use. In this way, we can use the functions directly without calling them via the package.

In [2]:

# Importing necessary functions
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.linear_model import LassoCV, RidgeCV
from scipy.stats import probplot

As you see from the packages that we imported, one of the is the scikit-learn (https://scikit-learn.org/stable/). scikit-learn (also known as sklearn) is a free and open-source machine learning library for Python programming. It includes code for various classification, regression and clustering algorithms including linear regression, support-vector machines, random forests, gradient boosting, k-means, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Load the data and set up¶

We begin our model development by loading the 'houses_cleaned' dataset. This is the dataset after we did some preprocessing steps in the first practical.

In [ ]:

houses_cleaned = pd.read_csv('houses_cleaned.csv')

In [ ]:

houses_cleaned.head()

Out[ ]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	rooms_per_household	log_median_income
0	-122.23	37.88	41.0	880.000	129.0	322.0	126.0	8.013025	452600.0	near_bay	6.984127	2.198671
1	-122.22	37.86	21.0	5698.375	1106.0	2401.0	1092.5	8.013025	358500.0	near_bay	6.238137	2.198671
2	-122.24	37.85	52.0	1467.000	190.0	496.0	177.0	7.257400	352100.0	near_bay	8.288136	2.111110
3	-122.25	37.85	52.0	1274.000	235.0	558.0	219.0	5.643100	341300.0	near_bay	5.817352	1.893579
4	-122.25	37.85	52.0	1627.000	280.0	565.0	259.0	3.846200	342200.0	near_bay	6.281853	1.578195

In [ ]:

houses_cleaned.tail()

Out[ ]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	rooms_per_household	log_median_income
20630	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	inland	5.045455	0.940124
20631	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	inland	6.114035	1.268861
20632	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	inland	5.205543	0.993252
20633	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	inland	5.329513	1.053336
20634	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	inland	5.254717	1.220417

In [ ]:

houses_cleaned.describe()

Out[ ]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	rooms_per_household	log_median_income
count	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000	20635.000000
mean	-119.569999	35.632412	28.636152	2441.902575	502.689858	1337.121105	469.066731	3.801266	205938.952435	5.304655	1.510583
std	2.003685	2.135918	12.583924	1397.859491	287.261212	765.561218	265.518130	1.657765	113192.930024	1.246199	0.342970
min	-124.350000	32.540000	1.000000	2.000000	-112.080009	3.000000	1.000000	0.499900	14999.000000	2.023219	0.405398
25%	-121.800000	33.930000	18.000000	1448.000000	296.000000	787.000000	280.000000	2.563100	119600.000000	4.440684	1.270631
50%	-118.500000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.535200	179700.000000	5.229091	1.511869
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743700	264700.000000	6.052257	1.748104
max	-114.310000	41.950000	52.000000	5698.375000	1173.500000	3132.000000	1092.500000	8.013025	482412.500000	8.469878	2.198671

This initial view helps in understanding the types of data columns (features) the dataset contains, such as continuous numerical values, categorical data, etc. The only remaining step is to perform label encoding on the categorical features.

One-Hot encoding¶

In our dataset, the ocean_proximity feature is categorical and needs to be encoded into a numeric format. Since there is no ordinal relationship between different types of locations like 'near_bay', 'inland', etc., we can use One-Hot Encoding which creates a separate binary column for each category. In this case, we'll have separate columns for 'near_bay', 'inland', 'near_ocean', and '<1h_ocean', with each row marked as 1 (true) if it belongs to that category and 0 (false) otherwise.

1. Apply One-Hot Encoding on the 'ocean_proximity' feature and transform it into a numeric format suitable for machine learning training. Use the function get_dummies() that is part of pandas. Save the new dataframe in the houses_cleaned variable. Check the new columns to understand hot the one hot encoding is working

Separating target and features¶

One of the key steps before splitting our data into train and test sets, is to remove and save the target feature on a seperate variable from the predictors. In this way, we are sure that we will not use the target feature as one of the predictors.

2. Separate the 'houses_cleaned' dataset into feature sets (X) and target variable (y). The target variable is the median_house_value

We split the dataset into two parts: independent/input variables and 'median_house_value' as the dependent target variable.

Splitting the dataset into training and test sets¶

One of the most important steps when working on a machine learning problem is to split the data into training and test sets. Sklearn has one function that can do that for us.

3. First ranform the target value into its log values since in this way the linear regression will work better (reduces the skew). Use np.log() for this transformation. Then use the function train_test_split from the sklearn.model_selection to divide 'houses_cleaned' into training (80%) and testing (20%) sets (parameter test_size). Use a fixed random_state=42 for reproducibility. Confirm the split by examining the initial rows of the training set.

Tip: The function train_test_split returns 4 objects, so you can start with X_train, X_test, y_train, y_test = train_test_split()

4. Print and verify the dimensions and total number of elements in each of the training and testing sets (X_train, X_test, y_train, y_test) after splitting. Use the attributes shape and size for this.

Feature scaling¶

Now let's apply Standarization on our features. This is particularly important for regression models and regularization techniques.

5. Scale the input features using the StandardScaler().

Note: for your training data you can use the function fit_transform() which learns the scaling parameters from training data, and immediately applies them to the training data. However, for the test data you have to use transform() to apply the scaling rules learned from the training data to the test data.

Why not fit on test data? Because that would leak information from the test set into the model, which breaks the rule of fair evaluation. The test set should simulate new, unseen data — we shouldn’t "observe" it while preprocessing

Building prediction model¶

Linear regression is the one of the basic algorithms to use for predicting a quantitative response. To refresh your memory, linear regression assumes a linear relationship among the predictors (X) and the target variable (y). One of the advantages of linear regression is also that the results are interpretable; meaning that it is easy to understand the model and the predictions.

6. Create a LinearRegression object and store it in the variable Linear_Regression. Once the model is created you can fit the model on the training data using the function fit(). The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Note: This is something that will come up in most of the models that we will use in the next practicals. First we create the model X by calling the constructor (so we create an object in Python language, for exampleLinear_Regression = LinearRegression()) and then we fit the model that we created on the training set (Linear_Regression.fit(X_train, y_train)). Once the model is fit (trained) we can use it to fo the predictions on the test set.

In the current question, you need to use the scaled version of the X_train which we created in the previous exercise

If now we want to print the intercept and the coefficients, we can run the following

In [ ]:

print("Intercept:", Linear_Regression.intercept_)

coef_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': Linear_Regression.coef_
})
print(coef_df)

Can you explain the intercept and the coefficients? Which predictors influnce the median price in a negative way and which in a positive way?

Prediction and evaluation¶

Now that we have the model, we can use it to make preditcion on new data (the test set).

7. Make predictions on the test data and evaluate the model's performance using R-squared and MSE. For the predictions, you can use the function predict() and save the result into the y_predict variable. This function will take as a parameter the X test data since we want to use the input features from the test data to do the predictions. The functions r2_score() and mean_squared_error() return the R-squared and MSE values respectively

Lasso and Ridge Regularization¶

Lasso regression is one of the regularization techniques and is going to shrink some of the feature coefficients to zero. We will use cross validation to find the best lambda paramter.

8. Implement LassoCV() with a range of alphas for cross-validation to identify relevant features and penalize irrelevant ones. In this context alpha is the same parameter as lambda. Start your code by defining a range of alphas such as alphas=np.logspace(-4, 4, 100). Also, set the number of folds to 5 (cv = 5)

9. Plot alpha values versus the MSEs. What do you notice?

10. Fit the lasso model on the training set and print the best alpha value. Use the attribute lasso_model.alpha_to get the best alpha. We give you the structure of the code, you only need to fill two lines. What does this plot tell us?

# Plot alpha values against mean squared error
plt.figure(figsize=(10, 6))
# Add the plt.plot line to plot the values of alphas versus the mean mse
# Note: mse_path_ is an array that stores the mean squared error for each alpha value and each cross-validation fold.
# plt.plot()

# use plt.axvline() to highlight the alpha value that gave the lowest mean error, change the linestyle to make it visible
# plt.axvline()

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('Mean Squared Error vs. Alpha for Lasso')
plt.legend(["MSE per alpha", "Optimal alpha"])
plt.grid(True)
plt.show()

11. Use the Lasso model that is now trained to make the predictions on the test set. Print the R-squared and the MSE values.

The second regularization method that we learned is the Ridge regression. Let's also apply ridge regression on our data.

12. Construct a RidgeCV object with a range of alphas. Similar to Lasso, use 5 folds and use the same alphas. Then print the optimal alpha value

13. Use the predict() function to generate the predictions on the test set.

14. Calculate the R-squared and the MSE for the ridge regression model

Comparing lasso and ridge regression models¶

A very useful tool is to visualise the model fit and see how the predictions look compared to the actual data.

15. Create comparative visualizations to assess the model fit of Lasso and Ridge regression models.

Ypu can use the the regplot() from seaborn. To make the plot nicer, add the argument scatter_kws={'alpha': 0.5, 'color': 'lightblue'}, while to add a regression line you can add the following argument line_kws={'color': 'black', 'linestyle': '--', 'linewidth': 2.5}. Play with the parameters to see how your plot changes

End of practical!