Practical 2: Basic ML & Linear Regression¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

In our first practical, we explored our data and made some tranformations. Also, we used Matplotlib and Seaborn to visualise the data. Now, in this practical, we're using the California Housing dataset and we will practice with Linear Regression.

Learning goals¶

Our goal is to develop a machine learning model to predict house prices from some input features. In this practical, you will learn:

  • To build machine learning models to solve a regression problem.
  • Split the data into training/test sets
  • To evaluate their performance
  • Apply Ridge and Lasso regulariazation

Install the necessary libraries (numpy, pandas, scikit-learn, matplotlib, seaborn) if they are not yet installed using !pip install, and then import them at the start of your script.

In [1]:
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn
In [2]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from scipy.stats import iqr

Also, we can import the functions required for some of the code that we will use. In this way, we can use the functions directly without calling them via the package.

In [3]:
# Importing necessary functions
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.linear_model import LassoCV, RidgeCV
from scipy.stats import probplot

As you see from the packages that we imported, one of the is the scikit-learn (https://scikit-learn.org/stable/). scikit-learn (also known as sklearn) is a free and open-source machine learning library for Python programming. It includes code for various classification, regression and clustering algorithms including linear regression, support-vector machines, random forests, gradient boosting, k-means, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Load the data and set up¶

We begin our model development by loading the 'houses_cleaned' dataset. This is the dataset after we did some preprocessing steps in the first practical.

In [4]:
houses_cleaned = pd.read_csv('houses_cleaned.csv')
In [5]:
houses_cleaned.head()
Out[5]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity rooms_per_household log_median_income
0 -122.23 37.88 41.0 880.000 129.0 322.0 126.0 8.013025 452600.0 near_bay 6.984127 2.198671
1 -122.22 37.86 21.0 5698.375 1106.0 2401.0 1092.5 8.013025 358500.0 near_bay 6.238137 2.198671
2 -122.24 37.85 52.0 1467.000 190.0 496.0 177.0 7.257400 352100.0 near_bay 8.288136 2.111110
3 -122.25 37.85 52.0 1274.000 235.0 558.0 219.0 5.643100 341300.0 near_bay 5.817352 1.893579
4 -122.25 37.85 52.0 1627.000 280.0 565.0 259.0 3.846200 342200.0 near_bay 6.281853 1.578195
In [6]:
houses_cleaned.tail()
Out[6]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity rooms_per_household log_median_income
20630 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 inland 5.045455 0.940124
20631 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 inland 6.114035 1.268861
20632 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 inland 5.205543 0.993252
20633 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 inland 5.329513 1.053336
20634 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 inland 5.254717 1.220417
In [7]:
houses_cleaned.describe()
Out[7]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value rooms_per_household log_median_income
count 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000 20635.000000
mean -119.569999 35.632412 28.636152 2441.902575 502.689858 1337.121105 469.066731 3.801266 205938.952435 5.304655 1.510583
std 2.003685 2.135918 12.583924 1397.859491 287.261212 765.561218 265.518130 1.657765 113192.930024 1.246199 0.342970
min -124.350000 32.540000 1.000000 2.000000 -112.080009 3.000000 1.000000 0.499900 14999.000000 2.023219 0.405398
25% -121.800000 33.930000 18.000000 1448.000000 296.000000 787.000000 280.000000 2.563100 119600.000000 4.440684 1.270631
50% -118.500000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.535200 179700.000000 5.229091 1.511869
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743700 264700.000000 6.052257 1.748104
max -114.310000 41.950000 52.000000 5698.375000 1173.500000 3132.000000 1092.500000 8.013025 482412.500000 8.469878 2.198671

This initial view helps in understanding the types of data columns (features) the dataset contains, such as continuous numerical values, categorical data, etc. The only remaining step is to perform label encoding on the categorical features.

One-Hot encoding¶

In our dataset, the ocean_proximity feature is categorical and needs to be encoded into a numeric format. Since there is no ordinal relationship between different types of locations like 'near_bay', 'inland', etc., we can use One-Hot Encoding which creates a separate binary column for each category. In this case, we'll have separate columns for 'near_bay', 'inland', 'near_ocean', and '<1h_ocean', with each row marked as 1 (true) if it belongs to that category and 0 (false) otherwise.

1. Apply One-Hot Encoding on the 'ocean_proximity' feature and transform it into a numeric format suitable for machine learning training. Use the function get_dummies() that is part of pandas. Save the new dataframe in the houses_cleaned variable. Check the new columns to understand hot the one hot encoding is working

In [8]:
# Applying one-hot encoding to the 'ocean_proximity' column
houses_cleaned = pd.get_dummies(houses_cleaned, columns=['ocean_proximity'])

houses_cleaned.head()
Out[8]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value rooms_per_household log_median_income ocean_proximity_<1h_ocean ocean_proximity_inland ocean_proximity_near_bay ocean_proximity_near_ocean
0 -122.23 37.88 41.0 880.000 129.0 322.0 126.0 8.013025 452600.0 6.984127 2.198671 False False True False
1 -122.22 37.86 21.0 5698.375 1106.0 2401.0 1092.5 8.013025 358500.0 6.238137 2.198671 False False True False
2 -122.24 37.85 52.0 1467.000 190.0 496.0 177.0 7.257400 352100.0 8.288136 2.111110 False False True False
3 -122.25 37.85 52.0 1274.000 235.0 558.0 219.0 5.643100 341300.0 5.817352 1.893579 False False True False
4 -122.25 37.85 52.0 1627.000 280.0 565.0 259.0 3.846200 342200.0 6.281853 1.578195 False False True False
In [9]:
print(houses_cleaned.describe())
          longitude      latitude  housing_median_age   total_rooms  \
count  20635.000000  20635.000000        20635.000000  20635.000000   
mean    -119.569999     35.632412           28.636152   2441.902575   
std        2.003685      2.135918           12.583924   1397.859491   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1448.000000   
50%     -118.500000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000   5698.375000   

       total_bedrooms    population    households  median_income  \
count    20635.000000  20635.000000  20635.000000   20635.000000   
mean       502.689858   1337.121105    469.066731       3.801266   
std        287.261212    765.561218    265.518130       1.657765   
min       -112.080009      3.000000      1.000000       0.499900   
25%        296.000000    787.000000    280.000000       2.563100   
50%        435.000000   1166.000000    409.000000       3.535200   
75%        647.000000   1725.000000    605.000000       4.743700   
max       1173.500000   3132.000000   1092.500000       8.013025   

       median_house_value  rooms_per_household  log_median_income  
count        20635.000000         20635.000000       20635.000000  
mean        205938.952435             5.304655           1.510583  
std         113192.930024             1.246199           0.342970  
min          14999.000000             2.023219           0.405398  
25%         119600.000000             4.440684           1.270631  
50%         179700.000000             5.229091           1.511869  
75%         264700.000000             6.052257           1.748104  
max         482412.500000             8.469878           2.198671  

Separating target and features¶

One of the key steps before splitting our data into train and test sets, is to remove and save the target feature on a seperate variable from the predictors. In this way, we are sure that we will not use the target feature as one of the predictors.

2. Separate the 'houses_cleaned' dataset into feature sets (X) and target variable (y). The target variable is the median_house_value

We split the dataset into two parts: independent/input variables and 'median_house_value' as the dependent target variable.

In [10]:
# Assuming 'median_house_value' is the target variable
target = 'median_house_value'

# separating the data into features and target
X = houses_cleaned.drop(target, axis=1) # axis = 1 refers to columns
y = houses_cleaned[target]

Splitting the dataset into training and test sets¶

One of the most important steps when working on a machine learning problem is to split the data into training and test sets. Sklearn has one function that can do that for us.

3. First ranform the target value into its log values since in this way the linear regression will work better (reduces the skew). Use np.log() for this transformation. Then use the function train_test_split from the sklearn.model_selection to divide 'houses_cleaned' into training (80%) and testing (20%) sets (parameter test_size). Use a fixed random_state=42 for reproducibility. Confirm the split by examining the initial rows of the training set.

Tip: The function train_test_split returns 4 objects, so you can start with X_train, X_test, y_train, y_test = train_test_split()

In [11]:
# Splitting the dataset into features (X) and target variable (y)
y_log_transformed = np.log(houses_cleaned['median_house_value'])

# Using 'train_test_split' to divide the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y_log_transformed, test_size=0.2, random_state=42)

# Confirming the split by examining the initial rows of the training set
print(X_train.head())
print(y_train.head())
       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
19980    -119.31     36.20                23.0       1837.0           332.0   
11231    -117.96     33.81                34.0       1416.0           277.0   
2271     -119.80     36.78                43.0       2382.0           431.0   
8658     -118.38     33.83                40.0       3070.0           570.0   
6248     -117.97     34.05                34.0       2050.0           495.0   

       population  households  median_income  rooms_per_household  \
19980      1064.0       335.0         3.1453             5.483582   
11231       980.0       284.0         4.7772             4.985915   
2271        874.0       380.0         3.5542             6.268421   
8658       1264.0       506.0         5.1626             6.067194   
6248       1832.0       465.0         2.8333             4.408602   

       log_median_income  ocean_proximity_<1h_ocean  ocean_proximity_inland  \
19980           1.421975                      False                    True   
11231           1.753919                       True                   False   
2271            1.516050                      False                    True   
8658            1.818499                       True                   False   
6248            1.343726                       True                   False   

       ocean_proximity_near_bay  ocean_proximity_near_ocean  
19980                     False                       False  
11231                     False                       False  
2271                      False                       False  
8658                      False                       False  
6248                      False                       False  
19980    11.218554
11231    12.114505
2271     11.477298
8658     12.977800
6248     11.955686
Name: median_house_value, dtype: float64

4. Print and verify the dimensions and total number of elements in each of the training and testing sets (X_train, X_test, y_train, y_test) after splitting. Use the attributes shape and size for this.

In [12]:
# Verify the dimensions
print("X_train {} and size {}".format(X_train.shape, X_train.size))
print("X_test {} and size {}".format(X_test.shape, X_test.size))
print("y_train {} and size {}".format(y_train.shape, y_train.size))
print("y_test {} and size {}".format(y_test.shape, y_test.size))
X_train (16508, 14) and size 231112
X_test (4127, 14) and size 57778
y_train (16508,) and size 16508
y_test (4127,) and size 4127

We see that we have 14 predictors. Also there are 16,508 rows in the training set and 4,127 in the test set

Feature scaling¶

Now let's apply Standarization on our features. This is particularly important for regression models and regularization techniques.

5. Scale the input features using the StandardScaler().

Note: for your training data you can use the function fit_transform() which learns the scaling parameters from training data, and immediately applies them to the training data. However, for the test data you have to use transform() to apply the scaling rules learned from the training data to the test data.

Why not fit on test data? Because that would leak information from the test set into the model, which breaks the rule of fair evaluation. The test set should simulate new, unseen data — we shouldn’t "observe" it while preprocessing

In [13]:
# Define the standard scaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both the training and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Building prediction model¶

Linear regression is the one of the basic algorithms to use for predicting a quantitative response. To refresh your memory, linear regression assumes a linear relationship among the predictors (X) and the target variable (y). One of the advantages of linear regression is also that the results are interpretable; meaning that it is easy to understand the model and the predictions.

6. Create a LinearRegression object and store it in the variable Linear_Regression. Once the model is created you can fit the model on the training data using the function fit(). The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Note: This is something that will come up in most of the models that we will use in the next practicals. First we create the model X by calling the constructor (so we create an object in Python language, for exampleLinear_Regression = LinearRegression()) and then we fit the model that we created on the training set (Linear_Regression.fit(X_train, y_train)). Once the model is fit (trained) we can use it to fo the predictions on the test set.

In the current question, you need to use the scaled version of the X_train which we created in the previous exercise

In [14]:
# Create a linear regression model
Linear_Regression = LinearRegression()

# Train the model on the scaled training data
Linear_Regression.fit(X_train_scaled, y_train)
Out[14]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

If now we want to print the intercept and the coefficients, we can run the following

In [15]:
print("Intercept:", Linear_Regression.intercept_)

coef_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': Linear_Regression.coef_
})
print(coef_df)
Intercept: 12.083924657893421
                       Feature  Coefficient
0                    longitude    -0.306512
1                     latitude    -0.318071
2           housing_median_age     0.047750
3                  total_rooms    -0.128552
4               total_bedrooms     0.282053
5                   population    -0.211067
6                   households     0.089760
7                median_income     0.281451
8          rooms_per_household     0.000819
9            log_median_income     0.086619
10   ocean_proximity_<1h_ocean     0.049764
11      ocean_proximity_inland    -0.078587
12    ocean_proximity_near_bay     0.013416
13  ocean_proximity_near_ocean     0.022889

Can you explain the intercept and the coefficients? Which predictors influnce the median price in a negative way and which in a positive way?

The intercept: 12.08 is the predicted log(median_house_value) when all features are at their mean values (because the data is standardized).

Each coefficient represents the effect of a 1 standard deviation increase in that feature, holding all others constant.

For example,

  • population = –0.21 tells us that the more people in a block, the lower home prices. Likely due to high density or competition
  • median_income = 0.28 tells us that wealthier areas have higher house prices
  • total_bedrooms = 0.28 means that more bedrooms (when separated) leads to more expensive houses.

Prediction and evaluation¶

Now that we have the model, we can use it to make preditcion on new data (the test set).

7. Make predictions on the test data and evaluate the model's performance using R-squared and MSE. For the predictions, you can use the function predict() and save the result into the y_predict variable. This function will take as a parameter the X test data since we want to use the input features from the test data to do the predictions. The functions r2_score() and mean_squared_error() return the R-squared and MSE values respectively

In [16]:
y_pred = Linear_Regression.predict(X_test_scaled)
print("R-squared for Linear_Regression:", r2_score(y_test, y_pred))
print("MSE for Linear_Regression:", mean_squared_error(y_test, y_pred))
R-squared for Linear_Regression: 0.7035479323877921
MSE for Linear_Regression: 0.09551544903125564

R-squared indicates the proportion of total variation of outcomes explained by the model. The R-squared is about 0.70, which suggests a good fit. We also see that the MSE is 0.095

Since we are predicting log house prices, this is the average squared error in log-units. To understand it in real-world terms, take the square root of MSE and then convert it back from log scale: RMSE (log) = root(0.0955) ≈ 0.309

Back-transformed error = e^0.309 ≈ 1.36

This means the predicted price is typically off by a factor of about 1.36, or around 36%.

Now we are also going to apply Lasso and Ridge regularization.

Lasso and Ridge Regularization¶

Lasso regression is one of the regularization techniques and is going to shrink some of the feature coefficients to zero.

We will use cross validation to find the best lambda paramter.

8. Implement LassoCV() with a range of alphas for cross-validation to identify relevant features and penalize irrelevant ones. In this context alpha is the same parameter as lambda. Start your code by defining a range of alphas such as alphas=np.logspace(-4, 4, 100). Also, set the number of folds to 5 (cv = 5)

In [17]:
# Prepare LassoCV with a range of alphas for cross-validation
alphas=np.logspace(-4, 4, 100)
lasso_model = LassoCV(alphas=alphas, cv=5)

9. Fit the lasso model on the training set and print the best alpha value. Use the attribute lasso_model.alpha_to get the best alpha.

In [18]:
lasso_model.fit(X_train_scaled, y_train)

# Extract the best alpha parameter
best_alpha = lasso_model.alpha_
print("Best alpha:", best_alpha)
Best alpha: 0.0001

10. Fit the lasso model on the training set and print the best alpha value. Use the attribute lasso_model.alpha_to get the best alpha. We give you the structure of the code, you only need to fill two lines. What does this plot tell us?

In [19]:
# Plot alpha values against mean squared error
plt.figure(figsize=(10, 6))
plt.plot(lasso_model.alphas_, lasso_model.mse_path_.mean(axis = 1), marker='o')
plt.axvline(lasso_model.alpha_, linestyle = "--")
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('Mean Squared Error vs. Alpha for Lasso')
plt.legend(["MSE per alpha", "Optimal alpha"])
plt.grid(True)
plt.show()

This plot shows how the mean squared error (MSE) changes depending on the value of the regularization parameter α (alpha) used in Lasso regression:

  • X-axis: Different values of alpha (on a log scale). Small values of alpha (left side) mean less regularization, allowing the model to fit the data more flexibly. Large values of alpha (right side) mean stronger regularization, forcing the model to shrink or eliminate coefficients.
  • Y-axis: Cross-validated mean squared error (MSE). Lower MSE = better predictive performance.

  • The solid blue line shows the average MSE for each alpha.

  • The dashed vertical line marks the optimal alpha, which minimizes MSE across all cross-validation folds.

The MSE is lowest when alpha is small — this means we want a model that keeps most features with only slight regularization. As alpha increases, the model becomes too simple and starts to underfit (higher MSE).

11. Use the Lasso model that is now trained to make the predictions on the test set. Print the R-squared and the MSE values.

In [20]:
# Predict on the test set
lasso_y_pred = lasso_model.predict(X_test_scaled)

# Calculate the Mean Squared Error and R-squared for evaluation
lasso_mse = mean_squared_error(y_test, lasso_y_pred)
lasso_r2 = r2_score(y_test, lasso_y_pred)

print("Lasso Regression MSE:", lasso_mse)
print(f"R-squared for Lasso: {lasso_r2:.10f}")
Lasso Regression MSE: 0.09551612828131519
R-squared for Lasso: 0.7035458242

The second regularization method that we learned is the Ridge regression. Let's also apply ridge regression on our data.

12. Construct a RidgeCV object with a range of alphas. Similar to Lasso, use 5 folds and use the same alphas. Then print the optimal alpha value

In [21]:
# Prepare RidgeCV with a range of alphas for cross-validation
ridge_model = RidgeCV(alphas = alphas, cv=5)

# Fit the RidgeCV model to find the best alpha
ridge_model.fit(X_train_scaled, y_train)

# Extract the best alpha value
optimal_alpha_ridge = ridge_model.alpha_

print("Optimal Alpha for Ridge:", optimal_alpha_ridge)
Optimal Alpha for Ridge: 3.351602650938848

13. Use the predict() function to generate the predictions on the test set.

In [22]:
# Predict on the test set
ridge_y_pred = ridge_model.predict(X_test_scaled)

14. Calculate the R-squared and the MSE for the ridge regression model

In [23]:
# Calculate the Mean Squared Error and R-squared for evaluation
ridge_mse = mean_squared_error(y_test, ridge_y_pred)
ridge_r2 = r2_score(y_test, ridge_y_pred)

print("Ridge Regression MSE:", ridge_mse)
print(f"R-squared for Ridge: {ridge_r2:.10f}")
Ridge Regression MSE: 0.09551536817719684
R-squared for Ridge: 0.7035481833

Observation:

The R-squared values are the same for all three models (around 0,703). This suggests that all three models explain a similar amount of variance in the dependent variable.

Comparing lasso and ridge regression models¶

A very useful tool is to visualise the model fit and see how the predictions look compared to the actual data.

15. Create comparative visualizations to assess the model fit of Lasso and Ridge regression models.

Ypu can use the the regplot() from seaborn. To make the plot nicer, add the argument scatter_kws={'alpha': 0.5, 'color': 'lightblue'}, while to add a regression line you can add the following argument line_kws={'color': 'black', 'linestyle': '--', 'linewidth': 2.5}. Play with the parameters to see how your plot changes

In [24]:
plt.figure(figsize=(10, 4))

# Lasso Regression Model Fit Plot
plt.subplot(1, 2, 1)
sns.regplot(x=y_test, y=lasso_y_pred, scatter_kws={'alpha':0.5, 'color': 'lightblue'}, line_kws={'color': 'black', 'linestyle': '--', 'linewidth': 2.5})
plt.xlabel('Actual Values', fontsize=8)
plt.ylabel('Predicted by Lasso', fontsize=8)
plt.title('Model Fit: Lasso Regression', fontsize=10)

# Ridge Regression Model Fit Plot
plt.subplot(1, 2, 2)
sns.regplot(x=y_test, y=ridge_y_pred, scatter_kws={'alpha':0.5, 'color': 'lightblue'}, line_kws={'color': 'black', 'linestyle': '--', 'linewidth': 2.5})
plt.xlabel('Actual Values', fontsize=8)
plt.ylabel('Predicted by Ridge', fontsize=8)
plt.title('Model Fit: Ridge Regression', fontsize=10)


plt.tight_layout()
plt.show()

The plots for the Lasso and Ridge regression models indicate that neither model perfectly predicts higher values, with predictions becoming less accurate as actual values increase. At first glance, both models perform similarly, without a clear advantage of one over the other.

End of practical!