Practical : Data exploration and visualization¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

Welcome to the first practical of the course!

In this practical, we are going to get familiar with Python and Google Colab, and then we will do some data exploration and visualization! You can also look at Python documentation to refresh you knowledge of programming: https://docs.python.org/3/reference/

Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:

Zero configuration required

  • Zero configuration required
  • Free access to GPUs
  • Easy sharing

Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introductions to Colab here, but we will also cover the basics.

Objectives¶

After completing your first practical you will be able to:

  • Import data and do data wrangling with pandas
  • Do data filtering
  • Explore your data
  • Plot your data with matplotlib and seaborn

Let's get started!¶

Here we are going to introduce Python and Google Colab. If you feel that you want to refresh your Python skills more, you can also complete the preparations from here: http://giachanou.com/machine-learning/#prepare.

1. Open Colab and create a new empty notebook to work with Python 3!

Go to https://colab.research.google.com/ and login with your account. Then click on "File → New notebook".

If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter.

If you want to stop your code from running in Colab:

  • Interrupt execution by pressing ctrl + M I or simply click the stop button
  • Or: Press ctrl + A to select all the code of that particular cell, press ctrl + Xto cut the entire cell code. Now the cell is empty and can be deleted by using ctrl + M D or by pressing the delete button. You can paste your code in a new code chunk and adjust it.

NB: On Macbooks, use cmd instead of ctrl in shortcuts.

You are of course welcome to use a different IDE if you don't want to use Google Colab

Necessary libraries

For this practical we are going to use the following libraries:

  • pandas: The main library for data manipulation and analysis (https://pandas.pydata.org/docs/).
  • scikit-learn: The main library for training models, preprocessing data, and evaluating results (https://scikit-learn.org/stable/)
  • numpy: Many ML and data science tools rely on NumPy’s fast, flexible ndarray for handling numerical data
  • matplotlib: The foundational plotting library in Python. Most other libraries (like Seaborn) are built on top of it (https://matplotlib.org/)
  • seaborn: Simplifies complex visualizations like box plots, violin plots, heatmaps — with beautiful styling

2. Install the necessary libraries.

Use the !pip install command and install the packages: numpy, pandas, scikit-learn, matplotlib, and seaborn. For example, to install numpy type !pip install -q numpy

Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network. However, many of those packages that we will use come already installed on google colab.

In [ ]:
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn

3. Import the necessary packages.

The packages are now installed, but to be able to use their functions we have to import them. A common practice is to import the packages all together at the beginning of the code. However, syntactically you can also do it later in the code (but always before you use it). For example, to import numpy type import numpy as np

In [1]:
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Load the data¶

Every problem in machine learning starts with the data. Uaually, we want to get some information and understand our data.

It is possible that you have your own data that you want to explore. However, there are many datasets available or we can also scrape data from web or use social media APIs. Here are some websites where you can find publicly available datasets:

  • CLARIN Resource Families
  • UCI Machine Learning Repository
  • Kaggle

For this first practical, we are going to use the California housing dataset. This dataset contains house attributes and summary statistics from the 1990 California census.

To begin working with the dataset in Google Colab, you need to first upload the 'housing.csv' file. This can be accomplished by clicking on the 'Files'button located on the left side of the Colab interface (the icon that looks like a folder). You have the option to either drag and drop the file or use the upload button for this purpose. As an alternative, Google Drive can be mounted within Colab, allowing you to access and upload the dataset directly from there.

4. Read the housing.csv dataset using the read_csv() function which is part of pandas library. Store the dataframe to a variable called houses. Check the first lines of the dataframe using head() and the last ones with the tail() function

As you noticed, the houses is a type of dataframe. DataFrame is the core data structure in the pandas library — think of it like a table or spreadsheet in Python. It’s made up of:

  • Rows: Each row is one observation or record (like a person, product, or house).
  • Columns: Each column represents a feature, variable, or attribute (like age, price, or location).
In [2]:
#code to mount google drive
#from google.colab import drive
#drive.mount('/content/drive')

houses = pd.read_csv("housing.csv")
In [3]:
#print the first lines of the dataset
houses.head()
Out[3]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [4]:
#print the last lines of the dataset
houses.tail()
Out[4]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

Initially, we observe that our dataset contains 20,639 rows and various columns. Also, we can see a small sample of our data. .

Displaying summary statistics¶

To examine the dataset's descriptive statistics in more details, we can use the info() and/or describe() functions. Those functions will give more information about the type of the data and some statistics.

5.Display basic information about the houses dataframe using the info() function. What can you say from this information? Which columns are numerical (e.g., integers or floats)? Are there any columns with missing data?

In [6]:
print(houses.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None

We see that we have 10 columns and most of them consist of float data types, except for ocean_proximity, which is an object of the string type in pandas. Also, we see that the only column with null values is the total_bedrooms.

The 10 columns are the following:

  • longitude: A measure of how far west a house is; a higher value is farther west
  • latitude: A measure of how far north a house is; a higher value is farther north
  • housing_median_age: Median age of a house within a block; a lower number is a newer building
  • total_rooms: Total number of rooms within a block
  • total_bedrooms: Total number of bedrooms within a block
  • population: Total number of people residing within a block
  • households:Total number of households, a group of people residing within a home unit, for a block
  • median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  • median_house_value: Median house value for households within a block (measured in US Dollars)
  • ocean_proximity: Location of the house with respect to the ocean/sea.

6. Now use the describe() function to print the descriptive statistics of the numerical features in the dataset. What can you say regarding the range of the variables?

Some example questions that you can try to answer are:

  • What are the average numbers of rooms and bedrooms per block?
  • What is the range of the median house values in the dataset?
  • Can you identify any potential data quality issues just from this summary?
In [5]:
print(houses.describe())
          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20433.000000  20640.000000  20640.000000   20640.000000   
mean       537.870553   1425.476744    499.539680       3.870671   
std        421.385070   1132.462122    382.329753       1.899822   
min          1.000000      3.000000      1.000000       0.499900   
25%        296.000000    787.000000    280.000000       2.563400   
50%        435.000000   1166.000000    409.000000       3.534800   
75%        647.000000   1725.000000    605.000000       4.743250   
max       6445.000000  35682.000000   6082.000000      15.000100   

       median_house_value  
count        20640.000000  
mean        206855.816909  
std         115395.615874  
min          14999.000000  
25%         119600.000000  
50%         179700.000000  
75%         264725.000000  
max         500001.000000  

We can see the descriptive statistics of the numerical columns. From those statistics we can see that on average houses in one block have 2,635 rooms and 537 bedrooms. Also, we can see that the maximum median house value is 500,000 and the minimum median house value is 14,999. With this summary you can also understand whether there are any values that are not expected in any of the columns (for example negative minimum in price could mean that there is an error).

Another important thing that we have to check is if there are missing values in the dataframe. We already have an idea with the info() funciton, but we can also print the exact number of null values per column.

7. Check if there are missing values in the houses dataframe. You can use the houses.isnull().sum(). This line of code first calls the .isnull() that will chck every cell in the dataframe and will return a Boolean table filled with True/False values (True if the cell contains a missing value). Then the .sum() adds up the number of True values per column

In [7]:
# Check for missing values
print(houses.isnull().sum())
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Firstly, we identify any missing values across the columns by applying the .isnull().sum() function to each column in the dataset, which helps us determine the total number of null entries. As seen above, all columns are complete except for 'total_bedrooms', which has 207 missing values.

We have three choices for dealing with the missing data in the 'total_bedrooms' column:

  1. Remove the missing values.
  2. Fill/impute in the missing values.
  3. Leave the missing values as is.

For this example, we will leave the data as they are and we will keep exploring our data. Depending on your problem you can also decide to remove or impute some values. In this course we didn't cover imputation techniques. If you interested in this topic you can check this book: https://stefvanbuuren.name/fimd/

Another thing we can do is to check the number of unique values present in each column of the houses dataset.

8. Determine the number of unique values present in each column of the houses dataset. You can use the function nunique() for this. How many unique values does the ocean_proximity have?

In [8]:
houses.nunique()
Out[8]:
0
longitude 844
latitude 862
housing_median_age 52
total_rooms 5926
total_bedrooms 1923
population 3888
households 1815
median_income 12928
median_house_value 3842
ocean_proximity 5

We know that the ocean proximity is a string type and now we see from the previous question that has 5 unique values. Let's now try to see how many times each value appears.

9. Calculate the frequency of each unique value in the 'ocean_proximity' column of the houses dataset. First you can take the values of the column (reminder: to access a single column from a DataFrame, we use square brackets [] with the column name in quotes) and then use the value_counts() function to count how often each unique value occurs.

Is there any value that can be problematic? For example that has very few data?

In [9]:
count_per_unique_value = houses['ocean_proximity'].value_counts()
print(count_per_unique_value)
ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

Here, we see that the value ISLAND has very few observations. So depending on the context of the problem, we can also decide to remove those 5 rows.

Remove/edit data from the dataframe¶

As we saw in the frequency counts of the ocean_proximity column, the category ISLAND appears very rarely — only 5 times out of more than 20,000 rows. In other words, it represents less than 0.03% of the dataset. Since ISLAND is extremely underrepresented and likely not important for general trends, we can safely remove those rows without significantly affecting our analysis.

10. Remove the rows for which the ocean proximity has the value of ISLAND. The function that you need to remove data from a dataframe is drop()

Here are some instructions to build this line step by step:

  • Step 1: Select rows where ocean_proximity equals "ISLAND". Can you write code to filter only the rows where ocean_proximity is "ISLAND"?
  • Step 2: Get the index of those rows. Now we want to remove those rows — and to do that, we need their index.
  • Step 3: Now drop those rows from the original DataFrame using .drop(). Add inplace=True to modify the DataFrame directly.
In [13]:
houses.drop(houses[houses['ocean_proximity'] == "ISLAND"].index, inplace=True)
# Display the DataFrame to confirm that the rows have been removed
houses['ocean_proximity'].value_counts()
Out[13]:
count
ocean_proximity
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290

11.Sometimes we need to do some tranformation to the values of the dataframe. Let's try to do this now. Convert the values in the ocean_proximity column by lowering the case (str.lower()) and replacing spaces with underscores (str.replace())

In [14]:
# Converting 'ocean_proximity' to lower case and replacing spaces with underscores
houses['ocean_proximity'] = houses['ocean_proximity'].str.lower().str.replace(" ", "_")

# Displaying the 'ocean_proximity' column to confirm the changes
houses['ocean_proximity']
Out[14]:
ocean_proximity
0 near_bay
1 near_bay
2 near_bay
3 near_bay
4 near_bay
... ...
20635 inland
20636 inland
20637 inland
20638 inland
20639 inland

20635 rows × 1 columns


Add columns to the dataframe¶

Often we also want to add new columns based on values of other ones. Here, we are going to add one more column to indicate whether the median house is above or below the median.

12. Add a new column in the dataframe houses called high_value which is 1 if the median house value is above the median of the dataset, otherwise 0.

Here is some help:

  • Step 1: Calculate the median of the median_house_value
  • Step 2: Compare each row to the median
  • Step 3: Convert Boolean to integer and assign to new column. Use .astype(int) to do that
In [15]:
houses['high_value'] = (houses['median_house_value'] > houses['median_house_value'].median()).astype(int)
print(houses.head())
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  \
0       322.0       126.0         8.3252            452600.0        near_bay   
1      2401.0      1138.0         8.3014            358500.0        near_bay   
2       496.0       177.0         7.2574            352100.0        near_bay   
3       558.0       219.0         5.6431            341300.0        near_bay   
4       565.0       259.0         3.8462            342200.0        near_bay   

   high_value  
0           1  
1           1  
2           1  
3           1  
4           1  

Another important tool when we explore our data is to group data together and get some statistics per group. Pandas has this functionality. The function we use to create those groups is called groupby().

13.Group by ocean_proximity and calculate the mean in total_rooms for each category. What can you say for the mean of the total rooms per category in ocean proximity?

In [16]:
# Group by 'ocean_proximity' and calculate the average median house value for each category
average_prices_by_proximity = houses.groupby('ocean_proximity')['total_rooms'].mean()
print(average_prices_by_proximity)
ocean_proximity
<1h_ocean     2628.343586
inland        2717.742787
near_bay      2493.589520
near_ocean    2583.700903
Name: total_rooms, dtype: float64

From those mean values we can see that the houses that are INLAND have the highest mean in the total number of rooms.

Exploratory analysis (graphical)¶

We already started some exploratory analysis with the descriptive statistics. However, to get a better understanding of the data we can further generate some plots. We can create among others histograms, boxplots and correlation matrices.

To create those plots, we will use Matplotlib, a widely-used Python library for data visualization. If you want to know more about this package, see the documentation here: https://matplotlib.org/ .

Let's first see how to use Matplotlib and which are its main components.

Axes in Matplotlib¶

An Axes in Matplotlib is a single plot within a figure, with essential elements like data limits (controlled by set_xlim() and set_ylim() methods), a title (set_title()), x-label (set_xlabel()), and y-label (set_ylabel()). It's where the data, along with associated labels and ticks, is plotted.

Subplots in Matplotlib¶

Subplots allow for multiple plots (axes) to be arranged in a grid within a single figure, facilitating comparative analysis of different data aspects. You can create subplots using plt.subplots(), which will return a figure and an array of axes, accessible through indexing or row-column notation.

Let's now create some histograms using Matplotlib.

For example, we can use the following code to create a histogram of the RoomsPerHousehold column. Note that we are using the histplot from the package Seaborn to create the histogram in this case.

In [17]:
# Define a figure of size (6,4)
plt.figure(figsize=(6, 4))
# create the histogram using the histplot
sns.histplot(houses['housing_median_age'], kde=True)
# add different elements such as title and x, y labels
plt.title('Distribution of Median Age')
plt.xlabel('Median Age')
plt.ylabel('Frequency')

# Set the limits on the X and Y axes
x_lim = (0, 60)
y_lim = (0, 5000)
plt.xlim(x_lim)
plt.ylim(y_lim)

plt.show()

This histogram visualizes the frequency distribution of the 'housing_median_age' feature.

Wen can observe that the most common ages are 15–40 years, there are frequent peaks around ages like 20, 25, and 35. These are likely common construction periods. We also see that there is a sharp increase at age 52 This suggests that many homes hit the upper limit, possibly due to capping in the data (e.g., all homes older than 52 are grouped into one).

14. Create histograms for other variables in the houses dataframe. You can do them seperately or place them in a grid. For example, if you want to create 2 subplots in a 2X2 grid you can run fig, axes = plt.subplots(2, 2, figsize=(12, 6)). In this case, you will need to use a for-loop (for ax, column in zip(axes, houses.columns)). What can you say about the distributions of the features?

In [18]:
n_rows = 6
n_cols = 2

# Creating the figure and subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 3))

# Flattens the 2D axes array into a 1D list so we can loop over it easily with zip().
axes = axes.flatten()

# Creating histograms for each column in the DataFrame
for ax, column in zip(axes, houses.columns):
    sns.histplot(houses[column], bins=20, ax=ax)
    ax.set_title(f'Histogram of {column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')

# Hide any extra subplots
# If the number of columns is less than the number of subplots, this loop removes the extra empty subplots from the figure.
for i in range(len(houses.columns), n_rows * n_cols):
    fig.delaxes(axes[i])

# Adjusting the layout and displaying the figure
fig.tight_layout()
plt.show()

From this histograms, we can see that the data in 'housing_median_age' and 'median_house_value' are concentrated at the left end of their range.

Next, we can also plot some boxplots which are going to show whether there are any potential outliers or not.

15. Create box plots (use the boxplot function from seaborn) for the columns 'total_rooms', 'total_bedrooms', 'population'. Are there any outliers?

In [ ]:
# List of columns to check for outliers
columns_to_check = ['total_rooms', 'total_bedrooms', 'population']

# Create a box plot for each specified column
for column in columns_to_check:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=houses[column])
    plt.title(f'Box plot of {column}')
    plt.show()

Those boxplots indicate non-normality of the variables.

For example, we see that for the total rooms and total bedrooms there are many large outliers, which indicates that a small number of houses or blocks have unusually high room counts. There are also a few points that need to be further explored. For this practical, we will leave them as they are.

16. Create a pair plot for the features 'total_bedrooms', 'total_rooms' and 'households'. To do this, you can use the function pairplot from seaborn. This function will create a matrix of scatter plots for each pair of features.

In [19]:
features = ['total_bedrooms', 'total_rooms', 'households']

# Create the pair plot with a specified height for each plot
sns.pairplot(houses[features], height=3)
# Adjust the layout
plt.subplots_adjust(top=1)
# Display the plot
plt.show()

We created a pair plot to examine the relationship among 'total_bedrooms', 'total_rooms', and 'households'. This plot reveals a linear relationship between 'total_bedrooms' and both 'total_rooms' and 'households'. This is logical since the number of bedrooms is included in the total room count of a block and is likely influenced by the household size. From this, we infer that the number of bedrooms cannot exceed the total number of rooms and that we can use other dataset features to estimate the number of bedrooms.

Viewing Correlations¶

In many data science and machine learning tasks, understanding how your features relate to each other — i.e., their correlations — is a critical step in exploratory data analysis (EDA).

Why correlation analysis?

  • When two features are highly correlated, they often carry similar information.
  • Correlation helps decide which features are useful and which ones can be dropped or combined.
  • Some algorithms (e.g., decision trees, random forests) are less sensitive to correlated features. Others (e.g., logistic regression, SVMs, neural networks) can perform better with uncorrelated inputs, especially if you're using regularization
  • Correlation with the target variable (e.g., house price) helps you identify predictive features.

17. Investigate the correlation among the features. You can create a heatmap using the heatmap function from seaborn. To get the correlations, you can use the houses.corr() function and set the parameter method='spearman'. Check https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [20]:
# houses.corr(method='spearman', numeric_only=True)
# This calculates a correlation matrix between all numeric columns in the houses DataFrame.

# method='spearman' tells pandas to use Spearman rank correlation, which:
# Measures monotonic relationships (not just linear)
# Is more robust to outliers and non-normal distributions

# numeric_only=True ensures only numeric columns are included (ignoring object types like strings).
# Result: A table of numbers from -1 to +1 showing how strongly each pair of columns is correlated.

# sns.heatmap(...)
# This uses Seaborn's heatmap function to visualize the correlation matrix.
# Parameters:
# annot=True: Shows the actual correlation values inside the heatmap cells.
# cmap='magma': Sets the color theme — 'magma' is a dark, high-contrast colormap that makes strong correlations easier to see.

sns.heatmap(houses.corr(method='spearman', numeric_only=True), annot=True, cmap='magma')
Out[20]:
<Axes: >

Values range from -1 to +1. Values that are close to +1 indicate a strong positive correlation and values close to -1 a strong negative correlation. From this plot, we see that there is high positive correlation among the features total_rooms, total_bedrooms, population and households (yellow colors in the center of the plot).

18.If your dataset contains many columns (sometimes even hundreds), it can be overwhelming to look at the full correlation matrix. Instead, focus on one feature of interest — for example, 'median_house_value'. Extract the correlations between 'median_house_value' and all other numeric features. Then, sort and display these correlations in descending order, from the strongest to the weakest.

In [21]:
# Set the size of the figure for the heatmap
plt.figure(figsize=(2, 4))

correlation_matrix = houses.corr(method='spearman', numeric_only=True)
sns.heatmap(correlation_matrix[['median_house_value']].sort_values(by='median_house_value', ascending=False), annot=True, cmap='cividis')
plt.title('Correlation of Features with Median House Value',size=10)
plt.show()

The heatmap shows that high_value and median_income are positively correlated with median_house_value, meaning higher median income usually indicates higher median house prices.

19. Utilize the matplotlib and seaborn libraries to create a scatter plot (scatterplot() function) with longitude on the x-axis and latitude on the y-axis. Investigate the influence of geographical location on housing prices by visualizing the distribution of median_house_value across different coordinates.

In [22]:
# Plotting the scatter plot for latitude and longitude
plt.figure(figsize=(8, 4))
sns.scatterplot(
    data=houses,
    x='longitude',
    y='latitude',
    size='median_house_value',
    hue='median_house_value',
    palette='magma',
    alpha=0.5)

# Customize the plot
plt.legend(title='Median House Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Median House Value by Geographical Coordinates')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

From the plot we see that:

  • House prices vary significantly by location. The color and size of each dot represent the median house value in that area. Brighter colors (light yellow white) = higher house prices Darker colors (dark purple) = lower house prices
  • Higher house values cluster along the coast. Look closely near longitude ≈ -118 to -122, and latitude ≈ 34 to 38 — that’s around Los Angeles, San Francisco, and coastal areas. These regions show many larger, brighter circles, indicating higher property values.
  • Lower values are inland. As you move east (right) on the x-axis (toward more inland locations), house values generally decrease.

This plot gives a strong visual confirmation that location is a major driver of housing prices. It suggests that geography should be considered as a key feature in any predictive housing model.

You can also first group by the data and then create the plots on the different categories.

20. Let's say you want to see how house prices differ across different proximity-to-ocean categories. You can group the data by 'ocean_proximity' and calculate the average 'median_house_value' for each category. Plot these results using a box plot (boxplot() function from seaborn can also be used here).

In [23]:
# Group by 'ocean_proximity' and calculate the average median house value for each category
average_prices_by_proximity = houses.groupby('ocean_proximity')['median_house_value'].mean()
print(average_prices_by_proximity)

# Create a boxplot to visualize the results
sns.boxplot(x='ocean_proximity', y='median_house_value', data=houses, palette="Set3", hue='ocean_proximity', legend=False)
plt.title('Boxplot of Median House Value by Ocean Proximity')
plt.xlabel('Ocean Proximity')
plt.ylabel('Median House Value')
plt.show()
ocean_proximity
<1h_ocean     240084.285464
inland        124805.392001
near_bay      259212.311790
near_ocean    249433.977427
Name: median_house_value, dtype: float64

This boxplot shows how median house values vary depending on the location of the house relative to the ocean. Each box represents a different category from the ocean_proximity column.

  • NEAR BAY and NEAR OCEAN: These categories show similar distributions: Median house values are relatively high (around \$250,000–\$300,000). There is a wide spread of values, indicating variety in property types or neighborhoods.
  • 1H OCEAN: This group includes houses within one hour of the ocean. Median values are slightly lower than direct coastal categories but still on the higher end. The distribution is broad, suggesting a mix of suburban and rural homes in this category.
  • INLAND: This category has the lowest house values overall. The median is well below $150,000. The compressed box and abundance of outliers suggest large, affordable developments with occasional high-value exceptions.

Feature Transformation¶

As we mentioned during the lecture, it is common to tranform our features so they are scaled in the a similar range.

Min-Max Scaling is a popular approach to normalise numerical data. Its aim is to compresses all values into the range [0, 1] so they can be useful for algorithms that require bounded input, such as neural networks.

21.Apply Min-Max Scaling to median_income and plot its new distribution. You can start with scaler = MinMaxScaler() that will initialise the Min Max scaler

In [24]:
# Apply Min-Max Scaling to 'MedInc'
scaler = MinMaxScaler()
houses['MedInc_MinMax'] = scaler.fit_transform(houses[['median_income']])

# Print original and scaled 'MedInc'
print(houses[['median_income', 'MedInc_MinMax']].head())

# Plot the original and scaled distribution of 'MedInc'
plt.figure(figsize=(8, 4))

# Min-Max Scaled distribution
sns.histplot(houses['MedInc_MinMax'], kde=True, color='green')
plt.title('Min-Max Scaled Distribution of MedInc')
plt.xlabel('MedInc (Min-Max Scaled)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
   median_income  MedInc_MinMax
0         8.3252       0.539668
1         8.3014       0.538027
2         7.2574       0.466028
3         5.6431       0.354699
4         3.8462       0.230776

22. Now apply the Standarization (StandardScaler()) on the median_income and plot its new distribution.

In [25]:
# define standard scaler
scaler = StandardScaler()

houses['MedInc_Standard'] = scaler.fit_transform(houses[['median_income']])

# Print original and scaled 'MedInc'
print(houses[['median_income', 'MedInc_Standard']].head())

# Plot the original and scaled distribution of 'MedInc'
plt.figure(figsize=(8, 4))

# Standard Scaled distribution
sns.histplot(houses['MedInc_Standard'], kde=True, color='green')
plt.title('Standarization of MedInc')
plt.xlabel('MedInc (Standarization)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
   median_income  MedInc_Standard
0         8.3252         2.344450
1         8.3014         2.331923
2         7.2574         1.782425
3         5.6431         0.932756
4         3.8462        -0.013024

End of Practical.