Practical 1 : Data exploration and visualization¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

Welcome to the first practical of the course!

In this practical, we are going to get familiar with Python and Google Colab, and then we will do some data exploration and visualization! You can also look at Python documentation to refresh you knowledge of programming: https://docs.python.org/3/reference/

Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:

Zero configuration required

Zero configuration required
Free access to GPUs
Easy sharing

Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introductions to Colab here, but we will also cover the basics.

Objectives¶

After completing your first practical you will be able to:

Import data and do data wrangling with pandas
Do data filtering
Explore your data
Plot your data with matplotlib and seaborn

Let's get started!¶

Here we are going to introduce Python and Google Colab. If you feel that you want to refresh your Python skills more, you can also complete the preparations from here: http://giachanou.com/machine-learning/#prepare.

1. Open Colab and create a new empty notebook to work with Python 3!

Go to https://colab.research.google.com/ and login with your account. Then click on "File → New notebook".

If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter.

If you want to stop your code from running in Colab:

Interrupt execution by pressing ctrl + M I or simply click the stop button
Or: Press ctrl + A to select all the code of that particular cell, press ctrl + Xto cut the entire cell code. Now the cell is empty and can be deleted by using ctrl + M D or by pressing the delete button. You can paste your code in a new code chunk and adjust it.

NB: On Macbooks, use cmd instead of ctrl in shortcuts.

You are of course welcome to use a different IDE if you don't want to use Google Colab

Necessary libraries

For this practical we are going to use the following libraries:

pandas: The main library for data manipulation and analysis (https://pandas.pydata.org/docs/).
scikit-learn: The main library for training models, preprocessing data, and evaluating results (https://scikit-learn.org/stable/)
numpy: Many ML and data science tools rely on NumPy’s fast, flexible ndarray for handling numerical data
matplotlib: The foundational plotting library in Python. Most other libraries (like Seaborn) are built on top of it (https://matplotlib.org/)
seaborn: Simplifies complex visualizations like box plots, violin plots, heatmaps — with beautiful styling

2. Install the necessary libraries.

Use the !pip install command and install the packages: numpy, pandas, scikit-learn, matplotlib, and seaborn. For example, to install numpy type !pip install -q numpy

Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network. However, many of those packages that we will use come already installed on google colab.

In [ ]:

!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn

3. Import the necessary packages.

The packages are now installed, but to be able to use their functions we have to import them. A common practice is to import the packages all together at the beginning of the code. However, syntactically you can also do it later in the code (but always before you use it). For example, to import numpy type import numpy as np

In [ ]:

import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Load the data¶

Every problem in machine learning starts with the data. Usually, we want to get some information and understand our data.

It is possible that you have your own data that you want to explore. However, there are many datasets available or we can also scrape data from web or use social media APIs. Here are some websites where you can find publicly available datasets:

For this first practical, we are going to use the California housing dataset. This dataset contains house attributes and summary statistics from the 1990 California census.

To begin working with the dataset in Google Colab, you need to first upload the 'housing.csv' file. This can be accomplished by clicking on the 'Files'button located on the left side of the Colab interface (the icon that looks like a folder). You have the option to either drag and drop the file or use the upload button for this purpose. As an alternative, Google Drive can be mounted within Colab, allowing you to access and upload the dataset directly from there.

4. Read the housing.csv dataset using the read_csv() function which is part of pandas library. Store the dataframe to a variable called houses. Check the first lines of the dataframe using head() and the last ones with the tail() function

As you noticed, the houses is a type of dataframe. DataFrame is the core data structure in the pandas library — think of it like a table or spreadsheet in Python. It’s made up of:

Rows: Each row is one observation or record (like a person, product, or house).
Columns: Each column represents a feature, variable, or attribute (like age, price, or location).

Displaying summary statistics¶

To examine the dataset's descriptive statistics in more details, we can use the info() and/or describe() functions. Those functions will give more information about the type of the data and some statistics.

5. Display basic information about the houses dataframe using the info() function. What can you say from this information? Which columns are numerical (e.g., integers or floats)? Are there any columns with missing data?

6. Now use the describe() function to print the descriptive statistics of the numerical features in the dataset. What can you say regarding the range of the variables?

Some example questions that you can try to answer are:

What are the average numbers of rooms and bedrooms per block?
What is the range of the median house values in the dataset?
Can you identify any potential data quality issues just from this summary?

Another important thing that we have to check is if there are missing values in the dataframe. We already have an idea with the info() funciton, but we can also print the exact number of null values per column.

7. Check if there are missing values in the houses dataframe. You can use the houses.isnull().sum(). This line of code first calls the .isnull() that will chck every cell in the dataframe and will return a Boolean table filled with True/False values (True if the cell contains a missing value). Then the .sum() adds up the number of True values per column

We have three choices for dealing with the missing data in the 'total_bedrooms' column:

Remove the missing values.
Fill/impute in the missing values.
Leave the missing values as is.

For this example, we will leave the data as they are and we will keep exploring our data. Depending on your problem you can also decide to remove or impute some values. In this course we didn't cover imputation techniques. If you interested in this topic you can check this book: https://stefvanbuuren.name/fimd/

Another thing we can do is to check the number of unique values present in each column of the houses dataset.

8. Determine the number of unique values present in each column of the houses dataset. You can use the function nunique() for this. How many unique values does the ocean_proximity have?

9. Calculate the frequency of each unique value in the 'ocean_proximity' column of the houses dataset. First you can take the values of the column (reminder: to access a single column from a DataFrame, we use square brackets [] with the column name in quotes) and then use the value_counts() function to count how often each unique value occurs.

Is there any value that can be problematic? For example that has very few data?

Remove/edit data from the dataframe¶

As we saw in the frequency counts of the ocean_proximity column, the category ISLAND appears very rarely — only 5 times out of more than 20,000 rows. In other words, it represents less than 0.03% of the dataset. Since ISLAND is extremely underrepresented and likely not important for general trends, we can safely remove those rows without significantly affecting our analysis.

10. Remove the rows for which the ocean proximity has the value of ISLAND. The function that you need to remove data from a dataframe is drop()

Here are some instructions to build this line step by step:

Step 1: Select rows where ocean_proximity equals "ISLAND". Can you write code to filter only the rows where ocean_proximity is "ISLAND"?
Step 2: Get the index of those rows. Now we want to remove those rows — and to do that, we need their index.
Step 3: Now drop those rows from the original DataFrame using .drop(). Add inplace=True to modify the DataFrame directly.

11.Sometimes we need to do some tranformation to the values of the dataframe. Let's try to do this now. Convert the values in the ocean_proximity column by lowering the case (str.lower()) and replacing spaces with underscores (str.replace())

Add columns to the dataframe¶

Often we also want to add new columns based on values of other ones. Here, we are going to add one more column to indicate whether the median house is above or below the median.

12. Add a new column in the dataframe houses called high_value which is 1 if the median house value is above the median of the dataset, otherwise 0.

Here is some help:

Step 1: Calculate the median of the median_house_value
Step 2: Compare each row to the median
Step 3: Convert Boolean to integer and assign to new column. Use .astype(int) to do that

Another important tool when we explore our data is to group data together and get some statistics per group. Pandas has this functionality. The function we use to create those groups is called groupby().

13.Group by ocean_proximity and calculate the mean in total_rooms for each category. What can you say for the mean of the total rooms per category in ocean proximity?

Exploratory analysis (graphical)¶

We already started some exploratory analysis with the descriptive statistics. However, to get a better understanding of the data we can further generate some plots. We can create among others histograms, boxplots and correlation matrices.

To create those plots, we will use Matplotlib, a widely-used Python library for data visualization. If you want to know more about this package, see the documentation here: https://matplotlib.org/ .

Let's first see how to use Matplotlib and which are its main components.

Axes in Matplotlib¶

An Axes in Matplotlib is a single plot within a figure, with essential elements like data limits (controlled by set_xlim() and set_ylim() methods), a title (set_title()), x-label (set_xlabel()), and y-label (set_ylabel()). It's where the data, along with associated labels and ticks, is plotted.

Subplots in Matplotlib¶

Subplots allow for multiple plots (axes) to be arranged in a grid within a single figure, facilitating comparative analysis of different data aspects. You can create subplots using plt.subplots(), which will return a figure and an array of axes, accessible through indexing or row-column notation.

Let's now create some histograms using Matplotlib.

For example, we can use the following code to create a histogram of the RoomsPerHousehold column. Note that we are using the histplot from the package Seaborn to create the histogram in this case.

# Define a figure of size (6,4)
plt.figure(figsize=(6, 4))
# create the histogram using the histplot
sns.histplot(houses['housing_median_age'], kde=True)
# add different elements such as title and x, y labels
plt.title('Distribution of Median Age')
plt.xlabel('Median Age')
plt.ylabel('Frequency')

# Set the limits on the X and Y axes
x_lim = (0, 60)
y_lim = (0, 5000)
plt.xlim(x_lim)
plt.ylim(y_lim)

plt.show()

This histogram visualizes the frequency distribution of the 'housing_median_age' feature. Which ages are the most common? Is there a shrp increase anywhere? Why?

14. Create histograms for other variables in the houses dataframe. You can do them seperately or place them in a grid. For example, if you want to create 2 subplots in a 2X2 grid you can run fig, axes = plt.subplots(2, 2, figsize=(12, 6)). In this case, you will need to use a for-loop (for ax, column in zip(axes, houses.columns)). We give you the structure of the code to help you start. What can you say about the distributions of the features?

# You can start with this structure
# 1. Choose the number of rows and columns for the plot grid
n_rows = 6  # Example: 6 rows
n_cols = 2  # Example: 2 columns

# 2. Create a grid of subplots using matplotlib
# Tip: Use plt.subplots(...) and set the figure size
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 3))

# 3. Flatten the axes array (from 2D to 1D) so it's easier to loop over
# Use .flatten() here
axes = axes.flatten()

# 4. Loop over both axes and column names using zip()
# In each loop:
#   - Plot a histogram with sns.histplot(...)
#   - Set the title and axis labels for each plot
# for ax, column in zip(axes, houses.columns):

#Your code goes here: Implement the rest of the for loop


# 5. OPTIONAL: If there are more axes than columns, hide the unused plots
for i in range(len(houses.columns), n_rows * n_cols):
    fig.delaxes(axes[i])  # Removes empty subplots

# 6. Use tight_layout() to fix spacing and show the figure
fig.tight_layout()
plt.show()

Next, we can also plot some boxplots which are going to show whether there are any potential outliers or not.

15. Create box plots (use the boxplot function from seaborn) for the columns 'total_rooms', 'total_bedrooms', 'population'. Are there any outliers?

16. Create a pair plot for the features 'total_bedrooms', 'total_rooms' and 'households'. To do this, you can use the function pairplot from seaborn. This function will create a matrix of scatter plots for each pair of features.

Viewing Correlations¶

In many data science and machine learning tasks, understanding how your features relate to each other — i.e., their correlations — is a critical step in exploratory data analysis (EDA).

Why correlation analysis?

When two features are highly correlated, they often carry similar information.
Correlation helps decide which features are useful and which ones can be dropped or combined.
Some algorithms (e.g., decision trees, random forests) are less sensitive to correlated features. Others (e.g., logistic regression, SVMs, neural networks) can perform better with uncorrelated inputs, especially if you're using regularization
Correlation with the target variable (e.g., house price) helps you identify predictive features.

17. Investigate the correlation among the features. You can create a heatmap using the heatmap function from seaborn. To get the correlations, you can use the houses.corr() function and set the parameter method='spearman'. Check https://seaborn.pydata.org/generated/seaborn.heatmap.html

18.If your dataset contains many columns (sometimes even hundreds), it can be overwhelming to look at the full correlation matrix. Instead, focus on one feature of interest — for example, 'median_house_value'. Extract the correlations between 'median_house_value' and all other numeric features. Then, sort and display these correlations in descending order, from the strongest to the weakest.

19. Utilize the matplotlib and seaborn libraries to create a scatter plot (scatterplot() function) with longitude on the x-axis and latitude on the y-axis. Investigate the influence of geographical location on housing prices by visualizing the distribution of median_house_value across different coordinates.

You can also first group by the data and then create the plots on the different categories.

20. Let's say you want to see how house prices differ across different proximity-to-ocean categories. You can group the data by 'ocean_proximity' and calculate the average 'median_house_value' for each category. Plot these results using a box plot (boxplot() function from seaborn can also be used here).

Feature Transformation¶

As we mentioned during the lecture, it is common to tranform our features so they are scaled in the a similar range.

Min-Max Scaling is a popular approach to normalise numerical data. Its aim is to compresses all values into the range [0, 1] so they can be useful for algorithms that require bounded input, such as neural networks.

21.Apply Min-Max Scaling to median_income and plot its new distribution. You can start with scaler = MinMaxScaler() that will initialise the Min Max scaler