Anastasia Giachanou, Tina Shahedi
Machine Learning with Python - Utrecht Summer School
Welcome to the first practical of the course!
In this practical, we are going to get familiar with Python and Google Colab, and then we will do some data exploration and visualization! You can also look at Python documentation to refresh you knowledge of programming: https://docs.python.org/3/reference/
Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
Zero configuration required
Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introductions to Colab here, but we will also cover the basics.
After completing this lab you will be able to:
Here we are going to introduce Python and Google Colab. If you feel that you want to refresh your Python skills more, you can also complete the preparations from here: http://giachanou.com/machine-learning/#prepare.
1. Open Colab and create a new empty notebook to work with Python 3!
Go to https://colab.research.google.com/ and login with your account. Then click on "File → New notebook".
If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter
.
If you want to stop your code from running in Colab:
ctrl + M I
or simply click the stop buttonctrl + A
to select all the code of that particular cell, press ctrl + X
to cut the entire cell code. Now the cell is empty and can be deleted by using ctrl + M D
or by pressing the delete button. You can paste your code in a new code chunk and adjust it.NB: On Macbooks, use cmd
instead of ctrl
in shortcuts.
2. Install the necessary libraries.
Use the !pip install
command and install the packages: numpy
, pandas
, scikit-learn
, matplotlib
, and seaborn
. For example, to install numpy
type !pip install -q numpy
Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network. However, many of those packages that we will use come already installed on google colab.
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn
3. Import the necessary packages.
The packages are now installed, but to be able to use their functions we have to import them. A common practice is to import the packages all together at the beginning of the code. However, syntactically you can also do it later in the code (but always before you use it). For example, to import numpy
type import numpy as np
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
Every problem in machine learning starts with the data. Uaually, we want to get some information and understand our data.
It is possible that you have your own data that you want to explore. However, there are many datasets available or we can also scrape data from web or use social media APIs. Here are some websites where you can find publicly available datasets:
For this first practical, we are going to use the California housing dataset. This dataset contains house attributes and summary statistics from the 1990 California census.
To begin working with the dataset in Google Colab, you need to first upload the 'housing.csv'
file. This can be accomplished by clicking on the 'Files'
button located on the left side of the Colab interface (the icon that looks like a folder). You have the option to either drag and drop the file or use the upload button for this purpose. As an alternative, Google Drive can be mounted within Colab, allowing you to access and upload the dataset directly from there.
4. Read the housing.csv
dataset using the read_csv()
function. Store the dataframe
to a variable called houses
. Check the first lines of the dataframe using head()
and the last ones with the tail()
function
#code to mount google drive
#from google.colab import drive
#drive.mount('/content/drive')
houses = pd.read_csv("housing.csv")
#print the first lines of the dataset
houses.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
#print the last lines of the dataset
houses.tail()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
Initially, we observe that our dataset contains 20,639 rows and various columns. Also, we can see a small sample of our data. .
To examine the dataset's descriptive statistics in more details, we can use the info()
and/or describe()
functions. Those functions will give more information about the type of the data and some statistics.
5. Display basic information about the houses dataframe using the info()
function. What can you say from this information? What type of data do the columns contain?
print(houses.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB None
We see that we have 10 columns and most of them consist of float
data types, except for ocean_proximity
, which is an object of the string type in pandas
. Also, we see that the only column with null values is the total_bedrooms
.
The 10 columns are the following:
6. Now use the describe()
function to print the descriptive statistics of the numerical features in the dataset. What can you say regarding the range of the variables?
print(houses.describe())
longitude latitude housing_median_age total_rooms \ count 20640.000000 20640.000000 20640.000000 20640.000000 mean -119.569704 35.631861 28.639486 2635.763081 std 2.003532 2.135952 12.585558 2181.615252 min -124.350000 32.540000 1.000000 2.000000 25% -121.800000 33.930000 18.000000 1447.750000 50% -118.490000 34.260000 29.000000 2127.000000 75% -118.010000 37.710000 37.000000 3148.000000 max -114.310000 41.950000 52.000000 39320.000000 total_bedrooms population households median_income \ count 20433.000000 20640.000000 20640.000000 20640.000000 mean 537.870553 1425.476744 499.539680 3.870671 std 421.385070 1132.462122 382.329753 1.899822 min 1.000000 3.000000 1.000000 0.499900 25% 296.000000 787.000000 280.000000 2.563400 50% 435.000000 1166.000000 409.000000 3.534800 75% 647.000000 1725.000000 605.000000 4.743250 max 6445.000000 35682.000000 6082.000000 15.000100 median_house_value count 20640.000000 mean 206855.816909 std 115395.615874 min 14999.000000 25% 119600.000000 50% 179700.000000 75% 264725.000000 max 500001.000000
We can see the descriptive statistics of the numerical columns. From those statistics we can see that on average houses in one block have 2,635 rooms and 537 bedrooms. Also, we can see that the maximum median house value is 500,000 and the minimum median house value is 14,999. With this summary you can also understand whether there are any values that are not expected in any of the columns (for example negative minimum in price could mean that there is an error).
Another important thing that we have to check is if there are missing values in the dataframe. We already have an idea with the info()
funciton, but we can also print the exact number of null values per column.
7. Check if there are missing values in the houses
dataframe. You can use the houses.isnull().sum()
# Check for missing values
print(houses.isnull().sum())
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 207 population 0 households 0 median_income 0 median_house_value 0 ocean_proximity 0 dtype: int64
Firstly, we identify any missing values across the columns by applying the .isnull().sum()
function to each column in the dataset, which helps us determine the total number of null entries. As seen above, all columns are complete except for 'total_bedrooms'
, which has 207 missing values.
We have three choices for dealing with the missing data in the 'total_bedrooms' column:
For this example, we will leave the data as they are and we will keep exploring our data. Depending on your problem you can also decide to remove or impute some values. In this course we didn't cover imputation techniques.
Another thing we can do is to check the number of unique values present in each column of the houses
dataset.
8. Determine the number of unique values present in each column of the houses
dataset. You can use the function nunique()
for this
houses.nunique()
longitude 844 latitude 862 housing_median_age 52 total_rooms 5926 total_bedrooms 1923 population 3888 households 1815 median_income 12928 median_house_value 3842 ocean_proximity 5 dtype: int64
We know that the ocean proximity is a string type and now we see from the previous question that has 5 unique values. Let's now try to see how many times each value appears.
9. Calculate the frequency of each unique value in the 'ocean_proximity
' column of the houses
dataset. First you can take the values of the column and then use the value_counts()
function
count_per_unique_value = houses['ocean_proximity'].value_counts()
print(count_per_unique_value)
ocean_proximity <1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: count, dtype: int64
Here, we see that the value ISLAND has very few observations. So depending on the context of the problem, we can also decide to remove those 5 rows.
As indicated, the 'ISLAND' category has a few data compared to the rest of the values. Those rows will not have any big impact so we can remove them.
10. Remove the rows for which the ocean proximity has the value of ISLAND
. The function that you need to remove data from a dataframe is drop()
houses.drop(houses[houses['ocean_proximity'] == "ISLAND"].index, inplace=True)
# Display the DataFrame to confirm that the rows have been removed
houses
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20635 rows × 10 columns
11.Sometimes we need to do some tranformation to the values of the dataframe. Let's try to do this now. Convert the values in the ocean_proximity
column by lowering the case (str.lower()
) and replacing spaces with underscores (str.replace()
)
# Converting 'ocean_proximity' to lower case and replacing spaces with underscores
houses['ocean_proximity'] = houses['ocean_proximity'].str.lower().str.replace(" ", "_")
# Displaying the 'ocean_proximity' column to confirm the changes
houses['ocean_proximity']
0 near_bay 1 near_bay 2 near_bay 3 near_bay 4 near_bay ... 20635 inland 20636 inland 20637 inland 20638 inland 20639 inland Name: ocean_proximity, Length: 20635, dtype: object
Often we also want to add new columns based on values of other ones. Here, we are going to add one more column to indicate whether the median house is above or below the median.
12. Add a new column in the dataframe houses
called high_value
which is 1 if the median house value is above the median of the dataset, otherwise 0.
houses['high_value'] = (houses['median_house_value'] > houses['median_house_value'].median()).astype(int)
print(houses.head())
longitude latitude housing_median_age total_rooms total_bedrooms \ 0 -122.23 37.88 41.0 880.0 129.0 1 -122.22 37.86 21.0 7099.0 1106.0 2 -122.24 37.85 52.0 1467.0 190.0 3 -122.25 37.85 52.0 1274.0 235.0 4 -122.25 37.85 52.0 1627.0 280.0 population households median_income median_house_value ocean_proximity \ 0 322.0 126.0 8.3252 452600.0 near_bay 1 2401.0 1138.0 8.3014 358500.0 near_bay 2 496.0 177.0 7.2574 352100.0 near_bay 3 558.0 219.0 5.6431 341300.0 near_bay 4 565.0 259.0 3.8462 342200.0 near_bay high_value 0 1 1 1 2 1 3 1 4 1
Another important tool when we explore our data is to group data together and get some statistics per group. Pandas has this functionality. The function we use to create those groups is called groupby()
.
13.Group by ocean_proximity
and calculate the mean in total_rooms
for each category. What can you say for the mean of the total rooms per category in ocean proximity?
# Group by 'ocean_proximity' and calculate the average median house value for each category
average_prices_by_proximity = houses.groupby('ocean_proximity')['total_rooms'].mean()
print(average_prices_by_proximity)
ocean_proximity <1h_ocean 2628.343586 inland 2717.742787 near_bay 2493.589520 near_ocean 2583.700903 Name: total_rooms, dtype: float64
From those mean values we can see that the houses that are INLAND have the highest mean in the total number of rooms.
We already started some exploratory analysis with the descriptive statistics. However, to get a better understanding of the data we can further generate some plots. We can create among others histograms, boxplots and correlation matrices.
To create those plots, we will use Matplotlib
, a widely-used Python library for data visualization. If you want to know more about this package, see the documentation here: https://matplotlib.org/ .
Let's first see how to use Matplotlib and which are its main components.
An Axes in Matplotlib is a single plot within a figure, with essential elements like data limits (controlled by set_xlim()
and set_ylim()
methods), a title (set_title()
), x-label (set_xlabel()
), and y-label (set_ylabel()
). It's where the data, along with associated labels and ticks, is plotted.
Subplots allow for multiple plots (axes
) to be arranged in a grid within a single figure, facilitating comparative analysis of different data aspects. You can create subplots using plt.subplots()
, which will return a figure and an array of axes, accessible through indexing or row-column notation.
Let's now create some histograms using Matplotlib.
For example, we can use the following code to create a histogram of the RoomsPerHousehold
column. Note that we are using the histplot
from the package Seaborn
to create the histogram in this case.
# Define a figure of size (6,4)
plt.figure(figsize=(6, 4))
# create the histogram using the histplot
sns.histplot(houses['housing_median_age'], kde=True)
# add different elements such as title and x, y labels
plt.title('Distribution of Median Age')
plt.xlabel('Median Age')
plt.ylabel('Frequency')
# Set the limits on the X and Y axes
x_lim = (0, 60)
y_lim = (0, 5000)
plt.xlim(x_lim)
plt.ylim(y_lim)
plt.show()
This histogram visualizes the frequency distribution of the 'housing_median_age
' feature.
14. Create histograms for other variables in the houses
dataframe. You can use the plt.subplots
to place them in a grid. For example, if you want to create 2 subplots in a 2X2 grid you can run fig, axes = plt.subplots(2, 2, figsize=(12, 6))
. What can you say about the distributions of the features?
n_rows = 6
n_cols = 2
# Creating the figure and subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 3))
# Flattening the axes array for easier iteration
axes = axes.flatten()
# Creating histograms for each column in the DataFrame
for ax, column in zip(axes, houses.columns):
sns.histplot(houses[column], bins=20, ax=ax)
ax.set_title(f'Histogram of {column}')
ax.set_xlabel(column)
ax.set_ylabel('Frequency')
# Hide any extra subplots
for i in range(len(houses.columns), n_rows * n_cols):
fig.delaxes(axes[i])
# Adjusting the layout and displaying the figure
fig.tight_layout()
plt.show()
From this histograms, we can see that the data in 'housing_median_age
' and 'median_house_value
' are concentrated at the left end of their range.
Next, we can also plot some boxplots which are going to show whether there are any potential outliers or not.
15. Create box plots (use the boxplot
function from seaborn
) for the columns 'total_rooms'
, 'total_bedrooms'
, 'population'
, Determine if there are any outliers.
# List of columns to check for outliers
columns_to_check = ['total_rooms', 'total_bedrooms', 'population']
# Create a box plot for each specified column
for column in columns_to_check:
plt.figure(figsize=(8, 4))
sns.boxplot(x=houses[column])
plt.title(f'Box plot of {column}')
plt.show()
Those boxplots indicate non-normality of the variables. There are also a few points that need to be further explored. For this practical, we will leave them as they are.
16. Create a pair plot for the features 'total_bedrooms'
, 'total_rooms'
and 'households'
. To do this, you can use the function pairplot
from seaborn. This function will create a matrix of scatter plots for each pair of features.
features = ['total_bedrooms', 'total_rooms', 'households']
# Create the pair plot with a specified height for each plot
sns.pairplot(houses[features], height=3)
# Adjust the layout
plt.subplots_adjust(top=1)
# Display the plot
plt.show()
We created a pair plot to examine the relationship among 'total_bedrooms'
, 'total_rooms'
, and 'households'
. This plot reveals a linear relationship between 'total_bedrooms'
and both 'total_rooms'
and 'households'
. This is logical since the number of bedrooms is included in the total room count of a block and is likely influenced by the household size. From this, we infer that the number of bedrooms cannot exceed the total number of rooms and that we can use other dataset features to estimate the number of bedrooms.
In several tasks, we are also interested to see if there are correlations among features.
17. Investigate the correlation among the features. You can create a heatmap using the heatmap
function from seaborn. To get the correlations, you can use the houses.corr()
function and set the parameter method='spearman'
# Set the size of the figure for the heatmap
plt.figure(figsize=(8, 6))
# Calculate the Spearman correlation matrix and create the heatmap (we specify numeric_only=True)
sns.heatmap(houses.corr(method='spearman', numeric_only=True), annot=True, cmap='magma')
# Add a title to the heatmap
plt.title('Spearman Correlation Among Numeric Features', size=10)
# Display the heatmap
plt.show()
Values range from -1 to +1. Values that are close to +1 indicate a strong positive correlation and values close to -1 a strong negative correlation. From this plot, we see that there is high positive correlation among the features total_rooms
, total_bedrooms
, population
and households
(yellow colors in the center of the plot).
18. If your dataframe has many columns (sometimes there can be 100s of features) then you can also select to view the correlations of one feature with the rest. You can indicate this as a parameter in the correlation_matrix
. Generate the correlations of median_house_value
with the rest of the features. Once you do that, try to change the code so it shows the correlations in a descending rank (from high to low).
# Set the size of the figure for the heatmap
plt.figure(figsize=(2, 4))
correlation_matrix = houses.corr(method='spearman', numeric_only=True)
sns.heatmap(correlation_matrix[['median_house_value']].sort_values(by='median_house_value', ascending=False), annot=True, cmap='cividis')
plt.title('Correlation of Features with Median House Value',size=10)
plt.show()
The heatmap shows that high_value
and median_income
are positively correlated with median_house_value
, meaning higher median income usually indicates higher median house prices.
19. Utilize the matplotlib
and seaborn
libraries to create a scatter plot (scatterplot()
function) with longitude
on the x-axis and latitude
on the y-axis. Investigate the influence of geographical location on housing prices by visualizing the distribution of median_house_value
across different coordinates.
# Plotting the scatter plot for latitude and longitude
plt.figure(figsize=(8, 4))
sns.scatterplot(
data=houses,
x='longitude',
y='latitude',
size='median_house_value',
hue='median_house_value',
palette='magma',
alpha=0.5)
# Customize the plot
plt.legend(title='Median House Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Median House Value by Geographical Coordinates')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
You can also first group by the data and then create the plots on the different categories.
20. Group the data by 'ocean_proximity
' and calculate the average 'median_house_value
' for each category. Plot these results using a box plot (boxplot()
function from seaborn
can also be used here).
# Group by 'ocean_proximity' and calculate the average median house value for each category
average_prices_by_proximity = houses.groupby('ocean_proximity')['median_house_value'].mean()
print(average_prices_by_proximity)
# Create a boxplot to visualize the results
sns.boxplot(x='ocean_proximity', y='median_house_value', data=houses, palette="Set3", hue='ocean_proximity', legend=False)
plt.title('Boxplot of Median House Value by Ocean Proximity')
plt.xlabel('Ocean Proximity')
plt.ylabel('Median House Value')
plt.show()
ocean_proximity <1h_ocean 240084.285464 inland 124805.392001 near_bay 259212.311790 near_ocean 249433.977427 Name: median_house_value, dtype: float64
As we mentioned during the lecture, it is common to tranform our features so they are scaled in the a similar range.
Min-Max Scaling is a popular approach to normalise numerical data. Its aim is to compresses all values into the range [0, 1] so they can be useful for algorithms that require bounded input, such as neural networks.
21.Apply Min-Max Scaling to median_income and plot its new distribution. You can start with scaler = MinMaxScaler()
that will initialise the Min Max scaler
# Apply Min-Max Scaling to 'MedInc'
scaler = MinMaxScaler()
houses['MedInc_MinMax'] = scaler.fit_transform(houses[['median_income']])
# Print original and scaled 'MedInc'
print(houses[['median_income', 'MedInc_MinMax']].head())
# Plot the original and scaled distribution of 'MedInc'
plt.figure(figsize=(8, 4))
# Min-Max Scaled distribution
sns.histplot(houses['MedInc_MinMax'], kde=True, color='green')
plt.title('Min-Max Scaled Distribution of MedInc')
plt.xlabel('MedInc (Min-Max Scaled)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
median_income MedInc_MinMax 0 8.3252 0.539668 1 8.3014 0.538027 2 7.2574 0.466028 3 5.6431 0.354699 4 3.8462 0.230776
22. Now apply the Standarization (StandardScaler()) on the median_income and plot its new distribution.
# define standard scaler
scaler = StandardScaler()
houses['MedInc_Standard'] = scaler.fit_transform(houses[['median_income']])
# Print original and scaled 'MedInc'
print(houses[['median_income', 'MedInc_Standard']].head())
# Plot the original and scaled distribution of 'MedInc'
plt.figure(figsize=(8, 4))
# Standard Scaled distribution
sns.histplot(houses['MedInc_Standard'], kde=True, color='green')
plt.title('Standarization of MedInc')
plt.xlabel('MedInc (Standarization)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
median_income MedInc_Standard 0 8.3252 2.344450 1 8.3014 2.331923 2 7.2574 1.782425 3 5.6431 0.932756 4 3.8462 -0.013024
End of Practical.