Practical : Data exploration and visualization¶

Anastasia Giachanou, Tina Shahedi

Machine Learning with Python - Utrecht Summer School

Welcome to the first practical of the course!

In this practical, we are going to get familiar with Python and Google Colab, and then we will do some data exploration and visualization! You can also look at Python documentation to refresh you knowledge of programming: https://docs.python.org/3/reference/

Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:

Zero configuration required

  • Zero configuration required
  • Free access to GPUs
  • Easy sharing

Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introductions to Colab here, but we will also cover the basics.

Objectives¶

After completing this lab you will be able to:

  • Do data wrangling
  • Do data filtering
  • Explore your data
  • Plot your data with matplotlib and seaborn

Let's get started!¶

Here we are going to introduce Python and Google Colab. If you feel that you want to refresh your Python skills more, you can also complete the preparations from here: http://giachanou.com/machine-learning/#prepare.

1. Open Colab and create a new empty notebook to work with Python 3!

Go to https://colab.research.google.com/ and login with your account. Then click on "File → New notebook".

If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter.

If you want to stop your code from running in Colab:

  • Interrupt execution by pressing ctrl + M I or simply click the stop button
  • Or: Press ctrl + A to select all the code of that particular cell, press ctrl + Xto cut the entire cell code. Now the cell is empty and can be deleted by using ctrl + M D or by pressing the delete button. You can paste your code in a new code chunk and adjust it.

NB: On Macbooks, use cmd instead of ctrl in shortcuts.

2. Install the necessary libraries.

Use the !pip install command and install the packages: numpy, pandas, scikit-learn, matplotlib, and seaborn. For example, to install numpy type !pip install -q numpy

Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network. However, many of those packages that we will use come already installed on google colab.

In [ ]:
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install -q matplotlib
!pip install -q seaborn

3. Import the necessary packages.

The packages are now installed, but to be able to use their functions we have to import them. A common practice is to import the packages all together at the beginning of the code. However, syntactically you can also do it later in the code (but always before you use it). For example, to import numpy type import numpy as np

In [ ]:
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Load the data¶

Every problem in machine learning starts with the data. Uaually, we want to get some information and understand our data.

It is possible that you have your own data that you want to explore. However, there are many datasets available or we can also scrape data from web or use social media APIs. Here are some websites where you can find publicly available datasets:

  • CLARIN Resource Families
  • UCI Machine Learning Repository
  • Kaggle

For this first practical, we are going to use the California housing dataset. This dataset contains house attributes and summary statistics from the 1990 California census.

To begin working with the dataset in Google Colab, you need to first upload the 'housing.csv' file. This can be accomplished by clicking on the 'Files'button located on the left side of the Colab interface (the icon that looks like a folder). You have the option to either drag and drop the file or use the upload button for this purpose. As an alternative, Google Drive can be mounted within Colab, allowing you to access and upload the dataset directly from there.

4. Read the housing.csv dataset using the read_csv() function. Store the dataframe to a variable called houses. Check the first lines of the dataframe using head() and the last ones with the tail() function

In [ ]:
#code to mount google drive
#from google.colab import drive
#drive.mount('/content/drive')

houses = pd.read_csv("housing.csv")
In [ ]:
#print the first lines of the dataset
houses.head()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [ ]:
#print the last lines of the dataset
houses.tail()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

Initially, we observe that our dataset contains 20,639 rows and various columns. Also, we can see a small sample of our data. .

Displaying summary statistics¶

To examine the dataset's descriptive statistics in more details, we can use the info() and/or describe() functions. Those functions will give more information about the type of the data and some statistics.

5. Display basic information about the houses dataframe using the info() function. What can you say from this information? What type of data do the columns contain?

In [ ]:
print(houses.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None

We see that we have 10 columns and most of them consist of float data types, except for ocean_proximity, which is an object of the string type in pandas. Also, we see that the only column with null values is the total_bedrooms.

The 10 columns are the following:

  • longitude: A measure of how far west a house is; a higher value is farther west
  • latitude: A measure of how far north a house is; a higher value is farther north
  • housing_median_age: Median age of a house within a block; a lower number is a newer building
  • total_rooms: Total number of rooms within a block
  • total_bedrooms: Total number of bedrooms within a block
  • population: Total number of people residing within a block
  • households:Total number of households, a group of people residing within a home unit, for a block
  • median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  • median_house_value: Median house value for households within a block (measured in US Dollars)
  • ocean_proximity: Location of the house with respect to the ocean/sea.

6. Now use the describe() function to print the descriptive statistics of the numerical features in the dataset. What can you say regarding the range of the variables?

In [ ]:
print(houses.describe())
          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20433.000000  20640.000000  20640.000000   20640.000000   
mean       537.870553   1425.476744    499.539680       3.870671   
std        421.385070   1132.462122    382.329753       1.899822   
min          1.000000      3.000000      1.000000       0.499900   
25%        296.000000    787.000000    280.000000       2.563400   
50%        435.000000   1166.000000    409.000000       3.534800   
75%        647.000000   1725.000000    605.000000       4.743250   
max       6445.000000  35682.000000   6082.000000      15.000100   

       median_house_value  
count        20640.000000  
mean        206855.816909  
std         115395.615874  
min          14999.000000  
25%         119600.000000  
50%         179700.000000  
75%         264725.000000  
max         500001.000000  

We can see the descriptive statistics of the numerical columns. From those statistics we can see that on average houses in one block have 2,635 rooms and 537 bedrooms. Also, we can see that the maximum median house value is 500,000 and the minimum median house value is 14,999. With this summary you can also understand whether there are any values that are not expected in any of the columns (for example negative minimum in price could mean that there is an error).

Another important thing that we have to check is if there are missing values in the dataframe. We already have an idea with the info() funciton, but we can also print the exact number of null values per column.

7. Check if there are missing values in the houses dataframe. You can use the houses.isnull().sum()

In [ ]:
# Check for missing values
print(houses.isnull().sum())
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Firstly, we identify any missing values across the columns by applying the .isnull().sum() function to each column in the dataset, which helps us determine the total number of null entries. As seen above, all columns are complete except for 'total_bedrooms', which has 207 missing values.

We have three choices for dealing with the missing data in the 'total_bedrooms' column:

  1. Remove the missing values.
  2. Fill/impute in the missing values.
  3. Leave the missing values as is.

For this example, we will leave the data as they are and we will keep exploring our data. Depending on your problem you can also decide to remove or impute some values. In this course we didn't cover imputation techniques.

Another thing we can do is to check the number of unique values present in each column of the houses dataset.

8. Determine the number of unique values present in each column of the houses dataset. You can use the function nunique() for this

In [ ]:
houses.nunique()
Out[ ]:
longitude               844
latitude                862
housing_median_age       52
total_rooms            5926
total_bedrooms         1923
population             3888
households             1815
median_income         12928
median_house_value     3842
ocean_proximity           5
dtype: int64

We know that the ocean proximity is a string type and now we see from the previous question that has 5 unique values. Let's now try to see how many times each value appears.

9. Calculate the frequency of each unique value in the 'ocean_proximity' column of the houses dataset. First you can take the values of the column and then use the value_counts() function

In [ ]:
count_per_unique_value = houses['ocean_proximity'].value_counts()
print(count_per_unique_value)
ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

Here, we see that the value ISLAND has very few observations. So depending on the context of the problem, we can also decide to remove those 5 rows.

Remove/edit data from the dataframe¶

As indicated, the 'ISLAND' category has a few data compared to the rest of the values. Those rows will not have any big impact so we can remove them.

10. Remove the rows for which the ocean proximity has the value of ISLAND. The function that you need to remove data from a dataframe is drop()

In [ ]:
houses.drop(houses[houses['ocean_proximity'] == "ISLAND"].index, inplace=True)
# Display the DataFrame to confirm that the rows have been removed
houses
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20635 rows × 10 columns

11.Sometimes we need to do some tranformation to the values of the dataframe. Let's try to do this now. Convert the values in the ocean_proximity column by lowering the case (str.lower()) and replacing spaces with underscores (str.replace())

In [ ]:
# Converting 'ocean_proximity' to lower case and replacing spaces with underscores
houses['ocean_proximity'] = houses['ocean_proximity'].str.lower().str.replace(" ", "_")

# Displaying the 'ocean_proximity' column to confirm the changes
houses['ocean_proximity']
Out[ ]:
0        near_bay
1        near_bay
2        near_bay
3        near_bay
4        near_bay
           ...   
20635      inland
20636      inland
20637      inland
20638      inland
20639      inland
Name: ocean_proximity, Length: 20635, dtype: object

Add columns to the dataframe¶

Often we also want to add new columns based on values of other ones. Here, we are going to add one more column to indicate whether the median house is above or below the median.

12. Add a new column in the dataframe houses called high_value which is 1 if the median house value is above the median of the dataset, otherwise 0.

In [ ]:
houses['high_value'] = (houses['median_house_value'] > houses['median_house_value'].median()).astype(int)
print(houses.head())
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  \
0       322.0       126.0         8.3252            452600.0        near_bay   
1      2401.0      1138.0         8.3014            358500.0        near_bay   
2       496.0       177.0         7.2574            352100.0        near_bay   
3       558.0       219.0         5.6431            341300.0        near_bay   
4       565.0       259.0         3.8462            342200.0        near_bay   

   high_value  
0           1  
1           1  
2           1  
3           1  
4           1  

Another important tool when we explore our data is to group data together and get some statistics per group. Pandas has this functionality. The function we use to create those groups is called groupby().

13.Group by ocean_proximity and calculate the mean in total_rooms for each category. What can you say for the mean of the total rooms per category in ocean proximity?

In [ ]:
# Group by 'ocean_proximity' and calculate the average median house value for each category
average_prices_by_proximity = houses.groupby('ocean_proximity')['total_rooms'].mean()
print(average_prices_by_proximity)
ocean_proximity
<1h_ocean     2628.343586
inland        2717.742787
near_bay      2493.589520
near_ocean    2583.700903
Name: total_rooms, dtype: float64

From those mean values we can see that the houses that are INLAND have the highest mean in the total number of rooms.

Exploratory analysis (graphical)¶

We already started some exploratory analysis with the descriptive statistics. However, to get a better understanding of the data we can further generate some plots. We can create among others histograms, boxplots and correlation matrices.

To create those plots, we will use Matplotlib, a widely-used Python library for data visualization. If you want to know more about this package, see the documentation here: https://matplotlib.org/ .

Let's first see how to use Matplotlib and which are its main components.

Axes in Matplotlib¶

An Axes in Matplotlib is a single plot within a figure, with essential elements like data limits (controlled by set_xlim() and set_ylim() methods), a title (set_title()), x-label (set_xlabel()), and y-label (set_ylabel()). It's where the data, along with associated labels and ticks, is plotted.

Subplots in Matplotlib¶

Subplots allow for multiple plots (axes) to be arranged in a grid within a single figure, facilitating comparative analysis of different data aspects. You can create subplots using plt.subplots(), which will return a figure and an array of axes, accessible through indexing or row-column notation.

Let's now create some histograms using Matplotlib.

For example, we can use the following code to create a histogram of the RoomsPerHousehold column. Note that we are using the histplot from the package Seaborn to create the histogram in this case.

In [ ]:
# Define a figure of size (6,4)
plt.figure(figsize=(6, 4))
# create the histogram using the histplot
sns.histplot(houses['housing_median_age'], kde=True)
# add different elements such as title and x, y labels
plt.title('Distribution of Median Age')
plt.xlabel('Median Age')
plt.ylabel('Frequency')

# Set the limits on the X and Y axes
x_lim = (0, 60)
y_lim = (0, 5000)
plt.xlim(x_lim)
plt.ylim(y_lim)

plt.show()

This histogram visualizes the frequency distribution of the 'housing_median_age' feature.

14. Create histograms for other variables in the houses dataframe. You can use the plt.subplots to place them in a grid. For example, if you want to create 2 subplots in a 2X2 grid you can run fig, axes = plt.subplots(2, 2, figsize=(12, 6)). What can you say about the distributions of the features?

In [ ]:
n_rows = 6
n_cols = 2

# Creating the figure and subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 3))

# Flattening the axes array for easier iteration
axes = axes.flatten()

# Creating histograms for each column in the DataFrame
for ax, column in zip(axes, houses.columns):
    sns.histplot(houses[column], bins=20, ax=ax)
    ax.set_title(f'Histogram of {column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')

# Hide any extra subplots
for i in range(len(houses.columns), n_rows * n_cols):
    fig.delaxes(axes[i])

# Adjusting the layout and displaying the figure
fig.tight_layout()
plt.show()

From this histograms, we can see that the data in 'housing_median_age' and 'median_house_value' are concentrated at the left end of their range.

Next, we can also plot some boxplots which are going to show whether there are any potential outliers or not.

15. Create box plots (use the boxplot function from seaborn) for the columns 'total_rooms', 'total_bedrooms', 'population', Determine if there are any outliers.

In [ ]:
# List of columns to check for outliers
columns_to_check = ['total_rooms', 'total_bedrooms', 'population']

# Create a box plot for each specified column
for column in columns_to_check:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=houses[column])
    plt.title(f'Box plot of {column}')
    plt.show()

Those boxplots indicate non-normality of the variables. There are also a few points that need to be further explored. For this practical, we will leave them as they are.

16. Create a pair plot for the features 'total_bedrooms', 'total_rooms' and 'households'. To do this, you can use the function pairplot from seaborn. This function will create a matrix of scatter plots for each pair of features.

In [ ]:
features = ['total_bedrooms', 'total_rooms', 'households']

# Create the pair plot with a specified height for each plot
sns.pairplot(houses[features], height=3)
# Adjust the layout
plt.subplots_adjust(top=1)
# Display the plot
plt.show()

We created a pair plot to examine the relationship among 'total_bedrooms', 'total_rooms', and 'households'. This plot reveals a linear relationship between 'total_bedrooms' and both 'total_rooms' and 'households'. This is logical since the number of bedrooms is included in the total room count of a block and is likely influenced by the household size. From this, we infer that the number of bedrooms cannot exceed the total number of rooms and that we can use other dataset features to estimate the number of bedrooms.

Viewing Correlations¶

In several tasks, we are also interested to see if there are correlations among features.

17. Investigate the correlation among the features. You can create a heatmap using the heatmap function from seaborn. To get the correlations, you can use the houses.corr() function and set the parameter method='spearman'

In [ ]:
# Set the size of the figure for the heatmap
plt.figure(figsize=(8, 6))

# Calculate the Spearman correlation matrix and create the heatmap (we specify numeric_only=True)
sns.heatmap(houses.corr(method='spearman', numeric_only=True), annot=True, cmap='magma')

# Add a title to the heatmap
plt.title('Spearman Correlation Among Numeric Features', size=10)

# Display the heatmap
plt.show()

Values range from -1 to +1. Values that are close to +1 indicate a strong positive correlation and values close to -1 a strong negative correlation. From this plot, we see that there is high positive correlation among the features total_rooms, total_bedrooms, population and households (yellow colors in the center of the plot).

18. If your dataframe has many columns (sometimes there can be 100s of features) then you can also select to view the correlations of one feature with the rest. You can indicate this as a parameter in the correlation_matrix. Generate the correlations of median_house_value with the rest of the features. Once you do that, try to change the code so it shows the correlations in a descending rank (from high to low).

In [ ]:
# Set the size of the figure for the heatmap
plt.figure(figsize=(2, 4))

correlation_matrix = houses.corr(method='spearman', numeric_only=True)
sns.heatmap(correlation_matrix[['median_house_value']].sort_values(by='median_house_value', ascending=False), annot=True, cmap='cividis')
plt.title('Correlation of Features with Median House Value',size=10)
plt.show()

The heatmap shows that high_value and median_income are positively correlated with median_house_value, meaning higher median income usually indicates higher median house prices.

19. Utilize the matplotlib and seaborn libraries to create a scatter plot (scatterplot() function) with longitude on the x-axis and latitude on the y-axis. Investigate the influence of geographical location on housing prices by visualizing the distribution of median_house_value across different coordinates.

In [ ]:
# Plotting the scatter plot for latitude and longitude
plt.figure(figsize=(8, 4))
sns.scatterplot(
    data=houses,
    x='longitude',
    y='latitude',
    size='median_house_value',
    hue='median_house_value',
    palette='magma',
    alpha=0.5)

# Customize the plot
plt.legend(title='Median House Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Median House Value by Geographical Coordinates')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

You can also first group by the data and then create the plots on the different categories.

20. Group the data by 'ocean_proximity' and calculate the average 'median_house_value' for each category. Plot these results using a box plot (boxplot() function from seaborn can also be used here).

In [ ]:
# Group by 'ocean_proximity' and calculate the average median house value for each category
average_prices_by_proximity = houses.groupby('ocean_proximity')['median_house_value'].mean()
print(average_prices_by_proximity)

# Create a boxplot to visualize the results
sns.boxplot(x='ocean_proximity', y='median_house_value', data=houses, palette="Set3", hue='ocean_proximity', legend=False)
plt.title('Boxplot of Median House Value by Ocean Proximity')
plt.xlabel('Ocean Proximity')
plt.ylabel('Median House Value')
plt.show()
ocean_proximity
<1h_ocean     240084.285464
inland        124805.392001
near_bay      259212.311790
near_ocean    249433.977427
Name: median_house_value, dtype: float64

Feature Transformation¶

As we mentioned during the lecture, it is common to tranform our features so they are scaled in the a similar range.

Min-Max Scaling is a popular approach to normalise numerical data. Its aim is to compresses all values into the range [0, 1] so they can be useful for algorithms that require bounded input, such as neural networks.

21.Apply Min-Max Scaling to median_income and plot its new distribution. You can start with scaler = MinMaxScaler() that will initialise the Min Max scaler

In [ ]:
# Apply Min-Max Scaling to 'MedInc'
scaler = MinMaxScaler()
houses['MedInc_MinMax'] = scaler.fit_transform(houses[['median_income']])

# Print original and scaled 'MedInc'
print(houses[['median_income', 'MedInc_MinMax']].head())

# Plot the original and scaled distribution of 'MedInc'
plt.figure(figsize=(8, 4))

# Min-Max Scaled distribution
sns.histplot(houses['MedInc_MinMax'], kde=True, color='green')
plt.title('Min-Max Scaled Distribution of MedInc')
plt.xlabel('MedInc (Min-Max Scaled)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
   median_income  MedInc_MinMax
0         8.3252       0.539668
1         8.3014       0.538027
2         7.2574       0.466028
3         5.6431       0.354699
4         3.8462       0.230776

22. Now apply the Standarization (StandardScaler()) on the median_income and plot its new distribution.

In [ ]:
# define standard scaler
scaler = StandardScaler()

houses['MedInc_Standard'] = scaler.fit_transform(houses[['median_income']])

# Print original and scaled 'MedInc'
print(houses[['median_income', 'MedInc_Standard']].head())

# Plot the original and scaled distribution of 'MedInc'
plt.figure(figsize=(8, 4))

# Standard Scaled distribution
sns.histplot(houses['MedInc_Standard'], kde=True, color='green')
plt.title('Standarization of MedInc')
plt.xlabel('MedInc (Standarization)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
   median_income  MedInc_Standard
0         8.3252         2.344450
1         8.3014         2.331923
2         7.2574         1.782425
3         5.6431         0.932756
4         3.8462        -0.013024

End of Practical.