Practical 10: Fairness in Machine Learning¶

Daniel Anadria, Anastasia Giachanou

Machine Learning with Python - Utrecht Summer School

In this practical, we are going to explore bias and fairness in Machine Learning!

COMPAS Recidivism¶

The COMPAS dataset contains outcomes from a proprietary tool named COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), designed to evaluate the probability of a convict committing another crime. It is utilized by judges and parole officers and is notably recognized for its discriminatory impact on African-American individuals.

Dataset source: Broward Country Clerk's Office, Broward County Sherrif's Office, Florida Department of Corrections, ProPublica

We are going to use this dataset to explore some of the notions of group fairness as it relates to machine learning.


Disclaimer:

Unlike most tutorials use the COMPAS dataset, we are not going to assess the fairness of the pre-computed COMPAS scores. Instead, we will build our own classifier based on the 'raw' data such as crime history and demographic information (thus excluding the derived COMPAS scores). This way, you will get a bit of an intuition regarding how such classifiers are built, where the fairness problems might stem from in the development pipeline, and what can be done about addressing fairness out model outputs.

In algorithmic fairness, it's important to understand the context surrounding a specific applied machine learning task. Sources of bias are many, as are the degrees of freedom in what disparity to focus on. Unfortuantely, satisfying multiple fairness criteria at the same time is often impossible, a phenomenon dubbed the fairness-accuracy trade-off.

We do not claim to be penal system nor social justice experts. The purpose of this tutorial is only to demonstrate some of the machine learning approaches to bias detection and mitigation. For this to be possible, we have to make choices on what biases are 'more important' to focus on. In reality, the values influencing what to optimize the models for are multifaceted and coming from different actors. We do not claim to have 'solved fairness problems'. This would require interdisciplinary multi-agent input and would always be based on a selection of particular values.


In [ ]:
!pip install -q squarify
!pip install -q fairlearn
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 234.1/234.1 kB 4.6 MB/s eta 0:00:00

fairlearn (developed by Microsoft) is a toolkit for:

  • Evaluating fairness metrics
  • Mitigating bias in machine learning models
  • Producing fairness-aware models using post-processing, reductions, or constraints

As always, we start with importing the required libraries

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve

import fairlearn
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.postprocessing import plot_threshold_optimizer

1. Load the COMPAS dataset from the url https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv (yes, you can use the pd.read_csv() and put the link inside the parenthesis) and inspect the first rows.

The COMPAS dataset contains the following variables. We tried looking up the meaning behind each variable. Our target variable is called two_year_recid.

Some variables were used to construct other variables. For example, decile_score represents the individual's COMPAS score, the value predicting the risk of recidivism. We will omit the COMPAS score and the related features, and try to predict two-year recidivism from the remaining features.

Variable Description
id Unique identifier for each individual
name Full name of the individual
first First name of the individual
last Last name of the individual
compas_screening_date Date when the COMPAS screening was conducted
sex Sex of the individual
dob Date of birth
age Age at the time of screening
age_cat Categorical age group (e.g., less than 25, 25-45, greater than 45)
race Race/ethnicity of the individual
juv_fel_count Number of juvenile felony charges
decile_score COMPAS decile score for general recidivism risk
juv_misd_count Number of juvenile misdemeanor charges
juv_other_count Number of other juvenile charges
priors_count Number of prior offenses
days_b_screening_arrest Days between screening and arrest
c_jail_in Date of jail entry for the current charge
c_jail_out Date of jail release for the current charge
c_case_number Case number for the current charge
c_offense_date Date of the current offense
c_arrest_date Date of the current arrest
c_days_from_compas Days from COMPAS screening to the current charge
c_charge_degree Degree of the current charge (e.g., felony, misdemeanor)
c_charge_desc Description of the current charge
is_recid Indicator of whether the individual recidivated
r_case_number Case number for the recidivism charge
r_charge_degree Degree of the recidivism charge
r_days_from_arrest Days from the arrest to the recidivism charge
r_offense_date Date of the recidivism offense
r_charge_desc Description of the recidivism charge
r_jail_in Date of jail entry for the recidivism charge
r_jail_out Date of jail release for the recidivism charge
violent_recid Indicator of violent recidivism
is_violent_recid Binary indicator for violent recidivism
vr_case_number Case number for the violent recidivism charge
vr_charge_degree Degree of the violent recidivism charge
vr_offense_date Date of the violent recidivism offense
vr_charge_desc Description of the violent recidivism charge
type_of_assessment Type of COMPAS assessment conducted
decile_score.1 COMPAS decile score for violent recidivism risk
score_text Textual interpretation of the COMPAS score (e.g., Low, Medium, High)
screening_date Date of the screening assessment
v_type_of_assessment Type of violent recidivism assessment conducted
v_decile_score Decile score for violent recidivism
v_score_text Textual interpretation of the violent recidivism score
v_screening_date Date of the violent recidivism screening
in_custody Date of custody start
out_custody Date of custody end
priors_count.1 Redundant count of prior offenses
start Start day of the observation period
end End day of the observation period
event Event indicator
two_year_recid Indicator for recidivism within two years

Question (discuss this with a classmate). Before we start, reflect on the task - what does it mean to predict the risk of a person comitting another crime based on (some of) these variables?

Do you think that we should all of the available data be used to make the prediction?

Exploratory Data Analysis¶

We talked about Exploratory Analysis on the first lecture. Before building a predictive model, it's a good practice to explore how the data are distributed. Remember that real world datasets are rarely balanced and patterns within the data reveal social realities - both both justified and unjustified.

2. What is the proportion of males vs females in the dataset? (hint: value_counts() has a normalize parameter). You can also visualize the distribiton of sex if you want to practice more with visialization, we will use a donut chart (pie() from matplotlib) but you can use a diiferent type of plot. What does this plot tell us?

Next, let's learn about the distribution of age in the dataset

Next, let's learn about the distribution of age in the dataset

3. Visualize the distributon of age by sex. (hint: you can use a violin plot or a box plot). What does this plot show?

4. What is the composition of the COMPAS dataset based on the race? Print the percentage per value in the race variable

5. Visualize the distribution of race using a treemap (squarify.plot).

6. Plot the distribution of race by sex - first using counts (frequencies), then using the log transformation of the count.

Logarithmic transformation transforms the scale of the data but retains the key patterns, making it easier for us to see the within-race sex distribution.

We see that for each race category, there are always more male than female observations. However, the sex imbalance varies by group.

7. Now let's consider the outcome variable - two year recidivism. What is the relationship of race to recidivism in the dataset? Start by making a bar plot of two-year recidivism by race. One suggestion is to use sns.countplot()

8. Plot the proportion of recidivism within each race category. This will make comparing recidivism patterns between groups easier. You can use sns.barplot() for this case

Prediction of Recidivism - Naïve Baseline Approach¶

Having explored the data and keeping in mind the distribution the of the categories, let's see what happens when we train a logistic regression model to predict two-year recidivism.

We have to prepare features for model input. Consider the following:

  • some features carry little predictive power, either due to being unrelated to the task or due to showing little variance (e.g. id, type_of_assessment, etc.)
  • some features may cause multicolinearity issues (e.g. age_cat vs age, or COMPAS decile scores vs remaining features that were used to derive them)
  • some features contain text (e.g. name) and dates (e.g. in_custody)

Note. Multicollinearity occurs when two or more features in a dataset are highly correlated, meaning they provide overlapping or redundant information. This can confuse or make the model unstable

The first two cases can be solved by removing (some of the) features. The third case could be solved through feature engineering (e.g. text vectorization in the case of names, subtracting dates to get day counts, etc.). However, we opt for a simple approach of dropping most features that aren't readily available for input. We make an exception for categorical features (e.g. 'race`) that can be dummy-coded.

We will now prepare the input data for the model.

In [ ]:
# Select features
included_features = ['sex', 'race', 'age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'priors_count.1', 'c_charge_degree']
X = df[included_features].copy()
y = df['two_year_recid']

# Clone race and sex labels (will be useful later)
X['race_label'] = X['race']
X['sex_label'] = X['sex']
In [ ]:
# One-hot encode categorical features
dummy_following_features = ['sex', 'race', 'c_charge_degree']
X = pd.get_dummies(X, columns=dummy_following_features)
X.shape
Out[ ]:
(7214, 18)

We will now split the dataset into training and test set. We will also save the race_label and the sex_label into a new dataframe and then remove them from the X_train and X_test.

In [ ]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save a copy of race and sex attribute labels from the test set, we will use it later
X_test_attributes = X_test[['race_label', 'sex_label']]

# remove race and sex attribute labels from train & test sets
X_train = X_train.drop(columns=['race_label', 'sex_label'])
X_test = X_test.drop(columns=['race_label', 'sex_label'])

Let's list our final selection of model input features.

In [ ]:
# Final predictors

for i, column in enumerate(X_train.columns, start=1):
    print(f"{i}. {column}")
1. age
2. juv_fel_count
3. juv_misd_count
4. juv_other_count
5. priors_count
6. priors_count.1
7. sex_Female
8. sex_Male
9. race_African-American
10. race_Asian
11. race_Caucasian
12. race_Hispanic
13. race_Native American
14. race_Other
15. c_charge_degree_F
16. c_charge_degree_M

9. Next fit the logistic regression model on the training data and then make the predictions on the test data (as we have learned) and display the classification report and confusion matrix. Open the practical 3 to refresh your memory on that

An important idea in group fairness is that classification performance should not only be examined for the overall model (as done above), but on a per group basis as well. For example, imagine a model with 90% overall accuracy. That sounds excellent — until you break it down by demographic group: majority group 95%, minority group 60%

10. Compute the true positive, true negative, false positive and false negative rates for different race and sex groups. You can use X_test_attributes dataframe to calculate those.

Your first step is to create a dataframe that will look as follows (this point was belonging to the class 0, was predicted as class 0, so it is a TN):

id sex race y_observed y_predicted baseline_prediction predicted_probability
308 Male Caucasian 0 0 TN 0.151703
... ... ... ... ... ... ...

Once you have this dataframe, you can use pivot_table(). This function groups data by one or more keys (in our case index='sex'), and then aggregates values using a function like mean, sum, or count (in our case sum). Think of it as a more flexible version of `.groupby() that returns a full table instead of a grouped series or DataFrame.

The values inside the cells are rates (e.g. false negative rate, true positive rate, etc.). The columns expressing model errors (FN and FP) are particularly important. We can already see that the FP rate is higher for Africal-Americans compared to Caucasian group

11. Create the same breakdown of predictions by sex. This code looks similar to the one above, this time you can directly use the comparison_df, no need to recreate it

When building a system to predict recidivism, what type of error is more problematic - false negatives or false positives? What do you think yourself?

From the tables we produced above, we observe that our classifier exhibits a false positive rate gap between African-Americans offenders and Caucasian and Hispanic offenders (6% and 9%, respecitvely). Since offenders with Asian, Native American, and Other ethnicity are few in the dataset, we take the model performance on these groups with a grain of salt. We also see that there is a 9% false negative rate gap, and a 11% false positive gap between female and male convicts.

Let's generate a detailed breakdown of the model’s performance across different racial groups

12. Make a confusion matrix for each race. You can loop through each unique race and print the classification report and confusion matrix

13. To get the results in a way that is easier to compare them, we can plot the ROC curves. Plot the ROC curve for each race. You will need a for-loop to iterate over races. What can you tell from this plot?

NOTE: The ROC curves of Native American and Asian groups should fall straight on the diagonal, that's what the value of 0.5 implies

In this practical, we are going to attempt reducing the false positive rate of African-Americans.

Bias Mitigation¶

Post-Processing - Threshold Optimizer¶

Bias mitigation (=fairness) techniques in machine learning can attempt to tackle the problem at three different model building stages:

  • preprocessing: change the input
  • in-processing: change the model
  • post-processing: change the output

Consider the problem setup. We have:

  • Features: $X$
  • Target: $y ∈[0,1]$
  • Task: we want to predict $y$ from $X$
  • Score function (here probability) $P=f(X)$
  • Decision based on a threshold $D = \mathbb{1}\{P > t\}$
  • Sensitive attribute $A \in \{a, b\}$

We are going to attempt tackle the problem of bias by using a post-processing approach. This way, we can directly control distribution of the outcome.

Logistic regression outputs predicted probability of recidivism. The probability is then dichotomized - by default the threshold $t$ is set to 0.5 - if the predicted probability is lower than 50%, then the person is assigned 0 (in our case 'low risk'), otherwise they are assigned 1 'high risk').

However, we could also consider setting different thresholds to different groups in an attempt to reduce bias. We do this by using a threshold optimization approach. We will use the fairlearn library. You can explore the documentation.

14. Create 2 variables where you save the values from the X_train['race_African-American'] and X_test['race_African-American']

15. Initialise the ThresholdOptimizer. The estimator can be set to the logistic regression model (the one we created before), the constraints to demographic_parity and the objective can be set to accuracy score. Also, set 'flip' to True to allow flipping the decision if it improves the resulting

What is ThresholdOptimizer? ThresholdOptimizer adjusts the decision thresholds per group (e.g., race or sex) to satisfy a fairness constraint like demographic parity or equalized odds. It does not retrain our model — it works on the model’s predictions.

Also, we chose demographic parity that means that the proportion of individuals predicted positive (e.g., “will reoffend”) should be the same across racial groups — regardless of true outcomes.

16. Then fit the optimiser on the training data and set sensitive_features to the protected attribute in train set. Make the new predictions on the tes set using the predict function and save them in a dataframe. Fianlly visualize the effect of the ThresholdOptimizer with plot_threshold_optimizer()

17. Combine comparison and the dataframe with the predictions

Let's make some comparisons in the results that we obtained

In [ ]:
final_comparison_df["y_predicted_new"] = final_comparison_df["y_predicted"]
#final_comparison_df["y_predicted_new"] = np.where(final_comparison_df['race'] == "African-American",final_comparison_df["y_predicted_new"],final_comparison_df["new_threshold_decision"])
final_comparison_df["y_predicted_new"][final_comparison_df['race'] == "African-American"] = final_comparison_df["new_threshold_decision"][final_comparison_df['race'] == "African-American"]
<ipython-input-28-95aed0b6bdea>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_comparison_df["y_predicted_new"][final_comparison_df['race'] == "African-American"] = final_comparison_df["new_threshold_decision"][final_comparison_df['race'] == "African-American"]
In [ ]:
final_comparison_df[["y_predicted",'new_threshold_decision']].corr()
Out[ ]:
y_predicted new_threshold_decision
y_predicted 1.000000 0.722498
new_threshold_decision 0.722498 1.000000
In [ ]:
def get_new_prediction_label(row):
    if row['y_observed'] == 0 and row['y_predicted_new'] == 0:
        return 'TN'
    elif row['y_observed'] == 1 and row['y_predicted_new'] == 0:
        return 'FN'
    elif row['y_observed'] == 0 and row['y_predicted_new'] == 1:
        return 'FP'
    elif row['y_observed'] == 1 and row['y_predicted_new'] == 1:
        return 'TP'


# Apply the function to each row to create the 'baseline prediction' column
final_comparison_df['updated_prediction'] = final_comparison_df.apply(get_new_prediction_label, axis=1)

final_comparison_df.head()
Out[ ]:
sex race y_observed y_predicted baseline_prediction predicted_probability new_threshold_decision y_predicted_new updated_prediction
308 Male Caucasian 0 0 TN 0.151788 0 0 TN
381 Male African-American 0 0 TN 0.422197 0 0 TN
3238 Male African-American 1 0 FN 0.356359 0 0 FN
2312 Male African-American 1 1 TP 0.586242 1 1 TP
251 Female Other 0 0 TN 0.202685 0 0 TN
In [ ]:
# Pivot table to summarize counts
pivot_table_counts = final_comparison_df.pivot_table(index='race', columns='updated_prediction', aggfunc='size', fill_value=0)

# Calculate support values
support = pivot_table_counts.sum(axis=1)

# Normalize to get proportions (rates)
pivot_table_proportions = pivot_table_counts.div(support, axis=0)

# Add the support column to the pivot table
pivot_table_proportions['support'] = support

# Round the proportions
pivot_table_proportions = pivot_table_proportions.round(2)

print(pivot_table_proportions)
updated_prediction    FN    FP    TN    TP  support
race                                               
African-American    0.18  0.13  0.36  0.32      731
Asian               0.20  0.20  0.60  0.00        5
Caucasian           0.21  0.10  0.53  0.16      505
Hispanic            0.21  0.07  0.64  0.08      117
Native American     0.00  0.33  0.33  0.33        3
Other               0.27  0.06  0.61  0.06       82

For reference:

Race - Baseline Prediction FN FP TN TP Support
African-American 0.15 0.16 0.33 0.36 731
Asian 0.20 0.20 0.60 0.00 5
Caucasian 0.21 0.10 0.53 0.16 505
Hispanic 0.21 0.07 0.64 0.08 117
Native American 0.00 0.33 0.33 0.33 3
Other 0.27 0.06 0.61 0.06 82

We see that our bias mitigation strategy for African-American has resulted in a reduction of false positive rate from 16% in the naïve baseline approach to 13% in the threshold optimized approach. The false negative rate has increased by 3%, showing that sometimes bias mitigation can come at a trade-off. The performance on the remaining groups has remained the same.

A challenge is that the Asian and African American groups are small and therefore not suitable for statistical learning. In order to improve the performance of the classifier on these groups, one would need to consider data augmentation approaches such as SMOTE (Synthetic Minority Oversampling Technique) or other solutions to imbalanced datasets.

Sources, Acknowledgments, and Additional Reading:

https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html

https://www.holisticai.com/blog/bias-mitigation-strategies-techniques-for-classification-tasks

https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html

We would like to thank Dr. Dong Nguyen of Utrecht University whose Human-Centered Machine Learning course materials have served as an inspiration.