Daniel Anadria, Anastasia Giachanou
Machine Learning with Python - Utrecht Summer School
In this practical, we are going to explore bias and fairness in Machine Learning!
The COMPAS dataset contains outcomes from a proprietary tool named COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), designed to evaluate the probability of a convict committing another crime. It is utilized by judges and parole officers and is notably recognized for its discriminatory impact on African-American individuals.
Dataset source: Broward Country Clerk's Office, Broward County Sherrif's Office, Florida Department of Corrections, ProPublica
We are going to use this dataset to explore some of the notions of group fairness as it relates to machine learning.
Disclaimer:
Unlike most tutorials use the COMPAS dataset, we are not going to assess the fairness of the pre-computed COMPAS scores. Instead, we will build our own classifier based on the 'raw' data such as crime history and demographic information (thus excluding the derived COMPAS scores). This way, you will get a bit of an intuition regarding how such classifiers are built, where the fairness problems might stem from in the development pipeline, and what can be done about addressing fairness out model outputs.
In algorithmic fairness, it's important to understand the context surrounding a specific applied machine learning task. Sources of bias are many, as are the degrees of freedom in what disparity to focus on. Unfortuantely, satisfying multiple fairness criteria at the same time is often impossible, a phenomenon dubbed the fairness-accuracy trade-off.
We do not claim to be penal system nor social justice experts. The purpose of this tutorial is only to demonstrate some of the machine learning approaches to bias detection and mitigation. For this to be possible, we have to make choices on what biases are 'more important' to focus on. In reality, the values influencing what to optimize the models for are multifaceted and coming from different actors. We do not claim to have 'solved fairness problems'. This would require interdisciplinary multi-agent input and would always be based on a selection of particular values.
!pip install -q squarify
!pip install -q fairlearn
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 234.1/234.1 kB 4.6 MB/s eta 0:00:00
fairlearn (developed by Microsoft) is a toolkit for:
As always, we start with importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve
import fairlearn
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.postprocessing import plot_threshold_optimizer
1. Load the COMPAS dataset from the url https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv (yes, you can use the pd.read_csv() and put the link inside the parenthesis) and inspect the first rows.
The COMPAS dataset contains the following variables. We tried looking up the meaning behind each variable. Our target variable is called two_year_recid.
Some variables were used to construct other variables. For example, decile_score represents the individual's COMPAS score, the value predicting the risk of recidivism. We will omit the COMPAS score and the related features, and try to predict two-year recidivism from the remaining features.
| Variable | Description |
|---|---|
| id | Unique identifier for each individual |
| name | Full name of the individual |
| first | First name of the individual |
| last | Last name of the individual |
| compas_screening_date | Date when the COMPAS screening was conducted |
| sex | Sex of the individual |
| dob | Date of birth |
| age | Age at the time of screening |
| age_cat | Categorical age group (e.g., less than 25, 25-45, greater than 45) |
| race | Race/ethnicity of the individual |
| juv_fel_count | Number of juvenile felony charges |
| decile_score | COMPAS decile score for general recidivism risk |
| juv_misd_count | Number of juvenile misdemeanor charges |
| juv_other_count | Number of other juvenile charges |
| priors_count | Number of prior offenses |
| days_b_screening_arrest | Days between screening and arrest |
| c_jail_in | Date of jail entry for the current charge |
| c_jail_out | Date of jail release for the current charge |
| c_case_number | Case number for the current charge |
| c_offense_date | Date of the current offense |
| c_arrest_date | Date of the current arrest |
| c_days_from_compas | Days from COMPAS screening to the current charge |
| c_charge_degree | Degree of the current charge (e.g., felony, misdemeanor) |
| c_charge_desc | Description of the current charge |
| is_recid | Indicator of whether the individual recidivated |
| r_case_number | Case number for the recidivism charge |
| r_charge_degree | Degree of the recidivism charge |
| r_days_from_arrest | Days from the arrest to the recidivism charge |
| r_offense_date | Date of the recidivism offense |
| r_charge_desc | Description of the recidivism charge |
| r_jail_in | Date of jail entry for the recidivism charge |
| r_jail_out | Date of jail release for the recidivism charge |
| violent_recid | Indicator of violent recidivism |
| is_violent_recid | Binary indicator for violent recidivism |
| vr_case_number | Case number for the violent recidivism charge |
| vr_charge_degree | Degree of the violent recidivism charge |
| vr_offense_date | Date of the violent recidivism offense |
| vr_charge_desc | Description of the violent recidivism charge |
| type_of_assessment | Type of COMPAS assessment conducted |
| decile_score.1 | COMPAS decile score for violent recidivism risk |
| score_text | Textual interpretation of the COMPAS score (e.g., Low, Medium, High) |
| screening_date | Date of the screening assessment |
| v_type_of_assessment | Type of violent recidivism assessment conducted |
| v_decile_score | Decile score for violent recidivism |
| v_score_text | Textual interpretation of the violent recidivism score |
| v_screening_date | Date of the violent recidivism screening |
| in_custody | Date of custody start |
| out_custody | Date of custody end |
| priors_count.1 | Redundant count of prior offenses |
| start | Start day of the observation period |
| end | End day of the observation period |
| event | Event indicator |
| two_year_recid | Indicator for recidivism within two years |
Question (discuss this with a classmate). Before we start, reflect on the task - what does it mean to predict the risk of a person comitting another crime based on (some of) these variables?
Do you think that we should all of the available data be used to make the prediction?
We talked about Exploratory Analysis on the first lecture. Before building a predictive model, it's a good practice to explore how the data are distributed. Remember that real world datasets are rarely balanced and patterns within the data reveal social realities - both both justified and unjustified.
2. What is the proportion of males vs females in the dataset?
(hint: value_counts() has a normalize parameter). You can also visualize the distribiton of sex if you want to practice more with visialization, we will use a donut chart (pie() from matplotlib) but you can use a diiferent type of plot. What does this plot tell us?
Next, let's learn about the distribution of age in the dataset
Next, let's learn about the distribution of age in the dataset
3. Visualize the distributon of age by sex.
(hint: you can use a violin plot or a box plot). What does this plot show?
4. What is the composition of the COMPAS dataset based on the race? Print the percentage per value in the race variable
5. Visualize the distribution of race using a treemap (squarify.plot).
6. Plot the distribution of race by sex - first using counts (frequencies), then using the log transformation of the count.
Logarithmic transformation transforms the scale of the data but retains the key patterns, making it easier for us to see the within-race sex distribution.
We see that for each race category, there are always more male than female observations. However, the sex imbalance varies by group.
7. Now let's consider the outcome variable - two year recidivism. What is the relationship of race to recidivism in the dataset?
Start by making a bar plot of two-year recidivism by race. One suggestion is to use sns.countplot()
8. Plot the proportion of recidivism within each race category. This will make comparing recidivism patterns between groups easier. You can use sns.barplot() for this case
Having explored the data and keeping in mind the distribution the of the categories, let's see what happens when we train a logistic regression model to predict two-year recidivism.
We have to prepare features for model input. Consider the following:
id, type_of_assessment, etc.)age_cat vs age, or COMPAS decile scores vs remaining features that were used to derive them)name) and dates (e.g. in_custody)Note. Multicollinearity occurs when two or more features in a dataset are highly correlated, meaning they provide overlapping or redundant information. This can confuse or make the model unstable
The first two cases can be solved by removing (some of the) features. The third case could be solved through feature engineering (e.g. text vectorization in the case of names, subtracting dates to get day counts, etc.). However, we opt for a simple approach of dropping most features that aren't readily available for input. We make an exception for categorical features (e.g. 'race`) that can be dummy-coded.
We will now prepare the input data for the model.
# Select features
included_features = ['sex', 'race', 'age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'priors_count.1', 'c_charge_degree']
X = df[included_features].copy()
y = df['two_year_recid']
# Clone race and sex labels (will be useful later)
X['race_label'] = X['race']
X['sex_label'] = X['sex']
# One-hot encode categorical features
dummy_following_features = ['sex', 'race', 'c_charge_degree']
X = pd.get_dummies(X, columns=dummy_following_features)
X.shape
(7214, 18)
We will now split the dataset into training and test set. We will also save the race_label and the sex_label into a new dataframe and then remove them from the X_train and X_test.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Save a copy of race and sex attribute labels from the test set, we will use it later
X_test_attributes = X_test[['race_label', 'sex_label']]
# remove race and sex attribute labels from train & test sets
X_train = X_train.drop(columns=['race_label', 'sex_label'])
X_test = X_test.drop(columns=['race_label', 'sex_label'])
Let's list our final selection of model input features.
# Final predictors
for i, column in enumerate(X_train.columns, start=1):
print(f"{i}. {column}")
1. age 2. juv_fel_count 3. juv_misd_count 4. juv_other_count 5. priors_count 6. priors_count.1 7. sex_Female 8. sex_Male 9. race_African-American 10. race_Asian 11. race_Caucasian 12. race_Hispanic 13. race_Native American 14. race_Other 15. c_charge_degree_F 16. c_charge_degree_M
9. Next fit the logistic regression model on the training data and then make the predictions on the test data (as we have learned) and display the classification report and confusion matrix. Open the practical 3 to refresh your memory on that
An important idea in group fairness is that classification performance should not only be examined for the overall model (as done above), but on a per group basis as well. For example, imagine a model with 90% overall accuracy. That sounds excellent — until you break it down by demographic group: majority group 95%, minority group 60%
10. Compute the true positive, true negative, false positive and false negative rates for different race and sex groups. You can use X_test_attributes dataframe to calculate those.
Your first step is to create a dataframe that will look as follows (this point was belonging to the class 0, was predicted as class 0, so it is a TN):
| id | sex | race | y_observed | y_predicted | baseline_prediction | predicted_probability |
|---|---|---|---|---|---|---|
| 308 | Male | Caucasian | 0 | 0 | TN | 0.151703 |
| ... | ... | ... | ... | ... | ... | ... |
Once you have this dataframe, you can use pivot_table(). This function groups data by one or more keys (in our case index='sex'), and then aggregates values using a function like mean, sum, or count (in our case sum). Think of it as a more flexible version of `.groupby() that returns a full table instead of a grouped series or DataFrame.
The values inside the cells are rates (e.g. false negative rate, true positive rate, etc.). The columns expressing model errors (FN and FP) are particularly important. We can already see that the FP rate is higher for Africal-Americans compared to Caucasian group
11. Create the same breakdown of predictions by sex. This code looks similar to the one above, this time you can directly use the comparison_df, no need to recreate it
When building a system to predict recidivism, what type of error is more problematic - false negatives or false positives? What do you think yourself?
From the tables we produced above, we observe that our classifier exhibits a false positive rate gap between African-Americans offenders and Caucasian and Hispanic offenders (6% and 9%, respecitvely). Since offenders with Asian, Native American, and Other ethnicity are few in the dataset, we take the model performance on these groups with a grain of salt. We also see that there is a 9% false negative rate gap, and a 11% false positive gap between female and male convicts.
Let's generate a detailed breakdown of the model’s performance across different racial groups
12. Make a confusion matrix for each race. You can loop through each unique race and print the classification report and confusion matrix
13. To get the results in a way that is easier to compare them, we can plot the ROC curves. Plot the ROC curve for each race. You will need a for-loop to iterate over races. What can you tell from this plot?
NOTE: The ROC curves of Native American and Asian groups should fall straight on the diagonal, that's what the value of 0.5 implies
In this practical, we are going to attempt reducing the false positive rate of African-Americans.
Bias mitigation (=fairness) techniques in machine learning can attempt to tackle the problem at three different model building stages:
Consider the problem setup. We have:
We are going to attempt tackle the problem of bias by using a post-processing approach. This way, we can directly control distribution of the outcome.
Logistic regression outputs predicted probability of recidivism. The probability is then dichotomized - by default the threshold $t$ is set to 0.5 - if the predicted probability is lower than 50%, then the person is assigned 0 (in our case 'low risk'), otherwise they are assigned 1 'high risk').
However, we could also consider setting different thresholds to different groups in an attempt to reduce bias. We do this by using a threshold optimization approach. We will use the fairlearn library. You can explore the documentation.
14. Create 2 variables where you save the values from the X_train['race_African-American'] and X_test['race_African-American']
15. Initialise the ThresholdOptimizer. The estimator can be set to the logistic regression model (the one we created before), the constraints to demographic_parity and the objective can be set to accuracy score. Also, set 'flip' to True to allow flipping the decision if it improves the resulting
What is ThresholdOptimizer? ThresholdOptimizer adjusts the decision thresholds per group (e.g., race or sex) to satisfy a fairness constraint like demographic parity or equalized odds. It does not retrain our model — it works on the model’s predictions.
Also, we chose demographic parity that means that the proportion of individuals predicted positive (e.g., “will reoffend”) should be the same across racial groups — regardless of true outcomes.
16. Then fit the optimiser on the training data and set sensitive_features to the protected attribute in train set. Make the new predictions on the tes set using the predict function and save them in a dataframe. Fianlly visualize the effect of the ThresholdOptimizer with plot_threshold_optimizer()
17. Combine comparison and the dataframe with the predictions
Let's make some comparisons in the results that we obtained
final_comparison_df["y_predicted_new"] = final_comparison_df["y_predicted"]
#final_comparison_df["y_predicted_new"] = np.where(final_comparison_df['race'] == "African-American",final_comparison_df["y_predicted_new"],final_comparison_df["new_threshold_decision"])
final_comparison_df["y_predicted_new"][final_comparison_df['race'] == "African-American"] = final_comparison_df["new_threshold_decision"][final_comparison_df['race'] == "African-American"]
<ipython-input-28-95aed0b6bdea>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy final_comparison_df["y_predicted_new"][final_comparison_df['race'] == "African-American"] = final_comparison_df["new_threshold_decision"][final_comparison_df['race'] == "African-American"]
final_comparison_df[["y_predicted",'new_threshold_decision']].corr()
| y_predicted | new_threshold_decision | |
|---|---|---|
| y_predicted | 1.000000 | 0.722498 |
| new_threshold_decision | 0.722498 | 1.000000 |
def get_new_prediction_label(row):
if row['y_observed'] == 0 and row['y_predicted_new'] == 0:
return 'TN'
elif row['y_observed'] == 1 and row['y_predicted_new'] == 0:
return 'FN'
elif row['y_observed'] == 0 and row['y_predicted_new'] == 1:
return 'FP'
elif row['y_observed'] == 1 and row['y_predicted_new'] == 1:
return 'TP'
# Apply the function to each row to create the 'baseline prediction' column
final_comparison_df['updated_prediction'] = final_comparison_df.apply(get_new_prediction_label, axis=1)
final_comparison_df.head()
| sex | race | y_observed | y_predicted | baseline_prediction | predicted_probability | new_threshold_decision | y_predicted_new | updated_prediction | |
|---|---|---|---|---|---|---|---|---|---|
| 308 | Male | Caucasian | 0 | 0 | TN | 0.151788 | 0 | 0 | TN |
| 381 | Male | African-American | 0 | 0 | TN | 0.422197 | 0 | 0 | TN |
| 3238 | Male | African-American | 1 | 0 | FN | 0.356359 | 0 | 0 | FN |
| 2312 | Male | African-American | 1 | 1 | TP | 0.586242 | 1 | 1 | TP |
| 251 | Female | Other | 0 | 0 | TN | 0.202685 | 0 | 0 | TN |
# Pivot table to summarize counts
pivot_table_counts = final_comparison_df.pivot_table(index='race', columns='updated_prediction', aggfunc='size', fill_value=0)
# Calculate support values
support = pivot_table_counts.sum(axis=1)
# Normalize to get proportions (rates)
pivot_table_proportions = pivot_table_counts.div(support, axis=0)
# Add the support column to the pivot table
pivot_table_proportions['support'] = support
# Round the proportions
pivot_table_proportions = pivot_table_proportions.round(2)
print(pivot_table_proportions)
updated_prediction FN FP TN TP support race African-American 0.18 0.13 0.36 0.32 731 Asian 0.20 0.20 0.60 0.00 5 Caucasian 0.21 0.10 0.53 0.16 505 Hispanic 0.21 0.07 0.64 0.08 117 Native American 0.00 0.33 0.33 0.33 3 Other 0.27 0.06 0.61 0.06 82
For reference:
| Race - Baseline Prediction | FN | FP | TN | TP | Support |
|---|---|---|---|---|---|
| African-American | 0.15 | 0.16 | 0.33 | 0.36 | 731 |
| Asian | 0.20 | 0.20 | 0.60 | 0.00 | 5 |
| Caucasian | 0.21 | 0.10 | 0.53 | 0.16 | 505 |
| Hispanic | 0.21 | 0.07 | 0.64 | 0.08 | 117 |
| Native American | 0.00 | 0.33 | 0.33 | 0.33 | 3 |
| Other | 0.27 | 0.06 | 0.61 | 0.06 | 82 |
We see that our bias mitigation strategy for African-American has resulted in a reduction of false positive rate from 16% in the naïve baseline approach to 13% in the threshold optimized approach. The false negative rate has increased by 3%, showing that sometimes bias mitigation can come at a trade-off. The performance on the remaining groups has remained the same.
A challenge is that the Asian and African American groups are small and therefore not suitable for statistical learning. In order to improve the performance of the classifier on these groups, one would need to consider data augmentation approaches such as SMOTE (Synthetic Minority Oversampling Technique) or other solutions to imbalanced datasets.
Sources, Acknowledgments, and Additional Reading:
https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html
https://www.holisticai.com/blog/bias-mitigation-strategies-techniques-for-classification-tasks
https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html
We would like to thank Dr. Dong Nguyen of Utrecht University whose Human-Centered Machine Learning course materials have served as an inspiration.