Practical 10: Fairness in Machine Learning¶

Daniel Anadria

Machine Learning with Python - Utrecht Summer School

In this practical, we are going to explore bias and fairness in Machine Learning!

COMPAS Recidivism¶

The COMPAS dataset contains outcomes from a proprietary tool named COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), designed to evaluate the probability of a convict committing another crime. It is utilized by judges and parole officers and is notably recognized for its discriminatory impact on African-American individuals.

Dataset source: Broward Country Clerk's Office, Broward County Sherrif's Office, Florida Department of Corrections, ProPublica

We are going to use this dataset to explore some of the notions of group fairness as it relates to machine learning.


Disclaimer:

Unlike most tutorials use the COMPAS dataset, we are not going to assess the fairness of the pre-computed COMPAS scores. Instead, we will build our own classifier based on the 'raw' data such as crime history and demographic information (thus excluding the derived COMPAS scores). This way, you will get a bit of an intuition regarding how such classifiers are built, where the fairness problems might stem from in the development pipeline, and what can be done about addressing fairness out model outputs.

In algorithmic fairness, it's important to understand the context surrounding a specific applied machine learning task. Sources of bias are many, as are the degrees of freedom in what disparity to focus on. Unfortuantely, satisfying multiple fairness criteria at the same time is often impossible, a phenomenon dubbed the fairness-accuracy trade-off.

We do not claim to be penal system nor social justice experts. The purpose of this tutorial is only to demonstrate som of the machine learning approaches to bias detection and mitigation. For this to be possible, we have to make choices on what biases are 'more important' to focus on. In reality, the values influencing what to optimize the models for are multifaceted and coming from different actors. We do not claim to have 'solved fairness problems'. This would require interdisciplinary multi-agent input and would always be based on a selection of particular values.


In [32]:
!pip install -q squarify
!pip install -q fairlearn
In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve

import fairlearn
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.postprocessing import plot_threshold_optimizer

1. Load the COMPAS dataset from the url https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv and inspect the variables.

In [34]:
df = pd.read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")
df.head()
Out[34]:
id name first last compas_screening_date sex dob age age_cat race ... v_decile_score v_score_text v_screening_date in_custody out_custody priors_count.1 start end event two_year_recid
0 1 miguel hernandez miguel hernandez 2013-08-14 Male 1947-04-18 69 Greater than 45 Other ... 1 Low 2013-08-14 2014-07-07 2014-07-14 0 0 327 0 0
1 3 kevon dixon kevon dixon 2013-01-27 Male 1982-01-22 34 25 - 45 African-American ... 1 Low 2013-01-27 2013-01-26 2013-02-05 0 9 159 1 1
2 4 ed philo ed philo 2013-04-14 Male 1991-05-14 24 Less than 25 African-American ... 3 Low 2013-04-14 2013-06-16 2013-06-16 4 0 63 0 1
3 5 marcu brown marcu brown 2013-01-13 Male 1993-01-21 23 Less than 25 African-American ... 6 Medium 2013-01-13 NaN NaN 1 0 1174 0 0
4 6 bouthy pierrelouis bouthy pierrelouis 2013-03-26 Male 1973-01-22 43 25 - 45 Other ... 1 Low 2013-03-26 NaN NaN 2 0 1102 0 0

5 rows × 53 columns

The COMPAS dataset contains the following variables. We tried looking up the meaning behind each variable. Our target variable is called two_year_recid.

Some variables were used to construct other variables. For example, decile_score represents the individual's COMPAS score, the value predicting the risk of recidivism. We will omit the COMPAS score and the related features, and try to predict two-year recidivism from the remaining features.

Variable Description
id Unique identifier for each individual
name Full name of the individual
first First name of the individual
last Last name of the individual
compas_screening_date Date when the COMPAS screening was conducted
sex Sex of the individual
dob Date of birth
age Age at the time of screening
age_cat Categorical age group (e.g., less than 25, 25-45, greater than 45)
race Race/ethnicity of the individual
juv_fel_count Number of juvenile felony charges
decile_score COMPAS decile score for general recidivism risk
juv_misd_count Number of juvenile misdemeanor charges
juv_other_count Number of other juvenile charges
priors_count Number of prior offenses
days_b_screening_arrest Days between screening and arrest
c_jail_in Date of jail entry for the current charge
c_jail_out Date of jail release for the current charge
c_case_number Case number for the current charge
c_offense_date Date of the current offense
c_arrest_date Date of the current arrest
c_days_from_compas Days from COMPAS screening to the current charge
c_charge_degree Degree of the current charge (e.g., felony, misdemeanor)
c_charge_desc Description of the current charge
is_recid Indicator of whether the individual recidivated
r_case_number Case number for the recidivism charge
r_charge_degree Degree of the recidivism charge
r_days_from_arrest Days from the arrest to the recidivism charge
r_offense_date Date of the recidivism offense
r_charge_desc Description of the recidivism charge
r_jail_in Date of jail entry for the recidivism charge
r_jail_out Date of jail release for the recidivism charge
violent_recid Indicator of violent recidivism
is_violent_recid Binary indicator for violent recidivism
vr_case_number Case number for the violent recidivism charge
vr_charge_degree Degree of the violent recidivism charge
vr_offense_date Date of the violent recidivism offense
vr_charge_desc Description of the violent recidivism charge
type_of_assessment Type of COMPAS assessment conducted
decile_score.1 COMPAS decile score for violent recidivism risk
score_text Textual interpretation of the COMPAS score (e.g., Low, Medium, High)
screening_date Date of the screening assessment
v_type_of_assessment Type of violent recidivism assessment conducted
v_decile_score Decile score for violent recidivism
v_score_text Textual interpretation of the violent recidivism score
v_screening_date Date of the violent recidivism screening
in_custody Date of custody start
out_custody Date of custody end
priors_count.1 Redundant count of prior offenses
start Start day of the observation period
end End day of the observation period
event Event indicator
two_year_recid Indicator for recidivism within two years

Before we start, reflect on the task - what does it mean to predict the risk of a person comitting another crime based on (some of) these variables?

Do you think that we should all of the available data be used to make the prediction?

Exploratory Data Analysis¶

Before building a predictive model, it's a good practice to explore how the data are distributed. Remember that real world datasets are rarely balanced and patterns within the data reveal social realities - both both justified and unjustified.

2. What is the proportion of males vs females in the dataset? (hint: value_counts() has a normalize parameter). You can also visualize the distribiton of sex if you want to practice more with visialization, we will use a donut chart but you can use a diiferent type of plot.

In [35]:
df['sex'].value_counts(normalize=True)*100
Out[35]:
sex
Male      80.6626
Female    19.3374
Name: proportion, dtype: float64

You can also visualize the distribiton of sex.

In [36]:
# Count the occurrences of each sex
sex_counts = df['sex'].value_counts()

# Get a colorblind-friendly palette with as many colors as there are categories in sex_counts
colors = sns.color_palette('colorblind', len(sex_counts))

# Calculate total for percentage calculation
total_count = sex_counts.sum()

# Create a donut chart with counts and percentages in the autopct, using the colorblind palette
plt.figure(figsize=(8, 8))
plt.pie(sex_counts, labels=sex_counts.index, autopct=lambda pct: f'{int(round(pct*total_count/100.0))} ({pct:.1f}%)', startangle=140, wedgeprops=dict(width=0.3), colors=colors)
plt.title('Distribution of Sex')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Next, let's learn about the distribution of age in the dataset

3. Visualize the distributon of age by sex. (hint: you can use a violin plot or a box plot)

In [37]:
# Create a violin plot
sns.violinplot(x='age', y='sex', data=df, palette='colorblind')

plt.title('Age Distribution by Sex')
plt.xlabel('Age')
plt.ylabel('Sex')
plt.show()
<ipython-input-37-cc023e08088d>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(x='age', y='sex', data=df, palette='colorblind')

We see that the distributon of age is similar between the sexes. Both male and female convicts tend to be young and their number dicreases with age. Male convicts have a longer right tail, meaning that there are some elderly men in the dataset, although not too many.

4. What is the composition of the COMPAS dataset based on the race? Print the percentage per value in the race variable

In [38]:
df['race'].value_counts(normalize=True)*100
Out[38]:
race
African-American    51.233712
Caucasian           34.017189
Hispanic             8.830053
Other                5.225950
Asian                0.443582
Native American      0.249515
Name: proportion, dtype: float64

Compare this with the 2020 U.S. Census*:

Self-identified race Percent of population
African-American 12.4%
Caucasian 61.6%
Hispanic 19%
Other 8.4%
Asian 6.0%
Native American 1.1%

* approximate values from: census.gov; the values do not add up to 100 because of the overlaps in the Hispanic class with other categories

We see that the COMPAS dataset does not reflect the population racial distribution of the United States. In particular, African-Americans are substantially overrepresented while other groups are underrepresented.

5. Visualize the distribution of race using a treemap (squarify.plot).

In [39]:
# Count the occurrences of each race
race_counts = df['race'].value_counts()

# Calculate total count for normalization
total_count = race_counts.sum()

# Generate labels with count and percentage
labels = [f'{index}\n{count} ({count/total_count:.1%})' for index, count in race_counts.items()]

# Create a treemap
plt.figure(figsize=(12, 8))
squarify.plot(sizes=race_counts, label=labels, alpha=0.8, color=sns.color_palette("colorblind", len(race_counts)))
plt.axis('off')
plt.title('Distribution of Race')
plt.show()

6. Plot the distribution of race by sex - first using counts (frequencies), then using the log transformation of the count.

In [40]:
# Create a count plot
sns.countplot(x='race', hue='sex', data=df, palette='colorblind')

plt.title('Distribution of Race by Sex')
plt.xlabel('Race')
plt.ylabel('Count')
plt.xticks(rotation=45)  # rotate x-axis labels for better readability
plt.legend(title='Sex')
plt.tight_layout()  # adjust layout to make room for the rotated x-axis labels
plt.show()
In [41]:
# Create a count plot
sns.countplot(x='race', hue='sex', data=df, palette='colorblind')

plt.title('Distribution of Race by Sex')
plt.xlabel('Race')
plt.ylabel('Log Count')
plt.xticks(rotation=45)  # rotate x-axis labels for better readability
plt.legend(title='Sex')
plt.yscale('log')  # Set y-axis to log scale
plt.tight_layout()  # adjust layout to make room for the rotated x-axis labels
plt.show()

Logarithmic transformation transforms the scale of the data but retains the key patterns, making it easier for us to see the within-race sex distribution.

We see that for each race category, there are always more male than female observations. However, the sex imbalance varies by group.

7. Now let's consider the outcome variable - two year recidivism. What is the relationship of race to recidivism in the dataset? Start by making a bar plot of two-year recidivism by race.

In [42]:
ax = sns.countplot(x='race', hue='two_year_recid', data=df, palette='colorblind')

# Add titles and labels
plt.title('Count of Two-Year Recidivism by Race')
plt.xlabel('Race')
plt.ylabel('Count')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Add text annotations
for container in ax.containers:
    ax.bar_label(container, fmt='%.0f', label_type='edge')

# Modify legend
handles, labels = ax.get_legend_handles_labels()
new_labels = ['No', 'Yes']  # Mapping 0 to No, 1 to Yes
plt.legend(handles, new_labels, title='Two-Year Recidivism')

# Optional: plt.yscale('log') # Uncomment if log scale is desired for y-axis
plt.tight_layout(rect=[0, 0, 0, 0])  # Adjust layout to make room for the rotated x-axis labels
plt.show()
<ipython-input-42-79d3b1722166>:19: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  plt.tight_layout(rect=[0, 0, 0, 0])  # Adjust layout to make room for the rotated x-axis labels

The bar plot shows that African-Americans are indeed the largest group, but also relative to other large groups, African-Americans appear to have more recidivist cases in the dataset.

A challenge is that due to large racial imbalances in the dataset, it is hard to see the plotted values for the smaller groups such as Native Americans and Asians.

8. plot the proportion of recidivism within each race category. This will make comparing recidivism patterns between groups easier.

In [43]:
# Calculate the proportion of recidivism within each race category
df_normalized = df.groupby('race')['two_year_recid'].value_counts(normalize=True).rename('proportion').reset_index()

# Plot the normalized proportions
ax = sns.barplot(x='race', y='proportion', hue='two_year_recid', data=df_normalized, palette='colorblind')

# Add titles and labels
plt.title('Proportion of Two-Year Recidivism by Race')
plt.xlabel('Race')
plt.ylabel('Proportion')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Add text annotations
for container in ax.containers:
    ax.bar_label(container, fmt='%.2f', label_type='edge')

# Modify legend
handles, labels = ax.get_legend_handles_labels()
new_labels = ['No', 'Yes']  # Mapping 0 to No, 1 to Yes
plt.legend(handles, new_labels, title='Two-Year Recidivism', loc='upper right', bbox_to_anchor=(1.35, 1))

plt.tight_layout(rect=[1, 0, 0, 1])  # Adjust layout to make room for the rotated x-axis labels

plt.show()
<ipython-input-43-706b9fa82ab4>:22: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  plt.tight_layout(rect=[1, 0, 0, 1])  # Adjust layout to make room for the rotated x-axis labels

By plotting proportions, we can see that the only two groups with recidivism above 50% are African- and Native Americans.

Prediction of Recidivism - Naïve Baseline Approach¶

Having explored the data, let's see what happens when we train a logistic regression model to predict two-year recidivism.

We have to prepare features for model input. Consider the following:

  • some features carry little predictive power, either due to being unrelated to the task or due to showing little variance (e.g. id, type_of_assessment, etc.)
  • some features may cause multicolinearity issues (e.g. age_cat vs age, or COMPAS decile scores vs remaining features that were used to derive them)
  • some features contain text (e.g. name) and dates (e.g. in_custody)

The first two cases can be solved by removing (some of the) features. The third case could be solved through feature engineering (e.g. text vectorization in the case of names, subtracting dates to get day counts, etc.). However, we opt for a simple approach of dropping most features that aren't readily available for input. We make an exception for categorical features (e.g. 'race`) that can be dummy-coded.

We will now prepare the input data for the model.

In [44]:
# Select features
included_features = ['sex', 'race', 'age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'priors_count.1', 'c_charge_degree']
X = df[included_features].copy()
y = df['two_year_recid']

# Clone race and sex labels (will be useful later)
X['race_label'] = X['race']
X['sex_label'] = X['sex']
In [45]:
# One-hot encode categorical features
dummy_following_features = ['sex', 'race', 'c_charge_degree']
X = pd.get_dummies(X, columns=dummy_following_features)
X.shape
Out[45]:
(7214, 18)

We will now split the dataset into training and test set. We will also save the race_label and the sex_label into a new dataframe and then remove them from the X_train and X_test.

In [46]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save a copy of race and sex attribute labels from the test set
X_test_attributes = X_test[['race_label', 'sex_label']]

# remove race and sex attribute labels from train & test sets
X_train = X_train.drop(columns=['race_label', 'sex_label'])
X_test = X_test.drop(columns=['race_label', 'sex_label'])

Let's list our final selection of model input features.

In [47]:
# Final predictors

for i, column in enumerate(X_train.columns, start=1):
    print(f"{i}. {column}")
1. age
2. juv_fel_count
3. juv_misd_count
4. juv_other_count
5. priors_count
6. priors_count.1
7. sex_Female
8. sex_Male
9. race_African-American
10. race_Asian
11. race_Caucasian
12. race_Hispanic
13. race_Native American
14. race_Other
15. c_charge_degree_F
16. c_charge_degree_M

9. Next fit the logistic regression model and display the classification report and confusion matrix.

In [48]:
# Initialize and fit the logistic regression model
lr_model = LogisticRegression(max_iter=800)
lr_model.fit(X_train, y_train)

# Predict on the test set
y_pred = lr_model.predict(X_test) # predictions
y_prob = lr_model.predict_proba(X_test)[:, 1]  # probabilities for the positive class

# Generate the classification report
class_report = classification_report(y_test, y_pred)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

print("\nGlobal Classification Report:")
print(class_report)

print("\nGlobal Confusion Matrix:")
print(conf_matrix)
Global Classification Report:
              precision    recall  f1-score   support

           0       0.71      0.78      0.74       823
           1       0.66      0.58      0.62       620

    accuracy                           0.69      1443
   macro avg       0.68      0.68      0.68      1443
weighted avg       0.69      0.69      0.69      1443


Global Confusion Matrix:
[[638 185]
 [262 358]]

An important idea in group fairness is that classification performance should not only be examined for the overall model (as done above), but on a per group basis as well.

10. Compute the true positive, true negative, false positive and false negative rates for different race and sex groups. You can use X_test_attributes dataframe to calculate those

In [49]:
# Create a comparison DataFrame
comparison_df = pd.DataFrame({})

# Append sex and race information
comparison_df['sex'] = X_test_attributes['sex_label']
comparison_df['race'] = X_test_attributes['race_label']

# Append observed and predicted outcomes
comparison_df['y_observed'] = y_test
comparison_df['y_predicted'] = y_pred

# Function to determine the prediction type
def get_prediction_label(row):
    if row['y_observed'] == 0 and row['y_predicted'] == 0:
        return 'TN'
    elif row['y_observed'] == 1 and row['y_predicted'] == 0:
        return 'FN'
    elif row['y_observed'] == 0 and row['y_predicted'] == 1:
        return 'FP'
    elif row['y_observed'] == 1 and row['y_predicted'] == 1:
        return 'TP'

# Apply the function to each row to create the 'baseline prediction' column
comparison_df['baseline_prediction'] = comparison_df.apply(get_prediction_label, axis=1)

comparison_df['predicted_probability'] = y_prob
In [50]:
# Pivot table to summarize counts
pivot_table_counts = comparison_df.pivot_table(index='race', columns='baseline_prediction', aggfunc='size', fill_value=0)

# Calculate support values
support = pivot_table_counts.sum(axis=1)

# Normalize to get proportions (rates)
pivot_table_proportions = pivot_table_counts.div(support, axis=0)

# Add the support column to the pivot table
pivot_table_proportions['support'] = support

# Round the proportions
pivot_table_proportions = pivot_table_proportions.round(2)

print(pivot_table_proportions)
baseline_prediction    FN    FP    TN    TP  support
race                                                
African-American     0.15  0.16  0.33  0.36      731
Asian                0.20  0.20  0.60  0.00        5
Caucasian            0.21  0.10  0.53  0.16      505
Hispanic             0.21  0.07  0.64  0.08      117
Native American      0.00  0.33  0.33  0.33        3
Other                0.27  0.06  0.61  0.06       82

The values inside the cells are rates (e.g. false negative rate, true positive rate, etc.). The columns expressing model errors (FN and FP) are particularly important. We can already see that the FP rate is higher for Africal-Americans compared to Caucasian group

11. Create the same breakdown of predictions by sex.

In [51]:
# Pivot table to summarize counts
pivot_table_counts = comparison_df.pivot_table(index='sex', columns='baseline_prediction', aggfunc='size', fill_value=0)

# Calculate support values
support = pivot_table_counts.sum(axis=1)

# Normalize to get proportions (rates)
pivot_table_proportions = pivot_table_counts.div(support, axis=0)

# Add the support column to the pivot table
pivot_table_proportions['support'] = support

# Round the proportions
pivot_table_proportions = pivot_table_proportions.round(2)

print(pivot_table_proportions)
baseline_prediction    FN    FP    TN    TP  support
sex                                                 
Female               0.26  0.04  0.62  0.08      250
Male                 0.17  0.15  0.41  0.28     1193

When building a system to predict recidivism, what type of error is more problematic - false negatives or false positives?

False negative - model predicts low risk, the person is released from pretrial detention / given parole / probation, the released person then reoffends

False positive - model predicts high risk, the person is detained / not given parole / probation, the detained person in reality would not reoffend

The choice of what matters more is highly context-dependent. We do not claim to have the answer. In this practical, we focus on some differences and highlight a technical approach to bias mitigation.

From the tables we produced above, we observe that our classifier exhibits a false positive rate gap between African-Americans offenders and Caucasian and Hispanic offenders (6% and 9%, respecitvely). Since offenders with Asian, Native American, and Other ethnicity are few in the dataset, we take the model performance on these groups with a grain of salt. We also see that there is a 9% false negative rate gap, and a 11% false positive gap between female and male convicts.

12. Make a confusion matrix for each race. You can loop through each unique race and print the classification report and confusion matrix

In [52]:
results = pd.DataFrame({
    'true_label': y_test,
    'predicted_probability': y_prob,
    'predicted_label': y_pred,
    'race': X_test_attributes['race_label']
})


# Loop through each unique race and print the classification report and confusion matrix
races = X_test_attributes['race_label'].unique()

for race in races:
    race_results = results[results['race'] == race]
    true_labels = race_results['true_label']
    predicted_labels = race_results['predicted_label']

    class_report = classification_report(true_labels, predicted_labels)
    conf_matrix = confusion_matrix(true_labels, predicted_labels)

    print(f"\nRace: {race}")
    print("\nClassification Report:")
    print(class_report)
    print("\nConfusion Matrix:")
    print(conf_matrix)
Race: Caucasian

Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.84      0.77       320
           1       0.61      0.43      0.50       185

    accuracy                           0.69       505
   macro avg       0.66      0.63      0.64       505
weighted avg       0.68      0.69      0.68       505


Confusion Matrix:
[[268  52]
 [105  80]]

Race: African-American

Classification Report:
              precision    recall  f1-score   support

           0       0.69      0.67      0.68       359
           1       0.69      0.71      0.70       372

    accuracy                           0.69       731
   macro avg       0.69      0.69      0.69       731
weighted avg       0.69      0.69      0.69       731


Confusion Matrix:
[[241 118]
 [109 263]]

Race: Other

Classification Report:
              precision    recall  f1-score   support

           0       0.69      0.91      0.79        55
           1       0.50      0.19      0.27        27

    accuracy                           0.67        82
   macro avg       0.60      0.55      0.53        82
weighted avg       0.63      0.67      0.62        82


Confusion Matrix:
[[50  5]
 [22  5]]

Race: Hispanic

Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.90      0.82        83
           1       0.53      0.26      0.35        34

    accuracy                           0.72       117
   macro avg       0.64      0.58      0.59       117
weighted avg       0.69      0.72      0.68       117


Confusion Matrix:
[[75  8]
 [25  9]]

Race: Native American

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.50      1.00      0.67         1

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3


Confusion Matrix:
[[1 1]
 [0 1]]

Race: Asian

Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.75      0.75         4
           1       0.00      0.00      0.00         1

    accuracy                           0.60         5
   macro avg       0.38      0.38      0.38         5
weighted avg       0.60      0.60      0.60         5


Confusion Matrix:
[[3 1]
 [1 0]]

13. To get the results in a way that is easier to compare them, we can plot the ROC curves. Plot the ROC curve for each race

In [53]:
# Plot ROC curves for each race
races = X_test_attributes['race_label'].unique()

plt.figure()

for race in races:
    race_results = results[results['race'] == race]
    fpr, tpr, _ = roc_curve(race_results['true_label'], race_results['predicted_probability'])
    roc_auc = roc_auc_score(race_results['true_label'], race_results['predicted_probability'])
    plt.plot(fpr, tpr, label=f'ROC curve for {race} (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic by Race')
plt.legend(loc='lower right')
plt.show()

NOTE: The ROC curves of Native American and Asian groups should fall straight on the diagonal, that's what the value of 0.5 implies

In this practical, we are going to attempt reducing the false positive rate of African-Americans.

Bias Mitigation¶

Post-Processing - Threshold Optimizer¶

Bias mitigation (=fairness) techniques in machine learning can attempt to tackle the problem at three different model building stages:

  • preprocessing: change the input
  • in-processing: change the model
  • post-processing: change the output

Consider the problem setup. We have:

  • Features: $X$
  • Target: $y ∈[0,1]$
  • Task: we want to predict $y$ from $X$
  • Score function (here probability) $P=f(X)$
  • Decision based on a threshold $D = \mathbb{1}\{P > t\}$
  • Sensitive attribute $A \in \{a, b\}$

We are going to attempt tackle the problem of bias by using a post-processing approach. This way, we can directly control distribution of the outcome.

Logistic regression outputs predicted probability of recidivism. The probability is then dichotomized - by default the threshold $t$ is set to 0.5 - if the predicted probability is lower than 50%, then the person is assigned 0 (in our case 'low risk'), otherwise they are assigned 1 'high risk').

However, we could also consider setting different thresholds to different groups in an attempt to reduce bias. We do this by using a threshold optimization approach. We will use the fairlearn library. You can explore the documentation.

14. Create 2 variables where you save the values from the X_train['race_African-American'] and X_test['race_African-American']

In [54]:
# Protected attribute
attribute = ['race_African-American']
protected_attribute_train = X_train[attribute]
protected_attribute_test = X_test[attribute]

15. Initialise the ThresholdOptimizer. The estimator can be set to the logistic regression model, the constraints to demographic parity and the objective can be set to accuracy score

In [55]:
# Initialize ThresholdOptimizer
postprocessor = ThresholdOptimizer(
  estimator=lr_model,
  constraints='demographic_parity',
  objective='accuracy_score',
  grid_size=1000,
  flip=True,
  prefit=False,
  predict_method='auto'
)

16. Then fit the optimiser on the training and test data and set sensitive_features to the protected attribute in train set. Make the new predictions using the predict function and save them in a dataframe

In [56]:
# Fit the postprocessor
postprocessor.fit(X_train, y_train, sensitive_features=protected_attribute_train)

# Predict on the test set
predictions = postprocessor.predict(X_test, sensitive_features=protected_attribute_test, random_state=42)

# Save predictions in the dictionary
predictions_dict = {}
predictions_dict['new_threshold_decision'] = predictions

print(f"Attribute :{attribute}, value={postprocessor._x_best}")

# Convert the predictions dictionary to a DataFrame for easier handling
predictions_df = pd.DataFrame(predictions_dict)
Attribute :['race_African-American'], value=0.442
In [57]:
plot_threshold_optimizer(postprocessor)

17. Combine comparison and predictions df

In [58]:
# Combine comparison_df and predictions_df
predictions_df.index = comparison_df.index
final_comparison_df = pd.concat([comparison_df, predictions_df], axis=1)
final_comparison_df.head()
Out[58]:
sex race y_observed y_predicted baseline_prediction predicted_probability new_threshold_decision
308 Male Caucasian 0 0 TN 0.151788 0
381 Male African-American 0 0 TN 0.422197 0
3238 Male African-American 1 0 FN 0.356359 0
2312 Male African-American 1 1 TP 0.586242 1
251 Female Other 0 0 TN 0.202685 0

Let's make some comparisons in the results that we obtained

In [59]:
final_comparison_df["y_predicted_new"] = final_comparison_df["y_predicted"]
#final_comparison_df["y_predicted_new"] = np.where(final_comparison_df['race'] == "African-American",final_comparison_df["y_predicted_new"],final_comparison_df["new_threshold_decision"])
final_comparison_df["y_predicted_new"][final_comparison_df['race'] == "African-American"] = final_comparison_df["new_threshold_decision"][final_comparison_df['race'] == "African-American"]
<ipython-input-59-95aed0b6bdea>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_comparison_df["y_predicted_new"][final_comparison_df['race'] == "African-American"] = final_comparison_df["new_threshold_decision"][final_comparison_df['race'] == "African-American"]
In [60]:
final_comparison_df[["y_predicted",'new_threshold_decision']].corr()
Out[60]:
y_predicted new_threshold_decision
y_predicted 1.000000 0.722498
new_threshold_decision 0.722498 1.000000
In [61]:
# Function to determine the prediction type
def get_prediction_label(row):
    if row['y_observed'] == 0 and row['y_predicted'] == 0:
        return 'TN'
    elif row['y_observed'] == 1 and row['y_predicted'] == 0:
        return 'FN'
    elif row['y_observed'] == 0 and row['y_predicted'] == 1:
        return 'FP'
    elif row['y_observed'] == 1 and row['y_predicted'] == 1:
        return 'TP'

# Apply the function to each row to create the 'baseline prediction' column
comparison_df['baseline_prediction'] = comparison_df.apply(get_prediction_label, axis=1)

# Function to determine the prediction type
def get_prediction_label(row):
    if row['y_observed'] == 0 and row['y_predicted_new'] == 0:
        return 'TN'
    elif row['y_observed'] == 1 and row['y_predicted_new'] == 0:
        return 'FN'
    elif row['y_observed'] == 0 and row['y_predicted_new'] == 1:
        return 'FP'
    elif row['y_observed'] == 1 and row['y_predicted_new'] == 1:
        return 'TP'

# Apply the function to each row to create the 'baseline prediction' column
final_comparison_df['updated_prediction'] = final_comparison_df.apply(get_prediction_label, axis=1)

final_comparison_df.head()
Out[61]:
sex race y_observed y_predicted baseline_prediction predicted_probability new_threshold_decision y_predicted_new updated_prediction
308 Male Caucasian 0 0 TN 0.151788 0 0 TN
381 Male African-American 0 0 TN 0.422197 0 0 TN
3238 Male African-American 1 0 FN 0.356359 0 0 FN
2312 Male African-American 1 1 TP 0.586242 1 1 TP
251 Female Other 0 0 TN 0.202685 0 0 TN
In [62]:
# Pivot table to summarize counts
pivot_table_counts = final_comparison_df.pivot_table(index='race', columns='updated_prediction', aggfunc='size', fill_value=0)

# Calculate support values
support = pivot_table_counts.sum(axis=1)

# Normalize to get proportions (rates)
pivot_table_proportions = pivot_table_counts.div(support, axis=0)

# Add the support column to the pivot table
pivot_table_proportions['support'] = support

# Round the proportions
pivot_table_proportions = pivot_table_proportions.round(2)

print(pivot_table_proportions)
updated_prediction    FN    FP    TN    TP  support
race                                               
African-American    0.18  0.13  0.36  0.32      731
Asian               0.20  0.20  0.60  0.00        5
Caucasian           0.21  0.10  0.53  0.16      505
Hispanic            0.21  0.07  0.64  0.08      117
Native American     0.00  0.33  0.33  0.33        3
Other               0.27  0.06  0.61  0.06       82

For reference:

Race - Baseline Prediction FN FP TN TP Support
African-American 0.15 0.16 0.33 0.36 731
Asian 0.20 0.20 0.60 0.00 5
Caucasian 0.21 0.10 0.53 0.16 505
Hispanic 0.21 0.07 0.64 0.08 117
Native American 0.00 0.33 0.33 0.33 3
Other 0.27 0.06 0.61 0.06 82

We see that our bias mitigation strategy for African-American has resulted in a reduction of false positive rate from 16% in the naïve baseline approach to 13% in the threshold optimized approach. The false negative rate has increased by 3%, showing that sometimes bias mitigation can come at a trade-off. The performance on the remaining groups has remained the same.

A challenge is that the Asian and African American groups are small and therefore not suitable for statistical learning. In order to improve the performance of the classifier on these groups, one would need to consider data augmentation approaches such as SMOTE (Synthetic Minority Oversampling Technique) or other solutions to imbalanced datasets.

Sources, Acknowledgments, and Additional Reading:

https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html

https://www.holisticai.com/blog/bias-mitigation-strategies-techniques-for-classification-tasks

https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html

We would like to thank Dr. Dong Nguyen of Utrecht University whose Human-Centered Machine Learning course materials have served as an inspiration.