The Imperative of Fairness Testing
Machine learning models increasingly make decisions affecting people’s lives—loan approvals, hiring recommendations, medical diagnoses, criminal sentencing. When these models encode biases from training (as discussed in AI-Assisted Bug Triaging: Intelligent Defect Prioritization at Scale) data or design choices, they can perpetuate discrimination at scale. A biased hiring algorithm might systematically reject qualified candidates based on gender. A credit scoring model might unfairly (as discussed in AI Code Smell Detection: Finding Problems in Test Automation with ML) penalize specific ethnic groups.
Bias detection isn’t just an ethical imperative—it’s a legal and business necessity. Regulations like the EU AI (as discussed in AI-powered Test Generation: The Future Is Already Here) Act and proposed US legislation mandate fairness assessments. Companies face reputational damage and litigation when biased AI systems are exposed.
Types of ML Bias
1. Data Bias
Historical Bias: Training data reflects existing societal inequalities.
# Example: Historical hiring data shows gender imbalance
training_data = pd.read_csv('historical_hires.csv')
print(training_data['gender'].value_counts())
# Male: 8500 (85%)
# Female: 1500 (15%)
# Model trained on this data will learn to favor male candidates
Representation Bias: Some groups are underrepresented in training data.
# Example: Face recognition dataset
dataset_distribution = {
'White': 0.70,
'Asian': 0.15,
'Black': 0.10,
'Hispanic': 0.05
}
# Model will perform worse on underrepresented groups
Measurement Bias: Features captured differently across groups.
# Example: Credit scores measured differently by region
# Urban: Comprehensive credit history
# Rural: Limited credit history (proxies used)
2. Algorithmic Bias
Aggregation Bias: One-size-fits-all model fails to account for group differences.
# Single diabetes prediction model for all demographics
# Optimal HbA1c thresholds differ by ethnicity
# Model accuracy: 92% for Caucasians, 78% for African Americans
Evaluation Bias: Test set doesn’t represent deployment population.
# Model evaluated on data from wealthy neighborhoods
# Deployed in economically diverse areas
# Real-world performance degrades for underrepresented groups
Fairness Metrics
1. Demographic Parity
All groups receive positive outcomes at equal rates.
def demographic_parity(y_pred, sensitive_attr):
"""
Calculate demographic parity difference
Ideally should be close to 0
"""
groups = np.unique(sensitive_attr)
positive_rates = []
for group in groups:
group_mask = sensitive_attr == group
positive_rate = y_pred[group_mask].mean()
positive_rates.append(positive_rate)
# Maximum difference between groups
dp_difference = max(positive_rates) - min(positive_rates)
return {
'demographic_parity_difference': dp_difference,
'group_positive_rates': dict(zip(groups, positive_rates)),
'is_fair': dp_difference < 0.1 # 10% threshold
}
# Example usage
y_pred = model.predict(X_test)
sensitive_attr = X_test['gender']
fairness = demographic_parity(y_pred, sensitive_attr)
print(f"Demographic parity difference: {fairness['demographic_parity_difference']:.3f}")
# 0.052 → Acceptable
# 0.250 → Problematic bias
2. Equalized Odds
True positive rate and false positive rate equal across groups.
from sklearn.metrics import confusion_matrix
def equalized_odds(y_true, y_pred, sensitive_attr):
"""
Calculate TPR and FPR differences between groups
"""
groups = np.unique(sensitive_attr)
tpr_list, fpr_list = [], []
for group in groups:
group_mask = sensitive_attr == group
tn, fp, fn, tp = confusion_matrix(
y_true[group_mask],
y_pred[group_mask]
).ravel()
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
tpr_list.append(tpr)
fpr_list.append(fpr)
tpr_diff = max(tpr_list) - min(tpr_list)
fpr_diff = max(fpr_list) - min(fpr_list)
return {
'tpr_difference': tpr_diff,
'fpr_difference': fpr_diff,
'group_tpr': dict(zip(groups, tpr_list)),
'group_fpr': dict(zip(groups, fpr_list)),
'is_fair': tpr_diff < 0.1 and fpr_diff < 0.1
}
odds = equalized_odds(y_test, y_pred, X_test['race'])
print(f"TPR difference: {odds['tpr_difference']:.3f}")
print(f"FPR difference: {odds['fpr_difference']:.3f}")
3. Equal Opportunity
True positive rate equal across groups (focuses on favorable outcomes).
def equal_opportunity(y_true, y_pred, sensitive_attr):
"""Equal TPR across protected groups"""
groups = np.unique(sensitive_attr)
tpr_list = []
for group in groups:
group_mask = sensitive_attr == group
tp = ((y_true[group_mask] == 1) & (y_pred[group_mask] == 1)).sum()
fn = ((y_true[group_mask] == 1) & (y_pred[group_mask] == 0)).sum()
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
tpr_list.append(tpr)
tpr_diff = max(tpr_list) - min(tpr_list)
return {
'tpr_difference': tpr_diff,
'group_tpr': dict(zip(groups, tpr_list)),
'is_fair': tpr_diff < 0.1
}
Bias Detection Tools
1. AI Fairness 360 (AIF360)
IBM’s comprehensive bias detection library:
from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric, ClassificationMetric
from aif360.algorithms.preprocessing import Reweighing
# Load data
dataset = BinaryLabelDataset(
df=data,
label_names=['hired'],
protected_attribute_names=['gender']
)
# Check dataset bias
metric = BinaryLabelDatasetMetric(
dataset,
privileged_groups=[{'gender': 1}], # Male
unprivileged_groups=[{'gender': 0}] # Female
)
print(f"Disparate impact: {metric.disparate_impact():.3f}")
# < 0.8 indicates bias
# Train model
model.fit(dataset.features, dataset.labels)
predictions = model.predict(test_dataset.features)
# Check model fairness
classified_metric = ClassificationMetric(
test_dataset,
predictions,
privileged_groups=[{'gender': 1}],
unprivileged_groups=[{'gender': 0}]
)
print(f"Equal opportunity difference: {classified_metric.equal_opportunity_difference():.3f}")
print(f"Average odds difference: {classified_metric.average_odds_difference():.3f}")
2. Fairlearn
Microsoft’s fairness toolkit:
from fairlearn.metrics import MetricFrame, selection_rate
from fairlearn.reductions import ExponentiatedGradient, DemographicParity
# Evaluate fairness
metric_frame = MetricFrame(
metrics={
'accuracy': accuracy_score,
'selection_rate': selection_rate,
'false_positive_rate': false_positive_rate
},
y_true=y_test,
y_pred=y_pred,
sensitive_features=X_test['gender']
)
print(metric_frame.by_group)
# accuracy selection_rate false_positive_rate
# gender
# Female 0.82 0.45 0.12
# Male 0.85 0.60 0.09
# → Selection rate disparity indicates bias
# Mitigate bias with constraints
mitigator = ExponentiatedGradient(
estimator=LogisticRegression(),
constraints=DemographicParity()
)
mitigator.fit(X_train, y_train, sensitive_features=X_train['gender'])
y_pred_mitigated = mitigator.predict(X_test)
# Re-evaluate
metric_frame_mitigated = MetricFrame(
metrics={'selection_rate': selection_rate},
y_true=y_test,
y_pred=y_pred_mitigated,
sensitive_features=X_test['gender']
)
print(f"Selection rate difference reduced to: {metric_frame_mitigated.difference():.3f}")
3. What-If Tool (Google)
Visual bias exploration:
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder
# Configure What-If Tool
config_builder = WitConfigBuilder(
test_examples[:500],
feature_names=feature_names
).set_model_name('hiring_model').set_target_feature('hired')
WitWidget(config_builder, height=800)
# Interactive visualization shows:
# - Datapoint editor
# - Performance & fairness metrics
# - Feature importance
# - Counterfactual analysis
Bias Mitigation Strategies
1. Pre-processing: Dataset Balancing
from imblearn.over_sampling import SMOTE
from aif360.algorithms.preprocessing import Reweighing
# SMOTE for class imbalance
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Reweighing for fairness
reweigher = Reweighing(
privileged_groups=[{'gender': 1}],
unprivileged_groups=[{'gender': 0}]
)
dataset_transformed = reweigher.fit_transform(dataset)
2. In-processing: Fairness Constraints
from fairlearn.reductions import GridSearch, EqualizedOdds
# Train with fairness constraints
sweep = GridSearch(
estimator=LogisticRegression(),
constraints=EqualizedOdds(),
grid_size=20
)
sweep.fit(X_train, y_train, sensitive_features=X_train['gender'])
# Select best model balancing accuracy and fairness
predictors = sweep.predictors_
for idx, predictor in enumerate(predictors):
accuracy = accuracy_score(y_test, predictor.predict(X_test))
fairness = equalized_odds(y_test, predictor.predict(X_test), X_test['gender'])
print(f"Model {idx}: Accuracy={accuracy:.3f}, TPR_diff={fairness['tpr_difference']:.3f}")
3. Post-processing: Threshold Optimization
from fairlearn.postprocessing import ThresholdOptimizer
# Optimize classification thresholds per group
postprocessor = ThresholdOptimizer(
estimator=trained_model,
constraints='equalized_odds',
objective='accuracy_score'
)
postprocessor.fit(X_train, y_train, sensitive_features=X_train['gender'])
y_pred_fair = postprocessor.predict(X_test, sensitive_features=X_test['gender'])
# Different thresholds per group to achieve fairness
print(postprocessor.interpolated_thresholder_.interpolation_dict)
# {0: [Threshold(operation='>', threshold=0.45)], # Female
# 1: [Threshold(operation='>', threshold=0.60)]} # Male
Testing Workflow
Comprehensive Fairness Test Suite
class FairnessTestSuite:
def __init__(self, model, sensitive_attributes):
self.model = model
self.sensitive_attrs = sensitive_attributes
def run_all_tests(self, X_test, y_test):
results = {}
for attr in self.sensitive_attrs:
y_pred = self.model.predict(X_test)
results[attr] = {
'demographic_parity': demographic_parity(y_pred, X_test[attr]),
'equalized_odds': equalized_odds(y_test, y_pred, X_test[attr]),
'equal_opportunity': equal_opportunity(y_test, y_pred, X_test[attr])
}
return self.generate_report(results)
def generate_report(self, results):
report = []
for attr, metrics in results.items():
for metric_name, metric_values in metrics.items():
if not metric_values.get('is_fair', True):
report.append({
'attribute': attr,
'metric': metric_name,
'severity': 'HIGH' if list(metric_values.values())[0] > 0.2 else 'MEDIUM',
'details': metric_values
})
return report
# Usage
fairness_suite = FairnessTestSuite(
model=my_model,
sensitive_attributes=['gender', 'race', 'age_group']
)
fairness_report = fairness_suite.run_all_tests(X_test, y_test)
for issue in fairness_report:
print(f"⚠️ {issue['severity']}: {issue['metric']} violation for {issue['attribute']}")
print(f" Details: {issue['details']}")
Real-World Case Studies
Case 1: COMPAS Recidivism
ProPublica’s investigation revealed racial bias in COMPAS (criminal risk assessment):
- Finding: Black defendants mislabeled high-risk at 2x rate of white defendants
- Root cause: Historical arrest data reflected policing bias
- Fairness metric violated: False positive rate parity
- Impact: Influenced sentencing decisions for thousands
Case 2: Amazon Hiring Tool
Amazon scrapped ML recruiting tool that showed gender bias:
- Finding: Penalized resumes containing “women’s” (e.g., “women’s chess club”)
- Root cause: Trained on 10 years of male-dominated applications
- Mitigation attempted: Removed gendered words → still learned proxy features
- Outcome: System discontinued
Case 3: Healthcare Algorithm
Study found algorithm used by US hospitals exhibited racial bias:
- Finding: Black patients required higher risk scores than white patients for same care
- Root cause: Used healthcare costs as proxy for health needs (Black patients historically received less care)
- Fairness metric: Equal opportunity violation
- Fix: Replaced cost with actual health metrics
Best Practices
Practice | Description |
---|---|
Test Early | Assess fairness in data exploration phase |
Multiple Metrics | No single metric captures all fairness notions |
Intersectionality | Test combinations (e.g., Black women, not just race or gender) |
Stakeholder Input | Involve affected communities in defining fairness |
Document Trade-offs | Acknowledge accuracy-fairness tensions |
Continuous Monitoring | Bias can drift as data distributions change |
Red Team Testing | Deliberately probe for discriminatory behavior |
Conclusion
Bias detection is not a checkbox exercise but an ongoing commitment to ethical AI. As ML systems scale, so does their potential for harm. Rigorous fairness testing, multiple complementary metrics, stakeholder engagement, and transparent trade-off decisions are essential.
The future of ML testing must balance technical performance with societal impact—building systems that are not just accurate, but just.