Introduction
Testing AI and machine learning-based systems is completely new territory for QA. Traditional testing approaches based on deterministic inputs and predictable outputs don’t work here. When system output depends on probabilities rather than clear rules, how do you determine if the system works correctly?
According to Gartner, by 2025, 75% of enterprise applications will use ML (as discussed in AI-Assisted Bug Triaging: Intelligent Defect Prioritization at Scale) components. This means every QA engineer will sooner or later face the task of testing AI/ML (as discussed in AI Code Smell Detection: Finding Problems in Test Automation with ML) systems. In this article, we’ll explore fundamental differences between ML (as discussed in AI-powered Test Generation: The Future Is Already Here) testing and traditional software testing, and practical approaches to quality assurance.
Why ML Testing Differs from Traditional Testing
Non-determinism
Traditional software:
def calculate_discount(price, code):
if code == "SAVE20":
return price * 0.8
return price
# Test: always predictable result
assert calculate_discount(100, "SAVE20") == 80
ML system:
def predict_customer_churn(customer_data):
# ML model returns probability
prediction = model.predict(customer_data)
return prediction # 0.73 or 0.68 or 0.81?
# Test: result varies!
# How to test probability?
Data Dependency
In ML systems, data = code. Changing the training dataset can radically change model behavior, even if the code itself hasn’t changed.
Problems:
- Data drift: production data differs from training data
- Label quality: labeling errors lead to wrong predictions
- Data bias: model reproduces data biases
No Explicit Business Logic
Traditional code contains explicit rules:
if age < 18:
return "Access denied"
ML model is a black box:
# Where's the logic? In neural network weights!
prediction = neural_network.forward(input_data)
How to test what you can’t read?
Quality Metrics Aren’t Binary
- Traditional test: PASS ✅ or FAIL ❌
- ML test: Accuracy 94.3%, Precision 0.87, Recall 0.91, F1 0.89
What threshold is acceptable? This is a business decision, not a technical question.
Data Validation: Foundation of ML Quality
Why Data Quality Is Critical
ML Rule: Garbage in = Garbage out × 100
Bad data in traditional software can lead to an error or crash. In ML, it leads to a model that systematically makes bad decisions.
Pipeline Data Validation
Data validation stages:
# 1. Schema validation
from great_expectations import DataContext
context = DataContext()
suite = context.create_expectation_suite("ml_data_validation")
# Check data structure
batch.expect_column_to_exist("customer_age")
batch.expect_column_values_to_be_between("customer_age", min_value=18, max_value=120)
batch.expect_column_values_to_be_in_set("country", ["US", "UK", "CA", "AU"])
# 2. Data distribution checks
batch.expect_column_mean_to_be_between("purchase_amount", min_value=50, max_value=500)
batch.expect_column_stdev_to_be_between("purchase_amount", min_value=10, max_value=200)
# 3. Referential integrity
batch.expect_column_values_to_not_be_null("user_id")
batch.expect_compound_columns_to_be_unique(["user_id", "transaction_id"])
Data Drift Detection
Problem: Training data from 2023 may not reflect 2025 reality.
Solution: Continuous monitoring of data distribution
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab
# Compare production data with training dataset
dashboard = Dashboard(tabs=[DataDriftTab()])
dashboard.calculate(reference_data=train_df, current_data=production_df)
# Drift metrics:
# - Wasserstein distance for numerical features
# - Population Stability Index (PSI)
# - Jensen-Shannon divergence for categorical
Drift detection example:
Feature Drift Report:
customer_age:
drift_detected: false
drift_score: 0.03
purchase_amount:
drift_detected: true ⚠️
drift_score: 0.47
reason: "Mean shifted from $150 to $89"
action: "Retrain model or investigate business change"
device_type:
drift_detected: true ⚠️
drift_score: 0.62
reason: "Mobile traffic increased from 40% to 78%"
Actions when drift detected:
- Investigate cause (business changes vs data quality issue)
- Retrain model on new data
- A/B test new model vs old
- Gradual rollout with successful test
Label Quality Validation
Problem: If 10% of labels are wrong, model will learn errors.
Validation strategies:
1. Cross-validation of labeling:
# Multiple annotators label same data
from sklearn.metrics import cohen_kappa_score
annotator1_labels = [1, 0, 1, 1, 0, 1]
annotator2_labels = [1, 0, 1, 0, 0, 1]
# Kappa > 0.8 = good agreement
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)
if kappa < 0.7:
print("⚠️ Annotators disagree! Guidelines need clarification")
2. Outlier detection in labels:
# Find suspicious labels
from cleanlab import find_label_issues
# Model predicts probabilities for each class
predicted_probs = model.predict_proba(X_train)
# Cleanlab finds likely mislabeled examples
label_issues = find_label_issues(
labels=y_train,
pred_probs=predicted_probs,
return_indices_ranked_by='self_confidence'
)
print(f"Found {len(label_issues)} potentially mislabeled examples")
# Manually review top-100 and fix
3. Active learning for quality improvement:
# Model requests labels for uncertain examples
from modAL.uncertainty import uncertainty_sampling
learner = ActiveLearner(
estimator=RandomForestClassifier(),
query_strategy=uncertainty_sampling
)
# Model selects 100 most uncertain examples
query_idx, query_instance = learner.query(X_unlabeled, n_instances=100)
# Human labels only these 100 instead of all data
# Labeling efficiency increases 5-10x
Model Testing and Validation
Unit Testing for ML Models
Yes, unit tests are possible even for ML!
Test data preprocessing:
def test_feature_engineering():
# Given
raw_data = pd.DataFrame({
'date': ['2025-01-01', '2025-01-02'],
'amount': [100, 200]
})
# When
features = preprocess_features(raw_data)
# Then
assert 'day_of_week' in features.columns
assert 'amount_log' in features.columns
assert features['amount_log'].iloc[0] == pytest.approx(4.605, 0.01)
assert features['day_of_week'].iloc[0] == 2 # Wednesday
Test model inference logic:
def test_model_prediction_shape():
# Given
model = load_model('churn_predictor_v2.pkl')
test_input = np.random.rand(10, 20) # 10 samples, 20 features
# When
predictions = model.predict(test_input)
# Then
assert predictions.shape == (10,) # One prediction per sample
assert np.all((predictions >= 0) & (predictions <= 1)) # Valid probabilities
Test model behavior on edge cases:
def test_model_handles_missing_values():
# Given
model = load_model('recommender.pkl')
input_with_nan = pd.DataFrame({
'age': [25, np.nan, 30],
'income': [50000, 60000, np.nan]
})
# When/Then
# Model shouldn't crash on NaN
predictions = model.predict(input_with_nan)
assert len(predictions) == 3
assert not np.any(np.isnan(predictions))
Integration Testing ML Pipeline
End-to-end ML pipeline test:
@pytest.mark.integration
def test_ml_pipeline_end_to_end():
# 1. Load data
raw_data = load_test_dataset('test_data.csv')
# 2. Preprocess
preprocessed = preprocessing_pipeline.transform(raw_data)
assert preprocessed.shape[1] == 50 # Expected number of features
# 3. Feature engineering
features = feature_engineering_pipeline.transform(preprocessed)
assert 'feature_interaction_1' in features.columns
# 4. Model prediction
predictions = model.predict(features)
assert len(predictions) == len(raw_data)
# 5. Postprocessing
final_output = postprocess_predictions(predictions)
assert final_output['confidence'].min() >= 0
assert final_output['confidence'].max() <= 1
Model Performance Testing
Metrics for different task types:
Classification:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
def test_model_classification_performance():
y_true = test_labels
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# Accuracy should exceed baseline
accuracy = (y_pred == y_true).mean()
assert accuracy > 0.85, f"Accuracy {accuracy} below threshold"
# AUC-ROC for ranking quality assessment
auc = roc_auc_score(y_true, y_proba)
assert auc > 0.90, f"AUC {auc} below threshold"
# Check precision/recall for each class
report = classification_report(y_true, y_pred, output_dict=True)
# "Fraud" class is critical - high recall required
assert report['fraud']['recall'] > 0.95, "Missing too many fraud cases!"
# False positives are expensive - good precision needed
assert report['fraud']['precision'] > 0.80, "Too many false fraud alerts!"
Regression:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def test_model_regression_performance():
y_true = test_target
y_pred = model.predict(X_test)
# MAE within acceptable limits
mae = mean_absolute_error(y_true, y_pred)
assert mae < 50, f"MAE {mae} too high (avg error ${mae})"
# R² shows model's explanatory power
r2 = r2_score(y_true, y_pred)
assert r2 > 0.85, f"R² {r2} - model explains too little variance"
# MAPE for relative error
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
assert mape < 10, f"MAPE {mape}% - predictions off by {mape}% on average"
Invariance Testing
Problem: Model should be robust to minor input changes.
Invariance test examples:
def test_invariance_to_feature_order():
"""Column permutation shouldn't affect result"""
original_pred = model.predict(X_test)
# Shuffle column order
shuffled_columns = X_test.sample(frac=1, axis=1)
shuffled_pred = model.predict(shuffled_columns)
np.testing.assert_array_almost_equal(original_pred, shuffled_pred)
def test_invariance_to_text_case():
"""NLP model shouldn't change predictions based on case"""
texts = ["This is SPAM!", "this is spam!", "THIS IS SPAM!"]
predictions = [spam_classifier.predict(t) for t in texts]
# All three variants should give same result
assert len(set(predictions)) == 1
def test_directional_expectation():
"""Increasing income should reduce churn probability"""
base_customer = pd.DataFrame({
'age': [30], 'income': [50000], 'tenure': [12]
})
base_churn_prob = model.predict_proba(base_customer)[0][1]
# Double the income
rich_customer = base_customer.copy()
rich_customer['income'] = 100000
rich_churn_prob = model.predict_proba(rich_customer)[0][1]
# Churn probability should decrease
assert rich_churn_prob < base_churn_prob, \
"Higher income should reduce churn probability"
Bias Detection: Ethics and Fairness
Why Bias Is a Critical Problem
Real examples of bias in ML:
- Amazon hiring ML (2018): Model discriminated against women, trained on resumes where 90% were male
- COMPAS (Criminal justice): Recidivism prediction model showed racial bias
- Apple Card (2019): Algorithm gave women credit limits 10-20x lower than men with equal income
Consequences:
- Legal risks (discrimination is protected by law)
- Reputational damage
- Ethical problems
- Reinforcing social inequality
Types of Bias
1. Data bias:
- Training data not representative of all groups
- Historical bias (model learns from past discrimination)
- Sampling bias (some groups underrepresented)
2. Model bias:
- Feature engineering amplifies bias
- Proxy features (ZIP code correlates with race)
- Optimization metric doesn’t account for fairness
3. Deployment bias:
- Model used for unintended purposes
- Feedback loops amplify bias
Detecting Bias
Fairness metrics:
from aif360.metrics import BinaryLabelDatasetMetric, ClassificationMetric
from aif360.datasets import StandardDataset
# Prepare data with protected attribute
dataset = StandardDataset(
df=data,
label_name='approved',
protected_attribute_names=['gender'],
privileged_classes=[['male']],
unprivileged_classes=[['female']]
)
# Metric 1: Statistical Parity Difference
# Should be close to 0 (equal probability of positive outcome)
metric = BinaryLabelDatasetMetric(dataset)
spd = metric.statistical_parity_difference()
assert abs(spd) < 0.1, f"Statistical parity violation: {spd}"
# If > 0.1: model more often predicts positive for privileged group
# Metric 2: Equal Opportunity Difference
# True positive rate should be equal for all groups
predictions = model.predict(X_test)
metric = ClassificationMetric(
dataset_true=test_dataset,
dataset_pred=predictions,
privileged_groups=[{'gender': 'male'}],
unprivileged_groups=[{'gender': 'female'}]
)
eod = metric.equal_opportunity_difference()
assert abs(eod) < 0.1, f"Equal opportunity violation: {eod}"
# Metric 3: Disparate Impact
# Ratio of positive outcomes between groups
di = metric.disparate_impact()
assert 0.8 <= di <= 1.25, f"Disparate impact: {di} (legal threshold)"
# By 4/5 rule: ratio < 0.8 = likely discrimination
Intersectional fairness:
# Check fairness for group intersections
for gender in ['male', 'female']:
for race in ['white', 'black', 'asian', 'hispanic']:
for age_group in ['<30', '30-50', '>50']:
subset = data[(data.gender == gender) &
(data.race == race) &
(data.age_group == age_group)]
if len(subset) < 30:
continue # Insufficient data
approval_rate = (model.predict(subset) == 1).mean()
# Check that approval rate doesn't deviate strongly from overall
overall_rate = (model.predict(data) == 1).mean()
if abs(approval_rate - overall_rate) > 0.15:
print(f"⚠️ Bias detected for {gender}/{race}/{age_group}")
print(f" Approval rate: {approval_rate:.2%} vs overall {overall_rate:.2%}")
Mitigating Bias
Approaches to reducing bias:
1. Pre-processing (fix data):
from aif360.algorithms.preprocessing import Reweighing
# Reweighing: assign weights to examples to balance groups
rw = Reweighing(
unprivileged_groups=[{'gender': 'female'}],
privileged_groups=[{'gender': 'male'}]
)
dataset_transformed = rw.fit_transform(dataset)
# Now train model on reweighted data
model.fit(dataset_transformed.features,
dataset_transformed.labels,
sample_weight=dataset_transformed.instance_weights)
2. In-processing (fair model training):
from aif360.algorithms.inprocessing import PrejudiceRemover
# Model optimizes both accuracy and fairness simultaneously
fair_model = PrejudiceRemover(
sensitive_attr='gender',
eta=1.0 # Trade-off between accuracy and fairness
)
fair_model.fit(X_train, y_train)
3. Post-processing (adjust predictions):
from aif360.algorithms.postprocessing import EqOddsPostprocessing
# Adjust threshold for different groups
eop = EqOddsPostprocessing(
unprivileged_groups=[{'gender': 'female'}],
privileged_groups=[{'gender': 'male'}]
)
# Train on validation set
eop.fit(val_dataset, predictions_val)
# Apply to test predictions
fair_predictions = eop.predict(predictions_test)
Testing fairness in CI/CD:
@pytest.mark.fairness
def test_model_fairness():
"""Fail build if model shows bias"""
# Load protected test set
test_data = load_fairness_test_set()
for protected_attr in ['gender', 'race', 'age_group']:
# Compute fairness metrics
metrics = compute_fairness_metrics(
model=model,
data=test_data,
protected_attribute=protected_attr
)
# Assert fairness thresholds
assert abs(metrics['statistical_parity']) < 0.1, \
f"Statistical parity violation for {protected_attr}"
assert metrics['disparate_impact'] > 0.8, \
f"Disparate impact violation for {protected_attr}"
assert abs(metrics['equal_opportunity']) < 0.1, \
f"Equal opportunity violation for {protected_attr}"
A/B Testing for ML Models
Why You Can’t Just Deploy New Model
Problems with offline evaluation:
- Test set may not reflect production distribution
- Offline metrics don’t always correlate with business metrics
- Model may have unexpected edge case behavior
Only way to know true quality = test in production on real users.
Designing ML A/B Tests
Basic architecture:
class MLABTestFramework:
def __init__(self, control_model, treatment_model):
self.control = control_model
self.treatment = treatment_model
self.assignment_cache = {}
def get_prediction(self, user_id, features):
# Consistent assignment: one user always in one group
if user_id not in self.assignment_cache:
self.assignment_cache[user_id] = self._assign_variant(user_id)
variant = self.assignment_cache[user_id]
if variant == 'control':
prediction = self.control.predict(features)
self._log_prediction('control', user_id, prediction)
else:
prediction = self.treatment.predict(features)
self._log_prediction('treatment', user_id, prediction)
return prediction
def _assign_variant(self, user_id):
# Hash-based assignment for consistency
hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
return 'treatment' if hash_val % 100 < 50 else 'control'
Metrics for ML A/B Tests
Multi-level metrics:
1. Model metrics (sanity checks):
# Verify treatment model works as expected
control_accuracy = evaluate_model(control_predictions, labels)
treatment_accuracy = evaluate_model(treatment_predictions, labels)
assert treatment_accuracy >= control_accuracy * 0.95, \
"Treatment model significantly worse - stop test!"
2. User engagement metrics:
# Recommendation model example
metrics = {
'control': {
'click_through_rate': 0.12,
'time_on_site': 8.5, # minutes
'items_viewed': 4.2
},
'treatment': {
'click_through_rate': 0.14, # +16.7% 🎉
'time_on_site': 9.1, # +7.1%
'items_viewed': 4.8 # +14.3%
}
}
# Statistical significance test
from scipy.stats import ttest_ind
control_ctr = get_user_ctr_data('control')
treatment_ctr = get_user_ctr_data('treatment')
t_stat, p_value = ttest_ind(treatment_ctr, control_ctr)
if p_value < 0.05 and treatment_ctr.mean() > control_ctr.mean():
print("✅ Treatment shows statistically significant improvement!")
3. Business metrics (north star):
# Ultimate goal - revenue/conversions
business_metrics = {
'control': {
'revenue_per_user': 45.30,
'conversion_rate': 0.032,
'ltv_30d': 120.50
},
'treatment': {
'revenue_per_user': 48.20, # +6.4%
'conversion_rate': 0.035, # +9.4%
'ltv_30d': 125.80 # +4.4%
}
}
# Economic significance
users_per_month = 100000
revenue_lift = (48.20 - 45.30) * users_per_month
# = $290,000/month additional revenue!
Guardrail Metrics
Problem: Treatment may improve some metrics but worsen others.
guardrail_metrics = {
'latency_p99': {
'control': 250, # ms
'treatment': 280, # ms - acceptable?
'threshold': 300,
'status': 'PASS'
},
'error_rate': {
'control': 0.001,
'treatment': 0.0015,
'threshold': 0.002,
'status': 'WARNING' # Requires investigation
},
'user_complaints': {
'control': 12, # per week
'treatment': 45, # 🚨 Alarming spike!
'threshold': 20,
'status': 'FAIL'
}
}
# Automatic kill switch
if guardrail_metrics['user_complaints']['status'] == 'FAIL':
rollback_experiment('ml_model_v2')
alert_team('Treatment model causing user complaints spike!')
Continuous Model Monitoring
Why Monitoring Is Critical
ML models “decay” over time:
- Data drift: world changes, model becomes outdated
- Concept drift: relationships between features and target change
- Upstream changes: API changes break feature generation
Without monitoring, you’ll learn about problems when users complain (too late).
Key Monitoring Metrics
1. Model performance metrics:
# Daily monitoring
daily_metrics = {
'date': '2025-10-01',
'predictions_count': 1.2M,
'avg_confidence': 0.78, # Dropped from 0.85 - warning sign
# Ground truth metrics (when labels available)
'accuracy': 0.89, # Was 0.94 a week ago
'precision': 0.85,
'recall': 0.91,
# Alerts
'alerts': [
'Accuracy dropped 5% in last 7 days',
'Average confidence declining trend'
]
}
2. Data quality metrics:
data_quality_dashboard = {
'missing_values': {
'age': 0.02, # OK
'income': 0.15, # 🚨 Increased from 0.03
},
'out_of_range_values': {
'age': 3, # 3 cases of age > 120
},
'new_categorical_values': {
'country': ['XX'], # Unknown country code
}
}
Conclusion
Testing AI/ML systems requires a fundamentally new approach:
Key takeaways:
✅ Data quality is 80% of ML system success. Data validation is critical at all stages.
✅ Bias detection is not an optional feature, but a mandatory requirement for production ML.
✅ A/B testing is the only way to truly validate a model in production.
✅ Continuous monitoring — ML models require constant observation, not set-and-forget.
Practical recommendations:
- Automate data validation in every pipeline
- Test fairness as part of CI/CD
- Don’t trust offline metrics — test in production
- Monitor models 24/7 — they degrade over time
- Document everything — what data, what assumptions, what limitations
ML testing is not just a new skill, it’s a new discipline. QA engineers who master it now will be in demand for the next decade.