AI and Machine Learning Testing

Master AI and ML testing. Learn to test data quality, model accuracy, bias detection, data drift monitoring, and ML deployment pipelines.

ML Pipeline Overview

Machine learning systems are fundamentally different from traditional software. Instead of explicit programming rules, ML models learn patterns from data. This creates unique testing challenges at every stage of the ML pipeline.

The ML Pipeline

graph LR A[Data Collection] --> B[Data Processing] B --> C[Feature Engineering] C --> D[Model Training] D --> E[Model Evaluation] E --> F[Model Deployment] F --> G[Monitoring] G -->|Data Drift| A

Each stage requires different testing approaches:

Data: Quality, completeness, bias, freshness
Features: Correctness, consistency, leakage detection
Model: Accuracy, fairness, robustness, interpretability
Serving: Latency, throughput, versioning, rollback

Data Quality Testing

Data is the foundation of ML — bad data produces bad models:

Test	What to Check
Completeness	Missing values, null rates by feature
Consistency	Same entity has same representation across sources
Freshness	Data is recent enough for the model’s use case
Distribution	Feature distributions match expected ranges
Duplicates	No unintended duplicate records
Labels	Training labels are accurate and consistent
Schema	Data matches expected schema (types, ranges)

Feature Engineering Testing

Features transform raw data into model inputs:

Feature values are within expected ranges
Feature computation is deterministic (same input → same output)
No data leakage (features do not contain target information)
Feature importance aligns with domain knowledge

Model Evaluation Testing

Standard Metrics

Metric	Use Case	Formula
Accuracy	Balanced classes	(TP + TN) / Total
Precision	When false positives are costly	TP / (TP + FP)
Recall	When false negatives are costly	TP / (TP + FN)
F1 Score	Balanced precision-recall	2 * P * R / (P + R)
AUC-ROC	Overall discriminative ability	Area under ROC curve

Beyond Accuracy

Slice-based evaluation: Model performance across data subgroups (by age, geography, device)
Edge case testing: Adversarial inputs, out-of-distribution data, boundary conditions
Regression testing: New model version is not worse than previous version on any metric
Robustness testing: Small input perturbations should not drastically change outputs

Bias and Fairness Testing

ML models can perpetuate or amplify societal biases:

Demographic parity: Positive prediction rates should be similar across groups
Equal opportunity: True positive rates should be similar across groups
Calibration: Predicted probabilities should be accurate for all groups
Disparate impact: Adverse decision rates should not disproportionately affect protected groups

Test for bias across: race, gender, age, disability status, geographic location, socioeconomic status.

Advanced ML Testing

Data Drift Monitoring

Production data changes over time:

Feature drift: Input feature distributions shift
Concept drift: The relationship between features and target changes
Label drift: The distribution of target values changes

Monitoring approach:

Statistical tests (Kolmogorov-Smirnov, Population Stability Index)
Distribution visualization dashboards
Automated alerts when drift exceeds thresholds
Triggered retraining pipelines

Model Serving Testing

ML models in production face infrastructure challenges:

Inference latency (P50, P95, P99) under load
Throughput (predictions per second)
Model versioning and gradual rollout (canary deployment)
A/B testing between model versions
Fallback to previous model on failure
Batch vs. real-time inference pipelines

ML Security Testing

Adversarial attacks: inputs crafted to fool the model
Model extraction: preventing unauthorized copying of model behavior
Data poisoning: detecting tampered training data
Privacy: model does not memorize and leak training data (membership inference)

Hands-On Exercise

Design a test plan for a credit scoring ML model:

Data quality: Verify training data completeness, check for historical bias
Model accuracy: Evaluate precision, recall, and AUC on holdout test set
Bias testing: Verify fair outcomes across age groups, genders, and zip codes
Robustness: Test with edge cases (zero income, extremely high credit limit)
Monitoring: Define drift detection metrics and retraining triggers

Solution Guide

Bias tests:

Calculate approval rates by gender: difference should be < 5%
Calculate approval rates by age group: no group should have > 2x rejection rate
Verify model explanation (SHAP values) does not rely on protected attributes

Robustness tests:

Income = $0: model should handle gracefully, not crash
Credit utilization = 100%: should produce reasonable (likely low) score
All features at boundary values: model should not produce extreme outlier scores

Pro Tips

Test data before testing models — most ML bugs are actually data bugs
Monitor production model performance continuously — accuracy degrades silently without monitoring
Always test for bias with real demographic data — synthetic data may not reveal real-world biases
Version everything — data, features, models, and configurations must be traceable and reproducible
Compare new models against baselines — a simpler model that performs nearly as well may be preferable

Key Takeaways

ML testing requires testing the entire pipeline: data, features, model, serving, and monitoring
Model accuracy alone is insufficient — fairness, robustness, and interpretability matter equally
Data drift is the silent killer of ML models — continuous monitoring is essential
ML bias testing is not optional — it has legal, ethical, and business implications

AI and Machine Learning Testing

What You Will Learn

ML Pipeline Overview

The ML Pipeline

Data Quality Testing

Feature Engineering Testing

Model Evaluation Testing

Standard Metrics

Beyond Accuracy

Bias and Fairness Testing

Advanced ML Testing

Data Drift Monitoring

Model Serving Testing

ML Security Testing

Hands-On Exercise

Pro Tips

Key Takeaways

Knowledge Check

AI and Machine Learning Testing

What You Will Learn

ML Pipeline Overview #

The ML Pipeline #

Data Quality Testing #

Feature Engineering Testing #

Model Evaluation Testing #

Standard Metrics #

Beyond Accuracy #

Bias and Fairness Testing #

Advanced ML Testing #

Data Drift Monitoring #

Model Serving Testing #

ML Security Testing #

Hands-On Exercise #

Pro Tips #

Key Takeaways #

Knowledge Check

ML Pipeline Overview

The ML Pipeline

Data Quality Testing

Feature Engineering Testing

Model Evaluation Testing

Standard Metrics

Beyond Accuracy

Bias and Fairness Testing

Advanced ML Testing

Data Drift Monitoring

Model Serving Testing

ML Security Testing

Hands-On Exercise

Pro Tips

Key Takeaways