ML Pipeline Overview

Machine learning systems are fundamentally different from traditional software. Instead of explicit programming rules, ML models learn patterns from data. This creates unique testing challenges at every stage of the ML pipeline.

The ML Pipeline

graph LR A[Data Collection] --> B[Data Processing] B --> C[Feature Engineering] C --> D[Model Training] D --> E[Model Evaluation] E --> F[Model Deployment] F --> G[Monitoring] G -->|Data Drift| A

Each stage requires different testing approaches:

  • Data: Quality, completeness, bias, freshness
  • Features: Correctness, consistency, leakage detection
  • Model: Accuracy, fairness, robustness, interpretability
  • Serving: Latency, throughput, versioning, rollback

Data Quality Testing

Data is the foundation of ML — bad data produces bad models:

TestWhat to Check
CompletenessMissing values, null rates by feature
ConsistencySame entity has same representation across sources
FreshnessData is recent enough for the model’s use case
DistributionFeature distributions match expected ranges
DuplicatesNo unintended duplicate records
LabelsTraining labels are accurate and consistent
SchemaData matches expected schema (types, ranges)

Feature Engineering Testing

Features transform raw data into model inputs:

  • Feature values are within expected ranges
  • Feature computation is deterministic (same input → same output)
  • No data leakage (features do not contain target information)
  • Feature importance aligns with domain knowledge

Model Evaluation Testing

Standard Metrics

MetricUse CaseFormula
AccuracyBalanced classes(TP + TN) / Total
PrecisionWhen false positives are costlyTP / (TP + FP)
RecallWhen false negatives are costlyTP / (TP + FN)
F1 ScoreBalanced precision-recall2 * P * R / (P + R)
AUC-ROCOverall discriminative abilityArea under ROC curve

Beyond Accuracy

  • Slice-based evaluation: Model performance across data subgroups (by age, geography, device)
  • Edge case testing: Adversarial inputs, out-of-distribution data, boundary conditions
  • Regression testing: New model version is not worse than previous version on any metric
  • Robustness testing: Small input perturbations should not drastically change outputs

Bias and Fairness Testing

ML models can perpetuate or amplify societal biases:

  • Demographic parity: Positive prediction rates should be similar across groups
  • Equal opportunity: True positive rates should be similar across groups
  • Calibration: Predicted probabilities should be accurate for all groups
  • Disparate impact: Adverse decision rates should not disproportionately affect protected groups

Test for bias across: race, gender, age, disability status, geographic location, socioeconomic status.

Advanced ML Testing

Data Drift Monitoring

Production data changes over time:

  • Feature drift: Input feature distributions shift
  • Concept drift: The relationship between features and target changes
  • Label drift: The distribution of target values changes

Monitoring approach:

  • Statistical tests (Kolmogorov-Smirnov, Population Stability Index)
  • Distribution visualization dashboards
  • Automated alerts when drift exceeds thresholds
  • Triggered retraining pipelines

Model Serving Testing

ML models in production face infrastructure challenges:

  • Inference latency (P50, P95, P99) under load
  • Throughput (predictions per second)
  • Model versioning and gradual rollout (canary deployment)
  • A/B testing between model versions
  • Fallback to previous model on failure
  • Batch vs. real-time inference pipelines

ML Security Testing

  • Adversarial attacks: inputs crafted to fool the model
  • Model extraction: preventing unauthorized copying of model behavior
  • Data poisoning: detecting tampered training data
  • Privacy: model does not memorize and leak training data (membership inference)

Hands-On Exercise

Design a test plan for a credit scoring ML model:

  1. Data quality: Verify training data completeness, check for historical bias
  2. Model accuracy: Evaluate precision, recall, and AUC on holdout test set
  3. Bias testing: Verify fair outcomes across age groups, genders, and zip codes
  4. Robustness: Test with edge cases (zero income, extremely high credit limit)
  5. Monitoring: Define drift detection metrics and retraining triggers
Solution Guide

Bias tests:

  • Calculate approval rates by gender: difference should be < 5%
  • Calculate approval rates by age group: no group should have > 2x rejection rate
  • Verify model explanation (SHAP values) does not rely on protected attributes

Robustness tests:

  • Income = $0: model should handle gracefully, not crash
  • Credit utilization = 100%: should produce reasonable (likely low) score
  • All features at boundary values: model should not produce extreme outlier scores

Pro Tips

  1. Test data before testing models — most ML bugs are actually data bugs
  2. Monitor production model performance continuously — accuracy degrades silently without monitoring
  3. Always test for bias with real demographic data — synthetic data may not reveal real-world biases
  4. Version everything — data, features, models, and configurations must be traceable and reproducible
  5. Compare new models against baselines — a simpler model that performs nearly as well may be preferable

Key Takeaways

  1. ML testing requires testing the entire pipeline: data, features, model, serving, and monitoring
  2. Model accuracy alone is insufficient — fairness, robustness, and interpretability matter equally
  3. Data drift is the silent killer of ML models — continuous monitoring is essential
  4. ML bias testing is not optional — it has legal, ethical, and business implications