Testing AI/ML Systems: New Challenges for QA

Introduction

Testing AI and machine learning-based systems is completely new territory for QA. Traditional testing approaches based on deterministic inputs and predictable outputs don’t work here. When system output depends on probabilities rather than clear rules, how do you determine if the system works correctly?

According to Gartner, by 2025, 75% of enterprise applications will use ML (as discussed in AI-Assisted Bug Triaging: Intelligent Defect Prioritization at Scale) components. This means every QA engineer will sooner or later face the task of testing AI/ML (as discussed in AI Code Smell Detection: Finding Problems in Test Automation with ML) systems. In this article, we’ll explore fundamental differences between ML (as discussed in AI-powered Test Generation: The Future Is Already Here) testing and traditional software testing, and practical approaches to quality assurance.

Why ML Testing Differs from Traditional Testing

Non-determinism

Traditional software:

def calculate_discount(price, code):
    if code == "SAVE20":
        return price * 0.8
    return price

# Test: always predictable result
assert calculate_discount(100, "SAVE20") == 80

ML system:

def predict_customer_churn(customer_data):
    # ML model returns probability
    prediction = model.predict(customer_data)
    return prediction  # 0.73 or 0.68 or 0.81?

# Test: result varies!
# How to test probability?

Data Dependency

In ML systems, data = code. Changing the training dataset can radically change model behavior, even if the code itself hasn’t changed.

Problems:

Data drift: production data differs from training data
Label quality: labeling errors lead to wrong predictions
Data bias: model reproduces data biases

No Explicit Business Logic

Traditional code contains explicit rules:

if age < 18:
    return "Access denied"

ML model is a black box:

# Where's the logic? In neural network weights!
prediction = neural_network.forward(input_data)

How to test what you can’t read?

Quality Metrics Aren’t Binary

Traditional test: PASS ✅ or FAIL ❌
ML test: Accuracy 94.3%, Precision 0.87, Recall 0.91, F1 0.89

What threshold is acceptable? This is a business decision, not a technical question.

Data Validation: Foundation of ML Quality

Why Data Quality Is Critical

ML Rule: Garbage in = Garbage out × 100

Bad data in traditional software can lead to an error or crash. In ML, it leads to a model that systematically makes bad decisions.

Pipeline Data Validation

Data validation stages:

# 1. Schema validation
from great_expectations import DataContext

context = DataContext()
suite = context.create_expectation_suite("ml_data_validation")

# Check data structure
batch.expect_column_to_exist("customer_age")
batch.expect_column_values_to_be_between("customer_age", min_value=18, max_value=120)
batch.expect_column_values_to_be_in_set("country", ["US", "UK", "CA", "AU"])

# 2. Data distribution checks
batch.expect_column_mean_to_be_between("purchase_amount", min_value=50, max_value=500)
batch.expect_column_stdev_to_be_between("purchase_amount", min_value=10, max_value=200)

# 3. Referential integrity
batch.expect_column_values_to_not_be_null("user_id")
batch.expect_compound_columns_to_be_unique(["user_id", "transaction_id"])

Data Drift Detection

Problem: Training data from 2023 may not reflect 2025 reality.

Solution: Continuous monitoring of data distribution

from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab

# Compare production data with training dataset
dashboard = Dashboard(tabs=[DataDriftTab()])
dashboard.calculate(reference_data=train_df, current_data=production_df)

# Drift metrics:
# - Wasserstein distance for numerical features
# - Population Stability Index (PSI)
# - Jensen-Shannon divergence for categorical

Drift detection example:

Feature Drift Report:
  customer_age:
    drift_detected: false
    drift_score: 0.03

  purchase_amount:
    drift_detected: true ⚠️
    drift_score: 0.47
    reason: "Mean shifted from $150 to $89"
    action: "Retrain model or investigate business change"

  device_type:
    drift_detected: true ⚠️
    drift_score: 0.62
    reason: "Mobile traffic increased from 40% to 78%"

Actions when drift detected:

Investigate cause (business changes vs data quality issue)
Retrain model on new data
A/B test new model vs old
Gradual rollout with successful test

Label Quality Validation

Problem: If 10% of labels are wrong, model will learn errors.

Validation strategies:

1. Cross-validation of labeling:

# Multiple annotators label same data
from sklearn.metrics import cohen_kappa_score

annotator1_labels = [1, 0, 1, 1, 0, 1]
annotator2_labels = [1, 0, 1, 0, 0, 1]

# Kappa > 0.8 = good agreement
kappa = cohen_kappa_score(annotator1_labels, annotator2_labels)

if kappa < 0.7:
    print("⚠️ Annotators disagree! Guidelines need clarification")

2. Outlier detection in labels:

# Find suspicious labels
from cleanlab import find_label_issues

# Model predicts probabilities for each class
predicted_probs = model.predict_proba(X_train)

# Cleanlab finds likely mislabeled examples
label_issues = find_label_issues(
    labels=y_train,
    pred_probs=predicted_probs,
    return_indices_ranked_by='self_confidence'
)

print(f"Found {len(label_issues)} potentially mislabeled examples")
# Manually review top-100 and fix

3. Active learning for quality improvement:

# Model requests labels for uncertain examples
from modAL.uncertainty import uncertainty_sampling

learner = ActiveLearner(
    estimator=RandomForestClassifier(),
    query_strategy=uncertainty_sampling
)

# Model selects 100 most uncertain examples
query_idx, query_instance = learner.query(X_unlabeled, n_instances=100)

# Human labels only these 100 instead of all data
# Labeling efficiency increases 5-10x

Model Testing and Validation

Unit Testing for ML Models

Yes, unit tests are possible even for ML!

Test data preprocessing:

def test_feature_engineering():
    # Given
    raw_data = pd.DataFrame({
        'date': ['2025-01-01', '2025-01-02'],
        'amount': [100, 200]
    })

    # When
    features = preprocess_features(raw_data)

    # Then
    assert 'day_of_week' in features.columns
    assert 'amount_log' in features.columns
    assert features['amount_log'].iloc[0] == pytest.approx(4.605, 0.01)
    assert features['day_of_week'].iloc[0] == 2  # Wednesday

Test model inference logic:

def test_model_prediction_shape():
    # Given
    model = load_model('churn_predictor_v2.pkl')
    test_input = np.random.rand(10, 20)  # 10 samples, 20 features

    # When
    predictions = model.predict(test_input)

    # Then
    assert predictions.shape == (10,)  # One prediction per sample
    assert np.all((predictions >= 0) & (predictions <= 1))  # Valid probabilities

Test model behavior on edge cases:

def test_model_handles_missing_values():
    # Given
    model = load_model('recommender.pkl')
    input_with_nan = pd.DataFrame({
        'age': [25, np.nan, 30],
        'income': [50000, 60000, np.nan]
    })

    # When/Then
    # Model shouldn't crash on NaN
    predictions = model.predict(input_with_nan)
    assert len(predictions) == 3
    assert not np.any(np.isnan(predictions))

Integration Testing ML Pipeline

End-to-end ML pipeline test:

@pytest.mark.integration
def test_ml_pipeline_end_to_end():
    # 1. Load data
    raw_data = load_test_dataset('test_data.csv')

    # 2. Preprocess
    preprocessed = preprocessing_pipeline.transform(raw_data)
    assert preprocessed.shape[1] == 50  # Expected number of features

    # 3. Feature engineering
    features = feature_engineering_pipeline.transform(preprocessed)
    assert 'feature_interaction_1' in features.columns

    # 4. Model prediction
    predictions = model.predict(features)
    assert len(predictions) == len(raw_data)

    # 5. Postprocessing
    final_output = postprocess_predictions(predictions)
    assert final_output['confidence'].min() >= 0
    assert final_output['confidence'].max() <= 1

Model Performance Testing

Metrics for different task types:

Classification:

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

def test_model_classification_performance():
    y_true = test_labels
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    # Accuracy should exceed baseline
    accuracy = (y_pred == y_true).mean()
    assert accuracy > 0.85, f"Accuracy {accuracy} below threshold"

    # AUC-ROC for ranking quality assessment
    auc = roc_auc_score(y_true, y_proba)
    assert auc > 0.90, f"AUC {auc} below threshold"

    # Check precision/recall for each class
    report = classification_report(y_true, y_pred, output_dict=True)

    # "Fraud" class is critical - high recall required
    assert report['fraud']['recall'] > 0.95, "Missing too many fraud cases!"

    # False positives are expensive - good precision needed
    assert report['fraud']['precision'] > 0.80, "Too many false fraud alerts!"

Regression:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def test_model_regression_performance():
    y_true = test_target
    y_pred = model.predict(X_test)

    # MAE within acceptable limits
    mae = mean_absolute_error(y_true, y_pred)
    assert mae < 50, f"MAE {mae} too high (avg error ${mae})"

    # R² shows model's explanatory power
    r2 = r2_score(y_true, y_pred)
    assert r2 > 0.85, f"R² {r2} - model explains too little variance"

    # MAPE for relative error
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    assert mape < 10, f"MAPE {mape}% - predictions off by {mape}% on average"

Invariance Testing

Problem: Model should be robust to minor input changes.

Invariance test examples:

def test_invariance_to_feature_order():
    """Column permutation shouldn't affect result"""
    original_pred = model.predict(X_test)

    # Shuffle column order
    shuffled_columns = X_test.sample(frac=1, axis=1)
    shuffled_pred = model.predict(shuffled_columns)

    np.testing.assert_array_almost_equal(original_pred, shuffled_pred)

def test_invariance_to_text_case():
    """NLP model shouldn't change predictions based on case"""
    texts = ["This is SPAM!", "this is spam!", "THIS IS SPAM!"]
    predictions = [spam_classifier.predict(t) for t in texts]

    # All three variants should give same result
    assert len(set(predictions)) == 1

def test_directional_expectation():
    """Increasing income should reduce churn probability"""
    base_customer = pd.DataFrame({
        'age': [30], 'income': [50000], 'tenure': [12]
    })

    base_churn_prob = model.predict_proba(base_customer)[0][1]

    # Double the income
    rich_customer = base_customer.copy()
    rich_customer['income'] = 100000

    rich_churn_prob = model.predict_proba(rich_customer)[0][1]

    # Churn probability should decrease
    assert rich_churn_prob < base_churn_prob, \
        "Higher income should reduce churn probability"

Bias Detection: Ethics and Fairness

Why Bias Is a Critical Problem

Real examples of bias in ML:

Amazon hiring ML (2018): Model discriminated against women, trained on resumes where 90% were male
COMPAS (Criminal justice): Recidivism prediction model showed racial bias
Apple Card (2019): Algorithm gave women credit limits 10-20x lower than men with equal income

Consequences:

Legal risks (discrimination is protected by law)
Reputational damage
Ethical problems
Reinforcing social inequality

Types of Bias

1. Data bias:

Training data not representative of all groups
Historical bias (model learns from past discrimination)
Sampling bias (some groups underrepresented)

2. Model bias:

Feature engineering amplifies bias
Proxy features (ZIP code correlates with race)
Optimization metric doesn’t account for fairness

3. Deployment bias:

Model used for unintended purposes
Feedback loops amplify bias

Detecting Bias

Fairness metrics:

from aif360.metrics import BinaryLabelDatasetMetric, ClassificationMetric
from aif360.datasets import StandardDataset

# Prepare data with protected attribute
dataset = StandardDataset(
    df=data,
    label_name='approved',
    protected_attribute_names=['gender'],
    privileged_classes=[['male']],
    unprivileged_classes=[['female']]
)

# Metric 1: Statistical Parity Difference
# Should be close to 0 (equal probability of positive outcome)
metric = BinaryLabelDatasetMetric(dataset)
spd = metric.statistical_parity_difference()

assert abs(spd) < 0.1, f"Statistical parity violation: {spd}"
# If > 0.1: model more often predicts positive for privileged group

# Metric 2: Equal Opportunity Difference
# True positive rate should be equal for all groups
predictions = model.predict(X_test)
metric = ClassificationMetric(
    dataset_true=test_dataset,
    dataset_pred=predictions,
    privileged_groups=[{'gender': 'male'}],
    unprivileged_groups=[{'gender': 'female'}]
)

eod = metric.equal_opportunity_difference()
assert abs(eod) < 0.1, f"Equal opportunity violation: {eod}"

# Metric 3: Disparate Impact
# Ratio of positive outcomes between groups
di = metric.disparate_impact()
assert 0.8 <= di <= 1.25, f"Disparate impact: {di} (legal threshold)"
# By 4/5 rule: ratio < 0.8 = likely discrimination

Intersectional fairness:

# Check fairness for group intersections
for gender in ['male', 'female']:
    for race in ['white', 'black', 'asian', 'hispanic']:
        for age_group in ['<30', '30-50', '>50']:
            subset = data[(data.gender == gender) &
                         (data.race == race) &
                         (data.age_group == age_group)]

            if len(subset) < 30:
                continue  # Insufficient data

            approval_rate = (model.predict(subset) == 1).mean()

            # Check that approval rate doesn't deviate strongly from overall
            overall_rate = (model.predict(data) == 1).mean()

            if abs(approval_rate - overall_rate) > 0.15:
                print(f"⚠️ Bias detected for {gender}/{race}/{age_group}")
                print(f"   Approval rate: {approval_rate:.2%} vs overall {overall_rate:.2%}")

Mitigating Bias

Approaches to reducing bias:

1. Pre-processing (fix data):

from aif360.algorithms.preprocessing import Reweighing

# Reweighing: assign weights to examples to balance groups
rw = Reweighing(
    unprivileged_groups=[{'gender': 'female'}],
    privileged_groups=[{'gender': 'male'}]
)

dataset_transformed = rw.fit_transform(dataset)

# Now train model on reweighted data
model.fit(dataset_transformed.features,
         dataset_transformed.labels,
         sample_weight=dataset_transformed.instance_weights)

2. In-processing (fair model training):

from aif360.algorithms.inprocessing import PrejudiceRemover

# Model optimizes both accuracy and fairness simultaneously
fair_model = PrejudiceRemover(
    sensitive_attr='gender',
    eta=1.0  # Trade-off between accuracy and fairness
)

fair_model.fit(X_train, y_train)

3. Post-processing (adjust predictions):

from aif360.algorithms.postprocessing import EqOddsPostprocessing

# Adjust threshold for different groups
eop = EqOddsPostprocessing(
    unprivileged_groups=[{'gender': 'female'}],
    privileged_groups=[{'gender': 'male'}]
)

# Train on validation set
eop.fit(val_dataset, predictions_val)

# Apply to test predictions
fair_predictions = eop.predict(predictions_test)

Testing fairness in CI/CD:

@pytest.mark.fairness
def test_model_fairness():
    """Fail build if model shows bias"""

    # Load protected test set
    test_data = load_fairness_test_set()

    for protected_attr in ['gender', 'race', 'age_group']:
        # Compute fairness metrics
        metrics = compute_fairness_metrics(
            model=model,
            data=test_data,
            protected_attribute=protected_attr
        )

        # Assert fairness thresholds
        assert abs(metrics['statistical_parity']) < 0.1, \
            f"Statistical parity violation for {protected_attr}"

        assert metrics['disparate_impact'] > 0.8, \
            f"Disparate impact violation for {protected_attr}"

        assert abs(metrics['equal_opportunity']) < 0.1, \
            f"Equal opportunity violation for {protected_attr}"

A/B Testing for ML Models

Why You Can’t Just Deploy New Model

Problems with offline evaluation:

Test set may not reflect production distribution
Offline metrics don’t always correlate with business metrics
Model may have unexpected edge case behavior

Only way to know true quality = test in production on real users.

Designing ML A/B Tests

Basic architecture:

class MLABTestFramework:
    def __init__(self, control_model, treatment_model):
        self.control = control_model
        self.treatment = treatment_model
        self.assignment_cache = {}

    def get_prediction(self, user_id, features):
        # Consistent assignment: one user always in one group
        if user_id not in self.assignment_cache:
            self.assignment_cache[user_id] = self._assign_variant(user_id)

        variant = self.assignment_cache[user_id]

        if variant == 'control':
            prediction = self.control.predict(features)
            self._log_prediction('control', user_id, prediction)
        else:
            prediction = self.treatment.predict(features)
            self._log_prediction('treatment', user_id, prediction)

        return prediction

    def _assign_variant(self, user_id):
        # Hash-based assignment for consistency
        hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
        return 'treatment' if hash_val % 100 < 50 else 'control'

Metrics for ML A/B Tests

Multi-level metrics:

1. Model metrics (sanity checks):

# Verify treatment model works as expected
control_accuracy = evaluate_model(control_predictions, labels)
treatment_accuracy = evaluate_model(treatment_predictions, labels)

assert treatment_accuracy >= control_accuracy * 0.95, \
    "Treatment model significantly worse - stop test!"

2. User engagement metrics:

# Recommendation model example
metrics = {
    'control': {
        'click_through_rate': 0.12,
        'time_on_site': 8.5,  # minutes
        'items_viewed': 4.2
    },
    'treatment': {
        'click_through_rate': 0.14,  # +16.7% 🎉
        'time_on_site': 9.1,          # +7.1%
        'items_viewed': 4.8           # +14.3%
    }
}

# Statistical significance test
from scipy.stats import ttest_ind

control_ctr = get_user_ctr_data('control')
treatment_ctr = get_user_ctr_data('treatment')

t_stat, p_value = ttest_ind(treatment_ctr, control_ctr)

if p_value < 0.05 and treatment_ctr.mean() > control_ctr.mean():
    print("✅ Treatment shows statistically significant improvement!")

3. Business metrics (north star):

# Ultimate goal - revenue/conversions
business_metrics = {
    'control': {
        'revenue_per_user': 45.30,
        'conversion_rate': 0.032,
        'ltv_30d': 120.50
    },
    'treatment': {
        'revenue_per_user': 48.20,  # +6.4%
        'conversion_rate': 0.035,   # +9.4%
        'ltv_30d': 125.80          # +4.4%
    }
}

# Economic significance
users_per_month = 100000
revenue_lift = (48.20 - 45.30) * users_per_month
# = $290,000/month additional revenue!

Guardrail Metrics

Problem: Treatment may improve some metrics but worsen others.

guardrail_metrics = {
    'latency_p99': {
        'control': 250,  # ms
        'treatment': 280,  # ms - acceptable?
        'threshold': 300,
        'status': 'PASS'
    },
    'error_rate': {
        'control': 0.001,
        'treatment': 0.0015,
        'threshold': 0.002,
        'status': 'WARNING'  # Requires investigation
    },
    'user_complaints': {
        'control': 12,  # per week
        'treatment': 45,  # 🚨 Alarming spike!
        'threshold': 20,
        'status': 'FAIL'
    }
}

# Automatic kill switch
if guardrail_metrics['user_complaints']['status'] == 'FAIL':
    rollback_experiment('ml_model_v2')
    alert_team('Treatment model causing user complaints spike!')

Continuous Model Monitoring

Why Monitoring Is Critical

ML models “decay” over time:

Data drift: world changes, model becomes outdated
Concept drift: relationships between features and target change
Upstream changes: API changes break feature generation

Without monitoring, you’ll learn about problems when users complain (too late).

Key Monitoring Metrics

1. Model performance metrics:

# Daily monitoring
daily_metrics = {
    'date': '2025-10-01',
    'predictions_count': 1.2M,
    'avg_confidence': 0.78,  # Dropped from 0.85 - warning sign

    # Ground truth metrics (when labels available)
    'accuracy': 0.89,  # Was 0.94 a week ago
    'precision': 0.85,
    'recall': 0.91,

    # Alerts
    'alerts': [
        'Accuracy dropped 5% in last 7 days',
        'Average confidence declining trend'
    ]
}

2. Data quality metrics:

data_quality_dashboard = {
    'missing_values': {
        'age': 0.02,      # OK
        'income': 0.15,   # 🚨 Increased from 0.03
    },
    'out_of_range_values': {
        'age': 3,  # 3 cases of age > 120
    },
    'new_categorical_values': {
        'country': ['XX'],  # Unknown country code
    }
}

Conclusion

Testing AI/ML systems requires a fundamentally new approach:

Key takeaways:

✅ Data quality is 80% of ML system success. Data validation is critical at all stages.

✅ Bias detection is not an optional feature, but a mandatory requirement for production ML.

✅ A/B testing is the only way to truly validate a model in production.

✅ Continuous monitoring — ML models require constant observation, not set-and-forget.

Practical recommendations:

Automate data validation in every pipeline
Test fairness as part of CI/CD
Don’t trust offline metrics — test in production
Monitor models 24/7 — they degrade over time
Document everything — what data, what assumptions, what limitations

ML testing is not just a new skill, it’s a new discipline. QA engineers who master it now will be in demand for the next decade.