The Test Suite Explosion Problem
Modern applications have thousands of automated tests. Running all tests on every commit is slow (hours), expensive (compute costs), and delays feedback. Yet running too few tests risks missing bugs that reach production.
Traditional approaches select tests using:
- All tests: Slow, comprehensive
- Changed files: Misses indirect dependencies
- Manual selection: Error-prone, inconsistent
Predictive test selection uses ML (as discussed in AI-powered Test Generation: The Future Is Already Here) to intelligently choose which tests to run based on code changes, historical failures, and risk analysis—cutting execution time by 60-90% while maintaining quality.
How Predictive Test Selection Works
1. Test-Code Mapping
Build a dependency graph between code and tests:
class CodeTestMapper:
def __init__(self):
self.code_test_map = defaultdict(set)
self.test_coverage = {}
def analyze_coverage(self, test_run_data):
"""Build mapping from coverage data"""
for test_name, coverage_data in test_run_data.items():
covered_files = coverage_data['files']
for file_path in covered_files:
self.code_test_map[file_path].add(test_name)
self.test_coverage[test_name] = {
'files': covered_files,
'lines': coverage_data['lines_covered']
}
def get_affected_tests(self, changed_files):
"""Get tests affected by code changes"""
affected = set()
for file_path in changed_files:
affected.update(self.code_test_map.get(file_path, set()))
return list(affected)
# Usage
mapper = CodeTestMapper()
mapper.analyze_coverage(coverage_report)
changed_files = git_diff.get_modified_files()
tests_to_run = mapper.get_affected_tests(changed_files)
print(f"Run {len(tests_to_run)} tests instead of {total_tests}")
2. Failure Prediction Model
Train ML model to predict test failure probability:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
class TestFailurePredictor (as discussed in [AI Test Metrics Analytics: Intelligent Analysis of QA Metrics](/blog/ai-test-metrics)):
def __init__(self):
self.model = RandomForestClassifier(n_estimators=100)
self.feature_extractor = FeatureExtractor()
def extract_features(self, commit, test):
"""Extract features for prediction"""
return {
# Code change features
'files_changed': len(commit['files']),
'lines_added': commit['additions'],
'lines_deleted': commit['deletions'],
'complexity_change': self.calculate_complexity_delta(commit),
# Test features
'test_execution_time_ms': test['avg_duration'],
'test_flakiness_score': test['flakiness'],
'days_since_last_failure' (as discussed in [AI-Assisted Bug Triaging: Intelligent Defect Prioritization at Scale](/blog/ai-bug-triaging)): test['days_since_failure'],
'failure_rate_30d': test['failures_last_30_days'] / test['runs_last_30_days'],
# Developer features
'author_test_failure_rate': commit['author_failure_rate'],
'commit_hour': commit['timestamp'].hour,
'is_friday_afternoon': commit['timestamp'].weekday() == 4 and commit['timestamp'].hour >= 14,
# Change location features
'changes_in_test_file': test['file_path'] in commit['files'],
'changes_in_dependencies': self.has_dependency_changes(commit, test)
}
def train(self, historical_data):
"""Train on historical test outcomes"""
features = []
labels = []
for commit, test, outcome in historical_data:
feature_vector = self.extract_features(commit, test)
features.append(list(feature_vector.values()))
labels.append(1 if outcome == 'failed' else 0)
self.model.fit(features, labels)
def predict_failure_probability(self, commit, test):
"""Predict probability that test will fail"""
features = self.extract_features(commit, test)
feature_vector = [list(features.values())]
probability = self.model.predict_proba(feature_vector)[0][1]
return {
'test': test['name'],
'failure_probability': probability,
'features': features
}
# Usage
predictor = TestFailurePredictor()
predictor.train(load_test_history(days=90))
for test in all_tests:
prediction = predictor.predict_failure_probability(current_commit, test)
if prediction['failure_probability'] > 0.3: # High risk
priority_tests.append(test)
3. Test Prioritization
Rank tests by value and risk:
class TestPrioritizer:
def __init__(self, predictor, mapper):
self.predictor = predictor
self.mapper = mapper
def calculate_test_value(self, test, commit):
"""Calculate value score for test"""
failure_prob = self.predictor.predict_failure_probability(commit, test)['failure_probability']
code_coverage = test['line_coverage'] / total_lines
bug_detection_history = test['bugs_caught_last_year']
execution_cost = test['avg_duration_ms'] / 1000 # seconds
# Value = (Failure Risk × Coverage × Bug History) / Cost
value_score = (failure_prob * code_coverage * bug_detection_history) / max(execution_cost, 1)
return value_score
def prioritize(self, commit, time_budget_seconds):
"""Select tests to maximize value within time budget"""
all_tests = self.mapper.get_test_catalog()
# Calculate value for each test
test_scores = [
{
'test': test,
'value': self.calculate_test_value(test, commit),
'duration': test['avg_duration_ms'] / 1000
}
for test in all_tests
]
# Sort by value (descending)
test_scores.sort(key=lambda x: x['value'], reverse=True)
# Greedy selection within budget
selected_tests = []
total_time = 0
for item in test_scores:
if total_time + item['duration'] <= time_budget_seconds:
selected_tests.append(item['test'])
total_time += item['duration']
return {
'selected_tests': selected_tests,
'estimated_duration': total_time,
'coverage': len(selected_tests) / len(all_tests)
}
# Usage
prioritizer = TestPrioritizer(predictor, mapper)
selection = prioritizer.prioritize(
commit=current_commit,
time_budget_seconds=600 # 10 minutes
)
print(f"Running {len(selection['selected_tests'])} highest-value tests")
print(f"Estimated time: {selection['estimated_duration']:.0f}s")
print(f"Coverage: {selection['coverage']:.1%} of test suite")
CI/CD Integration
GitHub Actions Example
name: Intelligent Test Selection
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Full history for analysis
- name: Analyze Code Changes
id: changes
run: |
CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }})
echo "files=$CHANGED_FILES" >> $GITHUB_OUTPUT
- name: Predict Test Selection
id: selection
env:
CHANGED_FILES: ${{ steps.changes.outputs.files }}
run: |
python predict_tests.py \
--changed-files "$CHANGED_FILES" \
--time-budget 600 \
--output selected_tests.json
- name: Run Selected Tests
run: |
pytest $(cat selected_tests.json | jq -r '.tests[]')
- name: Record Outcomes
if: always()
run: |
python record_results.py \
--commit ${{ github.sha }} \
--results test-results.xml
Advanced Techniques
Test Impact Analysis
class TestImpactAnalyzer:
def __init__(self):
self.impact_graph = nx.DiGraph()
def build_impact_graph(self, codebase):
"""Build dependency graph"""
# Add nodes
for file in codebase.files:
self.impact_graph.add_node(file.path, type='code')
for test in codebase.tests:
self.impact_graph.add_node(test.name, type='test')
# Add edges (dependencies)
for test in codebase.tests:
for covered_file in test.coverage:
self.impact_graph.add_edge(covered_file, test.name)
# Add code-to-code dependencies
for file in codebase.files:
for imported_file in file.imports:
self.impact_graph.add_edge(imported_file, file.path)
def get_impacted_tests(self, changed_files):
"""Find all transitively impacted tests"""
impacted = set()
for changed_file in changed_files:
# Find all reachable tests (transitive dependencies)
reachable = nx.descendants(self.impact_graph, changed_file)
for node in reachable:
if self.impact_graph.nodes[node]['type'] == 'test':
impacted.add(node)
return list(impacted)
Flakiness-Aware Selection
class FlakinessFilter:
def __init__(self, flakiness_threshold=0.1):
self.threshold = flakiness_threshold
def calculate_flakiness(self, test_history):
"""Calculate test flakiness score"""
if len(test_history) < 10:
return 0 # Not enough data
# Count inconsistent results on same code
flaky_instances = 0
for commit_sha in set(test_history['commit']):
commit_runs = test_history[test_history['commit'] == commit_sha]
if len(commit_runs) > 1:
outcomes = commit_runs['outcome'].unique()
if len(outcomes) > 1: # Different outcomes on same code
flaky_instances += 1
flakiness = flaky_instances / len(set(test_history['commit']))
return flakiness
def should_always_run(self, test):
"""Decide if test is too flaky for intelligent selection"""
if self.calculate_flakiness(test['history']) > self.threshold:
return True # Always run flaky tests to gather data
return False
Metrics and Monitoring
class SelectionMetrics:
def __init__(self):
self.metrics = []
def record_selection(self, commit, selected, skipped, outcomes):
"""Record selection effectiveness"""
selected_failures = [t for t in selected if outcomes[t] == 'failed']
skipped_failures = [t for t in skipped if outcomes[t] == 'failed']
self.metrics.append({
'commit': commit,
'tests_selected': len(selected),
'tests_skipped': len(skipped),
'time_saved_percent': len(skipped) / (len(selected) + len(skipped)),
'caught_failures': len(selected_failures),
'missed_failures': len(skipped_failures), # False negatives
'precision': len(selected_failures) / len(selected) if selected else 0,
'recall': len(selected_failures) / (len(selected_failures) + len(skipped_failures)) if (selected_failures or skipped_failures) else 1.0
})
def get_dashboard(self):
"""Generate metrics dashboard"""
df = pd.DataFrame(self.metrics)
return {
'avg_time_saved': df['time_saved_percent'].mean(),
'avg_recall': df['recall'].mean(), # What % of failures we catch
'total_missed_failures': df['missed_failures'].sum(),
'tests_per_commit': df['tests_selected'].mean()
}
Best Practices
Practice | Description |
---|---|
Start Conservative | Begin with high recall (95%+), optimize for speed later |
Monitor Missed Failures | Track false negatives, retrain if > 2% |
Retrain Regularly | Update model weekly with new test outcomes |
Always Run Critical Tests | Security, smoke tests run regardless |
Feedback Loop | Record outcomes to improve predictions |
Gradual Rollout | Validate on subset of commits first |
Explainability | Show why tests were selected/skipped |
Conclusion
Predictive test selection transforms CI/CD from “run everything and wait” to intelligent, fast feedback loops. By combining code analysis, ML prediction, and risk-based prioritization, teams reduce test execution time by 60-90% while catching 95%+ of failures.
The key is continuous learning: as the model observes outcomes, it improves predictions, creating a virtuous cycle of faster, smarter testing. Start conservative, monitor closely, and iterate toward optimal speed-quality balance.