TL;DR
- Use AWS Backup restore testing to automatically validate backups meet RTO/RPO targets—untested backups are not backups
- Automate DR testing with Terraform: spin up recovery infrastructure, validate functionality, tear down—pay only for test duration
- Test recovery procedures quarterly at minimum; document every step and have staff who didn’t write docs perform the restore
Best for: Teams with production workloads requiring documented recovery capabilities and compliance requirements Skip if: You’re running stateless applications with no persistent data (just redeploy from IaC) Read time: 15 minutes
Backups that aren’t tested undermine confidence in data recoverability. Studies show that 34% of companies don’t test their backups, and of those that do, 77% have found failures during restore tests. Your DR plan is only as good as your last successful test.
For related infrastructure testing, see Terraform Testing Strategies and AWS Infrastructure Testing.
AI-Assisted Approaches
AI tools excel at generating DR test plans and analyzing recovery procedures for gaps.
Generating comprehensive DR test plans:
Create a disaster recovery test plan for a three-tier web application:
- Web tier: Auto Scaling group with ALB
- App tier: ECS Fargate services
- Data tier: RDS PostgreSQL Multi-AZ with Read Replicas
Include tests for:
1. Single AZ failure (verify automatic failover)
2. Complete region failure (verify cross-region DR)
3. Data corruption (verify point-in-time recovery)
4. Ransomware scenario (verify isolated backup recovery)
For each scenario, specify: trigger method, expected RTO/RPO, validation steps,
rollback procedure, and success criteria.
Creating Terraform DR infrastructure:
Write Terraform modules for AWS disaster recovery with:
1. Cross-region RDS replication with automated failover
2. S3 cross-region replication with versioning
3. Route 53 health checks with DNS failover
4. Lambda function triggered by CloudWatch alarm to initiate DR
Include variables for RTO/RPO targets and outputs for monitoring.
Show how to test the failover without affecting production.
Analyzing backup coverage gaps:
Review this AWS backup configuration for gaps:
Backup Plan:
- RDS: Daily snapshots, 7-day retention
- EBS: Weekly snapshots, 30-day retention
- S3: No backup (cross-region replication only)
- DynamoDB: On-demand backups before deployments
Application requirements:
- RPO: 1 hour for database, 4 hours for file storage
- RTO: 2 hours for complete recovery
- Compliance: SOC 2, requires 90-day retention
Identify gaps, compliance issues, and recommendations.
When to Use Different Testing Approaches
Testing Strategy Decision Framework
| Test Type | Frequency | What It Validates | Production Impact |
|---|---|---|---|
| Backup validation | Daily (automated) | Backups are restorable | None |
| Component failover | Monthly | Individual service recovery | Minimal (uses replicas) |
| Full DR drill | Quarterly | Complete recovery procedure | Scheduled maintenance window |
| Chaos engineering | Continuous | System resilience | Controlled blast radius |
| Tabletop exercise | Annually | Human procedures and communication | None |
RTO/RPO Strategy Matrix
| Strategy | RTO | RPO | Cost | Use Case |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Non-critical systems |
| Pilot Light | 10-30 min | Minutes | $$ | Core business systems |
| Warm Standby | Minutes | Near-zero | $$$ | Critical applications |
| Multi-Site Active | Near-zero | Near-zero | $$$$ | Mission-critical |
AWS Backup Restore Testing
Automated Backup Validation
# Create a restore testing plan
aws backup create-restore-testing-plan \
--restore-testing-plan-name "daily-validation" \
--schedule-expression "cron(0 6 * * ? *)" \
--start-window-hours 1 \
--recovery-point-selection '{
"Algorithm": "LATEST_WITHIN_WINDOW",
"RecoveryPointTypes": ["SNAPSHOT"],
"SelectionWindowDays": 1
}'
# Add RDS to testing plan
aws backup create-restore-testing-selection \
--restore-testing-plan-name "daily-validation" \
--restore-testing-selection-name "rds-validation" \
--protected-resource-type "RDS" \
--iam-role-arn "arn:aws:iam::123456789012:role/BackupRestoreRole" \
--protected-resource-conditions '{
"StringEquals": [
{"Key": "aws:ResourceTag/Environment", "Value": "production"}
]
}'
Terraform for AWS Backup
# modules/backup-testing/main.tf
resource "aws_backup_plan" "production" {
name = "production-backup-plan"
rule {
rule_name = "daily-backups"
target_vault_name = aws_backup_vault.production.name
schedule = "cron(0 5 * * ? *)"
lifecycle {
delete_after = 90 # SOC 2 compliance
}
copy_action {
destination_vault_arn = aws_backup_vault.dr_region.arn
lifecycle {
delete_after = 90
}
}
}
rule {
rule_name = "hourly-backups"
target_vault_name = aws_backup_vault.production.name
schedule = "cron(0 * * * ? *)"
lifecycle {
delete_after = 7
}
}
}
resource "aws_backup_selection" "production_rds" {
iam_role_arn = aws_iam_role.backup.arn
name = "production-rds"
plan_id = aws_backup_plan.production.id
selection_tag {
type = "STRINGEQUALS"
key = "Backup"
value = "true"
}
resources = [
"arn:aws:rds:*:*:db:*",
"arn:aws:rds:*:*:cluster:*"
]
}
# Restore testing plan
resource "aws_backup_restore_testing_plan" "validation" {
name = "automated-restore-validation"
recovery_point_selection {
algorithm = "LATEST_WITHIN_WINDOW"
recovery_point_types = ["SNAPSHOT"]
selection_window_days = 1
}
schedule_expression = "cron(0 8 * * ? *)"
}
Python Script for Backup Validation
import boto3
import time
from datetime import datetime, timedelta
class BackupValidator:
def __init__(self, region='us-east-1'):
self.backup = boto3.client('backup', region_name=region)
self.rds = boto3.client('rds', region_name=region)
def validate_rds_backup(self, db_identifier, max_age_hours=24):
"""Validate RDS backup exists and is restorable."""
# Get latest snapshot
snapshots = self.rds.describe_db_snapshots(
DBInstanceIdentifier=db_identifier,
SnapshotType='automated'
)['DBSnapshots']
if not snapshots:
return {'status': 'FAILED', 'reason': 'No snapshots found'}
latest = max(snapshots, key=lambda x: x['SnapshotCreateTime'])
age = datetime.now(latest['SnapshotCreateTime'].tzinfo) - latest['SnapshotCreateTime']
if age > timedelta(hours=max_age_hours):
return {
'status': 'WARNING',
'reason': f'Latest snapshot is {age.total_seconds()/3600:.1f} hours old'
}
# Verify snapshot is available
if latest['Status'] != 'available':
return {'status': 'FAILED', 'reason': f"Snapshot status: {latest['Status']}"}
return {
'status': 'PASSED',
'snapshot_id': latest['DBSnapshotIdentifier'],
'snapshot_age_hours': age.total_seconds() / 3600
}
def test_restore(self, snapshot_id, test_db_identifier):
"""Perform actual restore test."""
try:
# Start restore
self.rds.restore_db_instance_from_db_snapshot(
DBInstanceIdentifier=test_db_identifier,
DBSnapshotIdentifier=snapshot_id,
DBInstanceClass='db.t3.micro', # Minimal for testing
PubliclyAccessible=False,
Tags=[{'Key': 'Purpose', 'Value': 'DR-Test'}]
)
# Wait for restore
start_time = time.time()
waiter = self.rds.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=test_db_identifier)
restore_time = time.time() - start_time
return {
'status': 'PASSED',
'restore_time_seconds': restore_time,
'meets_rto': restore_time < 7200 # 2 hour RTO
}
finally:
# Cleanup test instance
try:
self.rds.delete_db_instance(
DBInstanceIdentifier=test_db_identifier,
SkipFinalSnapshot=True,
DeleteAutomatedBackups=True
)
except Exception:
pass
# Usage in pytest
def test_production_db_backup():
validator = BackupValidator()
result = validator.validate_rds_backup('production-db')
assert result['status'] in ['PASSED', 'WARNING'], f"Backup validation failed: {result}"
Azure Site Recovery Testing
Test Failover with Azure CLI
# Create a recovery plan
az site-recovery recovery-plan create \
--resource-group rg-dr \
--vault-name vault-dr \
--name "webapp-recovery-plan" \
--primary-zone "East US" \
--recovery-zone "West US" \
--failover-deployment-model "Resource"
# Run test failover (non-disruptive)
az site-recovery recovery-plan test-failover \
--resource-group rg-dr \
--vault-name vault-dr \
--name "webapp-recovery-plan" \
--failover-direction "PrimaryToRecovery" \
--network-id "/subscriptions/.../test-vnet"
# Validate test environment
az vm list --resource-group rg-dr-test --output table
# Cleanup test failover
az site-recovery recovery-plan test-failover-cleanup \
--resource-group rg-dr \
--vault-name vault-dr \
--name "webapp-recovery-plan"
Terraform for Azure DR
# modules/azure-dr/main.tf
resource "azurerm_recovery_services_vault" "dr" {
name = "vault-dr-${var.environment}"
location = var.dr_region
resource_group_name = azurerm_resource_group.dr.name
sku = "Standard"
soft_delete_enabled = true
}
resource "azurerm_site_recovery_fabric" "primary" {
name = "fabric-primary"
resource_group_name = azurerm_resource_group.dr.name
recovery_vault_name = azurerm_recovery_services_vault.dr.name
location = var.primary_region
}
resource "azurerm_site_recovery_fabric" "secondary" {
name = "fabric-secondary"
resource_group_name = azurerm_resource_group.dr.name
recovery_vault_name = azurerm_recovery_services_vault.dr.name
location = var.dr_region
}
resource "azurerm_site_recovery_replication_policy" "policy" {
name = "replication-policy"
resource_group_name = azurerm_resource_group.dr.name
recovery_vault_name = azurerm_recovery_services_vault.dr.name
recovery_point_retention_in_minutes = 1440 # 24 hours
application_consistent_snapshot_frequency_in_minutes = 60 # 1 hour RPO
}
# Backup policy for VMs
resource "azurerm_backup_policy_vm" "daily" {
name = "daily-backup-policy"
resource_group_name = azurerm_resource_group.dr.name
recovery_vault_name = azurerm_recovery_services_vault.dr.name
backup {
frequency = "Daily"
time = "23:00"
}
retention_daily {
count = 30
}
retention_weekly {
count = 12
weekdays = ["Sunday"]
}
retention_monthly {
count = 12
weekdays = ["Sunday"]
weeks = ["First"]
}
}
Automated DR Testing with Terraform
Ephemeral DR Test Environment
# dr-test/main.tf - Spin up, test, destroy
variable "run_dr_test" {
description = "Set to true to run DR test"
type = bool
default = false
}
# Only create resources during DR test
resource "aws_db_instance" "dr_test" {
count = var.run_dr_test ? 1 : 0
identifier = "dr-test-${formatdate("YYYYMMDD", timestamp())}"
instance_class = "db.t3.medium"
# Restore from latest snapshot
snapshot_identifier = data.aws_db_snapshot.latest.id
vpc_security_group_ids = [aws_security_group.dr_test[0].id]
db_subnet_group_name = aws_db_subnet_group.dr_test[0].name
skip_final_snapshot = true
deletion_protection = false
tags = {
Purpose = "DR-Test"
AutoClean = "true"
}
}
data "aws_db_snapshot" "latest" {
db_instance_identifier = var.production_db_identifier
most_recent = true
}
# Test application connectivity
resource "null_resource" "validate_dr" {
count = var.run_dr_test ? 1 : 0
depends_on = [aws_db_instance.dr_test]
provisioner "local-exec" {
command = <<-EOT
# Wait for DB to be ready
aws rds wait db-instance-available \
--db-instance-identifier ${aws_db_instance.dr_test[0].identifier}
# Run connectivity test
python3 scripts/validate_dr.py \
--endpoint ${aws_db_instance.dr_test[0].endpoint} \
--expected-tables 50 \
--expected-rows 1000000
# Record RTO
echo "DR Test completed. RTO: $(cat /tmp/dr_rto.txt)"
EOT
}
}
output "dr_test_results" {
value = var.run_dr_test ? {
db_endpoint = aws_db_instance.dr_test[0].endpoint
restore_time = timestamp()
status = "Test environment ready"
} : null
}
GitHub Actions DR Test Workflow
name: Quarterly DR Test
on:
schedule:
- cron: '0 6 1 */3 *' # First day of quarter at 6 AM
workflow_dispatch:
inputs:
test_type:
description: 'Type of DR test'
required: true
default: 'backup-restore'
type: choice
options:
- backup-restore
- failover
- full-dr
jobs:
dr-test:
runs-on: ubuntu-latest
environment: dr-test
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.DR_TEST_ROLE_ARN }}
aws-region: us-west-2 # DR region
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Start DR Test
id: start_test
run: |
START_TIME=$(date +%s)
echo "start_time=$START_TIME" >> $GITHUB_OUTPUT
echo "DR Test started at $(date)"
- name: Deploy DR Environment
run: |
cd terraform/dr-test
terraform init
terraform apply -auto-approve -var="run_dr_test=true"
- name: Validate Recovery
id: validate
run: |
cd terraform/dr-test
DB_ENDPOINT=$(terraform output -raw dr_test_db_endpoint)
# Test database connectivity
python3 scripts/validate_dr.py \
--endpoint "$DB_ENDPOINT" \
--test-queries \
--output-file dr-results.json
# Calculate RTO
END_TIME=$(date +%s)
START_TIME=${{ steps.start_test.outputs.start_time }}
RTO=$((END_TIME - START_TIME))
echo "rto_seconds=$RTO" >> $GITHUB_OUTPUT
- name: Cleanup DR Test
if: always()
run: |
cd terraform/dr-test
terraform destroy -auto-approve -var="run_dr_test=true"
- name: Generate Report
run: |
cat << EOF > dr-report.md
# DR Test Report - $(date +%Y-%m-%d)
## Results
- **Test Type**: ${{ inputs.test_type || 'backup-restore' }}
- **Actual RTO**: ${{ steps.validate.outputs.rto_seconds }} seconds
- **Target RTO**: 7200 seconds (2 hours)
- **Status**: $([[ ${{ steps.validate.outputs.rto_seconds }} -lt 7200 ]] && echo "PASSED" || echo "FAILED")
## Validation Details
$(cat dr-results.json | jq -r '.summary')
EOF
- name: Upload Report
uses: actions/upload-artifact@v4
with:
name: dr-test-report
path: dr-report.md
- name: Notify on Failure
if: failure()
run: |
# Send alert if DR test failed
aws sns publish \
--topic-arn ${{ secrets.ALERTS_TOPIC_ARN }} \
--message "DR Test FAILED. Review required."
Measuring Success
| Metric | Target | How to Track |
|---|---|---|
| Backup success rate | 100% | AWS Backup/Azure Backup dashboards |
| Restore test success rate | 100% | Automated test results |
| Actual RTO vs Target | Actual < Target | DR test measurements |
| Actual RPO vs Target | Actual < Target | Backup timestamp analysis |
| DR test frequency | Quarterly minimum | Calendar tracking |
| Documentation currency | Updated after each test | Last modified date |
Warning signs your DR testing isn’t working:
- Restore tests consistently skipped due to “time constraints”
- RTO measurements don’t include human coordination time
- Backups exist but nobody knows the restore procedure
- DR documentation references outdated infrastructure
- Cross-region replication lag not monitored
Conclusion
Effective backup and DR testing requires automation and regular validation:
- Automate backup validation with AWS Backup restore testing or similar tools
- Test restores regularly—quarterly full DR drills at minimum
- Measure RTO/RPO during actual tests, not estimates
- Document everything and have staff unfamiliar with procedures run tests
- Use Terraform to create ephemeral test environments cost-effectively
The key insight: untested backups are not backups. The only way to know your DR plan works is to execute it regularly. Automate what you can, but ensure human procedures are tested too.
Official Resources
See Also
- Terraform Testing Strategies - Infrastructure testing fundamentals
- AWS Infrastructure Testing - Broader AWS testing strategies
- Compliance Testing for IaC - Meeting DR compliance requirements
- Network Configuration Testing - Validating DR network connectivity
- Infrastructure Scalability Testing - Load testing for DR environments
