Reliability and Recovery Testing

Learn reliability testing with MTBF and MTTR metrics, plus recovery testing for system failures including failover, backup/restore, and disaster recovery.

What Is Reliability Testing?

Reliability testing evaluates whether a system performs its intended function consistently over a specified period under defined conditions. A system is not truly reliable just because it passes functional tests — it must continue working correctly over time, under sustained usage, and through varying conditions.

Consider an online banking application. It may pass every functional test during a 30-minute test session. But what happens when thousands of users interact with it continuously for 72 hours? Reliability testing answers that question.

Key Reliability Metrics

Two metrics are foundational in reliability testing:

MTBF (Mean Time Between Failures) measures the average time a system operates before a failure occurs. For example, if a server experiences 3 failures over 300 hours of operation, the MTBF is 100 hours. Higher MTBF indicates greater reliability.

MTTR (Mean Time To Repair/Recover) measures the average time required to restore a system to operational status after a failure. If those 3 failures took 2, 4, and 3 hours to fix respectively, the MTTR is 3 hours. Lower MTTR indicates better recoverability.

Availability combines both metrics: Availability = MTBF / (MTBF + MTTR). Using the example above: 100 / (100 + 3) = 97.1%. This is the percentage of time the system is operational.

Metric	Formula	Goal	Example
MTBF	Total uptime / Number of failures	Maximize	100 hours
MTTR	Total repair time / Number of failures	Minimize	3 hours
Availability	MTBF / (MTBF + MTTR)	Maximize	97.1%

Types of Reliability Tests

Endurance testing runs the system under expected load for an extended period (24-72 hours) to detect memory leaks, resource exhaustion, and degradation patterns.

Stress-reliability testing combines stress testing with reliability measurement — how does failure rate change when the system operates at 80-90% capacity over time?

Operational profile testing simulates the actual distribution of user operations. If 60% of users browse, 30% add to cart, and 10% check out, the test mirrors these proportions.

What Is Recovery Testing?

Recovery testing verifies that a system can restore itself — or be restored — to a functional state after experiencing a failure. While reliability testing focuses on preventing and measuring failures, recovery testing focuses on what happens after a failure occurs.

Every system will eventually fail. The question is not whether, but how gracefully.

Common Failure Scenarios

Application crash. The server process terminates unexpectedly. Can the application restart automatically? Are in-flight transactions preserved or properly rolled back?

Power outage. The server loses power. When power returns, does the system come back up? Is data intact? Does it resume from where it left off?

Network drop. Connectivity between services is interrupted. Do services retry? Do they degrade gracefully? What happens to data in transit?

Database corruption. A disk sector fails or a write operation is interrupted. Can the database recover from its transaction log? Is data consistent?

Hardware failure. A physical component (disk, memory, NIC) fails. Does the system detect it and respond appropriately?

graph TD F[Failure Occurs] --> D{Detected?} D -->|Auto-detected| R[Recovery Initiated] D -->|Not detected| M[Manual Intervention] R --> A{Auto-recovery?} A -->|Yes| AR[Automatic Restart/Failover] A -->|No| MR[Manual Recovery Procedure] AR --> V[Verify Data Integrity] MR --> V M --> MR V --> S{System Functional?} S -->|Yes| O[Back to Operation] S -->|No| E[Escalation / DR Plan]

Failover Testing

Failover testing verifies that when a primary component fails, traffic or processing automatically switches to a standby (redundant) component. This is critical for high-availability systems.

Active-passive failover: A standby server waits idle until the primary fails. The test verifies that the standby takes over within the defined Recovery Time Objective (RTO).

Active-active failover: Multiple servers share the load. When one fails, the others absorb its traffic. The test verifies no requests are lost and performance remains acceptable.

What to measure during failover testing:

Failover time — How long until the standby takes over?
Data loss — Were any transactions lost during the switch?
User impact — Did users experience errors, or was the transition transparent?
Failback — When the primary recovers, does traffic return to it correctly?

Backup and Restore Testing

Having backups is not enough — you must test them. An untested backup is not a backup; it is a hope.

Backup and restore testing verifies:

Backups complete successfully within the backup window
Backup data is not corrupted (integrity checks)
Restoration works on the target environment
Restoration completes within the Recovery Time Objective (RTO)
Restored data is complete and matches the Recovery Point Objective (RPO)

Term	Definition	Example
RTO (Recovery Time Objective)	Maximum acceptable downtime	4 hours
RPO (Recovery Point Objective)	Maximum acceptable data loss (time)	1 hour

If your RPO is 1 hour, you must back up at least every hour. If your RTO is 4 hours, you must be able to restore from backup within 4 hours.

Disaster Recovery Testing

Disaster Recovery (DR) testing goes beyond individual component failures to simulate catastrophic scenarios: entire data center outages, regional failures, or simultaneous multi-component failures.

DR tests are typically conducted quarterly or annually and involve:

Activating the DR plan with defined roles and procedures
Switching operations to the DR site
Running critical operations from the DR site
Measuring RTO and RPO compliance
Switching back to the primary site

Types of DR tests (from least to most disruptive):

Tabletop exercise — Walk through the DR plan verbally without taking action
Simulation — Simulate a disaster scenario without actually shutting down systems
Parallel test — Bring up the DR site while the primary continues running
Full interruption test — Actually shut down the primary site and operate from DR

Exercise: Design Reliability Test Scenarios

You are the QA lead for a critical healthcare appointment scheduling system. The system has these characteristics:

Web application serving 50,000 daily users
PostgreSQL database with patient records
Integration with external insurance verification APIs
Two-server active-passive failover configuration
Target availability: 99.9% (8.76 hours maximum downtime per year)
RTO: 30 minutes, RPO: 5 minutes

Design test scenarios covering:

Part 1: Reliability metrics. Based on the 99.9% availability target, calculate the maximum acceptable MTTR if the system experiences one failure per month (MTBF = 720 hours). Does this meet the requirements?

Part 2: Recovery scenarios. Write 5 recovery test scenarios for the following failure types:

Application server crash
Database connection pool exhaustion
External insurance API becoming unreachable
Primary server hardware failure (failover test)
Corrupt database backup

For each scenario, specify: the trigger, expected behavior, acceptable recovery time, and data integrity verification steps.

Part 3: DR test plan. Design a disaster recovery test plan that verifies the system can meet its RTO and RPO targets. Include the test type, prerequisites, execution steps, and success criteria.

Hint

For Part 1, use the availability formula: Availability = MTBF / (MTBF + MTTR). Solve for MTTR with Availability = 0.999 and MTBF = 720. Then compare the calculated MTTR against the RTO of 30 minutes.

For Part 2, think about what the system should do automatically versus what requires manual intervention. Healthcare systems have strict data integrity requirements — no patient data should be lost.

Solution

Part 1: Reliability Metrics

Given: Availability = 99.9%, MTBF = 720 hours

0.999 = 720 / (720 + MTTR) 720 + MTTR = 720 / 0.999 720 + MTTR = 720.72 MTTR = 0.72 hours = 43.2 minutes

The calculated MTTR (43.2 minutes) exceeds the RTO (30 minutes). This means either the availability target or the RTO needs to be adjusted, or the system needs better auto-recovery mechanisms to reduce MTTR below 30 minutes.

Part 2: Recovery Scenarios

Scenario 1: Application server crash

Trigger: Kill the application process using kill -9
Expected behavior: Process monitor detects crash within 30 seconds, restarts the application automatically. In-flight requests receive a retry-able error. No data loss.
Acceptable recovery time: Under 2 minutes
Verification: Confirm all database transactions were either committed or rolled back. Verify no orphaned appointment records. Check that queued requests are reprocessed.

Scenario 2: Database connection pool exhaustion

Trigger: Simulate 500 concurrent connections exceeding the pool limit of 100
Expected behavior: Application returns “service temporarily unavailable” for new requests. Existing connections complete normally. Pool recovers as connections are released.
Acceptable recovery time: Under 5 minutes after load reduces
Verification: Query the database for incomplete transactions. Verify connection pool metrics return to normal levels.

Scenario 3: External insurance API unreachable

Trigger: Block network access to the insurance API endpoint using firewall rules
Expected behavior: Application uses circuit breaker pattern — after 3 failed attempts, stops calling the API and shows a user-friendly message. Appointments can still be scheduled without insurance verification (queued for later verification).
Acceptable recovery time: Circuit breaker opens within 30 seconds. Full recovery within 1 minute after API returns.
Verification: Verify queued verifications are processed when API returns. Confirm no appointments were lost.

Scenario 4: Primary server hardware failure (failover)

Trigger: Shut down the primary server
Expected behavior: Active-passive failover activates. Standby server takes over within the RTO. DNS or load balancer redirects traffic automatically.
Acceptable recovery time: Under 5 minutes
Verification: Confirm the standby has the latest data (within RPO of 5 minutes). Test complete user workflows on the standby. Verify session continuity if using shared session storage.

Scenario 5: Corrupt database backup

Trigger: Intentionally corrupt a backup file and attempt restoration
Expected behavior: Restoration process detects corruption via checksum verification. System falls back to the previous known-good backup. Alert is generated for the operations team.
Acceptable recovery time: Detection within 5 minutes. Fallback to previous backup adds time — total under 30 minutes.
Verification: Compare record counts and checksums between restored data and the source. Run application smoke tests against the restored database.

Part 3: DR Test Plan

Test type: Parallel test (DR site activated alongside primary)

Prerequisites:

DR site configured with identical infrastructure
Database replication lag under 5 minutes (matching RPO)
Current runbook available to all team participants
Monitoring dashboards configured for DR site

Execution steps:

Announce DR test to all stakeholders (scheduled maintenance window)
Verify DR site database is synchronized (check replication lag)
Redirect 10% of traffic to DR site using load balancer
Monitor DR site performance for 30 minutes
Simulate primary site failure by redirecting all traffic to DR site
Measure time from redirect to full DR site operational status
Run critical workflow tests: schedule appointment, verify insurance, cancel appointment
Operate from DR site for 2 hours
Redirect traffic back to primary site
Verify data consistency between primary and DR databases

Success criteria:

Full switchover completes within 30 minutes (RTO)
Data loss is under 5 minutes of transactions (RPO)
All critical workflows execute successfully on DR site
Performance degradation is under 20% compared to primary
Switchback to primary completes without data loss

Reliability Testing in CI/CD

Modern teams integrate reliability testing into their delivery pipeline:

Chaos engineering (Netflix’s Chaos Monkey approach) intentionally injects failures in production or pre-production environments to test recovery mechanisms continuously. Tools like Chaos Monkey, Gremlin, and Litmus help automate failure injection.

Canary deployments serve new code to a small percentage of users, monitoring for reliability degradation before full rollout.

Health checks and readiness probes (Kubernetes liveness/readiness probes) provide continuous reliability monitoring and automatic recovery at the infrastructure level.

Key Takeaways

Reliability testing measures how consistently a system performs over time using MTBF, MTTR, and availability metrics
Recovery testing verifies the system can return to a functional state after crashes, power outages, network failures, and hardware issues
Failover testing ensures standby systems take over automatically within acceptable timeframes
Backup/restore testing must be performed regularly — an untested backup is not a backup
Disaster recovery testing ranges from tabletop exercises to full interruption tests
RTO defines maximum acceptable downtime; RPO defines maximum acceptable data loss
Chaos engineering brings reliability testing into the CI/CD pipeline as a continuous practice

Reliability and Recovery Testing

What You Will Learn

What Is Reliability Testing?

Key Reliability Metrics

Types of Reliability Tests

What Is Recovery Testing?

Common Failure Scenarios

Failover Testing

Backup and Restore Testing

Disaster Recovery Testing

Exercise: Design Reliability Test Scenarios

Reliability Testing in CI/CD

Key Takeaways

Knowledge Check

Reliability and Recovery Testing

What You Will Learn

What Is Reliability Testing? #

Key Reliability Metrics #

Types of Reliability Tests #

What Is Recovery Testing? #

Common Failure Scenarios #

Failover Testing #

Backup and Restore Testing #

Disaster Recovery Testing #

Exercise: Design Reliability Test Scenarios #

Reliability Testing in CI/CD #

Key Takeaways #

Knowledge Check

What Is Reliability Testing?

Key Reliability Metrics

Types of Reliability Tests

What Is Recovery Testing?

Common Failure Scenarios

Failover Testing

Backup and Restore Testing

Disaster Recovery Testing

Exercise: Design Reliability Test Scenarios

Reliability Testing in CI/CD

Key Takeaways