What Is Reliability Testing?

Reliability testing evaluates whether a system performs its intended function consistently over a specified period under defined conditions. A system is not truly reliable just because it passes functional tests — it must continue working correctly over time, under sustained usage, and through varying conditions.

Consider an online banking application. It may pass every functional test during a 30-minute test session. But what happens when thousands of users interact with it continuously for 72 hours? Reliability testing answers that question.

Key Reliability Metrics

Two metrics are foundational in reliability testing:

MTBF (Mean Time Between Failures) measures the average time a system operates before a failure occurs. For example, if a server experiences 3 failures over 300 hours of operation, the MTBF is 100 hours. Higher MTBF indicates greater reliability.

MTTR (Mean Time To Repair/Recover) measures the average time required to restore a system to operational status after a failure. If those 3 failures took 2, 4, and 3 hours to fix respectively, the MTTR is 3 hours. Lower MTTR indicates better recoverability.

Availability combines both metrics: Availability = MTBF / (MTBF + MTTR). Using the example above: 100 / (100 + 3) = 97.1%. This is the percentage of time the system is operational.

MetricFormulaGoalExample
MTBFTotal uptime / Number of failuresMaximize100 hours
MTTRTotal repair time / Number of failuresMinimize3 hours
AvailabilityMTBF / (MTBF + MTTR)Maximize97.1%

Types of Reliability Tests

Endurance testing runs the system under expected load for an extended period (24-72 hours) to detect memory leaks, resource exhaustion, and degradation patterns.

Stress-reliability testing combines stress testing with reliability measurement — how does failure rate change when the system operates at 80-90% capacity over time?

Operational profile testing simulates the actual distribution of user operations. If 60% of users browse, 30% add to cart, and 10% check out, the test mirrors these proportions.

What Is Recovery Testing?

Recovery testing verifies that a system can restore itself — or be restored — to a functional state after experiencing a failure. While reliability testing focuses on preventing and measuring failures, recovery testing focuses on what happens after a failure occurs.

Every system will eventually fail. The question is not whether, but how gracefully.

Common Failure Scenarios

Application crash. The server process terminates unexpectedly. Can the application restart automatically? Are in-flight transactions preserved or properly rolled back?

Power outage. The server loses power. When power returns, does the system come back up? Is data intact? Does it resume from where it left off?

Network drop. Connectivity between services is interrupted. Do services retry? Do they degrade gracefully? What happens to data in transit?

Database corruption. A disk sector fails or a write operation is interrupted. Can the database recover from its transaction log? Is data consistent?

Hardware failure. A physical component (disk, memory, NIC) fails. Does the system detect it and respond appropriately?

graph TD F[Failure Occurs] --> D{Detected?} D -->|Auto-detected| R[Recovery Initiated] D -->|Not detected| M[Manual Intervention] R --> A{Auto-recovery?} A -->|Yes| AR[Automatic Restart/Failover] A -->|No| MR[Manual Recovery Procedure] AR --> V[Verify Data Integrity] MR --> V M --> MR V --> S{System Functional?} S -->|Yes| O[Back to Operation] S -->|No| E[Escalation / DR Plan]

Failover Testing

Failover testing verifies that when a primary component fails, traffic or processing automatically switches to a standby (redundant) component. This is critical for high-availability systems.

Active-passive failover: A standby server waits idle until the primary fails. The test verifies that the standby takes over within the defined Recovery Time Objective (RTO).

Active-active failover: Multiple servers share the load. When one fails, the others absorb its traffic. The test verifies no requests are lost and performance remains acceptable.

What to measure during failover testing:

  • Failover time — How long until the standby takes over?
  • Data loss — Were any transactions lost during the switch?
  • User impact — Did users experience errors, or was the transition transparent?
  • Failback — When the primary recovers, does traffic return to it correctly?

Backup and Restore Testing

Having backups is not enough — you must test them. An untested backup is not a backup; it is a hope.

Backup and restore testing verifies:

  • Backups complete successfully within the backup window
  • Backup data is not corrupted (integrity checks)
  • Restoration works on the target environment
  • Restoration completes within the Recovery Time Objective (RTO)
  • Restored data is complete and matches the Recovery Point Objective (RPO)
TermDefinitionExample
RTO (Recovery Time Objective)Maximum acceptable downtime4 hours
RPO (Recovery Point Objective)Maximum acceptable data loss (time)1 hour

If your RPO is 1 hour, you must back up at least every hour. If your RTO is 4 hours, you must be able to restore from backup within 4 hours.

Disaster Recovery Testing

Disaster Recovery (DR) testing goes beyond individual component failures to simulate catastrophic scenarios: entire data center outages, regional failures, or simultaneous multi-component failures.

DR tests are typically conducted quarterly or annually and involve:

  • Activating the DR plan with defined roles and procedures
  • Switching operations to the DR site
  • Running critical operations from the DR site
  • Measuring RTO and RPO compliance
  • Switching back to the primary site

Types of DR tests (from least to most disruptive):

  1. Tabletop exercise — Walk through the DR plan verbally without taking action
  2. Simulation — Simulate a disaster scenario without actually shutting down systems
  3. Parallel test — Bring up the DR site while the primary continues running
  4. Full interruption test — Actually shut down the primary site and operate from DR

Exercise: Design Reliability Test Scenarios

You are the QA lead for a critical healthcare appointment scheduling system. The system has these characteristics:

  • Web application serving 50,000 daily users
  • PostgreSQL database with patient records
  • Integration with external insurance verification APIs
  • Two-server active-passive failover configuration
  • Target availability: 99.9% (8.76 hours maximum downtime per year)
  • RTO: 30 minutes, RPO: 5 minutes

Design test scenarios covering:

Part 1: Reliability metrics. Based on the 99.9% availability target, calculate the maximum acceptable MTTR if the system experiences one failure per month (MTBF = 720 hours). Does this meet the requirements?

Part 2: Recovery scenarios. Write 5 recovery test scenarios for the following failure types:

  • Application server crash
  • Database connection pool exhaustion
  • External insurance API becoming unreachable
  • Primary server hardware failure (failover test)
  • Corrupt database backup

For each scenario, specify: the trigger, expected behavior, acceptable recovery time, and data integrity verification steps.

Part 3: DR test plan. Design a disaster recovery test plan that verifies the system can meet its RTO and RPO targets. Include the test type, prerequisites, execution steps, and success criteria.

HintFor Part 1, use the availability formula: Availability = MTBF / (MTBF + MTTR). Solve for MTTR with Availability = 0.999 and MTBF = 720. Then compare the calculated MTTR against the RTO of 30 minutes.

For Part 2, think about what the system should do automatically versus what requires manual intervention. Healthcare systems have strict data integrity requirements — no patient data should be lost.

Solution

Part 1: Reliability Metrics

Given: Availability = 99.9%, MTBF = 720 hours

0.999 = 720 / (720 + MTTR) 720 + MTTR = 720 / 0.999 720 + MTTR = 720.72 MTTR = 0.72 hours = 43.2 minutes

The calculated MTTR (43.2 minutes) exceeds the RTO (30 minutes). This means either the availability target or the RTO needs to be adjusted, or the system needs better auto-recovery mechanisms to reduce MTTR below 30 minutes.

Part 2: Recovery Scenarios

Scenario 1: Application server crash

  • Trigger: Kill the application process using kill -9
  • Expected behavior: Process monitor detects crash within 30 seconds, restarts the application automatically. In-flight requests receive a retry-able error. No data loss.
  • Acceptable recovery time: Under 2 minutes
  • Verification: Confirm all database transactions were either committed or rolled back. Verify no orphaned appointment records. Check that queued requests are reprocessed.

Scenario 2: Database connection pool exhaustion

  • Trigger: Simulate 500 concurrent connections exceeding the pool limit of 100
  • Expected behavior: Application returns “service temporarily unavailable” for new requests. Existing connections complete normally. Pool recovers as connections are released.
  • Acceptable recovery time: Under 5 minutes after load reduces
  • Verification: Query the database for incomplete transactions. Verify connection pool metrics return to normal levels.

Scenario 3: External insurance API unreachable

  • Trigger: Block network access to the insurance API endpoint using firewall rules
  • Expected behavior: Application uses circuit breaker pattern — after 3 failed attempts, stops calling the API and shows a user-friendly message. Appointments can still be scheduled without insurance verification (queued for later verification).
  • Acceptable recovery time: Circuit breaker opens within 30 seconds. Full recovery within 1 minute after API returns.
  • Verification: Verify queued verifications are processed when API returns. Confirm no appointments were lost.

Scenario 4: Primary server hardware failure (failover)

  • Trigger: Shut down the primary server
  • Expected behavior: Active-passive failover activates. Standby server takes over within the RTO. DNS or load balancer redirects traffic automatically.
  • Acceptable recovery time: Under 5 minutes
  • Verification: Confirm the standby has the latest data (within RPO of 5 minutes). Test complete user workflows on the standby. Verify session continuity if using shared session storage.

Scenario 5: Corrupt database backup

  • Trigger: Intentionally corrupt a backup file and attempt restoration
  • Expected behavior: Restoration process detects corruption via checksum verification. System falls back to the previous known-good backup. Alert is generated for the operations team.
  • Acceptable recovery time: Detection within 5 minutes. Fallback to previous backup adds time — total under 30 minutes.
  • Verification: Compare record counts and checksums between restored data and the source. Run application smoke tests against the restored database.

Part 3: DR Test Plan

Test type: Parallel test (DR site activated alongside primary)

Prerequisites:

  • DR site configured with identical infrastructure
  • Database replication lag under 5 minutes (matching RPO)
  • Current runbook available to all team participants
  • Monitoring dashboards configured for DR site

Execution steps:

  1. Announce DR test to all stakeholders (scheduled maintenance window)
  2. Verify DR site database is synchronized (check replication lag)
  3. Redirect 10% of traffic to DR site using load balancer
  4. Monitor DR site performance for 30 minutes
  5. Simulate primary site failure by redirecting all traffic to DR site
  6. Measure time from redirect to full DR site operational status
  7. Run critical workflow tests: schedule appointment, verify insurance, cancel appointment
  8. Operate from DR site for 2 hours
  9. Redirect traffic back to primary site
  10. Verify data consistency between primary and DR databases

Success criteria:

  • Full switchover completes within 30 minutes (RTO)
  • Data loss is under 5 minutes of transactions (RPO)
  • All critical workflows execute successfully on DR site
  • Performance degradation is under 20% compared to primary
  • Switchback to primary completes without data loss

Reliability Testing in CI/CD

Modern teams integrate reliability testing into their delivery pipeline:

Chaos engineering (Netflix’s Chaos Monkey approach) intentionally injects failures in production or pre-production environments to test recovery mechanisms continuously. Tools like Chaos Monkey, Gremlin, and Litmus help automate failure injection.

Canary deployments serve new code to a small percentage of users, monitoring for reliability degradation before full rollout.

Health checks and readiness probes (Kubernetes liveness/readiness probes) provide continuous reliability monitoring and automatic recovery at the infrastructure level.

Key Takeaways

  • Reliability testing measures how consistently a system performs over time using MTBF, MTTR, and availability metrics
  • Recovery testing verifies the system can return to a functional state after crashes, power outages, network failures, and hardware issues
  • Failover testing ensures standby systems take over automatically within acceptable timeframes
  • Backup/restore testing must be performed regularly — an untested backup is not a backup
  • Disaster recovery testing ranges from tabletop exercises to full interruption tests
  • RTO defines maximum acceptable downtime; RPO defines maximum acceptable data loss
  • Chaos engineering brings reliability testing into the CI/CD pipeline as a continuous practice