Vetora logo
🛡️Foundations

Reliability, Availability, and Durability

Reliability, availability, and durability are three distinct properties of a system that are often confused. Understanding their precise definitions, how they compose across service dependencies, and how to measure them with SLAs is essential for designing systems that meet business requirements.

Overview

Reliability, availability, and durability are the three pillars of system dependability, and confusing them leads to incorrect SLA commitments and architectural mistakes. Availability measures whether the system is currently operational and able to serve requests, expressed as a percentage of uptime over a period. The 'nines' notation (99.9%, 99.99%, 99.999%) is the standard industry shorthand, where each additional nine represents a 10x reduction in allowed downtime: 99.9% (three nines) allows 8.76 hours of downtime per year, 99.99% (four nines) allows 52.6 minutes, and 99.999% (five nines) allows just 5.26 minutes.

Durability measures the probability that stored data will not be lost over a given period. It is fundamentally different from availability: a system can be unavailable (temporarily unable to serve requests) while still being durable (data is safe on disk, just not accessible). Amazon S3 offers 99.999999999% (eleven nines) durability, meaning that if you store 10 million objects, you can expect to lose a single object once every 10,000 years. S3 achieves this through automatic replication across multiple facilities within a region. Durability is a function of redundancy (how many copies exist), independence (are the copies on separate failure domains?), and integrity verification (are bit-rot and corruption detected?).

Reliability combines two related concepts: Mean Time Between Failures (MTBF) -- how long the system runs before failing, and Mean Time To Recovery (MTTR) -- how long it takes to restore service after a failure. Reliability is calculated as MTBF / (MTBF + MTTR). A system that fails once a month (MTBF = 720 hours) but recovers in 10 minutes (MTTR = 0.17 hours) has reliability of 720 / 720.17 = 99.98%. Improving reliability can be achieved by either increasing MTBF (preventing failures through better hardware, redundancy, and testing) or decreasing MTTR (faster detection, automated recovery, pre-staged failover).

The composition of availability across dependencies is critical for system design and often underestimated. For serial (sequential) dependencies, availability multiplies: if Service A (99.9%) calls Service B (99.9%), the combined availability is 99.9% * 99.9% = 99.8%, which is significantly worse than either individual service. A system with 10 serial dependencies each at 99.9% has a combined availability of 99.9%^10 = 99.0% -- allowing 87.6 hours of downtime per year. For parallel (redundant) dependencies, the formula improves availability: two instances at 99.9% each provide combined availability of 1 - (1 - 0.999)^2 = 99.9999% (six nines). This math explains why redundancy is the primary mechanism for achieving high availability, and why minimizing the number of serial dependencies on the critical path is essential.

Key Points
  • 1Availability is uptime percentage. 99.9% (three nines) = 8.76 hours downtime/year. 99.99% (four nines) = 52.6 minutes/year. 99.999% (five nines) = 5.26 minutes/year. Each additional nine is exponentially harder and more expensive to achieve.
  • 2Durability is the probability data survives. S3's 11 nines (99.999999999%) means losing 1 object out of 10 million every 10,000 years. Durability is achieved through replication across independent failure domains and integrity checks to detect corruption.
  • 3Reliability = MTBF / (MTBF + MTTR). Reducing MTTR (faster recovery) is often more cost-effective than increasing MTBF (preventing all failures). Automated failover and pre-staged recovery procedures are key to minimizing MTTR.
  • 4Serial dependencies multiply availability: two 99.9% services = 99.8%. Ten 99.9% services = 99.0%. This is why minimizing the number of synchronous dependencies in the critical request path is critical for overall system availability.
  • 5Parallel redundancy dramatically improves availability: two instances at 99.9% each provide 99.9999% combined availability (1 - (0.001)^2). This is the mathematical basis for active-active deployments and multi-AZ architectures.
  • 6Error budgets (from Google SRE) reframe availability as a budget to spend: a 99.9% SLO gives you 8.76 hours of downtime per year. Teams can 'spend' this budget on risky deployments, experiments, and maintenance, pausing releases when the budget is depleted.
Simple Example

The Power Grid Analogy

Consider your home electricity. Availability is whether the lights are on right now -- a 99.99% available power grid means about 52 minutes of blackouts per year. Durability is whether your electrical wiring and outlets survive long-term -- a durable installation works for decades without replacement. Reliability is the combination: the grid rarely goes down (high MTBF) and when it does, power is restored quickly (low MTTR). If you plug a critical server into two independent power feeds (parallel redundancy), the chance of a total blackout drops dramatically. But if the server requires power AND network AND cooling in series, each dependency's availability multiplies, reducing the total.

Real-World Examples

Amazon S3

S3 provides 99.999999999% (11 nines) durability and 99.99% availability for the Standard storage class. The durability guarantee means S3 automatically replicates objects across a minimum of 3 availability zones within a region, using checksums to detect and repair bit-rot. The availability guarantee (99.99%) allows up to 52.6 minutes of downtime per year. Note the asymmetry: data is essentially never lost (durability), but the service may occasionally be unreachable (availability). This distinction drives architectural decisions about backup vs. caching strategies.

Amazon Aurora

Aurora stores 6 copies of data across 3 AWS Availability Zones, tolerating the loss of an entire AZ without read availability impact and the loss of 2 copies without write availability impact. This multi-copy architecture provides both high durability (data survives AZ failures) and high availability (reads continue from surviving copies). Aurora achieves 99.99% availability SLA -- four nines -- by combining replication with automated failover that promotes a read replica to primary in under 30 seconds (minimizing MTTR).

Google SRE (Error Budgets)

Google's SRE teams define availability as an error budget rather than a guarantee. If a service has a 99.95% SLO, the team has 4.38 hours of allowed downtime per year. This budget is 'spent' on deployments, experiments, and planned maintenance. When the budget is nearly exhausted, the team freezes changes and focuses on reliability. This approach quantifies the trade-off between velocity (shipping features) and reliability (maintaining uptime), using availability math to make it a concrete, measurable decision.

Trade-Offs
AspectDescription
Nines vs CostEach additional nine of availability roughly doubles or triples the infrastructure cost. Going from 99.9% to 99.99% might require adding a redundant instance and load balancer. Going from 99.99% to 99.999% might require multi-region active-active deployment, global load balancing, and automated failover -- a 5-10x cost increase. The business value of the additional nines must justify the engineering investment.
Durability vs LatencyHigher durability requires writing data to more replicas before acknowledging the write. S3's 11-nine durability involves writing to 3 AZs, which adds cross-AZ latency (1-2ms). For workloads where write latency is critical, you may accept lower durability (single-AZ write acknowledgment) with asynchronous replication for eventual durability.
MTBF vs MTTR InvestmentYou can improve reliability by either preventing failures (increasing MTBF) or recovering faster (decreasing MTTR). Preventing all failures is asymptotically impossible and expensive. Investing in fast recovery (automated failover, pre-staged backups, blue-green deployments) often yields better reliability improvements per dollar than preventing every possible failure mode.
Simplicity vs RedundancyAdding redundancy for availability increases operational complexity: more instances to monitor, more failover logic to test, more state synchronization to manage. A simple single-instance deployment is easier to operate but has no redundancy. The right level of redundancy depends on the availability requirement and the team's operational maturity.
Case Study

Google's Error Budget Framework for Balancing Velocity and Reliability

Scenario

Google's engineering teams faced a persistent tension between product developers (who wanted to ship features quickly) and SRE teams (who wanted to maintain reliability). Without a quantitative framework, debates about whether to deploy a risky change or prioritize reliability work were subjective and political. Some teams were too conservative, shipping slowly with unnecessarily high availability. Others shipped too aggressively, causing frequent outages.

Solution

Google introduced the error budget framework, grounded in availability math. Each service defines an SLO (e.g., 99.95% availability). The error budget is 100% - SLO = 0.05%, which translates to 4.38 hours of allowed downtime per year. SRE teams monitor real-time availability against this budget. When the budget has remaining capacity, product teams are free to deploy risky changes, run experiments, and ship features. When the budget is nearly exhausted, deployments are frozen and the team focuses exclusively on reliability improvements until the budget replenishes.

Outcome

The error budget framework eliminated subjective reliability debates by making them quantitative. Teams that maintained comfortable error budgets could ship faster with less SRE oversight. Teams that burned through their budgets faced automatic deployment freezes, incentivizing investment in reliability engineering. Across Google, the framework reduced unplanned outage duration by aligning incentives: product developers could see that unreliable services slow down feature delivery because exhausted budgets mean frozen deployments. The approach has been widely adopted outside Google through the SRE book and has become an industry standard for managing the velocity-reliability trade-off.

Common Mistakes
  • Confusing availability with durability. A system can be unavailable (temporarily down) without losing any data (fully durable). S3's occasional outages do not mean your data is lost -- it means you temporarily cannot access it. Design your failure handling accordingly: retries for availability issues, backups for durability concerns.
  • Ignoring serial dependency composition. If your service calls 5 downstream services sequentially, each at 99.9%, your service's availability ceiling is 99.9%^5 = 99.5%, regardless of how reliable your own code is. Map your dependency chain and calculate the composed availability before committing to an SLA.
  • Pursuing five nines when three nines suffice. 99.999% availability allows 5.26 minutes of downtime per year and requires multi-region active-active architecture, automated failover, and 24/7 on-call. Many applications function perfectly well at 99.9% (8.76 hours/year) at a fraction of the cost.
  • Measuring availability only as uptime percentage without distinguishing between types of failures. A 5-minute outage affecting 100% of users is different from a 50-minute outage affecting 10% of users, even though both may represent similar 'nines.' Consider request-based or user-based availability metrics for more accurate measurement.
Related Concepts

See Reliability, Availability, and Durability in action

Explore system design templates that use reliability, availability, and durability and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Measure availability vs durability under component failures

Metrics to watch
availability_pcterror_rate_pctrecovery_time_msdata_loss_events
Run Simulation
Test Your Understanding

1How much annual downtime does 99.99% availability (four nines) allow?

2Two services are deployed in series (Service A calls Service B). Service A has 99.9% availability and Service B has 99.9% availability. What is the combined availability?

3What is the difference between availability and durability?

Deeper Reading