1How much annual downtime does 99.99% availability (four nines) allow?
Reliability, availability, and durability are three distinct properties of a system that are often confused. Understanding their precise definitions, how they compose across service dependencies, and how to measure them with SLAs is essential for designing systems that meet business requirements.
Reliability, availability, and durability are the three pillars of system dependability, and confusing them leads to incorrect SLA commitments and architectural mistakes. Availability measures whether the system is currently operational and able to serve requests, expressed as a percentage of uptime over a period. The 'nines' notation (99.9%, 99.99%, 99.999%) is the standard industry shorthand, where each additional nine represents a 10x reduction in allowed downtime: 99.9% (three nines) allows 8.76 hours of downtime per year, 99.99% (four nines) allows 52.6 minutes, and 99.999% (five nines) allows just 5.26 minutes.
Durability measures the probability that stored data will not be lost over a given period. It is fundamentally different from availability: a system can be unavailable (temporarily unable to serve requests) while still being durable (data is safe on disk, just not accessible). Amazon S3 offers 99.999999999% (eleven nines) durability, meaning that if you store 10 million objects, you can expect to lose a single object once every 10,000 years. S3 achieves this through automatic replication across multiple facilities within a region. Durability is a function of redundancy (how many copies exist), independence (are the copies on separate failure domains?), and integrity verification (are bit-rot and corruption detected?).
Reliability combines two related concepts: Mean Time Between Failures (MTBF) -- how long the system runs before failing, and Mean Time To Recovery (MTTR) -- how long it takes to restore service after a failure. Reliability is calculated as MTBF / (MTBF + MTTR). A system that fails once a month (MTBF = 720 hours) but recovers in 10 minutes (MTTR = 0.17 hours) has reliability of 720 / 720.17 = 99.98%. Improving reliability can be achieved by either increasing MTBF (preventing failures through better hardware, redundancy, and testing) or decreasing MTTR (faster detection, automated recovery, pre-staged failover).
The composition of availability across dependencies is critical for system design and often underestimated. For serial (sequential) dependencies, availability multiplies: if Service A (99.9%) calls Service B (99.9%), the combined availability is 99.9% * 99.9% = 99.8%, which is significantly worse than either individual service. A system with 10 serial dependencies each at 99.9% has a combined availability of 99.9%^10 = 99.0% -- allowing 87.6 hours of downtime per year. For parallel (redundant) dependencies, the formula improves availability: two instances at 99.9% each provide combined availability of 1 - (1 - 0.999)^2 = 99.9999% (six nines). This math explains why redundancy is the primary mechanism for achieving high availability, and why minimizing the number of serial dependencies on the critical path is essential.
The Power Grid Analogy
Consider your home electricity. Availability is whether the lights are on right now -- a 99.99% available power grid means about 52 minutes of blackouts per year. Durability is whether your electrical wiring and outlets survive long-term -- a durable installation works for decades without replacement. Reliability is the combination: the grid rarely goes down (high MTBF) and when it does, power is restored quickly (low MTTR). If you plug a critical server into two independent power feeds (parallel redundancy), the chance of a total blackout drops dramatically. But if the server requires power AND network AND cooling in series, each dependency's availability multiplies, reducing the total.
Amazon S3
S3 provides 99.999999999% (11 nines) durability and 99.99% availability for the Standard storage class. The durability guarantee means S3 automatically replicates objects across a minimum of 3 availability zones within a region, using checksums to detect and repair bit-rot. The availability guarantee (99.99%) allows up to 52.6 minutes of downtime per year. Note the asymmetry: data is essentially never lost (durability), but the service may occasionally be unreachable (availability). This distinction drives architectural decisions about backup vs. caching strategies.
Amazon Aurora
Aurora stores 6 copies of data across 3 AWS Availability Zones, tolerating the loss of an entire AZ without read availability impact and the loss of 2 copies without write availability impact. This multi-copy architecture provides both high durability (data survives AZ failures) and high availability (reads continue from surviving copies). Aurora achieves 99.99% availability SLA -- four nines -- by combining replication with automated failover that promotes a read replica to primary in under 30 seconds (minimizing MTTR).
Google SRE (Error Budgets)
Google's SRE teams define availability as an error budget rather than a guarantee. If a service has a 99.95% SLO, the team has 4.38 hours of allowed downtime per year. This budget is 'spent' on deployments, experiments, and planned maintenance. When the budget is nearly exhausted, the team freezes changes and focuses on reliability. This approach quantifies the trade-off between velocity (shipping features) and reliability (maintaining uptime), using availability math to make it a concrete, measurable decision.
| Aspect | Description |
|---|---|
| Nines vs Cost | Each additional nine of availability roughly doubles or triples the infrastructure cost. Going from 99.9% to 99.99% might require adding a redundant instance and load balancer. Going from 99.99% to 99.999% might require multi-region active-active deployment, global load balancing, and automated failover -- a 5-10x cost increase. The business value of the additional nines must justify the engineering investment. |
| Durability vs Latency | Higher durability requires writing data to more replicas before acknowledging the write. S3's 11-nine durability involves writing to 3 AZs, which adds cross-AZ latency (1-2ms). For workloads where write latency is critical, you may accept lower durability (single-AZ write acknowledgment) with asynchronous replication for eventual durability. |
| MTBF vs MTTR Investment | You can improve reliability by either preventing failures (increasing MTBF) or recovering faster (decreasing MTTR). Preventing all failures is asymptotically impossible and expensive. Investing in fast recovery (automated failover, pre-staged backups, blue-green deployments) often yields better reliability improvements per dollar than preventing every possible failure mode. |
| Simplicity vs Redundancy | Adding redundancy for availability increases operational complexity: more instances to monitor, more failover logic to test, more state synchronization to manage. A simple single-instance deployment is easier to operate but has no redundancy. The right level of redundancy depends on the availability requirement and the team's operational maturity. |
Google's Error Budget Framework for Balancing Velocity and Reliability
Scenario
Google's engineering teams faced a persistent tension between product developers (who wanted to ship features quickly) and SRE teams (who wanted to maintain reliability). Without a quantitative framework, debates about whether to deploy a risky change or prioritize reliability work were subjective and political. Some teams were too conservative, shipping slowly with unnecessarily high availability. Others shipped too aggressively, causing frequent outages.
Solution
Google introduced the error budget framework, grounded in availability math. Each service defines an SLO (e.g., 99.95% availability). The error budget is 100% - SLO = 0.05%, which translates to 4.38 hours of allowed downtime per year. SRE teams monitor real-time availability against this budget. When the budget has remaining capacity, product teams are free to deploy risky changes, run experiments, and ship features. When the budget is nearly exhausted, deployments are frozen and the team focuses exclusively on reliability improvements until the budget replenishes.
Outcome
The error budget framework eliminated subjective reliability debates by making them quantitative. Teams that maintained comfortable error budgets could ship faster with less SRE oversight. Teams that burned through their budgets faced automatic deployment freezes, incentivizing investment in reliability engineering. Across Google, the framework reduced unplanned outage duration by aligning incentives: product developers could see that unreliable services slow down feature delivery because exhausted budgets mean frozen deployments. The approach has been widely adopted outside Google through the SRE book and has become an industry standard for managing the velocity-reliability trade-off.
See Reliability, Availability, and Durability in action
Explore system design templates that use reliability, availability, and durability and run traffic simulations to see how these concepts perform under real load.
Browse Templates1How much annual downtime does 99.99% availability (four nines) allow?
2Two services are deployed in series (Service A calls Service B). Service A has 99.9% availability and Service B has 99.9% availability. What is the combined availability?
3What is the difference between availability and durability?