1What is the difference between RPO and RTO?
Disaster recovery encompasses the strategies, processes, and infrastructure for recovering from catastrophic failures such as regional outages, data corruption, or ransomware attacks. DR planning centers on two key metrics: RPO (how much data loss is acceptable) and RTO (how much downtime is acceptable), which determine the cost and complexity of the DR strategy.
Disaster recovery (DR) is the set of policies, tools, and procedures for recovering critical technology infrastructure and data after a catastrophic event. Unlike routine failure handling (circuit breakers, retries, graceful degradation), DR addresses scenarios where an entire region, data center, or data set is compromised: natural disasters destroying a data center, region-wide cloud outages, data corruption propagating through replication, ransomware encrypting all accessible storage, or catastrophic software bugs that destroy data. DR planning is not optional for any system that stores data users depend on -- the question is not whether a disaster will occur, but when, and how prepared the organization is to recover.
DR planning revolves around two fundamental metrics. Recovery Point Objective (RPO) defines the maximum acceptable data loss, measured in time. An RPO of 1 hour means the organization accepts losing up to 1 hour of data -- the DR system must have a copy of all data from at least 1 hour ago. An RPO of zero means no data loss is acceptable, requiring synchronous replication to the DR site. Recovery Time Objective (RTO) defines the maximum acceptable downtime. An RTO of 4 hours means the system must be fully operational within 4 hours of the disaster being declared. An RTO of zero means continuous availability with no perceptible downtime, requiring active-active multi-region deployment. RPO and RTO directly determine the cost of the DR solution: tighter objectives (less data loss, less downtime) require more expensive infrastructure and more complex procedures.
DR strategies are ranked by cost and recovery speed, forming a spectrum from cheapest-and-slowest to most-expensive-and-fastest. Backup and Restore is the simplest: regularly back up data to a separate region, and when disaster strikes, provision new infrastructure and restore from backups. This is the cheapest option but has the longest RTO (hours to days) and RPO depends on backup frequency (typically hours). Pilot Light maintains a minimal standby environment with core infrastructure (database replicas, DNS entries) pre-provisioned but not actively running compute. During a disaster, compute resources are scaled up and traffic is rerouted, achieving an RTO of 10-30 minutes. Warm Standby runs a scaled-down but fully functional copy of the production environment that handles a small amount of traffic or none. Failover involves scaling up and rerouting, achieving an RTO of minutes. Multi-Site Active-Active runs the full application stack in multiple regions simultaneously, with traffic routed to the nearest healthy region. There is effectively no RTO because all regions are always serving traffic, but this is the most expensive option (2x+ infrastructure cost) and the most complex to implement correctly due to data replication and consistency challenges.
DR testing is as important as DR planning. A DR plan that has never been tested is a plan that will not work when needed. Organizations must regularly test their DR procedures: tabletop exercises (walking through the DR plan step by step), automated failover drills (actually triggering failover to the DR site and back), and full-scale DR tests (operating entirely from the DR site for a sustained period). GitLab's 2017 data loss incident is a cautionary tale: their backup procedures had silently failed, and they discovered this only when they needed to restore. Regular testing ensures that backup procedures work, failover automation functions correctly, runbooks are accurate, and team members know their roles during a disaster.
The House Insurance Analogy
DR is like home disaster insurance. RPO is how much stuff you can afford to lose: if you back up family photos to the cloud monthly (RPO = 1 month), you could lose up to a month of photos in a fire. If you back up daily (RPO = 1 day), you lose at most a day. If your photos sync to the cloud instantly (RPO = 0), you lose nothing. RTO is how quickly you need to be in a livable space again: a hotel room tonight (RTO = hours) is cheaper than owning a fully furnished second home always ready to move into (RTO = 0). The more protection you want (less data loss, faster recovery), the more you pay. Most families choose a practical balance, not maximum protection for everything.
Netflix
Netflix operates active-active across 3 AWS regions (US-East, US-West, EU-West), achieving near-zero RTO and near-zero RPO for their streaming service. All regions serve live traffic simultaneously, and Zuul (their API gateway) routes users to the nearest healthy region. If an entire AWS region fails, traffic is automatically redistributed to the remaining regions. Data is replicated asynchronously across regions using EVCache (for session data) and Cassandra (for user data), accepting seconds of RPO for the benefit of low-latency writes.
Capital One
Capital One uses a pilot light DR strategy with automated failover for their banking infrastructure. Core databases are replicated to a standby region with minimal compute pre-provisioned. When a disaster is detected, automated runbooks provision compute resources, promote database replicas, update DNS, and validate connectivity -- achieving an RTO of under 15 minutes. DR drills are conducted quarterly, and every drill result feeds into an improvement backlog.
GitLab
In January 2017, GitLab experienced a catastrophic data loss incident when an engineer accidentally deleted a production database directory. Five of six backup and replication mechanisms had silently failed or were never configured correctly. The team restored from their one working backup, but the process took 18 hours (RTO) and resulted in 6 hours of data loss (RPO). GitLab publicly documented the entire incident and response, leading to a complete overhaul of their DR strategy including daily backup verification, automated restore testing, and multi-region replication.
| Aspect | Description |
|---|---|
| Cost vs Recovery Speed | DR cost scales roughly exponentially with tighter objectives. Backup & Restore costs only storage for backups. Pilot Light adds minimal standby compute. Warm Standby doubles most infrastructure costs. Active-Active doubles all costs plus adds cross-region replication complexity. An organization spending $100K/month on infrastructure might spend $5K for Backup & Restore DR or $200K for Active-Active DR. |
| RPO vs Write Latency | Zero RPO requires synchronous replication: every write must be confirmed by the DR site before being acknowledged to the client. For cross-region replication, this adds 50-200ms per write. Asynchronous replication eliminates this latency penalty but accepts a replication lag window (typically seconds to minutes) during which data could be lost. |
| Complexity vs Reliability | Active-active DR is the most reliable but also the most complex. Cross-region data consistency, conflict resolution, global load balancing, and coordinated deployments all add operational complexity. This complexity itself becomes a source of risk: misconfigured replication or incorrect failover logic can cause data loss or split-brain scenarios. |
| Testing Frequency vs Operational Risk | Frequent DR testing (monthly failover drills) provides high confidence but risks production impact from failed tests. Infrequent testing (annual) reduces operational risk but allows DR procedures to become stale. The testing frequency should match the system's criticality and the rate of infrastructure changes. |
GitLab 2017 Data Loss -- The Cost of Untested Backups
Scenario
On January 31, 2017, a GitLab engineer performed routine database maintenance and accidentally ran a destructive command (rm -rf) on the wrong server, deleting the production PostgreSQL data directory. The team immediately tried to fail over to their backup systems, only to discover a cascading series of failures: LVM snapshots had not been configured, database replication was lagging by hours and then crashed, pg_dump backups had been silently failing due to a configuration error, Azure disk snapshots had never been enabled, and S3 backups existed but were 6 hours old.
Solution
The team restored from the one working backup: a 6-hour-old S3 snapshot of the database. The restoration process required provisioning new infrastructure, restoring the snapshot, replaying available WAL (Write-Ahead Log) segments, and verifying data integrity. GitLab live-streamed the entire recovery process on YouTube, maintaining radical transparency. After recovery, GitLab conducted a thorough post-mortem and implemented comprehensive DR improvements: automated daily backup verification (actually restoring backups to verify they work), multiple independent replication streams, automated DR failover testing, and documented runbooks for every failure scenario.
Outcome
The restoration took 18 hours (RTO) with 6 hours of data loss (RPO). Approximately 5,000 projects, 5,000 comments, and 700 new user accounts were lost. While the data loss was painful, GitLab's transparent handling -- live-streaming recovery, publishing a detailed post-mortem, and publicly tracking their DR improvement roadmap -- earned significant community respect. The incident became one of the most cited case studies in DR planning, driving the industry message: untested backups are not backups. GitLab's subsequent DR improvements have been validated through regular automated testing.
See Disaster Recovery (DR) in action
Explore system design templates that use disaster recovery (dr) and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the difference between RPO and RTO?
2Which DR strategy provides the fastest recovery but is the most expensive?
3Why is replication alone insufficient as a disaster recovery strategy?