What is important about Synchronous vs Asynchronous Replication regarding "Synchronous replication guarantees zero data loss on leader ..."?

Synchronous replication guarantees zero data loss on leader failure because every confirmed write exists on at least the leader and one follower. The cost is added write latency equal to the network round trip to the synchronous follower.

What is important about Synchronous vs Asynchronous Replication regarding "Asynchronous replication provides minimum write latency (loc..."?

Asynchronous replication provides minimum write latency (local disk write only) but creates a replication lag window during which data exists only on the leader. If the leader fails, committed writes in this window are permanently lost.

What is important about Synchronous vs Asynchronous Replication regarding "Semi-synchronous replication (one sync follower, rest async)..."?

Semi-synchronous replication (one sync follower, rest async) is the standard production configuration. It guarantees durability on two nodes while keeping the latency impact to one round trip, and it can survive the synchronous follower going offline by promoting another.

What is important about Synchronous vs Asynchronous Replication regarding "PostgreSQL offers granular synchronous_commit settings: 'on'..."?

PostgreSQL offers granular synchronous_commit settings: 'on' (local WAL), 'remote_write' (follower OS buffer), 'remote_apply' (follower applied), and 'off' (fire-and-forget). This per-transaction control lets developers choose different durability-latency trade-offs for different operations.

What is important about Synchronous vs Asynchronous Replication regarding "Failover with async replication can lose committed writes. A..."?

Failover with async replication can lose committed writes. Applications that confirmed success to users (order confirmations, payment receipts) may see those operations disappear after failover. This is called the 'lost committed writes' problem and is the primary argument for synchronous or semi-synchronous replication.

What is important about Synchronous vs Asynchronous Replication regarding "Google Spanner uses synchronous Paxos replication across all..."?

Google Spanner uses synchronous Paxos replication across all replicas, paying the latency cost but guaranteeing globally consistent reads. TrueTime (GPS + atomic clocks) keeps this latency manageable by providing precise clock synchronization, reducing the uncertainty window that consensus protocols must wait through.

Vetora

⌛Replication

Synchronous vs Asynchronous Replication

Synchronous replication waits for follower acknowledgment before confirming a write, guaranteeing durability at the cost of latency. Asynchronous replication confirms writes immediately on the leader, providing lower latency but risking data loss on leader failure. Semi-synchronous mode balances the two by keeping one follower synchronous and the rest asynchronous.

Overview

The choice between synchronous and asynchronous replication is one of the most consequential decisions in distributed database design, directly affecting write latency, data durability, failover behavior, and read consistency. In synchronous replication, the leader does not confirm a write to the client until at least one follower has acknowledged receiving (and optionally applying) the change. This provides a strong durability guarantee: if the leader crashes immediately after confirming the write, the data survives on the synchronous follower. The cost is latency -- every write must wait for a network round trip to the follower and back, which can range from sub-millisecond (same rack) to 50-200ms (cross-region).

Asynchronous replication takes the opposite approach: the leader confirms the write to the client as soon as it has durably written the change to its own storage (typically the write-ahead log), without waiting for any follower to acknowledge receipt. The leader streams changes to followers in the background, and followers apply them as fast as they can. This provides the lowest possible write latency because the critical path includes only the leader's local disk write. However, asynchronous replication introduces replication lag -- a time window during which followers are behind the leader. If the leader crashes during this window, committed writes that have not yet been replicated to any follower are permanently lost. The size of this data loss window depends on replication lag, which can range from milliseconds under normal load to seconds or minutes during heavy write bursts, long-running transactions, or follower I/O saturation.

Semi-synchronous replication is the pragmatic middle ground used by most production databases. In this mode, one follower is designated as the synchronous replica: the leader waits for this follower's acknowledgment before confirming writes. All other followers replicate asynchronously. This guarantees that every committed write exists on at least two nodes (leader + sync follower) while limiting the latency impact to a single network round trip. If the synchronous follower becomes unavailable, the leader can promote another follower to synchronous status (a process sometimes called semi-synchronous failover). MySQL's semi-sync replication plugin implements exactly this pattern. PostgreSQL's synchronous_commit setting offers even finer granularity: 'on' (wait for local WAL flush), 'remote_write' (wait for follower OS cache write), 'remote_apply' (wait for follower to apply the change), and 'off' (do not wait for local WAL flush -- fastest but least durable).

The replication mode directly impacts failover semantics. With synchronous replication, promoting a follower after leader failure is safe -- the promoted follower has all committed writes. With asynchronous replication, the promoted follower may be behind the failed leader, meaning some committed writes are lost. This creates a fundamental tension: applications that confirmed writes to clients (showing 'order placed' or 'payment processed') may have those writes silently disappear after failover. For financial and transactional workloads, this is unacceptable, driving the choice toward synchronous or semi-synchronous replication. For analytics, logging, or social media workloads where occasional data loss is tolerable, asynchronous replication's latency advantage is worth the trade-off. Google Spanner takes the synchronous approach to its extreme, using Paxos consensus across replicas with TrueTime-synchronized clocks to provide globally consistent reads and writes, absorbing the latency cost through hardware-level clock synchronization rather than software-level optimizations.

Key Points

1Synchronous replication guarantees zero data loss on leader failure because every confirmed write exists on at least the leader and one follower. The cost is added write latency equal to the network round trip to the synchronous follower.
2Asynchronous replication provides minimum write latency (local disk write only) but creates a replication lag window during which data exists only on the leader. If the leader fails, committed writes in this window are permanently lost.
3Semi-synchronous replication (one sync follower, rest async) is the standard production configuration. It guarantees durability on two nodes while keeping the latency impact to one round trip, and it can survive the synchronous follower going offline by promoting another.
4PostgreSQL offers granular synchronous_commit settings: 'on' (local WAL), 'remote_write' (follower OS buffer), 'remote_apply' (follower applied), and 'off' (fire-and-forget). This per-transaction control lets developers choose different durability-latency trade-offs for different operations.
5Failover with async replication can lose committed writes. Applications that confirmed success to users (order confirmations, payment receipts) may see those operations disappear after failover. This is called the 'lost committed writes' problem and is the primary argument for synchronous or semi-synchronous replication.
6Google Spanner uses synchronous Paxos replication across all replicas, paying the latency cost but guaranteeing globally consistent reads. TrueTime (GPS + atomic clocks) keeps this latency manageable by providing precise clock synchronization, reducing the uncertainty window that consensus protocols must wait through.

Simple Example

The Registered Mail vs Regular Mail Analogy

Synchronous replication is like sending a registered letter: you stand at the post office counter and wait for confirmation that the recipient signed for it before you leave. You know for certain the letter arrived, but you spent time waiting. Asynchronous replication is like dropping a regular letter in the mailbox: you walk away immediately and trust it will arrive eventually. It is faster for you, but if the mailbox catches fire before the postal worker collects the mail, your letter is lost and you would not even know. Semi-synchronous is like sending one copy registered (you wait for that confirmation) and additional copies as regular mail -- you have a guaranteed backup on at least one copy.

Real-World Examples

PostgreSQL

PostgreSQL's synchronous_commit parameter provides four durability levels per transaction. Setting it to 'remote_apply' (strongest) ensures the follower has applied the WAL records before the leader confirms the commit -- providing both durability and read-your-writes consistency on the follower. Setting it to 'off' (weakest) does not even wait for local WAL flush, providing maximum write throughput at the risk of losing the last few hundred milliseconds of transactions on crash. This per-transaction control lets a single database serve both a payment table (remote_apply) and a logging table (off) with optimal trade-offs for each.

MySQL

MySQL's semi-sync replication plugin ensures the leader waits for at least one follower to acknowledge receipt of the binlog events before confirming the transaction to the client. If the semi-sync follower becomes unreachable, MySQL automatically falls back to asynchronous replication after a configurable timeout (rpl_semi_sync_master_timeout), preventing write stalls. MySQL 8.0 enhanced this with 'after sync' mode, which waits for the follower ACK after writing to the binlog but before committing to the storage engine, closing a small durability gap in the original implementation.

Google Spanner

Spanner uses synchronous Paxos replication for every write, distributing data across multiple zones and regions. Each write requires a majority of Paxos replicas to acknowledge before the transaction commits. TrueTime, a GPS- and atomic-clock-synchronized time API, provides globally consistent timestamps that enable lock-free read-only transactions across the entire database. The latency cost of cross-region Paxos (typically 10-15ms for intra-continental commits) is acceptable for Spanner's target workloads (financial systems, inventory management) where correctness outweighs latency.

Trade-Offs

Aspect	Description
Write Latency vs Durability	This is the fundamental trade-off. Synchronous replication adds one network round trip (0.5ms same-rack to 200ms cross-region) to every write but guarantees the data survives leader failure. Asynchronous replication keeps write latency at local-disk speed but risks losing recently committed writes. The right choice depends entirely on the cost of data loss for your workload.
Write Availability under Follower Failure	With fully synchronous replication, a follower failure blocks all writes because the leader cannot get the required acknowledgment. Semi-synchronous mitigates this by falling back to async after a timeout, but this introduces a window of reduced durability. Asynchronous replication is unaffected by follower failures -- writes continue regardless of follower state.
Read Consistency Guarantees	Synchronous replication (especially remote_apply mode) guarantees that followers have the latest data, enabling consistent reads from replicas. Asynchronous replication creates replication lag, causing read-your-writes violations when clients write to the leader and read from a lagging follower. This forces architects to implement sticky sessions, version-aware routing, or leader reads for consistency-critical paths.
Tail Latency Impact	Synchronous replication makes write tail latency (p99, p99.9) dependent on the slowest sync follower. A GC pause, disk I/O spike, or network congestion on the follower directly inflates leader write latency. Asynchronous replication isolates write latency from follower performance, producing more predictable and lower tail latency at the cost of durability guarantees.

Case Study

PostgreSQL Semi-Synchronous Replication at a Payment Processor

Scenario

A payment processing company running PostgreSQL experienced a leader node crash during a peak transaction period. Their fully asynchronous replication setup meant the promoted follower was approximately 3 seconds behind the failed leader. Over 2,000 confirmed payment transactions were lost -- the application had returned 'payment successful' to merchants, but those transactions did not exist on the promoted follower. The resulting financial discrepancies required manual reconciliation and damaged customer trust.

Solution

The team reconfigured PostgreSQL to use synchronous_commit = 'remote_apply' for payment transaction tables, ensuring every committed payment existed on at least one follower before confirmation. For non-critical data (access logs, session state, analytics events), they kept synchronous_commit = 'off' to maintain high write throughput. They configured Patroni (a PostgreSQL HA manager) with a 5-second failover timeout and automatic promotion of the synchronous standby. They also added replication lag monitoring with alerts at 100ms and 500ms thresholds.

Outcome

After the reconfiguration, zero payment transactions were lost during three subsequent leader failover events (two planned, one unplanned). Payment write latency increased by 1.2ms average (the network round trip to the synchronous standby in the same availability zone), which was well within the service's latency budget. The per-table synchronous_commit configuration kept analytics write throughput unchanged, achieving a clean separation between durability-critical and latency-critical workloads on the same database cluster.

Common Mistakes

⚠Using fully asynchronous replication for financial or transactional data without understanding the data loss risk. If your application confirms 'payment successful' to a user, that write must be durable on at least two nodes. Async replication means those words may become a lie after a leader crash.
⚠Using fully synchronous replication with a cross-region follower. Cross-region network latency (50-200ms) is added to every write, which is rarely acceptable for interactive workloads. Use synchronous replication with a same-region or same-AZ follower, and replicate cross-region asynchronously.
⚠Not monitoring replication lag as a first-class metric. Replication lag should be dashboarded and alerted on, not discovered when users report stale data. PostgreSQL's pg_stat_replication and MySQL's SHOW REPLICA STATUS provide real-time lag metrics that should feed into your monitoring system.
⚠Assuming semi-synchronous replication eliminates all data loss risk. When the synchronous follower becomes unreachable and MySQL falls back to async mode, there is a window where writes confirmed to clients are only on the leader. If the leader crashes during this window, those writes are lost. Monitor the sync/async mode transitions as an operational alert.

Related Concepts

Leader-Follower Replication Replication Lag and Read-Your-Writes Write-Ahead Log (WAL)PACELC Theorem Availability, Durability Definitions

See Synchronous vs Asynchronous Replication in action

Explore system design templates that use synchronous vs asynchronous replication and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare sync vs async replication latency and durability under load

Metrics to watch

write_latency_p99_msreplication_lag_msdata_loss_on_failover

Run Simulation

Test Your Understanding

1What is the primary risk of asynchronous replication during a leader failure?

2In MySQL's semi-synchronous replication, what happens when the synchronous follower becomes unreachable?

Deeper Reading