Vetora logo
👑Replication

Leader-Follower (Primary-Replica) Replication

Leader-follower replication designates one node as the writable leader and replicates changes to N read-only followers via a replication log. It is the most widely deployed replication topology, powering PostgreSQL streaming replication, MySQL binlog replication, and MongoDB replica sets.

Overview

Leader-follower replication, also called primary-replica or master-slave replication, is the most common replication architecture in production databases. A single node -- the leader (or primary) -- accepts all write operations. After processing a write, the leader appends the change to a replication log (such as PostgreSQL's write-ahead log or MySQL's binary log) and streams these log entries to one or more follower (replica) nodes. Followers apply the log entries in order, maintaining a copy of the leader's data that can serve read queries. This architecture provides a clear write-ordering guarantee because all mutations flow through a single point of serialization.

The replication log is the heart of the system. In WAL-based (physical) replication, the leader ships the exact byte-level changes to its storage files. This is efficient but ties replicas to the same storage engine version and architecture. In logical (row-based) replication, the leader ships higher-level change records -- insert row X, update row Y, delete row Z -- which are decoupled from the physical storage format. Logical replication enables heterogeneous setups (different PostgreSQL versions, or replicating to a data warehouse) at the cost of slightly higher overhead. MySQL's GTID-based binlog replication uses globally unique transaction identifiers to track replication position, making failover and replica provisioning significantly simpler than older file-position-based approaches.

Failover is the most operationally complex aspect of leader-follower replication. When the leader fails, the system must detect the failure (typically via heartbeat timeouts), promote a follower to become the new leader, and redirect clients to the new leader. This process introduces several risks. If the timeout is too short, transient network glitches can trigger unnecessary failovers (flapping). If the timeout is too long, the system experiences extended write unavailability. The most dangerous failure mode is split-brain: if the old leader comes back online without realizing it has been replaced, two nodes accept writes simultaneously, causing divergent data that is extremely difficult to reconcile. Production systems mitigate split-brain using fencing tokens, STONITH (Shoot The Other Node In The Head), or consensus-based leader election via tools like etcd or ZooKeeper.

Read-after-write consistency is a persistent challenge with leader-follower replication. When a client writes to the leader and then immediately reads from a follower, the follower may not have received the latest change yet, causing the client to see stale data -- effectively losing their own write. Solutions include routing reads of recently-written data to the leader (read-your-writes guarantee), using synchronous replication for at least one follower (semi-synchronous mode), or including a version token with reads so that followers can wait until they have caught up to the required version before responding.

Key Points
  • 1All writes go through a single leader node, which serializes mutations into a replication log. This eliminates write conflicts entirely but creates a single point of failure for writes that must be addressed through failover mechanisms.
  • 2Followers apply the replication log in order and serve read queries, enabling horizontal read scaling. Adding more followers increases read throughput linearly but does not improve write throughput, which remains bounded by the leader's capacity.
  • 3WAL-based (physical) replication ships exact byte-level changes and is efficient but version-coupled. Logical (row-based) replication ships higher-level change records, enabling cross-version and cross-platform replication at slightly higher cost.
  • 4Failover requires three steps: detecting the leader failure (heartbeat timeout), promoting a follower (choosing the most up-to-date replica), and redirecting clients. Each step introduces latency and risk, especially split-brain if the old leader resurfaces.
  • 5Synchronous replication guarantees that at least one follower has every committed write, enabling zero-data-loss failover. Semi-synchronous mode (one sync follower, rest async) balances durability with write latency.
  • 6Read-after-write consistency violations occur when clients write to the leader and immediately read from a lagging follower. Solutions include sticky sessions, leader reads for own-data, or version-aware routing.
Simple Example

The Head Chef and Line Cooks

Think of a restaurant kitchen where the head chef (leader) is the only one who decides each dish's recipe and writes it on the order board. Three line cooks (followers) watch the board and replicate each dish exactly as written. Customers can ask any line cook what today's special is (read), but only the head chef can change the menu (write). If the head chef calls in sick, the sous chef (most experienced line cook) takes over -- but there is a brief period of confusion while the other cooks learn who is in charge. If the head chef comes back without knowing the sous chef took over, both might try to change the menu simultaneously, leading to conflicting dishes (split-brain).

Real-World Examples

PostgreSQL

PostgreSQL uses streaming replication based on its write-ahead log (WAL). The primary continuously streams WAL records to standby servers over TCP connections. PostgreSQL supports synchronous replication (the primary waits for at least one standby to flush WAL to disk before confirming a commit) and asynchronous replication (the primary confirms immediately). The pg_stat_replication view provides real-time visibility into replication lag, replay position, and standby state. Tools like Patroni automate failover using etcd or ZooKeeper for leader election.

MySQL

MySQL uses binary log (binlog) replication with Global Transaction Identifiers (GTIDs). The leader writes row-change events to the binlog, and followers pull these events and replay them. GTIDs assign a unique identifier to every transaction across the cluster, making it simple to determine which transactions a follower has and has not applied -- critical for automated failover. MySQL Group Replication extends this with Paxos-based consensus for automatic leader election and conflict detection, bridging toward multi-leader capabilities.

MongoDB

MongoDB's replica sets implement leader-follower replication with automatic failover. A replica set consists of a primary (leader) and one or more secondaries (followers). The primary records all mutations in the oplog (operations log), and secondaries tail this log to stay current. When the primary becomes unreachable, secondaries hold a Raft-inspired election to choose a new primary, typically completing failover in under 10 seconds. MongoDB supports configurable write concern (w:1 for async, w:majority for sync) and read preference (primary, secondary, nearest) for per-operation consistency tuning.

Trade-Offs
AspectDescription
Write Throughput BottleneckAll writes must pass through a single leader node, making leader capacity the ceiling for write throughput. Unlike leaderless or multi-leader systems, adding more followers does not increase write capacity. For write-heavy workloads, this bottleneck often drives teams toward sharding (partitioning data across multiple leader-follower groups) or multi-leader replication.
Failover Complexity and RiskAutomated failover introduces risk of split-brain (two leaders accepting writes), data loss (if the new leader is behind the old leader), and client disruption (connections must be rerouted). Manual failover avoids some risks but increases recovery time. Most production deployments use consensus-based failover (Patroni, Orchestrator) to balance speed and safety.
Replication Lag vs ConsistencyAsynchronous replication provides low write latency but allows followers to fall behind, causing stale reads. Synchronous replication eliminates lag but adds latency to every write (one network round trip to the sync follower) and reduces write availability -- if the sync follower is unreachable, writes block until timeout. Semi-synchronous is a practical middle ground but still has edge cases during failover.
Operational OverheadManaging replication slots, monitoring lag, handling replica drift, provisioning new replicas from backups, and testing failover procedures all add operational burden. PostgreSQL replication slots prevent WAL recycling until all replicas have consumed the data, which can fill disk if a replica goes offline. These operational details often determine whether a team can run replication reliably in production.
Case Study

GitHub's MySQL Replication and Orchestrator-Based Failover

Scenario

GitHub runs one of the largest MySQL deployments outside of hyperscale cloud providers, serving millions of developers. Their primary MySQL cluster handles all repository metadata writes (issues, pull requests, user data) with dozens of read replicas serving the high read volume. Before automated failover, a primary failure required manual intervention, causing extended write outages during incidents and on-call burnout for database engineers.

Solution

GitHub built and open-sourced Orchestrator, a MySQL replication topology management and failover tool. Orchestrator continuously monitors the replication topology, detects primary failures via configurable health checks, and automatically promotes the most up-to-date replica to primary using GTID-based replication positioning. It handles replica re-pointing, DNS updates, and proxy reconfiguration as part of the failover workflow. GitHub also implemented a proxy layer (ProxySQL and later freno for throttling) that can redirect writes to the new primary within seconds of detection.

Outcome

Automated failover reduced GitHub's MySQL write outage duration from 20-45 minutes (manual) to under 30 seconds. Orchestrator's topology awareness prevented split-brain by fencing the old primary before promoting a new one. The tool has been adopted by companies including Booking.com, Shopify, and Slack for their MySQL replication management, becoming a de facto standard for MySQL high availability.

Common Mistakes
  • Setting heartbeat timeouts too aggressively (e.g., 2 seconds). Short timeouts cause false failovers during transient network blips or GC pauses, leading to unnecessary disruption and potential data inconsistency. Production systems typically use 10-30 second timeouts with multiple consecutive missed heartbeats required before triggering failover.
  • Ignoring replication lag monitoring. Without monitoring pg_stat_replication (PostgreSQL) or Seconds_Behind_Master (MySQL), teams discover replication lag only when users report stale data. Lag alerts should fire well before the lag becomes user-visible, and long-running queries on replicas should be investigated as a common cause.
  • Assuming failover is instant and lossless. Even with semi-synchronous replication, there is a window during failover where committed transactions on the old leader may not have reached the promoted follower. Applications must be designed to handle duplicate or missing writes during failover, especially for non-idempotent operations like payment processing.
  • Running all reads from followers without considering consistency requirements. While follower reads reduce leader load, reading immediately after writing causes read-after-write violations. Critical read paths (user viewing their own profile after editing, order confirmation after purchase) should route to the leader or use version-aware routing.
Related Concepts

See Leader-Follower (Primary-Replica) Replication in action

Explore system design templates that use leader-follower (primary-replica) replication and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Watch replication lag grow under write load in a leader-follower setup

Metrics to watch
replication_lag_mswrite_throughputfailover_time_ms
Run Simulation
Test Your Understanding

1In a leader-follower replication setup, what happens when a client writes to the leader and immediately reads from a follower?

2What is the primary risk during leader-follower failover when the old leader comes back online?

3MySQL's GTID-based replication improves upon older file-position-based replication primarily by:

Deeper Reading