Vetora logo
📏Replication

Replication Lag and Read-Your-Writes

Replication lag is the delay between a write being committed on the leader and applied on followers. It causes stale reads, read-your-writes violations, and monotonic read anomalies. Understanding and mitigating lag is essential for building correct applications on asynchronously replicated databases.

Overview

Replication lag is an inherent consequence of asynchronous replication. When a leader commits a write, it streams the change to followers, but those followers apply the change with some delay -- typically milliseconds under normal conditions, but potentially seconds or minutes during high write load, long-running transactions, schema migrations, or I/O saturation on followers. This delay creates a window during which different replicas have different versions of the data, and clients reading from lagging followers receive stale results. While eventual consistency guarantees that all replicas will converge to the same state given enough time, the application must handle the inconsistency window correctly to avoid confusing or incorrect user experiences.

The most commonly encountered anomaly is the read-your-writes violation. A user submits a form (write to leader), and the page immediately refreshes (read from a follower). If the follower has not yet received the write, the user sees the old data and believes their submission was lost. This is particularly jarring for profile updates, message sends, and order placements -- operations where users expect to see their action reflected instantly. The solution is a read-your-writes guarantee: for data that the current user has recently modified, route the read to the leader (or a synchronous follower known to have the latest data). This can be implemented via sticky sessions (route all requests from a user to the leader for a short window after a write), version tokens (include the write's position in the replication log with the response, and have followers wait until they have reached that position before responding), or application-level routing (always read 'own' data from the leader).

Monotonic read violations are a subtler problem. A user makes two sequential reads that happen to be routed to different followers with different amounts of lag. The first read hits a relatively up-to-date follower and returns recent data. The second read hits a lagging follower and returns older data. From the user's perspective, time went backward -- they saw a comment appear and then disappear, or a counter go from 42 to 38 and back to 42. The solution is monotonic reads: ensure that a given user always reads from the same replica (sticky sessions), or include a monotonic version token that prevents a replica from serving data older than what the client has already seen.

Consistent prefix reads address causal ordering violations: seeing an effect before its cause. If User A writes 'What is the score?' and User B writes 'It is 3-1' (causally dependent on seeing A's question), a third observer reading from a lagging follower might see B's answer before A's question, which is confusing. Consistent prefix reads guarantee that if a sequence of writes happens in a certain order, any reader will see them in that same order. This is harder to guarantee in partitioned or sharded databases where different shards may replicate at different speeds. Solutions include writing causally related data to the same partition or using causal consistency protocols with logical timestamps.

Key Points
  • 1Replication lag is the time between a write committing on the leader and being applied on followers. Under normal load, lag is typically sub-second, but it can spike to minutes during bulk writes, schema migrations, long-running transactions, or follower I/O saturation.
  • 2Read-your-writes violations occur when a user writes to the leader and immediately reads from a lagging follower, seeing stale data. Solutions: route own-data reads to the leader, use sticky sessions after writes, or pass version tokens so followers wait until they are caught up.
  • 3Monotonic read violations occur when sequential reads from different followers show data going 'back in time.' Solutions: pin users to a single replica via sticky sessions, or use monotonic tokens that prevent serving data older than what the client has already seen.
  • 4Consistent prefix read violations occur when causal ordering is broken: an observer sees an effect before its cause. Solutions: write causally related data to the same partition, or use logical timestamps (Lamport clocks) to enforce causal ordering across partitions.
  • 5Replication lag metrics are available in all major databases: PostgreSQL (pg_stat_replication, replay_lag), MySQL (SHOW REPLICA STATUS, Seconds_Behind_Source), MongoDB (rs.printReplicationInfo). These should be monitored and alerted on as first-class operational metrics.
  • 6Lag spikes have common causes: long-running transactions that block replay on followers, schema migrations (ALTER TABLE) that lock tables, bulk data loads saturating follower I/O, and network bandwidth saturation between leader and followers. Identifying the cause is critical for remediation.
Simple Example

The Social Media Profile Update

You update your profile picture on a social media app (write to leader). You immediately visit your profile page (read from a nearby follower). The follower has not received the update yet, so you see your old photo. Panicking, you upload the photo again. Now there are two uploads. Later, both replicas catch up and show the new photo -- but the double upload is a wasted operation caused by the lag. A read-your-writes guarantee would have routed your profile page read to the leader (or waited until the follower was caught up), showing you the new photo on the first try and avoiding the confusion.

Real-World Examples

GitHub

GitHub experienced a well-known read-your-writes problem: a developer would push code (write to the MySQL leader) and immediately view the repository page (read from a follower). If the follower had not replicated the push yet, the page showed the old commit, making developers think their push failed. GitHub solved this by implementing a read-your-writes guarantee for the pushing user: after a push, subsequent requests from that user's session are routed to the leader for a brief window (a few seconds), ensuring they see their own changes. Other users continue reading from followers with acceptable eventual consistency.

Slack

Slack uses sticky sessions to prevent monotonic read violations in channel message displays. Without stickiness, a user refreshing a channel might be routed to different read replicas with different lag, causing messages to appear and disappear between refreshes -- a confusing experience. By pinning a user's session to a specific replica (based on a consistent hash of user ID), Slack ensures each user always reads from the same follower, providing monotonic reads even though different users may see slightly different lag windows. When a user posts a message, the session temporarily routes to the leader for read-after-write consistency.

Instagram

Instagram uses version vectors for feed consistency to address the causal ordering problem. When a user posts a photo, the write generates a version token. When rendering the feed for followers, the system includes version dependencies: if post B was a reply to post A, the feed rendering system ensures post A is visible before showing post B. This prevents the confusing experience of seeing a reply before the original post, which can happen when different feed items are replicated through different shards at different speeds. The version vector overhead is modest (a few bytes per post) compared to the consistency benefit.

Trade-Offs
AspectDescription
Leader Reads vs Follower ScalingRouting reads to the leader for read-your-writes consistency eliminates the scaling benefit of read replicas for those queries. If a large percentage of reads require leader routing (highly interactive applications), the leader becomes a bottleneck. The solution is to minimize the window: route to the leader only for the few seconds after a write, then fall back to followers.
Sticky Sessions vs Load BalancingSticky sessions (pinning users to specific replicas) provide monotonic reads but can create load imbalance if user activity is unevenly distributed. A few high-activity users pinned to one replica can overload it while others are underutilized. Consistent hashing for session pinning helps distribute users more evenly, but perfect balance is unachievable.
Version Tokens vs SimplicityVersion tokens (including the replication log position with the write response and requiring followers to reach that position before serving reads) provide precise read-your-writes guarantees without routing all reads to the leader. However, they add complexity to the client-server protocol and require followers to support 'wait for position' semantics, which not all databases expose natively.
Lag Tolerance vs Application CorrectnessSome applications can tolerate significant lag (social feeds, analytics dashboards) while others cannot (financial balances, inventory counts). The cost of lag mitigation (leader reads, sync replication, version tokens) should be proportional to the business impact of stale data. Over-engineering lag mitigation for a non-critical read path wastes resources, while under-engineering it for a financial path risks incorrect behavior.
Case Study

Solving Read-Your-Writes for an E-Commerce Order System

Scenario

An e-commerce platform ran PostgreSQL with one leader and four async read replicas handling 80% of read traffic via a load balancer. After placing an order, customers were redirected to an order confirmation page. During peak traffic (Black Friday), replication lag spiked to 5-10 seconds due to high write volume. Approximately 8% of customers saw 'Order not found' on the confirmation page because their read was routed to a lagging replica. This triggered a wave of duplicate orders as confused customers re-submitted, and a flood of support tickets from customers who believed their order was lost.

Solution

The team implemented a three-part mitigation strategy. First, they added a read-your-writes guarantee using PostgreSQL's pg_last_wal_receive_lsn(): after placing an order, the write response included the WAL LSN (log sequence number) of the commit. The confirmation page request included this LSN, and the load balancer routed the read to a replica that had replayed past that position (or to the leader if no replica was caught up). Second, they reduced lag spikes by converting long-running analytics queries from the leader to a dedicated analytics replica with a higher apply_delay tolerance. Third, they added real-time replication lag monitoring (Prometheus + pg_stat_replication) with PagerDuty alerts at 1-second and 5-second thresholds.

Outcome

The 'Order not found' error rate dropped from 8% to 0.02% during the next peak traffic event. Duplicate order submissions decreased by 94%. Median replication lag decreased from 2.1 seconds to 180ms after offloading analytics queries from the leader. The WAL LSN-based routing added less than 5ms to confirmation page latency in the common case (replica already caught up) and at most 200ms in the worst case (waiting for replica to catch up). The solution was subsequently applied to all write-then-read user flows (profile updates, cart modifications, review submissions).

Common Mistakes
  • Assuming replication lag is always small. Under normal load, lag is often sub-second, leading teams to believe it is negligible. But during peak traffic, bulk imports, schema migrations, or follower maintenance, lag can spike to seconds or minutes. Design for the worst case, not the average case.
  • Not implementing read-your-writes for user-facing write-then-read flows. Every flow where a user writes data and immediately views it (profile updates, order placement, message sending) needs a read-your-writes guarantee. Without it, users will see stale data, create duplicate submissions, and file support tickets.
  • Using SHOW SLAVE STATUS / Seconds_Behind_Master as the only lag metric. This metric measures the time difference between the follower's current replay position and the leader's binlog position, but it can be misleading: a follower replaying a large transaction may show 0 lag until the transaction commits, then spike. Use multiple metrics (bytes behind, WAL position delta, wall-clock replay lag) for a complete picture.
  • Ignoring the cause of lag spikes and only treating symptoms. Routing reads to the leader during lag spikes is a valid short-term fix, but it overloads the leader if lag is sustained. Investigate root causes: long-running transactions, lock contention, I/O saturation, or network bandwidth limits. Fixing the cause is more sustainable than working around the symptom.
Related Concepts

See Replication Lag and Read-Your-Writes in action

Explore system design templates that use replication lag and read-your-writes and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Observe replication lag under write load and test read-your-writes mitigations

Metrics to watch
replication_lag_msstale_read_pctread_your_writes_violations
Run Simulation
Test Your Understanding

1A user updates their profile name and immediately visits their profile page, but sees the old name. What replication consistency guarantee has been violated?

2A user refreshes a page twice and sees a message counter go from 42 to 38, then back to 42. What replication anomaly is this?

3Which PostgreSQL metric provides real-time visibility into how far behind a replica is in applying the replication stream?

Deeper Reading