1A user updates their profile name and immediately visits their profile page, but sees the old name. What replication consistency guarantee has been violated?
Replication lag is the delay between a write being committed on the leader and applied on followers. It causes stale reads, read-your-writes violations, and monotonic read anomalies. Understanding and mitigating lag is essential for building correct applications on asynchronously replicated databases.
Replication lag is an inherent consequence of asynchronous replication. When a leader commits a write, it streams the change to followers, but those followers apply the change with some delay -- typically milliseconds under normal conditions, but potentially seconds or minutes during high write load, long-running transactions, schema migrations, or I/O saturation on followers. This delay creates a window during which different replicas have different versions of the data, and clients reading from lagging followers receive stale results. While eventual consistency guarantees that all replicas will converge to the same state given enough time, the application must handle the inconsistency window correctly to avoid confusing or incorrect user experiences.
The most commonly encountered anomaly is the read-your-writes violation. A user submits a form (write to leader), and the page immediately refreshes (read from a follower). If the follower has not yet received the write, the user sees the old data and believes their submission was lost. This is particularly jarring for profile updates, message sends, and order placements -- operations where users expect to see their action reflected instantly. The solution is a read-your-writes guarantee: for data that the current user has recently modified, route the read to the leader (or a synchronous follower known to have the latest data). This can be implemented via sticky sessions (route all requests from a user to the leader for a short window after a write), version tokens (include the write's position in the replication log with the response, and have followers wait until they have reached that position before responding), or application-level routing (always read 'own' data from the leader).
Monotonic read violations are a subtler problem. A user makes two sequential reads that happen to be routed to different followers with different amounts of lag. The first read hits a relatively up-to-date follower and returns recent data. The second read hits a lagging follower and returns older data. From the user's perspective, time went backward -- they saw a comment appear and then disappear, or a counter go from 42 to 38 and back to 42. The solution is monotonic reads: ensure that a given user always reads from the same replica (sticky sessions), or include a monotonic version token that prevents a replica from serving data older than what the client has already seen.
Consistent prefix reads address causal ordering violations: seeing an effect before its cause. If User A writes 'What is the score?' and User B writes 'It is 3-1' (causally dependent on seeing A's question), a third observer reading from a lagging follower might see B's answer before A's question, which is confusing. Consistent prefix reads guarantee that if a sequence of writes happens in a certain order, any reader will see them in that same order. This is harder to guarantee in partitioned or sharded databases where different shards may replicate at different speeds. Solutions include writing causally related data to the same partition or using causal consistency protocols with logical timestamps.
The Social Media Profile Update
You update your profile picture on a social media app (write to leader). You immediately visit your profile page (read from a nearby follower). The follower has not received the update yet, so you see your old photo. Panicking, you upload the photo again. Now there are two uploads. Later, both replicas catch up and show the new photo -- but the double upload is a wasted operation caused by the lag. A read-your-writes guarantee would have routed your profile page read to the leader (or waited until the follower was caught up), showing you the new photo on the first try and avoiding the confusion.
GitHub
GitHub experienced a well-known read-your-writes problem: a developer would push code (write to the MySQL leader) and immediately view the repository page (read from a follower). If the follower had not replicated the push yet, the page showed the old commit, making developers think their push failed. GitHub solved this by implementing a read-your-writes guarantee for the pushing user: after a push, subsequent requests from that user's session are routed to the leader for a brief window (a few seconds), ensuring they see their own changes. Other users continue reading from followers with acceptable eventual consistency.
Slack
Slack uses sticky sessions to prevent monotonic read violations in channel message displays. Without stickiness, a user refreshing a channel might be routed to different read replicas with different lag, causing messages to appear and disappear between refreshes -- a confusing experience. By pinning a user's session to a specific replica (based on a consistent hash of user ID), Slack ensures each user always reads from the same follower, providing monotonic reads even though different users may see slightly different lag windows. When a user posts a message, the session temporarily routes to the leader for read-after-write consistency.
Instagram uses version vectors for feed consistency to address the causal ordering problem. When a user posts a photo, the write generates a version token. When rendering the feed for followers, the system includes version dependencies: if post B was a reply to post A, the feed rendering system ensures post A is visible before showing post B. This prevents the confusing experience of seeing a reply before the original post, which can happen when different feed items are replicated through different shards at different speeds. The version vector overhead is modest (a few bytes per post) compared to the consistency benefit.
| Aspect | Description |
|---|---|
| Leader Reads vs Follower Scaling | Routing reads to the leader for read-your-writes consistency eliminates the scaling benefit of read replicas for those queries. If a large percentage of reads require leader routing (highly interactive applications), the leader becomes a bottleneck. The solution is to minimize the window: route to the leader only for the few seconds after a write, then fall back to followers. |
| Sticky Sessions vs Load Balancing | Sticky sessions (pinning users to specific replicas) provide monotonic reads but can create load imbalance if user activity is unevenly distributed. A few high-activity users pinned to one replica can overload it while others are underutilized. Consistent hashing for session pinning helps distribute users more evenly, but perfect balance is unachievable. |
| Version Tokens vs Simplicity | Version tokens (including the replication log position with the write response and requiring followers to reach that position before serving reads) provide precise read-your-writes guarantees without routing all reads to the leader. However, they add complexity to the client-server protocol and require followers to support 'wait for position' semantics, which not all databases expose natively. |
| Lag Tolerance vs Application Correctness | Some applications can tolerate significant lag (social feeds, analytics dashboards) while others cannot (financial balances, inventory counts). The cost of lag mitigation (leader reads, sync replication, version tokens) should be proportional to the business impact of stale data. Over-engineering lag mitigation for a non-critical read path wastes resources, while under-engineering it for a financial path risks incorrect behavior. |
Solving Read-Your-Writes for an E-Commerce Order System
Scenario
An e-commerce platform ran PostgreSQL with one leader and four async read replicas handling 80% of read traffic via a load balancer. After placing an order, customers were redirected to an order confirmation page. During peak traffic (Black Friday), replication lag spiked to 5-10 seconds due to high write volume. Approximately 8% of customers saw 'Order not found' on the confirmation page because their read was routed to a lagging replica. This triggered a wave of duplicate orders as confused customers re-submitted, and a flood of support tickets from customers who believed their order was lost.
Solution
The team implemented a three-part mitigation strategy. First, they added a read-your-writes guarantee using PostgreSQL's pg_last_wal_receive_lsn(): after placing an order, the write response included the WAL LSN (log sequence number) of the commit. The confirmation page request included this LSN, and the load balancer routed the read to a replica that had replayed past that position (or to the leader if no replica was caught up). Second, they reduced lag spikes by converting long-running analytics queries from the leader to a dedicated analytics replica with a higher apply_delay tolerance. Third, they added real-time replication lag monitoring (Prometheus + pg_stat_replication) with PagerDuty alerts at 1-second and 5-second thresholds.
Outcome
The 'Order not found' error rate dropped from 8% to 0.02% during the next peak traffic event. Duplicate order submissions decreased by 94%. Median replication lag decreased from 2.1 seconds to 180ms after offloading analytics queries from the leader. The WAL LSN-based routing added less than 5ms to confirmation page latency in the common case (replica already caught up) and at most 200ms in the worst case (waiting for replica to catch up). The solution was subsequently applied to all write-then-read user flows (profile updates, cart modifications, review submissions).
See Replication Lag and Read-Your-Writes in action
Explore system design templates that use replication lag and read-your-writes and run traffic simulations to see how these concepts perform under real load.
Browse Templates1A user updates their profile name and immediately visits their profile page, but sees the old name. What replication consistency guarantee has been violated?
2A user refreshes a page twice and sees a message counter go from 42 to 38, then back to 42. What replication anomaly is this?
3Which PostgreSQL metric provides real-time visibility into how far behind a replica is in applying the replication stream?