Vetora logo
📉Performance

P99 & Tail Latency

P99 (99th percentile) latency measures the worst-case response time experienced by 1 in 100 requests. Tail latency -- the latency at p99, p99.9, and beyond -- reveals problems that averages and medians hide. In distributed systems with fan-out, tail latency is amplified: a single slow component makes the entire request slow.

Overview

Average latency is one of the most dangerous metrics in system design because it hides the experience of your most affected users. If your average latency is 50ms but your p99 is 2 seconds, then 1 in every 100 requests takes 40 times longer than average. For a service handling 10,000 RPS, that means 100 users every second experience a 2-second delay. Worse, these slow requests often cluster: the same user may hit the tail multiple times during a session, creating a consistently terrible experience for a subset of users.

Percentile latency measurements (p50, p95, p99, p99.9) give a much more accurate picture. The p50 (median) is the latency that half the requests are faster than. The p95 tells you the worst case for 95% of requests. The p99 tells you what 1 in 100 users experiences. Amazon, Google, and most large tech companies set SLOs on p99 or even p99.9 latency because they know that tail latency directly impacts revenue and user retention.

In distributed systems, tail latency is amplified by fan-out. When a single user request requires responses from multiple backend services or shards, the overall response time is determined by the slowest component. If each of 100 shards has a 1% chance of taking more than 1 second, the probability that at least one shard is slow is 1 - (0.99)^100 = 63.4%. This means nearly two-thirds of requests will experience tail latency from at least one shard. Google's seminal paper 'The Tail at Scale' documented this amplification effect and proposed mitigation strategies.

Key mitigation techniques include hedged requests (sending the same request to multiple replicas and using the first response), tied requests (similar to hedging but canceling the slower request when the faster one completes), request deadlines (killing requests that exceed their time budget rather than consuming resources on a response the client has already timed out on), and canary request patterns (testing a request on one shard before fanning out to all). These techniques do not eliminate tail latency but prevent it from dominating system behavior.

Key Points
  • 1P99 latency means 99% of requests complete faster than this value. A p99 of 500ms means 1 in 100 requests takes at least 500ms. For a service at 10,000 RPS, that is 100 slow requests per second -- a significant number of unhappy users.
  • 2Averages are misleading because latency distributions are typically long-tailed. A mean of 50ms could reflect a tight distribution (48-52ms) or a bimodal one (30ms for 99% and 2000ms for 1%). Only percentiles reveal the tail.
  • 3Fan-out amplifies tail latency. If a request touches N services each with probability p of being slow, the probability of at least one being slow is 1-(1-p)^N. With N=100 and p=0.01, 63% of requests hit a slow backend.
  • 4Hedged requests mitigate tail latency by sending the same request to multiple replicas after a short delay. If the primary does not respond within the p95 latency, a backup request is sent. The overhead is small (about 5% extra load) but the tail latency improvement is dramatic.
  • 5Coordinated omission is a measurement error where load generators fail to account for requests delayed by queueing. If the generator waits for a response before sending the next request, slow responses reduce the measured request rate, hiding the true impact of tail latency.
  • 6SLOs should be set on high percentiles. Google's SRE book recommends p99 for user-facing services. An SLO like 'p99 latency under 300ms' is more meaningful than 'average latency under 100ms' because it protects the worst-affected users.
Simple Example

The Grocery Checkout Line

Imagine a grocery store where the average checkout time is 3 minutes. Most people breeze through in 2-3 minutes. But 1 in 100 customers (the p99) takes 20 minutes because they have a price check issue or coupon problem. If you only measure the average, the store seems efficient. But that 1% of customers is furious -- and they tell their friends. Now imagine each customer needs to visit 5 departments (fan-out). The chance of hitting at least one 20-minute delay across 5 departments is 1-(0.99)^5 = 4.9%. Nearly 1 in 20 shopping trips is ruined by tail latency.

Real-World Examples

Google

Google's 'The Tail at Scale' paper (2013) documented how a single Google search fans out to thousands of index servers. Even with individual server p99.9 at 10ms, the probability that one server is slow becomes near-certain at scale. They mitigated this with hedged requests, tied requests, and micro-partitioning to reduce per-shard variance.

Amazon DynamoDB

DynamoDB guarantees single-digit millisecond p99 latency for reads and writes at any scale. To achieve this, they use request routers that track per-partition latency statistics and route around slow partitions. Adaptive capacity rebalances hot partitions automatically, preventing tail latency from sustained hot keys.

Netflix

Netflix observed that p99.9 latency in their microservice mesh was 10-100x worse than median latency. They implemented Zuul gateway timeout budgets that propagate deadline headers through the call chain, ensuring that if a downstream service is slow, the gateway cancels the request rather than letting it consume resources for a response the user has already abandoned.

Trade-Offs
AspectDescription
Hedged Requests: Tail Latency vs Resource CostHedging sends duplicate requests to reduce tail latency, but increases total system load. Sending a hedge after the p95 delay adds roughly 5% extra traffic. Sending after p50 adds approximately 50%. The sweet spot is hedging at p90-p95 -- meaningful tail improvement with minimal overhead.
Timeouts: Latency Control vs Error RateAggressive timeouts prevent tail latency from propagating but convert slow requests into errors. A timeout of p99 + 2x (e.g., 500ms if p99 is 250ms) balances latency control against false timeouts. Too tight, and you create artificial errors during minor slowdowns. Too loose, and you do not protect against cascading failures.
Measurement Precision vs OverheadCapturing exact percentiles requires storing or sampling every request latency. HdrHistogram provides precise percentiles with constant memory (~30KB) but adds CPU overhead. Digest-based approximations (t-digest, DDSketch) trade precision for lower overhead. At very high throughput (1M+ RPS), measurement overhead itself can affect latency.
SLO Strictness vs Engineering VelocityStrict p99.9 SLOs drive excellent user experience but slow down development -- every feature must be tested for tail latency impact, and teams spend time optimizing rare paths. Looser SLOs (p95) allow faster iteration but risk degrading the experience for power users who are often your most valuable customers.
Case Study

Google's Hedged Requests in BigTable

Scenario

Google's BigTable serves latency-sensitive applications like web search and ads. Individual BigTable tablet servers have p99 latency of ~5ms, but when a single request fans out to 100 tablets, the overall p99 degrades to 500ms+ because the request waits for the slowest tablet. This tail latency violated SLOs for latency-critical applications that depended on multi-tablet scans.

Solution

Google implemented hedged requests at the BigTable client level. After waiting for a configurable delay (typically the p95 of observed latency, e.g., 3ms), the client sends a duplicate request to a different replica of the same tablet. The first response wins, and the slower request is cancelled. The additional load from hedging was only 2-5% because most primary requests complete before the hedge fires.

Outcome

Hedged requests reduced BigTable's effective p99 latency from ~500ms to ~15ms for fan-out-100 reads, a 33x improvement. The technique was so effective that it was adopted across Google's storage stack, including Megastore and Spanner. The key insight was that a small amount of redundant work (2-5% extra RPCs) could eliminate the vast majority of tail latency, because most slow requests were caused by transient issues (GC pauses, disk seeks, network jitter) that affected only one replica at a time.

Common Mistakes
  • Reporting average latency to stakeholders instead of percentiles. Averages hide the tail where user pain lives. Always report p50, p95, p99, and ideally p99.9 in dashboards and SLO reviews.
  • Measuring latency only from the server side, missing network transit time and client-side rendering. End-to-end latency (from user click to visible response) is what actually matters for user experience.
  • Falling victim to coordinated omission in load tests. If your load generator waits for a response before sending the next request, it under-counts slow requests. Use open-loop generators (wrk2, Gatling) that maintain a constant request rate regardless of response time.
  • Ignoring tail latency amplification in fan-out architectures. If each of 50 microservices has p99 of 100ms, the end-to-end p99 can exceed 1 second. Always calculate the fan-out amplification factor during design.
Related Concepts

See P99 & Tail Latency in action

Explore system design templates that use p99 & tail latency and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Amplify tail latency with fan-out across URL lookup shards

Metrics to watch
p99_latency_msp999_latency_mshedged_request_pctshard_latency_spread_ms
Run Simulation
Test Your Understanding

1If a request fans out to 50 independent services, each with a 1% probability of being slow (>500ms), what is the approximate probability that the overall request experiences tail latency?

2What is coordinated omission in latency measurement?

Deeper Reading