1What is the primary advantage of tail-based sampling over head-based sampling?
Distributed tracing propagates a unique trace ID through every service in a request's path, recording spans (timed operations) that form a causal DAG. It answers the question 'where did the time go?' for any individual request across a microservice architecture.
Distributed tracing emerged from Google's Dapper paper (2010) and Twitter's Zipkin (2012) to solve a fundamental problem: in a system with dozens or hundreds of microservices, a single user request fans out across many services, and traditional per-service metrics and logs cannot show the end-to-end picture. Tracing assigns a globally unique trace ID at the entry point and propagates it through every downstream call via HTTP headers or gRPC metadata. Each service creates a span -- a named, timed operation with a parent reference -- forming a DAG (directed acyclic graph) that represents the entire request lifecycle.
A typical trace for an e-commerce checkout might include spans for: API gateway authentication (3ms), order service validation (8ms), inventory service reservation (12ms), payment service authorization (150ms), and notification service email (async, 50ms). The trace reveals that the payment service dominates latency, that the inventory call is synchronous but could be parallelized with payment, and that a retry on the payment service added 300ms.
Context propagation is the mechanism that makes tracing possible. The W3C Trace Context standard defines two headers: `traceparent` (trace-id, parent-span-id, trace-flags) and `tracestate` (vendor-specific data). When service A calls service B, it injects these headers into the outgoing request. Service B extracts them and creates a child span linked to A's span. This works across HTTP, gRPC, Kafka (via message headers), and any protocol that supports key-value metadata.
The economics of tracing require sampling. A service handling 100K RPS with an average trace depth of 8 spans generates 800K spans/second -- roughly 70 billion spans/day. At ~500 bytes per span, that is 35 TB/day of raw trace data. Head-based sampling (decide at the entry point, propagate the decision) reduces this to 1-5% at the cost of missing rare slow requests. Tail-based sampling (buffer all spans briefly, then decide which complete traces to keep) captures anomalies but requires an intermediate collector with significant memory. Most production systems use a hybrid: 1% head-based for baseline coverage plus tail-based capture of all errors and high-latency requests.
Following a Search Request
A user searches for 'running shoes'. The API gateway creates a root span and generates trace ID abc-123. It calls the search service (child span), which calls Elasticsearch (child span, 45ms) and the recommendation service (child span, 30ms) in parallel. The recommendation service calls the user-profile cache (child span, 2ms hit). Total trace duration: 82ms. The trace waterfall shows that search and recommendations ran in parallel (good), but Elasticsearch took 45ms of the 82ms total (optimization target). Without tracing, you would only see the 82ms total from the gateway metric -- no breakdown.
Google (Dapper)
Dapper traces every RPC in Google's production fleet. It uses adaptive sampling that keeps 1 in 1,024 traces for high-traffic services and all traces for low-traffic services. Dapper was designed with negligible overhead (<0.01% CPU) by batching span export and using a lightweight binary encoding. The Dapper paper established the trace/span model used by all modern systems.
Uber (Jaeger)
Uber built Jaeger (now CNCF graduated) to trace requests across 4,000+ microservices. Jaeger processes millions of spans per second using Kafka as a buffer and Elasticsearch or Cassandra as storage. Uber uses adaptive sampling that increases the sample rate for low-traffic services and decreases it for high-traffic ones, ensuring every service has sufficient trace coverage.
Shopify
Shopify uses distributed tracing across their Ruby on Rails monolith and surrounding microservices to debug Black Friday performance issues. By tracing checkout requests end-to-end, they identified that a single slow Redis call in the tax calculation path was causing cascading timeouts. The trace showed the Redis call taking 800ms due to a hot key, invisible in aggregate metrics because it affected only 0.1% of requests.
| Aspect | Description |
|---|---|
| Sampling Rate vs. Debug Coverage | Higher sampling captures more anomalies but increases storage cost linearly. 1% head-based sampling misses 1-in-100 errors. Tail-based sampling captures all errors but requires a collector buffer (typically 30-60s of spans in memory). |
| Automatic vs. Manual Instrumentation | Auto-instrumentation (OTel agents) is zero-effort but captures only framework-level spans (HTTP, DB, gRPC). Business-logic spans (e.g., 'validate_coupon') require manual instrumentation. Best practice: auto-instrument for coverage, manually instrument critical business paths. |
| Trace Depth vs. Performance Overhead | Deep traces (20+ spans) provide fine-grained visibility but add overhead: context propagation per call, span creation, and export. In latency-critical paths (<1ms), even microsecond overhead per span matters. Limit trace depth or use async export. |
| Centralized vs. Distributed Backends | Centralized backends (Jaeger with Elasticsearch) simplify querying but create a single point of failure and a storage bottleneck. Distributed backends (Tempo with object storage) scale better but require eventual consistency for query results. |
Pinterest Reduces MTTD by 60% with Tail-Based Sampling
Scenario
Pinterest's initial head-based sampling at 0.1% meant that rare but impactful errors in their ad serving pipeline were almost never captured in traces. During a revenue-impacting incident, engineers had metrics showing elevated error rates but no traces to diagnose the root cause.
Solution
They deployed an OpenTelemetry Collector with tail-based sampling that buffers spans for 45 seconds and keeps all error traces plus the slowest 5% of traces. This increased trace storage by only 3x (from 0.1% to ~5% effective rate) but captured 100% of error cases.
Outcome
MTTD for ad-serving issues dropped from 25 minutes to 10 minutes because engineers could immediately drill from an error rate alert to a correlated trace showing the exact failing downstream dependency.
See Distributed Tracing in action
Explore system design templates that use distributed tracing and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary advantage of tail-based sampling over head-based sampling?
2Why is it critical that the sampling decision is propagated from parent to child spans?