Vetora logo
๐Ÿ”Observability

Distributed Tracing

Distributed tracing propagates a unique trace ID through every service in a request's path, recording spans (timed operations) that form a causal DAG. It answers the question 'where did the time go?' for any individual request across a microservice architecture.

Overview

Distributed tracing emerged from Google's Dapper paper (2010) and Twitter's Zipkin (2012) to solve a fundamental problem: in a system with dozens or hundreds of microservices, a single user request fans out across many services, and traditional per-service metrics and logs cannot show the end-to-end picture. Tracing assigns a globally unique trace ID at the entry point and propagates it through every downstream call via HTTP headers or gRPC metadata. Each service creates a span -- a named, timed operation with a parent reference -- forming a DAG (directed acyclic graph) that represents the entire request lifecycle.

A typical trace for an e-commerce checkout might include spans for: API gateway authentication (3ms), order service validation (8ms), inventory service reservation (12ms), payment service authorization (150ms), and notification service email (async, 50ms). The trace reveals that the payment service dominates latency, that the inventory call is synchronous but could be parallelized with payment, and that a retry on the payment service added 300ms.

Context propagation is the mechanism that makes tracing possible. The W3C Trace Context standard defines two headers: `traceparent` (trace-id, parent-span-id, trace-flags) and `tracestate` (vendor-specific data). When service A calls service B, it injects these headers into the outgoing request. Service B extracts them and creates a child span linked to A's span. This works across HTTP, gRPC, Kafka (via message headers), and any protocol that supports key-value metadata.

The economics of tracing require sampling. A service handling 100K RPS with an average trace depth of 8 spans generates 800K spans/second -- roughly 70 billion spans/day. At ~500 bytes per span, that is 35 TB/day of raw trace data. Head-based sampling (decide at the entry point, propagate the decision) reduces this to 1-5% at the cost of missing rare slow requests. Tail-based sampling (buffer all spans briefly, then decide which complete traces to keep) captures anomalies but requires an intermediate collector with significant memory. Most production systems use a hybrid: 1% head-based for baseline coverage plus tail-based capture of all errors and high-latency requests.

Key Points
  • 1A trace is identified by a 128-bit trace ID. Each span within a trace has its own span ID and a parent span ID, forming a tree. Root spans have no parent. The trace ID is propagated via W3C Trace Context headers (traceparent, tracestate).
  • 2Spans record: service name, operation name, start timestamp, duration, status (OK/ERROR), and arbitrary attributes (user_id, http.method, db.statement). Span events (logs attached to spans) replace traditional log lines with trace-correlated context.
  • 3Instrumentation can be automatic (agent/library hooks for HTTP clients, DB drivers, gRPC) or manual (developer adds spans for business logic). OpenTelemetry provides both modes for Java, Python, Go, Node.js, and .NET.
  • 4Head-based sampling decides at the trace root whether to sample. It is cheap and propagates the decision downstream, but misses rare anomalies. A 1% sample rate means a 1-in-10,000 error might never be captured.
  • 5Tail-based sampling decides after all spans arrive. A collector buffers spans for a short window (30-60s), then keeps traces matching rules (error, high latency, specific user). It captures anomalies but requires significant memory and adds infra complexity.
  • 6Trace-metric correlation via exemplars allows clicking from a metric spike (e.g., p99 latency) directly to a sample trace that contributed to that spike. Prometheus and Grafana Mimir support exemplar storage.
Simple Example

Following a Search Request

A user searches for 'running shoes'. The API gateway creates a root span and generates trace ID abc-123. It calls the search service (child span), which calls Elasticsearch (child span, 45ms) and the recommendation service (child span, 30ms) in parallel. The recommendation service calls the user-profile cache (child span, 2ms hit). Total trace duration: 82ms. The trace waterfall shows that search and recommendations ran in parallel (good), but Elasticsearch took 45ms of the 82ms total (optimization target). Without tracing, you would only see the 82ms total from the gateway metric -- no breakdown.

Real-World Examples

Google (Dapper)

Dapper traces every RPC in Google's production fleet. It uses adaptive sampling that keeps 1 in 1,024 traces for high-traffic services and all traces for low-traffic services. Dapper was designed with negligible overhead (<0.01% CPU) by batching span export and using a lightweight binary encoding. The Dapper paper established the trace/span model used by all modern systems.

Uber (Jaeger)

Uber built Jaeger (now CNCF graduated) to trace requests across 4,000+ microservices. Jaeger processes millions of spans per second using Kafka as a buffer and Elasticsearch or Cassandra as storage. Uber uses adaptive sampling that increases the sample rate for low-traffic services and decreases it for high-traffic ones, ensuring every service has sufficient trace coverage.

Shopify

Shopify uses distributed tracing across their Ruby on Rails monolith and surrounding microservices to debug Black Friday performance issues. By tracing checkout requests end-to-end, they identified that a single slow Redis call in the tax calculation path was causing cascading timeouts. The trace showed the Redis call taking 800ms due to a hot key, invisible in aggregate metrics because it affected only 0.1% of requests.

Trade-Offs
AspectDescription
Sampling Rate vs. Debug CoverageHigher sampling captures more anomalies but increases storage cost linearly. 1% head-based sampling misses 1-in-100 errors. Tail-based sampling captures all errors but requires a collector buffer (typically 30-60s of spans in memory).
Automatic vs. Manual InstrumentationAuto-instrumentation (OTel agents) is zero-effort but captures only framework-level spans (HTTP, DB, gRPC). Business-logic spans (e.g., 'validate_coupon') require manual instrumentation. Best practice: auto-instrument for coverage, manually instrument critical business paths.
Trace Depth vs. Performance OverheadDeep traces (20+ spans) provide fine-grained visibility but add overhead: context propagation per call, span creation, and export. In latency-critical paths (<1ms), even microsecond overhead per span matters. Limit trace depth or use async export.
Centralized vs. Distributed BackendsCentralized backends (Jaeger with Elasticsearch) simplify querying but create a single point of failure and a storage bottleneck. Distributed backends (Tempo with object storage) scale better but require eventual consistency for query results.
Case Study

Pinterest Reduces MTTD by 60% with Tail-Based Sampling

Scenario

Pinterest's initial head-based sampling at 0.1% meant that rare but impactful errors in their ad serving pipeline were almost never captured in traces. During a revenue-impacting incident, engineers had metrics showing elevated error rates but no traces to diagnose the root cause.

Solution

They deployed an OpenTelemetry Collector with tail-based sampling that buffers spans for 45 seconds and keeps all error traces plus the slowest 5% of traces. This increased trace storage by only 3x (from 0.1% to ~5% effective rate) but captured 100% of error cases.

Outcome

MTTD for ad-serving issues dropped from 25 minutes to 10 minutes because engineers could immediately drill from an error rate alert to a correlated trace showing the exact failing downstream dependency.

Common Mistakes
  • โš Not propagating trace context through async boundaries: Traces break at message queue boundaries (Kafka, SQS) because the consumer creates a new trace, losing the end-to-end picture. Inject trace context into message headers on produce and extract on consume; OpenTelemetry's Kafka instrumentation does this automatically, with the consumer span linked to the producer span to preserve causality.
  • โš Using trace IDs as the sole debugging tool: Traces show timing and structure but not content -- they tell you a DB query took 500ms but not which query or why it was slow. Add semantic attributes to spans (db.statement, http.url, user.id, cache.hit) to turn traces from timing diagrams into rich debugging context, and use span events for inline logs.
  • โš Storing all traces indefinitely: Trace data grows with traffic volume, and at 1% sampling and 50K RPS you generate ~430 million traces/month, making 90-day retention a multi-TB, multi-thousand-dollar problem. Set aggressive retention policies (7-14 days for full traces, 30-90 days for error traces only) and use trace-to-metrics pipelines to derive durable aggregate data from ephemeral trace data.
  • โš Sampling at the wrong layer: If each service independently decides whether to sample, a trace might be partially captured (spans from service A but not B), making it useless for identifying the slow service. Make the sampling decision at the entry point (head-based) and propagate it via the trace-flags field in W3C Trace Context so all downstream services respect the decision.
Related Concepts

See Distributed Tracing in action

Explore system design templates that use distributed tracing and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Trace a ride request across 6 microservices

Metrics to watch
span_counttrace_latency_msservice_fanout_deptherror_rate_pct
Run Simulation
Test Your Understanding

1What is the primary advantage of tail-based sampling over head-based sampling?

2Why is it critical that the sampling decision is propagated from parent to child spans?

Deeper Reading