1Which observability pillar is best suited for real-time alerting on error rate spikes?
Observability rests on three complementary signal types: logs (discrete events), metrics (aggregated measurements), and traces (request-scoped causal chains). Together they let operators answer 'what is happening, how much, and why' in distributed systems without deploying new code.
The term 'three pillars of observability' was popularized by the CNCF and observability vendors in the late 2010s to describe the three fundamental signal types needed to understand complex distributed systems: logs, metrics, and traces. While each pillar existed long before the term was coined -- syslog dates to the 1980s, RRDtool to the 1990s, and Dapper-style tracing to 2010 -- the insight was that no single signal type is sufficient for modern microservice architectures.
Logs are discrete, timestamped records of events: 'user 12345 logged in at 14:03:07', 'query X took 347ms', 'connection pool exhausted'. They are the richest signal -- you can include arbitrary context -- but they are also the most expensive at scale. A service handling 50K RPS generating 2 log lines per request produces 100K lines/second, or ~8.6 billion lines/day. Without structure and indexing, finding the relevant needle in that haystack is impractical.
Metrics are pre-aggregated numeric time series: request rate (counter), error rate (counter), latency p50/p95/p99 (histogram), queue depth (gauge), CPU utilization (gauge). They are extremely cheap to collect, store, and query because aggregation reduces dimensionality by orders of magnitude. A single metric series for 'request latency p99' costs the same whether the service handles 100 or 100K RPS. The trade-off is loss of individual request detail.
Traces are request-scoped DAGs that follow a single operation across service boundaries. A trace for 'GET /api/orders/123' might span the API gateway, auth service, order service, inventory service, and database. Each span records the service, operation, duration, and metadata. Traces answer the questions metrics cannot: 'this request was slow -- where did the time go?' They are sampled (typically 1-10% of traffic) because storing a trace per request at scale is prohibitively expensive.
The modern synthesis treats all three as facets of the same data. Correlation IDs (trace IDs) link a trace to its log lines and to the metrics it contributed to. OpenTelemetry provides a vendor-neutral SDK that emits all three signal types from a single instrumentation point, and backends like Grafana Tempo + Loki + Mimir or Datadog unify them in a single query experience.
Debugging a Slow Checkout
A dashboard (metrics) shows checkout p99 latency spiked from 200ms to 2s. An engineer queries for traces with duration > 1s and finds that 95% of slow traces have a payment-service span taking 1.8s. Drilling into the payment service's logs filtered by those trace IDs reveals 'connection pool exhausted, waited 1.7s for available connection'. The fix: increase the payment service's DB connection pool. Metrics detected the problem, traces localized it, logs explained it.
Google (Dapper)
Google's Dapper paper (2010) introduced the trace/span model used by all modern tracing systems. Dapper traces every RPC in Google's production fleet with adaptive sampling, correlating traces with logs via a shared request ID. The paper showed that tracing 1 in 1,000 requests was sufficient to diagnose most production issues, establishing the economic model for distributed tracing.
Netflix
Netflix uses a unified telemetry platform called Atlas (metrics), Edgar (traces), and a centralized log aggregation system. Atlas processes over 2 billion metrics per minute with a custom in-memory time-series database. Edgar uses distributed tracing to follow requests across 1,000+ microservices. Cross-pillar correlation via trace IDs is the primary debugging workflow during incidents.
Uber
Uber built Jaeger, an open-source distributed tracing system now part of the CNCF. Jaeger handles millions of spans per second across Uber's microservice fleet and uses adaptive sampling to control cost. Uber pairs Jaeger traces with M3 (metrics) and structured logging to provide a complete observability stack, with trace IDs as the universal correlation key.
| Aspect | Description |
|---|---|
| Detail vs. Cost | Logs and traces provide per-request detail but scale with traffic volume. Metrics aggregate away individual requests but remain cheap regardless of scale. Most teams use metrics for alerting and dashboards, traces for request-level debugging, and logs as the last-resort forensic tool. |
| Sampling vs. Completeness | Traces must be sampled at high throughput. Head-based sampling (decide at ingress) is simple but misses rare slow requests. Tail-based sampling (decide after spans arrive) captures anomalies but requires buffering all spans briefly, adding infra complexity. |
| Cardinality vs. Queryability | Adding dimensions to metrics (user_id, endpoint, region) increases cardinality exponentially. A metric with 5 dimensions of 100 values each produces 10 billion series. High-cardinality observability (one series per user) requires specialized stores like ClickHouse or Honeycomb. |
| Vendor Lock-in vs. Integration Depth | All-in-one platforms (Datadog, New Relic) provide seamless cross-pillar correlation but create vendor lock-in. Open-source stacks (OpenTelemetry + Grafana) offer portability but require more operational investment to achieve the same UX. |
Spotify's Migration to OpenTelemetry
Scenario
Spotify ran a proprietary tracing system across 2,000+ microservices. The migration took 18 months and involved auto-instrumenting Java and Python services via OTel agents, switching from Zipkin to Tempo as the trace backend, and correlating traces with metrics in Grafana.
Solution
They adopted OpenTelemetry, auto-instrumenting Java and Python services via OTel agents, switching from Zipkin to Tempo as the trace backend, and correlating traces with metrics in Grafana.
Outcome
Post-migration, mean time to detection (MTTD) dropped 35% because engineers could pivot from metric alerts to correlated traces in one click. Log volume was reduced 40% by replacing verbose debug logs with trace-span attributes, saving $2M/year in log storage costs.
See Three Pillars of Observability in action
Explore system design templates that use three pillars of observability and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Which observability pillar is best suited for real-time alerting on error rate spikes?
2What is the primary purpose of trace sampling?