Vetora logo
📊Observability

Three Pillars of Observability

Observability rests on three complementary signal types: logs (discrete events), metrics (aggregated measurements), and traces (request-scoped causal chains). Together they let operators answer 'what is happening, how much, and why' in distributed systems without deploying new code.

Overview

The term 'three pillars of observability' was popularized by the CNCF and observability vendors in the late 2010s to describe the three fundamental signal types needed to understand complex distributed systems: logs, metrics, and traces. While each pillar existed long before the term was coined -- syslog dates to the 1980s, RRDtool to the 1990s, and Dapper-style tracing to 2010 -- the insight was that no single signal type is sufficient for modern microservice architectures.

Logs are discrete, timestamped records of events: 'user 12345 logged in at 14:03:07', 'query X took 347ms', 'connection pool exhausted'. They are the richest signal -- you can include arbitrary context -- but they are also the most expensive at scale. A service handling 50K RPS generating 2 log lines per request produces 100K lines/second, or ~8.6 billion lines/day. Without structure and indexing, finding the relevant needle in that haystack is impractical.

Metrics are pre-aggregated numeric time series: request rate (counter), error rate (counter), latency p50/p95/p99 (histogram), queue depth (gauge), CPU utilization (gauge). They are extremely cheap to collect, store, and query because aggregation reduces dimensionality by orders of magnitude. A single metric series for 'request latency p99' costs the same whether the service handles 100 or 100K RPS. The trade-off is loss of individual request detail.

Traces are request-scoped DAGs that follow a single operation across service boundaries. A trace for 'GET /api/orders/123' might span the API gateway, auth service, order service, inventory service, and database. Each span records the service, operation, duration, and metadata. Traces answer the questions metrics cannot: 'this request was slow -- where did the time go?' They are sampled (typically 1-10% of traffic) because storing a trace per request at scale is prohibitively expensive.

The modern synthesis treats all three as facets of the same data. Correlation IDs (trace IDs) link a trace to its log lines and to the metrics it contributed to. OpenTelemetry provides a vendor-neutral SDK that emits all three signal types from a single instrumentation point, and backends like Grafana Tempo + Loki + Mimir or Datadog unify them in a single query experience.

Key Points
  • 1Logs are high-cardinality, high-detail event records. Use structured logging (JSON) with consistent field names (trace_id, service, operation, duration_ms) to make them queryable. Unstructured printf-style logs become unusable past a few hundred RPS.
  • 2Metrics are pre-aggregated time series. They excel at dashboards, alerting, and capacity planning. The four golden signals (latency, traffic, errors, saturation) form the minimum viable metrics set for any service.
  • 3Traces follow a single request across service boundaries using context propagation (W3C Trace Context or B3 headers). Each span records service, operation, duration, status, and arbitrary tags. Head-based or tail-based sampling controls cost.
  • 4No single pillar is sufficient. Metrics tell you the 99th percentile is high; logs tell you which error is occurring; traces tell you which downstream service is the bottleneck for a specific slow request.
  • 5Correlation is the key to modern observability. Embedding trace_id in log lines, exemplar links in metrics, and service.name across all three lets operators pivot seamlessly between pillars.
  • 6Cost grows differently per pillar. Metrics: O(cardinality × retention). Logs: O(volume × retention). Traces: O(sample_rate × span_count × retention). Most organizations spend 60-80% of their observability budget on log storage.
Simple Example

Debugging a Slow Checkout

A dashboard (metrics) shows checkout p99 latency spiked from 200ms to 2s. An engineer queries for traces with duration > 1s and finds that 95% of slow traces have a payment-service span taking 1.8s. Drilling into the payment service's logs filtered by those trace IDs reveals 'connection pool exhausted, waited 1.7s for available connection'. The fix: increase the payment service's DB connection pool. Metrics detected the problem, traces localized it, logs explained it.

Real-World Examples

Google (Dapper)

Google's Dapper paper (2010) introduced the trace/span model used by all modern tracing systems. Dapper traces every RPC in Google's production fleet with adaptive sampling, correlating traces with logs via a shared request ID. The paper showed that tracing 1 in 1,000 requests was sufficient to diagnose most production issues, establishing the economic model for distributed tracing.

Netflix

Netflix uses a unified telemetry platform called Atlas (metrics), Edgar (traces), and a centralized log aggregation system. Atlas processes over 2 billion metrics per minute with a custom in-memory time-series database. Edgar uses distributed tracing to follow requests across 1,000+ microservices. Cross-pillar correlation via trace IDs is the primary debugging workflow during incidents.

Uber

Uber built Jaeger, an open-source distributed tracing system now part of the CNCF. Jaeger handles millions of spans per second across Uber's microservice fleet and uses adaptive sampling to control cost. Uber pairs Jaeger traces with M3 (metrics) and structured logging to provide a complete observability stack, with trace IDs as the universal correlation key.

Trade-Offs
AspectDescription
Detail vs. CostLogs and traces provide per-request detail but scale with traffic volume. Metrics aggregate away individual requests but remain cheap regardless of scale. Most teams use metrics for alerting and dashboards, traces for request-level debugging, and logs as the last-resort forensic tool.
Sampling vs. CompletenessTraces must be sampled at high throughput. Head-based sampling (decide at ingress) is simple but misses rare slow requests. Tail-based sampling (decide after spans arrive) captures anomalies but requires buffering all spans briefly, adding infra complexity.
Cardinality vs. QueryabilityAdding dimensions to metrics (user_id, endpoint, region) increases cardinality exponentially. A metric with 5 dimensions of 100 values each produces 10 billion series. High-cardinality observability (one series per user) requires specialized stores like ClickHouse or Honeycomb.
Vendor Lock-in vs. Integration DepthAll-in-one platforms (Datadog, New Relic) provide seamless cross-pillar correlation but create vendor lock-in. Open-source stacks (OpenTelemetry + Grafana) offer portability but require more operational investment to achieve the same UX.
Case Study

Spotify's Migration to OpenTelemetry

Scenario

Spotify ran a proprietary tracing system across 2,000+ microservices. The migration took 18 months and involved auto-instrumenting Java and Python services via OTel agents, switching from Zipkin to Tempo as the trace backend, and correlating traces with metrics in Grafana.

Solution

They adopted OpenTelemetry, auto-instrumenting Java and Python services via OTel agents, switching from Zipkin to Tempo as the trace backend, and correlating traces with metrics in Grafana.

Outcome

Post-migration, mean time to detection (MTTD) dropped 35% because engineers could pivot from metric alerts to correlated traces in one click. Log volume was reduced 40% by replacing verbose debug logs with trace-span attributes, saving $2M/year in log storage costs.

Common Mistakes
  • Logging without structure: Printf-style logs like 'Error processing order 12345' are impossible to query at scale. Without structured fields (order_id, error_type, service), log search degenerates into regex over terabytes of text. Use structured JSON logging with a consistent schema, including trace_id, service, operation, and error_type in every log line, enforced via a shared logging library.
  • Alerting on logs instead of metrics: Log-based alerts are expensive (scan terabytes per query), brittle (break when log format changes), and slow (minutes of lag). Metrics-based alerts fire in seconds. Extract counters from logs at write time (e.g., log-to-metric rules in Loki or Datadog), alert on metric thresholds, and use logs only for root-cause investigation after an alert fires.
  • No sampling strategy for traces: Storing 100% of traces at 50K RPS produces ~4 TB/day of span data. Cost explodes and query performance degrades, and engineers disable tracing entirely, losing the debugging benefit. Use head-based sampling at 1-5% for normal traffic and add tail-based sampling to always capture error and high-latency traces -- this captures 99% of debugging value at 5% of cost.
  • Treating observability as an afterthought: Adding instrumentation to a 50-service system post-launch means months of PRs, inconsistent schemas, and missing coverage during the riskiest period (first launch). Bake observability into the service template, auto-instrument via OpenTelemetry agents, and make trace_id propagation a required middleware, not an opt-in library.
Related Concepts

See Three Pillars of Observability in action

Explore system design templates that use three pillars of observability and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Correlate logs, metrics, and traces in a live e-commerce simulation

Metrics to watch
log_volume_per_secmetric_cardinalitytrace_span_countp99_latency_ms
Run Simulation
Test Your Understanding

1Which observability pillar is best suited for real-time alerting on error rate spikes?

2What is the primary purpose of trace sampling?

Deeper Reading