Vetora logo
📝Observability

Log Aggregation

Log aggregation collects, indexes, and makes searchable the logs from hundreds or thousands of distributed service instances in a centralized platform. It transforms ephemeral per-container stdout into a durable, queryable forensic record.

Overview

In a distributed system with hundreds of service instances, logs are generated on ephemeral containers that may be destroyed within minutes. Without centralized aggregation, debugging requires SSH-ing into individual hosts and grepping files -- impossible at scale and useless after containers are recycled. Log aggregation solves this by collecting all logs into a durable, searchable store.

The canonical pipeline has four stages. Generation: services emit structured JSON logs to stdout with consistent fields (timestamp, service, level, trace_id, message). Collection: agents (Fluentd, Vector, Filebeat, OTel Collector) running on each node tail container logs and forward them. Processing: a buffer (Kafka) absorbs burst traffic and a processor enriches logs (add Kubernetes metadata, parse stack traces, redact PII). Storage: an indexed backend (Elasticsearch, Grafana Loki, ClickHouse) stores logs and serves queries. A web UI (Kibana, Grafana) provides search, filtering, and visualization.

The choice of backend defines the architecture's trade-offs. Elasticsearch (ELK stack) creates a full-text inverted index on every field, enabling fast arbitrary queries ('show me all logs where user_id=12345 AND latency>500ms'). But this indexing is expensive: each GB of raw logs requires 1.5-3 GB of index storage, and ingestion throughput is limited by indexing speed. Loki (Grafana stack) takes the opposite approach: it indexes only a small set of labels (service, level, pod) and stores log lines as compressed chunks in object storage. Queries that match labels are fast; queries that require full-text search scan chunks and are slower. Loki is 10-20x cheaper to operate than Elasticsearch for the same log volume but less powerful for ad-hoc analysis.

Structured logging is the foundation of effective log aggregation. Unstructured logs ('Error: failed to process order') require regex extraction at query time, which is slow and fragile. Structured JSON logs ({"level": "error", "service": "orders", "order_id": 12345, "error_type": "payment_declined", "trace_id": "abc-123"}) are parsed once at ingestion and queryable by any field. The shift from printf-style to structured logging is the single highest-leverage observability improvement a team can make.

Cost management dominates log aggregation operations. A moderate-size microservice fleet (100 services, 10K RPS total) generates 50-100 GB of logs per day. At Elasticsearch's storage amplification (3x for index + replica), that is 150-300 GB/day of indexed storage. With a 30-day retention, that is 4.5-9 TB. Cloud-hosted Elasticsearch (AWS OpenSearch, Elastic Cloud) charges $0.10-0.20/GB/month, making log storage a $500-1800/month expense for a modest fleet. Strategies to reduce cost include: log level gating (INFO in production, DEBUG only when investigating), sampling verbose log sources, tiered storage (7 days hot SSD, 30 days warm HDD, archive to S3), and replacing high-volume log lines with metric counters.

Key Points
  • 1Always use structured logging (JSON). Every log line should have: timestamp, level, service, trace_id, and operation. Unstructured printf-style logs are ungrepable at scale and break parsing pipelines.
  • 2Choose a backend based on query patterns. Elasticsearch: best for ad-hoc, high-cardinality queries (search by any field). Loki: best for label-based queries (search by service + level) with 10x lower cost. ClickHouse: best for analytical queries over log data.
  • 3Log collection agents (Fluentd, Vector, Filebeat) run as DaemonSets on each Kubernetes node, tailing container stdout. They parse, enrich (add pod/node labels), and forward to the backend or a Kafka buffer.
  • 4Include trace_id in every log line. This is the bridge between logs and distributed traces -- when a trace shows a slow span, you filter logs by trace_id to see exactly what happened during that request.
  • 5Log levels (DEBUG, INFO, WARN, ERROR) control volume. Production should run at INFO or WARN. Dynamic log level adjustment (set DEBUG for one service temporarily via config) is essential for incident investigation.
  • 6Cost grows linearly with volume and retention. Tiered storage (hot/warm/cold), aggressive retention policies (7-30 days for most logs), and log-to-metric extraction (count errors, don't store each error log) are essential cost controls.
Simple Example

From Printf to Structured Logging

Before: logger.info(f'Order {order_id} payment failed: {error}'). This produces: '2024-01-15 14:03:07 INFO Order 12345 payment failed: card_declined'. Searching for all payment failures requires regex: /payment failed/. After: logger.info('payment_failed', order_id=12345, error_type='card_declined', amount_cents=5999, trace_id='abc-123'). This produces: {"timestamp": "2024-01-15T14:03:07Z", "level": "info", "msg": "payment_failed", "order_id": 12345, "error_type": "card_declined", "amount_cents": 5999, "trace_id": "abc-123"}. Now you can query: error_type='card_declined' AND amount_cents > 5000 -- impossible with the unstructured version.

Real-World Examples

Elastic / ELK Stack

The ELK stack (Elasticsearch, Logstash, Kibana) was the dominant log aggregation platform for a decade. Elasticsearch's inverted index enables sub-second full-text search across billions of log lines. Organizations like Wikipedia, Netflix, and LinkedIn run ELK clusters processing terabytes of logs daily. The trade-off is operational complexity: managing Elasticsearch clusters (shard balancing, index lifecycle, JVM tuning) is a significant operational burden.

Grafana Labs (Loki)

Grafana Loki was designed as a 'like Prometheus, but for logs' -- it indexes only labels, not log content. This makes it 10-20x cheaper than Elasticsearch for the same volume. Loki stores log chunks in object storage (S3, GCS) and uses a small index (BoltDB or TSDB). It integrates natively with Grafana for unified metrics+logs+traces dashboards. Adopted by CNCF-heavy organizations as the default Kubernetes logging backend.

Cloudflare

Cloudflare processes over 40 million HTTP requests per second and generates petabytes of logs daily. They built a custom log pipeline using Kafka for buffering, ClickHouse for analytical storage, and a purpose-built query engine. Their key insight: most log queries are analytical (top 10 error codes, traffic by country) rather than needle-in-haystack, making a columnar store like ClickHouse more efficient than Elasticsearch.

Trade-Offs
AspectDescription
Index Everything (ELK) vs. Index Labels Only (Loki)Full-text indexing (Elasticsearch) enables fast ad-hoc queries on any field but costs 2-3x in storage. Label-only indexing (Loki) is 10x cheaper but requires label-based filtering before full-text scan, making some queries slow.
Real-Time vs. Batch IngestionReal-time log shipping (Fluentd → Elasticsearch direct) provides sub-second searchability but risks overwhelming the backend during traffic spikes. Buffered ingestion (Fluentd → Kafka → Elasticsearch) absorbs spikes but adds 5-30 seconds of latency.
Retention Duration vs. Storage CostLonger retention provides better forensic capability but increases cost linearly. Most incidents are investigated within 7 days. Tiered storage (7 days hot, 30 days warm, archive to S3) balances cost and access speed.
Centralized vs. Per-Team Log StoresA single centralized log store simplifies operations and enables cross-service queries. Per-team stores (each team runs their own Loki) provide isolation and cost attribution but prevent fleet-wide queries during incidents.
Case Study

Wise Reduces Log Costs 90% by Migrating from ELK to Loki

Scenario

Wise (formerly TransferWise) ran a self-managed Elasticsearch cluster that consumed 15 TB of SSD storage and cost $45,000/month in AWS infrastructure. The cluster required a dedicated SRE to manage shard rebalancing, index rollover, and JVM tuning.

Solution

After evaluating their query patterns, they found that 85% of log queries filtered by service + time range + level -- exactly the labels Loki indexes. They migrated to Grafana Loki backed by S3 for storage.

Outcome

Monthly cost dropped to $4,500 (90% reduction). Query latency for label-based queries was comparable; full-text search was 3-5x slower but acceptable for the 15% of queries that needed it. The dedicated SRE was reassigned to product work.

Common Mistakes
  • Logging at DEBUG level in production: DEBUG logs can produce 10-100x more volume than INFO, overwhelming the log pipeline, increasing cost, and making it harder to find important messages. Default to INFO in production and implement dynamic log level adjustment so engineers can temporarily enable DEBUG for a specific service or pod during incident investigation, then revert.
  • No log retention policy: Logs accumulate indefinitely, and after 6 months the Elasticsearch cluster needs constant capacity expansion, query performance degrades, and cost grows unbounded. Define tiered retention (7 days hot on fast SSD, 30 days warm on HDD or reduced replicas, then delete or archive to S3) and use Index Lifecycle Management (Elasticsearch) or retention policies (Loki) to automate.
  • Logging sensitive data (PII, credentials, tokens): User emails, API keys, or credit card numbers appear in logs accessible to the entire engineering team, violating GDPR, PCI-DSS, and internal security policies. Use a log processor (Vector, Fluentd filter, OTel Collector) to redact or hash sensitive fields before indexing, audit log schemas quarterly for PII leaks, and never log request/response bodies without explicit allowlisting.
  • Using string concatenation for log messages: logger.info('Processing order ' + orderId + ' for user ' + userId) allocates string objects even when the log level is disabled, impacting hot-path performance. Use structured logging with lazy evaluation (logger.info('processing_order', order_id=orderId, user_id=userId)) so fields are only serialized if the log level is enabled.
Related Concepts

See Log Aggregation in action

Explore system design templates that use log aggregation and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Stream logs through a centralized aggregation pipeline

Metrics to watch
log_ingestion_ratepipeline_latency_msstorage_growth_gbquery_latency_ms
Run Simulation
Test Your Understanding

1What is the primary advantage of Loki over Elasticsearch for log storage?

2Why is it critical to include trace_id in every structured log line?

Deeper Reading