1What is the primary advantage of Loki over Elasticsearch for log storage?
Log aggregation collects, indexes, and makes searchable the logs from hundreds or thousands of distributed service instances in a centralized platform. It transforms ephemeral per-container stdout into a durable, queryable forensic record.
In a distributed system with hundreds of service instances, logs are generated on ephemeral containers that may be destroyed within minutes. Without centralized aggregation, debugging requires SSH-ing into individual hosts and grepping files -- impossible at scale and useless after containers are recycled. Log aggregation solves this by collecting all logs into a durable, searchable store.
The canonical pipeline has four stages. Generation: services emit structured JSON logs to stdout with consistent fields (timestamp, service, level, trace_id, message). Collection: agents (Fluentd, Vector, Filebeat, OTel Collector) running on each node tail container logs and forward them. Processing: a buffer (Kafka) absorbs burst traffic and a processor enriches logs (add Kubernetes metadata, parse stack traces, redact PII). Storage: an indexed backend (Elasticsearch, Grafana Loki, ClickHouse) stores logs and serves queries. A web UI (Kibana, Grafana) provides search, filtering, and visualization.
The choice of backend defines the architecture's trade-offs. Elasticsearch (ELK stack) creates a full-text inverted index on every field, enabling fast arbitrary queries ('show me all logs where user_id=12345 AND latency>500ms'). But this indexing is expensive: each GB of raw logs requires 1.5-3 GB of index storage, and ingestion throughput is limited by indexing speed. Loki (Grafana stack) takes the opposite approach: it indexes only a small set of labels (service, level, pod) and stores log lines as compressed chunks in object storage. Queries that match labels are fast; queries that require full-text search scan chunks and are slower. Loki is 10-20x cheaper to operate than Elasticsearch for the same log volume but less powerful for ad-hoc analysis.
Structured logging is the foundation of effective log aggregation. Unstructured logs ('Error: failed to process order') require regex extraction at query time, which is slow and fragile. Structured JSON logs ({"level": "error", "service": "orders", "order_id": 12345, "error_type": "payment_declined", "trace_id": "abc-123"}) are parsed once at ingestion and queryable by any field. The shift from printf-style to structured logging is the single highest-leverage observability improvement a team can make.
Cost management dominates log aggregation operations. A moderate-size microservice fleet (100 services, 10K RPS total) generates 50-100 GB of logs per day. At Elasticsearch's storage amplification (3x for index + replica), that is 150-300 GB/day of indexed storage. With a 30-day retention, that is 4.5-9 TB. Cloud-hosted Elasticsearch (AWS OpenSearch, Elastic Cloud) charges $0.10-0.20/GB/month, making log storage a $500-1800/month expense for a modest fleet. Strategies to reduce cost include: log level gating (INFO in production, DEBUG only when investigating), sampling verbose log sources, tiered storage (7 days hot SSD, 30 days warm HDD, archive to S3), and replacing high-volume log lines with metric counters.
From Printf to Structured Logging
Before: logger.info(f'Order {order_id} payment failed: {error}'). This produces: '2024-01-15 14:03:07 INFO Order 12345 payment failed: card_declined'. Searching for all payment failures requires regex: /payment failed/. After: logger.info('payment_failed', order_id=12345, error_type='card_declined', amount_cents=5999, trace_id='abc-123'). This produces: {"timestamp": "2024-01-15T14:03:07Z", "level": "info", "msg": "payment_failed", "order_id": 12345, "error_type": "card_declined", "amount_cents": 5999, "trace_id": "abc-123"}. Now you can query: error_type='card_declined' AND amount_cents > 5000 -- impossible with the unstructured version.
Elastic / ELK Stack
The ELK stack (Elasticsearch, Logstash, Kibana) was the dominant log aggregation platform for a decade. Elasticsearch's inverted index enables sub-second full-text search across billions of log lines. Organizations like Wikipedia, Netflix, and LinkedIn run ELK clusters processing terabytes of logs daily. The trade-off is operational complexity: managing Elasticsearch clusters (shard balancing, index lifecycle, JVM tuning) is a significant operational burden.
Grafana Labs (Loki)
Grafana Loki was designed as a 'like Prometheus, but for logs' -- it indexes only labels, not log content. This makes it 10-20x cheaper than Elasticsearch for the same volume. Loki stores log chunks in object storage (S3, GCS) and uses a small index (BoltDB or TSDB). It integrates natively with Grafana for unified metrics+logs+traces dashboards. Adopted by CNCF-heavy organizations as the default Kubernetes logging backend.
Cloudflare
Cloudflare processes over 40 million HTTP requests per second and generates petabytes of logs daily. They built a custom log pipeline using Kafka for buffering, ClickHouse for analytical storage, and a purpose-built query engine. Their key insight: most log queries are analytical (top 10 error codes, traffic by country) rather than needle-in-haystack, making a columnar store like ClickHouse more efficient than Elasticsearch.
| Aspect | Description |
|---|---|
| Index Everything (ELK) vs. Index Labels Only (Loki) | Full-text indexing (Elasticsearch) enables fast ad-hoc queries on any field but costs 2-3x in storage. Label-only indexing (Loki) is 10x cheaper but requires label-based filtering before full-text scan, making some queries slow. |
| Real-Time vs. Batch Ingestion | Real-time log shipping (Fluentd → Elasticsearch direct) provides sub-second searchability but risks overwhelming the backend during traffic spikes. Buffered ingestion (Fluentd → Kafka → Elasticsearch) absorbs spikes but adds 5-30 seconds of latency. |
| Retention Duration vs. Storage Cost | Longer retention provides better forensic capability but increases cost linearly. Most incidents are investigated within 7 days. Tiered storage (7 days hot, 30 days warm, archive to S3) balances cost and access speed. |
| Centralized vs. Per-Team Log Stores | A single centralized log store simplifies operations and enables cross-service queries. Per-team stores (each team runs their own Loki) provide isolation and cost attribution but prevent fleet-wide queries during incidents. |
Wise Reduces Log Costs 90% by Migrating from ELK to Loki
Scenario
Wise (formerly TransferWise) ran a self-managed Elasticsearch cluster that consumed 15 TB of SSD storage and cost $45,000/month in AWS infrastructure. The cluster required a dedicated SRE to manage shard rebalancing, index rollover, and JVM tuning.
Solution
After evaluating their query patterns, they found that 85% of log queries filtered by service + time range + level -- exactly the labels Loki indexes. They migrated to Grafana Loki backed by S3 for storage.
Outcome
Monthly cost dropped to $4,500 (90% reduction). Query latency for label-based queries was comparable; full-text search was 3-5x slower but acceptable for the 15% of queries that needed it. The dedicated SRE was reassigned to product work.
See Log Aggregation in action
Explore system design templates that use log aggregation and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary advantage of Loki over Elasticsearch for log storage?
2Why is it critical to include trace_id in every structured log line?