Production-grade observability platform: columnar hot index (24h), Parquet warm tier (90d), S3 Glacier cold archive (7y), streaming alert engine, distributed cross-tier query engine, schema-on-read parsing, and sampling for high-cardinality metrics.
The multi-tier logging platform with streaming alerting represents the state of the art in production observability — the architecture behind Datadog, Splunk Cloud, Elastic Observability, and AWS CloudWatch at their largest scale. It addresses three limitations of the standard Kafka-plus-Elasticsearch pipeline that become critical at FAANG scale: the lack of streaming alerting, the two-tier storage gap between hot and cold, and the absence of a unified query interface across storage tiers.
The first limitation is alerting latency. The standard pipeline (v1) indexes logs and pre-computes dashboard metrics but has no mechanism for streaming alerting. Detecting that the 5xx error rate for the checkout service exceeds 1 percent in a 5-minute window requires either polling Elasticsearch periodically (adding minutes of delay) or building custom aggregations from the Redis metrics cache (limited to pre-defined aggregations). In a production incident, minutes of alert delay translate directly to minutes of customer impact. The multi-tier variant adds a dedicated AlertEngine that evaluates rules on every parsed log record as it flows through the pipeline, providing sub-5-second alert latency.
The second limitation is the storage gap between hot and cold tiers. The standard pipeline has two tiers: hot (Elasticsearch, 7 days, ~$2,300/TB/month) and cold (S3, 7 years, ~$0.01/TB/month). Data transitions directly from hot to cold with no intermediate tier. This means that querying logs from 8 days ago requires a cold-tier S3 scan at 30-60 seconds latency, even though 8-day-old data is frequently accessed during incident investigations. The multi-tier variant introduces a warm tier: compressed Parquet files in S3 Standard ($23/TB/month) retaining 90 days of data with ~10-second query latency via S3 Select. This fills the gap between expensive-fast and cheap-slow.
The third limitation is the lack of a unified query interface. In the standard pipeline, engineers must know which tier contains the data they need and use different tools for each tier. Recent data requires an Elasticsearch query; historical data requires S3 Select or Athena. The multi-tier variant adds a distributed QueryEngine that accepts a single query with a time range and automatically determines which tiers to query, executes sub-queries in parallel, and merges results transparently. This is how production platforms like Datadog present a seamless search experience across petabytes of data spanning years.
Additional capabilities in the multi-tier variant include schema-on-read parsing (auto-detecting JSON, logfmt, and grok patterns without requiring upfront schema definition), sampling for high-cardinality metrics (keeping 100 percent of ERROR, 10 percent of WARN, 1 percent of DEBUG to reduce indexing cost 100x while preserving statistical accuracy), and a columnar hot index (ClickHouse or Apache Druid) that provides 10x faster aggregation queries than Elasticsearch for dashboard workloads.
The operational complexity is significantly higher than the standard pipeline: 11 components across 3 storage tiers, a streaming alert engine with sliding-window state management, a distributed query engine with cross-tier query planning, and a warm compactor managing continuous hot-to-warm and warm-to-cold data lifecycle. This architecture requires a dedicated SRE team to operate reliably. The trade-off is justified at FAANG scale where the combination of sub-second alerting, transparent cross-tier queries, and optimized storage economics directly reduces mean time to detection and mean time to resolution for production incidents.
The multi-tier logging platform uses 11 components organized into ingest, processing, alerting, query, and storage layers. The design prioritizes three properties that the standard pipeline lacks: streaming alerting (sub-5-second anomaly detection), transparent cross-tier queries (one query spans hot, warm, and cold), and 3-tier storage economics (columnar hot, Parquet warm, Glacier cold).
The ingest path begins with LogAgent, representing millions of hosts streaming batched log lines via gRPC to IngestGateway (AWS NLB). IngestGateway validates tenant headers, applies per-tenant rate limiting, and publishes to IngestStream (Kafka with 512 partitions). The API returns immediately — all downstream processing is async. At 100M lines per second, Kafka's deep buffer (100M messages) absorbs burst traffic during incidents and deployments.
ParserWorker (200 instances) consumes from IngestStream and is the central routing hub. For each log batch, it applies schema-on-read parsing: auto-detecting JSON, logfmt, and unstructured text patterns, extracting structured fields (service, level, trace_id, user_id), and enriching with metadata. It then applies sampling rules (100 percent ERROR, 10 percent WARN, 1 percent DEBUG for indexing; all data preserved in warm/cold tiers). Parsed records are routed to three destinations: (1) bulk index into HotIndexDB, (2) forward to AlertEngine for real-time rule evaluation, and (3) buffer for WarmCompactor.
HotIndexDB is a columnar store (ClickHouse or Apache Druid on OpenSearch) retaining the last 24 hours of parsed data. Columnar storage provides 10x faster aggregation queries than Elasticsearch for time-series workloads: GROUP BY service, level, minute over 24 hours completes in approximately 100ms. The trade-off is weaker full-text search (bloom filters instead of inverted indexes), which is acceptable for the 24-hour hot tier where most queries are aggregation-heavy (dashboards, alerts, SLO monitors). 6 data nodes with 128 partitions and 2 replicas.
AlertEngine (50 pods) receives parsed log records from ParserWorker and evaluates them against active rules stored in AlertDB (PostgreSQL). Rules define conditions on sliding windows: for example, '5xx rate exceeds 1 percent in a 5-minute window for service:checkout.' Each worker maintains in-memory sliding-window counters per rule per service. When a threshold is breached, AlertEngine writes the fired alert to AlertDB and triggers notifications (webhook, PagerDuty, Slack). Sub-5-second latency from log emission to alert firing.
WarmCompactor (30 workers) manages two lifecycle transitions: hot-to-warm and warm-to-cold. For hot-to-warm, it reads data older than 24 hours from HotIndexDB, compresses into columnar Parquet files organized by tenant/date/hour, and writes to WarmStore (S3 Standard, ~$23/TB/month). For warm-to-cold, it transitions data older than 90 days from WarmStore to ColdArchive (S3 Glacier Deep Archive, ~$0.00099/GB/month) via S3 lifecycle policies.
QueryEngine (20 pods, 100 threads each) is the unified search interface. It receives queries from QueryClient, determines which tiers contain relevant data based on the time range, and executes sub-queries in parallel: (1) HotIndexDB for the last 24 hours (~500ms), (2) WarmStore via S3 Select on Parquet for 1-90 days (~10s), and (3) ColdArchive via Glacier restore for 90 days to 7 years (minutes to hours). Results are merged, sorted by timestamp, and returned. Dashboard metric queries are served directly from HotIndexDB aggregations at sub-500ms latency.
Choice
Columnar store for the 24-hour hot tier instead of Elasticsearch
Rationale
Dashboard and alerting queries are primarily aggregations: error rate per service per minute, p99 latency per endpoint, throughput per region. Columnar stores process these 10x faster than Elasticsearch because they read only the columns needed for the aggregation, skipping the message text entirely. The trade-off is weaker full-text search — bloom filters instead of inverted indexes. For the 24h hot tier where most queries are aggregations, columnar is the right trade-off.
Choice
Dedicated AlertEngine evaluating rules on every parsed log record in real-time
Rationale
Polling-based alerting has inherent delay equal to the polling interval (typically 1-5 minutes). Streaming alerting evaluates every parsed log record as it flows through, detecting anomalies within seconds. This reduces mean time to detection from minutes to seconds. The cost is in-memory sliding-window state per rule per service, which requires careful memory management at 10K+ rules.
Choice
24h columnar hot, 90d Parquet warm, 7y Glacier cold
Rationale
Two tiers (hot + cold) leave a 10x cost gap and a 60x latency gap. The warm tier fills both: $23/TB/month is 100x cheaper than hot ($2,300/TB/month) and ~10s query latency is 100x faster than cold (minutes). Most incident investigations query data from 1-7 days ago, which falls squarely in the warm tier. The three-tier approach optimizes both cost and query latency for the actual access patterns.
Choice
Auto-detect log format (JSON, logfmt, grok) at parse time, no upfront schema
Rationale
In organizations with thousands of microservices, requiring all services to emit logs in a strict schema is impractical. Schema-on-read accepts any format and extracts structure at parse time. Known patterns (JSON, logfmt) are parsed automatically; unknown patterns are stored as raw text with the original message preserved. The trade-off is slightly higher parse-time CPU and occasional mis-parsing of ambiguous formats.
Choice
100% ERROR, 10% WARN, 1% DEBUG indexed; all data preserved in warm/cold
Rationale
High-cardinality fields (user_id, request_id) generate unbounded unique combinations. Indexing 100M lines per second of DEBUG logs in the hot tier is prohibitively expensive. Sampling reduces hot-tier indexing volume 100x while preserving statistical accuracy for aggregations and percentiles. All raw data is preserved in the warm and cold tiers for forensic investigation when needed.
Choice
Unified query interface spanning hot, warm, and cold tiers transparently
Rationale
Without a unified query engine, engineers must know which tier contains data for their time range and use different tools. The distributed QueryEngine accepts a single query, determines affected tiers from the time range, executes sub-queries in parallel, and merges results. This abstracts the storage complexity and provides a seamless search experience across years of data.
Target RPS
100M+ lines/sec ingest, 50K queries/sec, 10K alerts/sec
Latency (p99)
<10ms ingest, <1s hot search, <30s warm, minutes cold
Storage
~10 PB/day (hot: 24h, warm: 90d, cold: 7y)
Availability
99.9% (replicated across all tiers)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Ingest log batch (POST /api/v1/logs) | O(1) Kafka publish per batch (~5ms ack) | O(B) per batch, B = batch size (~4KB) | Fire-and-forget from ingest endpoint. Kafka absorbs burst. Actual processing is async. |
| Alert rule evaluation | O(R) per parsed log record, R = matching rules | O(R * W) sliding window state, W = window buckets per rule | R = ~10-50 matching rules per service. Each evaluation updates in-memory counters. Sub-5ms per record. |
| Hot tier query (last 24h) | O(log P + S) partition pruning + columnar scan, P = partitions, S = scanned rows | O(K) result buffer | Columnar scan reads only needed columns. Aggregation queries complete in ~100ms. Full scan ~500ms. |
| Warm tier query (1-90 days) | O(F * C) per file scanned, F = Parquet files, C = column scan per file | O(K) result buffer | S3 Select pushes filtering to S3. Partition pruning by date/hour reduces files scanned. p99 ~10s. |
| Cold tier query (90d-7y) | O(restore_time + F * C) Glacier restore + Parquet scan | O(K) result buffer + temporary restore storage | Glacier restore: 5 min (expedited) to 12 hours (standard). Query after restore similar to warm tier. |
| Cross-tier query (spans multiple tiers) | O(max(hot, warm, cold)) — parallel sub-queries, latency = slowest tier | O(K_hot + K_warm + K_cold) merge buffer | QueryEngine executes sub-queries in parallel and merges results. Hot results stream first. |
Columnar hot-tier index storing the last 24 hours of parsed, enriched log data. Optimized for aggregation queries (GROUP BY service, level, minute) and time-range scans. Per-tenant partitioning for multi-tenant isolation. ParserWorker bulk-inserts at 1000 docs per batch. WarmCompactor reads and deletes data older than 24h.
128 partitions, 2 replicas, 6 data nodes. Columnar compression reduces storage 5-10x vs row-based. Bloom filter on message enables approximate text search.
Warm-tier compressed Parquet files in S3 Standard for 1-90 day retention. Organized by tenant/date/hour with Hive-compatible partition metadata. Queryable via S3 Select at approximately 10-second latency. Data transitions to ColdArchive after 90 days.
Columnar Parquet format enables scanning only needed columns. S3 Select pushes filtering to S3, reducing data transfer. Cost: ~$23/TB/month.
Cold-tier for 7-year compliance retention. Data arrives from WarmStore via S3 lifecycle policy after 90 days. Retrieval requires restore request: standard (12 hours), expedited (5 minutes at premium cost). Used for compliance audits and forensic investigations.
Cost: ~$0.00099/GB/month. Minimum storage duration: 180 days. Retrieval: 5 min (expedited) to 12 hours (standard).
Alert rule definitions: condition expressions, sliding window durations, thresholds, notification targets. AlertEngine caches rules in memory and reloads on changes. Approximately 10K rules at scale.
Indexes: idx_rules_tenant ON (tenant_id, enabled)
~10K rules. AlertEngine caches in memory, polls for changes every 30s.
History of fired alerts. Written by AlertEngine when a rule triggers. Read by dashboards and on-call UIs. Indexed by (tenant_id, fired_at) for time-range queries. Retention: 90 days.
Indexes: idx_alerts_tenant_time ON (tenant_id, fired_at DESC), idx_alerts_rule ON (rule_id, fired_at DESC)
Write-heavy during incidents. ~1K writes/sec at peak. 90-day retention with automated cleanup.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Direct-to-Database) | T1 | 2ms ingest, 2-30s search | ~500 lines/sec | $200/month (single RDS) | Low — 3 components, no async | 99% (single DB, no failover) |
| Kafka + Indexing Pipeline | T2 | <10ms ingest, <5s hot search | 100M lines/sec | $15,000/month (Kafka + ES + S3) | Medium — Kafka, ES, workers | 99.9% (replicated components) |
| Multi-Tier with Alerting | T3 | <10ms ingest, <1s hot, <30s warm | 100M+ lines/sec | $25,000/month (3-tier + alerting) | High — 11 components, 3 tiers | 99.9% (replicated, multi-tier) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
The hot tier primarily serves aggregation queries for dashboards and alert evaluation: error rate per service per minute, p99 latency per endpoint, throughput by region. Columnar stores (ClickHouse, Apache Druid) process these aggregations 10x faster than Elasticsearch because they read only the columns needed, skipping the large message text column entirely. At 24 hours of data, full-text search is less critical — most searches target specific services and time ranges, which columnar handles efficiently via partition pruning.
AlertEngine receives parsed log records directly from ParserWorker (not from the indexed data). Each record is evaluated against all matching active rules immediately. Rules maintain in-memory sliding-window counters: when a 5-minute window for 'service:checkout AND level:ERROR' exceeds its threshold, the alert fires instantly. The end-to-end latency is: Kafka publish (~5ms) + ParserWorker (~8ms) + AlertEngine evaluation (~5ms) = approximately 18ms from log emission to alert evaluation. The 5-second target includes buffer for Kafka consumer lag during bursts.
The warm tier (S3 Standard, $23/TB/month) stores compressed Parquet files queryable via S3 Select in approximately 10 seconds. The cold tier (S3 Glacier Deep Archive, $0.99/TB/month) stores the same data but requires a restore request (5 minutes expedited, 12 hours standard) before it can be queried. The warm tier covers 1-90 days (frequent incident investigation window), while the cold tier covers 90 days to 7 years (compliance and forensic investigations). The 100x cost difference between warm and cold makes the lifecycle transition economically essential.
Sampling is applied by log level, not by user or service. All ERROR logs are indexed at 100 percent fidelity — no sampling. WARN logs are sampled at 10 percent, and DEBUG at 1 percent for the hot tier. Importantly, all raw unsampled data is preserved in the warm and cold tiers. If an engineer needs the specific DEBUG log lines that were sampled out of the hot tier, they can query the warm tier (Parquet in S3) where the complete data exists. Sampling only affects hot-tier indexing cost, not data durability.
QueryEngine parses the time range from the query and determines which tiers to query: last 24h hits HotIndexDB, 1-90d hits WarmStore, and older than 90d hits ColdArchive. For queries spanning multiple tiers (e.g., last 48 hours), it executes sub-queries in parallel against both hot and warm tiers, then merges results by timestamp. The response streams results as they arrive — hot-tier results appear in sub-second while warm-tier results arrive seconds later. Cold-tier queries require explicit acknowledgment due to Glacier restore latency.
Datadog's logging architecture uses a similar multi-tier approach: real-time indexing for recent data (their 'Online Archives'), compressed storage for medium-term (their 'Flex Logs'), and archival for long-term retention. They use a streaming pipeline for alerting (Monitors) and a unified query interface (Log Explorer) spanning all tiers. The core patterns — tiered storage, streaming alerting, distributed query, and sampling — are the same. This template captures the essential architecture without Datadog's proprietary compression algorithms and ML-based anomaly detection.
Sign in to join the discussion.
Ready to design your own Logging Pipeline?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator