Hard11 componentsInterview: High

Logging Pipeline — Multi-Tier with Alerting & Distributed Query

Q: Why a columnar hot index instead of Elasticsearch?

The hot tier primarily serves aggregation queries for dashboards and alert evaluation: error rate per service per minute, p99 latency per endpoint, throughput by region. Columnar stores (ClickHouse, Apache Druid) process these aggregations 10x faster than Elasticsearch because they read only the columns needed, skipping the large message text column entirely. At 24 hours of data, full-text search is less critical — most searches target specific services and time ranges, which columnar handles efficiently via partition pruning.

Q: How does the streaming AlertEngine achieve sub-5-second latency?

AlertEngine receives parsed log records directly from ParserWorker (not from the indexed data). Each record is evaluated against all matching active rules immediately. Rules maintain in-memory sliding-window counters: when a 5-minute window for 'service:checkout AND level:ERROR' exceeds its threshold, the alert fires instantly. The end-to-end latency is: Kafka publish (~5ms) + ParserWorker (~8ms) + AlertEngine evaluation (~5ms) = approximately 18ms from log emission to alert evaluation. The 5-second target includes buffer for Kafka consumer lag during bursts.

Q: How does the warm tier differ from the cold tier?

The warm tier (S3 Standard, $23/TB/month) stores compressed Parquet files queryable via S3 Select in approximately 10 seconds. The cold tier (S3 Glacier Deep Archive, $0.99/TB/month) stores the same data but requires a restore request (5 minutes expedited, 12 hours standard) before it can be queried. The warm tier covers 1-90 days (frequent incident investigation window), while the cold tier covers 90 days to 7 years (compliance and forensic investigations). The 100x cost difference between warm and cold makes the lifecycle transition economically essential.

Q: What happens when a user crosses the sampling threshold?

Sampling is applied by log level, not by user or service. All ERROR logs are indexed at 100 percent fidelity — no sampling. WARN logs are sampled at 10 percent, and DEBUG at 1 percent for the hot tier. Importantly, all raw unsampled data is preserved in the warm and cold tiers. If an engineer needs the specific DEBUG log lines that were sampled out of the hot tier, they can query the warm tier (Parquet in S3) where the complete data exists. Sampling only affects hot-tier indexing cost, not data durability.

Q: How does the distributed QueryEngine handle cross-tier queries?

QueryEngine parses the time range from the query and determines which tiers to query: last 24h hits HotIndexDB, 1-90d hits WarmStore, and older than 90d hits ColdArchive. For queries spanning multiple tiers (e.g., last 48 hours), it executes sub-queries in parallel against both hot and warm tiers, then merges results by timestamp. The response streams results as they arrive — hot-tier results appear in sub-second while warm-tier results arrive seconds later. Cold-tier queries require explicit acknowledgment due to Glacier restore latency.

Q: How does this compare to Datadog's actual architecture?

Datadog's logging architecture uses a similar multi-tier approach: real-time indexing for recent data (their 'Online Archives'), compressed storage for medium-term (their 'Flex Logs'), and archival for long-term retention. They use a streaming pipeline for alerting (Monitors) and a unified query interface (Log Explorer) spanning all tiers. The core patterns — tiered storage, streaming alerting, distributed query, and sampling — are the same. This template captures the essential architecture without Datadog's proprietary compression algorithms and ML-based anomaly detection.

Production-grade observability platform: columnar hot index (24h), Parquet warm tier (90d), S3 Glacier cold archive (7y), streaming alert engine, distributed cross-tier query engine, schema-on-read parsing, and sampling for high-cardinality metrics.

ObservabilityMulti-TierAlertingDistributed QueryFAANG-Scale

Try in Simulator

Problem Statement

The multi-tier logging platform with streaming alerting represents the state of the art in production observability — the architecture behind Datadog, Splunk Cloud, Elastic Observability, and AWS CloudWatch at their largest scale. It addresses three limitations of the standard Kafka-plus-Elasticsearch pipeline that become critical at FAANG scale: the lack of streaming alerting, the two-tier storage gap between hot and cold, and the absence of a unified query interface across storage tiers.

The first limitation is alerting latency. The standard pipeline (v1) indexes logs and pre-computes dashboard metrics but has no mechanism for streaming alerting. Detecting that the 5xx error rate for the checkout service exceeds 1 percent in a 5-minute window requires either polling Elasticsearch periodically (adding minutes of delay) or building custom aggregations from the Redis metrics cache (limited to pre-defined aggregations). In a production incident, minutes of alert delay translate directly to minutes of customer impact. The multi-tier variant adds a dedicated AlertEngine that evaluates rules on every parsed log record as it flows through the pipeline, providing sub-5-second alert latency.

The second limitation is the storage gap between hot and cold tiers. The standard pipeline has two tiers: hot (Elasticsearch, 7 days, ~$2,300/TB/month) and cold (S3, 7 years, ~$0.01/TB/month). Data transitions directly from hot to cold with no intermediate tier. This means that querying logs from 8 days ago requires a cold-tier S3 scan at 30-60 seconds latency, even though 8-day-old data is frequently accessed during incident investigations. The multi-tier variant introduces a warm tier: compressed Parquet files in S3 Standard ($23/TB/month) retaining 90 days of data with ~10-second query latency via S3 Select. This fills the gap between expensive-fast and cheap-slow.

The third limitation is the lack of a unified query interface. In the standard pipeline, engineers must know which tier contains the data they need and use different tools for each tier. Recent data requires an Elasticsearch query; historical data requires S3 Select or Athena. The multi-tier variant adds a distributed QueryEngine that accepts a single query with a time range and automatically determines which tiers to query, executes sub-queries in parallel, and merges results transparently. This is how production platforms like Datadog present a seamless search experience across petabytes of data spanning years.

Additional capabilities in the multi-tier variant include schema-on-read parsing (auto-detecting JSON, logfmt, and grok patterns without requiring upfront schema definition), sampling for high-cardinality metrics (keeping 100 percent of ERROR, 10 percent of WARN, 1 percent of DEBUG to reduce indexing cost 100x while preserving statistical accuracy), and a columnar hot index (ClickHouse or Apache Druid) that provides 10x faster aggregation queries than Elasticsearch for dashboard workloads.

The operational complexity is significantly higher than the standard pipeline: 11 components across 3 storage tiers, a streaming alert engine with sliding-window state management, a distributed query engine with cross-tier query planning, and a warm compactor managing continuous hot-to-warm and warm-to-cold data lifecycle. This architecture requires a dedicated SRE team to operate reliably. The trade-off is justified at FAANG scale where the combination of sub-second alerting, transparent cross-tier queries, and optimized storage economics directly reduces mean time to detection and mean time to resolution for production incidents.

Architecture Overview

The multi-tier logging platform uses 11 components organized into ingest, processing, alerting, query, and storage layers. The design prioritizes three properties that the standard pipeline lacks: streaming alerting (sub-5-second anomaly detection), transparent cross-tier queries (one query spans hot, warm, and cold), and 3-tier storage economics (columnar hot, Parquet warm, Glacier cold).

The ingest path begins with LogAgent, representing millions of hosts streaming batched log lines via gRPC to IngestGateway (AWS NLB). IngestGateway validates tenant headers, applies per-tenant rate limiting, and publishes to IngestStream (Kafka with 512 partitions). The API returns immediately — all downstream processing is async. At 100M lines per second, Kafka's deep buffer (100M messages) absorbs burst traffic during incidents and deployments.

ParserWorker (200 instances) consumes from IngestStream and is the central routing hub. For each log batch, it applies schema-on-read parsing: auto-detecting JSON, logfmt, and unstructured text patterns, extracting structured fields (service, level, trace_id, user_id), and enriching with metadata. It then applies sampling rules (100 percent ERROR, 10 percent WARN, 1 percent DEBUG for indexing; all data preserved in warm/cold tiers). Parsed records are routed to three destinations: (1) bulk index into HotIndexDB, (2) forward to AlertEngine for real-time rule evaluation, and (3) buffer for WarmCompactor.

HotIndexDB is a columnar store (ClickHouse or Apache Druid on OpenSearch) retaining the last 24 hours of parsed data. Columnar storage provides 10x faster aggregation queries than Elasticsearch for time-series workloads: GROUP BY service, level, minute over 24 hours completes in approximately 100ms. The trade-off is weaker full-text search (bloom filters instead of inverted indexes), which is acceptable for the 24-hour hot tier where most queries are aggregation-heavy (dashboards, alerts, SLO monitors). 6 data nodes with 128 partitions and 2 replicas.

AlertEngine (50 pods) receives parsed log records from ParserWorker and evaluates them against active rules stored in AlertDB (PostgreSQL). Rules define conditions on sliding windows: for example, '5xx rate exceeds 1 percent in a 5-minute window for service:checkout.' Each worker maintains in-memory sliding-window counters per rule per service. When a threshold is breached, AlertEngine writes the fired alert to AlertDB and triggers notifications (webhook, PagerDuty, Slack). Sub-5-second latency from log emission to alert firing.

WarmCompactor (30 workers) manages two lifecycle transitions: hot-to-warm and warm-to-cold. For hot-to-warm, it reads data older than 24 hours from HotIndexDB, compresses into columnar Parquet files organized by tenant/date/hour, and writes to WarmStore (S3 Standard, ~$23/TB/month). For warm-to-cold, it transitions data older than 90 days from WarmStore to ColdArchive (S3 Glacier Deep Archive, ~$0.00099/GB/month) via S3 lifecycle policies.

QueryEngine (20 pods, 100 threads each) is the unified search interface. It receives queries from QueryClient, determines which tiers contain relevant data based on the time range, and executes sub-queries in parallel: (1) HotIndexDB for the last 24 hours (~500ms), (2) WarmStore via S3 Select on Parquet for 1-90 days (~10s), and (3) ColdArchive via Glacier restore for 90 days to 7 years (minutes to hours). Results are merged, sorted by timestamp, and returned. Dashboard metric queries are served directly from HotIndexDB aggregations at sub-500ms latency.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Columnar Hot Index (ClickHouse/Druid) vs Elasticsearch

Choice

Columnar store for the 24-hour hot tier instead of Elasticsearch

Rationale

Dashboard and alerting queries are primarily aggregations: error rate per service per minute, p99 latency per endpoint, throughput per region. Columnar stores process these 10x faster than Elasticsearch because they read only the columns needed for the aggregation, skipping the message text entirely. The trade-off is weaker full-text search — bloom filters instead of inverted indexes. For the 24h hot tier where most queries are aggregations, columnar is the right trade-off.

Streaming Alert Engine

Choice

Dedicated AlertEngine evaluating rules on every parsed log record in real-time

Rationale

Polling-based alerting has inherent delay equal to the polling interval (typically 1-5 minutes). Streaming alerting evaluates every parsed log record as it flows through, detecting anomalies within seconds. This reduces mean time to detection from minutes to seconds. The cost is in-memory sliding-window state per rule per service, which requires careful memory management at 10K+ rules.

Three Storage Tiers (Hot/Warm/Cold)

Choice

24h columnar hot, 90d Parquet warm, 7y Glacier cold

Rationale

Two tiers (hot + cold) leave a 10x cost gap and a 60x latency gap. The warm tier fills both: $23/TB/month is 100x cheaper than hot ($2,300/TB/month) and ~10s query latency is 100x faster than cold (minutes). Most incident investigations query data from 1-7 days ago, which falls squarely in the warm tier. The three-tier approach optimizes both cost and query latency for the actual access patterns.

Schema-on-Read Parsing

Choice

Auto-detect log format (JSON, logfmt, grok) at parse time, no upfront schema

Rationale

In organizations with thousands of microservices, requiring all services to emit logs in a strict schema is impractical. Schema-on-read accepts any format and extracts structure at parse time. Known patterns (JSON, logfmt) are parsed automatically; unknown patterns are stored as raw text with the original message preserved. The trade-off is slightly higher parse-time CPU and occasional mis-parsing of ambiguous formats.

Sampling for High-Cardinality Metrics

Choice

100% ERROR, 10% WARN, 1% DEBUG indexed; all data preserved in warm/cold

Rationale

High-cardinality fields (user_id, request_id) generate unbounded unique combinations. Indexing 100M lines per second of DEBUG logs in the hot tier is prohibitively expensive. Sampling reduces hot-tier indexing volume 100x while preserving statistical accuracy for aggregations and percentiles. All raw data is preserved in the warm and cold tiers for forensic investigation when needed.

Distributed Cross-Tier Query Engine

Choice

Unified query interface spanning hot, warm, and cold tiers transparently

Rationale

Without a unified query engine, engineers must know which tier contains data for their time range and use different tools. The distributed QueryEngine accepts a single query, determines affected tiers from the time range, executes sub-queries in parallel, and merges results. This abstracts the storage complexity and provides a seamless search experience across years of data.

Scale & Performance

Target RPS

100M+ lines/sec ingest, 50K queries/sec, 10K alerts/sec

Latency (p99)

<10ms ingest, <1s hot search, <30s warm, minutes cold

Storage

~10 PB/day (hot: 24h, warm: 90d, cold: 7y)

Availability

99.9% (replicated across all tiers)

Time & Space Complexity

Operation	Time	Space	Notes
Ingest log batch (POST /api/v1/logs)	O(1) Kafka publish per batch (~5ms ack)	O(B) per batch, B = batch size (~4KB)	Fire-and-forget from ingest endpoint. Kafka absorbs burst. Actual processing is async.
Alert rule evaluation	O(R) per parsed log record, R = matching rules	O(R * W) sliding window state, W = window buckets per rule	R = ~10-50 matching rules per service. Each evaluation updates in-memory counters. Sub-5ms per record.
Hot tier query (last 24h)	O(log P + S) partition pruning + columnar scan, P = partitions, S = scanned rows	O(K) result buffer	Columnar scan reads only needed columns. Aggregation queries complete in ~100ms. Full scan ~500ms.
Warm tier query (1-90 days)	O(F * C) per file scanned, F = Parquet files, C = column scan per file	O(K) result buffer	S3 Select pushes filtering to S3. Partition pruning by date/hour reduces files scanned. p99 ~10s.
Cold tier query (90d-7y)	O(restore_time + F * C) Glacier restore + Parquet scan	O(K) result buffer + temporary restore storage	Glacier restore: 5 min (expedited) to 12 hours (standard). Query after restore similar to warm tier.
Cross-tier query (spans multiple tiers)	O(max(hot, warm, cold)) — parallel sub-queries, latency = slowest tier	O(K_hot + K_warm + K_cold) merge buffer	QueryEngine executes sub-queries in parallel and merges results. Hot results stream first.

Database Schema (HLD)

hot_logs (Columnar Index — ClickHouse/Druid)

Columnar hot-tier index storing the last 24 hours of parsed, enriched log data. Optimized for aggregation queries (GROUP BY service, level, minute) and time-range scans. Per-tenant partitioning for multi-tenant isolation. ParserWorker bulk-inserts at 1000 docs per batch. WarmCompactor reads and deletes data older than 24h.

log_id STRING (UUID, primary key)tenant_id STRING (partition column for tenant isolation)service STRING (source service name, dictionary-encoded)level STRING (DEBUG/INFO/WARN/ERROR, dictionary-encoded)message STRING (log text, bloom filter indexed)timestamp DATETIME (log event time, primary sort key)trace_id STRING (distributed trace correlation)parsed_fields JSON (extracted structured fields, schema-on-read)

128 partitions, 2 replicas, 6 data nodes. Columnar compression reduces storage 5-10x vs row-based. Bloom filter on message enables approximate text search.

warm_logs (S3 Parquet)

Warm-tier compressed Parquet files in S3 Standard for 1-90 day retention. Organized by tenant/date/hour with Hive-compatible partition metadata. Queryable via S3 Select at approximately 10-second latency. Data transitions to ColdArchive after 90 days.

partition_key STRING (S3 key: tenant/date/hour/shard.parquet)tenant_id STRING (Parquet partition column)date STRING (date partition for range scans)hour STRING (hour partition for granular access)size_bytes BIGINT (Parquet file size)row_count BIGINT (number of log rows in file)

Columnar Parquet format enables scanning only needed columns. S3 Select pushes filtering to S3, reducing data transfer. Cost: ~$23/TB/month.

cold_logs (S3 Glacier Deep Archive)

Cold-tier for 7-year compliance retention. Data arrives from WarmStore via S3 lifecycle policy after 90 days. Retrieval requires restore request: standard (12 hours), expedited (5 minutes at premium cost). Used for compliance audits and forensic investigations.

archive_key STRING (S3 Glacier key: tenant/year/month/shard)tenant_id STRING (partition for tenant-scoped restores)date_range STRING (date range covered by archive file)size_bytes BIGINT (archive file size)

Cost: ~$0.00099/GB/month. Minimum storage duration: 180 days. Retrieval: 5 min (expedited) to 12 hours (standard).

alert_rules (PostgreSQL)

Alert rule definitions: condition expressions, sliding window durations, thresholds, notification targets. AlertEngine caches rules in memory and reloads on changes. Approximately 10K rules at scale.

rule_id UUID PKtenant_id VARCHAR (indexed, for tenant-scoped rule management)name VARCHAR (human-readable rule name)condition TEXT (rule condition expression, e.g., 'level:ERROR AND service:checkout')window_seconds INT (sliding window duration, e.g., 300 for 5 minutes)threshold FLOAT (threshold value, e.g., 0.01 for 1%)severity VARCHAR (info/warn/critical)enabled BOOLEAN (rule active flag)

Indexes: idx_rules_tenant ON (tenant_id, enabled)

~10K rules. AlertEngine caches in memory, polls for changes every 30s.

fired_alerts (PostgreSQL)

History of fired alerts. Written by AlertEngine when a rule triggers. Read by dashboards and on-call UIs. Indexed by (tenant_id, fired_at) for time-range queries. Retention: 90 days.

alert_id UUID PKrule_id UUID FK (references alert_rules)tenant_id VARCHAR (indexed)service VARCHAR (service that triggered the alert)fired_at TIMESTAMPTZ (indexed, when the alert fired)resolved_at TIMESTAMPTZ (when the alert resolved, nullable)severity VARCHAR (alert severity at fire time)details JSONB (alert context — window values, threshold, matched logs)

Indexes: idx_alerts_tenant_time ON (tenant_id, fired_at DESC), idx_alerts_rule ON (rule_id, fired_at DESC)

Write-heavy during incidents. ~1K writes/sec at peak. 90-day retention with automated cleanup.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (Direct-to-Database)	T1	2ms ingest, 2-30s search	~500 lines/sec	$200/month (single RDS)	Low — 3 components, no async	99% (single DB, no failover)
Kafka + Indexing Pipeline	T2	<10ms ingest, <5s hot search	100M lines/sec	$15,000/month (Kafka + ES + S3)	Medium — Kafka, ES, workers	99.9% (replicated components)
Multi-Tier with Alerting	T3	<10ms ingest, <1s hot, <30s warm	100M+ lines/sec	$25,000/month (3-tier + alerting)	High — 11 components, 3 tiers	99.9% (replicated, multi-tier)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why a columnar hot index instead of Elasticsearch?

The hot tier primarily serves aggregation queries for dashboards and alert evaluation: error rate per service per minute, p99 latency per endpoint, throughput by region. Columnar stores (ClickHouse, Apache Druid) process these aggregations 10x faster than Elasticsearch because they read only the columns needed, skipping the large message text column entirely. At 24 hours of data, full-text search is less critical — most searches target specific services and time ranges, which columnar handles efficiently via partition pruning.

How does the streaming AlertEngine achieve sub-5-second latency?

AlertEngine receives parsed log records directly from ParserWorker (not from the indexed data). Each record is evaluated against all matching active rules immediately. Rules maintain in-memory sliding-window counters: when a 5-minute window for 'service:checkout AND level:ERROR' exceeds its threshold, the alert fires instantly. The end-to-end latency is: Kafka publish (~5ms) + ParserWorker (~8ms) + AlertEngine evaluation (~5ms) = approximately 18ms from log emission to alert evaluation. The 5-second target includes buffer for Kafka consumer lag during bursts.

How does the warm tier differ from the cold tier?

The warm tier (S3 Standard, $23/TB/month) stores compressed Parquet files queryable via S3 Select in approximately 10 seconds. The cold tier (S3 Glacier Deep Archive, $0.99/TB/month) stores the same data but requires a restore request (5 minutes expedited, 12 hours standard) before it can be queried. The warm tier covers 1-90 days (frequent incident investigation window), while the cold tier covers 90 days to 7 years (compliance and forensic investigations). The 100x cost difference between warm and cold makes the lifecycle transition economically essential.

What happens when a user crosses the sampling threshold?

Sampling is applied by log level, not by user or service. All ERROR logs are indexed at 100 percent fidelity — no sampling. WARN logs are sampled at 10 percent, and DEBUG at 1 percent for the hot tier. Importantly, all raw unsampled data is preserved in the warm and cold tiers. If an engineer needs the specific DEBUG log lines that were sampled out of the hot tier, they can query the warm tier (Parquet in S3) where the complete data exists. Sampling only affects hot-tier indexing cost, not data durability.

How does the distributed QueryEngine handle cross-tier queries?

QueryEngine parses the time range from the query and determines which tiers to query: last 24h hits HotIndexDB, 1-90d hits WarmStore, and older than 90d hits ColdArchive. For queries spanning multiple tiers (e.g., last 48 hours), it executes sub-queries in parallel against both hot and warm tiers, then merges results by timestamp. The response streams results as they arrive — hot-tier results appear in sub-second while warm-tier results arrive seconds later. Cold-tier queries require explicit acknowledgment due to Glacier restore latency.

How does this compare to Datadog's actual architecture?

Datadog's logging architecture uses a similar multi-tier approach: real-time indexing for recent data (their 'Online Archives'), compressed storage for medium-term (their 'Flex Logs'), and archival for long-term retention. They use a streaming pipeline for alerting (Monitors) and a unified query interface (Log Explorer) spanning all tiers. The core patterns — tiered storage, streaming alerting, distributed query, and sampling — are the same. This template captures the essential architecture without Datadog's proprietary compression algorithms and ML-based anomaly detection.

Related Templates

Logging Pipeline — Naive (Direct-to-Database)Logging Pipeline — Kafka + Indexing Pipeline

Discussion

Ready to design your own Logging Pipeline?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator