Vetora logo
Hard12 componentsInterview: High

Logging Pipeline — Tiered Observability Platform

Design a petabyte-scale logging pipeline ingesting 100M log lines/sec with tiered hot/warm/cold storage. Covers Kafka buffering, Elasticsearch indexing, S3 archival, and real-time dashboard metrics.

StreamingSearchTiered StorageObservability
Problem Statement

The logging and observability pipeline is a critical system design interview topic because every production system at scale depends on reliable log collection, search, and retention. The core challenge is ingesting an enormous volume of log data from millions of distributed agents, indexing it for fast search, and retaining it cost-efficiently for years. At companies like Datadog, Splunk, and Elastic, the logging infrastructure is often the most expensive and operationally complex system in the entire stack.

At production scale, a platform serving thousands of microservices generates 100 million log lines per second at peak. Each log line must be parsed (structured JSON or unstructured text with grok patterns), enriched with metadata (service name, host, trace ID), and indexed for full-text search. Engineers expect to search recent logs (last 7 days) in under 5 seconds and historical logs (up to 7 years for compliance) within 60 seconds. The daily data volume can reach 10 petabytes, making it economically impossible to keep all data in an expensive indexed tier.

The problem demands a tiered storage architecture where hot data (7 days) lives in a fast, indexed search engine like Elasticsearch, while cold data lives in cheap object storage like S3 in compressed columnar format. The transition between tiers must be automated, reliable, and transparent to users. Additionally, the pipeline must support real-time dashboard metrics (error rates, p99 latencies, throughput by service) without hammering the search index on every dashboard refresh.

This template models the complete observability pipeline: agent-based log collection, Kafka buffering for burst absorption, parser workers for enrichment and indexing, Elasticsearch hot tier for fast search, S3 cold tier for long-term retention, rolloff workers for automated tier transitions, and a Redis metrics cache for real-time dashboards.

Architecture Overview

The logging pipeline architecture implements a tiered storage pattern that balances search performance against storage cost at petabyte scale. The ingest path begins with log agents running on millions of hosts, streaming batched log lines via gRPC to an NLB (Network Load Balancer chosen for ultra-high-throughput L4 routing at 100K batches per second). The NLB distributes traffic to an IngestService fleet (30 pods) that validates the batch, stamps it with tenant_id and ingest_timestamp, and publishes to IngestStream (Kafka with 256 partitions for massive parallelism). The ingest endpoint returns 200 immediately, making log collection fully asynchronous.

ParserWorker (100 instances) consumes from Kafka, parses structured (JSON) and unstructured (grok/regex) log lines, enriches them with metadata such as service name and trace_id, computes sliding-window aggregations for the metrics cache, and bulk-indexes into the IndexDB (OpenSearch/Elasticsearch with 64 shards, 2 replicas, and per-tenant indices for isolation). The hot tier retains 7 days of fully indexed data, enabling sub-5-second full-text search across billions of documents using inverted indices.

The query path is separated from ingest to prevent log volume from affecting search latency. Engineers send search queries through an API Gateway (with per-tenant rate limiting) to SearchService (15 pods). For recent queries, SearchService queries the hot tier in Elasticsearch at approximately 500ms p99. For historical queries beyond 7 days, it queries ArchiveStore (S3 with Parquet files) using S3 Select at approximately 30 seconds p99. Dashboard reads hit the MetricsCache (Redis 6-node cluster) for pre-computed aggregations in approximately 2ms.

RolloffWorker (20 instances) continuously migrates expired data from the hot tier to the cold tier, compressing log data into Parquet format and writing to S3. Standard-IA storage handles the first 90 days at approximately $0.0125/GB/month, and Glacier handles long-term retention at approximately $0.004/GB/month. This tiered approach reduces storage cost by 100x compared to keeping everything in Elasticsearch.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Ingest Buffering Strategy

Choice

Kafka with 256 partitions and 50M message deep buffer

Rationale

Log traffic is extremely bursty; a deployment or production incident can spike ingest volume 10x within seconds. Kafka absorbs these bursts with deep buffering while parser workers consume at a sustainable rate. Without this buffer, index workers would need to auto-scale instantly with traffic, which takes minutes and causes data loss during the ramp-up period. Kafka also provides replay capability for reprocessing after parser bug fixes.

Hot Tier Search Engine

Choice

OpenSearch/Elasticsearch with per-tenant indices

Rationale

Full-text search on structured log data (e.g., service:checkout AND level:ERROR) is Elasticsearch's core strength. The inverted index enables sub-second search across billions of documents. Per-tenant indices (logs-{tenant_id}-{date}) provide data isolation, independent retention policies, and the ability to throttle a chatty tenant without affecting others. Alternatives like ClickHouse are faster for aggregations but significantly weaker for ad-hoc full-text search.

Cold Tier Storage Format

Choice

Compressed Parquet on S3 with lifecycle policies

Rationale

Parquet is a columnar format, which means scanning only the message column of 1 TB of logs reads approximately 10 GB instead of the full terabyte, reducing S3 scan cost by 100x. S3 Select and Athena can query Parquet in-place without loading data into a database. Combined with S3 lifecycle policies that transition data to Glacier after 90 days, the total retention cost drops to approximately $0.004/GB/month, making 7-year compliance retention economically feasible.

Dashboard Metrics Layer

Choice

Pre-computed sliding-window aggregations in Redis

Rationale

Dashboard queries such as error rate, p99 latency, and throughput by service would overwhelm Elasticsearch if computed on every dashboard refresh at 50K reads per second. Instead, ParserWorker computes these aggregations as a side effect of log processing and writes them to Redis with one-minute-bucket granularity. Dashboard reads resolve in approximately 2ms from Redis rather than running 5-second Elasticsearch aggregation queries, reducing search cluster load by orders of magnitude.

Scale & Performance

Target RPS

100M log lines/s ingest, 10K search queries/s

Latency (p99)

<5s (hot search p99), <60s (cold search p99)

Storage

~10 PB/day ingest, 7-year retention in S3

Availability

99.95%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why use tiered storage instead of indexing all logs in Elasticsearch?

At 10 PB per day, keeping all data in Elasticsearch would cost approximately $500,000 per day in compute and storage. S3 storage costs roughly $23 per TB per month compared to approximately $2,300 per TB per month for Elasticsearch indexed data. Since 95% of log searches target the last 24 hours, the hot tier (7 days of indexed data) handles the vast majority of queries. Historical searches against the cold tier are slower (30-60 seconds) but occur infrequently and at dramatically lower cost. The 100x cost difference makes tiered storage essential at this scale.

How do you prevent one noisy tenant from affecting other tenants in a multi-tenant logging platform?

Multi-tenant isolation is enforced at three levels. First, rate limiting on the IngestService caps per-tenant ingest volume, preventing one chatty microservice from monopolizing Kafka throughput. Second, per-tenant Kafka partitions ensure processing isolation so that a tenant generating malformed logs that cause parser failures does not block other tenants. Third, per-tenant Elasticsearch indices (logs-{tenant_id}-{date}) provide search isolation, allowing independent resource allocation and the ability to throttle or deprioritize expensive queries from a single tenant.

How does the pipeline handle log format diversity across hundreds of services?

ParserWorker supports multiple parsing strategies. Structured logs in JSON format are parsed directly into indexed fields. Unstructured logs are processed using grok patterns (regular expressions that extract named fields from free-text log lines). Each tenant or service can register a custom parser configuration. The parser also performs enrichment, adding metadata fields like service name, host, and distributed trace ID. Unparseable log lines are stored as raw text in a catch-all field to ensure no data is dropped, and parsing failures are tracked as metrics for alerting.

What happens if the Elasticsearch hot tier falls behind on indexing?

If parser workers cannot keep up with ingest volume, the Kafka buffer absorbs the backlog. With a 50 million message capacity, the buffer can absorb approximately 8 minutes of peak traffic (100M lines per second batched into 100K messages per second). During this window, recently ingested logs are not yet searchable but are safely queued for processing. If the backlog persists, operational alerts fire and additional parser workers are scaled up. In production, sampling non-critical log levels (keep 100% of ERROR, 10% of INFO, 1% of DEBUG) is a common strategy to reduce indexing volume while preserving the most important data.

How do you implement real-time alerting on top of a logging pipeline?

This template pre-computes sliding-window metrics in Redis via ParserWorker, which covers dashboard use cases. For real-time alerting (e.g., alert if 5xx rate exceeds 1% in a 5-minute window), production systems add a stream processing layer such as Apache Flink or Kafka Streams that consumes from the IngestStream in parallel with parser workers. The stream processor evaluates windowed aggregation rules in real time and triggers alerts via a notification service. This avoids querying Elasticsearch for alerting, which would add unacceptable latency for time-critical alerts.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Logging Pipeline?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator