Hard3 componentsInterview: High

Logging Pipeline — Naive (Direct-to-Database)

Q: Why does the naive approach collapse at just 500 log lines per second?

Each log line is a synchronous PostgreSQL INSERT requiring WAL write, B-tree index update, and fsync. With a 50-connection pool and approximately 2ms per INSERT, the theoretical ceiling is 25,000 inserts per second, but WAL contention, checkpoint I/O, MVCC overhead, and index maintenance reduce the real sustained ceiling to approximately 500 lines per second. Search queries that hold connections for seconds further reduce available ingest capacity. The Pipeline variant solves this by buffering in Kafka (sub-5ms ack) and processing asynchronously.

Q: Why is SQL LIKE search so slow on log data?

SQL LIKE '%keyword%' with a leading wildcard cannot use a B-tree index because the index is sorted lexicographically from the start of the string, not from an arbitrary position. PostgreSQL must read every row in the table and check the message column against the pattern. At 500 lines per second, the table accumulates 43 million rows per day. After two days, a search scans 86 million rows. The Pipeline variant uses Elasticsearch's inverted index, which pre-maps every word to the documents containing it, enabling sub-second search across billions of rows.

Q: What is the first optimization an interviewer expects?

Add a message queue (Kafka) between the agents and the database to decouple ingest from persistence. This allows the ingest endpoint to return immediately after publishing to Kafka (sub-5ms), while worker processes consume from Kafka and batch-insert into a search index (Elasticsearch) at a sustainable rate. This single change increases ingest capacity from 500 lines per second to millions and adds burst absorption during traffic spikes.

Q: Why not just add a PostgreSQL full-text index (tsvector/GIN)?

A PostgreSQL GIN index on a tsvector column would dramatically improve search performance — from O(N) sequential scan to O(log N) index lookup. However, it adds significant write amplification: every INSERT must also update the GIN index, reducing write throughput further. More importantly, PostgreSQL full-text search lacks the operational tooling, horizontal scaling, and query DSL of Elasticsearch. For a production logging system at scale, Elasticsearch is the industry-standard choice.

Q: How does the connection pool become the bottleneck?

The 50-connection pool is shared between ingest INSERTs (~2ms each) and search SELECTs (~2-30 seconds each). At 500 inserts per second, ingest alone requires approximately 1 connection-second per second (500 x 0.002s). A single search query at 10M rows holds a connection for 30 seconds. Ten concurrent searches consume 10 connections for 30 seconds each, reducing the effective pool to 40 connections for ingest. This mutual degradation between reads and writes is why the Pipeline variant separates ingest (Kafka path) from search (Elasticsearch path).

Q: Why not add a read replica to separate read and write load?

A read replica would help by directing search queries to the replica, freeing the primary's connection pool for ingest writes. However, it introduces replication lag (recently ingested logs may not be searchable for seconds), doubles storage cost, and does not solve the fundamental problems: LIKE search is still O(N) on the replica, and ingest throughput is still limited by the primary's INSERT capacity. The Pipeline variant solves both problems with Elasticsearch (fast search) and Kafka (async ingest).

The simplest logging system: every log line is a synchronous INSERT into PostgreSQL. No buffering, no indexing, no tiering. Demonstrates why direct database writes collapse under even moderate log volume.

ObservabilityBeginnerBottleneck AnalysisDirect Write

Try in Simulator

Problem Statement

Designing a centralized logging pipeline is one of the most important system design interview questions because it forces candidates to reason about extreme write-read asymmetry, tiered storage economics, and the tension between search latency and retention cost. The naive direct-to-database approach is where every candidate should start because it establishes the baseline that makes the improvements in buffered and tiered architectures measurable and concrete.

The core challenge is capturing log data from a fleet of hosts, persisting it durably, and making it searchable for debugging and incident investigation. In the naive approach, this is done with the simplest possible architecture: a single LogService that receives log lines via HTTP POST and writes each one synchronously to a PostgreSQL database. Search is implemented as SQL LIKE queries on the raw message column, which translates to a sequential full-table scan with no index assistance.

The naive approach has two fatal flaws that emerge at even modest scale. First, write throughput is capped by PostgreSQL's ability to process synchronous INSERTs. Each INSERT requires WAL write, B-tree index update, and fsync. With a 50-connection pool and approximately 2ms per INSERT at low load, the theoretical ceiling is 25,000 inserts per second, but WAL contention, checkpoint I/O, and lock contention reduce the real ceiling to approximately 500 sustained log lines per second. At this rate, the connection pool saturates and ingest latency climbs from 2ms to 50ms or more, with spillover affecting search queries that share the same connection pool.

Second, search performance degrades linearly with data volume. SQL LIKE '%keyword%' cannot use a B-tree index because the wildcard prefix defeats index seek. Every search is a sequential scan of the entire logs table. At 500 lines per second, the table accumulates approximately 43 million rows per day. After just two days, a search query must scan 86 million rows, taking 30 seconds or more. There is no full-text index, no inverted index, no bloom filter — just raw sequential I/O.

The naive approach also lacks any mechanism for tiered storage. All logs live in PostgreSQL on a single disk. When the disk fills (typically 500 GB for a standard RDS instance), old logs must be manually deleted. There is no warm tier, no cold archive, no lifecycle policy. Compliance requirements for multi-year retention are impossible to meet. The Pipeline variant (v1) solves this with S3 archival at $0.01 per GB per month.

This template makes the write ceiling and search degradation visible and quantifiable. Run the simulation at increasing log rates and watch PostgreSQL connection pool utilization climb while search latency degrades from 2 seconds to 30 seconds. The comparison with the Pipeline and Multi-Tier variants provides concrete numbers for the discussion of buffering, indexing, and tiered storage that interviewers expect.

Logging pipeline design appears in interviews at Datadog, Splunk, Elastic, AWS, Google, and any company operating at scale. Interviewers expect candidates to start with the direct-write approach, identify the write throughput ceiling and search degradation, and then propose Kafka buffering and Elasticsearch indexing as the first optimization.

Architecture Overview

The naive logging system is a three-component linear architecture: LogAgent (Client), LogService, and LogDB (PostgreSQL). There is no load balancer, no event stream, no search index, no cache, and no tiered storage. Every request flows through the same single path.

LogAgent represents the fleet of application hosts generating log data. Each agent sends individual log lines via HTTP POST to LogService. There is no batching at the agent level — each log line is a separate HTTP request, adding per-request overhead (TCP connection, HTTP headers, JSON serialization). At peak, the fleet generates approximately 1,000 log lines per second, but the system's actual ceiling is around 500 lines per second due to the database bottleneck.

LogService is a stateless REST API running on 3 ECS Fargate pods with 50 threads each, providing 150 total threads. It handles two operations: (1) log ingest via POST /api/v1/logs, which validates the payload (non-empty message, valid log level) and performs a synchronous INSERT into the logs table; and (2) log search via POST /api/v1/search, which translates the query into a SQL SELECT with LIKE '%keyword%' on the message column. The service processing time is approximately 5ms, but the total response time is dominated by the database operation — 2ms for INSERTs at low load, 500ms to 30 seconds for searches depending on table size.

LogDB is a single Amazon RDS PostgreSQL instance (db.r7g.large, 2 vCPU, 16 GB RAM) with strong consistency and no replicas. The logs table stores all log data with columns for log_id (UUID), service name, log level, raw message text, timestamp, and created_at. B-tree indexes on (service, timestamp) and (level, timestamp) support filtered queries like 'show ERROR logs from checkout service in the last hour,' but they cannot accelerate LIKE '%keyword%' on the unindexed message column. The connection pool is limited to 50 connections, shared between ingest writes and search reads.

The system has zero redundancy. A single PostgreSQL primary handles all reads and writes. If the database fails, both ingest and search stop entirely. There is no failover, no read replica, no connection pooler (like PgBouncer). The write and read workloads compete for the same resources: connection pool, buffer cache, WAL bandwidth, and disk I/O. Under load, search queries that hold connections for seconds starve the ingest path, causing log lines to be dropped or delayed.

The concrete scaling ceiling is approximately 500 log lines per second sustained. At this rate, the 50-connection pool is 80 percent utilized (assuming 2ms per INSERT), leaving minimal headroom for search queries. A single search query holding a connection for 5 seconds at 1 million rows consumes 1 of 50 connections for the duration. Ten concurrent searches consume 10 connections, reducing ingest capacity by 20 percent. This mutual degradation between reads and writes is the fundamental problem that the Pipeline variant solves with separate ingest (Kafka) and search (Elasticsearch) paths.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Synchronous INSERT per Log Line

Choice

Every log line triggers a synchronous PostgreSQL INSERT before the HTTP response returns

Rationale

This is the simplest durability model — if the HTTP response is 201, the log line is on disk. No buffering, no async pipeline, no data loss window. The cost is a hard throughput ceiling around 500 lines per second. The Pipeline variant decouples ingest from persistence via Kafka, achieving 100M lines per second.

SQL LIKE for Search

Choice

Search implemented as SELECT WHERE message LIKE '%keyword%'

Rationale

SQL LIKE requires zero additional infrastructure — no Elasticsearch cluster, no inverted index maintenance. It works for small tables (under 100K rows). The cost is O(N) sequential scan time that grows linearly with table size. At 1M rows, search takes 2-5 seconds. At 10M rows, it exceeds 30 seconds. The Pipeline variant replaces this with Elasticsearch for sub-second full-text search.

Single PostgreSQL Instance

Choice

One database for all log data with no replicas or read replicas

Rationale

A single PostgreSQL instance eliminates replication lag, failover complexity, and split-brain scenarios. ACID transactions ensure every log line is durably written. The cost is a single point of failure and shared resources between ingest and search workloads, leading to mutual degradation under load.

No Buffering or Batching

Choice

No Kafka, no in-memory buffer, no batch INSERT

Rationale

Eliminating buffering simplifies the architecture to three components with zero operational overhead. The trade-off is that every log line pays the full cost of an individual INSERT (WAL write, index update, fsync). Batch INSERTs would improve throughput 10x by amortizing per-transaction overhead, and Kafka buffering would add burst absorption.

No Tiered Storage

Choice

All logs in PostgreSQL on a single disk, no archival

Rationale

Without tiered storage, there is one data location and one query interface. The cost is that storage is limited to the PostgreSQL disk (typically 500 GB for standard RDS), retention is limited to weeks or months, and per-GB cost is approximately $0.10/GB/month versus $0.004/GB/month for S3 Glacier. Multi-year compliance retention is impossible.

Scale & Performance

Target RPS

~500 sustained (DB ceiling)

Latency (p99)

2ms ingest, 2-30s search (varies with table size)

Storage

~40 MB/day at 500 lines/sec (~500 GB max on single disk)

Availability

~99% (single DB, no redundancy)

Time & Space Complexity

Operation	Time	Space	Notes
Ingest log line (POST /api/v1/logs)	O(1) INSERT + O(log N) B-tree index update	O(1) per log line (~512 bytes with indexes)	N = total rows in logs table. Index update cost grows logarithmically. Bottleneck is WAL write + fsync.
Search logs (POST /api/v1/search)	O(N) sequential scan for LIKE '%keyword%'	O(K) result buffer, K = matching rows (capped at 100)	N = total rows. Full-table scan because leading wildcard defeats B-tree index. At 10M rows, takes 30+ seconds.
Filtered search (service + time range)	O(log N + M) index seek + scan, M = rows in range	O(K) result buffer	B-tree index on (service, timestamp) enables efficient range scan. Much faster than unfiltered LIKE, but still scans M rows for keyword match.

Database Schema (HLD)

logs

Single-table log storage in PostgreSQL. Every log line is a row. INSERTed synchronously on ingest, scanned with SQL LIKE on search. B-tree indexes on (service, timestamp) and (level, timestamp) support filtered queries but cannot accelerate LIKE '%keyword%' on the message column. Table grows at approximately 40 MB per day at 500 lines per second.

log_id UUID PK (generated UUID)service VARCHAR NOT NULL (source service name, indexed)level VARCHAR NOT NULL (DEBUG/INFO/WARN/ERROR, indexed)message TEXT NOT NULL (raw log message, unindexed for LIKE)timestamp TIMESTAMPTZ NOT NULL (log event time, indexed)created_at TIMESTAMPTZ NOT NULL DEFAULT now() (DB insert time)

Indexes: idx_logs_service_ts ON (service, timestamp DESC), idx_logs_level_ts ON (level, timestamp DESC)

At 500 lines/sec with ~512 byte average row size, table grows ~40 MB/day. After 30 days, ~1.2 GB. LIKE '%keyword%' scans every row regardless of indexes.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (Direct-to-Database)	T1	2ms ingest, 2-30s search	~500 lines/sec	$200/month (single RDS)	Low — 3 components, no async	99% (single DB, no failover)
Kafka + Indexing Pipeline	T2	<10ms ingest, <5s hot search	100M lines/sec	$15,000/month (Kafka + ES + S3)	Medium — Kafka, ES, workers	99.9% (replicated components)
Multi-Tier with Alerting	T3	<10ms ingest, <1s hot, <30s warm	100M+ lines/sec	$25,000/month (3-tier + alerting)	High — 11 components, 3 tiers	99.9% (replicated, multi-tier)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why does the naive approach collapse at just 500 log lines per second?

Each log line is a synchronous PostgreSQL INSERT requiring WAL write, B-tree index update, and fsync. With a 50-connection pool and approximately 2ms per INSERT, the theoretical ceiling is 25,000 inserts per second, but WAL contention, checkpoint I/O, MVCC overhead, and index maintenance reduce the real sustained ceiling to approximately 500 lines per second. Search queries that hold connections for seconds further reduce available ingest capacity. The Pipeline variant solves this by buffering in Kafka (sub-5ms ack) and processing asynchronously.

Why is SQL LIKE search so slow on log data?

SQL LIKE '%keyword%' with a leading wildcard cannot use a B-tree index because the index is sorted lexicographically from the start of the string, not from an arbitrary position. PostgreSQL must read every row in the table and check the message column against the pattern. At 500 lines per second, the table accumulates 43 million rows per day. After two days, a search scans 86 million rows. The Pipeline variant uses Elasticsearch's inverted index, which pre-maps every word to the documents containing it, enabling sub-second search across billions of rows.

What is the first optimization an interviewer expects?

Add a message queue (Kafka) between the agents and the database to decouple ingest from persistence. This allows the ingest endpoint to return immediately after publishing to Kafka (sub-5ms), while worker processes consume from Kafka and batch-insert into a search index (Elasticsearch) at a sustainable rate. This single change increases ingest capacity from 500 lines per second to millions and adds burst absorption during traffic spikes.

Why not just add a PostgreSQL full-text index (tsvector/GIN)?

A PostgreSQL GIN index on a tsvector column would dramatically improve search performance — from O(N) sequential scan to O(log N) index lookup. However, it adds significant write amplification: every INSERT must also update the GIN index, reducing write throughput further. More importantly, PostgreSQL full-text search lacks the operational tooling, horizontal scaling, and query DSL of Elasticsearch. For a production logging system at scale, Elasticsearch is the industry-standard choice.

How does the connection pool become the bottleneck?

The 50-connection pool is shared between ingest INSERTs (~2ms each) and search SELECTs (~2-30 seconds each). At 500 inserts per second, ingest alone requires approximately 1 connection-second per second (500 x 0.002s). A single search query at 10M rows holds a connection for 30 seconds. Ten concurrent searches consume 10 connections for 30 seconds each, reducing the effective pool to 40 connections for ingest. This mutual degradation between reads and writes is why the Pipeline variant separates ingest (Kafka path) from search (Elasticsearch path).

Why not add a read replica to separate read and write load?

A read replica would help by directing search queries to the replica, freeing the primary's connection pool for ingest writes. However, it introduces replication lag (recently ingested logs may not be searchable for seconds), doubles storage cost, and does not solve the fundamental problems: LIKE search is still O(N) on the replica, and ingest throughput is still limited by the primary's INSERT capacity. The Pipeline variant solves both problems with Elasticsearch (fast search) and Kafka (async ingest).

Related Templates

Logging Pipeline — Kafka + Indexing Pipeline Logging Pipeline — Multi-Tier with Alerting & Distributed Query

Discussion

Ready to design your own Logging Pipeline?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator