Vetora logo
Easy6 componentsInterview: High

Web Crawler — Producer-Consumer (Kafka Frontier)

Industry-standard web crawler using Kafka as a distributed URL frontier with parallel workers, Redis Bloom filter for URL dedup, S3 for HTML storage, and PostgreSQL for metadata. Achieves 400 pages/sec sustained with crash recovery via Kafka offset replay.

StorageKafkaRedisProducer-ConsumerBloom Filter
Problem Statement

The producer-consumer web crawler is the standard production approach for crawling at moderate to large scale. It solves the three fundamental problems that make the naive single-threaded approach unworkable: sequential throughput (5 pages/sec ceiling), lack of deduplication (infinite re-crawling via link cycles), and no crash recovery (in-memory queue lost on restart).

The key architectural shift is decoupling URL discovery from URL fetching via a durable distributed queue. Instead of a single service maintaining an in-memory queue, the system uses Kafka as a distributed URL frontier. Seed URLs and extracted links are published to Kafka topics partitioned by domain. A pool of 20 CrawlWorkers consume URLs from Kafka partitions in parallel — each worker independently fetches pages, checks dedup, stores content, and publishes extracted links back to Kafka. This producer-consumer loop is the core of the recursive crawl.

The parallelism unlocked by this architecture is transformative. Where the naive approach processes 1 URL at a time (5 pages/sec), 20 workers processing in parallel achieve 400 pages/sec — an 80x improvement. Each worker handles approximately 20 pages/sec, limited by politeness delays (1 request/sec per domain) rather than compute. Adding more workers and Kafka partitions scales throughput linearly until network or storage becomes the bottleneck.

URL deduplication is handled by a Redis Bloom filter shared across all workers. Before fetching a URL, each worker checks the Bloom filter. At 10B URLs, the filter uses approximately 10GB RAM with O(1) lookups and a 0.1% false positive rate. The false positive rate means approximately 10M URLs per 10B are incorrectly skipped — acceptable for general crawling but not for exhaustive archival. The Bloom filter rejects approximately 60% of extracted links as already-seen, saving significant fetch overhead.

Crash recovery is handled by Kafka's offset-based consumption model. Each worker tracks its position in the Kafka partition. If a worker crashes, Kafka replays uncommitted offsets to a replacement worker — no URLs are lost, no re-crawling of already-committed pages. This is a fundamental improvement over the in-memory queue, which loses all state on crash.

The architecture uses domain-based partitioning in Kafka: all URLs for a given domain are routed to the same partition, which is consumed by the same worker or worker group. This enables per-domain politeness enforcement — the worker can maintain a per-domain rate limiter ensuring no more than 1 request per second to any single domain. Without domain partitioning, concurrent workers could inadvertently DDoS a domain by crawling it in parallel.

This template is the answer interviewers expect when they ask 'How would you scale a web crawler?' The progression from naive (sequential, no dedup) to producer-consumer (parallel, Bloom filter, Kafka durability) demonstrates understanding of the fundamental scaling techniques.

Architecture Overview

The producer-consumer web crawler uses six components organized into a pipeline: SeedClient, CrawlService, UrlFrontier (Kafka), CrawlWorker pool, DedupCache (Redis), CrawlDatabase (PostgreSQL), and ContentStore (S3). The architecture separates URL discovery (CrawlService + LinkExtraction within workers) from URL fetching (CrawlWorkers) via Kafka, enabling parallel processing and crash recovery.

Seed URLs enter through SeedClient via the CrawlService API. CrawlService is a lightweight orchestration layer running on 2 ECS Fargate pods — it validates URLs, checks robots.txt compliance, and publishes seed URLs to the UrlFrontier (Kafka topic). CrawlService also serves an admin stats API that reads from CrawlDatabase. At approximately 10 RPS for seed injection, CrawlService is dramatically over-provisioned — its primary role is the admin interface, not the crawl engine.

The UrlFrontier is Amazon MSK (Managed Kafka) with 32 partitions, partitioned by domain. Each message contains a URL, domain, crawl depth from seed, and optional priority score. Kafka serves as both the URL queue (durable, partitioned) and the crash recovery mechanism (offset-based replay). At 400 URLs/sec consumption rate, Kafka is well within its throughput limits. Messages are retained for 7 days to support re-crawl scenarios.

The CrawlWorker pool is the crawl engine — 20 ECS Fargate workers consuming from Kafka partitions. Each worker processes URLs sequentially within its assigned partitions. For each URL: (1) check DedupCache for the URL hash — if seen, skip; (2) fetch the page via HTTP with politeness delay; (3) parse HTML to extract outbound links; (4) store raw HTML in ContentStore (S3); (5) write page metadata to CrawlDatabase; (6) mark the URL as seen in DedupCache; (7) publish extracted links back to UrlFrontier. Each worker processes approximately 20 pages/sec, limited by the 1 req/sec/domain politeness constraint across the domains assigned to its partitions.

DedupCache is Amazon ElastiCache for Redis running a Bloom filter for URL deduplication. Three nodes with 26 GB memory support 10B URLs with O(1) lookups. The Bloom filter is checked on every URL before fetching — approximately 60% of extracted links are rejects (already-seen URLs). Workers also cache per-domain robots.txt in this Redis cluster with 1-hour TTL.

ContentStore is Amazon S3 storing raw HTML. Each page is stored as an object keyed by URL hash. At 400 pages/sec with 50KB average page size, S3 ingests approximately 20 MB/sec (approximately 50 TB/month). S3 lifecycle policies move content to Infrequent Access after 30 days.

CrawlDatabase is Amazon RDS PostgreSQL storing crawl metadata — URL, domain, title, status code, timestamp, and outbound link count. Indexed by domain and fetched_at for freshness queries and admin dashboards. At 400 INSERTs/sec, a single primary handles the write load.

Architecture Preview
Loading architecture preview...
Request Flow — Producer-Consumer Crawl Loop

This sequence diagram shows the recursive crawl loop: workers consume URLs from Kafka, check the Bloom filter, fetch pages, store content and metadata, and publish extracted links back to Kafka. The critical insight is the parallelism — 20 workers process URLs concurrently across 32 Kafka partitions, achieving 400 pages/sec versus the naive approach's 5 pages/sec.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Admin injects seed URLs via CrawlService — published to Kafka partitioned by domain
  2. 2CrawlWorker consumes the next URL from its assigned Kafka partition
  3. 3Worker checks Redis Bloom filter: if URL hash exists, skip (already crawled)
  4. 4Worker fetches the page via HTTP GET with per-domain politeness delay (1 req/sec/domain)
  5. 5Worker parses HTML and extracts all outbound links
  6. 6Worker stores raw HTML in S3 keyed by URL hash
  7. 7Worker writes page metadata to PostgreSQL (URL, domain, title, status, timestamp)
  8. 8Worker marks URL as seen in Bloom filter (BF.ADD)
  9. 9Worker publishes extracted links to Kafka for the next crawl round
  10. 10Loop repeats — 20 workers process in parallel across 32 partitions

Pseudocode

// Worker crawl loop — runs on each of 20 workers
async function workerLoop(kafkaConsumer, partition):
    for message in kafkaConsumer.consume(partition):
        url = message.value.url
        domain = message.value.domain
        url_hash = sha256(url)

        // 1. URL dedup via Bloom filter
        if await redis.bf_exists("url_bloom", url_hash):
            kafkaConsumer.commit(message.offset)
            continue  // Already crawled — skip

        // 2. Politeness delay (1 req/sec/domain)
        await rateLimiter.waitForDomain(domain)

        // 3. Fetch page
        response = await httpGet(url)  // ~200ms avg

        // 4. Parse HTML, extract links
        links = parseHtml(response.body)
            .querySelectorAll("a[href]")
            .map(a => resolveUrl(url, a.href))

        // 5. Store content and metadata
        await s3.putObject(url_hash, response.body)  // ~5ms
        await db.execute(
            "INSERT INTO pages VALUES ($1, $2, $3, $4, $5, NOW())",
            [url_hash, url, domain, response.title, response.status]
        )  // ~30ms

        // 6. Mark as seen + publish links
        await redis.bf_add("url_bloom", url_hash)
        for link in links:
            await kafka.publish("url-frontier", {
                url: link, domain: extractDomain(link),
                depth: message.value.depth + 1
            })

        kafkaConsumer.commit(message.offset)
Key Design Decisions
Kafka as URL Frontier

Choice

Durable, partitioned Kafka topic instead of in-memory queue

Rationale

Kafka provides three critical properties: (1) durability — URLs survive process crashes via disk-backed logs, (2) partitioning — domain-based partitioning enables per-domain politeness, (3) offset replay — crashed workers resume without re-crawling. At 10B URLs with 7-day retention, Kafka is more cost-effective than a database-backed priority queue.

Redis Bloom Filter for URL Dedup

Choice

Probabilistic dedup with 0.1% false positive rate

Rationale

A Bloom filter handles 10B URLs in ~10GB RAM with O(1) lookups. False positives (occasionally skipping a new URL) are acceptable — missing 0.1% of URLs is better than storing 10B URLs in a hash set requiring ~80GB RAM. Redis provides the Bloom filter as a shared data structure visible to all 20 workers.

Domain-Based Kafka Partitioning

Choice

Partition by domain so all URLs for a domain go to the same worker

Rationale

Per-domain partitioning ensures a single worker handles all URLs for a given domain, enabling per-domain rate limiting (1 req/sec/domain). Without this, 20 workers could independently crawl the same domain at 20 req/sec — an unintentional DDoS that would get the crawler IP-blocked.

S3 for Raw HTML Storage

Choice

Object storage for full page content, separate from metadata

Rationale

At 400 pages/sec x 50KB average = ~20MB/sec = ~50TB/month. S3 provides virtually unlimited storage at low cost with 11-9s durability. Storing HTML in PostgreSQL would cause TOAST overhead, vacuum costs, and table bloat. Separating content (S3) from metadata (PostgreSQL) enables independent scaling.

20-Worker Pool

Choice

20 parallel Fargate workers with 32 Kafka partitions

Rationale

Each worker processes ~20 pages/sec (limited by per-domain politeness, not CPU). 20 workers x 20 pages/sec = 400 pages/sec. 32 Kafka partitions allow up to 32 workers. The 20/32 ratio leaves headroom for scaling without repartitioning. Workers are stateless — adding more is a simple scaling operation.

Scale & Performance

Target RPS

400 pages/sec sustained

Latency (p99)

~50ms per page (parallel workers)

Storage

~50 TB/month (S3 HTML + PostgreSQL metadata)

Availability

99.9% (Kafka + Redis replicated)

Database Schema (HLD)
pages

Stores crawl metadata for every fetched page. Primary key is url_hash (SHA-256 of the URL), ensuring each URL appears exactly once. Indexed by domain + fetched_at for freshness-based re-crawl scheduling. Write-heavy at ~400 INSERTs/sec. Raw HTML is stored separately in S3.

url_hash VARCHAR PK (SHA-256 hash of URL)url TEXT NOT NULL (full URL)domain VARCHAR(255) NOT NULL (indexed for domain queries)title TEXT (parsed from HTML <title>)status_code INTEGER NOT NULL (HTTP response status)fetched_at TIMESTAMPTZ NOT NULL (indexed for freshness queries)

Indexes: pk_pages ON (url_hash), idx_pages_domain_fetched ON (domain, fetched_at DESC)

At 400 pages/sec, grows ~1B rows/month. Raw HTML in S3, not in this table. Domain + fetched_at index supports re-crawl scheduling queries.

Event Contracts
url-frontierurl-frontier

Distributed URL frontier queue. Partitioned by domain across 32 partitions for per-domain politeness enforcement. Both seed URLs and extracted outbound links are published here, forming the recursive crawl loop. Workers consume from assigned partitions and commit offsets after successful processing.

Key Schema

domain: string (partition key for per-domain ordering)

Value Schema

{ url: string, domain: string, depth: integer, priority: float (optional) }

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
Naive (Single-Threaded BFS)T1200-2000ms per page~5 pages/sec$50/month (tiny RDS + 1 Fargate task)Low — 3 components, sequential95% (no redundancy, in-memory queue)
Producer-Consumer (Kafka Frontier)T250ms per page (parallel)400 pages/sec$2,000/month (Kafka + Redis + S3 + workers)Medium — Kafka, Bloom filter, workers99.9% (Kafka offset replay, replicated)
Distributed (Bloom + Priority Frontier)T350ms per page (parallel)1000 pages/sec$5,000/month (priority Kafka + dual Redis + S3)High — priority frontier, SimHash, DNS cache99.9% (crash-resilient, checkpointed)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why Kafka instead of a simpler queue like SQS or RabbitMQ?

Kafka provides three features critical for web crawling that SQS and RabbitMQ lack: (1) domain-based partitioning for per-domain politeness enforcement, (2) offset-based replay for crash recovery without re-crawling, and (3) log-based retention that allows the URL frontier to hold millions of URLs durably without consuming them. SQS deletes messages on consumption; Kafka retains them until the retention window expires.

How does the Bloom filter handle false positives?

At 0.1% false positive rate with 10B URLs, approximately 10M URLs are incorrectly marked as 'already seen' and skipped. For general-purpose search crawling, this is acceptable — the crawler will eventually discover these URLs via other paths. For exhaustive archival crawling (e.g., Internet Archive), the false positive rate is too high and must be supplemented with exact-match dedup using a database or hash table.

What happens when a CrawlWorker crashes mid-page-fetch?

Kafka tracks each worker's consumption offset. If a worker crashes before committing its offset, the URL is replayed to another worker in the same consumer group. The page may be fetched twice (once by the crashed worker, once by the replacement), but the Bloom filter prevents re-processing on subsequent encounters. The worst case is a duplicate entry in S3 and PostgreSQL, which is acceptable.

Why is there no content deduplication in this variant?

This variant deduplicates URLs but not content. The same article syndicated across 100 domains has 100 different URLs, so the Bloom filter treats them as 100 unique pages. Each copy is fetched, stored in S3, and indexed separately. The Distributed variant adds SimHash-based content dedup that detects near-duplicate content across different URLs, saving ~30% in storage costs.

How does per-domain politeness work with Kafka partitioning?

All URLs for a given domain are hashed to the same Kafka partition, which is consumed by the same worker. The worker maintains a per-domain timestamp tracking the last request time. Before fetching a URL, it checks if 1 second has elapsed since the last request to that domain. If not, it delays. This ensures the crawler never exceeds 1 req/sec to any single domain, regardless of how many URLs are queued for that domain.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Web Crawler?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator