Industry-standard web crawler using Kafka as a distributed URL frontier with parallel workers, Redis Bloom filter for URL dedup, S3 for HTML storage, and PostgreSQL for metadata. Achieves 400 pages/sec sustained with crash recovery via Kafka offset replay.
The producer-consumer web crawler is the standard production approach for crawling at moderate to large scale. It solves the three fundamental problems that make the naive single-threaded approach unworkable: sequential throughput (5 pages/sec ceiling), lack of deduplication (infinite re-crawling via link cycles), and no crash recovery (in-memory queue lost on restart).
The key architectural shift is decoupling URL discovery from URL fetching via a durable distributed queue. Instead of a single service maintaining an in-memory queue, the system uses Kafka as a distributed URL frontier. Seed URLs and extracted links are published to Kafka topics partitioned by domain. A pool of 20 CrawlWorkers consume URLs from Kafka partitions in parallel — each worker independently fetches pages, checks dedup, stores content, and publishes extracted links back to Kafka. This producer-consumer loop is the core of the recursive crawl.
The parallelism unlocked by this architecture is transformative. Where the naive approach processes 1 URL at a time (5 pages/sec), 20 workers processing in parallel achieve 400 pages/sec — an 80x improvement. Each worker handles approximately 20 pages/sec, limited by politeness delays (1 request/sec per domain) rather than compute. Adding more workers and Kafka partitions scales throughput linearly until network or storage becomes the bottleneck.
URL deduplication is handled by a Redis Bloom filter shared across all workers. Before fetching a URL, each worker checks the Bloom filter. At 10B URLs, the filter uses approximately 10GB RAM with O(1) lookups and a 0.1% false positive rate. The false positive rate means approximately 10M URLs per 10B are incorrectly skipped — acceptable for general crawling but not for exhaustive archival. The Bloom filter rejects approximately 60% of extracted links as already-seen, saving significant fetch overhead.
Crash recovery is handled by Kafka's offset-based consumption model. Each worker tracks its position in the Kafka partition. If a worker crashes, Kafka replays uncommitted offsets to a replacement worker — no URLs are lost, no re-crawling of already-committed pages. This is a fundamental improvement over the in-memory queue, which loses all state on crash.
The architecture uses domain-based partitioning in Kafka: all URLs for a given domain are routed to the same partition, which is consumed by the same worker or worker group. This enables per-domain politeness enforcement — the worker can maintain a per-domain rate limiter ensuring no more than 1 request per second to any single domain. Without domain partitioning, concurrent workers could inadvertently DDoS a domain by crawling it in parallel.
This template is the answer interviewers expect when they ask 'How would you scale a web crawler?' The progression from naive (sequential, no dedup) to producer-consumer (parallel, Bloom filter, Kafka durability) demonstrates understanding of the fundamental scaling techniques.
The producer-consumer web crawler uses six components organized into a pipeline: SeedClient, CrawlService, UrlFrontier (Kafka), CrawlWorker pool, DedupCache (Redis), CrawlDatabase (PostgreSQL), and ContentStore (S3). The architecture separates URL discovery (CrawlService + LinkExtraction within workers) from URL fetching (CrawlWorkers) via Kafka, enabling parallel processing and crash recovery.
Seed URLs enter through SeedClient via the CrawlService API. CrawlService is a lightweight orchestration layer running on 2 ECS Fargate pods — it validates URLs, checks robots.txt compliance, and publishes seed URLs to the UrlFrontier (Kafka topic). CrawlService also serves an admin stats API that reads from CrawlDatabase. At approximately 10 RPS for seed injection, CrawlService is dramatically over-provisioned — its primary role is the admin interface, not the crawl engine.
The UrlFrontier is Amazon MSK (Managed Kafka) with 32 partitions, partitioned by domain. Each message contains a URL, domain, crawl depth from seed, and optional priority score. Kafka serves as both the URL queue (durable, partitioned) and the crash recovery mechanism (offset-based replay). At 400 URLs/sec consumption rate, Kafka is well within its throughput limits. Messages are retained for 7 days to support re-crawl scenarios.
The CrawlWorker pool is the crawl engine — 20 ECS Fargate workers consuming from Kafka partitions. Each worker processes URLs sequentially within its assigned partitions. For each URL: (1) check DedupCache for the URL hash — if seen, skip; (2) fetch the page via HTTP with politeness delay; (3) parse HTML to extract outbound links; (4) store raw HTML in ContentStore (S3); (5) write page metadata to CrawlDatabase; (6) mark the URL as seen in DedupCache; (7) publish extracted links back to UrlFrontier. Each worker processes approximately 20 pages/sec, limited by the 1 req/sec/domain politeness constraint across the domains assigned to its partitions.
DedupCache is Amazon ElastiCache for Redis running a Bloom filter for URL deduplication. Three nodes with 26 GB memory support 10B URLs with O(1) lookups. The Bloom filter is checked on every URL before fetching — approximately 60% of extracted links are rejects (already-seen URLs). Workers also cache per-domain robots.txt in this Redis cluster with 1-hour TTL.
ContentStore is Amazon S3 storing raw HTML. Each page is stored as an object keyed by URL hash. At 400 pages/sec with 50KB average page size, S3 ingests approximately 20 MB/sec (approximately 50 TB/month). S3 lifecycle policies move content to Infrequent Access after 30 days.
CrawlDatabase is Amazon RDS PostgreSQL storing crawl metadata — URL, domain, title, status code, timestamp, and outbound link count. Indexed by domain and fetched_at for freshness queries and admin dashboards. At 400 INSERTs/sec, a single primary handles the write load.
This sequence diagram shows the recursive crawl loop: workers consume URLs from Kafka, check the Bloom filter, fetch pages, store content and metadata, and publish extracted links back to Kafka. The critical insight is the parallelism — 20 workers process URLs concurrently across 32 Kafka partitions, achieving 400 pages/sec versus the naive approach's 5 pages/sec.
Step-by-Step Walkthrough
Pseudocode
// Worker crawl loop — runs on each of 20 workers
async function workerLoop(kafkaConsumer, partition):
for message in kafkaConsumer.consume(partition):
url = message.value.url
domain = message.value.domain
url_hash = sha256(url)
// 1. URL dedup via Bloom filter
if await redis.bf_exists("url_bloom", url_hash):
kafkaConsumer.commit(message.offset)
continue // Already crawled — skip
// 2. Politeness delay (1 req/sec/domain)
await rateLimiter.waitForDomain(domain)
// 3. Fetch page
response = await httpGet(url) // ~200ms avg
// 4. Parse HTML, extract links
links = parseHtml(response.body)
.querySelectorAll("a[href]")
.map(a => resolveUrl(url, a.href))
// 5. Store content and metadata
await s3.putObject(url_hash, response.body) // ~5ms
await db.execute(
"INSERT INTO pages VALUES ($1, $2, $3, $4, $5, NOW())",
[url_hash, url, domain, response.title, response.status]
) // ~30ms
// 6. Mark as seen + publish links
await redis.bf_add("url_bloom", url_hash)
for link in links:
await kafka.publish("url-frontier", {
url: link, domain: extractDomain(link),
depth: message.value.depth + 1
})
kafkaConsumer.commit(message.offset)Choice
Durable, partitioned Kafka topic instead of in-memory queue
Rationale
Kafka provides three critical properties: (1) durability — URLs survive process crashes via disk-backed logs, (2) partitioning — domain-based partitioning enables per-domain politeness, (3) offset replay — crashed workers resume without re-crawling. At 10B URLs with 7-day retention, Kafka is more cost-effective than a database-backed priority queue.
Choice
Probabilistic dedup with 0.1% false positive rate
Rationale
A Bloom filter handles 10B URLs in ~10GB RAM with O(1) lookups. False positives (occasionally skipping a new URL) are acceptable — missing 0.1% of URLs is better than storing 10B URLs in a hash set requiring ~80GB RAM. Redis provides the Bloom filter as a shared data structure visible to all 20 workers.
Choice
Partition by domain so all URLs for a domain go to the same worker
Rationale
Per-domain partitioning ensures a single worker handles all URLs for a given domain, enabling per-domain rate limiting (1 req/sec/domain). Without this, 20 workers could independently crawl the same domain at 20 req/sec — an unintentional DDoS that would get the crawler IP-blocked.
Choice
Object storage for full page content, separate from metadata
Rationale
At 400 pages/sec x 50KB average = ~20MB/sec = ~50TB/month. S3 provides virtually unlimited storage at low cost with 11-9s durability. Storing HTML in PostgreSQL would cause TOAST overhead, vacuum costs, and table bloat. Separating content (S3) from metadata (PostgreSQL) enables independent scaling.
Choice
20 parallel Fargate workers with 32 Kafka partitions
Rationale
Each worker processes ~20 pages/sec (limited by per-domain politeness, not CPU). 20 workers x 20 pages/sec = 400 pages/sec. 32 Kafka partitions allow up to 32 workers. The 20/32 ratio leaves headroom for scaling without repartitioning. Workers are stateless — adding more is a simple scaling operation.
Target RPS
400 pages/sec sustained
Latency (p99)
~50ms per page (parallel workers)
Storage
~50 TB/month (S3 HTML + PostgreSQL metadata)
Availability
99.9% (Kafka + Redis replicated)
Stores crawl metadata for every fetched page. Primary key is url_hash (SHA-256 of the URL), ensuring each URL appears exactly once. Indexed by domain + fetched_at for freshness-based re-crawl scheduling. Write-heavy at ~400 INSERTs/sec. Raw HTML is stored separately in S3.
Indexes: pk_pages ON (url_hash), idx_pages_domain_fetched ON (domain, fetched_at DESC)
At 400 pages/sec, grows ~1B rows/month. Raw HTML in S3, not in this table. Domain + fetched_at index supports re-crawl scheduling queries.
Distributed URL frontier queue. Partitioned by domain across 32 partitions for per-domain politeness enforcement. Both seed URLs and extracted outbound links are published here, forming the recursive crawl loop. Workers consume from assigned partitions and commit offsets after successful processing.
Key Schema
domain: string (partition key for per-domain ordering)
Value Schema
{ url: string, domain: string, depth: integer, priority: float (optional) }
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Single-Threaded BFS) | T1 | 200-2000ms per page | ~5 pages/sec | $50/month (tiny RDS + 1 Fargate task) | Low — 3 components, sequential | 95% (no redundancy, in-memory queue) |
| Producer-Consumer (Kafka Frontier) | T2 | 50ms per page (parallel) | 400 pages/sec | $2,000/month (Kafka + Redis + S3 + workers) | Medium — Kafka, Bloom filter, workers | 99.9% (Kafka offset replay, replicated) |
| Distributed (Bloom + Priority Frontier) | T3 | 50ms per page (parallel) | 1000 pages/sec | $5,000/month (priority Kafka + dual Redis + S3) | High — priority frontier, SimHash, DNS cache | 99.9% (crash-resilient, checkpointed) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Kafka provides three features critical for web crawling that SQS and RabbitMQ lack: (1) domain-based partitioning for per-domain politeness enforcement, (2) offset-based replay for crash recovery without re-crawling, and (3) log-based retention that allows the URL frontier to hold millions of URLs durably without consuming them. SQS deletes messages on consumption; Kafka retains them until the retention window expires.
At 0.1% false positive rate with 10B URLs, approximately 10M URLs are incorrectly marked as 'already seen' and skipped. For general-purpose search crawling, this is acceptable — the crawler will eventually discover these URLs via other paths. For exhaustive archival crawling (e.g., Internet Archive), the false positive rate is too high and must be supplemented with exact-match dedup using a database or hash table.
Kafka tracks each worker's consumption offset. If a worker crashes before committing its offset, the URL is replayed to another worker in the same consumer group. The page may be fetched twice (once by the crashed worker, once by the replacement), but the Bloom filter prevents re-processing on subsequent encounters. The worst case is a duplicate entry in S3 and PostgreSQL, which is acceptable.
This variant deduplicates URLs but not content. The same article syndicated across 100 domains has 100 different URLs, so the Bloom filter treats them as 100 unique pages. Each copy is fetched, stored in S3, and indexed separately. The Distributed variant adds SimHash-based content dedup that detects near-duplicate content across different URLs, saving ~30% in storage costs.
All URLs for a given domain are hashed to the same Kafka partition, which is consumed by the same worker. The worker maintains a per-domain timestamp tracking the last request time. Before fetching a URL, it checks if 1 second has elapsed since the last request to that domain. If not, it delays. This ensures the crawler never exceeds 1 req/sec to any single domain, regardless of how many URLs are queued for that domain.
Sign in to join the discussion.
Ready to design your own Web Crawler?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator