The simplest web crawler: a single service fetches URLs one at a time from an in-memory FIFO queue, extracts links, and stores metadata in PostgreSQL. No dedup, no parallelism, no politeness. Demonstrates why sequential crawling caps at ~5 pages/sec.
Designing a web crawler is a classic system design interview question because it forces candidates to reason about recursion at internet scale, politeness constraints, deduplication, and crash recovery. The naive single-threaded BFS approach is where every candidate should start — it establishes the baseline that makes the improvements in distributed architectures measurable and concrete.
The core challenge is fetching pages from the public internet, extracting outbound links, and storing page content for later indexing. In the naive approach, this is done sequentially: a single CrawlService maintains an in-memory FIFO queue of URLs. It dequeues the next URL, issues an HTTP GET to fetch the page (200-2000ms round-trip depending on the server), parses the HTML to extract all outbound links, appends those links to the back of the queue, and writes page metadata (URL, title, HTTP status code, timestamp) to a PostgreSQL database. Then it moves to the next URL.
The sequential fetch loop is the fatal bottleneck. Each page fetch blocks the entire crawl for 200-2000ms. At an average of 500ms per page, the crawler processes approximately 2 pages per second. Even with optimistic 200ms fetches, throughput caps at 5 pages/sec. To crawl 1 billion pages at 5 pages/sec would take 6.3 years. The Producer-Consumer variant solves this by running 20 parallel workers consuming from a Kafka frontier, reaching 400 pages/sec — an 80x improvement.
The second critical flaw is the lack of URL deduplication. Real websites contain abundant link cycles: page A links to page B, which links to page C, which links back to page A. Without dedup, the crawler re-enqueues URLs it has already fetched, causing the queue to grow unboundedly and the crawler to revisit pages endlessly. The queue eventually consumes all available memory and the process crashes. The Producer-Consumer variant adds a Redis Bloom filter that checks each URL before enqueuing, achieving 60% dedup rate (60% of extracted links are already-seen URLs).
The third flaw is the in-memory queue itself. If the CrawlService process crashes — out-of-memory, network partition, deployment — the entire queue is lost. Every URL discovered during the crawl but not yet fetched is gone. The crawl must restart from the seed URLs with no record of progress. The Producer-Consumer variant uses Kafka's durable log with offset-based resumption: crashed workers pick up exactly where they left off.
This template makes these limitations visible and quantifiable. Run the simulation at different scales and watch the single-threaded throughput ceiling, the unbounded queue growth from link cycles, and the catastrophic state loss on crash. The comparison with the Producer-Consumer and Distributed variants provides the concrete numbers that interviewers expect.
The naive web crawler is a three-component linear architecture: SeedClient, CrawlService, and CrawlDatabase (PostgreSQL). There is no cache, no event stream, no worker pool, no deduplication layer, and no content storage beyond metadata.
All traffic enters through the SeedClient, which injects initial seed URLs via POST /api/v1/seeds. The SeedClient is the only external entry point — all subsequent URLs are discovered internally through link extraction during the crawl loop. Seed injection is low-volume (approximately 5 RPS) and is typically a one-time operation to kick off the crawl.
The CrawlService is a single-instance stateless REST API running on 1 pod with 10 threads on AWS ECS Fargate (1 vCPU, 2 GB RAM). It maintains an in-memory FIFO queue and runs a sequential BFS crawl loop. For each iteration: (1) dequeue the next URL from the in-memory queue, (2) issue an HTTP GET to fetch the page (200-2000ms round-trip), (3) parse the HTML using a simple DOM parser to extract all anchor tags with href attributes, (4) append all extracted URLs to the back of the queue without any dedup check, (5) write page metadata to CrawlDatabase via a synchronous INSERT. The service also exposes GET /api/v1/stats for crawl progress monitoring.
CrawlDatabase is a single Amazon RDS PostgreSQL instance (db.t4g.medium, 2 vCPU, 4 GB) with one table: pages. The table stores URL, domain, title, HTTP status code, and fetch timestamp. There is no unique constraint on the URL column — the same URL can appear multiple times if re-crawled via link cycles. At approximately 5 INSERTs/sec, the database is dramatically under-utilized. A single small instance handles the load with less than 1% CPU utilization.
The system has no redundancy at any layer. A single CrawlService instance means a crash halts the entire crawl. A single PostgreSQL primary means a database failure loses all crawl metadata. There is no cache, no dedup, no content storage, no politeness enforcement, and no priority scheduling. The architecture is intentionally minimal to demonstrate the baseline that motivates the Producer-Consumer and Distributed variants.
The concrete scaling ceiling is approximately 5 pages per second. At this rate, crawling 100,000 pages takes approximately 5.5 hours. Crawling 1 million pages takes 2.3 days. The in-memory queue grows at approximately 10 URLs per page (average outbound links) minus 1 URL consumed per iteration, so the queue grows by approximately 9 URLs per page crawled. After 100,000 pages, the queue contains approximately 900,000 URLs consuming roughly 100 MB of memory — manageable, but without dedup, many of those URLs are duplicates that will cause re-crawling.
This sequence diagram shows the two primary flows: seed URL injection (one-time setup) and the BFS crawl loop (repeated for every URL). The critical insight is the sequential nature — each page fetch blocks the entire crawl for 200-2000ms. The lack of dedup means extracted links are blindly re-enqueued, causing unbounded queue growth on websites with link cycles.
Step-by-Step Walkthrough
Pseudocode
// Seed injection — append to in-memory queue
async function seedUrls(urls: string[]):
for url in urls:
queue.push(url) // No dedup, no validation
return { status: 200, queued: urls.length }
// BFS crawl loop — sequential, blocking
async function crawlLoop():
while queue.length > 0:
url = queue.shift() // Dequeue front (FIFO = BFS)
// Blocking HTTP fetch — 200-2000ms
response = await httpGet(url) // THE BOTTLENECK
// Parse HTML, extract links
links = parseHtml(response.body)
.querySelectorAll("a[href]")
.map(a => a.href)
// Blindly enqueue all links — no dedup!
for link in links:
queue.push(link) // Duplicates re-enqueued
// Write metadata to PostgreSQL
await db.execute(
"INSERT INTO pages (url, domain, title, status_code, fetched_at)
VALUES ($1, $2, $3, $4, NOW())",
[url, extractDomain(url), response.title, response.status]
) // ~15ms
// Total per page: 200-2000ms fetch + ~5ms parse + ~15ms DB
// = ~220-2020ms per page = ~0.5-5 pages/secChoice
Sequential fetch — one URL at a time from an in-memory FIFO queue
Rationale
Single-threaded BFS is the simplest possible crawl strategy: no concurrency, no synchronization, no job distribution. The FIFO queue naturally implements breadth-first ordering. The cost is throughput — each fetch blocks the loop for 200-2000ms, capping the crawler at ~5 pages/sec. The Producer-Consumer variant uses 20 parallel workers to reach 400 pages/sec.
Choice
No Bloom filter, no hash set, no dedup of any kind
Rationale
Deduplication requires a data structure that scales with the number of seen URLs. A hash set of 10B URLs needs ~80GB RAM. A Bloom filter needs ~10GB. The naive approach skips dedup entirely, meaning link cycles cause unbounded queue growth and endless re-crawling. Adding a Redis Bloom filter is the first optimization candidates should propose.
Choice
URLs stored in a process-local FIFO queue — no persistence
Rationale
An in-memory queue is the simplest implementation: a list with append and pop-front. No external dependencies, no serialization, no network calls. The cost is durability — a crash loses the entire queue. The Producer-Consumer variant replaces this with Kafka, providing durable offset-based resumption.
Choice
One table, one instance, metadata only
Rationale
PostgreSQL stores crawl metadata (URL, title, status, timestamp) for progress monitoring. At ~5 writes/sec, even the smallest RDS instance is dramatically over-provisioned. Raw HTML content is not stored — only metadata. The Producer-Consumer variant adds S3 for HTML storage and expands the DB schema for richer metadata.
Choice
No robots.txt checking, no per-domain rate limiting
Rationale
Politeness requires robots.txt parsing, per-domain delay tracking, and domain-aware scheduling — significant complexity. The naive crawler skips this, which means it can DDoS a single domain with rapid-fire requests. In production, this would get the crawler IP-blocked within minutes.
Target RPS
~5 pages/sec (sequential ceiling)
Latency (p99)
200-2000ms per page fetch
Storage
~1 GB/month metadata only
Availability
~95% (single instance, no redundancy)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Seed URLs (POST /api/v1/seeds) | O(K) where K = number of seed URLs | O(K) queue growth | Simple list append — no validation, no dedup check. |
| Crawl iteration (GET /api/v1/crawl/process) | O(L) where L = outbound links on the page | O(L) queue growth per iteration | Dominated by HTTP fetch latency (200-2000ms). HTML parsing is O(page_size). Queue grows by ~L-1 per iteration (L links added, 1 URL consumed). |
| Stats query (GET /api/v1/stats) | O(N) full table scan for domain-filtered queries | O(1) | No domain index — COUNT queries scan the entire pages table. |
Stores crawl metadata for every fetched page. No unique constraint on URL — the same URL can appear multiple times due to lack of deduplication. Write rate is ~5 INSERTs/sec at the single-threaded crawl speed. No indexes beyond the primary key, so domain-filtered queries require full table scans.
Indexes: pk_pages ON (id)
At 5 pages/sec, table grows ~13M rows/month. Without dedup, many rows are duplicates. No domain index means stats queries do full scans.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Single-Threaded BFS) | T1 | 200-2000ms per page | ~5 pages/sec | $50/month (tiny RDS + 1 Fargate task) | Low — 3 components, sequential | 95% (no redundancy, in-memory queue) |
| Producer-Consumer (Kafka Frontier) | T2 | 50ms per page (parallel) | 400 pages/sec | $2,000/month (Kafka + Redis + S3 + workers) | Medium — Kafka, Bloom filter, workers | 99.9% (Kafka offset replay, replicated) |
| Distributed (Bloom + Priority Frontier) | T3 | 50ms per page (parallel) | 1000 pages/sec | $5,000/month (priority Kafka + dual Redis + S3) | High — priority frontier, SimHash, DNS cache | 99.9% (crash-resilient, checkpointed) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
The web crawler combines recursion at scale, politeness constraints, deduplication (URL and content), priority scheduling, crash recovery, and horizontal scaling in a single problem. It forces candidates to think about BFS/DFS traversal over a graph with cycles, rate limiting per domain, Bloom filters for dedup, Kafka for durability, and priority queues for freshness. Companies like Google, Bing, DuckDuckGo, and Common Crawl ask it because it maps directly to their search infrastructure.
The single-threaded fetch loop blocks on each HTTP GET for 200-2000ms (network round-trip + server response time). At an average of 500ms per page, the maximum throughput is 2 pages/sec. With optimistic 200ms responses, it reaches ~5 pages/sec. The bottleneck is the sequential nature — while waiting for one page, no other URLs are being fetched. The Producer-Consumer variant runs 20 workers in parallel, reaching 400 pages/sec.
Without URL deduplication, the crawler re-enqueues URLs it has already fetched. A simple cycle like A to B to C to A causes all three URLs to be re-added to the queue on every pass. The queue grows unboundedly — after a few thousand pages on a typical website, the queue contains hundreds of thousands of duplicate entries. Eventually, the process runs out of memory and crashes, losing all progress.
Add URL deduplication via a hash set or Bloom filter. Before enqueuing an extracted URL, check if it has been seen before. A hash set provides exact dedup but requires O(N) memory. A Bloom filter provides probabilistic dedup with O(1) lookups in ~10GB for 10B URLs. This eliminates link cycles and bounds queue growth. The second optimization is parallelism via a producer-consumer pattern with Kafka.
The in-memory queue is lost on any process restart — crash, deployment, out-of-memory kill. After crawling 100,000 pages and discovering 900,000 pending URLs, a crash means all 900,000 URLs are gone. The crawl must restart from seed URLs with no record of which pages were already fetched (the database has metadata, but rebuilding the queue from it requires scanning all pages and re-extracting links). Kafka solves this with durable offset-based consumption.
Sign in to join the discussion.
Ready to design your own Web Crawler?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator