Medium7 componentsInterview: High

Web Crawler — Distributed URL Frontier

Q: How does a web crawler respect robots.txt and per-domain rate limits?

Before crawling any domain, the crawler fetches and caches the domain's robots.txt file, which specifies which paths are disallowed for crawling. The DedupCache (Redis) stores parsed robots.txt rules per domain with a one-hour TTL. For rate limiting, the Kafka URL frontier is partitioned by domain, ensuring all URLs for a given domain are processed by the same worker. The worker enforces a configurable delay between requests to the same domain (typically one second), preventing the crawler from overwhelming any single site.

Q: What is a Bloom filter and why is it used for URL deduplication?

A Bloom filter is a probabilistic data structure that can test whether an element is a member of a set. It uses multiple hash functions to map each element to positions in a bit array. A lookup returns either 'definitely not in set' (no false negatives) or 'probably in set' (possible false positives). For web crawling, this means a Bloom filter will never cause the crawler to re-fetch a URL it has already seen, but it may occasionally skip a new URL — a 0.1% false positive rate at 10 billion URLs means roughly 10 million pages missed. The key advantage is memory efficiency: 10 billion URLs fit in approximately 10GB of RAM versus hundreds of gigabytes for a full hash set.

Q: How does the crawler recover from worker failures without losing URLs?

Kafka's offset-based consumption model provides automatic failure recovery. Each worker tracks its position in the Kafka partition via committed offsets. When a worker processes a URL and completes all write operations (S3, PostgreSQL, DedupCache), it commits the offset. If a worker crashes before committing, the offset remains uncommitted and Kafka re-delivers the URL to another worker. This guarantees at-least-once processing: no URLs are lost, though some may be processed twice. The DedupCache prevents duplicate storage for re-processed URLs.

Q: How would you prioritize important pages for crawling?

URL prioritization can be implemented by adding a priority score to each message in the Kafka frontier. Priority is computed based on several signals: domain authority (PageRank of the domain), page depth from seed (shallower pages are usually more important), URL structure (root pages over deep paths), and freshness requirements (news sites need more frequent re-crawling than static documentation). Workers can consume from multiple Kafka topics ordered by priority tier, processing all high-priority URLs before moving to lower tiers. This ensures that the most valuable pages are crawled first within the available throughput budget.

Q: How does the crawler handle duplicate content across different URLs?

URL deduplication (via Bloom filter) catches identical URLs but not identical content served from different URLs — the same article mirrored across 100 domains is stored 100 times. Content deduplication requires computing a fingerprint (SimHash or MinHash) of each page's content after fetching and comparing it against a database of known fingerprints. If a near-duplicate is detected, the crawler can skip storage and record a pointer to the canonical version. This adds computational cost per page and a second Redis lookup, but can reduce storage by 30-40% for large crawls that cover many content-syndication networks.

Design a distributed web crawler that fetches one billion pages per month using a Kafka-based URL frontier, Redis Bloom filter deduplication, and per-domain politeness enforcement.

DistributedKafkaDeduplicationStorage

Try in Simulator

Problem Statement

The web crawler is a classic system design interview question favored by companies like Google, Amazon, and LinkedIn because it tests a candidate's understanding of distributed task processing, data deduplication at massive scale, and polite interaction with external systems. The core problem — systematically fetching web pages, extracting links, and storing content — sounds simple, but the constraints at internet scale introduce challenges in queue management, duplicate detection, failure recovery, and storage that require careful architectural reasoning.

At production scale, a web crawler like Googlebot must fetch billions of pages per month while respecting the politeness conventions of the web. The robots.txt protocol and per-domain rate limiting (typically one request per second per domain) mean that raw throughput is constrained not by compute or bandwidth but by the need to distribute crawl pressure across millions of domains. A crawler that hammers a single domain with hundreds of concurrent requests will get blocked, damage the target site's performance, and violate widely accepted crawling norms.

Deduplication is another major challenge. As the crawler follows links across pages, it encounters the same URLs repeatedly — approximately 60% of extracted links point to already-crawled pages. Without efficient deduplication, the crawler wastes resources re-fetching known content. At a corpus of 10 billion URLs, a naive hash set would consume hundreds of gigabytes of memory. Probabilistic data structures like Bloom filters reduce this to roughly 10GB with a controlled false positive rate, trading a tiny fraction of missed pages for dramatic memory savings.

This template models the complete crawler architecture: a CrawlService for seed URL injection and admin operations, a Kafka-based URL frontier partitioned by domain for politeness enforcement, a pool of CrawlWorker pods that fetch pages and extract links, a Redis Bloom filter for URL deduplication, PostgreSQL for crawl metadata and scheduling, and S3 for raw HTML content storage. The simulation demonstrates how worker count affects crawl throughput, how Bloom filter false positive rates impact coverage, and how Kafka's partitioned architecture naturally enforces per-domain rate limiting.

Architecture Overview

The web crawler architecture follows a producer-consumer pattern with Kafka as the central URL frontier. The system operates as a recursive loop: seed URLs enter the frontier, workers consume and fetch them, extracted links are published back to the frontier for the next crawl round, and the cycle continues until the frontier is exhausted or a depth limit is reached.

The crawl begins when an administrator injects seed URLs via the CrawlService API. CrawlService validates the URLs, checks robots.txt compliance, and publishes them to the UrlFrontier (Amazon MSK / Kafka), partitioned by domain. Domain-based partitioning is a deliberate architectural choice: all URLs for a given domain land on the same Kafka partition, which means a single CrawlWorker (or worker group) handles that domain's queue. This naturally enforces per-domain politeness — the worker processes one URL at a time from that domain's partition with a configurable delay between requests.

CrawlWorker pods consume URLs from Kafka and execute a multi-step processing pipeline for each URL. First, the worker checks the DedupCache (Redis Bloom filter) to determine whether the URL has already been crawled. If the Bloom filter returns positive, the URL is skipped. If negative, the worker fetches the page via HTTP, respecting per-domain rate limits. After fetching, the worker parses the HTML to extract outbound links and page metadata (title, status code, content hash).

The worker then performs three write operations: raw HTML content is stored in ContentStore (S3) keyed by URL hash, page metadata is written to CrawlDatabase (PostgreSQL) for admin queries and freshness-based re-crawl scheduling, and the URL is marked as seen in the DedupCache. Finally, extracted outbound links are published back to the UrlFrontier, completing the recursive loop. Each link carries a depth counter that is incremented from its parent page, enabling the system to enforce a maximum crawl depth.

The architecture is naturally horizontally scalable. Adding more CrawlWorker pods and Kafka partitions increases parallelism proportionally. Kafka's offset-based consumption model provides crash recovery: if a worker dies mid-crawl, its uncommitted offsets are replayed by another worker, ensuring no URLs are lost. The 20-worker configuration sustains approximately 400 pages per second (each worker handles roughly 20 pages per second due to politeness delays), yielding one billion pages per month at steady-state throughput.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

URL Frontier Implementation

Choice

Kafka with domain-based partitioning for durable, ordered URL queuing

Rationale

Kafka provides durable, partitioned message storage with offset-based resumption. If a worker crashes mid-crawl, Kafka replays uncommitted offsets so no URLs are lost. Domain-based partitioning ensures all URLs for a given domain are processed by the same worker group, enabling per-domain rate limiting without external coordination. At 10 billion URLs, Kafka's log-based storage is more cost-effective and operationally simpler than a database-backed priority queue.

URL Deduplication Strategy

Choice

Redis Bloom filter with 0.1% false positive rate

Rationale

A Bloom filter stores 10 billion URLs in approximately 10GB of RAM with O(1) lookup time. The 0.1% false positive rate means roughly 10 million pages may be incorrectly skipped per full crawl — an acceptable trade-off for general-purpose web crawling. A traditional hash set storing full URL strings would require hundreds of gigabytes. Redis hosts the Bloom filter as a shared data structure visible to all workers, ensuring consistent deduplication across the entire worker fleet.

Content Storage

Choice

Amazon S3 for raw HTML with URL hash as object key

Rationale

At 400 pages per second with an average raw HTML size of 50KB, the crawler generates approximately 20MB per second or 50TB per month of content. S3 provides virtually unlimited storage at low cost with eleven nines of durability. Storing HTML content in PostgreSQL would be impractical at this volume — TOAST overhead and vacuum costs would degrade database performance severely. S3 lifecycle policies automatically tier old content to Infrequent Access storage for cost optimization.

Crawl Metadata Database

Choice

PostgreSQL with indexes on domain and fetched_at for re-crawl scheduling

Rationale

Crawl metadata (URL, domain, status code, fetch timestamp, outbound link count) requires queryable indexes for two key use cases: the admin API for monitoring crawl progress by domain, and freshness-based re-crawl scheduling that identifies URLs not crawled within a configurable window. PostgreSQL's B-tree indexes on domain and fetched_at enable efficient range queries for both cases. At 400 writes per second, a single PostgreSQL primary with read replicas handles the metadata load comfortably.

Scale & Performance

Target RPS

400 pages/s sustained (1B pages/month)

Latency (p99)

<5s per page (fetch + parse + store)

Storage

~50 TB/month (raw HTML); ~10 GB Bloom filter; ~200 GB/year metadata

Availability

99.9%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

How does a web crawler respect robots.txt and per-domain rate limits?

Before crawling any domain, the crawler fetches and caches the domain's robots.txt file, which specifies which paths are disallowed for crawling. The DedupCache (Redis) stores parsed robots.txt rules per domain with a one-hour TTL. For rate limiting, the Kafka URL frontier is partitioned by domain, ensuring all URLs for a given domain are processed by the same worker. The worker enforces a configurable delay between requests to the same domain (typically one second), preventing the crawler from overwhelming any single site.

What is a Bloom filter and why is it used for URL deduplication?

A Bloom filter is a probabilistic data structure that can test whether an element is a member of a set. It uses multiple hash functions to map each element to positions in a bit array. A lookup returns either 'definitely not in set' (no false negatives) or 'probably in set' (possible false positives). For web crawling, this means a Bloom filter will never cause the crawler to re-fetch a URL it has already seen, but it may occasionally skip a new URL — a 0.1% false positive rate at 10 billion URLs means roughly 10 million pages missed. The key advantage is memory efficiency: 10 billion URLs fit in approximately 10GB of RAM versus hundreds of gigabytes for a full hash set.

How does the crawler recover from worker failures without losing URLs?

Kafka's offset-based consumption model provides automatic failure recovery. Each worker tracks its position in the Kafka partition via committed offsets. When a worker processes a URL and completes all write operations (S3, PostgreSQL, DedupCache), it commits the offset. If a worker crashes before committing, the offset remains uncommitted and Kafka re-delivers the URL to another worker. This guarantees at-least-once processing: no URLs are lost, though some may be processed twice. The DedupCache prevents duplicate storage for re-processed URLs.

How would you prioritize important pages for crawling?

URL prioritization can be implemented by adding a priority score to each message in the Kafka frontier. Priority is computed based on several signals: domain authority (PageRank of the domain), page depth from seed (shallower pages are usually more important), URL structure (root pages over deep paths), and freshness requirements (news sites need more frequent re-crawling than static documentation). Workers can consume from multiple Kafka topics ordered by priority tier, processing all high-priority URLs before moving to lower tiers. This ensures that the most valuable pages are crawled first within the available throughput budget.

How does the crawler handle duplicate content across different URLs?

URL deduplication (via Bloom filter) catches identical URLs but not identical content served from different URLs — the same article mirrored across 100 domains is stored 100 times. Content deduplication requires computing a fingerprint (SimHash or MinHash) of each page's content after fetching and comparing it against a database of known fingerprints. If a near-duplicate is detected, the crawler can skip storage and record a pointer to the canonical version. This adds computational cost per page and a second Redis lookup, but can reduce storage by 30-40% for large crawls that cover many content-syndication networks.

Related Templates

Search Autocomplete Distributed Filesystem Logging Pipeline

Discussion

Ready to design your own Web Crawler?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator