Vetora logo
Medium9 componentsInterview: High

Web Crawler — Distributed (Bloom + Priority Frontier)

Production-grade distributed crawler with priority-based URL frontier, URL dedup via Bloom filter, content dedup via SimHash fingerprints, DNS caching, and dedicated link extraction with scoring. Achieves 1000 pages/sec with ~30% content dedup savings.

StorageDistributedSimHashPriority QueueDNS CacheProduction-Grade
Problem Statement

The distributed web crawler with priority frontier and content deduplication is the production-grade architecture used by large-scale search engines crawling billions of pages. It builds on the Producer-Consumer variant by solving three additional problems: lack of priority scheduling (all URLs treated equally), lack of content deduplication (same article stored 100 times across different domains), and DNS overhead (50-100ms per page for domain resolution).

The most impactful addition is the priority-based URL frontier. In the Producer-Consumer variant, URLs are consumed in FIFO order within each Kafka partition — a new, unimportant page from a spam blog is treated with the same urgency as a stale page from a high-authority news site. The priority frontier assigns each URL a score based on domain authority, page freshness (time since last crawl), and link context (anchor text relevance, position on the source page). Workers consume higher-priority URLs first within each partition, ensuring the most valuable pages are always crawled first.

Content deduplication via SimHash addresses a major storage inefficiency. The Producer-Consumer variant deduplicates URLs but not content. The same news article syndicated to 100 aggregator sites has 100 different URLs and is stored 100 times in S3. SimHash generates a 64-bit fingerprint per page; two pages with Hamming distance less than or equal to 3 are classified as near-duplicates. Workers compute SimHash after fetching a page and check the dedup cache before storing. At scale, approximately 30% of unique-URL pages are near-duplicates of already-stored content, saving approximately 15TB/month in storage costs.

DNS caching eliminates a hidden per-page overhead. Every HTTP fetch requires resolving the domain to an IP address. DNS resolution takes 50-100ms on average — at 1000 pages/sec, that is 50-100 seconds of aggregate DNS latency per second. Caching DNS results in Redis with 1-hour TTL eliminates redundant lookups: at steady state, approximately 95% of domains have cached resolutions. The DNS cache also provides a single point for blocklisting known spam and malware domains.

The dedicated LinkExtractor service separates link parsing and scoring from page fetching. In the Producer-Consumer variant, workers parse HTML and publish extracted links inline — this ties CPU-intensive link scoring to the I/O-bound fetch loop. The dedicated LinkExtractor pool receives raw HTML from workers, parses it, scores each outbound link by domain authority and contextual relevance, and publishes scored links back to the priority frontier. This decoupling allows workers to focus on fetching and allows link scoring to scale independently.

This architecture represents the senior-level answer in system design interviews. Candidates who can articulate the priority scheduling trade-offs, explain SimHash content dedup mechanics, reason about DNS cache staleness, and discuss the link scoring pipeline demonstrate the depth of understanding expected for staff-level positions at search companies.

Architecture Overview

The distributed web crawler uses nine components organized into a priority-aware crawl pipeline: SeedClient, CrawlService, UrlFrontier (Kafka with priority), CrawlWorker pool (50 workers), LinkExtractor (10 workers), DedupCache (Redis Bloom filter + SimHash), DnsCache (Redis), CrawlDatabase (PostgreSQL), and ContentStore (S3).

Seed URLs enter through SeedClient via CrawlService, which validates URLs, checks robots.txt compliance, assigns initial priority scores based on domain authority lookup, and publishes to the UrlFrontier. CrawlService runs on 3 ECS Fargate pods for high availability and also serves admin stats including content dedup metrics, Bloom filter size, and DNS cache hit rates.

The UrlFrontier is Amazon MSK (Managed Kafka) with 64 partitions, partitioned by domain with priority headers. Each message includes URL, domain, crawl depth, priority score (0-1), and source URL. Workers consume from assigned partitions using a priority-aware consumer that processes higher-priority URLs first within each partition. The priority score combines domain authority (0.4 weight), freshness decay (0.3 weight), and link context score from LinkExtractor (0.3 weight).

The CrawlWorker pool consists of 50 ECS Fargate workers — 2.5x the Producer-Consumer variant — each processing approximately 20 pages/sec. For each URL, the worker follows a multi-step pipeline: (1) check DedupCache Bloom filter for URL dedup, (2) resolve domain via DnsCache, (3) fetch page with per-domain politeness delay, (4) compute 64-bit SimHash fingerprint, (5) check DedupCache for content near-duplicates (Hamming distance <= 3), (6) store unique HTML in ContentStore (S3), (7) write metadata to CrawlDatabase, (8) mark URL as seen and store SimHash in DedupCache, (9) pass HTML to LinkExtractor for link scoring.

The DedupCache is a 5-node Amazon ElastiCache Redis cluster with 52 GB memory serving dual purposes: URL Bloom filter (10B URLs, ~10GB, 0.1% false positive rate) and SimHash fingerprint store (10B fingerprints, ~80GB with metadata). Workers check both before storing content. The URL Bloom filter has no TTL (URLs never expire); SimHash entries also persist permanently.

The DnsCache is a 2-node ElastiCache Redis cluster with 13 GB memory caching domain-to-IP mappings with 1-hour TTL. At steady state, approximately 95% of domains are cached. Cache misses trigger actual DNS resolution (50-100ms) and backfill the cache. The single-flight pattern prevents multiple workers from simultaneously resolving the same domain. DNS entries are invalidated on repeated HTTP connection errors.

CrawlDatabase is Amazon RDS PostgreSQL with 16 partitions and 3 replicas. Two tables: pages (url_hash, url, domain, title, status_code, fetched_at, priority_score, simhash, content_dedup flag) and domains (domain, robots_txt, authority_score, last_crawled, page_count). Partitioned by crawl date for efficient freshness queries. At 1000 INSERTs/sec, the primary is moderately loaded.

ContentStore is Amazon S3 storing unique (non-duplicate) HTML. After SimHash content dedup filters out approximately 30% of pages as near-duplicates, the net storage rate is approximately 600 unique pages/sec at 50KB average = 30MB/sec = approximately 75TB/month. S3 lifecycle policy transitions to Infrequent Access at 30 days and Glacier at 90 days.

Architecture Preview
Loading architecture preview...
Request Flow — Priority Crawl with Content Dedup

This sequence diagram shows the full distributed crawl pipeline: priority-based URL consumption, dual dedup (URL Bloom filter + content SimHash), DNS caching, content storage, and scored link extraction. The critical additions over the Producer-Consumer variant are the priority ordering, SimHash content dedup (saving ~30% storage), and DNS cache (saving 50-100ms per page).

Loading diagram...

Step-by-Step Walkthrough

  1. 1Worker consumes highest-priority URL from its Kafka partition
  2. 2Check Bloom filter for URL dedup — skip if already seen (~60% rejection rate)
  3. 3Resolve domain via DNS cache — 95% hit rate saves 50-100ms per page
  4. 4Fetch page via HTTP with per-domain politeness delay (1 req/sec/domain)
  5. 5Compute 64-bit SimHash fingerprint of page content
  6. 6Check SimHash cache for content near-duplicates (Hamming distance <= 3)
  7. 7If content is unique: store HTML in S3, write metadata with content_dedup=false
  8. 8If content is near-duplicate: skip S3 storage, write metadata with content_dedup=true
  9. 9Mark URL as seen in Bloom filter and store SimHash fingerprint
  10. 10Send HTML to LinkExtractor for parsing and scoring
  11. 11LinkExtractor scores each outbound link by domain authority and context relevance
  12. 12LinkExtractor publishes scored links to priority Kafka frontier

Pseudocode

// Worker crawl loop with priority + content dedup + DNS cache
async function workerLoop(kafkaConsumer, partition):
    for message in kafkaConsumer.consumeByPriority(partition):
        url = message.value.url
        domain = message.value.domain
        url_hash = sha256(url)

        // 1. URL dedup via Bloom filter
        if await redis.bf_exists("url_bloom", url_hash):
            continue

        // 2. DNS resolution via cache
        ip = await dnsCache.get("dns:" + domain)
        if not ip:
            ip = await dnsResolve(domain)  // ~60ms
            await dnsCache.set("dns:" + domain, ip, ttl=3600)

        // 3. Fetch page with politeness
        await rateLimiter.waitForDomain(domain)
        response = await httpGet(url, ip)  // ~200ms

        // 4. Content dedup via SimHash
        fingerprint = computeSimHash(response.body)
        existingHash = await redis.get("simhash:" + fingerprint)
        isContentDuplicate = existingHash and
            hammingDistance(fingerprint, existingHash) <= 3

        // 5. Store unique content
        if not isContentDuplicate:
            await s3.putObject(url_hash, response.body)
            await redis.set("simhash:" + fingerprint, url_hash)

        // 6. Write metadata
        await db.execute(
            "INSERT INTO pages VALUES ($1,$2,$3,$4,$5,NOW(),$6,$7,$8)",
            [url_hash, url, domain, response.title,
             response.status, message.value.priority,
             fingerprint, isContentDuplicate]
        )

        // 7. Mark URL as seen
        await redis.bf_add("url_bloom", url_hash)

        // 8. Send to LinkExtractor for scoring
        await linkExtractor.submit(url, response.body)
        // LinkExtractor scores links and publishes to Kafka
        // with computed priority scores
Key Design Decisions
Priority-Based URL Frontier

Choice

Kafka with priority headers — high-value and stale URLs consumed first

Rationale

Not all URLs are equally valuable. A stale page from nytimes.com is more important than a fresh page from a random blog. Priority scoring combines domain authority (0.4), freshness decay (0.3), and link context (0.3). Workers consume highest-priority URLs first, ensuring the crawler always works on the most valuable pages. Without priority, valuable pages wait behind millions of low-value URLs.

SimHash Content Deduplication

Choice

64-bit SimHash fingerprints with Hamming distance threshold of 3

Rationale

URL dedup misses content duplicates — the same article on 100 different domains has 100 different URLs. SimHash detects near-duplicate content by comparing 64-bit fingerprints: Hamming distance <= 3 means the pages are ~95% similar. At scale, ~30% of unique-URL pages are content duplicates, saving ~15TB/month in S3 storage. The ~2% false positive rate (unique pages incorrectly classified as duplicates) is acceptable for search crawling.

Dedicated DNS Cache

Choice

Separate Redis cluster caching domain-to-IP mappings with 1-hour TTL

Rationale

DNS resolution adds 50-100ms per page. At 1000 pages/sec, this is a significant per-page overhead. Caching in Redis with 1-hour TTL achieves ~95% hit rate at steady state, eliminating DNS overhead for most fetches. Single-flight pattern prevents thundering herd on popular domain lookups. Separating DNS cache from dedup cache prevents eviction interference between the two workloads.

Dedicated LinkExtractor Service

Choice

Separate worker pool for HTML parsing, link extraction, and link scoring

Rationale

Link scoring is CPU-intensive: HTML parsing, anchor text analysis, domain authority lookup, and priority computation. Running this inline with CrawlWorkers would reduce fetch throughput. A dedicated 10-worker LinkExtractor pool processes links asynchronously, allowing CrawlWorkers to focus on I/O-bound fetching. The two pools scale independently based on their respective bottlenecks.

50-Worker Pool with 64 Kafka Partitions

Choice

2.5x more workers than the Producer-Consumer variant

Rationale

50 workers at 20 pages/sec each = 1000 pages/sec. 64 Kafka partitions provide headroom for scaling to 64 workers without repartitioning. The 50/64 ratio leaves room for burst scaling. Workers are stateless Fargate tasks — scaling up is a configuration change.

S3 Lifecycle for Cost Optimization

Choice

Standard for 30 days, Infrequent Access for 30-90 days, Glacier after 90 days

Rationale

At ~75TB/month of net content, S3 costs grow quickly. Most content is accessed within the first 30 days (indexing pipeline). After 30 days, access drops to near-zero. IA tier saves ~40% vs Standard. Glacier after 90 days saves ~70%. Annual storage cost drops from ~$25K to ~$8K with lifecycle policies.

Scale & Performance

Target RPS

1000 pages/sec sustained

Latency (p99)

~50ms per page (parallel workers)

Storage

~75 TB/month gross, ~50 TB/month after content dedup

Availability

99.9% (Kafka + Redis replicated, checkpointed)

Time & Space Complexity
OperationTimeSpaceNotes
Crawl URL (full pipeline per page)O(1) Bloom check + O(1) DNS lookup + O(page_size) SimHash + O(links) extractionO(1) per URL in Bloom filter, O(1) per SimHash fingerprintWall time dominated by HTTP fetch (~200ms) and politeness delay. SimHash computation is O(page_size) but typically <5ms for 50KB pages.
Link scoring (per extracted link)O(1) domain authority lookup + O(anchor_text_length) relevance scoringO(links_per_page) batch processingLinkExtractor processes ~10 links/ms. A page with 50 outbound links takes ~5ms to score.
Content dedup check (SimHash comparison)O(1) Redis GET + O(1) Hamming distance computationO(1) per fingerprint storedHamming distance on 64-bit integers is a single XOR + popcount instruction. Redis lookup is the bottleneck at ~1ms.
Database Schema (HLD)
pages

Crawl metadata with priority and content dedup fields. Partitioned by crawl_date for efficient freshness queries. Includes SimHash fingerprint and content_dedup flag indicating whether the page was a near-duplicate. Write-heavy at ~1000 INSERTs/sec across 16 partitions.

url_hash VARCHAR PK (SHA-256 hash of URL)url TEXT NOT NULLdomain VARCHAR(255) NOT NULL (indexed)title TEXTstatus_code INTEGER NOT NULLfetched_at TIMESTAMPTZ NOT NULL (indexed, partition key)priority_score FLOAT NOT NULL (priority at crawl time)simhash BIGINT NOT NULL (64-bit SimHash fingerprint)content_dedup BOOLEAN NOT NULL DEFAULT FALSE

Indexes: pk_pages ON (url_hash), idx_pages_domain_fetched ON (domain, fetched_at DESC), idx_pages_simhash ON (simhash)

Partitioned by fetched_at (monthly). At 1000 pages/sec, grows ~2.5B rows/month. SimHash index supports content dedup analysis queries.

domains

Domain-level metadata for priority computation and politeness enforcement. Stores cached robots.txt, domain authority score, and crawl statistics. Updated on every page crawl via upsert on page_count and last_crawled.

domain VARCHAR PK (domain name)robots_txt TEXT (cached robots.txt content)authority_score FLOAT NOT NULL DEFAULT 0.5 (0-1 scale)last_crawled TIMESTAMPTZ NOT NULLpage_count INTEGER NOT NULL DEFAULT 0

Indexes: pk_domains ON (domain), idx_domains_authority ON (authority_score DESC)

Relatively small table — ~50M unique domains at web scale. Authority score is seeded from external data (e.g., Alexa rank) and updated based on inbound link analysis.

Event Contracts
url-frontier-priorityurl-frontier-priority

Priority-partitioned URL frontier. Each message includes a priority score (0-1) computed from domain authority, freshness decay, and link context. Workers consume highest-priority URLs first within each partition. Both seed URLs and scored extracted links are published here.

Key Schema

domain: string (partition key for per-domain ordering and politeness)

Value Schema

{ url: string, domain: string, depth: integer, priority: float, source_url: string (optional) }

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
Naive (Single-Threaded BFS)T1200-2000ms per page~5 pages/sec$50/month (tiny RDS + 1 Fargate task)Low — 3 components, sequential95% (no redundancy, in-memory queue)
Producer-Consumer (Kafka Frontier)T250ms per page (parallel)400 pages/sec$2,000/month (Kafka + Redis + S3 + workers)Medium — Kafka, Bloom filter, workers99.9% (Kafka offset replay, replicated)
Distributed (Bloom + Priority Frontier)T350ms per page (parallel)1000 pages/sec$5,000/month (priority Kafka + dual Redis + S3)High — priority frontier, SimHash, DNS cache99.9% (crash-resilient, checkpointed)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does SimHash content dedup work and what are its limitations?

SimHash converts a page's text content into a 64-bit fingerprint. Two pages with Hamming distance (number of differing bits) <= 3 are classified as near-duplicates. This detects syndicated articles, mirror sites, and trivially modified pages (different ads/navigation, same content). The ~2% false positive rate means some genuinely unique pages are incorrectly skipped. SimHash does not detect semantic duplicates (same information rewritten in different words) — that requires more expensive NLP-based similarity.

Why separate DNS cache from the dedup cache?

DNS entries have TTL (1-hour expiry) and benefit from LRU eviction. Bloom filter entries have no TTL and must never be evicted. Mixing them in the same Redis instance would cause eviction conflicts — DNS entries evicting Bloom filter data or vice versa. Separate Redis clusters allow each workload to have its own eviction policy, memory allocation, and scaling characteristics.

How does priority scoring prevent starvation of low-priority URLs?

Low-priority URLs gain priority over time through freshness decay. A URL that has not been crawled gains 0.01 priority points per day. After 30 days without crawling, even a low-authority URL accumulates 0.3 points of freshness priority. Additionally, workers reserve 10% of their capacity for the lowest-priority partition segment, ensuring no URL waits indefinitely. In practice, most URLs are crawled within 14 days.

What happens when DNS changes and the cache is stale?

DNS entries are cached with 1-hour TTL, so stale entries expire naturally within an hour. Additionally, workers invalidate DNS cache entries when they encounter repeated connection errors (3 consecutive failures) to the cached IP. This provides faster recovery for domains that migrate to new IPs. The worst case is 1 hour of failed fetches for a domain that changes DNS, after which the cache refreshes automatically.

How does this compare to Google's actual web crawling infrastructure?

Google's crawler (Googlebot) uses similar concepts at vastly larger scale: priority-based frontier with ML-driven scheduling, content dedup with multiple fingerprint algorithms, distributed DNS resolution, and politeness enforcement. Key differences include multi-datacenter distribution, page rendering via headless Chrome for JavaScript-heavy sites, and ML-based content quality scoring that influences crawl priority. This template captures the essential architecture without the Google-specific complexity.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Web Crawler?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator