Vetora logo
Medium7 componentsInterview: Very High

Search Autocomplete — Trie + Top-K (In-Memory Trie)

Industry-standard search autocomplete using an in-memory trie data structure with pre-computed top-K suggestions at every node. Lookups are O(prefix_length) with sub-millisecond latency. The trie is rebuilt every 5 minutes from aggregated query counts, with Redis-based trending boosts for near-real-time freshness.

TrieRedisKafkaReal-timeSearch Autocomplete
Problem Statement

The trie-based approach to search autocomplete is the industry-standard solution used by Google, Amazon, LinkedIn, and most major search platforms. It solves the fundamental problem with the naive database approach: eliminating per-keystroke database queries by serving suggestions from an in-memory data structure with O(prefix_length) lookup time.

A trie (prefix tree) is a tree data structure where each node represents a character, and paths from root to leaf represent complete query strings. The key optimization is pre-computing the top-K (typically 10) most popular suggestions at every node during the trie build phase. When a user types 'sys', the service traverses 3 nodes (root -> 's' -> 'y' -> 's') and immediately returns the pre-stored top-10 suggestions — no sorting, no scanning, no database query. The lookup is O(L) where L is the prefix length, regardless of how many queries match that prefix.

The trie for 1 billion unique queries is approximately 10 GB in memory. Each AutocompleteService pod loads the complete trie at startup and holds it in memory. With 20 pods, the aggregate memory cost is 200 GB — the price of sub-millisecond lookups. The trie is immutable during serving: a background thread reloads a new trie every 5 minutes via atomic pointer swap, ensuring zero-lock reads on the hot path.

The trie build process runs on TrieRebuildWorker, which consumes search logs from Kafka, aggregates per-query counts in sliding time windows (1-minute, 1-hour, 1-day, all-time), writes updated counts to QueryDB (DynamoDB), and periodically reconstructs the trie. The rebuild reads all query counts, builds the trie with top-K pre-computation at every node, serializes the result, and pushes it to TrieCache (Redis) for distribution to service pods.

Between full rebuilds, a trending boost layer in Redis provides near-real-time freshness. When TrieRebuildWorker detects a query whose recent count significantly exceeds its historical average, it writes a trending boost score to TrieCache. AutocompleteService merges static trie results with trending boosts on each request — if a trending query outscores a static suggestion, it replaces it in the top-10 response. This two-layer approach (static trie + dynamic trending) balances freshness with stability.

The architecture uses 7 components: TypeaheadClient, ApiGateway, MainLB, AutocompleteService, TrieCache (Redis), QueryDB (DynamoDB), SearchLogStream (Kafka), and TrieRebuildWorker. The API Gateway authenticates and rate-limits at 250K RPS. MainLB distributes queries using round-robin — no session affinity needed because every pod has the full trie. AutocompleteService handles both autocomplete reads (80% of traffic) and search log writes to Kafka (20%).

The primary trade-off is memory cost versus latency. At 10 GB per pod with 20 pods, the trie consumes 200 GB of aggregate memory. Sharding the trie by prefix first-character would reduce per-pod memory but requires consistent-hash routing at the LB layer and creates hot shards (prefix 's' gets disproportionate traffic). The full-trie-per-pod approach with round-robin LB is simpler and avoids hot-shard problems.

Interviewers expect candidates to explain the trie data structure, discuss why pre-computing top-K at each node eliminates sorting at query time (O(L) vs O(L + M log M)), reason about the memory-latency trade-off, and explain how the Kafka streaming pipeline enables near-real-time trending without blocking the hot path.

Architecture Overview

The trie-based autocomplete system uses seven components organized into three layers: the request path (TypeaheadClient, ApiGateway, MainLB, AutocompleteService), the data stores (TrieCache/Redis, QueryDB/DynamoDB), and the async pipeline (SearchLogStream/Kafka, TrieRebuildWorker).

The request path handles 200K autocomplete queries per second at peak. TypeaheadClient sends GET /api/v1/autocomplete?q=prefix&limit=10. ApiGateway (AWS API Gateway REST) authenticates the request, validates the JWT token (~2ms), enforces per-user rate limits to prevent scraping, and routes to MainLB. MainLB (AWS ALB) round-robins to one of 20 AutocompleteService pods. The service traverses its in-memory trie from root to the node matching the prefix and returns the pre-computed top-10 suggestions — typically in under 1ms.

The trending enrichment is an optional optimization on the hot path. After the trie lookup, AutocompleteService checks TrieCache (Redis) for trending boost scores that may modify the static rankings. For prefixes matching active trends (approximately 10% of requests), the service merges static top-K with trending candidates and re-ranks by combined score. Cache hit adds ~2ms; cache miss uses static trie rankings only — still correct, just not trend-boosted. The trending check is non-blocking: if Redis is down, the static trie results are returned without degradation.

The async pipeline handles search log ingestion and trie maintenance. When a user completes a search, AutocompleteService publishes the query to SearchLogStream (Kafka, 64 partitions). TrieRebuildWorker (20 workers) consumes these logs and aggregates per-prefix counts in sliding time windows. Every 5 minutes, the worker triggers a full trie rebuild: reads all counts from QueryDB, constructs the trie with top-K pre-computation, and pushes trending boosts to TrieCache. Service pods periodically reload the trie from QueryDB via a background thread (every 5-10 minutes). The reload is atomic — the new trie replaces the old one in a single pointer swap, so serving threads never see a partially-built trie.

Horizontal scaling is straightforward: add more AutocompleteService pods for higher RPS. Each pod is self-contained with its own copy of the trie, so there are no cross-pod dependencies. The LB is round-robin with no session affinity. TrieRebuildWorker scaling is limited by QueryDB scan throughput — DynamoDB On-Demand capacity absorbs rebuild spikes automatically.

Architecture Preview
Loading architecture preview...
Request Flow — In-Memory Trie Lookup with Trending Merge

This sequence diagram traces the autocomplete flow through the trie-based system. The key difference from the naive approach is the elimination of database queries on the hot path — the trie lookup is entirely in-memory, taking less than 1ms regardless of prefix length or match count. The trending merge adds a Redis lookup (~2ms) but is optional and non-blocking.

The second flow shows the async pipeline: search logs published to Kafka, consumed by TrieRebuildWorker, aggregated into DynamoDB, and periodically used to rebuild the trie. This pipeline runs independently of the autocomplete hot path — Kafka can lag without affecting autocomplete latency.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Client sends autocomplete request with prefix 'sys'. ApiGateway authenticates JWT (~2ms) and rate-limits. MainLB round-robins to one of 20 pods
  2. 2AutocompleteService traverses its in-memory trie: root -> 's' -> 'y' -> 's'. At the 'sys' node, it reads the pre-stored top-10 suggestions. Total: ~0.1ms with zero database dependency
  3. 3Service checks Redis for trending boost at key 'trend:sys'. If found, merges trending candidates with static top-10 and re-ranks by combined score. If not found (90% of cases for non-trending prefixes), returns static results directly
  4. 4When the user completes a search, the query is published to Kafka (fire-and-forget, ~5ms). TrieRebuildWorker consumes the log, aggregates counts, and periodically rebuilds the trie
  5. 5Every 5 minutes, TrieRebuildWorker reads all counts from DynamoDB, builds a new trie, and each pod reloads via atomic pointer swap. Between rebuilds, trending boosts in Redis provide near-real-time freshness

Pseudocode

// TRIE LOOKUP — sub-millisecond, zero DB dependency
function autocomplete(prefix: string, limit: number = 10):
    node = trie.root
    for char in prefix:
        node = node.children[char]
        if node is null:
            return []  // No suggestions for this prefix

    // Top-K is pre-computed at every node during trie build
    suggestions = node.topK  // Already sorted by popularity — O(1)

    // Optional: merge with trending boost from Redis
    trending = redis.get("trend:" + prefix)
    if trending:
        suggestions = merge_and_rerank(suggestions, trending.queries, trending.boost_score)

    return suggestions[:limit]

// TRIE BUILD — offline, every 5 minutes
function buildTrie(queryCounts: Map<string, int>):
    trie = new Trie()
    for (query, count) in queryCounts:
        node = trie.insert(query)
        node.count = count

    // Bottom-up top-K propagation
    propagateTopK(trie.root, K=10)
    return trie

function propagateTopK(node, K):
    candidates = []
    if node.isEndOfWord:
        candidates.append({query: node.query, score: node.count})
    for child in node.children.values():
        candidates.extend(propagateTopK(child, K))
    candidates.sort(by: score, descending: true)
    node.topK = candidates[:K]
    return candidates[:K]
Key Design Decisions
In-Memory Trie vs Redis Lookups per Request

Choice

Full trie loaded into each AutocompleteService pod's memory

Rationale

At 500K RPS, a network round-trip to Redis for every keystroke adds 1-2ms of latency and creates a dependency on Redis availability. The in-memory trie provides sub-millisecond lookups with zero network dependency for the hot path. Redis is used only for trending boosts (optional enrichment, not critical path). The trade-off is 10 GB memory per pod — with 20 pods, that is 200 GB aggregate. This is the same trade-off Google makes: their autocomplete service holds the trie in memory for every data center.

Pre-Computed Top-K at Each Trie Node

Choice

Store the top-10 suggestions at every node during the build phase

Rationale

Without pre-computed results, finding top-K suggestions for a prefix requires traversing all descendant nodes and sorting — O(M log M) per request where M is the number of queries matching the prefix. For prefix 's', M is 50 million. Pre-computing during the rebuild phase moves this cost to an offline background job and makes lookups O(L) where L is the prefix length (just traverse L nodes and read the stored result). The trade-off is memory (storing 10 suggestions at every node increases trie size by approximately 3x) and rebuild latency (building 1B-node trie with top-K propagation takes approximately 3 minutes).

Periodic Trie Rebuilds with Atomic Pointer Swap

Choice

Rebuild the full trie every 5 minutes, swap via pointer reassignment

Rationale

Updating a trie in-place under concurrent reads requires complex locking or lock-free data structures (concurrent trie implementations exist but are error-prone). Periodic rebuilds with atomic pointer swap are simpler and safer: the old trie serves requests while the new one is built in a separate memory region. When the build completes, a single pointer swap (AtomicReference.set() in Java, or equivalent) makes the new trie active. The old trie is garbage-collected after in-flight requests complete. The trade-off is a 5-minute staleness window for new queries, partially mitigated by the Redis trending boost layer.

Kafka for Search Log Ingestion

Choice

Publish completed search queries to Kafka for async aggregation

Rationale

At 200K+ completed searches per second, direct database writes would overwhelm QueryDB. Kafka buffers the log stream and enables TrieRebuildWorker to batch-aggregate counts efficiently. Kafka also provides replay capability — if the worker has a bug or needs to recompute historical trends, it can re-consume from a past offset. The 64 partitions (partitioned by query prefix) enable parallel consumption across 20 workers.

Full Trie per Pod (No Sharding)

Choice

Each pod holds the complete trie rather than a shard of it

Rationale

Sharding by prefix first-character requires consistent-hash routing at the LB layer and creates hot shards — the letter 's' accounts for approximately 15% of all English queries, while 'x' accounts for less than 0.5%. Full trie per pod with round-robin LB distributes load evenly and simplifies operations (any pod can serve any query). The cost is 10 GB memory per pod instead of ~1 GB per shard, but at $0.05/GB/hour for Fargate memory, this is approximately $500/month for 20 pods — trivial compared to engineering time for shard management.

Scale & Performance

Target RPS

200K peak (20 pods x 10K RPS each)

Latency (p99)

<1ms trie lookup + 2ms trending merge (cache hit)

Storage

~10 GB trie per pod, ~50 GB DynamoDB for counts

Availability

99.9% (multi-AZ, graceful degradation on Redis failure)

Time & Space Complexity
OperationTimeSpaceNotes
Trie lookup (autocomplete query)O(L) — traverse L nodes where L = prefix lengthO(1) — read pre-stored top-K at the target nodeSub-millisecond regardless of the number of queries matching the prefix. A 10-character prefix traverses 10 nodes — each traversal is a hash map lookup on the current node's children. Total: approximately 0.1ms for a typical query.
Trie build (offline, every 5 minutes)O(N * L) — insert N queries of average length L into the trieO(N * L) — trie nodes for all prefixes of all queriesBuilding the trie from 1B queries takes approximately 3 minutes on a 4-vCPU worker. The top-K propagation phase (bubbling up top-10 suggestions from leaves to root) adds approximately 1 minute. Total rebuild: ~4 minutes, well within the 5-minute cycle.
Trending merge (on cache hit)O(K) — merge K trending candidates with K static suggestionsO(K) — merged result setK is typically 10 (top-10 suggestions). Merging two sorted lists of 10 elements is trivial — <0.01ms. The Redis lookup dominates at ~1ms.
Kafka log aggregation (TrieRebuildWorker)O(1) per message — increment counter in hash mapO(P) — P unique prefixes in the sliding windowEach search log message increments a counter in an in-memory hash map. At 40K messages/sec across 20 workers, each worker handles 2K msg/sec — trivially fast. The hash map grows to approximately 10M entries (unique query-window combinations).
Database Schema (HLD)
query_counts (DynamoDB)

Stores aggregated per-query search counts across multiple time windows. Partition key is the 2-character prefix for even distribution; sort key is the full query string. Written by TrieRebuildWorker after aggregating Kafka search logs. Read via full table scan every 5 minutes for trie construction.

prefix STRING (partition key — first 2 chars of query)query STRING (sort key — full query string)count_1min INTEGERcount_1hr INTEGERcount_total INTEGER (all-time)updated_at STRING (ISO timestamp)

Indexes: Primary key: (prefix, query) — enables efficient range queries within a prefix

DynamoDB On-Demand capacity absorbs the bursty full-scan reads during trie rebuilds. With 1B items at ~100 bytes each, the table is approximately 100 GB. The 2-character prefix partition key distributes data across approximately 1,296 partitions (36^2 for alphanumeric), avoiding hot partitions.

trending_boosts (Redis)

Redis hash entries storing trending boost scores per prefix. Written by TrieRebuildWorker when a query's recent count exceeds its historical baseline. Read by AutocompleteService on every request for merge with static trie results. TTL 300 seconds — trends auto-expire if not sustained.

key: trend:{prefix} (Redis key)boost_score FLOAT (trending multiplier, 1.0 = no boost)trending_suggestions ARRAY (additional trending query strings)

Indexes: Redis key lookup — O(1) hash table access

Approximately 10K-50K active trending entries at any time (0.001% of all prefixes are trending). Memory footprint: ~50 MB. Read latency: <1ms (Redis in-memory hash table). 90% cache hit rate for popular trending prefixes.

Event Contracts
search_loggedsearch-logged

Emitted by AutocompleteService on every completed search query. Consumed by TrieRebuildWorker for count aggregation and trend detection. Partitioned by query_prefix for locality during aggregation.

Key Schema

query_prefix (string — first 2 characters of query)

Value Schema

{ query: string, user_id?: string, timestamp: string }

What-If Scenarios

Breaking news event causes a brand-new query to spike to 10K searches/sec

Impact

The new query does not exist in the static trie (last rebuilt 0-5 minutes ago). Users typing the trending query see no relevant suggestions for up to 5 minutes. The Kafka log captures the spike, and TrieRebuildWorker detects the trend and writes a boost to Redis within minutes, but the full trie includes the query only after the next rebuild cycle.

Mitigation

The trending boost layer in Redis partially addresses this — TrieRebuildWorker pushes trending queries to Redis as soon as they are detected (within 1-2 minutes). For sub-minute trending, the V2 streaming variant uses real-time stream processing with 10-second sliding windows.

TrieCache (Redis) goes down entirely

Impact

Trending merges fail — all requests return static trie results only. No data loss: the static trie is in-memory on each pod. Autocomplete continues working with correct but potentially stale rankings. Users do not notice unless they are specifically looking for a trending query that has not yet been incorporated into the static trie.

Mitigation

Redis Cluster with automatic failover (promote replica to primary within 15-30 seconds). Circuit breaker on the Redis call — after 3 consecutive failures, stop attempting trending lookups for 30 seconds to avoid adding latency from failed connections.

All 20 AutocompleteService pods reload the trie simultaneously

Impact

Each pod downloads the ~10 GB trie data from QueryDB simultaneously, creating a read spike of 200 GB in aggregate. DynamoDB handles this via On-Demand capacity, but pods experience 30-60 seconds of elevated GC pressure as the old trie is garbage-collected while the new one is loading. During this window, p99 latency may spike to 10-20ms (from <1ms baseline).

Mitigation

Stagger reloads across pods: pod 1 reloads at t=0, pod 2 at t=15s, pod 3 at t=30s, etc. This spreads the DynamoDB read load and GC pressure over 5 minutes instead of a single burst. The MainLB continues routing to pods that have not yet reloaded — they serve from the old (slightly stale but correct) trie.

Failure Modes & Resilience
ComponentFailureImpactMitigation
AutocompleteService (trie in memory)Out-of-memory crash during trie reload (double memory during swap)Pod crashes and restarts. MainLB detects the failed health check and stops routing to the pod. Remaining 19 pods absorb the traffic (~5% increase per pod). If multiple pods crash simultaneously (rare), capacity drops below the safe threshold.Ensure pod memory limit (16 GB) is at least 1.5x the trie size (10 GB) to accommodate the swap window. Use JVM flags (-XX:+UseG1GC -XX:MaxGCPauseMillis=50) to minimize GC pauses during reload. Pre-warm the new trie in a separate memory region before swapping.
SearchLogStream (Kafka)Kafka cluster unavailable for 30+ minutesSearch logs are lost — TrieRebuildWorker stops receiving new data. The trie becomes increasingly stale but continues serving from the last successful build. Trending detection stops — new trends are not detected until Kafka recovers. No impact on the autocomplete hot path (trie lookups are independent of Kafka).Kafka multi-AZ deployment with 3 broker replicas. If Kafka is down, AutocompleteService buffers search logs locally (up to 10K messages) and replays when Kafka recovers. Set alert on consumer lag > 5 minutes.
QueryDB (DynamoDB)Throttling during trie rebuild full scanTrie rebuild takes longer than expected (20+ minutes instead of 4 minutes). Pods continue serving from the previous trie. Trending boosts in Redis partially compensate. If throttling persists, the trie ages beyond 15 minutes and an alert fires.Use DynamoDB On-Demand capacity (auto-scales to handle scan bursts). If still throttling, reduce scan parallelism in TrieRebuildWorker (fewer concurrent reads per worker). Set RetryPolicy with exponential backoff on DynamoDB reads.
Monitoring & Alerting

Key metrics: (1) Trie lookup latency p50/p99 — should be <1ms. Alert at p99 > 5ms (indicates GC pauses during reload). (2) Trie age — time since last successful rebuild. Alert at 15 min, critical at 30 min. (3) Trending boost cache hit rate — expected 90%. Drop below 70% indicates Redis issues. (4) Kafka consumer lag — TrieRebuildWorker should be within 30 seconds of head. Alert at 5 min lag. (5) Pod memory utilization — should be 60-70% of limit (10 GB trie in 16 GB pod). Alert at 85%. (6) Autocomplete RPS per pod — should be evenly distributed across 20 pods. Variance > 20% indicates LB misconfiguration. Dashboard: Grafana with trie lookup latency histogram, trie age timeline, Kafka consumer lag, pod memory heatmap, and trending boost hit rate.

Cost Analysis

At 200K RPS peak: AutocompleteService 20 pods (8 vCPU, 16 GB each) ~$3,000/month, TrieCache Redis (4 nodes, 32 GB each) ~$1,200/month, QueryDB DynamoDB On-Demand ~$500/month, Kafka MSK (3 brokers) ~$600/month, API Gateway ~$200/month, TrieRebuildWorker 20 workers ~$400/month. Total: ~$5,900/month. The dominant cost is the AutocompleteService pods — 20 pods with 16 GB RAM each for the in-memory trie. Reducing pod count to 10 saves $1,500/month but halves capacity. The V2 streaming variant adds ML scoring (~$2,000/month) and personalization cache (~$500/month) for an additional $2,500/month.

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Database LIKE Query)T18-15ms query, 50-200ms+ under load~10K RPS$730/monthLow99% (single DB)
V1: Trie + Top-K (In-Memory Trie)T2<1ms trie + 2ms trending merge200K RPS$2,500/monthMedium99.9% (multi-AZ)
V2: Streaming Trending + ML RankingT3<1ms trie + 5-10ms ML scoring500K RPS$8,000/monthVery High99.95% (multi-AZ, ML fallback)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does the trie handle multi-language queries (CJK, Arabic, etc.)?

The trie operates on Unicode code points, not ASCII characters. Each node in the trie can have up to 1,114,112 children (the full Unicode space), but in practice the child map uses a hash map (not a fixed-size array) to avoid allocating empty slots. CJK characters work naturally — a Chinese query like a three-character phrase traverses 3 trie nodes, each branching on a Chinese character code point. The trie size increases with alphabet size because there are more unique prefixes, but the lookup time remains O(L) where L is the prefix length in characters.

What happens if the trie rebuild fails or takes longer than 5 minutes?

AutocompleteService pods continue serving from the previous trie — the pointer swap only happens when the new trie is fully built and validated. If the rebuild worker crashes, the trie ages but remains correct (just not fresh). An alert fires if the trie age exceeds 15 minutes (3 missed rebuild cycles). The trending boost layer in Redis partially compensates for stale trie data by surfacing recent trends. In the worst case (rebuild down for hours), the system serves increasingly stale but never incorrect suggestions.

Why DynamoDB instead of PostgreSQL for query counts?

DynamoDB's On-Demand capacity mode handles the bursty read pattern of trie rebuilds. Every 5 minutes, TrieRebuildWorker scans all 1B rows — a massive burst of reads that would require careful provisioning on PostgreSQL. DynamoDB auto-scales to absorb this burst without throttling. Additionally, DynamoDB's partition key (2-character prefix) distributes writes evenly across partitions, avoiding hot partitions during real-time count updates from the Kafka consumer.

How does the trending boost layer work without modifying the trie?

TrieRebuildWorker writes trending boost entries to Redis: key = 'trend:{prefix}', value = {boost_score, trending_suggestions}. When AutocompleteService performs a trie lookup, it also checks Redis for a trending boost matching the prefix. If found, the service merges the static trie top-10 with the trending suggestions, re-scores using combined popularity + boost, and returns the merged top-10. This is an overlay — the trie data is never modified. If Redis is down, the trending check is skipped and static results are returned.

How much memory does the trie actually consume for 1B queries?

The trie has approximately 5B nodes (average query length is 5 characters, but many queries share prefixes, reducing the total). Each node stores: a character (4 bytes), a hash map of children (48 bytes average for HashMap overhead + 8 bytes per child pointer), and top-10 suggestions (10 x ~80 bytes for query string + score = ~800 bytes, but stored as references to a shared string pool). Total: approximately 10 GB for 1B queries with aggressive string interning. Without string interning, it would be 30+ GB due to duplicate substrings.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Search Autocomplete?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator