Industry-standard search autocomplete using an in-memory trie data structure with pre-computed top-K suggestions at every node. Lookups are O(prefix_length) with sub-millisecond latency. The trie is rebuilt every 5 minutes from aggregated query counts, with Redis-based trending boosts for near-real-time freshness.
The trie-based approach to search autocomplete is the industry-standard solution used by Google, Amazon, LinkedIn, and most major search platforms. It solves the fundamental problem with the naive database approach: eliminating per-keystroke database queries by serving suggestions from an in-memory data structure with O(prefix_length) lookup time.
A trie (prefix tree) is a tree data structure where each node represents a character, and paths from root to leaf represent complete query strings. The key optimization is pre-computing the top-K (typically 10) most popular suggestions at every node during the trie build phase. When a user types 'sys', the service traverses 3 nodes (root -> 's' -> 'y' -> 's') and immediately returns the pre-stored top-10 suggestions — no sorting, no scanning, no database query. The lookup is O(L) where L is the prefix length, regardless of how many queries match that prefix.
The trie for 1 billion unique queries is approximately 10 GB in memory. Each AutocompleteService pod loads the complete trie at startup and holds it in memory. With 20 pods, the aggregate memory cost is 200 GB — the price of sub-millisecond lookups. The trie is immutable during serving: a background thread reloads a new trie every 5 minutes via atomic pointer swap, ensuring zero-lock reads on the hot path.
The trie build process runs on TrieRebuildWorker, which consumes search logs from Kafka, aggregates per-query counts in sliding time windows (1-minute, 1-hour, 1-day, all-time), writes updated counts to QueryDB (DynamoDB), and periodically reconstructs the trie. The rebuild reads all query counts, builds the trie with top-K pre-computation at every node, serializes the result, and pushes it to TrieCache (Redis) for distribution to service pods.
Between full rebuilds, a trending boost layer in Redis provides near-real-time freshness. When TrieRebuildWorker detects a query whose recent count significantly exceeds its historical average, it writes a trending boost score to TrieCache. AutocompleteService merges static trie results with trending boosts on each request — if a trending query outscores a static suggestion, it replaces it in the top-10 response. This two-layer approach (static trie + dynamic trending) balances freshness with stability.
The architecture uses 7 components: TypeaheadClient, ApiGateway, MainLB, AutocompleteService, TrieCache (Redis), QueryDB (DynamoDB), SearchLogStream (Kafka), and TrieRebuildWorker. The API Gateway authenticates and rate-limits at 250K RPS. MainLB distributes queries using round-robin — no session affinity needed because every pod has the full trie. AutocompleteService handles both autocomplete reads (80% of traffic) and search log writes to Kafka (20%).
The primary trade-off is memory cost versus latency. At 10 GB per pod with 20 pods, the trie consumes 200 GB of aggregate memory. Sharding the trie by prefix first-character would reduce per-pod memory but requires consistent-hash routing at the LB layer and creates hot shards (prefix 's' gets disproportionate traffic). The full-trie-per-pod approach with round-robin LB is simpler and avoids hot-shard problems.
Interviewers expect candidates to explain the trie data structure, discuss why pre-computing top-K at each node eliminates sorting at query time (O(L) vs O(L + M log M)), reason about the memory-latency trade-off, and explain how the Kafka streaming pipeline enables near-real-time trending without blocking the hot path.
The trie-based autocomplete system uses seven components organized into three layers: the request path (TypeaheadClient, ApiGateway, MainLB, AutocompleteService), the data stores (TrieCache/Redis, QueryDB/DynamoDB), and the async pipeline (SearchLogStream/Kafka, TrieRebuildWorker).
The request path handles 200K autocomplete queries per second at peak. TypeaheadClient sends GET /api/v1/autocomplete?q=prefix&limit=10. ApiGateway (AWS API Gateway REST) authenticates the request, validates the JWT token (~2ms), enforces per-user rate limits to prevent scraping, and routes to MainLB. MainLB (AWS ALB) round-robins to one of 20 AutocompleteService pods. The service traverses its in-memory trie from root to the node matching the prefix and returns the pre-computed top-10 suggestions — typically in under 1ms.
The trending enrichment is an optional optimization on the hot path. After the trie lookup, AutocompleteService checks TrieCache (Redis) for trending boost scores that may modify the static rankings. For prefixes matching active trends (approximately 10% of requests), the service merges static top-K with trending candidates and re-ranks by combined score. Cache hit adds ~2ms; cache miss uses static trie rankings only — still correct, just not trend-boosted. The trending check is non-blocking: if Redis is down, the static trie results are returned without degradation.
The async pipeline handles search log ingestion and trie maintenance. When a user completes a search, AutocompleteService publishes the query to SearchLogStream (Kafka, 64 partitions). TrieRebuildWorker (20 workers) consumes these logs and aggregates per-prefix counts in sliding time windows. Every 5 minutes, the worker triggers a full trie rebuild: reads all counts from QueryDB, constructs the trie with top-K pre-computation, and pushes trending boosts to TrieCache. Service pods periodically reload the trie from QueryDB via a background thread (every 5-10 minutes). The reload is atomic — the new trie replaces the old one in a single pointer swap, so serving threads never see a partially-built trie.
Horizontal scaling is straightforward: add more AutocompleteService pods for higher RPS. Each pod is self-contained with its own copy of the trie, so there are no cross-pod dependencies. The LB is round-robin with no session affinity. TrieRebuildWorker scaling is limited by QueryDB scan throughput — DynamoDB On-Demand capacity absorbs rebuild spikes automatically.
This sequence diagram traces the autocomplete flow through the trie-based system. The key difference from the naive approach is the elimination of database queries on the hot path — the trie lookup is entirely in-memory, taking less than 1ms regardless of prefix length or match count. The trending merge adds a Redis lookup (~2ms) but is optional and non-blocking.
The second flow shows the async pipeline: search logs published to Kafka, consumed by TrieRebuildWorker, aggregated into DynamoDB, and periodically used to rebuild the trie. This pipeline runs independently of the autocomplete hot path — Kafka can lag without affecting autocomplete latency.
Step-by-Step Walkthrough
Pseudocode
// TRIE LOOKUP — sub-millisecond, zero DB dependency
function autocomplete(prefix: string, limit: number = 10):
node = trie.root
for char in prefix:
node = node.children[char]
if node is null:
return [] // No suggestions for this prefix
// Top-K is pre-computed at every node during trie build
suggestions = node.topK // Already sorted by popularity — O(1)
// Optional: merge with trending boost from Redis
trending = redis.get("trend:" + prefix)
if trending:
suggestions = merge_and_rerank(suggestions, trending.queries, trending.boost_score)
return suggestions[:limit]
// TRIE BUILD — offline, every 5 minutes
function buildTrie(queryCounts: Map<string, int>):
trie = new Trie()
for (query, count) in queryCounts:
node = trie.insert(query)
node.count = count
// Bottom-up top-K propagation
propagateTopK(trie.root, K=10)
return trie
function propagateTopK(node, K):
candidates = []
if node.isEndOfWord:
candidates.append({query: node.query, score: node.count})
for child in node.children.values():
candidates.extend(propagateTopK(child, K))
candidates.sort(by: score, descending: true)
node.topK = candidates[:K]
return candidates[:K]Choice
Full trie loaded into each AutocompleteService pod's memory
Rationale
At 500K RPS, a network round-trip to Redis for every keystroke adds 1-2ms of latency and creates a dependency on Redis availability. The in-memory trie provides sub-millisecond lookups with zero network dependency for the hot path. Redis is used only for trending boosts (optional enrichment, not critical path). The trade-off is 10 GB memory per pod — with 20 pods, that is 200 GB aggregate. This is the same trade-off Google makes: their autocomplete service holds the trie in memory for every data center.
Choice
Store the top-10 suggestions at every node during the build phase
Rationale
Without pre-computed results, finding top-K suggestions for a prefix requires traversing all descendant nodes and sorting — O(M log M) per request where M is the number of queries matching the prefix. For prefix 's', M is 50 million. Pre-computing during the rebuild phase moves this cost to an offline background job and makes lookups O(L) where L is the prefix length (just traverse L nodes and read the stored result). The trade-off is memory (storing 10 suggestions at every node increases trie size by approximately 3x) and rebuild latency (building 1B-node trie with top-K propagation takes approximately 3 minutes).
Choice
Rebuild the full trie every 5 minutes, swap via pointer reassignment
Rationale
Updating a trie in-place under concurrent reads requires complex locking or lock-free data structures (concurrent trie implementations exist but are error-prone). Periodic rebuilds with atomic pointer swap are simpler and safer: the old trie serves requests while the new one is built in a separate memory region. When the build completes, a single pointer swap (AtomicReference.set() in Java, or equivalent) makes the new trie active. The old trie is garbage-collected after in-flight requests complete. The trade-off is a 5-minute staleness window for new queries, partially mitigated by the Redis trending boost layer.
Choice
Publish completed search queries to Kafka for async aggregation
Rationale
At 200K+ completed searches per second, direct database writes would overwhelm QueryDB. Kafka buffers the log stream and enables TrieRebuildWorker to batch-aggregate counts efficiently. Kafka also provides replay capability — if the worker has a bug or needs to recompute historical trends, it can re-consume from a past offset. The 64 partitions (partitioned by query prefix) enable parallel consumption across 20 workers.
Choice
Each pod holds the complete trie rather than a shard of it
Rationale
Sharding by prefix first-character requires consistent-hash routing at the LB layer and creates hot shards — the letter 's' accounts for approximately 15% of all English queries, while 'x' accounts for less than 0.5%. Full trie per pod with round-robin LB distributes load evenly and simplifies operations (any pod can serve any query). The cost is 10 GB memory per pod instead of ~1 GB per shard, but at $0.05/GB/hour for Fargate memory, this is approximately $500/month for 20 pods — trivial compared to engineering time for shard management.
Target RPS
200K peak (20 pods x 10K RPS each)
Latency (p99)
<1ms trie lookup + 2ms trending merge (cache hit)
Storage
~10 GB trie per pod, ~50 GB DynamoDB for counts
Availability
99.9% (multi-AZ, graceful degradation on Redis failure)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Trie lookup (autocomplete query) | O(L) — traverse L nodes where L = prefix length | O(1) — read pre-stored top-K at the target node | Sub-millisecond regardless of the number of queries matching the prefix. A 10-character prefix traverses 10 nodes — each traversal is a hash map lookup on the current node's children. Total: approximately 0.1ms for a typical query. |
| Trie build (offline, every 5 minutes) | O(N * L) — insert N queries of average length L into the trie | O(N * L) — trie nodes for all prefixes of all queries | Building the trie from 1B queries takes approximately 3 minutes on a 4-vCPU worker. The top-K propagation phase (bubbling up top-10 suggestions from leaves to root) adds approximately 1 minute. Total rebuild: ~4 minutes, well within the 5-minute cycle. |
| Trending merge (on cache hit) | O(K) — merge K trending candidates with K static suggestions | O(K) — merged result set | K is typically 10 (top-10 suggestions). Merging two sorted lists of 10 elements is trivial — <0.01ms. The Redis lookup dominates at ~1ms. |
| Kafka log aggregation (TrieRebuildWorker) | O(1) per message — increment counter in hash map | O(P) — P unique prefixes in the sliding window | Each search log message increments a counter in an in-memory hash map. At 40K messages/sec across 20 workers, each worker handles 2K msg/sec — trivially fast. The hash map grows to approximately 10M entries (unique query-window combinations). |
Stores aggregated per-query search counts across multiple time windows. Partition key is the 2-character prefix for even distribution; sort key is the full query string. Written by TrieRebuildWorker after aggregating Kafka search logs. Read via full table scan every 5 minutes for trie construction.
Indexes: Primary key: (prefix, query) — enables efficient range queries within a prefix
DynamoDB On-Demand capacity absorbs the bursty full-scan reads during trie rebuilds. With 1B items at ~100 bytes each, the table is approximately 100 GB. The 2-character prefix partition key distributes data across approximately 1,296 partitions (36^2 for alphanumeric), avoiding hot partitions.
Redis hash entries storing trending boost scores per prefix. Written by TrieRebuildWorker when a query's recent count exceeds its historical baseline. Read by AutocompleteService on every request for merge with static trie results. TTL 300 seconds — trends auto-expire if not sustained.
Indexes: Redis key lookup — O(1) hash table access
Approximately 10K-50K active trending entries at any time (0.001% of all prefixes are trending). Memory footprint: ~50 MB. Read latency: <1ms (Redis in-memory hash table). 90% cache hit rate for popular trending prefixes.
Emitted by AutocompleteService on every completed search query. Consumed by TrieRebuildWorker for count aggregation and trend detection. Partitioned by query_prefix for locality during aggregation.
Key Schema
query_prefix (string — first 2 characters of query)
Value Schema
{ query: string, user_id?: string, timestamp: string }
Breaking news event causes a brand-new query to spike to 10K searches/sec
Impact
The new query does not exist in the static trie (last rebuilt 0-5 minutes ago). Users typing the trending query see no relevant suggestions for up to 5 minutes. The Kafka log captures the spike, and TrieRebuildWorker detects the trend and writes a boost to Redis within minutes, but the full trie includes the query only after the next rebuild cycle.
Mitigation
The trending boost layer in Redis partially addresses this — TrieRebuildWorker pushes trending queries to Redis as soon as they are detected (within 1-2 minutes). For sub-minute trending, the V2 streaming variant uses real-time stream processing with 10-second sliding windows.
TrieCache (Redis) goes down entirely
Impact
Trending merges fail — all requests return static trie results only. No data loss: the static trie is in-memory on each pod. Autocomplete continues working with correct but potentially stale rankings. Users do not notice unless they are specifically looking for a trending query that has not yet been incorporated into the static trie.
Mitigation
Redis Cluster with automatic failover (promote replica to primary within 15-30 seconds). Circuit breaker on the Redis call — after 3 consecutive failures, stop attempting trending lookups for 30 seconds to avoid adding latency from failed connections.
All 20 AutocompleteService pods reload the trie simultaneously
Impact
Each pod downloads the ~10 GB trie data from QueryDB simultaneously, creating a read spike of 200 GB in aggregate. DynamoDB handles this via On-Demand capacity, but pods experience 30-60 seconds of elevated GC pressure as the old trie is garbage-collected while the new one is loading. During this window, p99 latency may spike to 10-20ms (from <1ms baseline).
Mitigation
Stagger reloads across pods: pod 1 reloads at t=0, pod 2 at t=15s, pod 3 at t=30s, etc. This spreads the DynamoDB read load and GC pressure over 5 minutes instead of a single burst. The MainLB continues routing to pods that have not yet reloaded — they serve from the old (slightly stale but correct) trie.
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| AutocompleteService (trie in memory) | Out-of-memory crash during trie reload (double memory during swap) | Pod crashes and restarts. MainLB detects the failed health check and stops routing to the pod. Remaining 19 pods absorb the traffic (~5% increase per pod). If multiple pods crash simultaneously (rare), capacity drops below the safe threshold. | Ensure pod memory limit (16 GB) is at least 1.5x the trie size (10 GB) to accommodate the swap window. Use JVM flags (-XX:+UseG1GC -XX:MaxGCPauseMillis=50) to minimize GC pauses during reload. Pre-warm the new trie in a separate memory region before swapping. |
| SearchLogStream (Kafka) | Kafka cluster unavailable for 30+ minutes | Search logs are lost — TrieRebuildWorker stops receiving new data. The trie becomes increasingly stale but continues serving from the last successful build. Trending detection stops — new trends are not detected until Kafka recovers. No impact on the autocomplete hot path (trie lookups are independent of Kafka). | Kafka multi-AZ deployment with 3 broker replicas. If Kafka is down, AutocompleteService buffers search logs locally (up to 10K messages) and replays when Kafka recovers. Set alert on consumer lag > 5 minutes. |
| QueryDB (DynamoDB) | Throttling during trie rebuild full scan | Trie rebuild takes longer than expected (20+ minutes instead of 4 minutes). Pods continue serving from the previous trie. Trending boosts in Redis partially compensate. If throttling persists, the trie ages beyond 15 minutes and an alert fires. | Use DynamoDB On-Demand capacity (auto-scales to handle scan bursts). If still throttling, reduce scan parallelism in TrieRebuildWorker (fewer concurrent reads per worker). Set RetryPolicy with exponential backoff on DynamoDB reads. |
Key metrics: (1) Trie lookup latency p50/p99 — should be <1ms. Alert at p99 > 5ms (indicates GC pauses during reload). (2) Trie age — time since last successful rebuild. Alert at 15 min, critical at 30 min. (3) Trending boost cache hit rate — expected 90%. Drop below 70% indicates Redis issues. (4) Kafka consumer lag — TrieRebuildWorker should be within 30 seconds of head. Alert at 5 min lag. (5) Pod memory utilization — should be 60-70% of limit (10 GB trie in 16 GB pod). Alert at 85%. (6) Autocomplete RPS per pod — should be evenly distributed across 20 pods. Variance > 20% indicates LB misconfiguration. Dashboard: Grafana with trie lookup latency histogram, trie age timeline, Kafka consumer lag, pod memory heatmap, and trending boost hit rate.
At 200K RPS peak: AutocompleteService 20 pods (8 vCPU, 16 GB each) ~$3,000/month, TrieCache Redis (4 nodes, 32 GB each) ~$1,200/month, QueryDB DynamoDB On-Demand ~$500/month, Kafka MSK (3 brokers) ~$600/month, API Gateway ~$200/month, TrieRebuildWorker 20 workers ~$400/month. Total: ~$5,900/month. The dominant cost is the AutocompleteService pods — 20 pods with 16 GB RAM each for the in-memory trie. Reducing pod count to 10 saves $1,500/month but halves capacity. The V2 streaming variant adds ML scoring (~$2,000/month) and personalization cache (~$500/month) for an additional $2,500/month.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Database LIKE Query) | T1 | 8-15ms query, 50-200ms+ under load | ~10K RPS | $730/month | Low | 99% (single DB) |
| V1: Trie + Top-K (In-Memory Trie) | T2 | <1ms trie + 2ms trending merge | 200K RPS | $2,500/month | Medium | 99.9% (multi-AZ) |
| V2: Streaming Trending + ML Ranking | T3 | <1ms trie + 5-10ms ML scoring | 500K RPS | $8,000/month | Very High | 99.95% (multi-AZ, ML fallback) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
The trie operates on Unicode code points, not ASCII characters. Each node in the trie can have up to 1,114,112 children (the full Unicode space), but in practice the child map uses a hash map (not a fixed-size array) to avoid allocating empty slots. CJK characters work naturally — a Chinese query like a three-character phrase traverses 3 trie nodes, each branching on a Chinese character code point. The trie size increases with alphabet size because there are more unique prefixes, but the lookup time remains O(L) where L is the prefix length in characters.
AutocompleteService pods continue serving from the previous trie — the pointer swap only happens when the new trie is fully built and validated. If the rebuild worker crashes, the trie ages but remains correct (just not fresh). An alert fires if the trie age exceeds 15 minutes (3 missed rebuild cycles). The trending boost layer in Redis partially compensates for stale trie data by surfacing recent trends. In the worst case (rebuild down for hours), the system serves increasingly stale but never incorrect suggestions.
DynamoDB's On-Demand capacity mode handles the bursty read pattern of trie rebuilds. Every 5 minutes, TrieRebuildWorker scans all 1B rows — a massive burst of reads that would require careful provisioning on PostgreSQL. DynamoDB auto-scales to absorb this burst without throttling. Additionally, DynamoDB's partition key (2-character prefix) distributes writes evenly across partitions, avoiding hot partitions during real-time count updates from the Kafka consumer.
TrieRebuildWorker writes trending boost entries to Redis: key = 'trend:{prefix}', value = {boost_score, trending_suggestions}. When AutocompleteService performs a trie lookup, it also checks Redis for a trending boost matching the prefix. If found, the service merges the static trie top-10 with the trending suggestions, re-scores using combined popularity + boost, and returns the merged top-10. This is an overlay — the trie data is never modified. If Redis is down, the trending check is skipped and static results are returned.
The trie has approximately 5B nodes (average query length is 5 characters, but many queries share prefixes, reducing the total). Each node stores: a character (4 bytes), a hash map of children (48 bytes average for HashMap overhead + 8 bytes per child pointer), and top-10 suggestions (10 x ~80 bytes for query string + score = ~800 bytes, but stored as references to a shared string pool). Total: approximately 10 GB for 1B queries with aggressive string interning. Without string interning, it would be 30+ GB due to duplicate substrings.
Sign in to join the discussion.
Ready to design your own Search Autocomplete?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator