Design a search autocomplete system that returns top-N suggestions in under 100ms using an in-memory trie with pre-computed top-K results per node, backed by a streaming pipeline for trending query detection.
Search autocomplete is one of the most commonly asked system design interview questions at companies like Google, Amazon, and Microsoft because it tests a candidate's ability to combine data structures, caching, and real-time streaming into a cohesive low-latency system. The core challenge seems straightforward — return relevant query suggestions as the user types — but the constraints at production scale transform it into a deeply nuanced problem that touches on prefix indexing, ranking algorithms, freshness guarantees, and sub-millisecond lookup requirements.
At Google's scale, the autocomplete system handles hundreds of thousands of requests per second, with each keystroke generating a new query. Users expect suggestions to appear instantly — any latency above 100ms feels sluggish and degrades the search experience. The system must serve suggestions from a corpus of over one billion unique queries, ranked by a combination of global popularity, personal search history, and real-time trending signals. When a major news event breaks, trending queries must surface in suggestions within minutes, not hours.
The problem becomes especially interesting when you consider the data structure trade-offs. A naive approach of scanning all stored queries for prefix matches is O(N) per request and completely infeasible at scale. A trie (prefix tree) reduces lookup to O(prefix_length), but storing one billion queries in a trie requires careful memory optimization. Pre-computing top-K suggestions at each trie node eliminates the need for real-time sorting but increases memory consumption and rebuild time. Candidates must reason about these trade-offs and design a system that balances lookup speed, memory footprint, and data freshness.
This template models the complete autocomplete architecture: an API gateway for rate limiting, a load balancer distributing across stateless service pods, an in-memory trie with pre-computed top-K suggestions, a Redis cache for trending boosts, a DynamoDB store for aggregated query counts, and a Kafka-based streaming pipeline that processes search logs and periodically rebuilds the trie. The simulation reveals how trie size affects pod memory requirements, how the streaming pipeline impacts suggestion freshness, and how the two-layer approach of static trie plus dynamic trending boosts balances stability with real-time responsiveness.
The autocomplete architecture follows a two-layer design: a static in-memory trie for fast prefix lookups and a dynamic trending layer in Redis for real-time freshness. The data flow begins when a user types a character in the search box. The TypeaheadClient sends a GET request through the API Gateway, which handles authentication and per-user rate limiting to prevent scraping. The MainLB distributes requests via round-robin across 20 AutocompleteService pods, each of which holds a complete copy of the trie in memory.
The AutocompleteService performs the core lookup by traversing the in-memory trie from the root to the node matching the user's prefix. Each trie node stores the pre-computed top-10 suggestions for its prefix, so the lookup completes in O(prefix_length) time — typically under one millisecond with no sorting required at query time. For prefixes that match active trends, the service also queries the TrieCache (Redis) for trending boost scores and merges them with the static trie rankings, re-ranking the combined candidates before returning the final top-N results.
The freshness pipeline runs in parallel. Every completed search query is logged to SearchLogStream (Amazon MSK / Kafka). TrieRebuildWorker pods consume these logs, aggregate per-prefix query counts across sliding time windows (one-minute, one-hour, one-day), and write updated counts to QueryDB (DynamoDB). Every five minutes, the worker triggers a full trie rebuild: it reads all query counts from DynamoDB, constructs a new trie with pre-computed top-K at each node, and pushes trending boost scores to TrieCache. AutocompleteService pods periodically reload the rebuilt trie via an atomic pointer swap — the old trie is replaced by the new one in a single operation, ensuring zero downtime during refresh.
This architecture deliberately separates the hot read path (in-memory trie, zero network dependency) from the write path (Kafka, DynamoDB, periodic rebuild). The trie itself never changes during a serving window, which eliminates the need for concurrent read-write locking. The trending boost layer in Redis provides near-real-time responsiveness for breaking queries between full trie rebuilds, giving the system the best of both worlds: stable, fast prefix lookups with timely surfacing of trending content.
Choice
Full in-memory trie with pre-computed top-K per node on each service pod
Rationale
At 200K+ RPS, a network round-trip to an external index (Redis, Elasticsearch) for every keystroke adds 1-2ms of latency and creates a hard dependency on index availability. The in-memory trie provides sub-millisecond lookups with zero network dependency on the hot path. Each pod holds the full trie (~10GB for one billion queries), which is feasible with 16GB RAM pods and avoids the routing complexity of sharded indexes.
Choice
Periodic full rebuild every 5 minutes with atomic pointer swap
Rationale
Updating a trie in place under concurrent reads requires complex locking or lock-free data structures that are difficult to implement correctly. Periodic rebuilds with an atomic pointer swap are dramatically simpler: the new trie is built in a background thread, and a single reference assignment replaces the old trie. The five-minute window is short enough that suggestions stay fresh, and the trending boost layer in Redis bridges the gap for breaking queries.
Choice
Kafka-based log ingestion with sliding-window aggregation in TrieRebuildWorker
Rationale
At 200K+ completed searches per second, writing directly to DynamoDB would overwhelm the database. Kafka buffers the log stream and enables batch aggregation by the TrieRebuildWorker, which processes counts in sliding windows (one-minute, one-hour, one-day). Kafka also provides replay capability for reprocessing if the aggregation logic needs correction, and its partitioned architecture supports horizontal scaling of consumer workers.
Choice
Full trie on every pod with round-robin load balancing
Rationale
Sharding the trie by prefix first-character requires consistent-hash routing at the load balancer and creates hot shards — the letter 's' receives disproportionately more traffic than 'x' or 'z'. Full replication with round-robin eliminates hot-shard problems and simplifies the load balancing layer. The 10GB-per-pod memory cost is acceptable given the operational simplicity and the elimination of cross-shard query routing.
Target RPS
200,000 autocomplete queries/s (peak)
Latency (p99)
<100ms p99 (trie lookup <1ms + network overhead)
Storage
~10 GB trie per pod; ~500 GB in QueryDB (1B queries)
Availability
99.99%
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
A trie is purpose-built for prefix matching, which is the fundamental operation in autocomplete. An inverted index (like Elasticsearch) is optimized for full-text search with tokenization and TF-IDF scoring — it can handle prefix queries but with significantly higher overhead per lookup. A hash map provides O(1) exact-key lookup but does not support prefix matching at all without iterating over all keys. The trie provides O(prefix_length) lookups with natural prefix traversal, and pre-computing top-K at each node eliminates the need for any sorting at query time, keeping latency well under one millisecond.
Trending queries are handled through a two-layer mechanism. The first layer is the Kafka-based streaming pipeline: completed search queries flow through Kafka to the TrieRebuildWorker, which aggregates counts in sliding time windows and pushes trending boost scores to Redis. The second layer is the merge at query time: the AutocompleteService combines static trie rankings with trending boosts from Redis, re-ranking the top-N results. This means a trending query can appear in suggestions within minutes of gaining popularity, even before the next full trie rebuild occurs.
A trie storing one billion unique queries with pre-computed top-10 suggestions at each node requires approximately 10GB of RAM. The memory is dominated by the node structure overhead and the top-K suggestion lists rather than the character pointers themselves. Compressed trie variants (Patricia trie, radix tree) can reduce memory by collapsing single-child chains into single nodes, but the savings are modest relative to the top-K storage. With 16GB pods, 10GB for the trie leaves adequate headroom for GC overhead and request processing.
The atomic pointer swap design makes this safe. If a rebuild fails, the current trie continues serving without interruption — it simply contains slightly older data. The TrieRebuildWorker retries the build on its next cycle. If a rebuild takes longer than the five-minute interval, the system gracefully skips a cycle and serves from the existing trie. The trending boost layer in Redis provides freshness for breaking queries during the extended window. Monitoring alerts fire if multiple consecutive rebuilds fail, indicating a systemic issue requiring investigation.
Personalized autocomplete requires a per-user feature lookup at query time. The simplest approach is to maintain a small per-user recent query cache in Redis, keyed by user ID, storing the last 50-100 queries. At query time, the AutocompleteService merges global trie results with the user's personal query history, boosting suggestions that match their past searches. This adds approximately 2-3ms for the Redis lookup but significantly improves relevance. More sophisticated approaches use ML models that incorporate user demographics, location, and browsing context, though these add inference latency and model-serving infrastructure.
Sign in to join the discussion.
Ready to design your own Search Autocomplete?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator