Hard10 componentsInterview: Very High

Search Autocomplete — Streaming Trending + ML Ranking

Q: How does the ML scorer handle cold-start users with no search history?

For users with no PersonalizationCache entry (new users or anonymous), the ML scorer receives an empty history feature vector. The model is trained with this case explicitly — approximately 30% of training data has empty history. Without history, the model falls back to popularity, trending, geo-relevance, and time-of-day features. This produces the same quality as V1 (global popularity ranking) for cold-start users, with personalization improving as the user accumulates search history.

Q: What happens if the ML scorer fails completely?

SuggestionService has a circuit breaker on the MLScorer gRPC call. After 5 consecutive failures or if p99 latency exceeds 15ms, the circuit opens and SuggestionService returns static trie results (no ML re-ranking) for 30 seconds before retrying. This fallback path is the same quality as V1 — personalization degrades but autocomplete continues working. The circuit breaker prevents cascading failures from MLScorer to the hot path.

Q: How is the ML model trained and updated?

The GBDT model is trained offline on click-through data: for each autocomplete session, the training data records which suggestion (if any) the user clicked. Positive labels are clicked suggestions; negative labels are displayed but not clicked. Features are computed from the trie (popularity, trending score), user history (cosine similarity), and context (location, time, device). The model is retrained weekly on the latest 30 days of click data and deployed as a binary artifact. A/B testing validates that the new model improves click-through rate before full rollout.

Q: How does TrendAggregator avoid false positives from bot traffic?

Three layers of defense: (1) ApiGateway rate-limits per-IP and per-user — bots sending more than 100 requests/second are blocked. (2) TrendAggregator applies a minimum absolute threshold — a prefix must receive at least 500 queries in 10 seconds to be considered trending, regardless of the ratio to baseline. This filters out low-volume statistical noise. (3) A blocklist of known spam prefixes (adult content, phishing terms) is checked before boosting. Despite these defenses, sophisticated bot attacks can still trigger false trends — this is an ongoing adversarial game at all major search platforms.

Q: Why top-100 candidates from the trie instead of top-1000?

Diminishing returns. The GBDT model's ability to surface relevant results decreases beyond position 100 — the popularity gap between position 100 and position 1000 is so large that even a heavily personalized signal rarely promotes a result from position 1000 to the top 10. Top-100 provides sufficient diversity for personalization while keeping the Redis payload manageable (~5 KB per prefix lookup versus ~50 KB for top-1000). At 500K lookups/sec, this difference is 2.5 GB/sec versus 25 GB/sec of Redis network traffic.

Full production-grade search autocomplete with sharded Redis trie for sub-millisecond lookups, real-time trending detection via Kafka stream processing, ML-based re-ranking using user context and history, and per-user personalization. The architecture used by Google, Amazon, and LinkedIn at scale.

RedisKafkaMLPersonalizationReal-timeSearch Autocomplete

Try in Simulator

Problem Statement

The streaming trending + ML ranking approach represents the state of the art for search autocomplete systems at companies like Google, Amazon, LinkedIn, and Twitter. It builds on the trie-based approach (V1) by adding three critical capabilities: real-time trending detection, ML-based personalized re-ranking, and per-user search history personalization.

The first limitation of the V1 approach is trending latency. The trie is rebuilt every 5 minutes, so a brand-new query — breaking news, a viral meme, a product launch — takes up to 5 minutes to appear in suggestions. The Redis trending boost layer in V1 reduces this to 1-2 minutes, but for truly real-time trending (surface within seconds), a streaming architecture is required. TrendAggregator consumes search logs from Kafka in real-time, maintains sliding-window counters at 10-second, 1-minute, and 5-minute granularities, and pushes trending boosts to TrieCache (Redis) as soon as a trend is detected. Detection threshold: a prefix's 10-second count exceeding 2x its 1-hour baseline. This surfaces breaking queries within 10-30 seconds of the trend starting.

The second limitation is personalization. The V1 trie returns the same suggestions for everyone — global popularity ranking only. A developer typing 'react' should see 'react hooks tutorial' while a chemistry student should see 'reaction kinetics formula.' The MLScorer addresses this by re-ranking the top-100 trie candidates using a gradient-boosted decision tree (GBDT) model with features: (1) global popularity score from trie, (2) trending boost score from Redis, (3) user search history similarity — cosine similarity between candidate and user's last 20 searches, (4) geographic relevance — does the candidate have local intent matching the user's location, (5) time-of-day signal — some queries are diurnal (e.g., 'breakfast restaurants' peaks in the morning), (6) device type — mobile users prefer shorter suggestions. The GBDT model inference takes ~2ms for 100 candidates — fast enough for inline scoring.

The third limitation is per-user context. The ML scorer needs user features to personalize. PersonalizationCache (Redis) stores each user's last 20 completed searches with a 30-day TTL. When a user types a prefix, SuggestionService fetches their history in parallel with the trie lookup — both are sub-millisecond Redis operations. The history is passed to the MLScorer as a feature vector. For anonymous users (no user_id), the ML scorer falls back to global ranking (equivalent to V1 behavior).

The trie itself is stored in a sharded Redis cluster rather than in-memory per pod. At 1B queries with top-100 per node, the serialized trie is approximately 40-60 GB — too large for a single pod's memory. Sharding across 8 Redis nodes (each holding ~8 GB of trie data) distributes the memory requirement. The trade-off is a network hop per lookup (~1ms) versus zero-network in-memory traversal (<0.1ms in V1). At 500K RPS, the 1ms Redis cost is acceptable and buys unlimited horizontal scaling — adding Redis nodes extends the trie capacity without changing the service code.

The architecture uses 10 components: Client, ApiGateway, SuggestionLB, SuggestionService (30 pods), TrieCache (Redis, 8 shards), PersonalizationCache (Redis, 4 nodes), MLScorer (20 workers), SearchLogStream (Kafka, 128 partitions), TrendAggregator (30 workers), TrieRebuilder (5 workers), and QueryDB (DynamoDB). The total infrastructure cost is approximately $8,000/month — 3x the V1 approach — justified by personalized suggestions that improve search click-through rate by 15-25% (a direct revenue driver for search-dependent businesses).

Interviewers asking about the advanced variant expect candidates to discuss real-time stream processing for trend detection, ML feature engineering for search ranking, the trade-off between personalization quality and latency, and the operational complexity of a 10-component system with multiple failure modes.

Architecture Overview

The streaming trending + ML ranking system uses ten main components organized into four layers: the request path (Client, ApiGateway, SuggestionLB, SuggestionService), the data stores (TrieCache/Redis sharded, PersonalizationCache/Redis, QueryDB/DynamoDB), the ML layer (MLScorer), and the async pipeline (SearchLogStream/Kafka, TrendAggregator, TrieRebuilder).

The request path handles 500K autocomplete requests per second at peak. Client sends GET /api/v2/autocomplete?q=prefix&limit=10&user_id=abc. ApiGateway authenticates, rate-limits at 600K RPS (headroom for traffic spikes), and routes to SuggestionLB. SuggestionLB uses least-connections algorithm — this accounts for variable ML scorer latency. SuggestionService (30 pods x 300 threads) orchestrates the three-phase lookup.

Phase 1 (trie lookup): SuggestionService performs HGET on the sharded TrieCache to retrieve the top-100 candidates for the given prefix. The trie is sharded across 8 Redis nodes by consistent hash on the prefix. Sub-millisecond lookup. This phase also fetches any trending boost from TrieCache (separate key pattern 'trend:{prefix}'). Phase 2 (personalization): In parallel with Phase 1, SuggestionService fetches the user's search history from PersonalizationCache via LRANGE user:{user_id}:history 0 19. This returns the last 20 searches. Cache miss (anonymous user) is a no-op. Phase 3 (ML re-ranking): SuggestionService sends the 100 candidates plus user context (history, location, device, time) to MLScorer via gRPC. MLScorer runs the GBDT model (~2ms inference), scores each candidate, and returns the re-ranked top-10. Total Phase 3 latency: 5-10ms including gRPC overhead.

The fallback path is critical for reliability. If MLScorer latency exceeds 15ms or the scorer is unhealthy (circuit breaker open), SuggestionService returns the static trie top-10 without ML re-ranking. This degrades quality (no personalization) but preserves latency (< 5ms total). If TrieCache is down (catastrophic Redis failure), the system returns empty suggestions — there is no in-memory fallback because the trie is too large for individual pods.

The async pipeline has two consumers: TrendAggregator and TrieRebuilder, both consuming from SearchLogStream (Kafka, 128 partitions). TrendAggregator runs 30 workers, each handling ~4 partitions. It maintains in-memory sliding-window counters and detects trends by comparing short-window counts against long-window baselines. When a trend is detected, it writes a boost to TrieCache — visible to the next autocomplete request. TrendAggregator also batch-writes updated counts to QueryDB every 30 seconds for long-term persistence. TrieRebuilder runs 5 workers on a 10-minute cron. It reads all counts from DynamoDB, builds the trie with top-100 per node, and bulk-writes to TrieCache using shadow-key-swap for zero-downtime updates.

PersonalizationCache handles write-behind updates: when a user completes a search (POST /api/v2/search/log), SuggestionService both publishes to Kafka (for trending) and writes to PersonalizationCache (LPUSH + LTRIM to maintain last 20). The 30-day TTL auto-expires inactive users, keeping storage bounded at approximately 20 GB for 100M users.

Architecture Preview

Loading architecture preview...

Open in Simulator

Request Flow — Three-Phase Lookup with ML Re-Ranking

This sequence diagram traces the three-phase autocomplete flow: trie lookup, personalization fetch, and ML re-ranking. Phases 1 and 2 execute in parallel (both are sub-millisecond Redis operations), and Phase 3 (ML scoring) processes the combined results. The fallback path is also shown — if the ML scorer is slow or unavailable, the system returns static trie results.

The async pipeline shows the real-time trending detection flow: search logs published to Kafka, consumed by TrendAggregator with sliding-window counters, and trending boosts pushed to TrieCache within seconds of detection.

Loading diagram...

Step-by-Step Walkthrough

1Client sends autocomplete request with prefix 'sys' and user_id. ApiGateway authenticates and rate-limits
2SuggestionService executes Phases 1 and 2 in parallel: HGET on sharded TrieCache for top-100 candidates + LRANGE on PersonalizationCache for user history. Both are sub-millisecond Redis operations
3SuggestionService also fetches trending boost from TrieCache (separate key pattern). If trending data exists for 'sys', it is included as a feature for ML scoring
4Phase 3: SuggestionService sends 100 candidates + user context (history, location, device, time) to MLScorer via gRPC. The GBDT model scores each candidate on 6 features and returns the re-ranked top-10. ~2ms inference + ~3ms network = 5ms total
5On search completion, the query is published to Kafka (for trending) and appended to PersonalizationCache (for future personalization). TrendAggregator consumes the Kafka event and updates sliding-window counters. If a trend is detected, a boost is written to TrieCache within seconds

Pseudocode

// THREE-PHASE AUTOCOMPLETE LOOKUP
async function autocomplete(prefix: string, userId?: string, location?: LatLng):
    // Phase 1 + Phase 2: parallel Redis lookups (<1ms each)
    [trieResult, trendingBoost, userHistory] = await Promise.all([
        trieCache.hget(`trie:${shard(prefix)}:${prefix}`),           // Top-100 candidates
        trieCache.get(`trend:${prefix}`),                             // Trending boost
        userId ? personalizationCache.lrange(`user:${userId}:history`, 0, 19) : []
    ])

    if (!trieResult) return []  // No suggestions for this prefix

    candidates = deserialize(trieResult)  // 100 candidates with scores

    // Phase 3: ML re-ranking (5-10ms)
    try {
        features = {
            candidates,
            trending: trendingBoost,
            history: userHistory,
            location,
            timeOfDay: new Date().getHours(),
            deviceType: detectDevice()
        }
        ranked = await mlScorer.score(features)  // gRPC call, 2ms inference
        return ranked.slice(0, 10)
    } catch (err) {
        // Fallback: circuit breaker open or ML timeout
        // Return static trie results (V1 behavior)
        return candidates.slice(0, 10).map(c => c.query)
    }

// TREND DETECTION — TrendAggregator (real-time Kafka consumer)
function onSearchLogged(event: SearchEvent):
    prefix = event.query.substring(0, 3)  // Track 3-char prefixes
    counters[prefix].window_10s.increment()
    counters[prefix].window_1min.increment()
    counters[prefix].window_5min.increment()

    // Check trending threshold every 10 seconds
    if counters[prefix].window_10s.count > 2 * counters[prefix].baseline_per_10s:
        trieCache.set(`trend:${prefix}`, {
            boost_score: counters[prefix].window_10s.count / counters[prefix].baseline_per_10s,
            trending_queries: getTopQueriesForPrefix(prefix),
            detected_at: new Date().toISOString()
        }, { ttl: 600 })  // Auto-expire in 10 minutes

Key Design Decisions

Sharded Redis Trie vs In-Memory Trie per Pod

Choice

Trie stored in a sharded 8-node Redis cluster, not in pod memory

Rationale

The V1 approach stores the full trie (10 GB for top-10 per node) in each pod's memory. The V2 approach stores top-100 per node, inflating the trie to 40-60 GB — too large for individual pods. Sharding across 8 Redis nodes distributes this storage. The cost is a 1ms network hop per lookup (vs <0.1ms in-memory), but this enables the ML scorer to receive 100 candidates instead of 10, dramatically improving re-ranking quality. The Redis cluster also provides persistence and replication, eliminating the pod-restart cold-start problem from V1.

Gradient-Boosted Decision Tree for Re-Ranking

Choice

GBDT model with 6 features, 2ms inference, inline scoring

Rationale

Deep learning models (transformers, BERT) provide superior relevance but require 50-100ms inference — too slow for inline autocomplete scoring. GBDT models inference in approximately 2ms for 100 candidates, fitting within the 100ms p99 SLO. The 6 features (popularity, trending, history similarity, geo-relevance, time-of-day, device type) capture the most impactful personalization signals. The model is trained offline on click-through data (did the user select the suggestion?) and deployed as a lightweight binary — no GPU required.

Real-Time Trend Detection via Sliding Windows

Choice

10-second, 1-minute, 5-minute sliding windows with 2x baseline threshold

Rationale

The V1 approach detects trends on a 5-minute batch cycle. For breaking events (earthquake, election result, product launch), users start searching immediately and expect relevant suggestions. TrendAggregator's 10-second sliding window detects surges within 10-30 seconds. The 2x threshold (10-second count > 2x the per-10-second average from the 1-hour window) balances sensitivity and false positive rate. Lower thresholds (1.5x) generate too many false trends from normal traffic variance.

Separate Personalization Cache

Choice

Dedicated Redis cluster for user history, separate from trie cache

Rationale

Trie data is query-centric (keyed by prefix, read on every request), while personalization data is user-centric (keyed by user_id, read only for logged-in users). Mixing them creates hot-key imbalance: prefix 'how' might get 1M reads/sec while individual user keys get 1 read/session. Separate clusters allow independent scaling, eviction policies (trie uses LRU, personalization uses LFU), and failure domains — TrieCache failure breaks autocomplete, but PersonalizationCache failure only degrades personalization (fallback to global ranking).

Shadow-Key-Swap for Zero-Downtime Trie Rebuild

Choice

Write new trie to shadow keys, RENAME atomically

Rationale

TrieRebuilder writes new trie data to keys prefixed with 'trie_new:{shard}:{prefix}' while the live trie uses 'trie:{shard}:{prefix}'. When the rebuild completes, all shadow keys are RENAMEd to live keys atomically (Redis RENAME is O(1)). This ensures SuggestionService never reads a partially-rebuilt trie. The trade-off is temporary double memory usage during the swap window (~10 seconds), requiring the Redis cluster to have 2x headroom.

gRPC for ML Scorer Communication

Choice

gRPC with Protobuf instead of REST with JSON

Rationale

MLScorer is called on every autocomplete request (400K/sec). gRPC with Protobuf reduces serialization overhead by approximately 60% compared to JSON. HTTP/2 connection multiplexing enables high throughput over fewer connections — 30 SuggestionService pods each open 4 gRPC channels to 20 MLScorer workers, totaling 120 multiplexed streams instead of 120K HTTP/1.1 connections. The 3ms round-trip includes serialization, network, and deserialization — JSON REST would add ~3ms from serialization alone.

Scale & Performance

Target RPS

500K peak (30 pods x ~17K RPS each)

Latency (p99)

<1ms trie + <1ms personalization + 5-10ms ML scoring = 15-25ms total

Storage

~60 GB trie (Redis), ~20 GB personalization (Redis), ~100 GB counts (DynamoDB)

Availability

99.95% (multi-AZ, ML fallback path, degraded-mode serving)

Time & Space Complexity

Operation	Time	Space	Notes
Trie lookup (Redis HGET)	O(1) — Redis hash table lookup by prefix key	O(K) — K=100 candidates returned	Sub-millisecond. The Redis shard is selected by consistent hash on the prefix. HGET is O(1) amortized. Payload: ~5 KB for 100 suggestions.
ML scoring (GBDT inference)	O(K * D) — K candidates, D tree depth (typically 8-12)	O(K) — feature vectors for K candidates	K=100, D=10 average tree depth, 200 trees in ensemble. Total: 100 * 10 * 200 = 200K comparisons. GBDT is optimized for batch inference — 100 candidates scored in ~2ms on a single CPU core.
Trend detection (TrendAggregator)	O(1) per message — increment counter in hash map	O(Q) — Q unique query prefixes in sliding window	Each search log message increments 3 counters (10s, 1min, 5min window). Q is approximately 10M unique prefixes. Hash map: ~500 MB per worker. 30 workers total, each handling ~4 Kafka partitions.
Trie rebuild (TrieRebuilder)	O(N * L) — N queries, L average prefix length	O(N * L) — full trie in memory during build	1B queries, L=5 average. Build takes approximately 5 minutes on 5 parallel workers. Shadow-key-swap to Redis adds approximately 2 minutes for bulk write pipeline. Total rebuild cycle: ~7 minutes within the 10-minute schedule.

Database Schema (HLD)

trie_prefix (Redis — sharded)

Sharded Redis hash storing top-100 suggestions per prefix. Each entry is a serialized list of (query_string, popularity_score) pairs. 8 Redis nodes, each holding ~8 GB of trie data. Written by TrieRebuilder every 10 minutes via shadow-key-swap. Read by SuggestionService on every autocomplete request.

key: trie:{shard_id}:{prefix} (Redis key)value: [{query: string, score: float}] x 100 (serialized array)

Indexes: Redis hash table — O(1) key lookup by prefix, Consistent hash ring for shard selection by prefix

At 1B queries with average prefix depth of 5, there are approximately 500M unique prefix entries. Distributed across 8 shards: ~62M entries per shard. Each entry averages ~100 bytes (100 suggestions compressed with string interning), totaling ~6-8 GB per shard.

user_history (Redis — PersonalizationCache)

Redis list storing each user's last 20 completed searches. Written on every search completion (LPUSH + LTRIM). Read on every autocomplete request for logged-in users. 30-day TTL auto-expires inactive users.

key: user:{user_id}:history (Redis key)value: list of last 20 search query strings (LPUSH/LTRIM)

Indexes: Redis hash table — O(1) key lookup by user_id

100M users x ~200 bytes per user (20 queries x 10 bytes avg) = ~20 GB total. LFU eviction policy ensures frequently active users stay cached while dormant users are evicted. 85% cache hit rate for logged-in users.

trending_boosts (Redis — TrieCache)

Redis hash entries storing real-time trending boost scores per prefix. Written by TrendAggregator within seconds of trend detection. Read by SuggestionService and passed to MLScorer as a feature. TTL 600s — trends auto-expire if not sustained.

key: trend:{prefix} (Redis key)boost_score FLOAT (trending multiplier, 1.0-10.0)trending_queries ARRAY (specific trending query strings)detected_at STRING (ISO timestamp of detection)

Indexes: Redis hash table — O(1) key lookup

Approximately 10K-100K active trending entries at any time. Memory footprint: ~100 MB. Updated in real-time by TrendAggregator — a new trend is visible to the next autocomplete request within seconds.

query_counts (DynamoDB)

Persistent store for aggregated query counts across multiple time windows. Written by TrendAggregator in 30-second batches. Read by TrieRebuilder via full scan every 10 minutes. On-Demand capacity absorbs rebuild spikes.

prefix STRING (partition key — first 2 chars)query STRING (sort key — full query string)count_10s INTEGER (last 10-second window)count_1min INTEGERcount_1hr INTEGERcount_total INTEGER (all-time)language STRING (detected query language)

Indexes: Primary key: (prefix, query)

1B items x ~120 bytes = ~120 GB. The 2-character prefix partition key distributes data across ~1,296 DynamoDB partitions. On-Demand capacity auto-scales for periodic full-scan reads during trie rebuild.

Event Contracts

search_logged_v2search-logged

Emitted by SuggestionService on every completed search. Two consumer groups: TrendAggregator (real-time trend detection) and TrieRebuilder (periodic count aggregation). Partitioned by query_prefix for processing locality. 128 partitions for 500K peak throughput.

Key Schema

query_prefix (string — first 2 characters of query)

Value Schema

{ query: string, user_id?: string, location?: string, timestamp: string }

What-If Scenarios

Major breaking news event — 100K users search the same query within 30 seconds

Impact

TrendAggregator detects the surge within 10 seconds (10-second sliding window count exceeds 2x baseline). Trending boost is written to TrieCache within 15 seconds. The next autocomplete request for that prefix returns the trending query in the top-10. Users see relevant suggestions within 15-30 seconds of the event — compared to 5+ minutes in V1.

Mitigation

Already handled by design. The 10-second detection window is tuned for exactly this scenario. If the event is so large that Kafka consumer lag spikes, TrendAggregator workers auto-scale based on lag metric.

ML scorer fleet goes down completely (all 20 workers crash)

Impact

Circuit breaker opens on SuggestionService within 5 seconds (after 5 consecutive gRPC failures). All autocomplete requests fall back to static trie results from Redis (no ML re-ranking, no personalization). Quality degrades to V1 level — global popularity ranking only. Latency actually improves (skip the 5-10ms ML call). Users do not see errors — they see slightly less relevant suggestions.

Mitigation

Auto-scaling replaces crashed workers within 2-3 minutes (ECS Fargate). Circuit breaker retries every 30 seconds. On recovery, ML scoring resumes automatically. Monitor ML scorer fleet health with >1 worker healthy per AZ.

PersonalizationCache Redis cluster loses a node (1 of 4)

Impact

25% of user history lookups fail — those users receive non-personalized suggestions (global ranking). The ML scorer still runs but with an empty history feature, producing V1-equivalent quality for affected users. No errors visible to users — just slightly less relevant suggestions.

Mitigation

Redis Cluster automatic failover promotes a replica within 15-30 seconds. During failover, SuggestionService catches the Redis exception and passes an empty history to MLScorer (graceful degradation). 30-day TTL means no data loss for the affected shard — history rebuilds naturally as users continue searching.

Coordinated bot attack sends 50K queries/sec for a spam term to trigger false trending

Impact

TrendAggregator detects the spike and writes a trending boost for the spam term. Autocomplete users see the spam term in their suggestions — a reputation and trust violation.

Mitigation

Three layers: (1) ApiGateway rate-limits per-IP at 100 req/sec — a single bot IP is throttled immediately. (2) TrendAggregator minimum threshold: a prefix must receive queries from at least 1,000 unique user_ids in 10 seconds to qualify as trending (not just raw query volume). (3) Spam blocklist: known spam patterns are checked before boosting. Post-incident: add the spam term to the blocklist and flush the trending boost from TrieCache.

Failure Modes & Resilience

Component	Failure	Impact	Mitigation
TrieCache (Redis — 8-node sharded cluster)	Single shard node failure	Prefixes hashed to the failed shard return empty results — approximately 12.5% of autocomplete requests return no suggestions. Other shards are unaffected. Users notice missing suggestions for specific prefixes (e.g., all prefixes starting with 'sys' if that shard fails).	Redis Cluster with 1 replica per shard. Automatic failover promotes replica within 15-30 seconds. During failover, SuggestionService retries with exponential backoff (50ms, 100ms, 200ms). After 3 retries, return empty suggestions for affected prefixes.
MLScorer (20 gRPC workers)	Latency spike from model serving issues (cold start, GC pause)	ML scoring p99 latency exceeds 15ms. Circuit breaker opens on SuggestionService. Autocomplete falls back to static trie results. Quality degrades but latency SLO is preserved. The fallback is invisible to users — they see slightly less personalized suggestions.	Pre-warm MLScorer workers (load model on startup, run inference on test data before accepting traffic). Use connection pooling on gRPC channels. Auto-scale based on p99 latency: if > 10ms, add workers.
SearchLogStream (Kafka — 128 partitions)	Kafka cluster unavailable	Search logs are lost — TrendAggregator stops detecting new trends, and TrieRebuilder stops receiving new data. Existing trie and trending boosts continue serving (stale but correct). Personalization continues working (writes go directly to PersonalizationCache, not through Kafka). Autocomplete hot path is unaffected.	Kafka multi-AZ deployment with replication factor 3. If Kafka is completely unavailable, SuggestionService buffers logs locally (10K messages per pod) and replays on recovery. TrendAggregator switches to degraded mode — serves from existing trending boosts without detecting new trends.
TrendAggregator (30 workers)	Consumer lag exceeds 5 minutes	Trending detection is delayed — new trends take 5+ minutes to surface instead of 10-30 seconds. The system degrades to V1-equivalent trending behavior. Autocomplete hot path is unaffected — existing trending boosts in Redis continue serving until TTL expiration (600s).	Auto-scale workers based on Kafka consumer lag metric. Add workers when lag > 60 seconds. Increase Kafka partition count from 128 to 256 to enable more parallel consumers. Alert at lag > 2 minutes, critical at lag > 5 minutes.

Monitoring & Alerting

Key metrics: (1) End-to-end autocomplete latency p50/p99 — should be <25ms. Alert at p99 > 60ms, critical at p99 > 100ms (SLO breach). (2) ML scorer latency p50/p99 — should be <5ms. Alert at p99 > 10ms, circuit breaker opens at p99 > 15ms. (3) Trend detection latency — time from query surge to trending boost written to Redis. Should be <30 seconds. Alert at >60 seconds. (4) Personalization cache hit rate — expected 85% for logged-in users. Drop below 70% indicates cache eviction issues. (5) TrieCache shard health — all 8 shards healthy. Alert on any shard failure. (6) Kafka consumer lag — TrendAggregator should be <30 seconds. Alert at 2 minutes. TrieRebuilder can tolerate higher lag (batch consumer). (7) ML fallback rate — percentage of requests served without ML re-ranking (circuit breaker open). Should be <1%. Alert at >5%. Dashboard: Grafana with panels for end-to-end latency histogram, ML scorer latency, trending detection timeline, Kafka consumer lag, Redis shard health, and personalization cache hit rate.

Cost Analysis

At 500K RPS peak: SuggestionService 30 pods (4 vCPU, 8 GB) ~$2,200/month, TrieCache Redis (8 nodes, 64 GB total) ~$2,400/month, PersonalizationCache Redis (4 nodes, 26 GB total) ~$800/month, MLScorer 20 workers (4 vCPU, 8 GB) ~$1,500/month, QueryDB DynamoDB On-Demand ~$600/month, Kafka MSK (6 brokers, 128 partitions) ~$1,200/month, ApiGateway ~$300/month, TrendAggregator 30 workers ~$600/month, TrieRebuilder 5 workers ~$100/month. Total: ~$9,700/month. The dominant costs are the Redis clusters (TrieCache + PersonalizationCache = $3,200/month) and the SuggestionService pods ($2,200/month). The ML investment ($1,500/month for MLScorer) is justified by a 15-25% improvement in search click-through rate, which translates to millions in revenue for search-dependent businesses.

Security Considerations

User privacy: PersonalizationCache stores per-user search history — GDPR requires right to erasure (DELETE user:{user_id}:history on request), data export (LRANGE full history), and purpose limitation (history used only for autocomplete personalization, not sold to advertisers). Search history must be encrypted at rest (Redis encryption) and in transit (TLS). ML model privacy: the GBDT model is trained on aggregated click-through data, not individual user sessions — no user-level data is embedded in the model weights. Trending manipulation: rate limiting per-IP and per-user prevents bot-driven trending attacks. Blocklist of offensive/spam terms prevents harmful content from being boosted into suggestions. Autocomplete suggestions must never expose PII from other users' searches — all suggestions come from anonymized, aggregated query popularity data.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
V0: Naive (Database LIKE Query)	T1	8-15ms query, 50-200ms+ under load	~10K RPS	$730/month	Low	99% (single DB)
V1: Trie + Top-K (In-Memory Trie)	T2	<1ms trie + 2ms trending merge	200K RPS	$2,500/month	Medium	99.9% (multi-AZ)
V2: Streaming Trending + ML Ranking	T3	<1ms trie + 5-10ms ML scoring	500K RPS	$8,000/month	Very High	99.95% (multi-AZ, ML fallback)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

How does the ML scorer handle cold-start users with no search history?

For users with no PersonalizationCache entry (new users or anonymous), the ML scorer receives an empty history feature vector. The model is trained with this case explicitly — approximately 30% of training data has empty history. Without history, the model falls back to popularity, trending, geo-relevance, and time-of-day features. This produces the same quality as V1 (global popularity ranking) for cold-start users, with personalization improving as the user accumulates search history.

What happens if the ML scorer fails completely?

SuggestionService has a circuit breaker on the MLScorer gRPC call. After 5 consecutive failures or if p99 latency exceeds 15ms, the circuit opens and SuggestionService returns static trie results (no ML re-ranking) for 30 seconds before retrying. This fallback path is the same quality as V1 — personalization degrades but autocomplete continues working. The circuit breaker prevents cascading failures from MLScorer to the hot path.

How is the ML model trained and updated?

The GBDT model is trained offline on click-through data: for each autocomplete session, the training data records which suggestion (if any) the user clicked. Positive labels are clicked suggestions; negative labels are displayed but not clicked. Features are computed from the trie (popularity, trending score), user history (cosine similarity), and context (location, time, device). The model is retrained weekly on the latest 30 days of click data and deployed as a binary artifact. A/B testing validates that the new model improves click-through rate before full rollout.

How does TrendAggregator avoid false positives from bot traffic?

Three layers of defense: (1) ApiGateway rate-limits per-IP and per-user — bots sending more than 100 requests/second are blocked. (2) TrendAggregator applies a minimum absolute threshold — a prefix must receive at least 500 queries in 10 seconds to be considered trending, regardless of the ratio to baseline. This filters out low-volume statistical noise. (3) A blocklist of known spam prefixes (adult content, phishing terms) is checked before boosting. Despite these defenses, sophisticated bot attacks can still trigger false trends — this is an ongoing adversarial game at all major search platforms.

Why top-100 candidates from the trie instead of top-1000?

Diminishing returns. The GBDT model's ability to surface relevant results decreases beyond position 100 — the popularity gap between position 100 and position 1000 is so large that even a heavily personalized signal rarely promotes a result from position 1000 to the top 10. Top-100 provides sufficient diversity for personalization while keeping the Redis payload manageable (~5 KB per prefix lookup versus ~50 KB for top-1000). At 500K lookups/sec, this difference is 2.5 GB/sec versus 25 GB/sec of Redis network traffic.

Related Templates

Search Autocomplete — Naive (Database LIKE Query)Search Autocomplete — Trie + Top-K (In-Memory Trie)Ad Click Aggregator — CQRS (Event Sourcing + Projections)

Discussion

Ready to design your own Search Autocomplete?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator