Full production feed ranking architecture with two-tower neural retrieval for interest-based content discovery beyond the follow graph, online learning from real-time engagement signals, a 4-stage ranking funnel (candidate generation, light rank, heavy rank, diversity re-rank), and a dedicated feature store for sub-30s feature freshness. This is how TikTok and Instagram surface content from creators users have never followed.
The real-time online learning approach to feed ranking represents the state of the art used at TikTok, Instagram Explore, and YouTube Recommendations. It solves the two fundamental limitations of the V1 multi-stage ML pipeline: the inability to discover content beyond the follow graph and the inability to adapt the ranking model in real time to trending topics and shifting user interests.
The first key innovation is two-tower retrieval. The V1 approach only shows posts from users the viewer follows — the candidate set is bounded by the social graph. Two-tower retrieval breaks this constraint by computing a user interest embedding (from engagement history: topics clicked, authors engaged with, content types preferred) and a post embedding (from content features: text, image, engagement signals), then performing approximate nearest neighbor (ANN) search to find posts that match the user's interests regardless of whether they follow the author. This is conceptually how TikTok's For You page works: the algorithm surfaces content from any creator whose posts match the viewer's interest profile, not just from followed accounts.
The RetrievalService maintains an in-memory HNSW index (via FAISS) of 100M+ post embeddings (128-dimensional vectors). On each feed request, it looks up the user's embedding from FeatureStore, runs ANN search to find the 8K nearest posts by cosine similarity, and returns the candidate set. Combined with 1000-2000 follow-graph candidates from CandidateCache, the total candidate pool is approximately 10K posts — 5-10x larger than V1's candidate set, enabling much richer content discovery.
The second key innovation is online learning. The V1 approach trains its ML model offline (batch training on historical data, typically weekly). Online learning continuously updates model weights from the engagement stream — when a user clicks, likes, shares, or dwells on a post, the OnlineLearner performs a stochastic gradient descent step on the ranking model's final scoring layer. This enables the model to adapt to emerging trends within minutes: if a breaking news topic suddenly generates high engagement, the model learns to boost related content before the next batch training cycle.
The 4-stage funnel adds a dedicated ReRanker as the final stage. The HeavyRanker optimizes for engagement prediction (what the user wants to click), while the ReRanker enforces what the platform wants to show: content diversity (no more than 3 consecutive posts from the same author), content discovery (at least 20% from outside the follow graph), content policy (demote flagged posts), and business objectives (boost original content, insert sponsored posts at designated slots). Separating engagement optimization from diversity/business rules is critical — mixing them in a single model creates conflicting training objectives.
The FeatureStore (Redis) is the connective tissue between online learning and ranking. It stores real-time features updated by OnlineLearner with sub-30s freshness: user embeddings, post engagement velocity, user-topic preferences, trending scores, and click-through rates. LightRanker, HeavyRanker, and RetrievalService all read from FeatureStore, ensuring every stage uses the freshest available signals.
The primary trade-offs are operational complexity (12 components, 2 ML models, online learning stability concerns), two-tower embedding freshness (post embeddings are computed at publish time and refreshed every 6-12 hours, missing engagement dynamics), and content discovery vs social trust (surfacing 30-40% content from outside the follow graph can feel intrusive to users who curated their follow list carefully).
Interviewers expect candidates to explain the two-tower architecture (separate user and post embedding computation, ANN for retrieval), discuss online learning stability (guardrails against feedback loops), reason about the 4-stage funnel design (why separate re-ranking from ML scoring), and analyze the content discovery vs social trust trade-off.
The real-time online learning architecture uses twelve components organized into five layers: edge traffic (FeedClient, ApiGateway, FeedLB), application orchestration (FeedOrchestrator, RetrievalService), data stores (CandidateCache, FeatureStore), ranking pipeline (LightRanker, HeavyRanker, ReRanker), and online learning (EngagementStream, OnlineLearner, ModelStore).
The feed read path starts at FeedClient, passes through ApiGateway (JWT auth, rate limiting at 150K RPS, engagement signal validation) and FeedLB (round-robin across 16 FeedOrchestrator pods). FeedOrchestrator is the central coordinator — it manages the 4-stage ranking pipeline and assembles the final feed response.
Candidate generation operates in parallel across two sources. First, FeedOrchestrator retrieves follow-graph candidates from CandidateCache (Redis): 1000-2000 pre-computed post IDs from the fanout process (~2ms). Second, FeedOrchestrator calls RetrievalService for interest-based candidates: RetrievalService looks up the user's 128-dimensional interest embedding from FeatureStore (~1.5ms), runs HNSW ANN search on the in-memory post embedding index (~18ms), and returns 8K candidate post IDs ranked by cosine similarity. After deduplication, the combined candidate pool is approximately 10K posts. The parallel retrieval adds only 20ms to the critical path (max of cache lookup and ANN search).
LightRanker receives the ~10K candidates and applies fast feature-based scoring. It fetches lightweight features from FeatureStore (engagement velocity, trending score) and applies a logistic regression or GBDT model with ~20 features per candidate. This stage reduces 10K to 500 candidates in approximately 30ms. The key features: recency decay, author affinity, engagement velocity, content-type preference, and user-topic match score.
HeavyRanker receives the 500 pre-filtered candidates and runs a deep neural engagement prediction model. It fetches full feature vectors from FeatureStore (~300 features per candidate: user embedding, post embedding, cross-features, real-time engagement signals). The model predicts four engagement probabilities: P(click), P(like), P(share), P(dwell>30s). A weighted combination produces the final engagement score. Batch inference on 500 candidates takes approximately 80ms. The model weights are loaded from ModelStore (S3) every 5 minutes — the latest checkpoint from OnlineLearner.
ReRanker receives the top 50 scored posts and applies constrained reordering: maximum 3 consecutive posts from the same author, minimum 20% content from outside the follow graph, demotion of policy-flagged posts, boost for original content over reposts, and sponsored post insertion at designated positions. This is pure rule evaluation (~10ms), no ML inference.
The online learning loop operates asynchronously. FeedOrchestrator publishes granular engagement signals to EngagementStream (Kafka, 32 partitions): clicks, likes, shares, dwell time, scroll position, and content exposure events. OnlineLearner (15 workers) consumes these events and performs two operations: (1) updates real-time features in FeatureStore (engagement velocity, CTR, trending score, user-topic preferences) with sub-30s latency, (2) performs online SGD on the HeavyRanker model's final scoring layer and pushes updated checkpoints to ModelStore every 5 minutes. Guardrails include maximum learning rate, gradient norm clipping, feature value bounds, and validation against a holdout set before checkpoint promotion.
Total end-to-end feed latency: ~155ms (20ms retrieval + 30ms light rank + 80ms heavy rank + 10ms re-rank + 15ms assembly). Each component scales independently: RetrievalService by pod count (8 baseline), LightRanker by worker count (40 baseline), HeavyRanker by worker count (25 baseline), OnlineLearner by worker count (15 baseline).
Choice
Separate user and post embedding towers with ANN (HNSW) search for candidate generation
Rationale
Follow-graph-only candidates (V1) limit the feed to content from people the user follows. Two-tower retrieval computes a user interest embedding from engagement history and a post embedding from content + engagement signals, then finds nearest posts via ANN search. This discovers content from creators the user has never followed but would engage with. The two-tower architecture is efficient because user and post embeddings are computed independently: post embeddings are pre-built offline, and only the ANN search (O(log N), ~18ms for 100M posts) runs at serving time. This is the core architecture behind TikTok's For You page and Instagram Explore.
Choice
Candidate generation -> light rank -> heavy rank -> re-rank (4 stages instead of V1's 3)
Rationale
Adding a dedicated ReRanker as the 4th stage separates engagement optimization from diversity and business rules. The HeavyRanker maximizes predicted engagement (what users want to click). The ReRanker enforces platform-level objectives: content diversity, original content boost, policy compliance, and sponsored content placement. Mixing these objectives in a single ML model creates conflicting gradients during training and produces suboptimal results for both goals. Rule-based re-ranking is deterministic, debuggable, and instantly tunable without model retraining.
Choice
Continuous SGD on the scoring layer from engagement signals, with gradient clipping and holdout validation
Rationale
Batch retraining (V1 approach, weekly) cannot react to emerging trends, breaking news, or viral content until the next training cycle. Online learning updates model weights continuously — when a topic starts trending, the model adapts within minutes. The guardrails prevent instability: maximum learning rate (0.001), gradient norm clipping (1.0), feature value bounds (+/- 3 standard deviations), and mandatory validation against a 5% holdout set before promoting a checkpoint. If a checkpoint degrades holdout AUC by more than 1%, it is rejected and the previous checkpoint remains active.
Choice
Two Redis clusters: CandidateCache for post IDs, FeatureStore for real-time features
Rationale
CandidateCache stores pre-computed candidate post IDs with 10-minute TTL and read-heavy access. FeatureStore stores real-time features (user embeddings, engagement velocity, trending scores) with sub-30s update freshness and read-write-heavy access from OnlineLearner. Separating them prevents feature update writes from evicting candidate sets and provides independent scaling: CandidateCache needs more memory (larger value sizes), FeatureStore needs more write throughput (higher update frequency).
Choice
Approximate nearest neighbor via FAISS HNSW index on 128-dim post embeddings
Rationale
Exact nearest neighbor on 100M+ 128-dimensional embeddings requires computing cosine similarity against every embedding — O(N) per query, approximately 10 seconds. ANN (HNSW) achieves 95%+ recall in O(log N) time, returning 8K candidates in under 20ms. The 5% recall loss (missing some truly-nearest posts) is acceptable because the subsequent ranking stages (light rank, heavy rank) use richer features to score candidates — a post missed by ANN retrieval but genuinely relevant would likely have been filtered by the light ranker anyway.
Choice
Log not just clicks but feed position, viewport exposure time, and scroll behavior
Rationale
Position bias is a major confounder in engagement prediction: posts shown at position 1 get 10x more clicks than position 10 regardless of quality. By logging the feed position of each post and exposure events (whether the post entered the viewport), the online learner can correct for position bias during training. Without this correction, the model learns to rank already-high-ranked posts higher (a self-reinforcing feedback loop), reducing exploration and content diversity.
Target RPS
100K peak (60K feed reads + 25K engagement + 10K refresh + 5K retrieval)
Latency (p99)
~155ms total (20ms retrieval + 30ms light rank + 80ms heavy rank + 10ms re-rank + 15ms assembly)
Storage
~5 TB/year (embeddings, features, candidate cache, model checkpoints, engagement logs)
Availability
99.9% (multi-AZ, Kafka replication, Redis cluster, model rollback)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Two-tower ANN retrieval | O(log N) — HNSW search on N post embeddings | O(N * D) — N embeddings x D dimensions in memory | 18ms for 100M 128-dim embeddings. 95%+ recall vs exact search. The HNSW index uses approximately 2x the raw embedding memory for graph structure. |
| Light ranking (feature-based scoring, 10K candidates) | O(K * F) — K candidates x F features | O(K) — scored candidate list | 30ms for 10K candidates with 20 features each. Logistic regression or GBDT — no deep model inference. 3x slower than V1's light ranking due to 10x more candidates but only 20 features vs 10. |
| Heavy ML ranking (deep neural model, 500 candidates) | O(M * F) — M candidates x F features per candidate | O(M * F) — feature matrix for batch inference | 80ms for 500 candidates with 300 features each. 60% slower than V1's heavy ranking (200 candidates) because 2.5x more candidates pass through the light ranker due to the larger initial pool. The richer feature set (300 vs 200 features) adds 20% to inference time. |
| Re-ranking (diversity + business rules) | O(R) — R = re-rank set size (50 posts) | O(R) — reordered post list | 10ms for 50 posts. Pure rule evaluation — no ML inference. Constraint satisfaction: max consecutive same-author, min discovery percentage, policy flags. |
| Online learning (SGD update) | O(B * F) — B = mini-batch size, F = feature count | O(P) — P = model parameter count | Continuous, asynchronous. Each engagement signal triggers a gradient update on the scoring layer. Checkpoint pushed to ModelStore every 5 minutes. Total online learning throughput: 25K events/sec at peak. |
Pre-computed candidate post sets from the follow-graph fanout process. Each key holds 1000-2000 post IDs. Merged with interest-based candidates from RetrievalService to form the full 10K candidate pool. 6 Redis nodes, 600s TTL.
Indexes: Key-based O(1) lookup by user_id
Same structure as V1 but the candidate set is no longer the sole source of candidates — it is merged with 8K interest-based candidates from RetrievalService. The total candidate pool (10K) is 5-10x larger than V1, enabling richer content discovery at the cost of more expensive light ranking (10K candidates vs 1K).
Real-time feature store for ranking. Stores user embeddings (128-dim vectors), post engagement velocity, user-topic preferences, trending scores, and CTR. Updated by OnlineLearner with sub-30s freshness. Read by LightRanker, HeavyRanker, and RetrievalService.
Indexes: Key-based O(1) lookup by user_id or post_id
Separate from CandidateCache to prevent feature updates from evicting candidate sets. 4 Redis nodes with 24 GB total. OnlineLearner writes features with sub-30s freshness. User embeddings are updated on every engagement signal; post features are updated on every engagement event for that post.
Ranking model checkpoint storage. OnlineLearner writes new checkpoints every 5 minutes. HeavyRanker reads the latest validated checkpoint. Versioned with S3 object versioning for instant rollback.
Indexes: S3 prefix listing by timestamp for latest checkpoint discovery
Retains the last 48 hours of checkpoints (approximately 576 versions). If a new checkpoint degrades engagement metrics, operators roll back by updating a pointer file to the previous version. HeavyRanker polls this pointer every 5 minutes.
Granular engagement signals from FeedOrchestrator. Includes position and exposure data for position-bias correction in online learning. Partitioned by user_id for per-user engagement ordering. 32 partitions, 25K msg/sec at peak.
Key Schema
user_id (STRING)
Value Schema
{ post_id, user_id, signal_type: 'click' | 'like' | 'share' | 'dwell' | 'expose', dwell_ms?: INTEGER, position?: INTEGER }
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Chronological + Heuristic) | T1 | 80-200ms at 100 follows, 500ms+ at 500 follows | ~100K RPS (DB-limited) | $1,130/month | Low | 99% (single DB) |
| V1: Multi-Stage ML Pipeline | T2 | ~80ms (2ms cache + 20ms light rank + 50ms heavy rank) | 100K RPS | $3,500/month | Medium | 99.9% (multi-AZ) |
| V2: Real-Time Online Learning + Multi-Tower | T3 | ~155ms (20ms retrieval + 30ms light + 80ms heavy + 10ms re-rank) | 100K RPS | $8,500/month | Very High | 99.9% (multi-AZ) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Collaborative filtering (e.g., matrix factorization) learns user-item interactions from historical engagement and recommends items similar to what similar users engaged with. Two-tower retrieval also learns embeddings from engagement data, but the two towers (user encoder, post encoder) are trained as separate neural networks that map inputs to a shared embedding space. The key advantage is that new posts can be embedded immediately (by running the post encoder on content features) without waiting for engagement data — collaborative filtering requires historical interactions. This makes two-tower retrieval effective for fresh content (published minutes ago), while collaborative filtering struggles with cold-start items.
Five guardrails: (1) Maximum learning rate (0.001) prevents large weight updates from a single batch of engagement signals. (2) Gradient norm clipping (max norm 1.0) prevents exploding gradients from outlier engagement patterns. (3) Feature value bounds (clip at +/- 3 standard deviations) prevent extreme feature values from dominating predictions. (4) Holdout validation — every checkpoint is evaluated against a 5% holdout set; if AUC drops by >1%, the checkpoint is rejected. (5) Automatic rollback — if live engagement metrics (CTR, time-spent) drop >5% within 15 minutes of a new checkpoint, the system automatically reverts to the previous version. These guardrails trade some model freshness for stability.
A single model scoring 10K candidates with 300 features each would require 10K x 300 = 3M feature lookups and a batch inference on a 10K x 300 matrix — approximately 2 seconds per request, far exceeding the 500ms SLO. The funnel applies progressively more expensive processing to progressively smaller candidate sets: retrieval (10K candidates, 0 features, 20ms) -> light rank (10K candidates, 20 features, 30ms) -> heavy rank (500 candidates, 300 features, 80ms) -> re-rank (50 candidates, rules only, 10ms). Each stage is 5-10x more expensive per candidate but operates on a 5-20x smaller set. The total cost is approximately 155ms — 13x faster than scoring everything with the heavy model.
The re-ranker enforces a minimum 20% content-from-outside-follow-graph rule to ensure content discovery, but the other 80% comes from the user's curated follow list, preserving social trust. The 20% discovery content is scored by the HeavyRanker (so it is predicted to be engaging, not random), and the re-ranker interleaves it with follow-graph content rather than clustering it. Users can also provide explicit negative feedback ('not interested', 'show less like this') which updates the user's topic preferences in FeatureStore, reducing future surfacing of unwanted content categories.
The primary cost is memory for the embedding index. 100M posts x 128 dimensions x 4 bytes (float32) = 51.2 GB for the raw embeddings. HNSW index overhead adds approximately 2x, so the total in-memory index is approximately 100 GB. Distributed across 8 RetrievalService pods with replication, each pod holds approximately 25 GB — fitting comfortably in 16 GB RAM per pod with the index partially memory-mapped. The compute cost is minimal: ANN search on HNSW is O(log N) and takes 18ms per query. At 60K queries/sec (feed reads), 8 pods handle the load with each pod processing 7.5K queries/sec. The infrastructure cost for RetrievalService is approximately $2,000/month — a small fraction of the total system cost justified by the dramatic improvement in content discovery.
Sign in to join the discussion.
Ready to design your own Feed Ranking?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator