Vetora logo
Medium10 componentsInterview: Very High

Feed Ranking — Multi-Stage ML Pipeline

Industry-standard multi-stage ranking pipeline used at Meta, YouTube, and LinkedIn. Pre-computed candidate sets in Redis, a lightweight heuristic ranker (1000 to 200), and a heavy ML scorer (200 to 50). Real-time engagement feedback via Kafka. The funnel design keeps ML inference costs manageable by applying the expensive model only to pre-filtered candidates.

ML PipelineKafkaRedisMulti-StageFeed Ranking
Problem Statement

The multi-stage ML pipeline approach to feed ranking represents the industry standard architecture used at Meta (Facebook, Instagram), YouTube, LinkedIn, and most production social platforms. It solves the two fundamental problems with the naive approach: query-time ranking latency and the inability to personalize based on user behavior.

The key architectural insight is the multi-stage funnel. Instead of running a single expensive operation on all candidates, the pipeline applies progressively more expensive models to progressively smaller candidate sets: (1) Candidate generation retrieves ~1000 pre-computed post IDs from CandidateCache (Redis) in 2ms — zero database queries at feed-read time. (2) LightRankerWorker applies fast heuristic scoring (recency decay, author affinity, engagement velocity) to reduce 1000 to 200 candidates in 20ms. (3) HeavyMLWorker runs a deep neural engagement prediction model on 200 candidates with ~200 features each, producing the final top-50 in 50ms. The total pipeline latency is approximately 80ms — compared to 400-600ms for the naive approach at the same follow count.

CandidateCache is the critical optimization. Instead of querying PostDB on every feed request, a background fanout process pre-computes candidate post sets per user. When a user you follow publishes a new post, the fanout process adds that post ID to your candidate set in Redis. This converts the feed read from an O(N log N) database query to an O(1) Redis lookup — a fundamental architectural shift from pull-based (compute on read) to push-based (compute on write) feed generation.

The engagement feedback loop is the second major advancement. EngagementStream (Kafka) captures real-time user interactions: clicks, likes, shares, and dwell time. FeatureWorker consumes these events and updates engagement features in CandidateCache — engagement velocity (likes per minute), click-through rate, and trending score. These updated features are used by the next ranking pipeline invocation, enabling the system to boost trending content and demote ignored content within minutes.

The HeavyMLWorker uses a deep neural network trained on historical engagement data to predict P(click), P(like), and P(dwell>30s) for each candidate post given the viewer's features. The model uses approximately 200 features per candidate: user demographics, user engagement history, post metadata, author features, cross-features (user-author affinity, user-topic match), and real-time signals (post engagement velocity, recency). Model inference is batched — scoring 200 candidates in a single forward pass takes approximately 50ms on CPU, compared to 5ms per candidate if scored individually.

The primary trade-offs are candidate staleness (pre-computed sets can be up to 5 minutes stale for new posts), cold start for new users (no engagement history means no personalization), and echo chamber risk (the ML model optimizes for predicted engagement, which can create filter bubbles by showing users more of what they already like).

Interviewers expect candidates to explain the funnel design rationale (apply cheapest filter first), discuss the fanout-on-write trade-off (write amplification vs read latency), reason about the engagement feedback loop's freshness requirements, and analyze the cold-start problem for new users.

Architecture Overview

The multi-stage ML pipeline architecture uses ten components organized into four layers: edge traffic (FeedClient, ApiGateway, MainLB), application orchestration (FeedService), data stores (CandidateCache, PostDB), ranking pipeline (ScoringStream, LightRankerWorker, HeavyMLWorker), and engagement feedback (EngagementStream, FeatureWorker).

The feed read path begins at FeedClient, passes through ApiGateway (JWT authentication, rate limiting at 120K RPS) and MainLB (round-robin across 12 FeedService pods), and reaches FeedService. FeedService retrieves the user's pre-computed candidate set from CandidateCache (Redis) — typically 500-1000 post IDs with lightweight metadata (author_id, timestamp, like_count). This lookup takes approximately 2ms. For cache misses (10% of requests), FeedService falls back to PostDB to query the user's follow list and recent posts.

FeedService publishes the candidate set to ScoringStream (Kafka, 32 partitions, partitioned by user_id for ordering). LightRankerWorker consumes the candidates and applies fast heuristic scoring: recency decay (exponential, 6-hour half-life), author affinity (pre-computed user-user scores), engagement velocity (likes/shares per minute from real-time features), and content-type preference. The scoring formula uses ~10 features per candidate — a weighted sum computed in microseconds per candidate. 30 workers reduce 1000 to 200 candidates in under 20ms. Results are published back to ScoringStream.

HeavyMLWorker consumes the 200 pre-filtered candidates and runs a deep neural model for engagement prediction. The model uses ~200 features per candidate including user embeddings (64-dim), post embeddings (64-dim), cross-features (user-author interaction history, user-topic match), and real-time signals (post engagement velocity, recency). Batch inference on 200 candidates takes approximately 50ms. The model outputs P(click), P(like), P(share), and P(dwell>30s). A weighted combination produces the final engagement score. Top 50 posts are returned to FeedService, which fetches post content from PostDB and assembles the feed response.

The engagement feedback loop operates asynchronously. When users interact with the feed (click, like, share, dwell), FeedService publishes engagement signals to EngagementStream (Kafka, 16 partitions). FeatureWorker (10 workers) consumes these events and updates real-time features in CandidateCache: engagement velocity (rolling 30-minute window), click-through rate, and trending score. These updated features are available for the next ranking pipeline invocation, closing the feedback loop with sub-30-second freshness.

Horizontal scaling is independent per component: FeedService scales by pod count (12 baseline), LightRankerWorker scales by worker count (30 baseline), HeavyMLWorker scales by worker count (20 baseline), and Kafka scales by partition count. The most expensive component is HeavyMLWorker — ML inference on 200 candidates at 100K RPS requires significant compute. The funnel design (1000 to 200 to 50) keeps this cost manageable by ensuring the expensive model only processes pre-filtered candidates.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Multi-Stage Ranking Funnel

Choice

3-stage pipeline: candidate retrieval (Redis) -> light rank (heuristic) -> heavy rank (ML)

Rationale

Running a heavy ML model on 1000+ candidates per request would take 500ms+ and cost 10x more in compute. The 3-stage funnel applies the cheapest filter first (Redis lookup: 2ms), then a fast heuristic scorer (20ms for 1000 candidates), then the expensive deep model on a much smaller set (50ms for 200 candidates). This is the industry standard at Meta, YouTube, and TikTok. Each stage reduces the candidate set by 5x, and each subsequent stage is 5-10x more expensive per candidate — the math only works as a funnel.

Pre-Computed Candidates via Fanout

Choice

Fanout-on-write to CandidateCache (Redis) instead of query-time retrieval

Rationale

The naive approach queries PostDB on every feed request, scanning thousands of rows. Pre-computing candidates converts the O(N log N) database query to an O(1) Redis lookup (~2ms). The fanout process runs when a user publishes a post — the post ID is added to every follower's candidate set in Redis. The trade-off is write amplification: a user with 1M followers triggers 1M Redis writes per post. For celebrity accounts, this is handled by a hybrid approach (fanout for users with <10K followers, pull for users with >10K followers).

Kafka for Scoring Pipeline

Choice

Amazon MSK (Kafka) for inter-stage communication with backpressure

Rationale

The ranking pipeline is multi-stage with different processing rates per stage. Kafka provides ordered, reliable event delivery between stages with consumer-side backpressure. If HeavyMLWorker falls behind (e.g., during a traffic spike), Kafka buffers the backlog without dropping requests. Each stage scales independently by adjusting consumer group size. The alternative — synchronous RPC between stages — couples their scaling and propagates latency spikes across the pipeline.

Separate Light and Heavy Rankers

Choice

LightRankerWorker (heuristic, ~10 features) + HeavyMLWorker (deep model, ~200 features)

Rationale

The light ranker uses simple features (recency, like count, author affinity) computed in microseconds per candidate. It eliminates obviously irrelevant candidates (old posts, low-affinity authors) before the expensive ML model runs. The 5x reduction in candidates (1000 to 200) translates directly to 5x lower ML inference cost and 5x lower latency for the heavy stage. Without the light ranker, the heavy model would need to process 1000 candidates at ~250ms — exceeding the 500ms p99 budget.

Real-Time Engagement Feedback via Kafka

Choice

EngagementStream (Kafka) -> FeatureWorker -> CandidateCache for sub-30s feature updates

Rationale

A post's quality changes rapidly after publishing — viral content accumulates likes/shares quickly, while low-quality content gets ignored. Real-time engagement signals (updated every few seconds via Kafka) let the ranker boost trending content and demote ignored content within minutes, not hours. Without this feedback loop, a post published 1 hour ago with 10K likes would be ranked the same as a post with 0 likes, because the candidate set was pre-computed before engagement accumulated.

Scale & Performance

Target RPS

100K peak (70K feed reads + 20K engagement + 10K refresh)

Latency (p99)

~80ms total (2ms cache + 20ms light rank + 50ms heavy rank + 10ms assembly)

Storage

~2 TB/year (candidate cache, engagement features, post content, model checkpoints)

Availability

99.9% (multi-AZ, Kafka replication, Redis cluster)

Time & Space Complexity
OperationTimeSpaceNotes
Candidate retrieval (CandidateCache)O(1) — Redis key-value lookupO(K) — K = candidate count (500-1000)2ms per request. The fundamental optimization over V0: converting O(N log N) database sort to O(1) cache lookup. The fanout process pays the write cost asynchronously.
Light ranking (heuristic scoring)O(K) — score each candidate with ~10 featuresO(K) — scored candidate list20ms for 1000 candidates. Each candidate scored with a weighted formula (microseconds per candidate). No ML inference, no feature store lookup — all features are co-located with the candidate in CandidateCache.
Heavy ML ranking (deep neural model)O(M * F) — M candidates, F features per candidateO(M * F) — feature matrix for batch inference50ms for 200 candidates with 200 features each. Batched inference amortizes model loading overhead. The 5x reduction from light ranking (1000 to 200) keeps this cost manageable.
Database Schema (HLD)
candidate_cache (Redis)

Pre-computed candidate post sets stored in Redis. Each key holds 500-1000 post IDs with lightweight metadata. Populated by the fanout process when followed users publish new posts. Read on every feed request (2ms). Also stores real-time engagement features updated by FeatureWorker.

KEY cand:{user_id} (sorted set or list)VALUE: post_ids (500-1000 post IDs with metadata)FIELD: updated_at (last fanout timestamp)TTL: 600 seconds

Indexes: Key-based O(1) lookup by user_id

The candidate cache is the critical optimization that converts O(N log N) database queries to O(1) lookups. 90% hit rate since active users refresh feeds frequently. Cache misses fall back to PostDB. Memory usage: ~50 bytes per post ID x 1000 candidates x 10M users = ~500 GB across 6 Redis nodes.

posts (DynamoDB)

Persistent store for post content, author metadata, and engagement counters. Read path: FeedService fetches post content for final feed assembly after ranking. Write path: engagement counter updates from FeatureWorker. On-Demand capacity.

post_id STRING PK (partition key)author_id STRINGcontent STRINGlike_count NUMBERcreated_at STRING (ISO timestamp)

Indexes: PK on post_id (hash key for O(1) lookups)

Post content is fetched only after ranking is complete — FeedService retrieves the top 50 post IDs from the ranking pipeline, then batch-GETs post content from PostDB. This is a simple key-value access pattern, not a range scan. At 100K feed reads/sec x 50 posts/feed = 5M point reads/sec.

Event Contracts
Scoring Requestscoring-request

Carries candidate post sets through the multi-stage ranking pipeline. FeedService produces scoring requests; LightRankerWorker consumes, scores, and republishes reduced sets; HeavyMLWorker consumes for final ML scoring. Partitioned by user_id for per-user ordering.

Key Schema

user_id (STRING)

Value Schema

{ user_id, candidate_ids: STRING[], stage: 'light' | 'heavy' }

Engagement Signalengagement-signal

Emitted on every user engagement action. FeedService produces; FeatureWorker consumes to update real-time engagement features in CandidateCache. Partitioned by post_id to aggregate per-post metrics.

Key Schema

post_id (STRING)

Value Schema

{ post_id, user_id, signal_type: 'click' | 'like' | 'share' | 'dwell', dwell_ms?: INTEGER }

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Chronological + Heuristic)T180-200ms at 100 follows, 500ms+ at 500 follows~100K RPS (DB-limited)$1,130/monthLow99% (single DB)
V1: Multi-Stage ML PipelineT2~80ms (2ms cache + 20ms light rank + 50ms heavy rank)100K RPS$3,500/monthMedium99.9% (multi-AZ)
V2: Real-Time Online Learning + Multi-TowerT3~155ms (20ms retrieval + 30ms light + 80ms heavy + 10ms re-rank)100K RPS$8,500/monthVery High99.9% (multi-AZ)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why pre-compute candidates instead of querying the database on each feed request?

Pre-computing candidates converts the feed read from an O(N log N) database query (scanning thousands of rows, computing scores, sorting) to an O(1) Redis lookup (retrieve a pre-built list of post IDs in 2ms). At 100K feed reads/sec, eliminating the database query removes 100K expensive sort operations per second from PostgreSQL. The trade-off is write amplification during fanout: publishing a post to 10K followers requires 10K Redis writes. But writes are batched and asynchronous, while reads are latency-critical and synchronous — optimizing the read path at the expense of write complexity is the correct trade-off for a read-heavy feed system (80% reads, 10% writes).

How does the ML model handle the cold-start problem for new users?

New users have no engagement history, so the ML model's personalization features (user-author affinity, topic preference, engagement patterns) are all zero-valued. The model falls back to popularity-based ranking: global trending posts, high-engagement content, and editorially curated content fill the feed. As the user accumulates 10-20 interactions (clicks, likes, dwells), the personalization features begin to differentiate, and the feed becomes increasingly tailored. Most platforms also use an explicit onboarding flow (choose topics, follow suggested accounts) to bootstrap the cold start with initial signal.

What happens if Kafka (ScoringStream) is unavailable?

If ScoringStream is down, FeedService cannot dispatch candidates to the ranking pipeline. The system should implement a fallback path: retrieve candidates from CandidateCache and apply a simplified heuristic ranking directly in FeedService (similar to the V0 naive approach but on pre-computed candidates rather than a database query). The feed quality degrades (no ML scoring) but availability is maintained. EngagementStream downtime is less critical — engagement features in CandidateCache become stale but the ranking pipeline continues to function with slightly outdated signals.

How does the funnel design affect ranking quality?

The funnel introduces a quality risk: the light ranker might filter out a post that the heavy ML model would have ranked highly. This happens when the light ranker's heuristic features (recency, like count, affinity) fail to capture a signal that the heavy model's 200 features would detect — for example, a post from a low-affinity author on a topic the user has recently started engaging with. The mitigation is to make the light ranker slightly permissive (keep 200 out of 1000, not 50 out of 1000) and to periodically audit the overlap between light ranker output and heavy model rankings using offline evaluation.

Why Kafka instead of direct RPC between ranking stages?

Direct RPC (FeedService calls LightRanker, LightRanker calls HeavyRanker) creates tight coupling: if HeavyRanker is slow, the entire pipeline blocks, and FeedService threads are held waiting. Kafka decouples the stages — each consumes at its own rate with backpressure. If HeavyRanker falls behind during a traffic spike, Kafka buffers the backlog (millions of messages) without dropping requests or blocking upstream stages. This is critical for a system where the heaviest stage (ML inference at 50ms) is 25x slower than the lightest stage (cache lookup at 2ms).

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Feed Ranking?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator