Vetora logo
Hard11 componentsInterview: Very High

Feed Ranking — ML-Powered News Feed

Design a ranked news feed with a multi-stage ML pipeline: candidate generation, lightweight scoring, heavy neural ranking, and real-time engagement feedback.

Machine LearningRanking PipelineReal-Time FeaturesNews Feed
Problem Statement

Feed ranking is one of the highest-value system design problems because it sits at the intersection of distributed systems, machine learning infrastructure, and product metrics. Companies like Meta, TikTok, YouTube, and LinkedIn all rely on ML-powered feed ranking as their primary engagement driver. Interviewers at these companies use this question to evaluate whether candidates can design a system that balances ranking quality against latency budgets, inference costs, and feature freshness.

At production scale, a ranked news feed serves 100,000 feed views per second at peak. Each feed view requires scoring 500 to 2,000 candidate posts from the user's follow graph and interest signals, ranking them by predicted engagement, and returning the top 50 posts within a 500ms latency budget. Running a heavy neural model on all 1,000+ candidates per request would be prohibitively expensive and too slow, so the system must use a multi-stage funnel that progressively narrows the candidate set while applying increasingly sophisticated scoring.

The key challenges include designing an efficient candidate generation strategy that pre-computes post sets for each user, building a multi-stage ranking pipeline that applies cheap filters before expensive ML models, implementing real-time feature updates so the ranker reflects engagement signals within seconds rather than hours, handling cold-start scenarios for new users with no engagement history, and managing the tension between engagement optimization and content diversity. A poorly designed feed ranking system either delivers stale or irrelevant content, or it consumes excessive compute resources that make it economically unviable at scale.

Architecture Overview

The architecture implements a three-stage ranking funnel that mirrors the production systems at Meta, YouTube, and TikTok. When a user opens their feed, FeedService retrieves a pre-computed candidate set of 500 to 1,000 post IDs from CandidateCache (Redis), which is populated by a separate fanout process from the user's follow graph. This gives O(1) candidate retrieval in approximately 2ms rather than an expensive graph traversal on every request.

The candidate set is published to ScoringStream (Kafka) where it flows through two ranking stages. The LightRankerWorker applies fast heuristic scoring using simple features like recency decay, author affinity scores, and engagement velocity, reducing 1,000 candidates to 200 in under 20ms. The HeavyMLWorker then runs a deep neural model on the 200 pre-filtered candidates, using approximately 200 features per candidate including user embeddings, post embeddings, and cross-features. This batch inference takes roughly 50ms and produces the final top-50 scored posts. FeedService assembles the response by fetching full post content from PostDB (DynamoDB).

Critically, the system includes a real-time engagement feedback loop. As users interact with posts through clicks, likes, shares, and dwell time, FeedService publishes these signals to EngagementStream (Kafka). FeatureWorker consumes these events and updates engagement features in CandidateCache, including metrics like engagement velocity, click-through rate, and trending scores. These updated features are immediately available to the next ranking pipeline invocation, closing the feedback loop with feature freshness under 30 seconds. This near-real-time feature update is what allows the ranker to boost trending content and demote ignored content within minutes of publication.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Multi-Stage Ranking Funnel

Choice

Three stages: candidate retrieval, light ranker (1000 to 200), heavy ML scorer (200 to 50)

Rationale

Running the heavy neural model on all 1,000+ candidates per request would cost 10x more in compute and exceed the 500ms latency budget. The funnel applies the cheapest filter first, using a heuristic scorer at 20ms to eliminate 80% of candidates before the expensive 50ms ML inference runs. This is the standard architecture at Meta, YouTube, and TikTok.

Pre-Computed Candidate Cache

Choice

Redis cache with pre-computed post sets per user from fanout process

Rationale

Computing candidates on every feed request would require querying the follow graph, fetching recent posts from all followed users, and deduplicating, easily exceeding 100ms per request. Pre-computing candidate sets and caching them in Redis reduces retrieval to 2ms with 90% hit rate. The trade-off is up to 5 minutes of staleness for newly followed accounts.

Kafka for Scoring Pipeline

Choice

MSK (Managed Kafka) connecting ranking stages with independent scaling

Rationale

The ranking pipeline is multi-stage and each stage has different compute requirements. Kafka provides ordered, reliable delivery between stages with natural backpressure. If ML scoring falls behind during a traffic spike, Kafka buffers the backlog without dropping requests or blocking the lighter upstream stages. Each worker fleet scales independently based on its own queue depth.

Real-Time Engagement Feedback Loop

Choice

Kafka-based event stream updating features in Redis with under 30-second freshness

Rationale

A post's relevance changes rapidly after publishing. Viral content accumulates engagement quickly while low-quality content is ignored. Without real-time feature updates, the ranker would rely on stale signals and miss trending content. The feedback loop through EngagementStream and FeatureWorker ensures the ranker sees fresh engagement velocity within seconds, which is critical for news and time-sensitive content.

Scale & Performance

Target RPS

100,000 peak feed views; 20,000 engagement signals per second

Latency (p99)

< 500ms end-to-end feed ranking (p99); < 100ms ML inference

Storage

52GB candidate cache (Redis); DynamoDB for post content and counters

Availability

99.99% feed availability; < 30s feature freshness

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why not score all candidate posts with the heavy ML model directly?

At 100K feed views per second with 1,000 candidates each, scoring everything with the heavy model would require 100 million model inferences per second, which is prohibitively expensive in GPU and CPU cost. The multi-stage funnel reduces this by 5x, applying a cheap heuristic filter first to eliminate obviously irrelevant candidates. The heavy model only scores the top 200, reducing inference volume by 80% while losing minimal ranking quality since the eliminated candidates were unlikely to rank highly anyway.

How does the system handle cold-start users with no engagement history?

New users have no click or dwell history, so the ML model lacks personalization signals for them. The system falls back to a popularity-based ranking that surfaces globally trending content, posts from accounts the user explicitly followed during signup, and content that performs well with demographically similar users. As the user interacts with 10-20 posts, sufficient engagement data accumulates for the model to begin meaningful personalization.

What features does the heavy ML model use for engagement prediction?

The model uses approximately 200 features per candidate, organized into four categories: user features (embedding, historical engagement patterns, content preferences), post features (embedding, age, like count, engagement velocity), cross-features (author-viewer affinity, topic relevance, social graph proximity), and contextual features (time of day, device type, session depth). User and post embeddings are pre-computed dense vectors that capture latent preferences learned from historical interactions.

How does real-time engagement feedback improve ranking quality?

Without real-time feedback, the ranker relies on features computed during the last batch job, which may be hours old. A post that went viral 5 minutes ago would still show stale engagement metrics. The feedback loop updates features like engagement velocity and click-through rate in CandidateCache within 30 seconds of user interactions. This allows the ranker to boost freshly trending content and demote posts that are being ignored, which is especially important for news and time-sensitive content where relevance decays rapidly.

How does the system prevent echo chambers and filter bubbles?

A pure engagement-optimized ranker tends to show users more of what they already like, creating filter bubbles. Production systems address this with a diversity reranking stage that explicitly promotes content variety after the ML scorer produces its ranked list. This stage ensures the final feed includes a mix of content types, authors, and topics rather than clustering around the user's strongest preferences. While this template focuses on the core ranking pipeline, a production deployment would add diversity constraints as a post-processing step on the final top-50.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Feed Ranking?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator