What is important about A/B Testing & Experimentation regarding "Offline metrics (AUC, NDCG, F1) correlate imperfectly with o..."?

Offline metrics (AUC, NDCG, F1) correlate imperfectly with online business metrics. A model with higher NDCG may not increase user engagement because the test set does not capture real-world feedback loops, novelty effects, or position bias. A/B testing is the only reliable measure of real-world impact.

What is important about A/B Testing & Experimentation regarding "Sample size determines experiment duration. Detecting a 0.5%..."?

Sample size determines experiment duration. Detecting a 0.5% improvement in click-through rate at 95% confidence requires ~160K users per group. Under-powered experiments produce false negatives (concluding a good change has no effect) or, worse, false positives from peeking at results before reaching significance.

What is important about A/B Testing & Experimentation regarding "Experiment layers prevent interactions between concurrent ex..."?

Experiment layers prevent interactions between concurrent experiments. Experiments in the same layer get exclusive traffic slices; experiments in different layers run on overlapping traffic. This enables hundreds of concurrent experiments without interference.

What is important about A/B Testing & Experimentation regarding "Guardrail metrics (latency, error rate, revenue, fairness) m..."?

Guardrail metrics (latency, error rate, revenue, fairness) must be monitored in every experiment, not just the primary metric. A model that improves engagement by 0.5% but increases latency by 200ms will net-negative user experience.

What is important about A/B Testing & Experimentation regarding "Interleaving (mixing results from two models in the same lis..."?

Interleaving (mixing results from two models in the same list) provides 100x more statistical sensitivity than A/B testing because it controls for position bias and individual user variation. It is the preferred method for ranking and recommendation model evaluation.

What is important about A/B Testing & Experimentation regarding "Multi-armed bandits shift traffic toward the winning variant..."?

Multi-armed bandits shift traffic toward the winning variant during the experiment, reducing opportunity cost. However, they make it harder to compute valid confidence intervals because the assignment probability changes over time. Thompson Sampling is the most common bandit algorithm for ML experiments.

Vetora

🧪AI / ML Infrastructure

A/B Testing & Experimentation

A/B testing (online controlled experimentation) is the gold standard for measuring the causal impact of ML model changes on business metrics. It splits live traffic between a control (current model) and treatment (new model) to measure statistically significant differences in user behavior, revenue, or engagement.

Overview

Offline evaluation (hold-out test sets, cross-validation) tells you whether a new model is better in theory. Online A/B testing tells you whether it is better in practice. The gap between offline and online metrics is well-documented: at Netflix, offline improvements in RMSE do not always translate to increased viewing hours. At Google, only 10-20% of A/B tests show statistically significant positive results, even when the change was developed by experienced teams with strong offline metrics. This makes rigorous online experimentation the most important capability in a mature ML organization.

The mechanics of A/B testing for ML models are straightforward: randomly assign users to control (current model) or treatment (new model), serve predictions from the appropriate model, and measure the difference in business metrics (revenue, engagement, retention, etc.) over a defined experiment period. Statistical significance is determined using hypothesis testing (typically a two-sided t-test or Mann-Whitney U test for non-normal distributions). The experiment runs until it reaches the required sample size for the desired minimum detectable effect (MDE) and statistical power.

Sample size planning is critical and often underestimated. Detecting a 1% relative improvement in a metric with 10% baseline conversion rate at 95% confidence and 80% power requires approximately 15,000 users per group. Detecting a 0.1% improvement (common for large-scale optimizations at companies with billions of events) requires 1.5 million users per group. This means small-traffic products may need weeks or months to reach significance, creating pressure to ship changes without adequate testing.

Experiment interactions are the hardest problem at scale. A company running 500 concurrent experiments (common at Google, Meta, Microsoft) must handle cases where experiments interact: model A and feature B both affect recommendations, and their combined effect is not the sum of their individual effects. Solutions include experiment layers (non-overlapping traffic segments for experiments that could interact), mutual exclusion groups, and post-hoc interaction analysis. Google's Overlapping Experiment Infrastructure (described in their 2010 paper) pioneered the layer-based approach used by most large experimentation platforms today.

Beyond classical A/B testing, modern ML experimentation uses interleaving (mixing recommendations from two models in the same result list, with each item attributed to its source model) for faster statistical convergence -- interleaving requires 100x fewer samples than A/B testing for the same sensitivity. Multi-armed bandits adaptively shift traffic to the better-performing variant during the experiment, reducing regret (the cost of serving a worse variant) but complicating statistical inference.

Key Points

1Offline metrics (AUC, NDCG, F1) correlate imperfectly with online business metrics. A model with higher NDCG may not increase user engagement because the test set does not capture real-world feedback loops, novelty effects, or position bias. A/B testing is the only reliable measure of real-world impact.
2Sample size determines experiment duration. Detecting a 0.5% improvement in click-through rate at 95% confidence requires ~160K users per group. Under-powered experiments produce false negatives (concluding a good change has no effect) or, worse, false positives from peeking at results before reaching significance.
3Experiment layers prevent interactions between concurrent experiments. Experiments in the same layer get exclusive traffic slices; experiments in different layers run on overlapping traffic. This enables hundreds of concurrent experiments without interference.
4Guardrail metrics (latency, error rate, revenue, fairness) must be monitored in every experiment, not just the primary metric. A model that improves engagement by 0.5% but increases latency by 200ms will net-negative user experience.
5Interleaving (mixing results from two models in the same list) provides 100x more statistical sensitivity than A/B testing because it controls for position bias and individual user variation. It is the preferred method for ranking and recommendation model evaluation.
6Multi-armed bandits shift traffic toward the winning variant during the experiment, reducing opportunity cost. However, they make it harder to compute valid confidence intervals because the assignment probability changes over time. Thompson Sampling is the most common bandit algorithm for ML experiments.

Simple Example

Testing a New Search Ranking Model

A search team trains a new ranking model with higher offline NDCG. They set up an A/B test: 50% of users see results from the current model (control), 50% from the new model (treatment). After 2 weeks and 500K users per group, the treatment shows +1.2% search success rate (p < 0.01) and no regression in guardrail metrics (latency unchanged, revenue +0.3%). The experiment is declared a winner and the new model is promoted to 100% traffic via the model registry.

Real-World Examples

Google

Google runs over 10,000 search quality experiments per year, of which only 10-20% ship as launches. Their Overlapping Experiment Infrastructure uses experiment layers to run hundreds of concurrent experiments without interference. Each search query passes through multiple experiment layers (ranking, UI, ads), and the system tracks interactions between layers. A single search ranking change is typically validated through offline evaluation, interleaving, and A/B testing before launch.

Netflix

Netflix uses interleaving as the primary method for evaluating recommendation model changes because it provides 100x more statistical sensitivity than A/B testing. In an interleaved experiment, each user's recommendation row mixes items from the control and treatment models, and engagement with each item is attributed to its source model. This allows Netflix to detect a 0.1% improvement in viewing hours with just 10K users, whereas an A/B test would require 1M+ users.

Microsoft (Bing)

Microsoft's ExP (Experimentation Platform) runs thousands of concurrent A/B tests across Bing, Office, Xbox, and Azure. ExP provides automated sample size calculation, experiment interaction detection, and a shared metric catalog with pre-computed guardrail metrics. Bing uses a staged rollout process: 1% canary, 10% pilot, 50% A/B test, 100% launch. ExP detects and alerts on metric regressions at each stage, with automatic rollback for severe regressions.

Trade-Offs

Aspect	Description
Experiment Duration vs. Sensitivity	Smaller minimum detectable effects require larger sample sizes and longer experiments. A 2-week experiment can detect a 1% effect; detecting a 0.1% effect may require 3-6 months. Teams must choose between detecting only large effects (fast but misses subtle improvements) and running long experiments (sensitive but slow iteration).
A/B Testing vs. Interleaving	A/B testing is simple and general-purpose but requires large sample sizes. Interleaving is 100x more sensitive but only works for ranking/recommendation systems and cannot measure absolute metrics (only relative preference). Use interleaving for model comparison, A/B testing for measuring business metric impact.
Statistical Rigor vs. Speed	Waiting for full statistical significance (p < 0.05, 80% power) prevents false positives but slows iteration. Sequential testing methods (group sequential design, always-valid p-values) allow peeking at results during the experiment without inflating error rates, enabling earlier stopping for clear winners or losers.
Bandits vs. Fixed-Assignment Tests	Multi-armed bandits reduce opportunity cost by shifting traffic to the winning variant but complicate statistical inference and are not suitable for experiments where the treatment effect varies over time (novelty effects, learning effects). Fixed-assignment A/B tests are simpler to analyze and give cleaner causal estimates.

Case Study

Booking.com's Experimentation Culture

Scenario

Booking.com wanted to make data-driven decisions across all product changes, from major ML model updates to button color changes. With over 200 product teams and 1,000+ engineers, they needed an experimentation platform that enabled every team to run rigorous experiments without deep statistics expertise.

Solution

Booking.com built an in-house experimentation platform that runs 1,000+ concurrent A/B tests at any time. The platform provides automated sample size calculation, experiment interaction detection via mutual exclusion groups, guardrail metric monitoring, and automated result reports with pre-computed statistical significance. Every product change -- no matter how small -- must pass an A/B test before full rollout. The platform uses sequential testing to enable early stopping.

Outcome

The experimentation culture enabled Booking.com to iterate faster while maintaining quality: the experiment-to-launch rate stabilized at ~10% (90% of tested ideas do not improve metrics, validating the need for testing). Revenue per visitor improved 25% over 3 years through the compound effect of hundreds of small, validated improvements. The platform reduced experiment setup time from days to minutes, enabling every engineer to run experiments independently.

Common Mistakes

⚠Peeking at experiment results and stopping early when they look significant. Checking p-values daily inflates the false positive rate from 5% to 30%+. Use sequential testing methods (spending functions, always-valid confidence intervals) that control error rates under continuous monitoring, or pre-commit to a fixed experiment duration.
⚠Running underpowered experiments. A test with 1,000 users per group cannot detect a 0.5% effect -- it will almost always show 'no significant difference', even if the treatment is genuinely better. Always compute the required sample size before starting the experiment based on baseline metric variance and minimum detectable effect.
⚠Ignoring experiment interactions. Two experiments that both modify the recommendation feed can interfere: each looks positive alone, but together they degrade the experience. Use experiment layers or mutual exclusion groups for experiments that affect the same user surface.
⚠Using only a single primary metric. A model that improves click-through rate but degrades downstream conversion or increases latency is a net negative. Define 2-3 primary metrics and 5-10 guardrail metrics for every experiment.

Related Concepts

Model Serving & Inference ML Model Registry Feature Stores Training Pipelines RAG Architecture

See A/B Testing & Experimentation in action

Explore system design templates that use a/b testing & experimentation and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Run A/B tests on feed ranking algorithms

Metrics to watch

experiment_sample_sizestatistical_power_pctmetric_lift_pctp_value

Run Simulation

Test Your Understanding

1Why do offline ML metrics (AUC, NDCG) not always predict online A/B test results?

2What is the main advantage of interleaving over traditional A/B testing for recommendation evaluation?

Deeper Reading