1Why do offline ML metrics (AUC, NDCG) not always predict online A/B test results?
A/B testing (online controlled experimentation) is the gold standard for measuring the causal impact of ML model changes on business metrics. It splits live traffic between a control (current model) and treatment (new model) to measure statistically significant differences in user behavior, revenue, or engagement.
Offline evaluation (hold-out test sets, cross-validation) tells you whether a new model is better in theory. Online A/B testing tells you whether it is better in practice. The gap between offline and online metrics is well-documented: at Netflix, offline improvements in RMSE do not always translate to increased viewing hours. At Google, only 10-20% of A/B tests show statistically significant positive results, even when the change was developed by experienced teams with strong offline metrics. This makes rigorous online experimentation the most important capability in a mature ML organization.
The mechanics of A/B testing for ML models are straightforward: randomly assign users to control (current model) or treatment (new model), serve predictions from the appropriate model, and measure the difference in business metrics (revenue, engagement, retention, etc.) over a defined experiment period. Statistical significance is determined using hypothesis testing (typically a two-sided t-test or Mann-Whitney U test for non-normal distributions). The experiment runs until it reaches the required sample size for the desired minimum detectable effect (MDE) and statistical power.
Sample size planning is critical and often underestimated. Detecting a 1% relative improvement in a metric with 10% baseline conversion rate at 95% confidence and 80% power requires approximately 15,000 users per group. Detecting a 0.1% improvement (common for large-scale optimizations at companies with billions of events) requires 1.5 million users per group. This means small-traffic products may need weeks or months to reach significance, creating pressure to ship changes without adequate testing.
Experiment interactions are the hardest problem at scale. A company running 500 concurrent experiments (common at Google, Meta, Microsoft) must handle cases where experiments interact: model A and feature B both affect recommendations, and their combined effect is not the sum of their individual effects. Solutions include experiment layers (non-overlapping traffic segments for experiments that could interact), mutual exclusion groups, and post-hoc interaction analysis. Google's Overlapping Experiment Infrastructure (described in their 2010 paper) pioneered the layer-based approach used by most large experimentation platforms today.
Beyond classical A/B testing, modern ML experimentation uses interleaving (mixing recommendations from two models in the same result list, with each item attributed to its source model) for faster statistical convergence -- interleaving requires 100x fewer samples than A/B testing for the same sensitivity. Multi-armed bandits adaptively shift traffic to the better-performing variant during the experiment, reducing regret (the cost of serving a worse variant) but complicating statistical inference.
Testing a New Search Ranking Model
A search team trains a new ranking model with higher offline NDCG. They set up an A/B test: 50% of users see results from the current model (control), 50% from the new model (treatment). After 2 weeks and 500K users per group, the treatment shows +1.2% search success rate (p < 0.01) and no regression in guardrail metrics (latency unchanged, revenue +0.3%). The experiment is declared a winner and the new model is promoted to 100% traffic via the model registry.
Google runs over 10,000 search quality experiments per year, of which only 10-20% ship as launches. Their Overlapping Experiment Infrastructure uses experiment layers to run hundreds of concurrent experiments without interference. Each search query passes through multiple experiment layers (ranking, UI, ads), and the system tracks interactions between layers. A single search ranking change is typically validated through offline evaluation, interleaving, and A/B testing before launch.
Netflix
Netflix uses interleaving as the primary method for evaluating recommendation model changes because it provides 100x more statistical sensitivity than A/B testing. In an interleaved experiment, each user's recommendation row mixes items from the control and treatment models, and engagement with each item is attributed to its source model. This allows Netflix to detect a 0.1% improvement in viewing hours with just 10K users, whereas an A/B test would require 1M+ users.
Microsoft (Bing)
Microsoft's ExP (Experimentation Platform) runs thousands of concurrent A/B tests across Bing, Office, Xbox, and Azure. ExP provides automated sample size calculation, experiment interaction detection, and a shared metric catalog with pre-computed guardrail metrics. Bing uses a staged rollout process: 1% canary, 10% pilot, 50% A/B test, 100% launch. ExP detects and alerts on metric regressions at each stage, with automatic rollback for severe regressions.
| Aspect | Description |
|---|---|
| Experiment Duration vs. Sensitivity | Smaller minimum detectable effects require larger sample sizes and longer experiments. A 2-week experiment can detect a 1% effect; detecting a 0.1% effect may require 3-6 months. Teams must choose between detecting only large effects (fast but misses subtle improvements) and running long experiments (sensitive but slow iteration). |
| A/B Testing vs. Interleaving | A/B testing is simple and general-purpose but requires large sample sizes. Interleaving is 100x more sensitive but only works for ranking/recommendation systems and cannot measure absolute metrics (only relative preference). Use interleaving for model comparison, A/B testing for measuring business metric impact. |
| Statistical Rigor vs. Speed | Waiting for full statistical significance (p < 0.05, 80% power) prevents false positives but slows iteration. Sequential testing methods (group sequential design, always-valid p-values) allow peeking at results during the experiment without inflating error rates, enabling earlier stopping for clear winners or losers. |
| Bandits vs. Fixed-Assignment Tests | Multi-armed bandits reduce opportunity cost by shifting traffic to the winning variant but complicate statistical inference and are not suitable for experiments where the treatment effect varies over time (novelty effects, learning effects). Fixed-assignment A/B tests are simpler to analyze and give cleaner causal estimates. |
Booking.com's Experimentation Culture
Scenario
Booking.com wanted to make data-driven decisions across all product changes, from major ML model updates to button color changes. With over 200 product teams and 1,000+ engineers, they needed an experimentation platform that enabled every team to run rigorous experiments without deep statistics expertise.
Solution
Booking.com built an in-house experimentation platform that runs 1,000+ concurrent A/B tests at any time. The platform provides automated sample size calculation, experiment interaction detection via mutual exclusion groups, guardrail metric monitoring, and automated result reports with pre-computed statistical significance. Every product change -- no matter how small -- must pass an A/B test before full rollout. The platform uses sequential testing to enable early stopping.
Outcome
The experimentation culture enabled Booking.com to iterate faster while maintaining quality: the experiment-to-launch rate stabilized at ~10% (90% of tested ideas do not improve metrics, validating the need for testing). Revenue per visitor improved 25% over 3 years through the compound effect of hundreds of small, validated improvements. The platform reduced experiment setup time from days to minutes, enabling every engineer to run experiments independently.
See A/B Testing & Experimentation in action
Explore system design templates that use a/b testing & experimentation and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why do offline ML metrics (AUC, NDCG) not always predict online A/B test results?
2What is the main advantage of interleaving over traditional A/B testing for recommendation evaluation?