1What is the primary advantage of continuous batching over static batching for LLM inference?
Model serving is the infrastructure that takes a trained ML model and exposes it as a low-latency, high-throughput prediction endpoint. At scale, serving must handle batching, GPU memory management, model versioning, and graceful rollouts -- problems that look nothing like training.
Model serving is the bridge between data science and production systems. A trained model is a static artifact -- a set of weights and a computation graph -- but serving it at production scale requires solving a distinct set of systems engineering problems: how do you maximize GPU utilization when requests arrive at irregular intervals? How do you serve a 70B-parameter model that does not fit on a single GPU? How do you roll out a new model version without downtime or accuracy regressions?
The fundamental tension in model serving is between latency and throughput. A GPU achieves peak FLOPS only when operating on large tensors, which means batching multiple requests together. But batching increases latency for individual requests because each request must wait for the batch to fill. Naive batching (wait for N requests or T milliseconds) wastes GPU cycles on small batches during low traffic and adds unacceptable latency during spikes. Continuous batching (also called iteration-level or inflight batching), pioneered by Orca and adopted by vLLM and TGI, solves this by inserting new requests into an active batch at each decode step of autoregressive generation, achieving 2-4x higher throughput than static batching.
For LLM inference specifically, the KV-cache is the dominant memory bottleneck. Each token in the context window requires storing key and value tensors for every attention layer. A 70B model with a 4K context window consumes ~2 GB of KV-cache per concurrent request. PagedAttention (vLLM) manages KV-cache memory like an OS manages virtual memory -- using non-contiguous pages to eliminate fragmentation -- and improves serving throughput by 2-4x compared to naive contiguous allocation.
Multi-GPU and multi-node serving uses tensor parallelism (split a single layer across GPUs) and pipeline parallelism (split layers across GPUs in stages). Tensor parallelism reduces per-token latency because all GPUs compute simultaneously, but requires high-bandwidth interconnect (NVLink at 900 GB/s, not PCIe at 64 GB/s). Pipeline parallelism tolerates lower bandwidth but increases latency for individual requests. At the fleet level, a serving system must also handle model routing (send requests to the right model version), autoscaling (scale GPU instances based on queue depth, not CPU), and graceful draining (finish in-flight requests before shutting down a replica).
Serving a Recommendation Model
An e-commerce site deploys a product recommendation model. During off-peak hours, requests trickle in at 10 RPS; during flash sales, traffic spikes to 5,000 RPS. With static batching (batch size 32, timeout 50ms), off-peak requests wait 50ms for a batch that never fills, and peak traffic queues behind full batches. Switching to continuous batching, the system processes requests individually during off-peak (no wait) and dynamically forms large batches during peak (high GPU utilization). P50 latency drops from 60ms to 15ms off-peak, and peak throughput increases 3x without adding GPUs.
OpenAI
OpenAI serves GPT-4 and GPT-4o across thousands of GPUs using a custom inference stack with continuous batching, speculative decoding, and aggressive KV-cache optimization. The system handles millions of concurrent conversations, routing requests to appropriate model versions and managing GPU memory across variable-length contexts from 4K to 128K tokens. Speculative decoding uses a smaller 'draft' model to predict future tokens, which the larger model verifies in parallel, improving tokens-per-second by 2-3x for certain workloads.
Google (Vertex AI)
Google's Vertex AI prediction service supports autoscaling from zero to thousands of TPU/GPU replicas. It uses a custom continuous batching framework optimized for Gemini models and supports traffic splitting for A/B testing model versions. Google's internal serving stack processes over 10 billion predictions per day across Search, YouTube, and Ads, using a mix of TPU v5e for throughput-optimized workloads and TPU v5p for latency-sensitive ones.
Netflix
Netflix serves hundreds of ML models for recommendations, artwork personalization, and video encoding optimization. Their Metaflow-based serving infrastructure deploys models as containers with GPU sidecar accelerators, enabling zero-downtime model updates via blue-green deployment. Each model serves 200M+ member sessions, with strict p99 latency SLOs under 50ms to avoid impacting the browse experience.
| Aspect | Description |
|---|---|
| Latency vs. Throughput | Larger batch sizes improve GPU utilization and throughput (more predictions per GPU-second) but increase tail latency because each request waits for batch peers. For real-time APIs (search, chat), optimize for latency with small batches. For offline scoring (recommendations, feed ranking), optimize for throughput with large batches. |
| Accuracy vs. Speed (Quantization) | INT8 quantization halves memory and roughly doubles throughput with ~0.5% accuracy loss. INT4 quarters memory but can degrade accuracy 1-3% on reasoning tasks. The decision depends on the task: sentiment classification tolerates quantization well; code generation and math reasoning are more sensitive. |
| Single Large Model vs. Ensemble / Routing | A single large model is simpler to serve but wasteful for easy queries. Model routing (send simple queries to a small model, complex ones to a large model) can reduce serving cost 60-80% but adds routing latency and complexity, and the router itself must be fast and accurate. |
| Self-Hosted vs. Managed Inference | Self-hosted (vLLM on your GPUs) gives full control over batching, caching, and cost but requires GPU operations expertise. Managed services (OpenAI API, Bedrock, Vertex AI) abstract away GPU management but add per-token cost, vendor lock-in, and less control over latency tail. |
Spotify's Migration from TensorFlow Serving to Triton
Scenario
Spotify served 50+ recommendation models on TensorFlow Serving, each running on dedicated GPU instances. GPU utilization averaged 15-25% because models had bursty traffic patterns -- high during evening listening hours, low overnight -- and TF Serving's static batching could not adapt efficiently.
Solution
Spotify migrated to NVIDIA Triton Inference Server with model-level concurrent execution and dynamic batching. They co-located multiple models on the same GPU using Triton's model repository, enabling GPU sharing across models with different traffic patterns. Autoscaling was switched from CPU-based metrics to GPU utilization and request queue depth.
Outcome
Average GPU utilization increased from 20% to 65%, reducing the GPU fleet by 55% (~$4M/year savings). P99 inference latency dropped from 45ms to 28ms due to dynamic batching's ability to form optimal batch sizes. Model deployment time decreased from 45 minutes to under 5 minutes using Triton's live model reloading.
See Model Serving & Inference in action
Explore system design templates that use model serving & inference and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary advantage of continuous batching over static batching for LLM inference?
2Why is PagedAttention important for LLM serving at scale?