Vetora logo
๐Ÿ”ฎAI / ML Infrastructure

Model Serving & Inference

Model serving is the infrastructure that takes a trained ML model and exposes it as a low-latency, high-throughput prediction endpoint. At scale, serving must handle batching, GPU memory management, model versioning, and graceful rollouts -- problems that look nothing like training.

Overview

Model serving is the bridge between data science and production systems. A trained model is a static artifact -- a set of weights and a computation graph -- but serving it at production scale requires solving a distinct set of systems engineering problems: how do you maximize GPU utilization when requests arrive at irregular intervals? How do you serve a 70B-parameter model that does not fit on a single GPU? How do you roll out a new model version without downtime or accuracy regressions?

The fundamental tension in model serving is between latency and throughput. A GPU achieves peak FLOPS only when operating on large tensors, which means batching multiple requests together. But batching increases latency for individual requests because each request must wait for the batch to fill. Naive batching (wait for N requests or T milliseconds) wastes GPU cycles on small batches during low traffic and adds unacceptable latency during spikes. Continuous batching (also called iteration-level or inflight batching), pioneered by Orca and adopted by vLLM and TGI, solves this by inserting new requests into an active batch at each decode step of autoregressive generation, achieving 2-4x higher throughput than static batching.

For LLM inference specifically, the KV-cache is the dominant memory bottleneck. Each token in the context window requires storing key and value tensors for every attention layer. A 70B model with a 4K context window consumes ~2 GB of KV-cache per concurrent request. PagedAttention (vLLM) manages KV-cache memory like an OS manages virtual memory -- using non-contiguous pages to eliminate fragmentation -- and improves serving throughput by 2-4x compared to naive contiguous allocation.

Multi-GPU and multi-node serving uses tensor parallelism (split a single layer across GPUs) and pipeline parallelism (split layers across GPUs in stages). Tensor parallelism reduces per-token latency because all GPUs compute simultaneously, but requires high-bandwidth interconnect (NVLink at 900 GB/s, not PCIe at 64 GB/s). Pipeline parallelism tolerates lower bandwidth but increases latency for individual requests. At the fleet level, a serving system must also handle model routing (send requests to the right model version), autoscaling (scale GPU instances based on queue depth, not CPU), and graceful draining (finish in-flight requests before shutting down a replica).

Key Points
  • 1Continuous batching inserts new requests into an active inference batch at each decode iteration, achieving 2-4x higher throughput than static batching by keeping the GPU saturated even with variable-length sequences.
  • 2KV-cache is the memory bottleneck for LLM serving. PagedAttention (vLLM) uses paged memory management to eliminate internal fragmentation, improving memory utilization from ~50% to ~95% and enabling more concurrent requests per GPU.
  • 3Tensor parallelism splits model layers across GPUs on the same node (requires NVLink), reducing per-token latency. Pipeline parallelism splits layers sequentially across nodes, increasing throughput but adding latency. Most production deployments use both.
  • 4Quantization (FP16 to INT8 or INT4) reduces model memory footprint by 2-4x and increases inference throughput, at the cost of 0.5-2% accuracy loss depending on the model and task. GPTQ, AWQ, and SmoothQuant are common quantization methods.
  • 5Model serving autoscaling should use GPU-specific metrics (queue depth, batch utilization, GPU memory pressure) rather than CPU utilization. GPU instances take 2-5 minutes to start, so predictive scaling based on traffic patterns is essential.
  • 6Canary deployments for models compare accuracy metrics (not just HTTP error rates) between old and new versions. A model that returns 200 OK but produces garbage predictions is a silent failure that only model-quality monitoring can catch.
Simple Example

Serving a Recommendation Model

An e-commerce site deploys a product recommendation model. During off-peak hours, requests trickle in at 10 RPS; during flash sales, traffic spikes to 5,000 RPS. With static batching (batch size 32, timeout 50ms), off-peak requests wait 50ms for a batch that never fills, and peak traffic queues behind full batches. Switching to continuous batching, the system processes requests individually during off-peak (no wait) and dynamically forms large batches during peak (high GPU utilization). P50 latency drops from 60ms to 15ms off-peak, and peak throughput increases 3x without adding GPUs.

Real-World Examples

OpenAI

OpenAI serves GPT-4 and GPT-4o across thousands of GPUs using a custom inference stack with continuous batching, speculative decoding, and aggressive KV-cache optimization. The system handles millions of concurrent conversations, routing requests to appropriate model versions and managing GPU memory across variable-length contexts from 4K to 128K tokens. Speculative decoding uses a smaller 'draft' model to predict future tokens, which the larger model verifies in parallel, improving tokens-per-second by 2-3x for certain workloads.

Google (Vertex AI)

Google's Vertex AI prediction service supports autoscaling from zero to thousands of TPU/GPU replicas. It uses a custom continuous batching framework optimized for Gemini models and supports traffic splitting for A/B testing model versions. Google's internal serving stack processes over 10 billion predictions per day across Search, YouTube, and Ads, using a mix of TPU v5e for throughput-optimized workloads and TPU v5p for latency-sensitive ones.

Netflix

Netflix serves hundreds of ML models for recommendations, artwork personalization, and video encoding optimization. Their Metaflow-based serving infrastructure deploys models as containers with GPU sidecar accelerators, enabling zero-downtime model updates via blue-green deployment. Each model serves 200M+ member sessions, with strict p99 latency SLOs under 50ms to avoid impacting the browse experience.

Trade-Offs
AspectDescription
Latency vs. ThroughputLarger batch sizes improve GPU utilization and throughput (more predictions per GPU-second) but increase tail latency because each request waits for batch peers. For real-time APIs (search, chat), optimize for latency with small batches. For offline scoring (recommendations, feed ranking), optimize for throughput with large batches.
Accuracy vs. Speed (Quantization)INT8 quantization halves memory and roughly doubles throughput with ~0.5% accuracy loss. INT4 quarters memory but can degrade accuracy 1-3% on reasoning tasks. The decision depends on the task: sentiment classification tolerates quantization well; code generation and math reasoning are more sensitive.
Single Large Model vs. Ensemble / RoutingA single large model is simpler to serve but wasteful for easy queries. Model routing (send simple queries to a small model, complex ones to a large model) can reduce serving cost 60-80% but adds routing latency and complexity, and the router itself must be fast and accurate.
Self-Hosted vs. Managed InferenceSelf-hosted (vLLM on your GPUs) gives full control over batching, caching, and cost but requires GPU operations expertise. Managed services (OpenAI API, Bedrock, Vertex AI) abstract away GPU management but add per-token cost, vendor lock-in, and less control over latency tail.
Case Study

Spotify's Migration from TensorFlow Serving to Triton

Scenario

Spotify served 50+ recommendation models on TensorFlow Serving, each running on dedicated GPU instances. GPU utilization averaged 15-25% because models had bursty traffic patterns -- high during evening listening hours, low overnight -- and TF Serving's static batching could not adapt efficiently.

Solution

Spotify migrated to NVIDIA Triton Inference Server with model-level concurrent execution and dynamic batching. They co-located multiple models on the same GPU using Triton's model repository, enabling GPU sharing across models with different traffic patterns. Autoscaling was switched from CPU-based metrics to GPU utilization and request queue depth.

Outcome

Average GPU utilization increased from 20% to 65%, reducing the GPU fleet by 55% (~$4M/year savings). P99 inference latency dropped from 45ms to 28ms due to dynamic batching's ability to form optimal batch sizes. Model deployment time decreased from 45 minutes to under 5 minutes using Triton's live model reloading.

Common Mistakes
  • โš Using CPU-based autoscaling for GPU workloads. GPU instances show near-zero CPU usage during inference because the CPU is just dispatching work to the GPU. Scaling on CPU means the system never scales up. Use GPU utilization, request queue depth, or custom batch-fill-rate metrics instead.
  • โš Ignoring KV-cache memory when capacity planning. A 13B model uses ~26 GB of GPU memory for weights in FP16, but each concurrent request adds 200-500 MB of KV-cache. Planning for 50 concurrent requests on a model that 'fits' in 40 GB of VRAM will OOM at 20 concurrent requests.
  • โš Treating model rollout like code rollout. A new model version that passes unit tests can still produce subtly wrong predictions. Always canary new models with a shadow or split-traffic deployment, comparing prediction quality metrics (AUC, NDCG, calibration) against the incumbent model before full rollout.
  • โš No request prioritization or fairness. Without priority queues, a burst of low-priority batch-scoring requests can starve latency-sensitive real-time predictions. Implement request classes with dedicated queue slots or preemption to protect SLOs.
Related Concepts

See Model Serving & Inference in action

Explore system design templates that use model serving & inference and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate model inference latency under batched requests

Metrics to watch
inference_latency_msbatch_utilization_pctgpu_utilization_pctthroughput_rps
Run Simulation
Test Your Understanding

1What is the primary advantage of continuous batching over static batching for LLM inference?

2Why is PagedAttention important for LLM serving at scale?

Deeper Reading