Vetora logo
🎮AI / ML Infrastructure

GPU Orchestration

GPU orchestration manages the scheduling, allocation, and lifecycle of GPU resources across training and inference workloads. Unlike CPU workloads, GPUs are expensive ($2-$30/hour per GPU), scarce, and require topology-aware scheduling to maximize interconnect bandwidth between co-located GPUs.

Overview

GPU orchestration is the systems-level challenge of efficiently allocating the most expensive resource in modern computing. A single NVIDIA H100 GPU costs $25,000-$40,000 to purchase and $2-$4/hour to rent. An 8-GPU DGX H100 node runs $25-$35/hour. At scale, organizations operate thousands of GPUs costing millions of dollars per month. Even a 10% improvement in GPU utilization translates to hundreds of thousands in savings.

The fundamental difference between GPU and CPU orchestration is topology sensitivity. Eight GPUs in a DGX node are not interchangeable -- they are connected by NVLink (900 GB/s bidirectional on H100) within the node and by InfiniBand (400 Gbps) between nodes. A distributed training job that needs 16 GPUs will perform dramatically differently depending on whether it gets 2 full 8-GPU nodes (all-to-all NVLink within each node) or 16 GPUs scattered across 8 nodes (InfiniBand between all pairs). Topology-aware scheduling -- assigning GPUs that share high-bandwidth interconnects -- is essential for training workloads.

GPU sharing is another critical challenge. A single A100 (80 GB) running a 1B-parameter model for inference uses only ~4 GB of VRAM and 10-20% of compute. Without sharing, 75% of the GPU is wasted. Multi-Instance GPU (MIG) on A100/H100 partitions a physical GPU into up to 7 isolated instances, each with dedicated compute and memory. Time-slicing (NVIDIA GPU Operator) shares a GPU across pods using context switching, which is simpler but provides weaker isolation. For ML inference, GPU sharing can improve fleet utilization from 20-30% to 60-80%.

Gang scheduling -- allocating all GPUs for a distributed training job atomically -- prevents deadlocks. If a job needs 8 GPUs but only 6 are available, a naive scheduler might allocate 6 and wait for 2 more. If another job also needs 8 and gets the other 2, both jobs are stuck holding partial allocations. Gang schedulers (Volcano, Coscheduler) ensure all-or-nothing allocation. Priority and preemption policies determine which workloads get GPUs when demand exceeds supply: latency-sensitive inference jobs typically preempt batch training jobs, which can resume from checkpoints.

Cost management strategies include spot/preemptible instances (60-90% savings for fault-tolerant training), right-sizing (matching GPU type to workload -- T4 for small inference, A100 for training, H100 for LLM workloads), and quota management (allocating GPU budgets to teams with chargeback). Fleet-level metrics like GPU-hours utilized / GPU-hours allocated reveal waste, and automated reclamation (deallocating idle GPUs after 15 minutes) prevents hoarding.

Key Points
  • 1Topology-aware scheduling places distributed training jobs on GPUs with high-bandwidth interconnects (NVLink within a node, InfiniBand between nodes). Ignoring topology can reduce training throughput by 2-5x due to gradient synchronization bottlenecks.
  • 2Gang scheduling allocates all required GPUs for a job atomically (all-or-nothing), preventing deadlocks where multiple jobs each hold partial allocations waiting for each other's resources.
  • 3Multi-Instance GPU (MIG) partitions a single A100/H100 into up to 7 isolated instances with dedicated compute and memory. This enables multiple small inference models to share a GPU with hardware-level isolation, improving fleet utilization from 20-30% to 60-80%.
  • 4Preemption policies allow high-priority inference workloads to reclaim GPUs from lower-priority training jobs. Training jobs must checkpoint frequently to make preemption practical (resume from checkpoint rather than restart from scratch).
  • 5GPU memory (VRAM) is usually the binding constraint, not compute FLOPS. An 80 GB A100 can fit a 7B model in FP16 (14 GB weights + KV-cache) but will OOM trying to train a 13B model without model parallelism. Capacity planning must account for both weights and dynamic memory (activations, KV-cache, optimizer states).
  • 6Spot/preemptible GPUs save 60-90% but require fault-tolerant workloads. Training pipelines with frequent checkpointing are good candidates; latency-sensitive inference endpoints are not (preemption causes request failures).
Simple Example

Scheduling a Training Job on a Shared Cluster

A team submits a training job requiring 16 GPUs for 8 hours. The GPU orchestrator checks topology: 2 full 8-GPU nodes on the same InfiniBand switch are available. It gang-schedules the job across both nodes, ensuring NVLink connectivity within each node and InfiniBand between them. Four hours in, a critical inference deployment needs 8 GPUs urgently. The orchestrator preempts the training job (which has checkpointed 15 minutes ago), saves the checkpoint, frees the GPUs for inference, and re-queues the training job. When GPUs become available, the training job resumes from its checkpoint, losing only 15 minutes of compute.

Real-World Examples

Meta (Research SuperCluster)

Meta's Research SuperCluster (RSC) uses 16,000 A100 GPUs connected via InfiniBand for training large language models. A custom scheduler handles topology-aware placement, ensuring multi-node training jobs land on GPUs within the same InfiniBand fabric segment. Fair-share scheduling allocates GPU quotas across research teams, with priority boosts for time-sensitive projects. GPU utilization is tracked at 15-second granularity and idle GPUs are reclaimed after 10 minutes.

Google (Borg + TPU)

Google's Borg scheduler manages TPU pod allocations for training workloads like Gemini. TPU pods (up to 8,960 chips in a v5p pod) require atomic allocation and ICI (Inter-Chip Interconnect) topology awareness. Borg implements priority-based preemption: production serving preempts batch training, which preempts research experiments. Preempted jobs automatically resume from the latest checkpoint stored on Colossus (Google's distributed file system).

Microsoft (Azure ML)

Azure ML manages tens of thousands of GPUs for internal (Bing, Office AI) and external customers. The scheduler supports MIG partitioning for inference workloads, gang scheduling for distributed training, and a quota system with billing per GPU-second. Azure's ND H100 v5 instances use InfiniBand for multi-node training, and the scheduler preferentially co-locates pods on the same InfiniBand leaf switch to minimize communication latency.

Trade-Offs
AspectDescription
Utilization vs. IsolationGPU sharing (MIG, time-slicing) improves utilization from 20-30% to 60-80% but introduces noisy-neighbor risks. MIG provides hardware isolation but limits partitioning options (fixed slice sizes). Time-slicing is flexible but one pod's CUDA kernel can starve others of compute.
Scheduling Latency vs. Placement QualityOptimal topology-aware placement requires solving a bin-packing problem (NP-hard). Simple first-fit scheduling is fast but produces suboptimal placement. Sophisticated solvers improve placement quality but add scheduling latency (seconds to minutes), which matters for autoscaling inference where pods must start quickly.
Dedicated vs. Shared ClustersDedicated GPU clusters per team ensure predictable performance and simplify scheduling but lead to low utilization (each team provisions for peak). Shared clusters improve utilization through statistical multiplexing but require quota management, preemption policies, and chargeback systems.
On-Premises vs. Cloud GPUsOn-premises GPUs (purchased hardware) have lower long-term cost for sustained utilization (>60%) but require capital investment, physical infrastructure, and 3-6 month lead times. Cloud GPUs offer instant scaling and no capital cost but are 2-3x more expensive over 3 years for sustained workloads.
Case Study

Spotify's GPU Cluster Consolidation

Scenario

Spotify operated separate GPU clusters for training (Kubeflow on GKE) and inference (custom deployment on GCE VMs). The training cluster was idle 40% of the time (nights and weekends), while the inference cluster ran at 25% utilization (small models on large GPUs). Combined, the fleet ran at ~30% utilization, wasting $1.5M/year in GPU spend.

Solution

Spotify consolidated both workloads onto a single GKE cluster with the NVIDIA GPU Operator, MIG for inference workloads, and Volcano for gang-scheduling training jobs. Priority classes ensured inference pods preempted training pods when needed. Training pipelines were modified to checkpoint every 15 minutes and tolerate preemption. MIG sliced A100s into 7 instances for small inference models.

Outcome

Fleet-wide GPU utilization increased from 30% to 68%. Annual GPU spend decreased by $900K. Inference latency was unaffected because priority preemption guaranteed GPU availability for serving workloads. Training job completion times increased by ~5% due to occasional preemptions, which was acceptable given the cost savings.

Common Mistakes
  • Ignoring GPU topology when scheduling distributed training. Placing 8 GPUs across 8 different nodes (PCIe/Ethernet interconnect) instead of on 1 node (NVLink) can reduce training throughput by 3-5x. Always use topology-aware scheduling for multi-GPU training jobs.
  • No GPU memory monitoring or capacity planning. Teams request 'one GPU' without considering VRAM requirements. A model that uses 60 GB of VRAM during training (weights + optimizer states + activations) will OOM on a 40 GB A100 but work on an 80 GB A100. Require VRAM estimates in job specifications.
  • Treating GPU allocation like CPU allocation. CPU overcommit (allocating more vCPUs than physical cores) works because most workloads are I/O-bound. GPU overcommit fails catastrophically because ML workloads are compute-bound. Allocated GPU memory and compute are hard limits, not soft targets.
  • No preemption or priority system. Without priorities, a batch training experiment submitted on Friday can hold GPUs all weekend, blocking a critical model retrain on Monday. Implement at least three priority levels (serving > production training > experiments) with automatic preemption.
Related Concepts

See GPU Orchestration in action

Explore system design templates that use gpu orchestration and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Schedule GPU jobs and observe cluster utilization

Metrics to watch
gpu_utilization_pctscheduling_latency_msqueue_wait_time_mspreemption_rate_pct
Run Simulation
Test Your Understanding

1Why is topology-aware scheduling important for distributed GPU training?

2What problem does gang scheduling solve for GPU workloads?

Deeper Reading