1Why is topology-aware scheduling important for distributed GPU training?
GPU orchestration manages the scheduling, allocation, and lifecycle of GPU resources across training and inference workloads. Unlike CPU workloads, GPUs are expensive ($2-$30/hour per GPU), scarce, and require topology-aware scheduling to maximize interconnect bandwidth between co-located GPUs.
GPU orchestration is the systems-level challenge of efficiently allocating the most expensive resource in modern computing. A single NVIDIA H100 GPU costs $25,000-$40,000 to purchase and $2-$4/hour to rent. An 8-GPU DGX H100 node runs $25-$35/hour. At scale, organizations operate thousands of GPUs costing millions of dollars per month. Even a 10% improvement in GPU utilization translates to hundreds of thousands in savings.
The fundamental difference between GPU and CPU orchestration is topology sensitivity. Eight GPUs in a DGX node are not interchangeable -- they are connected by NVLink (900 GB/s bidirectional on H100) within the node and by InfiniBand (400 Gbps) between nodes. A distributed training job that needs 16 GPUs will perform dramatically differently depending on whether it gets 2 full 8-GPU nodes (all-to-all NVLink within each node) or 16 GPUs scattered across 8 nodes (InfiniBand between all pairs). Topology-aware scheduling -- assigning GPUs that share high-bandwidth interconnects -- is essential for training workloads.
GPU sharing is another critical challenge. A single A100 (80 GB) running a 1B-parameter model for inference uses only ~4 GB of VRAM and 10-20% of compute. Without sharing, 75% of the GPU is wasted. Multi-Instance GPU (MIG) on A100/H100 partitions a physical GPU into up to 7 isolated instances, each with dedicated compute and memory. Time-slicing (NVIDIA GPU Operator) shares a GPU across pods using context switching, which is simpler but provides weaker isolation. For ML inference, GPU sharing can improve fleet utilization from 20-30% to 60-80%.
Gang scheduling -- allocating all GPUs for a distributed training job atomically -- prevents deadlocks. If a job needs 8 GPUs but only 6 are available, a naive scheduler might allocate 6 and wait for 2 more. If another job also needs 8 and gets the other 2, both jobs are stuck holding partial allocations. Gang schedulers (Volcano, Coscheduler) ensure all-or-nothing allocation. Priority and preemption policies determine which workloads get GPUs when demand exceeds supply: latency-sensitive inference jobs typically preempt batch training jobs, which can resume from checkpoints.
Cost management strategies include spot/preemptible instances (60-90% savings for fault-tolerant training), right-sizing (matching GPU type to workload -- T4 for small inference, A100 for training, H100 for LLM workloads), and quota management (allocating GPU budgets to teams with chargeback). Fleet-level metrics like GPU-hours utilized / GPU-hours allocated reveal waste, and automated reclamation (deallocating idle GPUs after 15 minutes) prevents hoarding.
Scheduling a Training Job on a Shared Cluster
A team submits a training job requiring 16 GPUs for 8 hours. The GPU orchestrator checks topology: 2 full 8-GPU nodes on the same InfiniBand switch are available. It gang-schedules the job across both nodes, ensuring NVLink connectivity within each node and InfiniBand between them. Four hours in, a critical inference deployment needs 8 GPUs urgently. The orchestrator preempts the training job (which has checkpointed 15 minutes ago), saves the checkpoint, frees the GPUs for inference, and re-queues the training job. When GPUs become available, the training job resumes from its checkpoint, losing only 15 minutes of compute.
Meta (Research SuperCluster)
Meta's Research SuperCluster (RSC) uses 16,000 A100 GPUs connected via InfiniBand for training large language models. A custom scheduler handles topology-aware placement, ensuring multi-node training jobs land on GPUs within the same InfiniBand fabric segment. Fair-share scheduling allocates GPU quotas across research teams, with priority boosts for time-sensitive projects. GPU utilization is tracked at 15-second granularity and idle GPUs are reclaimed after 10 minutes.
Google (Borg + TPU)
Google's Borg scheduler manages TPU pod allocations for training workloads like Gemini. TPU pods (up to 8,960 chips in a v5p pod) require atomic allocation and ICI (Inter-Chip Interconnect) topology awareness. Borg implements priority-based preemption: production serving preempts batch training, which preempts research experiments. Preempted jobs automatically resume from the latest checkpoint stored on Colossus (Google's distributed file system).
Microsoft (Azure ML)
Azure ML manages tens of thousands of GPUs for internal (Bing, Office AI) and external customers. The scheduler supports MIG partitioning for inference workloads, gang scheduling for distributed training, and a quota system with billing per GPU-second. Azure's ND H100 v5 instances use InfiniBand for multi-node training, and the scheduler preferentially co-locates pods on the same InfiniBand leaf switch to minimize communication latency.
| Aspect | Description |
|---|---|
| Utilization vs. Isolation | GPU sharing (MIG, time-slicing) improves utilization from 20-30% to 60-80% but introduces noisy-neighbor risks. MIG provides hardware isolation but limits partitioning options (fixed slice sizes). Time-slicing is flexible but one pod's CUDA kernel can starve others of compute. |
| Scheduling Latency vs. Placement Quality | Optimal topology-aware placement requires solving a bin-packing problem (NP-hard). Simple first-fit scheduling is fast but produces suboptimal placement. Sophisticated solvers improve placement quality but add scheduling latency (seconds to minutes), which matters for autoscaling inference where pods must start quickly. |
| Dedicated vs. Shared Clusters | Dedicated GPU clusters per team ensure predictable performance and simplify scheduling but lead to low utilization (each team provisions for peak). Shared clusters improve utilization through statistical multiplexing but require quota management, preemption policies, and chargeback systems. |
| On-Premises vs. Cloud GPUs | On-premises GPUs (purchased hardware) have lower long-term cost for sustained utilization (>60%) but require capital investment, physical infrastructure, and 3-6 month lead times. Cloud GPUs offer instant scaling and no capital cost but are 2-3x more expensive over 3 years for sustained workloads. |
Spotify's GPU Cluster Consolidation
Scenario
Spotify operated separate GPU clusters for training (Kubeflow on GKE) and inference (custom deployment on GCE VMs). The training cluster was idle 40% of the time (nights and weekends), while the inference cluster ran at 25% utilization (small models on large GPUs). Combined, the fleet ran at ~30% utilization, wasting $1.5M/year in GPU spend.
Solution
Spotify consolidated both workloads onto a single GKE cluster with the NVIDIA GPU Operator, MIG for inference workloads, and Volcano for gang-scheduling training jobs. Priority classes ensured inference pods preempted training pods when needed. Training pipelines were modified to checkpoint every 15 minutes and tolerate preemption. MIG sliced A100s into 7 instances for small inference models.
Outcome
Fleet-wide GPU utilization increased from 30% to 68%. Annual GPU spend decreased by $900K. Inference latency was unaffected because priority preemption guaranteed GPU availability for serving workloads. Training job completion times increased by ~5% due to occasional preemptions, which was acceptable given the cost savings.
See GPU Orchestration in action
Explore system design templates that use gpu orchestration and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why is topology-aware scheduling important for distributed GPU training?
2What problem does gang scheduling solve for GPU workloads?