1What is the primary purpose of AllReduce in data-parallel training?
ML training pipelines orchestrate the end-to-end workflow from raw data to a validated model artifact: data ingestion, preprocessing, feature engineering, distributed training, hyperparameter tuning, evaluation, and model registration. Reproducibility, idempotency, and efficient GPU utilization are the key engineering challenges.
An ML training pipeline is more than 'call model.fit()'. In production, training is a multi-stage workflow that must be reproducible, auditable, and resilient to failures. A typical pipeline includes: (1) data ingestion and validation -- pull training data, verify schema, check for distribution drift; (2) preprocessing and feature engineering -- normalize, tokenize, compute derived features; (3) training -- run distributed gradient descent across multiple GPUs; (4) evaluation -- measure metrics on hold-out sets, compare against the production baseline; (5) registration -- if the new model passes quality gates, register it in the model registry for deployment.
Distributed training is the compute-intensive core of the pipeline. Data parallelism -- the most common strategy -- replicates the model on every GPU and shards the training data across them. After each forward-backward pass, gradients are synchronized across GPUs using AllReduce (ring-reduce or tree-reduce). This scales well up to ~64 GPUs but communication overhead grows with model size: a 7B-parameter model produces 28 GB of gradients (in FP32) that must be exchanged every step. Gradient compression, mixed-precision training (FP16/BF16 forward pass, FP32 master weights), and gradient accumulation (local steps before sync) reduce communication pressure.
For models too large to fit on a single GPU (70B+ parameters), model parallelism splits the model itself across GPUs. Pipeline parallelism assigns different layers to different GPUs and pipelines micro-batches through them. Tensor parallelism splits individual layers (e.g., attention heads) across GPUs on the same node. ZeRO (DeepSpeed) partitions optimizer states, gradients, and parameters across GPUs to reduce per-GPU memory, enabling training of 100B+ models on clusters of commodity GPUs. FSDP (PyTorch) implements similar ideas natively.
Pipeline orchestration manages the DAG of stages, handles retries on transient failures (GPU errors, preemptions on spot instances), and tracks lineage. Modern orchestrators (Kubeflow Pipelines, Metaflow, Vertex AI Pipelines, Flyte) treat each stage as a containerized step with declared inputs, outputs, and resource requirements. Checkpointing is critical: a 24-hour training run that fails at hour 23 without checkpoints wastes $10,000+ in GPU compute. Best practice is to checkpoint every 15-30 minutes and implement automatic resume from the latest checkpoint.
A Nightly Retrain Pipeline
An e-commerce company retrains its search ranking model nightly. At midnight, the pipeline (1) pulls the latest 30 days of click data from the data warehouse, (2) validates that the data schema is unchanged and volume is within 20% of the previous day, (3) computes features using the feature store, (4) trains for 3 epochs on 8 GPUs using data parallelism, checkpointing every 20 minutes, (5) evaluates NDCG@10 on a held-out set and compares against the production model, (6) if NDCG improves by >0.5%, registers the new model in the registry and triggers a canary deployment. The entire pipeline takes 4 hours and costs ~$200 in GPU compute.
Meta (LLaMA)
Meta trained LLaMA 2 70B on 2,048 A100 GPUs over 1.7 million GPU-hours. The training pipeline used FSDP for model parallelism, mixed-precision BF16 training, and custom data loading with pre-tokenized data to eliminate CPU bottlenecks. Checkpoints were saved every 1,000 steps. Training failures (hardware faults, networking issues) occurred every ~2 days on average, and automatic restart from the latest checkpoint kept productive utilization above 90%.
Google (Gemini)
Google trained Gemini Ultra on a fleet of TPU v5p pods, using a combination of data parallelism, pipeline parallelism, and expert parallelism (for mixture-of-experts layers). The training pipeline included automated data quality checks, curriculum learning (progressively increasing sequence length), and continuous evaluation against internal benchmarks. Google's training infrastructure handles hardware failures at the pod level, automatically rerouting computation around failed TPU chips.
Spotify
Spotify uses Kubeflow Pipelines on GKE to orchestrate 200+ training pipelines for recommendation, search, and content understanding models. Each pipeline is defined as a DAG of containerized steps with declared GPU requirements. Spot/preemptible instances reduce cost by 60-70%, and pipelines automatically retry failed steps from the last checkpoint. Metaflow is used for rapid experimentation before promoting pipelines to Kubeflow for production scheduling.
| Aspect | Description |
|---|---|
| Data Parallelism vs. Model Parallelism | Data parallelism is simpler and scales well for models that fit on one GPU. Model parallelism (pipeline/tensor/ZeRO) is necessary for large models but adds communication complexity, can cause pipeline bubbles (idle GPU time), and requires careful tuning of micro-batch sizes and partition strategies. |
| On-Demand vs. Spot/Preemptible Instances | Spot instances are 60-90% cheaper but can be preempted with 30-120 seconds notice. Training on spot requires robust checkpointing, fast checkpoint save (<30 seconds), and automatic resume. Long training runs (days/weeks) on spot have higher failure rates, but the cost savings often justify the added complexity. |
| Retrain Frequency vs. Cost | More frequent retraining captures recent data patterns (critical for fast-changing domains like fraud) but multiplies compute cost. Daily retraining of a model costing $200/run is $73K/year. Weekly is $10K/year. The right frequency depends on how fast the data distribution shifts, measured by monitoring feature and prediction drift. |
| Custom Pipeline vs. Managed Service | Custom pipelines (Kubeflow, Metaflow, Flyte) offer flexibility and avoid vendor lock-in but require dedicated MLOps engineers to maintain. Managed services (SageMaker, Vertex AI) reduce operational overhead but constrain framework choices and charge premiums over raw GPU cost. |
LinkedIn's Productioneer Training Platform
Scenario
LinkedIn needed to retrain hundreds of ML models (feed ranking, job recommendations, ads) on schedules ranging from hourly to weekly. Manual training workflows caused frequent failures from data quality issues, inconsistent environments, and untracked dependencies. A failed training run for the feed ranking model directly impacted revenue because stale models degraded recommendation quality.
Solution
LinkedIn built Productioneer, a declarative training platform where data scientists specify the training DAG, data sources, evaluation criteria, and deployment strategy in a config file. Productioneer handles containerization, GPU scheduling, data validation, checkpointing, evaluation, and automatic model registration. It integrates with LinkedIn's feature store (Frame) for consistent feature computation and with their A/B testing platform (XLNT) for model experiments.
Outcome
Training pipeline reliability increased from 75% to 98% (measured as the percentage of scheduled training runs that complete successfully). Model iteration velocity doubled: data scientists could test a new model idea end-to-end in hours instead of days. GPU utilization improved 40% through better scheduling and bin-packing of training jobs across the shared GPU cluster.
See Training Pipelines in action
Explore system design templates that use training pipelines and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary purpose of AllReduce in data-parallel training?
2Why is training on spot/preemptible GPU instances challenging?