What is important about Training Pipelines regarding "Data parallelism replicates the model on every GPU and shard..."?

Data parallelism replicates the model on every GPU and shards data across them. AllReduce synchronizes gradients after each step. This scales linearly up to ~32-64 GPUs, after which communication overhead dominates (especially for large models over slower interconnects).

What is important about Training Pipelines regarding "Mixed-precision training (FP16/BF16 compute, FP32 master wei..."?

Mixed-precision training (FP16/BF16 compute, FP32 master weights) halves GPU memory usage and doubles throughput on tensor-core GPUs (A100, H100) with minimal accuracy loss. It is now standard practice for any model training.

What is important about Training Pipelines regarding "ZeRO (DeepSpeed) and FSDP (PyTorch) partition optimizer stat..."?

ZeRO (DeepSpeed) and FSDP (PyTorch) partition optimizer states, gradients, and parameters across GPUs, enabling training of models 4-8x larger than a single GPU's memory. ZeRO Stage 3 partitions everything; Stage 1 partitions only optimizer states (lower communication overhead).

What is important about Training Pipelines regarding "Checkpointing every 15-30 minutes prevents catastrophic comp..."?

Checkpointing every 15-30 minutes prevents catastrophic compute waste. Async checkpointing (writing to storage while training continues) avoids pausing the training loop. Store checkpoints on networked storage (S3, GCS) so any node can resume.

What is important about Training Pipelines regarding "Reproducibility requires pinning: data snapshot version, cod..."?

Reproducibility requires pinning: data snapshot version, code commit hash, library versions, random seeds, and GPU count. Even with all these pinned, floating-point non-determinism across GPU architectures can cause 0.1-0.5% metric variation.

What is important about Training Pipelines regarding "Hyperparameter tuning (learning rate, batch size, weight dec..."?

Hyperparameter tuning (learning rate, batch size, weight decay) uses Bayesian optimization or population-based training to search efficiently. A single tuning run can consume 10-100x the compute of one training run, so early stopping and resource-aware scheduling are essential.

Vetora

🛠️AI / ML Infrastructure

Training Pipelines

ML training pipelines orchestrate the end-to-end workflow from raw data to a validated model artifact: data ingestion, preprocessing, feature engineering, distributed training, hyperparameter tuning, evaluation, and model registration. Reproducibility, idempotency, and efficient GPU utilization are the key engineering challenges.

Overview

An ML training pipeline is more than 'call model.fit()'. In production, training is a multi-stage workflow that must be reproducible, auditable, and resilient to failures. A typical pipeline includes: (1) data ingestion and validation -- pull training data, verify schema, check for distribution drift; (2) preprocessing and feature engineering -- normalize, tokenize, compute derived features; (3) training -- run distributed gradient descent across multiple GPUs; (4) evaluation -- measure metrics on hold-out sets, compare against the production baseline; (5) registration -- if the new model passes quality gates, register it in the model registry for deployment.

Distributed training is the compute-intensive core of the pipeline. Data parallelism -- the most common strategy -- replicates the model on every GPU and shards the training data across them. After each forward-backward pass, gradients are synchronized across GPUs using AllReduce (ring-reduce or tree-reduce). This scales well up to ~64 GPUs but communication overhead grows with model size: a 7B-parameter model produces 28 GB of gradients (in FP32) that must be exchanged every step. Gradient compression, mixed-precision training (FP16/BF16 forward pass, FP32 master weights), and gradient accumulation (local steps before sync) reduce communication pressure.

For models too large to fit on a single GPU (70B+ parameters), model parallelism splits the model itself across GPUs. Pipeline parallelism assigns different layers to different GPUs and pipelines micro-batches through them. Tensor parallelism splits individual layers (e.g., attention heads) across GPUs on the same node. ZeRO (DeepSpeed) partitions optimizer states, gradients, and parameters across GPUs to reduce per-GPU memory, enabling training of 100B+ models on clusters of commodity GPUs. FSDP (PyTorch) implements similar ideas natively.

Pipeline orchestration manages the DAG of stages, handles retries on transient failures (GPU errors, preemptions on spot instances), and tracks lineage. Modern orchestrators (Kubeflow Pipelines, Metaflow, Vertex AI Pipelines, Flyte) treat each stage as a containerized step with declared inputs, outputs, and resource requirements. Checkpointing is critical: a 24-hour training run that fails at hour 23 without checkpoints wastes $10,000+ in GPU compute. Best practice is to checkpoint every 15-30 minutes and implement automatic resume from the latest checkpoint.

Key Points

1Data parallelism replicates the model on every GPU and shards data across them. AllReduce synchronizes gradients after each step. This scales linearly up to ~32-64 GPUs, after which communication overhead dominates (especially for large models over slower interconnects).
2Mixed-precision training (FP16/BF16 compute, FP32 master weights) halves GPU memory usage and doubles throughput on tensor-core GPUs (A100, H100) with minimal accuracy loss. It is now standard practice for any model training.
3ZeRO (DeepSpeed) and FSDP (PyTorch) partition optimizer states, gradients, and parameters across GPUs, enabling training of models 4-8x larger than a single GPU's memory. ZeRO Stage 3 partitions everything; Stage 1 partitions only optimizer states (lower communication overhead).
4Checkpointing every 15-30 minutes prevents catastrophic compute waste. Async checkpointing (writing to storage while training continues) avoids pausing the training loop. Store checkpoints on networked storage (S3, GCS) so any node can resume.
5Reproducibility requires pinning: data snapshot version, code commit hash, library versions, random seeds, and GPU count. Even with all these pinned, floating-point non-determinism across GPU architectures can cause 0.1-0.5% metric variation.
6Hyperparameter tuning (learning rate, batch size, weight decay) uses Bayesian optimization or population-based training to search efficiently. A single tuning run can consume 10-100x the compute of one training run, so early stopping and resource-aware scheduling are essential.

Simple Example

A Nightly Retrain Pipeline

An e-commerce company retrains its search ranking model nightly. At midnight, the pipeline (1) pulls the latest 30 days of click data from the data warehouse, (2) validates that the data schema is unchanged and volume is within 20% of the previous day, (3) computes features using the feature store, (4) trains for 3 epochs on 8 GPUs using data parallelism, checkpointing every 20 minutes, (5) evaluates NDCG@10 on a held-out set and compares against the production model, (6) if NDCG improves by >0.5%, registers the new model in the registry and triggers a canary deployment. The entire pipeline takes 4 hours and costs ~$200 in GPU compute.

Real-World Examples

Meta (LLaMA)

Meta trained LLaMA 2 70B on 2,048 A100 GPUs over 1.7 million GPU-hours. The training pipeline used FSDP for model parallelism, mixed-precision BF16 training, and custom data loading with pre-tokenized data to eliminate CPU bottlenecks. Checkpoints were saved every 1,000 steps. Training failures (hardware faults, networking issues) occurred every ~2 days on average, and automatic restart from the latest checkpoint kept productive utilization above 90%.

Google (Gemini)

Google trained Gemini Ultra on a fleet of TPU v5p pods, using a combination of data parallelism, pipeline parallelism, and expert parallelism (for mixture-of-experts layers). The training pipeline included automated data quality checks, curriculum learning (progressively increasing sequence length), and continuous evaluation against internal benchmarks. Google's training infrastructure handles hardware failures at the pod level, automatically rerouting computation around failed TPU chips.

Spotify

Spotify uses Kubeflow Pipelines on GKE to orchestrate 200+ training pipelines for recommendation, search, and content understanding models. Each pipeline is defined as a DAG of containerized steps with declared GPU requirements. Spot/preemptible instances reduce cost by 60-70%, and pipelines automatically retry failed steps from the last checkpoint. Metaflow is used for rapid experimentation before promoting pipelines to Kubeflow for production scheduling.

Trade-Offs

Aspect	Description
Data Parallelism vs. Model Parallelism	Data parallelism is simpler and scales well for models that fit on one GPU. Model parallelism (pipeline/tensor/ZeRO) is necessary for large models but adds communication complexity, can cause pipeline bubbles (idle GPU time), and requires careful tuning of micro-batch sizes and partition strategies.
On-Demand vs. Spot/Preemptible Instances	Spot instances are 60-90% cheaper but can be preempted with 30-120 seconds notice. Training on spot requires robust checkpointing, fast checkpoint save (<30 seconds), and automatic resume. Long training runs (days/weeks) on spot have higher failure rates, but the cost savings often justify the added complexity.
Retrain Frequency vs. Cost	More frequent retraining captures recent data patterns (critical for fast-changing domains like fraud) but multiplies compute cost. Daily retraining of a model costing $200/run is $73K/year. Weekly is $10K/year. The right frequency depends on how fast the data distribution shifts, measured by monitoring feature and prediction drift.
Custom Pipeline vs. Managed Service	Custom pipelines (Kubeflow, Metaflow, Flyte) offer flexibility and avoid vendor lock-in but require dedicated MLOps engineers to maintain. Managed services (SageMaker, Vertex AI) reduce operational overhead but constrain framework choices and charge premiums over raw GPU cost.

Case Study

LinkedIn's Productioneer Training Platform

Scenario

LinkedIn needed to retrain hundreds of ML models (feed ranking, job recommendations, ads) on schedules ranging from hourly to weekly. Manual training workflows caused frequent failures from data quality issues, inconsistent environments, and untracked dependencies. A failed training run for the feed ranking model directly impacted revenue because stale models degraded recommendation quality.

Solution

LinkedIn built Productioneer, a declarative training platform where data scientists specify the training DAG, data sources, evaluation criteria, and deployment strategy in a config file. Productioneer handles containerization, GPU scheduling, data validation, checkpointing, evaluation, and automatic model registration. It integrates with LinkedIn's feature store (Frame) for consistent feature computation and with their A/B testing platform (XLNT) for model experiments.

Outcome

Training pipeline reliability increased from 75% to 98% (measured as the percentage of scheduled training runs that complete successfully). Model iteration velocity doubled: data scientists could test a new model idea end-to-end in hours instead of days. GPU utilization improved 40% through better scheduling and bin-packing of training jobs across the shared GPU cluster.

Common Mistakes

⚠No data validation before training. Feeding corrupted or schema-drifted data into training produces a model that looks good on its own test set but fails in production. Always validate data schema, volume, and distribution statistics before training begins, and fail the pipeline early if checks fail.
⚠Skipping the comparison against the production baseline. A new model that improves by 0.1% on a hold-out set may actually be worse than the current production model on real traffic due to distribution differences. Always evaluate new models against the live production model's metrics before registration.
⚠No checkpointing or inadequate checkpointing frequency. GPU training on cloud instances fails regularly (spot preemptions, hardware faults, OOM). A 12-hour training run without checkpoints wastes the full cost on failure. Checkpoint every 15-30 minutes and test the resume-from-checkpoint path regularly.
⚠Using all available data without a reproducible data snapshot. If training reads from a live table that is continuously updated, the same pipeline run on different days produces different models, making debugging impossible. Always train from a versioned, immutable data snapshot.

Related Concepts

Feature Stores GPU Orchestration ML Model Registry Model Serving & Inference A/B Testing & Experimentation

See Training Pipelines in action

Explore system design templates that use training pipelines and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate distributed training data throughput

Metrics to watch

training_throughput_samples_per_secgpu_utilization_pctcheckpoint_duration_msdata_loading_latency_ms

Run Simulation

Test Your Understanding

1What is the primary purpose of AllReduce in data-parallel training?

2Why is training on spot/preemptible GPU instances challenging?

Deeper Reading