1Why is Apache Spark typically 10-100x faster than Hadoop MapReduce for iterative workloads like ML training?
Batch processing operates on bounded, finite datasets by collecting data over a period and processing it as a single unit. Frameworks like MapReduce, Spark, and Hive enable parallel computation across commodity clusters, trading latency for throughput. Batch remains the backbone of data warehousing, ML training pipelines, and large-scale analytics.
Batch processing is the oldest and most battle-tested paradigm in large-scale data engineering. The core idea is simple: accumulate data over a time window (an hour, a day, a month), then process the entire bounded dataset as a single job. The job reads all input, applies transformations or aggregations, and writes the output. Because the input is finite and known in advance, the system can optimize scheduling, retry failed tasks, and guarantee correctness through deterministic re-execution.
Google's **MapReduce** (2004) formalized batch processing for commodity clusters. A job is split into a Map phase (apply a function to each record, emit key-value pairs) and a Reduce phase (group by key and aggregate). The framework handles partitioning, scheduling, fault tolerance, and data locality. Hadoop brought MapReduce to the open-source world and became the foundation of the big data ecosystem.
Apache **Spark** (2014) replaced MapReduce's disk-heavy shuffle with in-memory Resilient Distributed Datasets (RDDs), achieving 10-100x speedups for iterative workloads like ML training and graph processing. Spark's DataFrame API and Catalyst optimizer brought SQL-like declarative queries to batch processing. Today, Spark is the dominant batch engine, processing exabytes daily at companies like Netflix, Apple, and Alibaba.
Despite the rise of streaming, batch processing remains essential. ML model training, compliance reporting, data warehouse backfills, and end-of-day billing reconciliation all require processing complete, bounded datasets. The "Lambda Architecture" runs batch alongside streaming: batch provides the authoritative, complete view while streaming provides low-latency approximations.
Daily Revenue Report
An e-commerce company runs a nightly Spark batch job at 2 AM. The job reads the day's order events from S3 (partitioned by date: s3://data/orders/dt=2026-06-02/), joins with the product catalog table, aggregates revenue by category and region, and writes the result to a data warehouse table. The entire day's data (say, 500 million orders, 200 GB compressed Parquet) is processed in a single 20-minute job on a 50-node cluster. If the job fails, it is retried from scratch -- re-reading the same immutable input and overwriting the same output partition -- guaranteeing idempotency.
Google invented MapReduce in 2004 to build its search index. The web crawler writes raw pages to GFS; a daily MapReduce job parses, tokenizes, and inverts the pages into the search index. At peak, Google ran hundreds of thousands of MapReduce jobs per day across clusters of tens of thousands of machines. MapReduce was later superseded internally by Flume (now Dataflow), but the batch paradigm remains central to Search indexing.
Netflix
Netflix runs over 100,000 Spark batch jobs daily for recommendation model training, A/B test analysis, content valuation, and billing reconciliation. Their data platform processes petabytes per day from S3 using Spark on Kubernetes (via their Genie job management platform). Batch ML training jobs iterate over the full viewing history dataset to retrain collaborative filtering models, achieving cold-start prediction improvements impossible with online learning alone.
Stripe
Stripe uses batch processing for end-of-day financial reconciliation. Every night, a batch pipeline reads all payment events, matches debits to credits, computes settlement amounts per merchant, and generates payout files. Financial accuracy is paramount -- batch processing over the complete bounded dataset eliminates the approximation risks of streaming aggregation. The pipeline is idempotent: re-running it for the same date produces identical output.
| Aspect | Description |
|---|---|
| Latency vs Throughput | Batch processing optimizes for throughput at the cost of latency. A job that processes a full day's data can achieve very high per-record throughput (millions of records/second) because it amortizes startup, shuffle, and I/O costs. But results are only available after the entire batch completes -- typically minutes to hours of delay. If you need sub-second results, batch is the wrong paradigm. |
| Simplicity vs Freshness | Batch pipelines are simpler to reason about: bounded input, deterministic output, easy to test and debug. But data freshness suffers. A dashboard powered by nightly batch is always up to 24 hours stale. The Lambda Architecture addresses this by layering a real-time streaming view on top of a batch-computed base view, at the cost of maintaining two codepaths. |
| Resource Efficiency vs Cost | Batch jobs can use spot/preemptible instances (70-90% cheaper) because they are fault-tolerant and can be retried. Spark checkpoints intermediate state and reruns only failed stages. However, large batch clusters sit idle between job runs. Auto-scaling and ephemeral clusters (spin up, run job, tear down) reduce waste but add startup latency. |
| Correctness vs Complexity | Batch guarantees exactly-once processing trivially through idempotent overwrites: the output for a given input partition is always the same. This is much simpler than achieving exactly-once in streaming. However, late-arriving data complicates batch: records that arrive after the batch window closes are missed until the next run or a backfill. |
Uber's Batch ETL Migration from Hive to Spark
Scenario
Uber's data platform ran over 50,000 Hive-on-MapReduce batch jobs daily for trip analytics, driver payments, surge pricing models, and regulatory reporting. As data volumes grew past 100 PB, Hive jobs became increasingly slow -- some critical jobs exceeded their SLA windows (e.g., driver payment calculation needed to complete before 6 AM, but was routinely finishing at 8 AM).
Solution
Uber migrated their batch platform to Spark on YARN, later moving to Spark on Kubernetes. They developed an internal framework that automatically translated HiveQL queries to Spark SQL, ran shadow jobs in parallel to validate correctness, and gradually shifted production traffic. They also adopted Parquet columnar storage with Z-ordering for predicate pushdown optimization.
Outcome
Average batch job completion time dropped by 5x. The driver payment pipeline went from 4+ hours to under 45 minutes, consistently meeting its SLA. Cluster resource utilization improved by 40% because Spark's in-memory execution reduced the number of disk I/O-bound stages. The migration processed over 500 PB/day at peak.
See Batch Processing in action
Explore system design templates that use batch processing and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why is Apache Spark typically 10-100x faster than Hadoop MapReduce for iterative workloads like ML training?
2What is the primary advantage of idempotent output writes in batch processing?