Vetora logo
๐Ÿ“ฆData Engineering

Batch Processing

Batch processing operates on bounded, finite datasets by collecting data over a period and processing it as a single unit. Frameworks like MapReduce, Spark, and Hive enable parallel computation across commodity clusters, trading latency for throughput. Batch remains the backbone of data warehousing, ML training pipelines, and large-scale analytics.

Overview

Batch processing is the oldest and most battle-tested paradigm in large-scale data engineering. The core idea is simple: accumulate data over a time window (an hour, a day, a month), then process the entire bounded dataset as a single job. The job reads all input, applies transformations or aggregations, and writes the output. Because the input is finite and known in advance, the system can optimize scheduling, retry failed tasks, and guarantee correctness through deterministic re-execution.

Google's **MapReduce** (2004) formalized batch processing for commodity clusters. A job is split into a Map phase (apply a function to each record, emit key-value pairs) and a Reduce phase (group by key and aggregate). The framework handles partitioning, scheduling, fault tolerance, and data locality. Hadoop brought MapReduce to the open-source world and became the foundation of the big data ecosystem.

Apache **Spark** (2014) replaced MapReduce's disk-heavy shuffle with in-memory Resilient Distributed Datasets (RDDs), achieving 10-100x speedups for iterative workloads like ML training and graph processing. Spark's DataFrame API and Catalyst optimizer brought SQL-like declarative queries to batch processing. Today, Spark is the dominant batch engine, processing exabytes daily at companies like Netflix, Apple, and Alibaba.

Despite the rise of streaming, batch processing remains essential. ML model training, compliance reporting, data warehouse backfills, and end-of-day billing reconciliation all require processing complete, bounded datasets. The "Lambda Architecture" runs batch alongside streaming: batch provides the authoritative, complete view while streaming provides low-latency approximations.

Key Points
  • 1Batch jobs operate on bounded datasets -- the input is finite and known before the job starts. This enables full-dataset optimizations like sort-merge joins, global aggregations, and deterministic re-execution on failure.
  • 2MapReduce splits computation into Map (parallel per-record transformation) and Reduce (key-grouped aggregation) phases separated by a shuffle. Hadoop's implementation writes intermediate data to disk, making it reliable but slow.
  • 3Spark replaces disk-based shuffles with in-memory RDDs and a DAG execution engine. Lazy evaluation lets the Catalyst optimizer fuse operations, prune partitions, and push down predicates. Spark is typically 10-100x faster than MapReduce for iterative workloads.
  • 4Batch throughput scales linearly with cluster size: doubling the number of executors roughly halves wall-clock time for embarrassingly parallel jobs. The limiting factor is shuffle data volume, which grows with the number of distinct keys.
  • 5Idempotent output writes are critical for batch correctness. If a job fails partway through and is retried, writing to the same output partition with overwrite semantics ensures no duplicates. Spark's 'saveAsTable' with OVERWRITE mode achieves this.
  • 6Batch scheduling tools (Airflow, Dagster, Prefect) manage job dependencies as DAGs, handle retries, SLA monitoring, and backfills. A well-designed batch pipeline is a DAG where each node is an idempotent, independently retriable job.
Simple Example

Daily Revenue Report

An e-commerce company runs a nightly Spark batch job at 2 AM. The job reads the day's order events from S3 (partitioned by date: s3://data/orders/dt=2026-06-02/), joins with the product catalog table, aggregates revenue by category and region, and writes the result to a data warehouse table. The entire day's data (say, 500 million orders, 200 GB compressed Parquet) is processed in a single 20-minute job on a 50-node cluster. If the job fails, it is retried from scratch -- re-reading the same immutable input and overwriting the same output partition -- guaranteeing idempotency.

Real-World Examples

Google

Google invented MapReduce in 2004 to build its search index. The web crawler writes raw pages to GFS; a daily MapReduce job parses, tokenizes, and inverts the pages into the search index. At peak, Google ran hundreds of thousands of MapReduce jobs per day across clusters of tens of thousands of machines. MapReduce was later superseded internally by Flume (now Dataflow), but the batch paradigm remains central to Search indexing.

Netflix

Netflix runs over 100,000 Spark batch jobs daily for recommendation model training, A/B test analysis, content valuation, and billing reconciliation. Their data platform processes petabytes per day from S3 using Spark on Kubernetes (via their Genie job management platform). Batch ML training jobs iterate over the full viewing history dataset to retrain collaborative filtering models, achieving cold-start prediction improvements impossible with online learning alone.

Stripe

Stripe uses batch processing for end-of-day financial reconciliation. Every night, a batch pipeline reads all payment events, matches debits to credits, computes settlement amounts per merchant, and generates payout files. Financial accuracy is paramount -- batch processing over the complete bounded dataset eliminates the approximation risks of streaming aggregation. The pipeline is idempotent: re-running it for the same date produces identical output.

Trade-Offs
AspectDescription
Latency vs ThroughputBatch processing optimizes for throughput at the cost of latency. A job that processes a full day's data can achieve very high per-record throughput (millions of records/second) because it amortizes startup, shuffle, and I/O costs. But results are only available after the entire batch completes -- typically minutes to hours of delay. If you need sub-second results, batch is the wrong paradigm.
Simplicity vs FreshnessBatch pipelines are simpler to reason about: bounded input, deterministic output, easy to test and debug. But data freshness suffers. A dashboard powered by nightly batch is always up to 24 hours stale. The Lambda Architecture addresses this by layering a real-time streaming view on top of a batch-computed base view, at the cost of maintaining two codepaths.
Resource Efficiency vs CostBatch jobs can use spot/preemptible instances (70-90% cheaper) because they are fault-tolerant and can be retried. Spark checkpoints intermediate state and reruns only failed stages. However, large batch clusters sit idle between job runs. Auto-scaling and ephemeral clusters (spin up, run job, tear down) reduce waste but add startup latency.
Correctness vs ComplexityBatch guarantees exactly-once processing trivially through idempotent overwrites: the output for a given input partition is always the same. This is much simpler than achieving exactly-once in streaming. However, late-arriving data complicates batch: records that arrive after the batch window closes are missed until the next run or a backfill.
Case Study

Uber's Batch ETL Migration from Hive to Spark

Scenario

Uber's data platform ran over 50,000 Hive-on-MapReduce batch jobs daily for trip analytics, driver payments, surge pricing models, and regulatory reporting. As data volumes grew past 100 PB, Hive jobs became increasingly slow -- some critical jobs exceeded their SLA windows (e.g., driver payment calculation needed to complete before 6 AM, but was routinely finishing at 8 AM).

Solution

Uber migrated their batch platform to Spark on YARN, later moving to Spark on Kubernetes. They developed an internal framework that automatically translated HiveQL queries to Spark SQL, ran shadow jobs in parallel to validate correctness, and gradually shifted production traffic. They also adopted Parquet columnar storage with Z-ordering for predicate pushdown optimization.

Outcome

Average batch job completion time dropped by 5x. The driver payment pipeline went from 4+ hours to under 45 minutes, consistently meeting its SLA. Cluster resource utilization improved by 40% because Spark's in-memory execution reduced the number of disk I/O-bound stages. The migration processed over 500 PB/day at peak.

Common Mistakes
  • โš Running batch jobs without idempotent output writes. If a job fails partway and is retried, non-idempotent writes (e.g., INSERT INTO without deduplication) produce duplicate records. Always use overwrite semantics or merge-on-write with a deduplication key.
  • โš Scheduling batch jobs based on wall-clock time rather than data availability. A job that runs at 2 AM expecting yesterday's data to be complete will fail if upstream data arrives late. Use sensor-based triggers (Airflow sensors, S3 event notifications) to start jobs only when input data is ready.
  • โš Neglecting data skew in shuffle-heavy jobs. If 1% of keys contain 50% of the data, the reduce stage will be bottlenecked on a few executors. Use salting (appending random suffixes to hot keys), broadcast joins for small tables, or adaptive query execution (Spark AQE) to handle skew.
  • โš Using batch processing when the business actually needs sub-minute latency. The temptation to 'just run micro-batches every 5 minutes' leads to scheduling overhead, small-file problems, and fragile pipelines. If freshness requirements are under a few minutes, use a true streaming engine like Flink or Kafka Streams.
Related Concepts

See Batch Processing in action

Explore system design templates that use batch processing and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Run batch aggregation jobs on hourly click data

Metrics to watch
job_duration_msrecords_processedshuffle_data_gboutput_freshness_lag_ms
Run Simulation
Test Your Understanding

1Why is Apache Spark typically 10-100x faster than Hadoop MapReduce for iterative workloads like ML training?

2What is the primary advantage of idempotent output writes in batch processing?

Deeper Reading