1What is data skew, and why is it problematic in distributed processing?
Data partitioning determines how records are distributed across nodes, files, or partitions in a distributed system. Shuffles are the expensive redistribution of data between partitions during operations like joins and aggregations. Understanding partitioning strategies and shuffle mechanics is essential for optimizing distributed query performance.
In distributed data processing, a dataset is divided into **partitions** (also called splits, shards, or buckets) that are processed in parallel across multiple machines. How data is partitioned determines which operations are local (fast) and which require cross-node data transfer (slow). Partitioning is the single most impactful design decision for distributed query performance.
Three partitioning strategies dominate: **Hash partitioning** applies a hash function to a key column and assigns records to partitions based on the hash value (e.g., partition = hash(user_id) % num_partitions). This ensures even distribution for uniform key distributions. **Range partitioning** divides the key space into contiguous ranges (e.g., dates January-March in partition 1, April-June in partition 2). This enables efficient range scans but can cause skew if data is not uniformly distributed. **Round-robin partitioning** assigns records to partitions in rotation, ensuring perfect balance but providing no key-based locality.
A **shuffle** occurs when an operation requires data to be reorganized by a different key than the current partitioning. For example, if data is partitioned by user_id but a query groups by country, all records for the same country must be sent to the same partition -- requiring a full cross-network data transfer. In Spark, shuffles are the boundary between stages: operations within a stage are pipelined in memory; shuffles materialize intermediate data to disk and transfer it across the network. A shuffle of 1 TB of data across 200 nodes generates massive network I/O and is often the bottleneck of the entire job.
**Data skew** is the most insidious problem in distributed processing. When one partition contains vastly more data than others (e.g., a 'null' key, a celebrity user, or a popular product), that partition becomes a bottleneck while other nodes sit idle. A perfectly parallel 10-minute job can take 2 hours if one partition has 100x more data. Mitigating skew with techniques like salted keys, broadcast joins, and Spark's Adaptive Query Execution (AQE) is a critical skill for data engineers.
Skewed Join in an Ad Analytics Pipeline
An ad analytics pipeline joins a clicks table (10 billion rows) with an advertisers table (100K rows) on advertiser_id. Without optimization, Spark hash-partitions both tables by advertiser_id and shuffles 10 billion rows across the network. Worse, one advertiser (a global brand) accounts for 30% of all clicks, causing massive skew -- one partition processes 3 billion rows while others process 50 million each. Two optimizations fix this: (1) Broadcast join -- the advertisers table is only 100K rows (~5MB), so Spark broadcasts it to every executor and the join becomes a local lookup with zero shuffle. (2) If the small table were larger, salted keys would split the hot advertiser's partition into 10 sub-partitions, spreading the 3 billion rows across 10 tasks instead of 1.
Meta (Facebook)
Meta's data warehouse processes exabytes daily using Spark and Presto. Data is partitioned by date (range partitioning) with sub-partitioning by country for their largest tables. Their largest shuffle operations (joining user activity with user profiles across 3 billion users) transfer petabytes across the network. They mitigate skew from highly active users (celebrities, bots) using salted keys and cardinality-aware partitioning that pre-splits known hot keys.
Databricks
Databricks developed Spark's Adaptive Query Execution (AQE) to handle the shuffle and skew problems automatically. AQE observes actual partition sizes at shuffle boundaries and makes runtime decisions: coalescing 1,000 tiny partitions into 50 appropriately-sized ones, splitting a skewed 10GB partition into 10 x 1GB sub-partitions, and switching from sort-merge join to broadcast join when a table turns out to be smaller than estimated. AQE reduced the need for manual tuning by 60% in their customer benchmarks.
LinkedIn's Spark-based ETL processes trillions of member activity events daily. They discovered that partitioning their member-activity table by member_id created severe skew: power users (recruiters, influencers) generated 1000x more events than average users. They solved this with dynamic repartitioning: a pre-processing step computes per-key cardinalities and assigns hot keys to multiple partitions using a salt suffix. This reduced their largest job's runtime from 8 hours to 45 minutes.
| Aspect | Description |
|---|---|
| Partition Count vs Partition Size | Too few partitions (e.g., 10 partitions for 1 TB) create large partitions that exhaust executor memory and prevent parallelism. Too many partitions (e.g., 100,000 partitions for 1 TB) create tiny partitions with excessive scheduling overhead, small file problems, and underutilized executors. The sweet spot is partitions of 128MB-1GB each. Spark's AQE can automatically coalesce small post-shuffle partitions. |
| Shuffle Volume vs Computation | Reducing shuffle data volume is almost always worth additional computation. Map-side partial aggregation (combiners) reduces the data sent to the reduce stage. Pre-filtering before a join reduces shuffle volume. Broadcast joins eliminate shuffles entirely for small tables. Every byte not shuffled saves network I/O, disk I/O, and serialization/deserialization CPU. |
| Pre-Partitioning vs Runtime Flexibility | Pre-partitioning data by a known query key (e.g., storing data bucketed by user_id) eliminates shuffle for queries on that key. But if a different query groups by product_id, a full shuffle is still needed. Pre-partitioning optimizes for known access patterns at the cost of flexibility. Iceberg's hidden partitioning and partition evolution help by allowing the partition strategy to change over time without rewriting data. |
| Skew Mitigation Complexity vs Job Performance | Techniques like salted keys, two-phase aggregation, and custom partitioners add code complexity and maintenance burden. AQE handles many skew scenarios automatically but cannot solve all cases (e.g., it does not help with skewed source data). For critical production jobs, investing in skew mitigation pays dividends; for ad-hoc exploration, the overhead may not be worth it. |
Spotify's Shuffle Optimization for Royalty Calculations
Scenario
Spotify's royalty calculation pipeline joins a massive streams table (billions of rows per day of individual song plays) with a rights-holders table (millions of rows of complex licensing agreements) on track_id. The shuffle for this join transferred 5+ TB of data across the network. Additionally, a handful of viral tracks accounted for 10% of all streams, creating severe data skew in the track_id-partitioned shuffle.
Solution
Spotify implemented a multi-pronged optimization: (1) Bucketed tables -- both the streams and rights-holders tables were pre-bucketed by track_id into 2,048 buckets, eliminating the shuffle entirely for the join (bucket-to-bucket local join). (2) For the daily incremental pipeline (only today's new streams), they could not pre-bucket, so they used a salted join: hot track_ids were detected dynamically, salted into 16 sub-keys, and the rights-holders table was replicated 16x for those keys. (3) Map-side partial aggregation reduced the stream count per track before the join.
Outcome
The shuffle volume dropped from 5 TB to under 200 GB (96% reduction). Job wall-clock time dropped from 4 hours to 35 minutes. The skewed partition that previously took 90 minutes (while all other partitions finished in 10 minutes) was eliminated by salting. Cluster cost for the royalty pipeline dropped by 70% because fewer executor-hours were needed.
See Data Partitioning & Shuffles in action
Explore system design templates that use data partitioning & shuffles and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is data skew, and why is it problematic in distributed processing?
2How does a broadcast join eliminate shuffles in Spark?