Vetora logo
🔥Partitioning

Hotspot Mitigation

Hotspot mitigation addresses the problem of one partition receiving disproportionate traffic compared to others. Hotspots arise from celebrity-problem access patterns, temporal key distribution, or poor shard key selection. Solutions include key salting, dynamic partition splitting, read caching, and application-level routing -- each trading complexity for more uniform load distribution.

Overview

In any partitioned system, the theoretical promise of uniform load distribution often collides with the reality of skewed access patterns. A hotspot forms when one partition receives significantly more read or write traffic than its peers, becoming a throughput bottleneck that limits the scalability of the entire system. The system is only as fast as its slowest partition, so a single hotspot can negate the benefits of partitioning across dozens or hundreds of otherwise healthy nodes. Hotspots are one of the most common and impactful operational problems in distributed databases, caches, and message queues.

The causes of hotspots fall into three main categories. First, the celebrity problem (also called the thundering-herd or viral-content problem): when a single key receives massively disproportionate traffic. A tweet from a celebrity with 100 million followers generates millions of read requests for a single partition key. A viral product listing on an e-commerce site concentrates all traffic on the partition owning that product ID. Second, temporal patterns: in range-partitioned systems, all writes for 'today' go to the partition holding the current time range, while historical partitions sit idle. Third, poor shard key selection: choosing a low-cardinality key (country code, status enum) or a highly correlated key (department_id in a company where 80% of employees are in engineering) creates permanent structural imbalance.

Salting is the most common write-side mitigation technique. By appending a random suffix (e.g., a number from 0-9) to the key before hashing, writes for a single logical key are scattered across multiple partitions. Instead of all writes to celebrity_tweet_123 hitting one partition, writes go to celebrity_tweet_123_0 through celebrity_tweet_123_9 across 10 partitions. The tradeoff is that reads must now query all 10 salted variants and merge the results. This is acceptable when writes are the bottleneck (which they usually are for hot keys) and reads can tolerate the fan-out overhead. Instagram used this technique to handle time-bucketed partition hotspots by adding a shard suffix to their partition keys.

Dynamic partition splitting is the system-level solution. CockroachDB detects hot ranges by monitoring per-range query rates and automatically splits them, distributing the load across two new ranges on different nodes. DynamoDB's adaptive capacity feature (introduced in 2019) isolates hot partition keys by splitting the underlying physical partition and dedicating throughput to the hot key. On the read side, caching is the first line of defense: placing a cache (Redis, Memcached) in front of the database absorbs repeated reads for hot keys before they reach the partition. Application-level routing takes this further by maintaining a list of known hot keys and routing them to dedicated read replicas or specialized caching tiers, keeping hot-key traffic entirely separate from normal-path queries.

Key Points
  • 1A hotspot occurs when one partition receives disproportionately more traffic than others, making it the bottleneck for the entire system. The total system throughput is limited by the throughput of the hottest partition, regardless of how many other partitions exist.
  • 2The celebrity problem (also called the thundering herd) is the most dramatic cause: a single key (celebrity tweet, viral product, trending topic) receives millions of reads or writes, overwhelming the partition that owns it even if the system has thousands of partitions.
  • 3Temporal hotspots affect range-partitioned systems where the partition key includes a time component. All current writes funnel to the partition holding the most recent time range, while historical partitions receive zero write traffic.
  • 4Key salting scatters writes for a hot key across multiple partitions by appending a random suffix (e.g., key_0 through key_9). Reads must fan out to all salted variants and merge results, trading read complexity for write distribution.
  • 5Dynamic partition splitting (CockroachDB load-based splitting, DynamoDB adaptive capacity) detects hot partitions at the system level and automatically subdivides them, redistributing traffic without application-level changes.
  • 6Monitoring partition-level metrics (QPS per partition, latency percentiles per shard, disk I/O per node) is essential for detecting hotspots before they cause outages. Skew ratios (max partition load / average partition load) above 2x should trigger investigation.
Simple Example

The Grocery Store Checkout Problem

A grocery store has 10 checkout lanes. Normally, customers distribute themselves evenly. But one day, a celebrity walks in and everyone rushes to lane 5 to take photos, blocking the lane completely. The other 9 lanes are empty, but the store is effectively at a standstill because lane 5 has a 200-person queue. The store manager has several options: (1) Salting -- hand out random lane tickets so fans are spread across all 10 lanes. (2) Splitting -- open 3 additional express lanes just for the celebrity crowd. (3) Caching -- put up a photo of the celebrity at the entrance so people take their selfie there instead of blocking the checkout. (4) Routing -- create a dedicated VIP lane for the celebrity, keeping normal lanes clear. Each solution trades some convenience for better overall throughput.

Real-World Examples

Twitter

Twitter faced the celebrity problem when users with tens of millions of followers (e.g., Justin Bieber, Barack Obama) posted tweets. A naive read-on-demand approach would concentrate millions of timeline reads on the partition holding the celebrity's tweet. Twitter solved this with fan-out-on-write: when a celebrity tweets, the system precomputes and writes the tweet into each follower's home timeline cache (up to a limit). This trades write amplification for read distribution -- instead of millions of reads hitting one partition, each follower reads from their own timeline partition. For users with extremely large followings (>500K), Twitter uses a hybrid approach with fan-out-on-read to avoid excessive write amplification.

Instagram

Instagram experienced time-based partition hotspots when using Cassandra with a partition key based on time buckets. All photos uploaded in the current hour were written to the same partition, creating a severe write hotspot during peak usage (evenings and weekends). Instagram solved this by switching to a compound partition key of (user_id, time_bucket_suffix) where time_bucket_suffix was a small integer (0-9) appended based on a hash of the media ID. This spread writes for any given time window across multiple partitions while still allowing efficient queries for a specific user's recent photos.

Amazon DynamoDB

DynamoDB introduced adaptive capacity in 2019 to automatically handle hot partitions. When a partition key receives traffic that exceeds the per-partition throughput limit (3,000 RCU / 1,000 WCU), DynamoDB isolates the hot item by splitting the underlying physical partition and reallocating unused capacity from underutilized partitions. This happens transparently to the application. DynamoDB also supports burst capacity, borrowing unused throughput from the previous 5 minutes to absorb short traffic spikes on a hot partition without throttling.

Trade-Offs
AspectDescription
Salting: Write Distribution vs Read Fan-OutSalting distributes writes for a hot key across N partitions by appending a random suffix, but reads for that key must now query all N salted variants and merge results. The fan-out overhead is proportional to the salt factor N. Choose N based on the write-to-read ratio: high-write/low-read hot keys benefit from aggressive salting (N=10-100), while balanced read-write keys need a lower salt factor.
Caching: Read Absorption vs StalenessPlacing a cache in front of hot-read partitions absorbs the majority of requests before they reach the database, but introduces a staleness window. Cache invalidation must be carefully designed -- a stale celebrity tweet is tolerable, but a stale inventory count could cause overselling. The cache TTL represents a direct tradeoff between read load reduction and data freshness.
Dynamic Splitting: Automation vs Split/Merge OverheadAutomatic hot-partition splitting (CockroachDB, DynamoDB) requires no application changes but introduces transient performance disruption during the split operation itself. Frequent splitting and merging (thrashing) can occur if load fluctuates rapidly. Systems need hysteresis thresholds -- split at 2x average load, merge only below 0.25x -- to prevent oscillation.
Application-Level Routing: Precision vs Maintenance BurdenMaintaining an application-level list of known hot keys and routing them to dedicated infrastructure provides the most precise control but creates operational burden. The hot-key list must be updated in real-time as traffic patterns shift. A key that was hot yesterday may not be hot today, and a new viral event can create a hot key in seconds. Automated hot-key detection systems add complexity but are essential at scale.
Case Study

Twitter's Fan-Out-on-Write Architecture for Celebrity Tweets

Scenario

When a user with 50 million followers tweeted, the naive approach of fetching the tweet on-demand would concentrate 50 million read requests on the single partition holding that tweet. During events like the Super Bowl or elections, multiple celebrities tweeting simultaneously created cascading hotspots that overwhelmed the tweet storage tier, causing timeline rendering failures for hundreds of millions of users. The P99 timeline load latency would spike from 50ms to over 5 seconds during these events.

Solution

Twitter implemented fan-out-on-write: when a user tweets, a background pipeline pushes the tweet ID into each follower's home timeline cache (stored in Redis). Each follower's timeline read hits their own cache partition, eliminating the hotspot on the tweet's source partition. For users with extremely large followings (>500K followers), Twitter uses a hybrid approach: the tweet is fanned out to followers with smaller followings, but celebrity-to-celebrity edges use fan-out-on-read to avoid writing the same tweet to millions of timeline caches. A dedicated hot-key detection service monitors tweet engagement velocity and pre-warms caches for tweets that are going viral.

Outcome

Timeline read latency dropped to a consistent sub-10ms P99 regardless of how many followers the author had. The write amplification cost was approximately 4.6x (each tweet is written to an average of 4.6 timeline caches due to follower overlap and filtering), which Twitter accepted as reasonable given the dramatic improvement in read performance. The hybrid fan-out approach reduced write amplification for celebrity accounts by 75% compared to pure fan-out-on-write.

Common Mistakes
  • Assuming uniform distribution means no hotspots. Even with perfect hash partitioning and consistent hashing, a single key that receives 1 million QPS will hotspot its partition. Uniform distribution of keys across partitions does not prevent skewed access patterns on individual keys.
  • Salting all keys instead of just hot keys. Applying a salt factor of 10 to every key in the system turns every single-key read into a 10-partition fan-out, dramatically increasing read latency and system load for the 99.9% of keys that are not hot. Salt only the keys identified as hot, or use a dynamic salting system that activates salting per-key based on traffic thresholds.
  • Using only caching without addressing the underlying hotspot. A cache absorbs reads but does not eliminate writes to the hot partition. If the hotspot is write-dominated (e.g., a high-volume event stream), caching the reads has no effect on the write bottleneck. Match the mitigation to the type of hotspot: caching for read hotspots, salting for write hotspots.
  • Failing to monitor partition-level metrics. Aggregate cluster metrics (total QPS, average latency) can mask severe per-partition skew. A cluster averaging 100ms latency might have 9 partitions at 10ms and 1 partition at 910ms. Always monitor per-partition QPS, latency percentiles, and disk I/O, and alert on skew ratios exceeding 2x.
Related Concepts

See Hotspot Mitigation in action

Explore system design templates that use hotspot mitigation and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate a celebrity-problem hotspot and test mitigation strategies

Metrics to watch
max_partition_qpsskew_ratiop99_latencycache_hit_rate
Run Simulation
Test Your Understanding

1What is the 'celebrity problem' in the context of database partitioning?

2How does key salting mitigate write hotspots?

3Which AWS service feature automatically detects and splits hot partitions without application changes?

Deeper Reading