1A time-series database uses range partitioning with timestamp as the partition key. What is the most likely operational problem?
Range partitioning splits data across nodes by dividing the key space into contiguous ranges, where each partition owns all keys between a lower and upper boundary. This strategy keeps data sorted within each partition and enables efficient range queries, but can lead to hotspots when keys are sequentially distributed.
Range partitioning is one of the most intuitive data distribution strategies in distributed systems. The key space is divided into contiguous, non-overlapping ranges, and each partition (also called a shard, region, or tablet depending on the system) is responsible for all keys that fall within its boundaries. For example, a user database might assign users with last names A-F to partition 1, G-L to partition 2, M-R to partition 3, and S-Z to partition 4. Each partition stores its data in sorted order, which makes range scans and ordered queries extremely efficient because all relevant data lives on a single node or a small set of adjacent nodes.
The primary advantage of range partitioning is query locality. When an application needs to retrieve all orders placed between January 1 and January 31, a range-partitioned system can route the query directly to the partition(s) holding that date range, avoiding the scatter-gather overhead that hash partitioning requires. This makes range partitioning the natural choice for time-series data, alphabetical lookups, geospatial queries (when using space-filling curves like Z-order or Hilbert curves), and any workload that frequently scans contiguous segments of the key space. Systems like HBase, Bigtable, and CockroachDB all use range partitioning as their primary data distribution strategy.
However, range partitioning has a critical weakness: hotspots. When keys are sequentially generated -- timestamps, auto-incrementing IDs, or monotonically increasing sequence numbers -- all writes concentrate on the partition that owns the current range boundary. This creates a single hot partition that becomes a throughput bottleneck while other partitions sit idle. The classic example is a time-series workload where every new event writes to the partition holding the current time range. The hot partition's disk I/O, CPU, and memory are saturated, while partitions holding historical data receive zero write traffic.
To mitigate hotspots, systems employ several strategies. CockroachDB automatically splits ranges when they exceed 512MB or experience disproportionate load, redistributing the data across nodes. HBase allows administrators to pre-split regions based on expected key distribution, and supports salting row keys (prepending a hash prefix) to scatter sequential writes. Bigtable dynamically splits and merges tablets based on size, and recommends designing row keys that distribute writes evenly -- for example, reversing a domain name (com.example instead of example.com) so that subdomains of the same site do not cluster on one tablet. The choice between range and hash partitioning often comes down to whether the workload is read-heavy with range scans (favoring range) or write-heavy with random access (favoring hash).
The Library Card Catalog
Imagine a library with four filing cabinets for its card catalog. Cabinet 1 holds cards for authors A-F, Cabinet 2 holds G-L, Cabinet 3 holds M-R, and Cabinet 4 holds S-Z. If you want to find all authors whose last name starts with 'K', you go directly to Cabinet 2 -- no need to search all four. However, if the library suddenly receives a huge donation of books from authors whose names start with 'S', Cabinet 4 becomes overloaded while the others remain half-empty. The librarian must then re-divide the ranges -- maybe splitting Cabinet 4 into S-T and U-Z -- to rebalance the load. This is exactly how range partitioning works: fast lookups for contiguous ranges, but imbalanced data distribution requires periodic boundary adjustment.
HBase
HBase partitions tables into regions, where each region holds a contiguous range of row keys sorted lexicographically. RegionServers host one or more regions, and the HBase Master assigns regions to servers. When a region exceeds a configurable size threshold (default 10GB in modern versions), it splits into two daughter regions at the midpoint of the row key range. Operators commonly pre-split tables and design row keys with salted prefixes (e.g., prepending hash(timestamp) mod 10) to prevent hotspotting on time-series workloads.
CockroachDB
CockroachDB organizes all data into ranges, each targeting 512MB by default. Ranges are automatically split when they exceed this threshold and merged when adjacent ranges shrink below a minimum size. The system also performs load-based splitting: if a range receives disproportionate query traffic, CockroachDB splits it even if it has not reached the size limit. Ranges are replicated via Raft consensus and can be rebalanced across nodes without downtime using lease transfers.
Google Bigtable
Bigtable partitions tables into tablets, each covering a contiguous range of row keys. Tablets start as a single range and are dynamically split by the tablet server when they grow beyond a configurable size (typically 100-200MB). The Bigtable master reassigns tablets across tablet servers for load balancing. Google recommends designing row keys to avoid sequential write patterns -- for example, using reversed timestamps (Long.MAX_VALUE - timestamp) so that recent data spreads across multiple tablets instead of concentrating at the end of the key space.
| Aspect | Description |
|---|---|
| Range Query Efficiency vs Write Distribution | Range partitioning excels at range scans because contiguous keys reside on the same partition, but this locality guarantee is exactly what causes hotspots when writes are sequential. Hash partitioning provides uniform write distribution but requires scatter-gather for range queries. The fundamental tension is between read locality and write balance. |
| Manual vs Automatic Boundary Management | Manual boundary selection gives operators control over data placement but requires ongoing maintenance as data volume and access patterns change. Automatic splitting (CockroachDB, Bigtable) reduces operational burden but can cause transient performance dips during split operations and may produce suboptimal boundaries if the splitting heuristic does not account for query patterns. |
| Pre-splitting vs Organic Growth | Pre-splitting avoids the initial single-partition bottleneck and is essential for high-ingest workloads, but requires upfront knowledge of the key distribution. Over-splitting wastes resources on empty partitions; under-splitting forces expensive splits under load. Organic growth (starting with one partition and splitting as needed) is simpler but creates a bootstrapping bottleneck. |
| Key Design Complexity | Mitigating hotspots in range-partitioned systems often requires complex key design -- salting, reversing, composite keys -- which adds application-level complexity and can defeat the range scan benefits that motivated choosing range partitioning in the first place. Salted keys scatter sequential data, making point lookups require multiple queries (one per salt bucket). |
HBase Time-Series Hotspotting at a Social Media Platform
Scenario
A social media platform stored user activity events in HBase with row keys of the form timestamp_userId. During peak hours, the platform ingested 500,000 events per second. Because timestamps are monotonically increasing, all writes funneled into the single region that owned the current time range, while hundreds of other regions sat idle. The hot region's RegionServer hit 100% CPU and began dropping writes, causing data loss during traffic spikes.
Solution
The team redesigned the row key to userId_timestamp, which distributed writes across regions by user ID while still allowing efficient range scans for a single user's activity history. They also pre-split the table into 256 regions based on the first two hex characters of the hashed user ID, ensuring immediate write distribution from startup. For global time-range queries (e.g., all events in the last hour), they built a secondary MapReduce pipeline that scanned all regions in parallel.
Outcome
Write throughput increased from 50,000 to 500,000 events per second with uniform distribution across RegionServers. The P99 write latency dropped from 800ms to 12ms. The tradeoff was that global time-range queries now required a full table scan, but this was acceptable because those queries ran as offline batch jobs rather than real-time requests.
See Range Partitioning in action
Explore system design templates that use range partitioning and run traffic simulations to see how these concepts perform under real load.
Browse Templates1A time-series database uses range partitioning with timestamp as the partition key. What is the most likely operational problem?
2What technique does HBase recommend to avoid write hotspots on range-partitioned tables with sequential keys?
3CockroachDB automatically splits a range when it exceeds what default size threshold?