Vetora logo
📏Partitioning

Range Partitioning

Range partitioning splits data across nodes by dividing the key space into contiguous ranges, where each partition owns all keys between a lower and upper boundary. This strategy keeps data sorted within each partition and enables efficient range queries, but can lead to hotspots when keys are sequentially distributed.

Overview

Range partitioning is one of the most intuitive data distribution strategies in distributed systems. The key space is divided into contiguous, non-overlapping ranges, and each partition (also called a shard, region, or tablet depending on the system) is responsible for all keys that fall within its boundaries. For example, a user database might assign users with last names A-F to partition 1, G-L to partition 2, M-R to partition 3, and S-Z to partition 4. Each partition stores its data in sorted order, which makes range scans and ordered queries extremely efficient because all relevant data lives on a single node or a small set of adjacent nodes.

The primary advantage of range partitioning is query locality. When an application needs to retrieve all orders placed between January 1 and January 31, a range-partitioned system can route the query directly to the partition(s) holding that date range, avoiding the scatter-gather overhead that hash partitioning requires. This makes range partitioning the natural choice for time-series data, alphabetical lookups, geospatial queries (when using space-filling curves like Z-order or Hilbert curves), and any workload that frequently scans contiguous segments of the key space. Systems like HBase, Bigtable, and CockroachDB all use range partitioning as their primary data distribution strategy.

However, range partitioning has a critical weakness: hotspots. When keys are sequentially generated -- timestamps, auto-incrementing IDs, or monotonically increasing sequence numbers -- all writes concentrate on the partition that owns the current range boundary. This creates a single hot partition that becomes a throughput bottleneck while other partitions sit idle. The classic example is a time-series workload where every new event writes to the partition holding the current time range. The hot partition's disk I/O, CPU, and memory are saturated, while partitions holding historical data receive zero write traffic.

To mitigate hotspots, systems employ several strategies. CockroachDB automatically splits ranges when they exceed 512MB or experience disproportionate load, redistributing the data across nodes. HBase allows administrators to pre-split regions based on expected key distribution, and supports salting row keys (prepending a hash prefix) to scatter sequential writes. Bigtable dynamically splits and merges tablets based on size, and recommends designing row keys that distribute writes evenly -- for example, reversing a domain name (com.example instead of example.com) so that subdomains of the same site do not cluster on one tablet. The choice between range and hash partitioning often comes down to whether the workload is read-heavy with range scans (favoring range) or write-heavy with random access (favoring hash).

Key Points
  • 1Range partitioning divides the key space into contiguous, non-overlapping segments where each partition owns all keys between a lower bound (inclusive) and upper bound (exclusive). Partition boundaries can be chosen manually or determined automatically by the system.
  • 2Data within each partition remains sorted by key, enabling efficient range scans, ordered iteration, and prefix queries without needing a secondary index. A query for 'all users with names starting with J' hits exactly one partition.
  • 3Hotspots occur when write patterns are sequential (timestamps, auto-incrementing IDs), causing all writes to funnel into the partition that owns the current end of the key space. This is the single biggest risk of range partitioning.
  • 4Automatic range splitting (used by CockroachDB at 512MB, Bigtable at configurable thresholds) subdivides large or hot partitions without manual intervention, but splitting itself requires coordination and briefly pauses writes to the affected range.
  • 5Pre-splitting is a technique where an operator defines initial partition boundaries based on expected data distribution, avoiding the performance dip that occurs when a single initial partition must be repeatedly split under load.
  • 6Range partitioning can be combined with key design strategies like salting (prepending a hash prefix), key reversal, or composite keys to mitigate hotspots while still enabling some degree of range query efficiency within each salted bucket.
Simple Example

The Library Card Catalog

Imagine a library with four filing cabinets for its card catalog. Cabinet 1 holds cards for authors A-F, Cabinet 2 holds G-L, Cabinet 3 holds M-R, and Cabinet 4 holds S-Z. If you want to find all authors whose last name starts with 'K', you go directly to Cabinet 2 -- no need to search all four. However, if the library suddenly receives a huge donation of books from authors whose names start with 'S', Cabinet 4 becomes overloaded while the others remain half-empty. The librarian must then re-divide the ranges -- maybe splitting Cabinet 4 into S-T and U-Z -- to rebalance the load. This is exactly how range partitioning works: fast lookups for contiguous ranges, but imbalanced data distribution requires periodic boundary adjustment.

Real-World Examples

HBase

HBase partitions tables into regions, where each region holds a contiguous range of row keys sorted lexicographically. RegionServers host one or more regions, and the HBase Master assigns regions to servers. When a region exceeds a configurable size threshold (default 10GB in modern versions), it splits into two daughter regions at the midpoint of the row key range. Operators commonly pre-split tables and design row keys with salted prefixes (e.g., prepending hash(timestamp) mod 10) to prevent hotspotting on time-series workloads.

CockroachDB

CockroachDB organizes all data into ranges, each targeting 512MB by default. Ranges are automatically split when they exceed this threshold and merged when adjacent ranges shrink below a minimum size. The system also performs load-based splitting: if a range receives disproportionate query traffic, CockroachDB splits it even if it has not reached the size limit. Ranges are replicated via Raft consensus and can be rebalanced across nodes without downtime using lease transfers.

Google Bigtable

Bigtable partitions tables into tablets, each covering a contiguous range of row keys. Tablets start as a single range and are dynamically split by the tablet server when they grow beyond a configurable size (typically 100-200MB). The Bigtable master reassigns tablets across tablet servers for load balancing. Google recommends designing row keys to avoid sequential write patterns -- for example, using reversed timestamps (Long.MAX_VALUE - timestamp) so that recent data spreads across multiple tablets instead of concentrating at the end of the key space.

Trade-Offs
AspectDescription
Range Query Efficiency vs Write DistributionRange partitioning excels at range scans because contiguous keys reside on the same partition, but this locality guarantee is exactly what causes hotspots when writes are sequential. Hash partitioning provides uniform write distribution but requires scatter-gather for range queries. The fundamental tension is between read locality and write balance.
Manual vs Automatic Boundary ManagementManual boundary selection gives operators control over data placement but requires ongoing maintenance as data volume and access patterns change. Automatic splitting (CockroachDB, Bigtable) reduces operational burden but can cause transient performance dips during split operations and may produce suboptimal boundaries if the splitting heuristic does not account for query patterns.
Pre-splitting vs Organic GrowthPre-splitting avoids the initial single-partition bottleneck and is essential for high-ingest workloads, but requires upfront knowledge of the key distribution. Over-splitting wastes resources on empty partitions; under-splitting forces expensive splits under load. Organic growth (starting with one partition and splitting as needed) is simpler but creates a bootstrapping bottleneck.
Key Design ComplexityMitigating hotspots in range-partitioned systems often requires complex key design -- salting, reversing, composite keys -- which adds application-level complexity and can defeat the range scan benefits that motivated choosing range partitioning in the first place. Salted keys scatter sequential data, making point lookups require multiple queries (one per salt bucket).
Case Study

HBase Time-Series Hotspotting at a Social Media Platform

Scenario

A social media platform stored user activity events in HBase with row keys of the form timestamp_userId. During peak hours, the platform ingested 500,000 events per second. Because timestamps are monotonically increasing, all writes funneled into the single region that owned the current time range, while hundreds of other regions sat idle. The hot region's RegionServer hit 100% CPU and began dropping writes, causing data loss during traffic spikes.

Solution

The team redesigned the row key to userId_timestamp, which distributed writes across regions by user ID while still allowing efficient range scans for a single user's activity history. They also pre-split the table into 256 regions based on the first two hex characters of the hashed user ID, ensuring immediate write distribution from startup. For global time-range queries (e.g., all events in the last hour), they built a secondary MapReduce pipeline that scanned all regions in parallel.

Outcome

Write throughput increased from 50,000 to 500,000 events per second with uniform distribution across RegionServers. The P99 write latency dropped from 800ms to 12ms. The tradeoff was that global time-range queries now required a full table scan, but this was acceptable because those queries ran as offline batch jobs rather than real-time requests.

Common Mistakes
  • Using monotonically increasing keys (timestamps, auto-increment IDs) as the primary partition key in a range-partitioned system. This guarantees a write hotspot on the last partition. Always design keys that distribute writes across the key space.
  • Failing to pre-split tables for high-ingest workloads. Starting with a single partition means all initial writes go to one node, creating a bottleneck that persists until the system has split enough times to distribute load -- by which point data may have been lost or delayed.
  • Assuming range partitioning and hash partitioning are mutually exclusive. Many production systems use a hybrid approach: hash the key for partition assignment but use a clustering column for sorted order within each partition (similar to Cassandra's compound primary key).
  • Ignoring partition size skew over time. Even if initial boundaries are well-chosen, data growth patterns change. A partition that starts with 10% of data may grow to hold 60% after a year. Without automatic splitting or periodic manual rebalancing, performance degrades silently.
Related Concepts

See Range Partitioning in action

Explore system design templates that use range partitioning and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Visualize range partition hotspots as write patterns shift

Metrics to watch
partition_write_skewrange_query_latencysplit_operations
Run Simulation
Test Your Understanding

1A time-series database uses range partitioning with timestamp as the partition key. What is the most likely operational problem?

2What technique does HBase recommend to avoid write hotspots on range-partitioned tables with sequential keys?

3CockroachDB automatically splits a range when it exceeds what default size threshold?

Deeper Reading