Vetora logo
๐Ÿ“ˆScalability

Database Sharding

Explore how database sharding partitions data across multiple database instances to distribute load, increase storage capacity, and improve write throughput for large-scale applications.

Overview

Database sharding is a horizontal partitioning strategy that distributes rows of a database table across multiple independent database instances, called shards. Each shard holds a subset of the total data and operates as a fully functional database, handling reads and writes for its partition independently. The goal is to overcome the limitations of a single database instance by spreading load across multiple machines.

The most critical decision in sharding is choosing the shard key -- the column or attribute used to determine which shard holds a given row. Common strategies include hash-based sharding (applying a hash function to the shard key to uniformly distribute data), range-based sharding (assigning contiguous ranges of the shard key to each shard), and directory-based sharding (maintaining a lookup table that maps each key to its shard). Each strategy has different implications for data distribution, query routing, and shard rebalancing.

Hash-based sharding provides excellent data distribution -- rows are evenly spread across shards -- but makes range queries expensive because data for a range of keys is scattered across all shards. Range-based sharding keeps contiguous data together (great for time-series queries or alphabetical lookups) but can create hotspots if the access pattern is skewed toward recent data. Directory-based sharding offers maximum flexibility but introduces a lookup service that becomes a potential single point of failure and latency bottleneck.

Sharding introduces substantial complexity that should not be undertaken lightly. Cross-shard queries (joins across data on different shards) are expensive or impossible. Transactions that span multiple shards require distributed transaction protocols. Resharding (adding or removing shards) requires data migration that can take hours or days for large datasets. For these reasons, sharding is typically a last resort after vertical scaling, read replicas, caching, and query optimization have been exhausted.

Key Points
  • 1Sharding distributes data across multiple database instances. Each shard holds a subset of rows and operates independently, providing linear scalability for both storage capacity and write throughput.
  • 2The shard key determines data distribution. A good shard key has high cardinality, even distribution, and aligns with the most common query patterns to avoid cross-shard queries.
  • 3Hash-based sharding provides uniform distribution but scatters range queries. Range-based sharding keeps related data together but can create hotspots. The choice depends on access patterns.
  • 4Cross-shard operations (joins, transactions, aggregations) are the primary cost of sharding. Design your data model so that the most frequent queries are satisfied by a single shard.
  • 5Resharding (changing the number of shards) is operationally expensive. Consistent hashing minimizes data movement during resharding by ensuring that only 1/N of the data needs to move when adding the Nth shard.
  • 6Many databases offer built-in sharding: MongoDB (auto-sharding), CockroachDB (range-based), Vitess (MySQL sharding proxy), and Citus (PostgreSQL extension). These reduce the engineering burden compared to application-level sharding.
Simple Example

The Library Analogy

Imagine a library that has grown so large that a single building cannot hold all the books and serve all the visitors. Instead of building one enormous library (vertical scaling), the city opens multiple branch libraries (sharding). Books with authors whose last names start with A-H go to Branch 1, I-P to Branch 2, and Q-Z to Branch 3. Each branch operates independently, with its own staff and card catalog. Visitors know which branch to go to based on the author's name. This works well for looking up specific books, but becomes a problem when someone wants to find all books published in 2024 -- they must visit all three branches and combine the results (a cross-shard query).

Real-World Examples

Instagram

Instagram shards its PostgreSQL databases by user ID. All data for a given user (photos, comments, likes, followers) resides on the same shard, which means the most common queries (load a user's profile, display their feed) hit a single shard. With thousands of shards, Instagram can store billions of photos while maintaining low-latency queries. Cross-user aggregations (trending hashtags, global search) are handled by separate denormalized data stores.

Uber

Uber uses Schemaless, a custom sharding layer built on top of MySQL. Trip data is sharded by city and time range, keeping recent trips for a given city on the same shard for fast access. Historical data is migrated to cold storage shards. The shard key choice reflects Uber's primary access pattern: looking up recent trips in a specific geographic area.

Pinterest

Pinterest built a custom sharding solution on top of MySQL that distributes pin data across thousands of shards using a consistent hashing scheme. Each shard is a standard MySQL instance, and the application layer handles shard routing. Pinterest chose application-level sharding over a database proxy to maintain full control over query routing and avoid the latency overhead of a proxy layer.

Trade-Offs
AspectDescription
Write Scalability vs Query FlexibilitySharding dramatically increases write throughput by distributing writes across multiple independent databases. However, queries that span multiple shards (cross-shard joins, global aggregations) become expensive or impractical, limiting the types of queries the system can efficiently support.
Data Locality vs Even DistributionRange-based sharding keeps related data together (good for range queries) but risks uneven distribution if the key space is skewed. Hash-based sharding ensures uniform distribution but scatters related data across shards. The optimal choice depends on whether the workload is point-lookup-heavy or range-query-heavy.
Operational ComplexityEach shard is an independent database that needs monitoring, backups, schema migrations, and failover configuration. Managing hundreds of shards requires significant operational tooling and expertise. Schema changes must be applied to every shard, and a failed migration on one shard can block the entire system.
Application ComplexityThe application must know how to route queries to the correct shard, handle cross-shard query aggregation, and manage distributed transactions. This routing logic becomes a critical piece of infrastructure that must be maintained, tested, and evolved alongside the database schema.
Case Study

Pinterest's Journey to Database Sharding

Scenario

Pinterest started with a single PostgreSQL database. As user growth accelerated, they migrated to a few large PostgreSQL instances with read replicas. When even vertical scaling and read replicas could not handle the write volume (billions of pins, saves, and interactions), they needed a sharding strategy. The challenge was to shard without downtime, with minimal application code changes, and in a way that could scale to thousands of shards.

Solution

Pinterest designed a custom sharding scheme using MySQL (chosen over PostgreSQL for its simpler replication). Each shard is a standard MySQL instance. A lightweight routing library in the application layer maps each object ID to a shard using a consistent hash. Object IDs encode the shard number directly: the ID format includes bits for shard ID, object type, and local sequence number. This design allows the application to determine the correct shard from the ID alone without any lookup service. Resharding is handled by splitting a shard into two new shards and replaying the binary log.

Outcome

Pinterest scaled from a handful of database instances to over 8,000 MySQL shards, supporting hundreds of billions of pins. The sharding scheme is transparent to most application developers because the routing library handles shard selection automatically. The consistent ID format means that objects can be resolved to their shard in constant time with no network overhead. Write throughput scales linearly with the number of shards, and the system handles over a million database operations per second.

Common Mistakes
  • โš Choosing a low-cardinality shard key. Sharding by a column with few distinct values (country, category) creates large, uneven shards. Choose a high-cardinality key like user ID or object ID for uniform distribution.
  • โš Sharding too early. Sharding adds enormous complexity. If your dataset fits on a single well-tuned database instance with read replicas and proper indexing, the operational simplicity of a single database far outweighs the theoretical scalability of sharding.
  • โš Not co-locating related data. If your most common query needs data from multiple shards, you have chosen the wrong shard key. Analyze your access patterns first and choose a shard key that keeps frequently co-accessed data on the same shard.
  • โš Ignoring hotspot potential. Even with a good shard key, power-law distributions can create hotspots. A celebrity user with millions of followers generates disproportionate load on their shard. Monitor shard load distribution and have a strategy for splitting hot shards.
Related Concepts

See Database Sharding in action

Explore system design templates that use database sharding and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate sharded database queries across user partitions

Metrics to watch
cross_shard_query_pctshard_utilization_skewp99_latency_msthroughput_rps
Run Simulation
Test Your Understanding

1A social media app shards user data by user_id. A feature request requires displaying a leaderboard of the top 100 users by follower count across all users. What is the core challenge?

2When using consistent hashing for shard routing, what happens when a new shard is added to a ring with N existing shards?

Deeper Reading