1What is the primary purpose of MVCC (Multi-Version Concurrency Control) in PostgreSQL and MySQL InnoDB?
Relational databases store data in structured tables with rows and columns, enforce ACID transactions, and use SQL for querying. With decades of maturity, battle-tested tooling, and strong consistency guarantees, they remain the default choice for most application workloads.
Relational databases are the most widely deployed data storage systems in the world, underpinning everything from small web applications to some of the largest internet platforms on the planet. The relational model, formalized by Edgar F. Codd in 1970, organizes data into relations (tables) composed of tuples (rows) and attributes (columns). Each table has a defined schema, and relationships between tables are expressed through foreign keys. This structure enables powerful declarative querying via SQL, where the developer describes what data they want and the query optimizer decides how to retrieve it.
The defining strength of relational databases is their transactional guarantees -- ACID (Atomicity, Consistency, Isolation, Durability). Atomicity ensures that multi-statement transactions either fully commit or fully roll back. Consistency enforces schema constraints, foreign key relationships, and CHECK constraints. Isolation prevents concurrent transactions from interfering with each other, implemented through Multi-Version Concurrency Control (MVCC) in both PostgreSQL and MySQL's InnoDB engine. Durability guarantees that committed transactions survive crashes via Write-Ahead Logging (WAL). These guarantees make relational databases the default choice for any workload where data correctness is non-negotiable: financial transactions, inventory management, user accounts, and order processing.
Performance in relational databases depends heavily on indexing and query optimization. B-tree indexes are the workhorse data structure, providing O(log n) lookups, range scans, and ordered traversal. PostgreSQL also supports GIN indexes for full-text search and JSONB, GiST indexes for geometric and geographic data, and BRIN indexes for large sequential datasets. The query planner analyzes table statistics to choose between sequential scans, index scans, hash joins, merge joins, and nested loops -- often making decisions that outperform hand-optimized queries. Connection pooling (via PgBouncer for PostgreSQL or ProxySQL for MySQL) is critical at scale because each database connection consumes memory for process state, and most applications need far fewer concurrent connections than concurrent requests.
PostgreSQL and MySQL represent two philosophies within the relational world. PostgreSQL prioritizes correctness, extensibility, and advanced features -- it supports custom types, window functions, recursive CTEs, partial indexes, materialized views, and JSONB for semi-structured data. MySQL prioritizes simplicity, read performance, and operational ease -- its simpler replication model (async binary log replication) made it the backbone of the early web era (LAMP stack). Both databases have converged significantly: MySQL 8.0 added CTEs, window functions, and improved JSON support, while PostgreSQL's logical replication has become more flexible. The choice between them is increasingly one of ecosystem and organizational expertise rather than fundamental capability.
The Spreadsheet with Superpowers
Think of a relational database as a collection of linked spreadsheets. One spreadsheet lists customers (id, name, email), another lists orders (id, customer_id, total, date). The customer_id column in the orders sheet is a foreign key that links back to the customers sheet. Unlike a regular spreadsheet, the database enforces rules: you cannot create an order for a customer_id that does not exist (referential integrity), you cannot insert a negative total if a CHECK constraint forbids it, and if two people try to update the same order simultaneously, MVCC ensures they do not corrupt each other's work. SQL lets you ask questions like 'show me all customers who spent more than $1,000 in the last 30 days' without telling the database how to compute it -- the query optimizer figures out the most efficient execution plan.
Instagram runs PostgreSQL at massive scale, storing user profiles, media metadata, relationships, and activity feeds. They shard PostgreSQL across thousands of nodes using application-level sharding by user ID. Their engineering team has contributed significant work to PostgreSQL connection management, using PgBouncer to handle tens of thousands of concurrent connections. Despite operating at billions of rows, Instagram chose PostgreSQL over NoSQL because the transactional guarantees and SQL query flexibility reduced application complexity.
Uber
Uber uses MySQL as its primary relational database, scaled through Vitess -- an open-source sharding middleware originally built at YouTube. Vitess handles connection pooling, query routing, shard management, and online schema migrations across thousands of MySQL instances. Uber migrated from PostgreSQL to MySQL in 2016 due to PostgreSQL's MVCC implementation causing write amplification on their update-heavy workloads, demonstrating that the PostgreSQL-vs-MySQL choice depends heavily on specific access patterns.
Shopify
Shopify operates one of the largest MySQL deployments in the world, powering millions of online stores. They use ProxySQL for connection pooling and query routing, and have built a custom sharding framework (Pods architecture) that assigns each merchant to a specific MySQL shard. Their system processes billions of database queries per day with sub-10ms p99 latency for most operations, proving that MySQL can scale to handle some of the highest-traffic e-commerce platforms.
| Aspect | Description |
|---|---|
| Schema Rigidity vs Data Integrity | A predefined schema with strong typing, constraints, and foreign keys catches data errors at write time and prevents invalid states. However, schema changes (ALTER TABLE) on large tables can be expensive and require careful migration planning. Tools like pg_repack or gh-ost (for MySQL) enable online schema changes, but they add operational complexity. |
| Vertical Scaling vs Horizontal Distribution | Single-node relational databases scale vertically -- bigger CPU, more RAM, faster NVMe. This is simpler to operate but has hardware ceilings. Horizontal sharding (Vitess, Citus) distributes data across nodes but complicates cross-shard queries and distributed transactions. For many workloads, a single well-tuned PostgreSQL instance on modern hardware (128 cores, 1 TB RAM, NVMe) handles more load than teams expect. |
| JOIN Performance at Scale | Normalized schemas require JOINs that can be expensive when tables grow to billions of rows or when data is distributed across shards. Cross-shard JOINs are particularly costly. Strategies include controlled denormalization, materialized views, read replicas for analytics queries, and moving complex aggregations to a data warehouse. |
| Connection Overhead | PostgreSQL's process-per-connection model consumes 5-10 MB per connection. Without connection pooling, a microservices architecture with 50 services each opening 20 connections can exhaust database resources. PgBouncer or application-side pooling is mandatory at scale, adding an infrastructure component to manage and monitor. |
Instagram's PostgreSQL Sharding at Billion-User Scale
Scenario
Instagram needed to scale its PostgreSQL-backed data layer from millions to billions of users while maintaining strong consistency for social features like likes, comments, and follower graphs. A single PostgreSQL instance could not handle the write throughput or storage requirements, and NoSQL alternatives would have required rewriting significant application logic that relied on SQL joins and transactions.
Solution
Instagram implemented application-level sharding by user ID, distributing users across thousands of PostgreSQL instances. Each shard is a fully independent PostgreSQL database with its own schema. Cross-shard operations (like fetching a feed that includes posts from users on different shards) are handled at the application layer by fanning out queries and merging results. They use PgBouncer for connection pooling, logical replication for schema changes, and custom tooling for shard rebalancing.
Outcome
Instagram successfully scaled PostgreSQL to support over 2 billion monthly active users. The relational model's consistency guarantees eliminated entire categories of data corruption bugs that would have required complex application-level resolution in a NoSQL system. Query development velocity remained high because engineers could use standard SQL. The sharding approach added operational complexity but was managed through extensive automation and custom tooling.
See Relational Databases (PostgreSQL, MySQL) in action
Explore system design templates that use relational databases (postgresql, mysql) and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary purpose of MVCC (Multi-Version Concurrency Control) in PostgreSQL and MySQL InnoDB?
2Why is connection pooling critical for PostgreSQL in production?
3A composite B-tree index on columns (a, b, c) can efficiently support which of the following WHERE clauses?