Vetora logo
🏛️Database Families

Relational Databases (PostgreSQL, MySQL)

Relational databases store data in structured tables with rows and columns, enforce ACID transactions, and use SQL for querying. With decades of maturity, battle-tested tooling, and strong consistency guarantees, they remain the default choice for most application workloads.

Overview

Relational databases are the most widely deployed data storage systems in the world, underpinning everything from small web applications to some of the largest internet platforms on the planet. The relational model, formalized by Edgar F. Codd in 1970, organizes data into relations (tables) composed of tuples (rows) and attributes (columns). Each table has a defined schema, and relationships between tables are expressed through foreign keys. This structure enables powerful declarative querying via SQL, where the developer describes what data they want and the query optimizer decides how to retrieve it.

The defining strength of relational databases is their transactional guarantees -- ACID (Atomicity, Consistency, Isolation, Durability). Atomicity ensures that multi-statement transactions either fully commit or fully roll back. Consistency enforces schema constraints, foreign key relationships, and CHECK constraints. Isolation prevents concurrent transactions from interfering with each other, implemented through Multi-Version Concurrency Control (MVCC) in both PostgreSQL and MySQL's InnoDB engine. Durability guarantees that committed transactions survive crashes via Write-Ahead Logging (WAL). These guarantees make relational databases the default choice for any workload where data correctness is non-negotiable: financial transactions, inventory management, user accounts, and order processing.

Performance in relational databases depends heavily on indexing and query optimization. B-tree indexes are the workhorse data structure, providing O(log n) lookups, range scans, and ordered traversal. PostgreSQL also supports GIN indexes for full-text search and JSONB, GiST indexes for geometric and geographic data, and BRIN indexes for large sequential datasets. The query planner analyzes table statistics to choose between sequential scans, index scans, hash joins, merge joins, and nested loops -- often making decisions that outperform hand-optimized queries. Connection pooling (via PgBouncer for PostgreSQL or ProxySQL for MySQL) is critical at scale because each database connection consumes memory for process state, and most applications need far fewer concurrent connections than concurrent requests.

PostgreSQL and MySQL represent two philosophies within the relational world. PostgreSQL prioritizes correctness, extensibility, and advanced features -- it supports custom types, window functions, recursive CTEs, partial indexes, materialized views, and JSONB for semi-structured data. MySQL prioritizes simplicity, read performance, and operational ease -- its simpler replication model (async binary log replication) made it the backbone of the early web era (LAMP stack). Both databases have converged significantly: MySQL 8.0 added CTEs, window functions, and improved JSON support, while PostgreSQL's logical replication has become more flexible. The choice between them is increasingly one of ecosystem and organizational expertise rather than fundamental capability.

Key Points
  • 1ACID guarantees make relational databases the safest choice for transactional workloads. Atomicity prevents partial writes, isolation (via MVCC) handles concurrent access without read locks, and WAL-based durability ensures committed data survives crashes.
  • 2B-tree indexes provide O(log n) point lookups and efficient range scans, making them ideal for primary key lookups, foreign key joins, and ordered queries. Composite indexes on (a, b, c) support queries filtering on any left prefix of the key columns.
  • 3Normalized schema design (3NF) minimizes data duplication and update anomalies but requires JOIN operations that can be expensive at scale. Controlled denormalization trades storage and write complexity for faster reads, and is a common optimization for hot query paths.
  • 4Connection pooling is essential at scale because each PostgreSQL backend process consumes 5-10 MB of memory. PgBouncer in transaction mode can multiplex thousands of application connections onto a pool of 50-200 database connections.
  • 5PostgreSQL offers extensibility that blurs the line between relational and NoSQL: JSONB columns with GIN indexes, PostGIS for geospatial queries, pg_trgm for trigram-based fuzzy search, and pgvector for embedding similarity search -- all within a single transactional database.
  • 6MySQL with Vitess or ProxySQL enables horizontal read scaling through sharding and read replicas. MySQL's simpler binary-log replication makes it operationally easier to set up leader-follower topologies compared to PostgreSQL's WAL-based streaming replication.
Simple Example

The Spreadsheet with Superpowers

Think of a relational database as a collection of linked spreadsheets. One spreadsheet lists customers (id, name, email), another lists orders (id, customer_id, total, date). The customer_id column in the orders sheet is a foreign key that links back to the customers sheet. Unlike a regular spreadsheet, the database enforces rules: you cannot create an order for a customer_id that does not exist (referential integrity), you cannot insert a negative total if a CHECK constraint forbids it, and if two people try to update the same order simultaneously, MVCC ensures they do not corrupt each other's work. SQL lets you ask questions like 'show me all customers who spent more than $1,000 in the last 30 days' without telling the database how to compute it -- the query optimizer figures out the most efficient execution plan.

Real-World Examples

Instagram

Instagram runs PostgreSQL at massive scale, storing user profiles, media metadata, relationships, and activity feeds. They shard PostgreSQL across thousands of nodes using application-level sharding by user ID. Their engineering team has contributed significant work to PostgreSQL connection management, using PgBouncer to handle tens of thousands of concurrent connections. Despite operating at billions of rows, Instagram chose PostgreSQL over NoSQL because the transactional guarantees and SQL query flexibility reduced application complexity.

Uber

Uber uses MySQL as its primary relational database, scaled through Vitess -- an open-source sharding middleware originally built at YouTube. Vitess handles connection pooling, query routing, shard management, and online schema migrations across thousands of MySQL instances. Uber migrated from PostgreSQL to MySQL in 2016 due to PostgreSQL's MVCC implementation causing write amplification on their update-heavy workloads, demonstrating that the PostgreSQL-vs-MySQL choice depends heavily on specific access patterns.

Shopify

Shopify operates one of the largest MySQL deployments in the world, powering millions of online stores. They use ProxySQL for connection pooling and query routing, and have built a custom sharding framework (Pods architecture) that assigns each merchant to a specific MySQL shard. Their system processes billions of database queries per day with sub-10ms p99 latency for most operations, proving that MySQL can scale to handle some of the highest-traffic e-commerce platforms.

Trade-Offs
AspectDescription
Schema Rigidity vs Data IntegrityA predefined schema with strong typing, constraints, and foreign keys catches data errors at write time and prevents invalid states. However, schema changes (ALTER TABLE) on large tables can be expensive and require careful migration planning. Tools like pg_repack or gh-ost (for MySQL) enable online schema changes, but they add operational complexity.
Vertical Scaling vs Horizontal DistributionSingle-node relational databases scale vertically -- bigger CPU, more RAM, faster NVMe. This is simpler to operate but has hardware ceilings. Horizontal sharding (Vitess, Citus) distributes data across nodes but complicates cross-shard queries and distributed transactions. For many workloads, a single well-tuned PostgreSQL instance on modern hardware (128 cores, 1 TB RAM, NVMe) handles more load than teams expect.
JOIN Performance at ScaleNormalized schemas require JOINs that can be expensive when tables grow to billions of rows or when data is distributed across shards. Cross-shard JOINs are particularly costly. Strategies include controlled denormalization, materialized views, read replicas for analytics queries, and moving complex aggregations to a data warehouse.
Connection OverheadPostgreSQL's process-per-connection model consumes 5-10 MB per connection. Without connection pooling, a microservices architecture with 50 services each opening 20 connections can exhaust database resources. PgBouncer or application-side pooling is mandatory at scale, adding an infrastructure component to manage and monitor.
Case Study

Instagram's PostgreSQL Sharding at Billion-User Scale

Scenario

Instagram needed to scale its PostgreSQL-backed data layer from millions to billions of users while maintaining strong consistency for social features like likes, comments, and follower graphs. A single PostgreSQL instance could not handle the write throughput or storage requirements, and NoSQL alternatives would have required rewriting significant application logic that relied on SQL joins and transactions.

Solution

Instagram implemented application-level sharding by user ID, distributing users across thousands of PostgreSQL instances. Each shard is a fully independent PostgreSQL database with its own schema. Cross-shard operations (like fetching a feed that includes posts from users on different shards) are handled at the application layer by fanning out queries and merging results. They use PgBouncer for connection pooling, logical replication for schema changes, and custom tooling for shard rebalancing.

Outcome

Instagram successfully scaled PostgreSQL to support over 2 billion monthly active users. The relational model's consistency guarantees eliminated entire categories of data corruption bugs that would have required complex application-level resolution in a NoSQL system. Query development velocity remained high because engineers could use standard SQL. The sharding approach added operational complexity but was managed through extensive automation and custom tooling.

Common Mistakes
  • Not using connection pooling. Opening a new database connection for every request adds 3-10ms of overhead per query and consumes significant server memory. PgBouncer or ProxySQL should be deployed from day one in any production setup.
  • Over-normalizing schemas to satisfy theoretical ideals. While 3NF reduces data duplication, excessive normalization leads to complex multi-table JOINs on every read. Pragmatic denormalization of hot read paths (like embedding a user's display name in a posts table) is a standard optimization.
  • Missing indexes on foreign key columns. Without an index on a foreign key column, DELETE and UPDATE operations on the referenced table require full table scans to check for referencing rows. This causes unexpected slowdowns as tables grow.
  • Using SELECT * in production queries. Fetching all columns when you only need two wastes network bandwidth, prevents covering index optimizations, and makes queries fragile when schema changes add new columns. Always specify the exact columns needed.
Related Concepts

See Relational Databases (PostgreSQL, MySQL) in action

Explore system design templates that use relational databases (postgresql, mysql) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

See how relational databases handle concurrent transactions under load

Metrics to watch
query_latency_p99connection_pool_utilizationtransactions_per_second
Run Simulation
Test Your Understanding

1What is the primary purpose of MVCC (Multi-Version Concurrency Control) in PostgreSQL and MySQL InnoDB?

2Why is connection pooling critical for PostgreSQL in production?

3A composite B-tree index on columns (a, b, c) can efficiently support which of the following WHERE clauses?

Deeper Reading