Database

Storage

Persistent data store supporting SQL or NoSQL models with ACID transactions, replication, sharding, and query optimization.

Overview

A Database is the persistent storage backbone of any distributed system, responsible for durably storing, indexing, and retrieving data. In Vetora's simulator, the Database component models read/write latency, connection pooling, replication lag, and the performance characteristics of different database paradigms. Choosing the right database — and configuring it properly — is one of the most consequential decisions in system design.

SQL databases (relational) organize data in tables with predefined schemas, enforce relationships through foreign keys, and support ACID transactions (Atomicity, Consistency, Isolation, Durability). PostgreSQL, MySQL, and SQL Server are the workhorses of this category. ACID guarantees are essential for financial transactions, inventory management, and any domain where data integrity is non-negotiable. The tradeoff is that strict ACID enforcement limits horizontal scalability — coordinating transactions across shards introduces distributed coordination overhead.

NoSQL databases trade schema rigidity and ACID guarantees for horizontal scalability and flexible data models. Document databases (MongoDB, Couchbase) store JSON-like documents with dynamic schemas. Key-value stores (DynamoDB, Redis) provide the simplest and fastest access pattern — lookup by key. Wide-column stores (Cassandra, HBase) handle massive write throughput across thousands of nodes. Graph databases (Neo4j, Amazon Neptune) excel at traversing relationship-heavy data. Each paradigm is optimized for specific access patterns; choosing the wrong one creates performance problems that no amount of optimization can fix.

Sharding (horizontal partitioning) distributes data across multiple database instances based on a shard key. Range-based sharding assigns key ranges to shards (user IDs 1–1M on shard 1, 1M–2M on shard 2). Hash-based sharding applies a hash function to the shard key and maps to shards, providing more even distribution. The shard key choice is critical — it must distribute data evenly, support your primary query patterns, and minimize cross-shard queries. Re-sharding (adding shards to an existing cluster) is one of the most complex operations in distributed systems.

Replication creates copies of data across multiple database instances for availability and read scaling. Synchronous replication waits for all replicas to confirm writes before acknowledging success — strong consistency but higher write latency. Asynchronous replication acknowledges writes after the primary commits, replicating to followers in the background — lower latency but introduces replication lag where followers may serve stale reads. Vetora models replication lag and shows how eventual consistency affects end-user experience.

Connection pooling is an often-overlooked but critical aspect of database performance. Each database connection consumes server memory (5–10MB in PostgreSQL) and file descriptors. Without pooling, a microservices architecture with 20 services, each opening connections on demand, can exhaust the database's connection limit. Connection pool managers like PgBouncer maintain a pool of reusable connections, multiplexing application requests across a smaller set of database connections.

When to Use

+Persistent storage of structured or semi-structured data that must survive process and server restarts
+Transactional operations requiring ACID guarantees — financial records, inventory, user accounts
+Complex queries involving joins, aggregations, and filtering across multiple dimensions
+Any system that needs durable, queryable data — virtually every production application uses at least one database

Not Recommended

-Hot data that is read far more than written — put a cache layer in front rather than querying the database for every read
-Large binary objects (images, videos, documents) — use Object Storage with database metadata pointers
-Ephemeral data like session tokens or rate limiter counters — in-memory stores like Redis are more appropriate

Key Parameters in Vetora

Parameter	Description	Typical Values
readLatencyMs	Time to execute a single-key read query. Affected by indexing, data volume, and storage type (SSD vs HDD).	1–10ms for indexed reads, 50–500ms for unindexed scans
writeLatencyMs	Time to durably commit a write operation. Includes WAL (write-ahead log) flush and any synchronous replication.	2–20ms for single writes, 10–100ms with synchronous replication
connectionPoolSize	Maximum concurrent connections from application services. Must be balanced against database server memory and CPU.	20–100 connections per service instance
replicationLagMs	Delay between a write on the primary and its availability on read replicas. Zero for synchronous replication.	10–1,000ms for async replication

Real-World Examples

PostgreSQL

Advanced open-source relational database with MVCC concurrency, JSONB support, full-text search, and PostGIS for geographic data. Used by Apple, Instagram, Spotify, and countless production systems.

Amazon DynamoDB

Fully managed NoSQL key-value and document database with single-digit millisecond performance at any scale. Supports auto-scaling, global tables for multi-region replication, and on-demand capacity mode.

Apache Cassandra

Wide-column NoSQL database designed for massive write throughput across hundreds of nodes with no single point of failure. Used by Apple (400,000+ nodes), Netflix, and Discord for time-series and event data.

Frequently Asked Questions

How do you choose between SQL and NoSQL databases?

Choose SQL (PostgreSQL, MySQL) when you need ACID transactions, complex joins and aggregations, strong consistency, and a well-defined schema. Choose NoSQL when you need horizontal scalability beyond what a single SQL server can handle, flexible schemas that evolve frequently, specific access patterns (key-value lookup, document retrieval, time-series writes), or multi-region deployment with tunable consistency. Many production systems use both — SQL for transactional data (orders, accounts) and NoSQL for high-volume data (events, metrics, session state).

What is database sharding and how does it work?

Sharding partitions data across multiple database instances (shards) based on a shard key. Hash-based sharding applies a hash function to the key for even distribution. Range-based sharding assigns key ranges to shards for efficient range queries. The shard key must distribute data evenly and support primary query patterns. Cross-shard queries (joining data across shards) are expensive and should be minimized. Adding shards (re-sharding) requires migrating data and is one of the most complex distributed systems operations.

What is replication lag and how does it affect system design?

Replication lag is the delay between a write on the primary database and its availability on read replicas. With asynchronous replication (typical for read scaling), lag ranges from 10ms to several seconds. This means a user who writes data might not see it immediately if the next read hits a replica. Solutions include read-your-writes consistency (route reads to primary for the writing user), causal consistency tokens, and synchronous replication for critical paths (at the cost of higher write latency).

Why is database connection pooling important?

Each database connection consumes 5–10MB of server memory in PostgreSQL and similar databases. Without pooling, a microservices architecture with 20 services × 10 instances × 10 connections = 2,000 connections, consuming 10–20GB of database memory for connections alone. Connection poolers like PgBouncer maintain a small pool of actual database connections (e.g., 100) and multiplex thousands of application connections through them. This reduces memory usage, speeds up connection acquisition, and prevents connection exhaustion during traffic spikes.

What are the most important database indexing strategies?

B-tree indexes are the default for most databases, supporting equality and range queries efficiently. Hash indexes provide O(1) lookups for exact matches but do not support range queries. Composite indexes cover queries on multiple columns — column order matters (leftmost prefix rule). Partial indexes only index rows matching a condition, reducing index size and maintenance cost. Covering indexes include all queried columns, enabling index-only scans that skip the table entirely. In Vetora's simulator, you can model the impact of indexing on read latency vs. write overhead.

Related Components

CacheStorage

In-memory data store that accelerates reads by serving frequently accessed data without querying the...

ServiceCompute

Application server or microservice that processes requests, runs business logic, and communicates wi...

Object StorageStorage

Scalable blob storage for unstructured data like images, videos, documents, and backups with lifecyc...

Event StreamStorage

Durable message streaming platform for pub/sub, event sourcing, and asynchronous communication betwe...

Try Database in the Simulator

Build architectures with Database and 13 other component types. Run discrete event simulations and get AI-powered feedback.

Open Playground