Vetora logo
📄Database Families

Document Stores (MongoDB)

Document stores persist data as schema-flexible JSON-like documents, enabling natural mapping between application objects and database records. MongoDB, the leading document database, combines flexible schemas with rich querying, aggregation pipelines, and horizontal scaling via sharding.

Overview

Document stores represent a fundamentally different data modeling philosophy from relational databases. Instead of distributing data across normalized tables linked by foreign keys, document stores persist complete, self-contained records as JSON-like documents. Each document can have a different structure -- one user document might have an 'address' field while another does not, and nested objects and arrays are first-class citizens. This schema flexibility eliminates the impedance mismatch between application objects (which are hierarchical) and database rows (which are flat), reducing the amount of object-relational mapping (ORM) code developers must write and maintain.

MongoDB, the most widely adopted document database, stores documents in BSON (Binary JSON), an efficient binary encoding that supports additional types like dates, ObjectIDs, decimals, and binary data. Documents are organized into collections (analogous to tables), and each document has a unique _id field. MongoDB supports rich queries on any field within a document, including nested fields and array elements, using a JSON-based query language. Indexes can be created on any field, including fields within nested objects and arrays (multikey indexes). The aggregation pipeline provides a powerful framework for data transformation and analytics, chaining stages like $match, $group, $lookup (left outer join), $unwind, and $project into a processing pipeline.

The critical data modeling decision in document databases is whether to embed related data within a single document or reference it across documents. Embedding (denormalization) stores related data together -- for example, embedding an array of order line items within an order document. This provides atomic reads (one query returns the complete order) and atomic writes (updating the order and its line items in one operation). However, embedding leads to data duplication when the same entity is referenced from multiple documents, and documents have a 16 MB size limit in MongoDB. Referencing (normalization) stores related data in separate collections with ObjectID references, similar to foreign keys. This avoids duplication but requires multiple queries or $lookup aggregation stages to assemble related data, and MongoDB does not enforce referential integrity on references.

MongoDB's horizontal scaling is achieved through sharding -- distributing data across multiple servers (shards) based on a shard key. Choosing an effective shard key is critical: it should have high cardinality, distribute writes evenly, and enable the query router (mongos) to target queries to a single shard rather than scattering them across all shards. Common shard key patterns include hashed _id (even distribution but scatter-gather for range queries), compound keys like {tenant_id, created_at} (enables targeted queries within a tenant), and zone-based sharding for data locality requirements. Change streams provide a real-time feed of document changes, enabling event-driven architectures, CDC (Change Data Capture) pipelines, and real-time synchronization without polling.

Key Points
  • 1Schema flexibility allows documents in the same collection to have different fields. This eliminates ALTER TABLE migrations for schema changes and maps naturally to polymorphic application objects. However, schema validation rules should be used to enforce minimum data quality.
  • 2Embedding vs referencing is the core data modeling decision. Embed when data is read together and the embedded array is bounded (e.g., order line items). Reference when data is shared across documents or the related collection is unbounded (e.g., comments on a post).
  • 3The aggregation pipeline provides SQL-equivalent analytical capabilities: $group for GROUP BY, $match for WHERE, $lookup for JOIN, $sort, $limit, and $facet for parallel pipeline branches. Complex analytics queries are expressed as a sequence of pipeline stages.
  • 4MongoDB provides multi-document ACID transactions (since version 4.0) across collections and shards. However, transactions add performance overhead and the document model's embedding pattern often eliminates the need for multi-document transactions by keeping related data in a single document.
  • 5Shard key selection determines query performance and write distribution. A shard key with low cardinality creates jumbo chunks that cannot be split. A monotonically increasing shard key (like ObjectID) concentrates all writes on the last shard. Hashed shard keys distribute writes evenly but force scatter-gather on range queries.
  • 6Change streams provide an oplog-backed real-time feed of insert, update, replace, and delete events. They enable reactive application patterns, real-time analytics, and CDC pipelines to downstream systems like Elasticsearch or data warehouses.
Simple Example

The Filing Cabinet Analogy

Think of a document store as a filing cabinet where each folder (document) contains all the information about one subject. In a relational database, information about a customer would be spread across multiple drawers -- one drawer for customer details, another for their orders, another for their addresses. To get the full picture, you would need to open multiple drawers and match customer IDs. In a document store, each customer folder contains everything: their name, email, all their addresses, and all their recent orders, nested right inside the folder. Opening one folder gives you the complete customer view instantly. The trade-off is that if the same address appears in multiple customer folders, changing it requires updating every copy.

Real-World Examples

Coinbase

Coinbase uses MongoDB Atlas for storing cryptocurrency transaction records and user portfolio data. Each user's portfolio is modeled as a document containing embedded holdings across different cryptocurrencies, enabling single-query portfolio reads. The schema flexibility allowed Coinbase to add new cryptocurrency assets without schema migrations -- each new asset type adds new fields to portfolio documents without affecting existing ones. Change streams power real-time price update notifications to users.

eBay

eBay uses MongoDB for its product catalog, which contains hundreds of millions of listings with highly variable attributes. A smartphone listing has completely different fields (screen size, battery capacity, carrier) than a dress listing (size, material, color). The document model's schema flexibility handles this variation naturally -- each listing document contains only the attributes relevant to its product category. A relational approach would require either hundreds of sparse columns or an EAV (Entity-Attribute-Value) pattern, both of which perform poorly.

Adobe

Adobe's Experience Platform uses MongoDB to store user experience profiles -- rich, hierarchical documents representing a user's interactions across Adobe products. Each profile document embeds behavioral data, preferences, and segmentation attributes. The aggregation pipeline powers real-time audience segmentation, grouping users by behavior patterns across millions of profiles. MongoDB's ability to index nested fields and array elements enables millisecond-latency lookups on deeply nested profile attributes.

Trade-Offs
AspectDescription
Schema Flexibility vs Data ConsistencySchema-free documents allow rapid iteration and polymorphic data, but without validation rules, data quality degrades over time. Old application versions may write documents with missing or differently-typed fields. MongoDB's schema validation ($jsonSchema) partially addresses this, but enforcement is weaker than relational constraints and cannot be applied retroactively to existing documents.
Embedding vs NormalizationEmbedding provides atomic reads and writes within a single document, but leads to data duplication (update anomalies) and is limited by the 16 MB document size. Normalization avoids duplication but requires multiple queries or $lookup stages (which are slower than SQL JOINs in many cases) and lacks referential integrity enforcement.
Query Flexibility vs PredictabilityMongoDB supports rich queries on any field, but query performance depends entirely on index coverage. Without proper indexes, queries fall back to collection scans on potentially billions of documents. The query planner is less mature than PostgreSQL's or MySQL's, and execution plans can be less predictable, requiring careful explain() analysis.
Horizontal Scaling vs Operational ComplexityMongoDB's sharding enables horizontal write scaling, but introduces significant operational complexity: choosing and potentially changing shard keys, managing mongos routers, handling chunk balancing, and dealing with scatter-gather queries when the shard key is not in the query filter. For datasets under 1 TB, a single replica set is often simpler and sufficient.
Case Study

eBay Product Catalog -- Schema-Flexible Storage for Variable Attributes

Scenario

eBay's product catalog contains hundreds of millions of active listings spanning thousands of product categories, each with unique attributes. A relational schema with a fixed set of columns could not accommodate the attribute variation -- a laptop listing needs 'CPU speed' and 'RAM size' while a pair of shoes needs 'shoe size' and 'material.' The EAV (Entity-Attribute-Value) pattern used previously required expensive JOINs across millions of attribute rows to reconstruct a single listing, causing unacceptable query latency.

Solution

eBay adopted MongoDB to store each listing as a self-contained document with category-specific attributes embedded directly in the document. A laptop listing document contains {cpu: '2.4 GHz', ram: '16 GB'} while a shoe listing contains {size: '10', material: 'leather'} -- no wasted sparse columns and no multi-table JOINs. MongoDB's flexible indexing (wildcard indexes on attribute fields) enables efficient queries like 'find all laptops with ram >= 16 GB' without knowing the full attribute schema at index creation time.

Outcome

Catalog query latency dropped from hundreds of milliseconds (EAV joins) to single-digit milliseconds (document reads). Adding new product categories with unique attributes became a backend configuration change rather than a database migration. The document model's natural fit for hierarchical product data eliminated thousands of lines of ORM mapping code. MongoDB's sharded cluster scaled horizontally to handle eBay's read-heavy catalog workload across billions of queries per day.

Common Mistakes
  • Treating MongoDB as a schemaless free-for-all. Without schema validation rules, documents accumulate inconsistent field names, missing required fields, and incompatible data types over time. Always define $jsonSchema validators on production collections to enforce minimum data quality.
  • Embedding unbounded arrays in documents. Embedding a user's entire comment history or all events in a session document can exceed the 16 MB document size limit and cause increasingly slow updates as the array grows. Use the bucket pattern (fixed-size sub-documents) or reference separate documents for unbounded relationships.
  • Using MongoDB for highly relational data that requires frequent multi-collection JOINs. If your queries consistently require $lookup across 3-4 collections, a relational database with proper indexes will likely perform better and be simpler to reason about.
  • Ignoring index coverage for production queries. MongoDB does not warn when queries fall back to collection scans. Use explain() to verify that all production queries use appropriate indexes, and set notablescan=true in development to catch missing indexes early.
Related Concepts

See Document Stores (MongoDB) in action

Explore system design templates that use document stores (mongodb) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare document vs relational data models under e-commerce workloads

Metrics to watch
read_latency_p99write_throughputdocument_size_avg
Run Simulation
Test Your Understanding

1What is the main advantage of embedding related data within a MongoDB document?

2Why is a monotonically increasing field (like a timestamp or auto-increment ID) a poor shard key in MongoDB?

Deeper Reading