What is important about B-Tree and B+ Tree regarding "B-tree nodes are sized to match disk pages (4KB-16KB), so ea..."?

B-tree nodes are sized to match disk pages (4KB-16KB), so each node read or write is a single I/O operation. A fanout of 100-500 means a tree indexing 1 billion keys is only 3-4 levels deep, requiring just 3-4 page reads per lookup.

What is important about B-Tree and B+ Tree regarding "All leaf nodes in a B-tree are at the same depth (the tree i..."?

All leaf nodes in a B-tree are at the same depth (the tree is perfectly balanced), guaranteeing consistent O(log n) performance regardless of data distribution, insertion order, or skew.

What is important about B-Tree and B+ Tree regarding "Write amplification in B-trees comes from rewriting entire p..."?

Write amplification in B-trees comes from rewriting entire pages for single-key modifications. Worst case during a split cascade is one page write per tree level. Buffer pools and WAL mitigate this by caching hot pages and batching writes.

What is important about B-Tree and B+ Tree regarding "Clustered indexes (InnoDB) store the actual row data in the ..."?

Clustered indexes (InnoDB) store the actual row data in the B+ tree leaf nodes, making primary key lookups single-tree traversals. Secondary indexes store primary key values in their leaves, requiring a second lookup into the clustered index.

What is important about B-Tree and B+ Tree regarding "B-trees support concurrent access through latch couplings (c..."?

B-trees support concurrent access through latch couplings (crabbing): acquire a latch on the child node before releasing the parent's latch. Modern optimistic latch coupling avoids holding parent latches during most reads, improving concurrency.

Vetora

🌳Storage Engines

B-Tree and B+ Tree

B-trees and B+ trees are self-balancing tree data structures that maintain sorted data and allow O(log n) lookups, insertions, and deletions. They are the default index structure in most relational databases because their page-oriented design aligns perfectly with how disks and SSDs read and write data in fixed-size blocks.

Overview

The B-tree, invented by Rudolf Bayer and Edward McCreight at Boeing Research Labs in 1972, is a self-balancing tree data structure designed specifically for systems that read and write large blocks of data. Unlike binary search trees where each node holds a single key, a B-tree node holds dozens to hundreds of keys sorted in order, with child pointers between them. This high fanout -- typically 100 to 500 children per node -- means that even billions of keys can be indexed in a tree just 3 or 4 levels deep. Every lookup, insertion, or deletion touches at most one node per level, yielding O(log n) performance with a very small constant factor in practice.

The page-oriented design of B-trees is what makes them the natural choice for databases. Operating systems and storage devices transfer data in fixed-size pages (commonly 4KB, 8KB, or 16KB). A B-tree node is sized to match exactly one page, so each node read or write corresponds to a single I/O operation. When a node overflows during insertion (exceeds the maximum number of keys), it splits into two nodes and propagates a median key up to the parent. When a node underflows during deletion, it merges with a sibling. These operations maintain the tree's balance guarantee: all leaf nodes are always at the same depth, ensuring consistent performance regardless of data distribution or insertion order.

The B+ tree variant, used by nearly every production database including PostgreSQL, MySQL InnoDB, and SQLite, refines the original B-tree design in two critical ways. First, data records (or pointers to data records) are stored only in leaf nodes; internal nodes contain only keys and child pointers. This maximizes the fanout of internal nodes because they do not waste space on data payloads, reducing tree height and therefore the number of I/O operations per lookup. Second, leaf nodes are linked together in a doubly-linked list, enabling efficient range scans -- once you find the starting key, you simply follow the leaf chain without revisiting internal nodes. This makes B+ trees excellent for queries with ORDER BY, BETWEEN, and inequality predicates.

Write amplification is the primary cost of B-tree-based storage. Every modification to a key potentially requires rewriting the entire page containing that key, plus any parent pages that need updating during splits. In the worst case, an insertion that triggers splits all the way to the root requires rewriting one page per tree level -- typically 3-4 pages. Database engines mitigate this through buffer pools (caching hot pages in memory), write-ahead logging (batching modifications), and delayed page splits. Despite this write overhead, B-trees remain the dominant index structure for read-heavy and mixed workloads because their read performance -- a single root-to-leaf traversal -- is extremely difficult to beat.

Key Points

1B-tree nodes are sized to match disk pages (4KB-16KB), so each node read or write is a single I/O operation. A fanout of 100-500 means a tree indexing 1 billion keys is only 3-4 levels deep, requiring just 3-4 page reads per lookup.
2B+ trees store data only in leaf nodes, maximizing internal node fanout. Leaf nodes are linked together in a doubly-linked list, enabling efficient sequential range scans without revisiting internal nodes.
3All leaf nodes in a B-tree are at the same depth (the tree is perfectly balanced), guaranteeing consistent O(log n) performance regardless of data distribution, insertion order, or skew.
4Write amplification in B-trees comes from rewriting entire pages for single-key modifications. Worst case during a split cascade is one page write per tree level. Buffer pools and WAL mitigate this by caching hot pages and batching writes.
5Clustered indexes (InnoDB) store the actual row data in the B+ tree leaf nodes, making primary key lookups single-tree traversals. Secondary indexes store primary key values in their leaves, requiring a second lookup into the clustered index.
6B-trees support concurrent access through latch couplings (crabbing): acquire a latch on the child node before releasing the parent's latch. Modern optimistic latch coupling avoids holding parent latches during most reads, improving concurrency.

Simple Example

The Library Card Catalog

Imagine a library with 10 million books. The card catalog has a master index with 26 drawers labeled A-Z. Inside each drawer, divider cards split the space into finer ranges (Aa-Am, An-Az, etc.). Behind each divider are cards pointing to even more specific sub-ranges, until you reach the individual book cards at the bottom level. To find any book, you open at most 3-4 drawers/dividers -- you never scan all 10 million cards. This is exactly how a B+ tree works: internal nodes are the dividers (they only contain keys for navigation), leaf nodes are the actual book cards (they contain the data), and the leaves are linked together so you can flip through them in order for browsing by author or title.

Real-World Examples

PostgreSQL

PostgreSQL uses B-tree as its default index type (CREATE INDEX uses B-tree unless otherwise specified). PostgreSQL's B-tree implementation supports deduplication (storing duplicate keys once with a posting list of TIDs), bottom-up deletion for reducing page splits on monotonically increasing keys (like timestamps and serial IDs), and parallel index scans that divide the leaf page chain among multiple workers for faster sequential scans of large indexes.

MySQL InnoDB

InnoDB uses a clustered B+ tree where the primary key index stores the actual row data in its leaf nodes. Every InnoDB table is a B+ tree organized by primary key. Secondary indexes store the primary key value (not a row pointer) in their leaf nodes, requiring a 'bookmark lookup' back to the clustered index. This design means primary key range scans are extremely fast (data is physically ordered), but secondary index lookups require two tree traversals.

SQLite

SQLite uses B-trees for all tables and indexes. Table B-trees (using rowid as the key) store complete row data in leaf nodes. Index B-trees store indexed columns plus the rowid. SQLite's B-tree implementation fits entirely in a single file with a page size configurable from 512 bytes to 65536 bytes (default 4096). The simplicity and reliability of SQLite's B-tree engine make it the most widely deployed database engine in the world, running on billions of devices.

Trade-Offs

Aspect	Description
Read Performance vs Write Amplification	B-trees provide excellent read performance (single root-to-leaf traversal) but suffer from write amplification because each modification rewrites an entire page. For write-heavy workloads, LSM trees can be 10-100x more efficient on writes because they convert random writes into sequential appends, though at the cost of more complex reads.
Space Efficiency vs Performance	B-tree pages are typically 50-70% full after random insertions due to page splits. This internal fragmentation wastes 30-50% of allocated space. Fillfactor settings can control initial page density, but lower fillfactors waste more space while reducing future split frequency. Periodic REINDEX operations can reclaim space but require locking.
Concurrency vs Complexity	B-trees support fine-grained concurrency through latch coupling and lock-free techniques, but the implementation complexity is significant. Page splits require coordination across multiple levels. Modern approaches like Bw-trees (used in SQL Server Hekaton) use compare-and-swap operations to avoid latches entirely, but add complexity through delta chains and garbage collection.
Sequential vs Random I/O	B-tree modifications produce random I/O because updated pages can be anywhere on disk. On HDDs, random I/O is 100x slower than sequential I/O, making B-tree writes expensive. SSDs reduce this gap (random reads are only 4-10x slower than sequential), which is one reason B-trees remain competitive against LSM trees on modern hardware.

Case Study

InnoDB Clustered Index -- How MySQL Organizes Every Table as a B+ Tree

Scenario

A high-traffic e-commerce platform storing 500 million orders in MySQL InnoDB needed sub-millisecond primary key lookups for order status checks, efficient range scans for date-based reporting, and fast joins between orders and order_items tables. The original MyISAM storage engine used heap-organized tables with separate B-tree indexes, requiring two I/O operations per lookup (one to find the row pointer in the index, one to fetch the row from the heap file).

Solution

InnoDB's clustered B+ tree design stores the complete row data in the primary key index's leaf nodes, eliminating the second I/O hop. The team chose auto-increment integers as primary keys to ensure sequential insertion at the rightmost leaf, minimizing page splits. Secondary indexes on (customer_id, order_date) were created for customer order history queries, with the understanding that these require a bookmark lookup back to the clustered index. The buffer pool was sized to hold the top 3 levels of the B+ tree (approximately 200MB for 500M rows), ensuring that lookups required at most one disk read for the leaf page.

Outcome

Primary key lookups dropped from 2 disk reads (MyISAM) to 1 disk read or 0 disk reads (when the leaf page was cached). Date-range report queries became 5x faster because the clustered index stored rows in primary key order, and the secondary index on (customer_id, order_date) provided ordered access patterns. Buffer pool hit rate exceeded 99.5% for the index's internal nodes, meaning the tree's root and branch pages were always in memory.

Common Mistakes

⚠Using UUIDs as primary keys in InnoDB. Random UUIDs cause insertions at random positions in the B+ tree, triggering excessive page splits and fragmentation. Use auto-increment integers or UUIDv7 (time-ordered) for clustered index keys to maintain sequential insertion patterns.
⚠Creating too many indexes on write-heavy tables. Each index is a separate B+ tree that must be updated on every INSERT, UPDATE, or DELETE. A table with 8 indexes requires 8 B+ tree modifications per write. Profile your actual query patterns and drop unused indexes.
⚠Ignoring the buffer pool's role in B-tree performance. A B+ tree with 4 levels only needs 3 levels cached in memory (the root and branch pages) to guarantee at most one disk read per lookup. Under-sizing the buffer pool forces repeated reads of internal nodes, degrading performance from O(1) effective disk reads to O(log n).
⚠Assuming B-trees are always the right index type. For full-text search, GIN or inverted indexes are better. For geospatial queries, R-trees or SP-GiST indexes are better. For high-cardinality set membership tests, Bloom indexes may be more appropriate. B-trees excel at equality and range queries on sortable data.

Related Concepts

LSM Tree LSM Tree vs B-Tree Write-Ahead Log (WAL)Database Sharding Relational Databases

See B-Tree and B+ Tree in action

Explore system design templates that use b-tree and b+ tree and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare B-tree indexed reads under varying workload patterns

Metrics to watch

read_latency_mswrite_amplificationpage_split_ratethroughput_rps

Run Simulation

Test Your Understanding

1Why do B+ trees store data only in leaf nodes rather than in all nodes like original B-trees?

2A B+ tree with a fanout of 400 indexes 1 billion keys. Approximately how many levels deep is the tree?

3What happens when you use random UUIDs as the primary key in an InnoDB clustered B+ tree?

Deeper Reading