1Why do B+ trees store data only in leaf nodes rather than in all nodes like original B-trees?
B-trees and B+ trees are self-balancing tree data structures that maintain sorted data and allow O(log n) lookups, insertions, and deletions. They are the default index structure in most relational databases because their page-oriented design aligns perfectly with how disks and SSDs read and write data in fixed-size blocks.
The B-tree, invented by Rudolf Bayer and Edward McCreight at Boeing Research Labs in 1972, is a self-balancing tree data structure designed specifically for systems that read and write large blocks of data. Unlike binary search trees where each node holds a single key, a B-tree node holds dozens to hundreds of keys sorted in order, with child pointers between them. This high fanout -- typically 100 to 500 children per node -- means that even billions of keys can be indexed in a tree just 3 or 4 levels deep. Every lookup, insertion, or deletion touches at most one node per level, yielding O(log n) performance with a very small constant factor in practice.
The page-oriented design of B-trees is what makes them the natural choice for databases. Operating systems and storage devices transfer data in fixed-size pages (commonly 4KB, 8KB, or 16KB). A B-tree node is sized to match exactly one page, so each node read or write corresponds to a single I/O operation. When a node overflows during insertion (exceeds the maximum number of keys), it splits into two nodes and propagates a median key up to the parent. When a node underflows during deletion, it merges with a sibling. These operations maintain the tree's balance guarantee: all leaf nodes are always at the same depth, ensuring consistent performance regardless of data distribution or insertion order.
The B+ tree variant, used by nearly every production database including PostgreSQL, MySQL InnoDB, and SQLite, refines the original B-tree design in two critical ways. First, data records (or pointers to data records) are stored only in leaf nodes; internal nodes contain only keys and child pointers. This maximizes the fanout of internal nodes because they do not waste space on data payloads, reducing tree height and therefore the number of I/O operations per lookup. Second, leaf nodes are linked together in a doubly-linked list, enabling efficient range scans -- once you find the starting key, you simply follow the leaf chain without revisiting internal nodes. This makes B+ trees excellent for queries with ORDER BY, BETWEEN, and inequality predicates.
Write amplification is the primary cost of B-tree-based storage. Every modification to a key potentially requires rewriting the entire page containing that key, plus any parent pages that need updating during splits. In the worst case, an insertion that triggers splits all the way to the root requires rewriting one page per tree level -- typically 3-4 pages. Database engines mitigate this through buffer pools (caching hot pages in memory), write-ahead logging (batching modifications), and delayed page splits. Despite this write overhead, B-trees remain the dominant index structure for read-heavy and mixed workloads because their read performance -- a single root-to-leaf traversal -- is extremely difficult to beat.
The Library Card Catalog
Imagine a library with 10 million books. The card catalog has a master index with 26 drawers labeled A-Z. Inside each drawer, divider cards split the space into finer ranges (Aa-Am, An-Az, etc.). Behind each divider are cards pointing to even more specific sub-ranges, until you reach the individual book cards at the bottom level. To find any book, you open at most 3-4 drawers/dividers -- you never scan all 10 million cards. This is exactly how a B+ tree works: internal nodes are the dividers (they only contain keys for navigation), leaf nodes are the actual book cards (they contain the data), and the leaves are linked together so you can flip through them in order for browsing by author or title.
PostgreSQL
PostgreSQL uses B-tree as its default index type (CREATE INDEX uses B-tree unless otherwise specified). PostgreSQL's B-tree implementation supports deduplication (storing duplicate keys once with a posting list of TIDs), bottom-up deletion for reducing page splits on monotonically increasing keys (like timestamps and serial IDs), and parallel index scans that divide the leaf page chain among multiple workers for faster sequential scans of large indexes.
MySQL InnoDB
InnoDB uses a clustered B+ tree where the primary key index stores the actual row data in its leaf nodes. Every InnoDB table is a B+ tree organized by primary key. Secondary indexes store the primary key value (not a row pointer) in their leaf nodes, requiring a 'bookmark lookup' back to the clustered index. This design means primary key range scans are extremely fast (data is physically ordered), but secondary index lookups require two tree traversals.
SQLite
SQLite uses B-trees for all tables and indexes. Table B-trees (using rowid as the key) store complete row data in leaf nodes. Index B-trees store indexed columns plus the rowid. SQLite's B-tree implementation fits entirely in a single file with a page size configurable from 512 bytes to 65536 bytes (default 4096). The simplicity and reliability of SQLite's B-tree engine make it the most widely deployed database engine in the world, running on billions of devices.
| Aspect | Description |
|---|---|
| Read Performance vs Write Amplification | B-trees provide excellent read performance (single root-to-leaf traversal) but suffer from write amplification because each modification rewrites an entire page. For write-heavy workloads, LSM trees can be 10-100x more efficient on writes because they convert random writes into sequential appends, though at the cost of more complex reads. |
| Space Efficiency vs Performance | B-tree pages are typically 50-70% full after random insertions due to page splits. This internal fragmentation wastes 30-50% of allocated space. Fillfactor settings can control initial page density, but lower fillfactors waste more space while reducing future split frequency. Periodic REINDEX operations can reclaim space but require locking. |
| Concurrency vs Complexity | B-trees support fine-grained concurrency through latch coupling and lock-free techniques, but the implementation complexity is significant. Page splits require coordination across multiple levels. Modern approaches like Bw-trees (used in SQL Server Hekaton) use compare-and-swap operations to avoid latches entirely, but add complexity through delta chains and garbage collection. |
| Sequential vs Random I/O | B-tree modifications produce random I/O because updated pages can be anywhere on disk. On HDDs, random I/O is 100x slower than sequential I/O, making B-tree writes expensive. SSDs reduce this gap (random reads are only 4-10x slower than sequential), which is one reason B-trees remain competitive against LSM trees on modern hardware. |
InnoDB Clustered Index -- How MySQL Organizes Every Table as a B+ Tree
Scenario
A high-traffic e-commerce platform storing 500 million orders in MySQL InnoDB needed sub-millisecond primary key lookups for order status checks, efficient range scans for date-based reporting, and fast joins between orders and order_items tables. The original MyISAM storage engine used heap-organized tables with separate B-tree indexes, requiring two I/O operations per lookup (one to find the row pointer in the index, one to fetch the row from the heap file).
Solution
InnoDB's clustered B+ tree design stores the complete row data in the primary key index's leaf nodes, eliminating the second I/O hop. The team chose auto-increment integers as primary keys to ensure sequential insertion at the rightmost leaf, minimizing page splits. Secondary indexes on (customer_id, order_date) were created for customer order history queries, with the understanding that these require a bookmark lookup back to the clustered index. The buffer pool was sized to hold the top 3 levels of the B+ tree (approximately 200MB for 500M rows), ensuring that lookups required at most one disk read for the leaf page.
Outcome
Primary key lookups dropped from 2 disk reads (MyISAM) to 1 disk read or 0 disk reads (when the leaf page was cached). Date-range report queries became 5x faster because the clustered index stored rows in primary key order, and the secondary index on (customer_id, order_date) provided ordered access patterns. Buffer pool hit rate exceeded 99.5% for the index's internal nodes, meaning the tree's root and branch pages were always in memory.
See B-Tree and B+ Tree in action
Explore system design templates that use b-tree and b+ tree and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why do B+ trees store data only in leaf nodes rather than in all nodes like original B-trees?
2A B+ tree with a fanout of 400 indexes 1 billion keys. Approximately how many levels deep is the tree?
3What happens when you use random UUIDs as the primary key in an InnoDB clustered B+ tree?