Vetora logo
⚔️Storage Engines

LSM Tree vs B-Tree

A comprehensive comparison of the two dominant storage engine architectures: LSM trees (write-optimized, sequential I/O, compaction-based) and B-trees (read-optimized, in-place updates, page-based). Understanding this trade-off is fundamental to choosing the right database engine for a given workload.

Overview

The choice between LSM tree and B-tree storage engines is the most consequential low-level architectural decision in database system design. These two data structures represent fundamentally different philosophies: B-trees maintain a single, mutable sorted structure that is updated in place, optimizing for read performance at the cost of random write I/O; LSM trees accumulate immutable sorted runs that are merged in the background, optimizing for write throughput at the cost of read amplification. Every production database makes this choice, and understanding the trade-offs is essential for selecting the right engine for a given workload.

Write amplification -- the ratio of total bytes written to storage versus the logical bytes of user data -- is where LSM trees have a clear structural advantage. When a B-tree modifies a single key, it must rewrite the entire page (typically 4KB-16KB) containing that key, even if the actual change is a few bytes. A page split during insertion rewrites two pages plus the parent, and these writes are random I/O scattered across the disk. An LSM tree, by contrast, appends the write sequentially to a WAL and buffers it in memory. When the memtable flushes, it writes a sorted SSTable in a single sequential operation. However, LSM trees have their own write amplification from compaction: in leveled compaction, data is rewritten at each level transition, resulting in a total write amplification of approximately 10-30x over the data's lifetime. The key difference is that all of this I/O is sequential, which on HDDs is approximately 100x faster than random I/O, and on SSDs is still 3-10x faster.

Read amplification -- the number of I/O operations required to satisfy a point lookup -- favors B-trees. A B-tree lookup traverses from root to leaf, touching one page per level; with internal pages cached in the buffer pool, this typically means a single disk read for the leaf page. An LSM tree point lookup must potentially check the memtable, then each SSTable level from newest to oldest. Without optimizations, a 5-level LSM tree could require 5 disk reads per lookup. Bloom filters reduce this dramatically: with 10 bits per key (~1% false positive rate), most levels are skipped, and the expected number of disk reads drops to approximately 1.1 for a point lookup. But for range scans, LSM trees must merge-sort results from multiple SSTables, while B-trees simply follow the linked leaf chain -- a significant advantage for range-heavy workloads like analytics queries and ordered iteration.

Space amplification -- the ratio of total disk space used versus the logical size of the data -- is another critical dimension. B-trees typically maintain 65-75% page utilization after random insertions (due to page splits leaving pages half-full), resulting in approximately 1.3-1.5x space amplification. LSM trees with leveled compaction maintain around 1.1x space amplification in steady state because compaction produces tightly packed SSTables. However, during compaction itself, the system temporarily needs space for both the input SSTables and the output SSTable, which can spike to 2x the logical data size. Size-tiered compaction is worse: because it delays merging large tiers, space amplification can reach 2-10x. For cost-sensitive deployments, this difference in storage efficiency can be the deciding factor.

SSD-specific considerations add nuance to the comparison. SSDs have asymmetric read/write performance, limited write endurance (each cell can only be written a finite number of times), and internal write amplification from garbage collection. LSM trees' sequential write patterns align well with SSD internals: sequential writes minimize the SSD's internal garbage collection overhead and reduce write amplification at the device level. However, heavy compaction can cause periodic I/O spikes that interfere with foreground request latency -- a phenomenon called 'compaction stalls.' B-trees benefit from SSDs' fast random reads (SSD random reads are only 4-10x slower than sequential, compared to 100x on HDDs), which makes the B-tree's random read pattern much less costly on modern hardware. The convergence of random and sequential I/O performance on SSDs has narrowed the write throughput gap between LSM and B-tree engines, though LSM trees still hold a significant advantage for sustained write-intensive workloads.

Hybrid approaches attempt to combine the best of both architectures. The BW-tree (used in Microsoft's Hekaton in-memory OLTP engine and Azure SQL Database) uses a B-tree structure but with delta chains instead of in-place updates, converting random writes into sequential log-structured appends while maintaining B-tree read patterns. Facebook's RocksDB introduced a feature called BlobDB that stores large values in a separate log-structured blob store while keeping keys and small values in the LSM tree, reducing write amplification for workloads with large values. TiDB takes a different hybrid approach: it uses RocksDB (LSM) as the storage engine for its TiKV OLTP layer, but uses a columnar format in TiFlash for OLAP queries, routing queries to the appropriate engine based on the query pattern. These hybrid designs acknowledge that no single data structure is optimal for all access patterns, and the future of storage engines lies in adaptive, workload-aware combinations.

Key Points
  • 1Write amplification: B-trees rewrite entire pages (4-16KB) for single-key changes with random I/O. LSM trees append sequentially but incur 10-30x write amplification from leveled compaction. The critical difference is sequential vs random I/O -- sequential is 100x faster on HDDs and 3-10x faster on SSDs.
  • 2Read amplification: B-trees require a single root-to-leaf traversal (typically 1 disk read with cached internal nodes). LSM trees must potentially check multiple SSTables across levels, but Bloom filters (10 bits/key, ~1% FPR) reduce expected disk reads to approximately 1.1 per point lookup.
  • 3Space amplification: B-trees run at 65-75% page utilization (~1.4x overhead). LSM trees with leveled compaction achieve ~1.1x steady-state overhead but can spike to 2x during compaction. Size-tiered compaction can require 2-10x space overhead.
  • 4Range scan performance strongly favors B-trees. B+ tree leaf nodes are linked, so range scans follow a sequential chain. LSM trees must merge-sort results from multiple SSTables, adding CPU overhead and I/O for each level.
  • 5SSD considerations: LSM sequential writes reduce SSD garbage collection overhead and write wear. B-trees benefit from SSDs' fast random reads (only 4-10x penalty vs 100x on HDDs). The performance gap between the two architectures is smaller on SSDs than on HDDs.
  • 6Hybrid engines (BW-tree, BlobDB, TiDB's dual engine) combine B-tree read patterns with log-structured write patterns, acknowledging that neither pure approach is optimal for all workloads.
Simple Example

Filing Cabinet vs Notebook

Imagine two systems for tracking inventory changes. The B-tree approach is like a filing cabinet with folders sorted alphabetically: to update an item, you find the right folder (fast lookup), erase the old count, and write the new one (in-place update, but you touched the whole page). The LSM approach is like a notebook: you jot each change on the next blank line (fast sequential write). To find the current count for an item, you scan backward through the notebook looking for the latest entry for that item (slower read). Periodically, you rewrite the notebook by consolidating all entries per item into a clean sorted version (compaction). The filing cabinet is great for looking things up; the notebook is great for recording lots of changes quickly.

Real-World Examples

Meta (MyRocks)

Meta migrated significant portions of their MySQL fleet from InnoDB (B+ tree) to MyRocks (LSM tree backed by RocksDB) for social graph storage. The write-heavy workload -- billions of social interactions per day -- caused high write amplification in InnoDB. MyRocks achieved 50% storage savings through better compression of sequentially-written SSTables and improved write throughput by eliminating random page writes. Point read latency remained within 10% of InnoDB due to effective Bloom filter and block cache tuning.

TiDB (Hybrid Approach)

TiDB uses RocksDB (LSM tree) as the storage engine for its TiKV OLTP layer, chosen for write scalability in a distributed architecture where every Raft group maintains its own RocksDB instance. For OLAP queries, TiDB routes requests to TiFlash, which uses a columnar storage format optimized for analytical scans. This dual-engine design lets TiDB serve both write-heavy transactional workloads and read-heavy analytical workloads from the same database, with TiFlash receiving asynchronous replication from TiKV.

CockroachDB (Pebble)

CockroachDB chose an LSM tree engine (Pebble, a Go rewrite of RocksDB) over B-tree storage for its distributed SQL database. The decision was driven by write scalability: in a distributed database using Raft consensus, every write is replicated to multiple nodes, multiplying the total write volume. LSM's sequential write pattern handles this amplified write load more efficiently than B-tree random writes. Pebble uses leveled compaction with concurrent compaction workers and achieves write throughput within 90% of RocksDB while being written in pure Go for easier integration.

Trade-Offs
AspectDescription
Write Throughput vs Read Latency ConsistencyLSM trees provide 3-10x higher sustained write throughput than B-trees on the same hardware, but read tail latency (P99) can spike during heavy compaction. B-trees provide consistent, predictable read latency because there is no background merging process competing for I/O, but write throughput is limited by random I/O capacity.
Storage Efficiency vs PredictabilityLSM trees with leveled compaction achieve better steady-state storage efficiency (~1.1x vs B-tree's ~1.4x) but have unpredictable space requirements during compaction (temporary 2x spike). B-trees' space usage is more predictable: pages are allocated and freed incrementally without large batch operations. For environments where disk space is tightly provisioned, B-trees' predictability is valuable.
Sequential I/O Advantage vs SSD ConvergenceLSM trees' sequential write advantage is enormous on HDDs (100x) but diminishing on SSDs (3-10x). As NVMe SSDs become standard, the write throughput gap between LSM and B-tree engines narrows. However, LSM trees still benefit SSDs through reduced write wear (fewer erase cycles) and less internal garbage collection, extending SSD lifespan.
Operational Complexity vs SimplicityLSM trees require careful tuning of compaction strategy, rate limiting, Bloom filter sizing, memtable sizes, and level multipliers. Misconfigured compaction can cause write stalls, space blowup, or degraded read performance. B-trees are operationally simpler: key tuning parameters are buffer pool size and fillfactor. For teams without deep storage engine expertise, B-tree engines (PostgreSQL, MySQL InnoDB) are easier to operate well.
Case Study

CockroachDB's Decision: Why a Distributed SQL Database Chose LSM over B-Tree

Scenario

CockroachDB, a distributed SQL database designed for global deployments, needed a storage engine for its per-node data storage. Every write in CockroachDB is replicated via Raft consensus to at least 3 nodes, meaning the total system write volume is 3x the user write volume. The team evaluated both B-tree (BoltDB, a pure-Go B+ tree) and LSM (RocksDB/Pebble) architectures. The key requirements were: high write throughput for replicated writes, efficient range scans for SQL query processing, space-efficient storage for multi-tenant deployments, and predictable latency under mixed workloads.

Solution

CockroachDB initially used RocksDB (LSM) via CGo bindings, then built Pebble -- a pure-Go LSM tree engine -- to eliminate CGo overhead and gain full control over the storage layer. The team chose LSM over B-tree for three reasons: (1) Raft replication triples the write volume, making sequential I/O essential for write scalability; (2) LSM's sorted SSTable format compresses better than partially-filled B-tree pages, critical for multi-tenant clusters; (3) LSM's append-only write pattern produces less SSD write wear than B-tree's in-place updates, reducing hardware costs. To mitigate LSM's read amplification, Pebble uses table-level Bloom filters, an efficient block cache, and iterator pooling for range scans.

Outcome

Pebble achieved write throughput within 90% of RocksDB while eliminating CGo boundary-crossing overhead. CockroachDB's storage layer handles sustained write-heavy workloads (such as bulk data import and high-throughput OLTP) with consistent performance. The LSM engine's better compression reduced storage costs by approximately 30% compared to the BoltDB B-tree prototype. Read latency for point lookups averages under 500 microseconds with properly warmed block cache and Bloom filters, and range scan performance is competitive with B-tree engines for typical SQL query patterns.

Common Mistakes
  • Choosing a storage engine based on a single benchmark rather than your actual workload profile. A benchmark showing LSM trees are 5x faster on writes is irrelevant if your workload is 95% reads. Profile your real read/write ratio, access patterns (point vs range vs scan), and key distribution before deciding.
  • Ignoring compaction impact on tail latency. LSM tree P99 latency can spike 10-100x during heavy compaction due to I/O contention. If your SLA requires consistent sub-millisecond P99 reads, you need aggressive compaction rate limiting and I/O prioritization, or a B-tree engine may be a better fit.
  • Assuming B-trees cannot handle write-heavy workloads. With sufficient buffer pool memory, a B-tree engine can absorb high write rates by batching dirty pages and flushing them asynchronously. PostgreSQL and MySQL InnoDB serve many write-heavy applications effectively. The crossover point where LSM becomes necessary is higher than most developers assume.
  • Overlooking the operational complexity of LSM tuning. LSM engines expose dozens of tuning parameters (memtable size, L0 slowdown/stop triggers, compaction concurrency, Bloom filter bits, block cache size, compression per level). Incorrect tuning can cause write stalls, space amplification blowup, or degraded read performance. B-tree engines have fewer critical parameters (buffer pool size, checkpoint interval).
Related Concepts

See LSM Tree vs B-Tree in action

Explore system design templates that use lsm tree vs b-tree and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare B-tree reads vs LSM writes for mixed workloads

Metrics to watch
read_latency_mswrite_latency_mswrite_amplificationspace_efficiency_pct
Run Simulation
Test Your Understanding

1Why do LSM trees generally achieve higher write throughput than B-trees?

2What is the primary mechanism LSM trees use to maintain acceptable read performance despite having data spread across multiple SSTables?

3A workload has 70% point reads, 20% range scans, and 10% writes. Which storage engine architecture is likely the better fit?

Deeper Reading