Vetora logo
📝Storage Engines

LSM Tree (Log-Structured Merge Tree)

The Log-Structured Merge Tree (LSM tree) is a write-optimized data structure that converts random writes into sequential I/O by buffering changes in an in-memory sorted structure (memtable) and periodically flushing them to immutable sorted files (SSTables) on disk. Background compaction merges these files to maintain read performance.

Overview

The Log-Structured Merge Tree, first described by Patrick O'Neil and colleagues in their 1996 paper, is a data structure designed to provide high write throughput on storage devices where sequential I/O is dramatically faster than random I/O. While B-trees modify data in place -- requiring random page writes for every update -- LSM trees convert all writes into sequential appends. This single design decision enables LSM trees to achieve 10-100x better write throughput than B-trees on traditional hard drives, and 3-10x better throughput on SSDs where the sequential-to-random performance gap is narrower.

The write path in an LSM tree is remarkably simple. When a write arrives, it is first appended to a write-ahead log (WAL) on disk for durability -- this is a sequential append and therefore fast. Simultaneously, the key-value pair is inserted into an in-memory sorted data structure called a memtable, typically implemented as a red-black tree, skip list, or adaptive radix tree. The memtable is sorted so it can later be written to disk as a sorted run. When the memtable reaches a configured size threshold (typically 64MB-256MB), it becomes immutable and a new empty memtable takes its place. The immutable memtable is then flushed to disk as a Sorted String Table (SSTable) -- a file where key-value pairs are written in sorted order. Because the memtable is already sorted, this flush is a single sequential write.

The read path is where LSM trees pay the cost of their write optimization. To find a key, the system must check the active memtable first, then the immutable memtable (if one exists), then each SSTable on disk from newest to oldest. In the worst case, every level must be checked. Bloom filters are the critical optimization that makes this practical: each SSTable includes a Bloom filter that can definitively say a key is NOT in that SSTable, allowing the reader to skip it entirely. With properly tuned Bloom filters (10 bits per key yields approximately 1% false positive rate), the system typically reads only 1-2 SSTables per point lookup instead of checking all of them.

Compaction is the background process that keeps the LSM tree healthy. Without compaction, SSTables accumulate endlessly, degrading read performance and wasting space (because superseded values in older SSTables are never reclaimed). Compaction merges multiple SSTables into fewer, larger SSTables, discarding deleted keys (tombstones) and superseded values in the process. The two dominant compaction strategies are size-tiered compaction (used by Cassandra by default), which groups SSTables of similar size and merges them when enough accumulate, and leveled compaction (used by RocksDB by default), which organizes SSTables into levels where each level is 10x larger than the previous and guarantees no overlapping key ranges within a level. Leveled compaction provides better read performance and more predictable space usage, but amplifies writes because data is rewritten with every level-to-level merge.

Key Points
  • 1Write path: append to WAL (sequential) + insert into memtable (in-memory sorted structure). This makes writes O(1) amortized with sequential I/O only, providing 10-100x better write throughput than B-trees on HDDs.
  • 2Memtable is an in-memory sorted data structure (red-black tree, skip list, or ART). When it reaches a size threshold (typically 64-256MB), it becomes immutable and flushes to disk as a sorted SSTable in a single sequential write.
  • 3Read path checks memtable, then each SSTable level from newest to oldest. Bloom filters (10 bits per key = ~1% FPR) let readers skip SSTables that definitely do not contain the requested key, reducing most reads to 1-2 SSTable checks.
  • 4Compaction merges SSTables in the background to reclaim space from deleted/superseded values and maintain read performance. Without compaction, read latency grows linearly with the number of SSTables.
  • 5Size-tiered compaction groups similar-sized SSTables and is write-friendly but can temporarily use 2x disk space during merges. Leveled compaction (RocksDB default) bounds read amplification but rewrites data at every level transition.
  • 6Deletes in LSM trees are handled by writing a tombstone marker, not by removing data immediately. The actual data is only reclaimed when compaction merges the tombstone with the original entry. This means deleted data may persist on disk until compaction reaches it.
Simple Example

The Sticky Notes System

Imagine you are recording transactions throughout the day. Instead of finding the right page in a sorted ledger for each entry (like a B-tree), you write each transaction on a sticky note and add it to a sorted stack on your desk (the memtable). When the stack gets too tall, you file the entire sorted batch into a filing cabinet drawer as one sorted bundle (SSTable flush). Periodically, you consolidate multiple drawers by merging their sorted bundles into one larger sorted file and throwing away outdated entries (compaction). To look something up, you first check your current sticky note stack, then check each drawer from newest to oldest. To avoid opening every drawer, you tape a quick-reference list on each drawer saying which letters are NOT inside (the Bloom filter).

Real-World Examples

RocksDB (Meta)

RocksDB, developed by Meta and based on Google's LevelDB, is the most widely deployed LSM tree engine. Meta uses RocksDB as the storage layer for MySQL (via MyRocks) to store social graph data, achieving 50% storage savings over InnoDB through better compression of sequentially-written SSTables. RocksDB uses leveled compaction by default with configurable level multiplier, universal compaction for write-heavy workloads, and supports column families for logical data separation within a single database instance.

Apache Cassandra

Cassandra uses an LSM tree engine for all data storage. Writes go to a commit log (WAL) and memtable, then flush to SSTables. Cassandra defaults to size-tiered compaction (STCS), which minimizes write amplification but can cause temporary space spikes during compaction. For read-heavy tables, operators switch to leveled compaction (LCS) which guarantees at most one SSTable per key per level. Cassandra's SSTable format includes per-partition Bloom filters, partition indexes, and compression metadata.

LevelDB (Google)

LevelDB is Google's original open-source LSM tree implementation, designed by Jeff Dean and Sanjay Ghemawat. It introduced the leveled compaction strategy that gives it its name: L0 holds flushed memtables with overlapping key ranges, while L1+ guarantees non-overlapping key ranges within each level. Each level is 10x the size of the previous. LevelDB powers Chrome's IndexedDB implementation and served as the foundation for RocksDB, which added features like column families, concurrent compaction, and rate limiting.

Trade-Offs
AspectDescription
Write Throughput vs Read LatencyLSM trees achieve extremely high write throughput (sequential I/O only) but reads may require checking multiple SSTables across levels. Point reads in a leveled compaction LSM with 5 levels may need to check up to 5 SSTables (mitigated by Bloom filters). B-trees provide consistently fast reads (single tree traversal) but slower writes due to random I/O.
Write Amplification vs Space AmplificationLeveled compaction rewrites data at each level transition, causing write amplification of 10-30x (each byte is written 10-30 times across its lifetime). Size-tiered compaction has lower write amplification (2-4x) but can require 2x temporary disk space during a compaction of the largest tier and may have up to 10x space amplification from duplicate entries.
Compaction I/O Impact vs Read PerformanceCompaction consumes significant disk I/O bandwidth and CPU. During heavy compaction, foreground read and write latency can spike. Rate limiting compaction preserves foreground performance but lets SSTables accumulate, degrading read latency. The compaction scheduling strategy is one of the most critical tuning decisions for LSM-based systems.
Memory Usage vs Disk ReadsLarger memtables reduce flush frequency and the total number of SSTables, improving read performance. But larger memtables consume more RAM and increase recovery time (more WAL to replay) after a crash. Bloom filters also consume memory (10 bits per key), but eliminating unnecessary SSTable reads more than justifies the overhead for most workloads.
Case Study

Meta's Migration from InnoDB to MyRocks -- LSM for Social Graph Storage

Scenario

Meta's social graph data -- billions of friendships, likes, and interactions -- was stored in MySQL InnoDB (B+ tree engine). The workload was extremely write-heavy, with new social interactions generating millions of writes per second. InnoDB's in-place update design caused high write amplification and poor compression ratios because B-tree pages are partially filled. Storage costs were growing faster than revenue, and the team needed a 2x reduction in storage footprint without sacrificing read latency for the most frequent queries.

Solution

Meta developed MyRocks, a MySQL storage engine backed by RocksDB's LSM tree. The key advantages were: (1) sequential writes from LSM eliminate random I/O, improving write throughput; (2) SSTables are written once and compacted in sorted order, achieving much better compression ratios than partially-filled B-tree pages; (3) leveled compaction bounds the number of SSTables per key range, keeping read amplification manageable. Meta used Bloom filters with 10 bits per key and block cache for hot data to maintain acceptable read latencies.

Outcome

MyRocks achieved a 50% reduction in storage space compared to InnoDB for the same data, primarily through better compression of sequentially-written SSTables. Write throughput improved by 2-3x for the social graph workload. Read latency for point lookups remained within 10% of InnoDB performance due to effective Bloom filter and block cache tuning. Meta migrated a significant fraction of its MySQL fleet from InnoDB to MyRocks, saving petabytes of SSD storage across their data centers.

Common Mistakes
  • Neglecting compaction tuning and monitoring. An LSM tree without proper compaction configuration will accumulate SSTables, causing read latency to grow linearly and space usage to balloon. Monitor SSTable count per level, compaction backlog, and P99 read latency as early warning signals.
  • Using LSM trees for read-heavy workloads with random access patterns. LSM trees are optimized for write-heavy or write-mostly workloads. If your workload is 90% reads with random key access, a B-tree index will provide more consistent and predictable latency because it requires a single tree traversal per read.
  • Under-sizing Bloom filters to save memory. Reducing Bloom filter bits-per-key from 10 to 5 doubles the false positive rate from 1% to 10%, meaning 10% of point lookups will unnecessarily read an SSTable that does not contain the key. The memory saved (a few MB) is rarely worth the read performance degradation.
  • Ignoring the impact of tombstones on range scan performance. Deletes write tombstone markers that persist until compaction reclaims them. A range scan over a key space with many deletions will encounter and skip these tombstones, degrading scan performance. Use TTL-based expiration or force compaction on heavily deleted ranges.
Related Concepts

See LSM Tree (Log-Structured Merge Tree) in action

Explore system design templates that use lsm tree (log-structured merge tree) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Benchmark LSM-tree write throughput for click ingestion

Metrics to watch
write_throughput_rpscompaction_overhead_pctread_amplificationspace_amplification
Run Simulation
Test Your Understanding

1In an LSM tree, what is the primary purpose of the memtable?

2Which compaction strategy provides better read performance but higher write amplification?

3Why do LSM trees use tombstones for deletes instead of removing data immediately?

Deeper Reading