Vetora logo
🌸Storage Engines

Bloom Filters and Cuckoo Filters

Bloom filters are space-efficient probabilistic data structures for set membership testing. They can definitively say an element is NOT in a set (zero false negatives) but may incorrectly say an element IS in the set (false positives). Cuckoo filters extend this concept with support for deletion and better space efficiency at low false positive rates.

Overview

A Bloom filter, invented by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure designed to answer one question: 'Is this element in the set?' The answer is either 'definitely not' or 'probably yes.' This asymmetry -- zero false negatives but a configurable rate of false positives -- makes Bloom filters invaluable in systems where the cost of a false positive (an unnecessary lookup) is much lower than the cost of checking every element directly. A Bloom filter representing 1 billion elements with a 1% false positive rate requires only 1.2 GB of memory, compared to storing 1 billion keys which could require 50-100 GB. This 50-100x space savings is why Bloom filters appear in virtually every large-scale storage and networking system.

The data structure is elegant in its simplicity. A Bloom filter is a bit array of m bits, initially all set to zero, paired with k independent hash functions. To add an element, compute k hash values and set the corresponding bit positions to 1. To test membership, compute the same k hash values and check if all corresponding bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element is probably in the set -- the bits may have been set by different elements whose hash values happened to map to the same positions (a false positive). The optimal parameters are well-known: for n elements and a desired false positive probability p, the optimal bit array size is m = -(n * ln(p)) / (ln(2))^2, and the optimal number of hash functions is k = (m/n) * ln(2). In practice, 10 bits per element and 7 hash functions yield approximately a 1% false positive rate, which is the most common configuration in database systems.

Counting Bloom filters address the standard Bloom filter's inability to support deletion. Instead of single bits, a counting Bloom filter uses multi-bit counters (typically 4 bits each) at each position. Adding an element increments the k counters; removing it decrements them. A position represents membership if its counter is greater than zero. The trade-off is space: a 4-bit counting Bloom filter uses 4x the memory of a standard Bloom filter. Counting Bloom filters are used in systems that need to add and remove elements dynamically, such as web cache content routers and network packet classifiers.

Cuckoo filters, introduced by Fan, Andersen, Kaminsky, and Mitzenmacher in 2014, provide a superior alternative to counting Bloom filters. A cuckoo filter stores compact fingerprints (truncated hashes) of elements in a cuckoo hash table with two candidate bucket positions per element. Lookup checks both candidate buckets for the fingerprint. Insertion uses cuckoo hashing: if both candidate buckets are full, an existing fingerprint is displaced to its alternate bucket (potentially triggering a chain of displacements). Deletion simply removes the fingerprint from its bucket. Cuckoo filters support deletion without the space overhead of counting Bloom filters, and achieve better space efficiency than Bloom filters when the target false positive rate is below approximately 3%. For example, at 1% FPR, a cuckoo filter with 12-bit fingerprints uses approximately 12.2 bits per element, competitive with a Bloom filter's 10 bits, but the cuckoo filter also supports deletion. At 0.1% FPR, cuckoo filters are significantly more space-efficient because Bloom filter size grows logarithmically with 1/FPR while cuckoo filter size grows with fingerprint size.

In storage engines, Bloom filters serve a critical role on the LSM tree read path. Each SSTable includes a Bloom filter for all keys it contains. When a point lookup arrives, the system checks each SSTable's Bloom filter before reading the SSTable from disk. If the Bloom filter says 'definitely not here,' the SSTable is skipped entirely, avoiding an expensive disk read. Without Bloom filters, a point lookup in a 5-level LSM tree might read 5 SSTables; with Bloom filters at 1% FPR, the expected number of unnecessary SSTable reads drops to approximately 0.05 (5 levels * 1% FPR). Beyond storage engines, Bloom filters are used in distributed systems for data synchronization (determining which elements one replica has that another does not), in network security (Chrome's Safe Browsing checks URLs against a Bloom filter before making a network request), and in content delivery networks (determining which edge server is likely to have a cached object without querying every server).

Key Points
  • 1Bloom filters guarantee zero false negatives (if the filter says 'not in set,' it is definitely not). False positives are configurable: 10 bits per element with 7 hash functions yields approximately 1% FPR. Testing membership is O(k) time and O(1) memory accesses.
  • 2Optimal sizing: m = -(n * ln(p)) / (ln(2))^2 bits for n elements and target FPR p. For 1% FPR, this works out to approximately 10 bits per element regardless of element size. A Bloom filter for 100 million keys at 1% FPR requires only 120 MB.
  • 3Standard Bloom filters do not support deletion because clearing a bit position could affect other elements that hash to the same position. Counting Bloom filters use multi-bit counters to support deletion at 4x the space cost.
  • 4Cuckoo filters store compact fingerprints in a cuckoo hash table, supporting both deletion and lookup. They are more space-efficient than Bloom filters at FPR below 3% and support deletion without the space overhead of counting Bloom filters.
  • 5In LSM tree storage engines, per-SSTable Bloom filters reduce point lookup disk reads from O(levels) to approximately O(1) by skipping SSTables that definitely do not contain the requested key. This is the single most impactful optimization for LSM read performance.
  • 6Bloom filters are composable: the union of two Bloom filters (bitwise OR) yields a valid Bloom filter for the union of their element sets. This enables distributed Bloom filter operations and hierarchical filter designs.
Simple Example

The Nightclub Bouncer's Guest List

Imagine a nightclub with a bouncer who has an imperfect memory system. When guests are added to the VIP list, the bouncer memorizes a few features about each name (starts with 'J,' has 5 letters, ends with a vowel). When someone arrives, the bouncer checks all the features. If any feature does not match, the person is definitely NOT on the list and is turned away (no false negatives). If all features match, the person is probably on the list and allowed in -- but occasionally someone whose name happens to match all the features by coincidence gets in without being on the list (false positive). The bouncer can add features (more hash functions / more bits) to reduce false positives, but can never eliminate them entirely. This imperfect but fast check prevents the bouncer from searching through the entire paper guest list for every arrival.

Real-World Examples

Apache Cassandra

Cassandra creates a Bloom filter for each SSTable, configured via the bloom_filter_fp_chance setting (default 0.01 = 1% FPR for SSTables, 0.1 = 10% for system tables). When a read request arrives, Cassandra checks the Bloom filter of each SSTable that might contain the partition key. A 1% FPR means 99% of SSTables that do not contain the key are skipped without a disk read. For tables with many SSTables (before compaction), this optimization can reduce disk reads by 90% or more, making point lookups fast even when data is spread across dozens of SSTables.

Google Chrome (Safe Browsing)

Chrome uses a Bloom filter (and more recently a hash-prefix-based filter with similar properties) for its Safe Browsing feature, which checks every URL the user visits against a list of known malicious URLs. The Bloom filter is stored locally on the client device, allowing instant checks without a network request for the vast majority of safe URLs. When the Bloom filter indicates a possible match (including false positives), Chrome makes a privacy-preserving API call to verify. This design checks billions of URL visits per day with minimal latency impact and minimal server load.

Bitcoin (SPV Nodes)

Bitcoin's Simplified Payment Verification (SPV) nodes use Bloom filters (BIP 37) to request only relevant transactions from full nodes without revealing exactly which addresses they control. The SPV node sends a Bloom filter of its addresses to a full node, which then sends only transactions matching the filter. The false positive rate provides a tunable privacy-performance trade-off: a higher FPR sends more irrelevant transactions (wasting bandwidth) but reveals less about which specific addresses the SPV node owns.

Trade-Offs
AspectDescription
Space Efficiency vs False Positive RateReducing the false positive rate requires more bits per element. At 10 bits/element, FPR is ~1%. At 15 bits/element, FPR drops to ~0.1%. At 5 bits/element, FPR rises to ~10%. The relationship is exponential: halving the FPR requires approximately 1.44 additional bits per element. For very low FPR requirements (below 0.1%), cuckoo filters become more space-efficient.
No Deletion vs Counting OverheadStandard Bloom filters cannot delete elements, which is a problem for dynamic sets. Counting Bloom filters support deletion but use 4x the space. Cuckoo filters support deletion with minimal space overhead but have a maximum load factor (typically 95%) after which insertions fail. The choice depends on whether the set is append-only (use Bloom) or requires deletions (use cuckoo or counting Bloom).
Memory Usage vs Disk I/O SavingsBloom filters consume memory (120MB per 100M elements at 1% FPR). In an LSM tree with hundreds of SSTables, the total Bloom filter memory can reach several GB. This memory cost must be weighed against the disk I/O savings: each avoided SSTable read saves 4-16KB of I/O and the associated latency. For most workloads, the trade-off strongly favors Bloom filters.
Hash Function Quality vs PerformanceBloom filter correctness depends on hash functions producing uniformly distributed outputs. Poor hash functions increase the effective FPR. Using k independent hash functions is theoretically optimal but computationally expensive. The double-hashing technique (deriving k hash values from two independent hashes via h_i(x) = h1(x) + i*h2(x)) provides near-optimal performance with only two hash computations, and is used by most implementations.
Case Study

Cassandra Bloom Filters -- Reducing Disk Reads by 90%+ in a Write-Heavy System

Scenario

A real-time analytics platform stored event data in Apache Cassandra with a write rate of 500,000 events per second. The event data was partitioned by device_id, with each device generating thousands of events per day. Read queries fetched all events for a specific device_id in a time range. With size-tiered compaction, each partition's data was spread across 10-30 SSTables. Without Bloom filters, every point read for a device_id would need to check all 30 SSTables, requiring 30 disk reads per query -- far too slow for the sub-10ms latency requirement.

Solution

The team configured Cassandra's Bloom filters with bloom_filter_fp_chance = 0.01 (1% FPR), meaning each SSTable's Bloom filter correctly identifies 99% of non-matching SSTables. For a query touching 30 SSTables where only 2 actually contain data for the requested device_id, the Bloom filter eliminates 27.7 of the 28 irrelevant SSTables on average (28 * 0.99), reducing disk reads from 30 to approximately 2.3. The team also tuned memtable flush size and compaction to limit SSTable count, further reducing the number of Bloom filter checks per query.

Outcome

Point read latency dropped from 45ms (30 SSTable reads) to under 5ms (2-3 SSTable reads) -- a 90% reduction in disk I/O per query. Bloom filter memory usage was approximately 1.5 GB for 12 billion keys across the cluster (10 bits per key), a small fraction of the 128 GB per-node memory. The 1% false positive rate meant approximately 1 in 100 queries performed one unnecessary SSTable read, adding negligible overhead. The team later experimented with 0.001 FPR (15 bits/key) for their hottest tables, further reducing unnecessary reads at the cost of 50% more Bloom filter memory.

Common Mistakes
  • Using Bloom filters for small sets where a hash set would work. Bloom filters' space advantage only matters when the set is large enough that storing it explicitly is expensive. For sets under 10,000 elements, a simple hash set is faster, simpler, and has zero false positives.
  • Forgetting that Bloom filters cannot be resized. Once created, the bit array size is fixed. If the number of elements grows beyond the original estimate, the FPR increases beyond the target. Solutions include using scalable Bloom filters (a chain of filters with increasing size) or recreating the filter with updated parameters.
  • Ignoring the memory cost of Bloom filters in LSM trees. Each SSTable's Bloom filter must be loaded into memory for fast access. With thousands of SSTables and 10 bits per key, Bloom filter memory can reach several GB. Monitor Bloom filter memory usage and consider higher FPR (fewer bits) for cold or infrequently-queried SSTables.
  • Assuming Bloom filters help with range queries. Bloom filters only answer point membership queries ('is key X in this set?'). They cannot answer range queries ('are there any keys between A and B?'). For range queries in LSM trees, the system must check SSTable min/max key metadata or zone maps, not Bloom filters.
Related Concepts

See Bloom Filters and Cuckoo Filters in action

Explore system design templates that use bloom filters and cuckoo filters and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Reduce unnecessary disk reads with Bloom filter pre-checks

Metrics to watch
false_positive_ratedisk_read_savings_pctmemory_usage_mblookup_latency_ms
Run Simulation
Test Your Understanding

1What guarantee does a Bloom filter provide about set membership?

2Why are Bloom filters critical for LSM tree read performance?

3What advantage do cuckoo filters have over standard Bloom filters?

Deeper Reading