Vetora logo
📊Database Families

Wide-Column Stores (Cassandra, Bigtable, HBase)

Wide-column stores organize data by partition key and clustering columns, enabling fast writes and efficient range scans over massive datasets. Cassandra, Bigtable, and HBase dominate time-series, event logging, and IoT workloads where write throughput and horizontal scalability matter most.

Overview

Wide-column stores are a family of distributed databases designed for workloads that demand massive write throughput, horizontal scalability, and efficient range scans over large datasets. Unlike relational databases where tables have a fixed set of columns, wide-column stores allow each row to contain a different set of columns, and rows are organized by a partition key (which determines data placement across nodes) and optional clustering columns (which determine sort order within a partition). This two-level key structure -- partition key for distribution, clustering columns for ordering -- is the fundamental design primitive that makes wide-column stores powerful for time-series, event logging, and IoT data.

The storage engine underpinning most wide-column stores is the Log-Structured Merge-tree (LSM-tree). Writes go first to an in-memory buffer (memtable) and are simultaneously appended to a commit log for durability. When the memtable reaches a threshold size, it is flushed to disk as an immutable sorted file (SSTable). Reads merge data from the memtable and multiple SSTables. Background compaction processes periodically merge SSTables to reclaim space from deleted or overwritten data and reduce the number of files that reads must consult. This architecture makes writes extremely fast (sequential I/O to the commit log + in-memory buffer) at the cost of more complex reads (potentially consulting multiple SSTables). Bloom filters on each SSTable help skip files that definitely do not contain the requested key, reducing unnecessary disk I/O.

Apache Cassandra, the most popular open-source wide-column store, was originally designed at Facebook for inbox search and later open-sourced. Cassandra uses a peer-to-peer architecture with no single point of failure -- every node is equal, and data is replicated across multiple nodes using consistent hashing. Cassandra's defining feature is tunable consistency: for each read and write operation, you specify a consistency level (ONE, QUORUM, ALL, LOCAL_QUORUM, etc.) that determines how many replicas must acknowledge the operation. This allows developers to choose their position on the consistency-availability spectrum per query. Lightweight Transactions (LWT) provide compare-and-set operations using Paxos consensus for cases that require linearizable consistency, at the cost of significantly higher latency.

Google Bigtable, the original wide-column store (published in 2006), influenced the design of both HBase and Cassandra. Unlike Cassandra's peer-to-peer model, Bigtable uses a leader-based architecture with a master server managing tablet assignments and tablet servers storing data ranges. This centralized coordination simplifies consistency but introduces a single point of management (though not a single point of failure in practice, as the master is replicated). HBase, the open-source Bigtable implementation on Hadoop, follows the same architecture. Bigtable (as a managed Google Cloud service) provides single-digit-millisecond reads at petabyte scale with strong consistency per row, making it the backbone of many Google services including Search indexing, Google Analytics, and Google Maps. The wide-column model excels when your access patterns are known at design time and revolve around partition-key lookups with clustering-column range scans -- it is poorly suited for ad-hoc queries, aggregations across partitions, or workloads requiring multi-row transactions.

Key Points
  • 1Data is organized by partition key (determines which node stores the data via consistent hashing) and clustering columns (determines sort order within a partition). This two-level key design enables efficient range scans within a partition while distributing data evenly across nodes.
  • 2LSM-tree storage engines make writes extremely fast: data is written to an in-memory memtable and commit log (sequential I/O), then periodically flushed to immutable SSTables on disk. Reads are slower because they may need to consult multiple SSTables, mitigated by Bloom filters and key caches.
  • 3Cassandra's tunable consistency allows per-query trade-offs. QUORUM (majority of replicas must respond) provides strong-enough consistency for most workloads. ONE provides lowest latency for non-critical reads. LOCAL_QUORUM restricts the quorum to the local data center for cross-region deployments.
  • 4Compaction strategies significantly impact performance. Size-Tiered Compaction (STCS) is write-optimized but uses more disk space temporarily. Leveled Compaction (LCS) provides more predictable read performance by organizing SSTables into levels but increases write amplification.
  • 5Lightweight Transactions (LWT) in Cassandra provide linearizable consistency using Paxos consensus for compare-and-set operations. They are 4-10x slower than regular writes and should be used sparingly -- only for operations like unique username registration or conditional updates.
  • 6Data modeling in wide-column stores is query-driven: you design tables to match specific query patterns, often duplicating data across multiple tables. This is the opposite of relational normalization and requires knowing all access patterns upfront.
Simple Example

The Library Card Catalog

Imagine a massive library with books distributed across 10 buildings (nodes). Each building holds books for specific author last-name ranges (partition key). Within each building, books by the same author are sorted by publication year (clustering column). To find all books by 'Tolkien published between 1950 and 1960,' you go to the building that handles 'T' authors (partition lookup) and scan the shelf from 1950 to 1960 (range scan on clustering column) -- very fast. But to find 'all fantasy books published in 1954 across all authors,' you would need to visit all 10 buildings and scan every shelf (full cluster scan) -- extremely slow. Wide-column stores work the same way: queries within a partition are fast; queries across partitions are expensive.

Real-World Examples

Netflix

Netflix uses Apache Cassandra for storing viewing history, bookmarks, and user activity data across hundreds of terabytes. The partition key is user_id and the clustering column is timestamp, enabling efficient queries like 'get user X's last 50 views.' Netflix operates Cassandra across multiple AWS regions with LOCAL_QUORUM consistency for writes and LOCAL_ONE for reads, providing low-latency access from the nearest region. Their Cassandra deployment handles millions of writes per second during peak streaming hours.

Apple

Apple operates one of the largest known Cassandra deployments in the world, with over 150,000 nodes storing data for iCloud, Siri, Maps, and other services. The scale demonstrates Cassandra's linear horizontal scalability -- throughput grows proportionally with the number of nodes. Apple's deployment handles petabytes of data with high availability requirements, leveraging Cassandra's peer-to-peer architecture to avoid single points of failure across global data centers.

Spotify

Spotify uses Google Cloud Bigtable for storing and analyzing user event data -- play events, search queries, skip patterns, and playlist interactions. Each event is keyed by user_id with a timestamp clustering column, enabling rapid retrieval of a user's recent activity for real-time recommendations. Bigtable's strong per-row consistency and managed infrastructure let Spotify's data team focus on analytics pipelines rather than database operations, processing billions of events daily.

Trade-Offs
AspectDescription
Write Performance vs Read ComplexityLSM-tree engines provide extremely fast writes (sequential I/O) but reads may need to consult multiple SSTables from different levels. Bloom filters and key caches mitigate this, but read latency is less predictable than B-tree-based databases, especially for non-existent keys where all SSTables must be checked.
Query-Driven Modeling vs FlexibilityWide-column stores require designing tables around specific query patterns, often duplicating data across multiple tables (one per query). This delivers excellent performance for known queries but makes ad-hoc queries expensive or impossible. Adding a new query pattern often requires creating a new table and backfilling data.
Tunable Consistency vs SimplicityCassandra's per-query consistency levels provide flexibility but shift consistency decisions to application developers. Mistakes are easy: writing at QUORUM but reading at ONE can return stale data. The developer must understand the relationship between replication factor, write consistency, and read consistency to ensure the guarantees they expect.
Horizontal Scalability vs Operational OverheadAdding nodes to Cassandra increases capacity linearly, but operating a large cluster requires expertise in compaction tuning, repair scheduling, token management, and capacity planning. Managed services like DataStax Astra or Google Cloud Bigtable reduce operational burden but at higher per-operation cost.
Case Study

Netflix Viewing History -- Cassandra at Global Scale

Scenario

Netflix needed to store the complete viewing history for over 200 million subscribers, with each user averaging thousands of viewing events. The system required millisecond-latency reads for personalization (recommending what to watch next), high write throughput during peak streaming hours (when millions of users are simultaneously watching), and multi-region availability so that viewing history is accessible regardless of which AWS region serves the user. A relational database could not handle the write volume, and eventual consistency was acceptable for viewing history because a briefly-stale history does not affect the user experience.

Solution

Netflix stores viewing history in Cassandra with user_id as the partition key and timestamp as the clustering column. Each user's viewing history is stored on the same set of replicas, enabling efficient retrieval. Writes use LOCAL_QUORUM consistency to ensure durability within the local region, while reads use LOCAL_ONE for minimum latency (falling back to the nearest available replica). Cross-region replication is asynchronous, ensuring that a user's history eventually propagates to all regions. Size-Tiered Compaction handles the write-heavy workload efficiently.

Outcome

Netflix's Cassandra deployment handles millions of writes per second during peak hours with sub-10ms p99 read latency for recent viewing history. The system scales linearly by adding nodes -- doubling the cluster doubles throughput without architectural changes. During AWS region failures, viewers seamlessly fail over to another region where their viewing history is available (though potentially a few seconds behind). The query-driven data model means that the 'get recent views for user X' query consistently hits a single partition, keeping read latency predictable regardless of total data volume.

Common Mistakes
  • Designing tables with relational normalization in mind. Wide-column stores require query-driven denormalization -- one table per query pattern. Trying to normalize data and JOIN across tables results in expensive multi-partition queries that defeat the purpose of the architecture.
  • Using too few or too many clustering columns. Too few clustering columns means you cannot efficiently query sub-ranges within a partition. Too many creates excessively wide rows that are expensive to read. The clustering columns should match the predicates in your most frequent queries.
  • Ignoring partition size limits. Cassandra performs best when partitions are under 100 MB. A partition key like 'country' with billions of events creates massive partitions that cause memory pressure, slow compaction, and uneven data distribution. Use composite partition keys (country + day) to bound partition sizes.
  • Running repairs infrequently. Without regular anti-entropy repairs, deleted data (tombstones) can resurrect and replicas can diverge permanently. Cassandra requires scheduled repairs within the gc_grace_seconds window (default 10 days) to maintain data consistency across replicas.
Related Concepts

See Wide-Column Stores (Cassandra, Bigtable, HBase) in action

Explore system design templates that use wide-column stores (cassandra, bigtable, hbase) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate write-heavy workloads on wide-column vs relational stores

Metrics to watch
write_throughputread_latency_p99compaction_pressure
Run Simulation
Test Your Understanding

1What is the role of the partition key in a Cassandra table?

2Why does Cassandra use an LSM-tree storage engine instead of a B-tree?

3What happens when you read with consistency level ONE and write with consistency level QUORUM in Cassandra (replication factor 3)?

Deeper Reading