Multi-region CRDT counter architecture with grow-only counters per region, HyperLogLog deduplication, and async cross-region merge via Kafka. Local writes at ~2ms, global convergence within seconds, time-bucketed aggregation for analytics.
The CRDT (Conflict-free Replicated Data Type) counter architecture is the production-grade solution for globally distributed counting at extreme scale. It solves two fundamental problems that the sharded counter approach does not address: cross-region write latency and conflict-free convergence without coordination.
The cross-region latency problem is severe. When a video goes viral globally, increment traffic arrives simultaneously from US-East, EU-West, AP-Southeast, and other regions. In the sharded counter approach, all writes must route to the single region hosting the Redis cluster. Users in EU-West experience 80-120ms of additional latency (cross-Atlantic round-trip) just to reach the counter infrastructure. Users in AP-Southeast face 150-250ms. During viral peaks at 20M inc/sec globally, this cross-region traffic creates a bandwidth bottleneck and multiplies the effective write latency by 5-10x compared to local writes.
CRDTs solve this with a mathematically elegant property: the merge operation is commutative, associative, and idempotent. A G-Counter (Grow-only Counter) is a vector of N components, one per region. Each region independently increments its own component. The global count is the sum of all components. The merge operation takes the component-wise maximum: merge(A, B)[i] = max(A[i], B[i]). This guarantees that regardless of the order in which regions exchange state, or how many times the same state is received, the counters converge to the same correct value.
The architecture places a complete counter stack in each region: ApiGateway, LoadBalancer, CounterService, LocalCounterCache (Redis), and DedupCache (HyperLogLog). Local writes complete in ~2ms — the same latency as a single-region sharded counter. The cross-region coordination happens asynchronously via MergeStream (Kafka with cross-region replication). MergeWorker consumes events from all regions, applies the CRDT merge, and writes the globally consistent count to CounterDB.
Per-user deduplication is critical for counters like 'likes' where each user should count at most once per item. The CRDT variant uses HyperLogLog (HLL) — a probabilistic data structure that estimates the cardinality of a set using only ~12KB of memory per counter key, regardless of the number of unique users. CounterService issues PFADD dedup:{key} {user_id} before incrementing. PFADD returns 1 if the user is likely new (with ~2% standard error), 0 if likely already counted. The trade-off is that ~2% of legitimate first-time users are falsely rejected — their increment is silently dropped.
Time-bucketed aggregation adds an analytics dimension. AggregationWorker reads merged counts and computes per-minute, per-hour, and per-day rollups stored in CounterDB. This enables historical queries like 'views per hour for the last 7 days' without scanning billions of raw increment events.
The CRDT counter pattern appears in interviews for global-scale systems — TikTok (videos viewed across continents), YouTube (global view counts), and large-scale IoT platforms (device event counting across edge locations). Interviewers at senior levels expect candidates to articulate why CRDTs are preferable to consensus-based approaches (Paxos, Raft) for counters, explain the G-Counter merge semantics, and reason about the HyperLogLog accuracy trade-off.
The progression from naive (single row) to sharded (single region, distributed writes) to CRDT (multi-region, zero-coordination convergence) demonstrates increasing sophistication in handling the fundamental tension between write throughput, read freshness, global availability, and data consistency.
The CRDT counter system uses 9 components organized into a region-local write path with async cross-region merge. The components are: CounterClient, ApiGateway, MainLB, CounterService, LocalCounterCache (Redis CRDT shards + merged counts), DedupCache (Redis HyperLogLog), MergeStream (Kafka), MergeWorker, AggregationWorker, and CounterDB (PostgreSQL).
The write path is fully region-local. A request enters through the regional ApiGateway (JWT auth, rate limiting at 2.5M RPS), routes to MainLB, and reaches CounterService. CounterService first checks DedupCache via PFADD to determine if this user has already been counted for this key. If the user is new (PFADD returns 1), CounterService INCRs a random sub-counter shard in LocalCounterCache and publishes a counter_increment event to MergeStream (Kafka). The entire write path completes in ~10ms, with no cross-region hop. If the user is a duplicate (PFADD returns 0), the increment is silently dropped and the response returns immediately.
LocalCounterCache serves dual purposes. It stores the region-local sharded sub-counters (counter:{region_id}:{key}:{shard_id}) for incoming writes, and it caches the last known merged global count (counter:{key}:merged) for fast approximate reads. The 16-node Redis cluster handles 2M+ INCR/sec and 1M+ GET/sec for the local region.
DedupCache is a separate Redis cluster dedicated to HyperLogLog structures. Each counter key has one HLL (dedup:{key}) that tracks all user IDs that have incremented this counter. Each HLL uses ~12KB when fully populated (2^14 registers). The 8-node cluster supports 2M+ PFADD/sec. TTL of 24 hours expires HLL state for cold counters, reclaiming memory.
MergeStream (Kafka) carries increment events both within the region (to MergeWorker) and across regions (via MirrorMaker or Kafka cross-region replication). Events are partitioned by counter_key (256 partitions) for parallel processing. The topic schema includes region_id, enabling MergeWorker to maintain the per-region CRDT vector.
MergeWorker is the CRDT convergence engine. It consumes events from MergeStream, maintains the G-Counter vector per key ({region_id: count}), and applies the merge operation (component-wise max). Every 10 seconds, it writes the merged global count and the full CRDT vector to CounterDB, and writes the merged count back to LocalCounterCache for fast approximate reads. With 60 workers processing in parallel, merge completes within seconds of the last region's update.
AggregationWorker computes time-bucketed rollups (per-minute, per-hour, per-day) from the merged counts in CounterDB. These rollups are written to a separate counter_aggregations table optimized for historical range queries. 20 workers handle the aggregation workload with 60-second processing cycles.
CounterDB (PostgreSQL) provides durable storage for two table types: the counters table (merged global counts with full CRDT vectors) and the counter_aggregations table (time-bucketed rollups). It serves exact-count reads for the 10% of traffic that requires global accuracy. 128 partitions with 3 replicas provide throughput and durability.
Choice
Each region maintains an independent counter component; merge takes component-wise max
Rationale
The G-Counter's merge operation (max per component) is commutative, associative, and idempotent. This means regions can exchange state in any order, at any frequency, with duplicates, and the result always converges to the correct count. No coordination protocol (Paxos, Raft, 2PC) is needed. The global count is simply the sum of all region components.
Choice
PFADD checks unique users per counter key at ~2% error rate using ~12KB per key
Rationale
Full set membership (hash set of user_id+key pairs) would require ~40 bytes per entry — 40GB for 1B pairs. HyperLogLog uses ~12KB per counter key regardless of cardinality, with ~2% standard error. The trade-off is that ~2% of legitimate first-time users are falsely rejected. For social media likes this is acceptable; for voting or financial systems it would not be.
Choice
Dedicated Redis clusters for HyperLogLog dedup and counter shards
Rationale
HyperLogLog structures have different memory profiles (12KB each, ~120GB total for 10M keys) and eviction needs than counter shards (~64 bytes each). Separating them prevents HLL memory pressure from evicting hot counter data and allows independent scaling of dedup capacity vs counter throughput.
Choice
MergeStream topics replicated across all regions for CRDT state exchange
Rationale
Kafka MirrorMaker provides at-least-once cross-region replication with configurable lag. Since the CRDT merge is idempotent, duplicate delivery is harmless — applying the same increment twice produces the same merged result. Kafka also provides durable replay: if a MergeWorker crashes, it can reconstruct state from the event log.
Choice
AggregationWorker computes per-minute, per-hour, per-day rollups from merged counts
Rationale
Raw per-increment data is too granular for analytics dashboards. Pre-computed rollups enable efficient historical queries ('views per hour for the last 7 days') without scanning billions of raw events. The aggregation table uses a composite key (counter_key, bucket_type, bucket_start) for efficient range scans.
Choice
All writes complete locally (~2ms); cross-region merge happens asynchronously
Rationale
Cross-region latency (50-200ms) is unacceptable on the write path for counters receiving millions of increments per second. By completing writes locally and merging asynchronously, the system provides consistent low-latency writes from any region while eventually converging to a globally correct count.
Target RPS
2M+ inc/sec global, 20M/sec peak
Latency (p99)
<5ms local writes, <15ms exact reads
Storage
~120 GB Redis (HLL) + 500 GB PostgreSQL/year
Availability
99.95% (multi-region, independent replicas)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Increment with dedup (POST /api/v1/counters/{key}/increment) | O(1) PFADD + O(1) Redis INCR + O(1) Kafka publish | O(S) per counter key (S = 16 shards) + O(1) HLL per key (~12KB) | All region-local. Total write latency: ~2ms (PFADD) + ~2ms (INCR) + ~5ms (Kafka async) = ~10ms. |
| Approximate read (GET /api/v1/counters/{key}) | O(1) Redis GET from merged count cache | O(1) single key fetch | Sub-5ms latency. Reflects local + last merged global state. Cross-region staleness: 2-10 seconds. |
| Exact global read (GET /api/v1/counters/{key}/exact) | O(1) PostgreSQL SELECT by primary key | O(1) single row fetch (includes CRDT vector JSON) | ~15ms latency. Reflects fully merged global count. Staleness: 10-30 seconds. |
| CRDT merge (MergeWorker cycle) | O(R) per key where R = number of regions | O(K * R) for K active keys across R regions | Merge is component-wise max: O(R) comparisons per key. Commutative + idempotent = replay-safe. |
Region-local sharded sub-counters forming the local component of the CRDT G-Counter vector. CounterService INCRs a random shard on each deduplicated increment. MergeWorker reads all 16 shards to compute the local region's total. 48h TTL evicts cold counters.
50M active keys x 16 shards x 64 bytes = ~50GB working set per region. Distributed across 16 Redis nodes.
Per-key HyperLogLog structures for approximate unique user deduplication. PFADD adds a user ID; returns 1 if new, 0 if likely duplicate. ~2% standard error. 24h TTL expires dedup state for cold counters.
10M active keys x 12KB = ~120GB. 8 Redis nodes. PFADD is O(1) amortized.
Durable store of CRDT-merged global counter values. MergeWorker writes the merged count and the full CRDT vector (per-region components as JSON) every 10 seconds. Serves exact-count reads for analytics and billing.
Partition: counter_key
Indexes: PRIMARY KEY (counter_key)
128 partitions x 3 replicas. Eventual consistency. ~50M rows at scale.
Time-bucketed rollups computed by AggregationWorker every 60 seconds. Stores per-minute, per-hour, and per-day aggregated counts for historical analytics dashboards.
Indexes: PRIMARY KEY (counter_key, bucket_type, bucket_start), idx_bucket_range ON (bucket_type, bucket_start)
Enables efficient range queries for dashboard charts. Partitioned by bucket_type for query optimization.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Single PostgreSQL Row) | T1 | 50-500ms increments (lock contention) | ~2K-5K inc/sec per key | $300/month (single RDS instance) | Low — no cache, no workers, no sharding | 99% (single DB, no failover) |
| Sharded (Redis + Async Aggregation) | T2 | <10ms increments (Redis INCR) | 1M+ inc/sec per key (16 shards) | $2,500/month (Redis cluster + Kafka + DB) | Medium — shards, aggregation workers, Kafka | 99.9% (replicated Redis + DB) |
| CRDT (Multi-Region Grow-Only Counters) | T3 | <5ms local writes, <10s global convergence | 2M+ inc/sec global (region-local writes) | $8,000/month (multi-region Redis + Kafka + DB) | High — CRDT vectors, HLL dedup, cross-region merge | 99.95% (multi-region, independent replicas) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
A G-Counter (Grow-only Counter) is a vector of N integers, one per replica (region). Each region only increments its own component. The global count is the sum of all components. The merge operation takes the component-wise maximum: for each region i, merged[i] = max(local[i], remote[i]). This merge is commutative (A merge B = B merge A), associative ((A merge B) merge C = A merge (B merge C)), and idempotent (A merge A = A). These properties guarantee convergence without coordination — no matter the order or frequency of state exchange, all replicas converge to the same correct count.
Both are probabilistic data structures, but they solve different problems. A Bloom filter tests set membership ('has this user incremented this counter?') with a configurable false positive rate. HyperLogLog estimates set cardinality ('how many unique users have incremented this counter?'). For dedup, we need membership testing — HyperLogLog's PFADD happens to return a flag indicating whether the element was likely new, which serves as a membership test with ~2% error. A Bloom filter would provide lower error rates (~1%) but uses more memory per entry (~12 bytes vs HLL's amortized ~1.2 bytes for large sets). The choice depends on the accuracy-vs-memory trade-off.
Because each region operates independently, a region outage does not affect other regions. Local writes continue in all healthy regions. The offline region's counts stop incrementing but do not corrupt the global count — the CRDT vector simply stops receiving updates for that component. When the region recovers, it resumes incrementing from where it left off. MergeWorker replays any buffered Kafka events and the global count converges to include the recovered region's increments. No data is lost as long as Kafka retention covers the outage duration.
For local approximate reads: the count reflects all local increments plus the last merged global state. Local staleness is 0 (local increments are reflected immediately); cross-region staleness is Kafka replication lag (1-5 seconds) plus MergeWorker processing time (1-5 seconds), totaling 2-10 seconds. For exact reads from CounterDB: add MergeWorker DB write interval (10 seconds), totaling 12-20 seconds. For time-bucketed analytics: add AggregationWorker interval (60 seconds).
The G-Counter only supports increments. A PN-Counter (Positive-Negative Counter) maintains two G-Counters — one for increments, one for decrements — and the display count is the difference. This template models the grow-only case (views, clicks) where decrements are not needed. Extending to PN-Counters would require doubling the CRDT vector storage, adding a second dedup check for decrements, and handling the edge case where decrement state arrives before the corresponding increment (temporarily negative counts).
Raft provides strong consistency (linearizable reads) at the cost of cross-region coordination. Every write must be acknowledged by a majority of replicas, adding at least one cross-region round-trip (50-200ms) to every increment. At 2M inc/sec, this is unacceptable. CRDTs provide eventual consistency with zero coordination — every write is local (~2ms) and convergence happens asynchronously. The trade-off is staleness: cross-region counts lag by seconds instead of being immediately consistent. For view counts and likes, this staleness is invisible to users.
Sign in to join the discussion.
Ready to design your own Distributed Counter?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator