Hard10 componentsInterview: Very High

Distributed Counter — Sharded with Async Aggregation

Q: Why 16 shards instead of more or fewer?

16 shards balance write distribution against read/aggregation cost. At 10M inc/sec peak, each shard handles 625K ops/sec — well within a single Redis core's ~1M ops/sec limit. More shards (e.g., 64) would provide more headroom but increase MGET cost during aggregation (64 keys per key per cycle). Fewer shards (e.g., 4) would leave each shard at 2.5M ops/sec — exceeding single-core capacity. Production systems use adaptive sharding: only hot keys (>1000 inc/sec) are sharded; cold keys use a single counter.

Industry-standard sharded counter architecture: each counter key is split into 16 Redis sub-counter shards. Writes INCR a random shard. An aggregator worker periodically sums shards into a master counter for fast approximate reads. A persist worker durably stores counts to the database.

StorageRedisKafkaShardingCounter

Try in Simulator

Problem Statement

The sharded counter architecture is the industry-standard solution to the hot-key write bottleneck that makes the naive single-row approach unworkable. It is the pattern used by YouTube for view counts, Instagram for like counts, and Twitter for engagement metrics. The core insight is that a single counter does not need to be a single physical storage location — it can be split into N independent sub-counters that are summed on read.

The hot-key problem is severe. A viral video receiving 1M views per second means 1M writes per second to a single counter key. A single Redis key can sustain ~1M operations per second on a single core, but the key may be hashed to a single slot in a Redis cluster, limiting throughput to one core. A single PostgreSQL row can sustain only ~2K-5K updates per second. The sharded counter pattern solves this by distributing writes across N sub-counter keys, each potentially on a different Redis node or hash slot.

The architecture splits each logical counter into 16 sub-counters in Redis: counter:{key}:0 through counter:{key}:15. When an increment arrives, CounterService hashes (item_id, random_shard_index) to select one of the 16 shards and performs an atomic INCR. Each shard sees at most 1/16th of the total write traffic — 62,500 ops/sec instead of 1M. This is well within the capacity of a single Redis core (~1M ops/sec). The write path is fast (~2ms Redis INCR) and scales linearly with the number of shards.

The trade-off is read complexity. The current count is the sum of all 16 shards, requiring an MGET of 16 keys. To avoid this per-read overhead, an AggregatorWorker periodically (every 5 seconds) sums all shards and writes the total to a MasterCache key. Read requests fetch this single master key — fast and simple, but stale by up to 5 seconds. For a video at 100K views/sec, the displayed count might be 500K behind reality, but '12.3M views' vs '12.8M views' is indistinguishable to users.

Durability comes from a separate PersistWorker that batches aggregated counts and writes to CounterDB (DynamoDB) every 30 seconds. At 1M increments/sec, per-increment DB writes would require 1M write IOPS — prohibitively expensive. Batching reduces this to ~33K writes/sec (1M active keys / 30s), a 30x cost reduction.

The architecture uses Kafka (AggStream) as a buffer between the hot write path and the async aggregation/persistence path. During viral events, Kafka absorbs burst traffic (5M message buffer) while workers process at a sustainable rate. CounterService publishes increment events to Kafka as fire-and-forget — the user response does not wait for aggregation or persistence.

This pattern appears in virtually every system design interview for counter-related questions. Interviewers expect candidates to identify the single-row bottleneck, propose sharding as the solution, reason about the approximate-vs-exact read trade-off, and explain why async aggregation is necessary for read performance. The CRDT variant extends this to multi-region deployment with conflict-free convergence.

Architecture Overview

The sharded counter system uses 10 components organized into a write-optimized pipeline with async aggregation and persistence. The components are: CounterClient, ApiGateway, MainLB, CounterService (write path), ReadService (read path), ShardCache (Redis sub-counters), MasterCache (Redis aggregated totals), AggStream (Kafka), AggregatorWorker, PersistWorker, and CounterDB (DynamoDB).

All traffic enters through the ApiGateway, which performs JWT authentication (~3ms), enforces a 1.2M RPS rate limit, and routes all /api/v1/ traffic to MainLB. The MainLB distributes traffic between CounterService (increments) and ReadService (count reads) using round-robin.

The write path is optimized for maximum throughput with minimum latency. CounterService receives POST /api/v1/counters/{key}/increment, hashes (item_id, random_shard_index) to select one of 16 sub-counter shards, and performs an atomic INCR on ShardCache (Redis). The write completes in ~2ms. No database is touched on the hot write path. CounterService then publishes an increment event to AggStream (Kafka) as fire-and-forget for downstream aggregation.

The aggregation pipeline runs continuously in the background. AggregatorWorker consumes from AggStream and, every 5 seconds per active key, reads all 16 sub-counter shards from ShardCache via MGET, computes the sum, and writes the aggregated total to MasterCache. This master count is the approximate value — stale by at most one aggregation interval (5 seconds). PersistWorker batches aggregated counts and writes to CounterDB every 30 seconds for durable storage.

The read path has two modes. Approximate reads (GET /api/v1/counters/{key}) fetch the pre-aggregated value from MasterCache in ~2ms — a single Redis GET. This is used by 95% of read traffic. Exact reads (GET /api/v1/counters/{key}/exact) query CounterDB for the durably persisted count (~15ms). The exact count lags behind by up to 30 seconds (PersistWorker batch interval).

ShardCache and MasterCache are separate Redis clusters, deliberately isolated. ShardCache handles 10M+ writes/sec (INCR on sub-counters) across a 12-node cluster. MasterCache handles 500K reads/sec (GET on aggregated totals) across a 6-node cluster. Separating them prevents write amplification on the shard path from interfering with read latency.

CounterDB (DynamoDB) provides durable storage with 64 partitions and 3 replicas. It is not on the hot path — PersistWorker writes batch every 30 seconds, and only 5% of reads (exact-count queries) hit it directly. Eventual consistency is acceptable since the primary read path uses MasterCache.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Sharded Sub-Counters (16 Shards per Key)

Choice

Split each logical counter into 16 independent Redis keys, INCR a random shard per write

Rationale

A single Redis key receiving 10M INCR/sec saturates a single Redis core (~1M ops/sec per core). Sharding into 16 sub-counters distributes writes across multiple hash slots and potentially multiple cluster nodes. Each shard sees at most 625K ops/sec — well within single-core capacity. This is the YouTube view counter pattern.

Separate ShardCache and MasterCache

Choice

Dedicated Redis clusters for writes (shard INCR) and reads (aggregated GET)

Rationale

ShardCache handles 10M+ writes/sec while MasterCache handles 500K reads/sec. Separating them prevents write amplification from interfering with read latency. Each cluster is sized independently for its workload — 12 nodes for writes, 6 nodes for reads.

Approximate Reads via Periodic Aggregation

Choice

AggregatorWorker sums shards every 5 seconds into MasterCache

Rationale

Reading all 16 shards per request (MGET) would add 16 round-trips or one pipelined batch (~5ms) to every read. Pre-aggregating into MasterCache makes reads a single GET (~2ms). The trade-off is staleness: the count lags by up to 5 seconds. For display purposes ('12.3M views'), this is imperceptible.

Kafka Buffer for Burst Absorption

Choice

CounterService publishes increment events to Kafka (fire-and-forget) for async processing

Rationale

During viral events, increment traffic spikes 10x. Kafka absorbs the burst (5M message buffer) while workers process at a sustainable rate. Without Kafka, workers would need to scale instantly with traffic — expensive and slow to provision.

Batched Database Persistence (30-Second Intervals)

Choice

PersistWorker writes aggregated counts to DynamoDB every 30 seconds in batch

Rationale

At 1M increments/sec, per-increment DB writes would require 1M write IOPS — approximately $30K/day on DynamoDB on-demand pricing. Batching reduces writes to ~33K/sec (1M active keys / 30s), a 30x cost reduction. The trade-off is that exact counts lag by up to 30 seconds.

Scale & Performance

Target RPS

1M+ inc/sec sustained, 10M/sec peak

Latency (p99)

<5ms approximate reads, <10ms writes

Storage

~10 GB Redis + 500 GB DynamoDB/year

Availability

99.9% (replicated Redis + DynamoDB)

Time & Space Complexity

Operation	Time	Space	Notes
Increment (POST /api/v1/counters/{key}/increment)	O(1) Redis INCR on random shard + O(1) Kafka publish	O(S) per counter key where S = shard count (16)	No DB on the hot path. Total write latency: ~5ms (INCR) + ~5ms (Kafka, async).
Approximate read (GET /api/v1/counters/{key})	O(1) Redis GET from MasterCache	O(1) single key fetch	Sub-5ms latency. Stale by up to 5 seconds (aggregation interval).
Exact read (GET /api/v1/counters/{key}/exact)	O(1) DynamoDB GetItem by partition key	O(1) single row fetch	~15ms latency. Stale by up to 30 seconds (persist batch interval).
Aggregation (AggregatorWorker cycle)	O(S) MGET per active key, S = 16 shards	O(K) where K = active keys processed per cycle	Runs every 5 seconds. 10M active keys x 16 shards = 160M Redis reads per cycle.

Database Schema (HLD)

counter:{key}:{shard_id} (ShardCache / Redis)

Sharded sub-counters for a single logical counter key, split across 16 shards to distribute write load. CounterService INCRs a random shard on each increment; AggregatorWorker MGETs all 16 shards every 5 seconds to compute the sum. Near 100% hit rate for active keys since writes create the entries.

count INTEGER (shard counter value, INCR atomically)

10M active keys x 16 shards x 64 bytes = ~10GB working set. 24h TTL evicts cold counters.

counter:{key}:master (MasterCache / Redis)

Aggregated counter totals for fast approximate reads. AggregatorWorker writes the summed value from all 16 shards every 5 seconds. ReadService serves 95% of read traffic from this cache at sub-5ms latency.

count INTEGER (aggregated counter value)aggregated_at STRING (last aggregation timestamp)

10M active keys x 64 bytes = ~640MB working set. 24h TTL.

counters (DynamoDB)

Durable store of aggregated counter values written by PersistWorker every 30 seconds. Partition key is counter_key. Serves exact-count reads for analytics and billing where precision matters.

counter_key VARCHAR (partition key)count BIGINT (aggregated count)updated_at TIMESTAMPTZ (last persist timestamp)

Partition: counter_key

64 partitions x 3 replicas. Eventual consistency. ~33K writes/sec at peak.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (Single PostgreSQL Row)	T1	50-500ms increments (lock contention)	~2K-5K inc/sec per key	$300/month (single RDS instance)	Low — no cache, no workers, no sharding	99% (single DB, no failover)
Sharded (Redis + Async Aggregation)	T2	<10ms increments (Redis INCR)	1M+ inc/sec per key (16 shards)	$2,500/month (Redis cluster + Kafka + DB)	Medium — shards, aggregation workers, Kafka	99.9% (replicated Redis + DB)
CRDT (Multi-Region Grow-Only Counters)	T3	<5ms local writes, <10s global convergence	2M+ inc/sec global (region-local writes)	$8,000/month (multi-region Redis + Kafka + DB)	High — CRDT vectors, HLL dedup, cross-region merge	99.95% (multi-region, independent replicas)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why 16 shards instead of more or fewer?

16 shards balance write distribution against read/aggregation cost. At 10M inc/sec peak, each shard handles 625K ops/sec — well within a single Redis core's ~1M ops/sec limit. More shards (e.g., 64) would provide more headroom but increase MGET cost during aggregation (64 keys per key per cycle). Fewer shards (e.g., 4) would leave each shard at 2.5M ops/sec — exceeding single-core capacity. Production systems use adaptive sharding: only hot keys (>1000 inc/sec) are sharded; cold keys use a single counter.

What happens if AggregatorWorker falls behind?

If AggregatorWorker cannot keep up with the event stream, Kafka consumer lag increases. The master count in MasterCache becomes progressively staler — instead of 5 seconds, it might be 30 seconds or more behind. Read clients still get a count (the last aggregated value), but it is less fresh. Kafka's retention buffer (5M messages) absorbs temporary bursts. If lag persists, additional AggregatorWorker instances can be spun up to process in parallel.

How does the Bloom filter dedup work for likes?

For likes (one per user per item), CounterService checks a Bloom filter in ShardCache before incrementing. The Bloom filter uses ~12 bytes per entry; for 1B user-item pairs, that is ~12GB — distributed across the shard cluster. PFADD returns whether the element is likely new. False positives (~1% rate) occasionally reject a real like, but never double-count. The CRDT variant uses HyperLogLog for similar dedup with ~2% error rate.

Why not use Redis Cluster's built-in replication instead of separate caches?

Redis Cluster replication creates read replicas of the same data, but ShardCache and MasterCache store different data with different access patterns. ShardCache holds 160B sub-counter keys (10B counters x 16 shards) optimized for INCR. MasterCache holds 10B aggregated keys optimized for GET. Separating them allows independent sizing, eviction policies, and failure domains.

What is the maximum count staleness a user sees?

For approximate reads: up to 5 seconds (AggregatorWorker aggregation interval). For exact reads: up to 30 seconds (PersistWorker batch interval). In practice, approximate reads are sufficient for 95% of use cases. For a video at 100K views/sec, 5 seconds of staleness means the count is ~500K behind — showing '12.3M views' instead of '12.8M views'. This is imperceptible to users and acceptable for all display purposes.

Related Templates

Distributed Counter — Naive (Single PostgreSQL Row)Distributed Counter — CRDT (Multi-Region Grow-Only Counters)

Discussion

Ready to design your own Distributed Counter?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator