Medium12 componentsInterview: Very High

Distributed Cache — Redis Cluster with Replication + Multi-Tier

Q: Why use both L1 in-process cache and L2 Redis Cluster?

L1 and L2 serve complementary roles. L1 (in-process LRU) provides sub-100us latency by eliminating network round-trips — but it is limited to approximately 10MB per pod (10K entries) and each pod has its own independent L1 (no sharing). L2 (Redis Cluster) provides sub-1ms latency with 78GB of shared capacity accessible by all service pods — but requires a network round-trip. Under Zipfian access patterns, the top 0.1% of keys (in L1) account for approximately 20% of traffic. Serving these in-process reduces Redis load by 20% and cuts tail latency by 10x. The remaining 80% of traffic hits L2 Redis at sub-1ms latency. Only 5% of traffic reaches the database. This two-tier model is how Meta's TAO cache and Twitter's cache infrastructure achieve their latency targets.

Q: How does automatic failover work in Redis Cluster?

Redis Cluster nodes exchange heartbeat messages (PING/PONG) every second. When a primary node fails to respond for the configured timeout (typically 5-15 seconds), the cluster marks it as PFAIL (possible failure). Once a majority of primary nodes agree on the failure (FAIL), the corresponding replica initiates election: it increments the cluster epoch, broadcasts a FAILOVER_AUTH_REQUEST to all primaries, and if it receives a majority of votes, promotes itself to primary. The new primary takes ownership of the failed primary's hash slots. CacheProxy detects the promotion via a MOVED redirect on the next request and updates its slot table. Total failover time: approximately 5 seconds under default configuration. Data loss window: only writes that reached the failed primary but had not yet replicated to the replica — typically under 1ms of writes due to asynchronous replication lag.

Q: Why Kafka for cache invalidation instead of Redis pub/sub or polling?

Three options compared: (1) Redis pub/sub is the fastest (5ms end-to-end) but fire-and-forget — if a consumer is offline during the invalidation, it misses the message permanently and serves stale data until TTL expiry (300 seconds). (2) Polling (each service queries the database for recently changed keys) creates O(services x poll_rate) database load and has inherent staleness equal to the poll interval. (3) Kafka provides durable, replayable messaging with consumer offsets. When a service pod restarts, it consumes from its last committed offset and processes all missed invalidations — no stale data, no polling overhead. Kafka end-to-end latency (50-200ms) is acceptable for cache invalidation because even a 200ms staleness window is negligible compared to the 300-second TTL. The correct choice depends on the consistency requirement: Kafka for most use cases, Redis pub/sub when sub-10ms invalidation is required and TTL provides the safety net.

Q: What is the data loss window during Redis Cluster failover?

Redis uses asynchronous replication: the primary acknowledges writes to the client before the write is replicated to the replica. If the primary fails immediately after acknowledging a write, that write is lost — it never reached the replica. Under normal conditions, replication lag is under 1ms, so the data loss window is negligible (the last ~1ms of writes). For a cache (not a primary data store), this is perfectly acceptable: lost writes simply become cache misses that are backfilled from the database on next access. If data loss must be prevented (e.g., distributed locks, rate limiter counters), Redis supports WAIT command to block until a write is replicated — at the cost of write latency equal to replication lag.

Q: How does hot key detection prevent single-shard overload?

CacheProxy maintains a Count-Min Sketch (probabilistic frequency estimator, approximately 100KB of memory) that tracks access frequency per key. Every 5 seconds, it identifies keys exceeding 1,000 ops/sec and broadcasts their identities to all AppService pods via an internal notification channel. AppService pods add these keys to their L1 in-process cache with a 5-second TTL. Once in L1, the hot key is served in-process without hitting Redis — the per-shard load from that key drops from 10,000+ ops/sec to near zero (only TTL refresh reads). This prevents the single-shard overload that occurs in V1 when a viral key maps to one shard via consistent hashing.

Production-grade distributed cache with L1 in-process LRU cache (sub-100us hits), L2 Redis Cluster with 3 primary shards and 3 replicas (automatic failover), CacheProxy for connection multiplexing, and Kafka-based invalidation for cross-service cache consistency. 500K ops/sec with 99.99% availability.

Redis ClusterMulti-Tier CacheReplicationKafkaHigh AvailabilityStorage

Try in Simulator

Problem Statement

The replicated multi-tier approach to distributed caching represents the production-grade architecture used by companies operating at massive scale — Twitter's Cache Cluster, Meta's TAO cache, Netflix's EVCache, and AWS ElastiCache for Redis in cluster mode. It solves the three remaining problems with the consistent hash sharding approach (V1): no fault tolerance (shard failure loses data), no hot key mitigation (single shard overwhelmed), and no cross-service cache consistency (stale data in other services).

The key innovation is the two-tier cache architecture. L1 is an in-process LRU cache embedded in each AppService pod. It holds the top ~10,000 keys by access frequency — approximately the top 0.1% of the keyspace. Under Zipfian access patterns (which model real-world traffic where a small number of keys receive disproportionate traffic), this 0.1% of keys accounts for approximately 20% of all cache lookups. Serving these keys in-process eliminates the network round-trip to Redis entirely, achieving sub-100 microsecond latency — 10x faster than even the fastest Redis response. The trade-off is staleness: L1 entries are evicted only when a Kafka invalidation event arrives, which takes 50-200ms after the originating database write.

The L2 tier is a 3-shard Redis Cluster with 1 replica per shard (6 Redis nodes total). Redis Cluster natively divides the keyspace into 16,384 hash slots using CRC16. Each primary shard owns approximately 5,461 slots. Unlike the V1 consistent hash approach (which uses a separate proxy to manage the ring), Redis Cluster has built-in slot routing: if a client sends a request to the wrong shard, Redis returns a MOVED redirect to the correct shard. CacheProxy handles these redirects transparently and caches the slot-to-shard mapping for direct routing.

Automatic failover is the critical availability improvement over V1. When a primary shard fails, Redis Cluster's built-in consensus protocol detects the failure via heartbeat timeout (~5 seconds) and promotes the corresponding replica to primary. No data is lost (the replica has a full copy of the primary's data, minus writes in the last replication lag window — typically under 1ms). No thundering herd occurs because the promoted replica immediately serves all requests. Compare this with V1, where shard failure loses ~33% of cached data and causes a miss storm.

Kafka-based invalidation solves the cross-service consistency problem. When Service A writes to the database and invalidates its local cache, Service B still holds the stale value in both its L1 and L2 caches. The invalidation event published to Kafka is consumed by both InvalidationWorker (which DELETEs the key from L2 Redis) and all AppService pods (which evict the key from their L1 local caches). This ensures all services see the updated value within the Kafka propagation delay (50-200ms).

CacheProxy serves as a connection multiplexer between AppService pods and Redis Cluster nodes. Without the proxy, 20 AppService pods maintaining connections to 6 Redis nodes would require 6,000+ connections. CacheProxy reduces this to approximately 1,200 by multiplexing application connections through 4 proxy pods. It also handles MOVED/ASK redirects during slot migration and implements hot key detection — monitoring per-key access frequency and broadcasting keys exceeding 1,000 ops/sec to AppService pods for L1 promotion.

Interviewers expect candidates to explain the L1/L2 trade-off (latency vs staleness), analyze the automatic failover mechanism (replica promotion, data loss window), reason about Kafka-based invalidation (why not Redis pub/sub), discuss connection multiplexing benefits, and compare hash slot routing (CRC16 mod 16384) with consistent hashing (MurmurHash3 + virtual nodes).

Architecture Overview

The replicated multi-tier architecture uses twelve components organized into five layers: traffic entry (Client, AppLB), application with L1 cache (AppService), proxy routing (CacheProxy), L2 Redis Cluster (CachePrimary_A/B/C + CacheReplica_A/B/C), data store (BackendDB), and invalidation pipeline (InvalidationStream, InvalidationWorker).

The traffic entry layer is minimal: Client sends requests to AppLB (AWS ALB), which distributes to AppService pods using round-robin at up to 1M RPS capacity.

AppService is the first caching tier. Each of the 20 pods maintains an in-process LRU cache holding approximately 10,000 entries (~10MB). On a cache read, AppService first checks L1 (sub-100us). If the key is found in L1 — which happens approximately 20% of the time under Zipfian access patterns — the value is returned immediately with no network call. L1 entries have a 5-second TTL to limit staleness and are also evicted on Kafka invalidation events. For keys not in L1, AppService sends the request to CacheProxy for L2 lookup.

CacheProxy runs on 4 pods, each with 200 threads, handling up to 600K ops/sec. It maintains the Redis Cluster slot table (queried via CLUSTER SLOTS) and routes keys to the correct primary shard based on CRC16(key) mod 16384. When a slot migrates during cluster scaling, CacheProxy handles MOVED and ASK redirects transparently — the application sees no errors. CacheProxy also implements hot key detection: it tracks per-key access frequency using a Count-Min Sketch (probabilistic data structure) and broadcasts keys exceeding 1,000 ops/sec to AppService pods for L1 promotion.

The L2 Redis Cluster consists of 6 nodes: 3 primaries (CachePrimary_A in AZ-a, CachePrimary_B in AZ-b, CachePrimary_C in AZ-c) and 3 replicas (CacheReplica_A in AZ-b, CacheReplica_B in AZ-c, CacheReplica_C in AZ-a). Each primary-replica pair is deployed in different availability zones for fault isolation. The 16,384 hash slots are distributed equally: Shard A owns slots 0-5460, Shard B owns 5461-10922, Shard C owns 10923-16383. Each shard has 26GB of memory (78GB aggregate, expandable by adding shards).

Automatic failover: when a primary fails (detected by heartbeat timeout after ~5 seconds), Redis Cluster's Raft-like consensus protocol triggers the corresponding replica to promote to primary. CacheProxy detects the promotion via a MOVED redirect on the next request and updates its slot table. The failover window is approximately 5 seconds, during which requests to the affected shard fail (AppService retries with exponential backoff). No data is lost because replicas maintain a full copy of the primary's data.

BackendDB (PostgreSQL, db.r7g.xlarge) is the source of truth. At 95% combined cache hit rate (20% L1 + 75% L2) with 500K peak RPS, BackendDB sees approximately 25K read ops/sec — handled by 2 read replicas while the primary handles writes.

The invalidation pipeline ensures cross-service consistency. When any service writes to BackendDB, it publishes a cache invalidation event to InvalidationStream (Kafka, 16 partitions). InvalidationWorker (5 Fargate workers) consumes these events and sends DELETE commands to the appropriate L2 shard. Simultaneously, all AppService pods subscribe to the invalidation topic and evict the key from their L1 local caches. End-to-end invalidation propagation: 50-200ms from database write to L1/L2 eviction across all services.

Architecture Preview

Loading architecture preview...

Open in Simulator

Request Flow — Multi-Tier Cache with Automatic Failover

This sequence diagram traces the multi-tier cache read path (L1 -> L2 -> DB) and the invalidation flow (DB write -> Kafka -> L1/L2 eviction). The critical insight is the layered latency model: L1 serves in sub-100us (no network), L2 in sub-1ms (one network hop via proxy), and DB fallback in ~20ms. The second flow shows automatic failover: when a primary fails, the replica promotes and CacheProxy reroutes transparently.

The invalidation flow demonstrates cross-service consistency: when Service A writes to the database, the invalidation event propagates via Kafka to InvalidationWorker (L2 eviction) and all AppService pods (L1 eviction), ensuring Service B sees the updated value within 50-200ms.

Loading diagram...

Step-by-Step Walkthrough

1Client sends cache read. AppService checks L1 in-process LRU. Hot keys (top 0.1%) hit L1 with sub-100us latency — no network call needed
2On L1 miss: AppService sends GET to CacheProxy. Proxy computes CRC16(key) mod 16384 to determine hash slot, routes to correct primary shard
3On L2 hit (~75%): Redis primary returns value in sub-1ms. CacheProxy returns to AppService. If key is detected as hot (>1000 ops/sec), promote to L1
4On L2 miss (~5%): AppService queries BackendDB (~12ms). Result is written to L2 via CacheProxy (backfill) and optionally promoted to L1 if hot
5Failover: if a primary crashes, the replica promotes within ~5 seconds. CacheProxy detects via MOVED redirect and updates slot table. No data loss
6Invalidation: after a DB write, an event is published to Kafka. InvalidationWorker DELETEs from L2. All AppService pods evict from L1. Propagation: 50-200ms

Pseudocode

// MULTI-TIER CACHE READ — L1 → L2 → DB
async function cacheGet(key: string): Promise<string> {
    // Tier 1: L1 in-process LRU (sub-100us)
    const l1Value = l1Cache.get(key);
    if (l1Value !== null) {
        metrics.l1Hit++;
        return l1Value;
    }

    // Tier 2: L2 Redis Cluster via CacheProxy (sub-1ms)
    const slot = crc16(key) % 16384;
    const shard = slotTable[slot];
    const l2Value = await cacheProxy.get(shard, key);
    if (l2Value !== null) {
        metrics.l2Hit++;
        // Promote to L1 if hot key detected
        if (hotKeyDetector.isHot(key)) {
            l1Cache.set(key, l2Value, { ttl: 5000 });
        }
        return l2Value;
    }

    // Tier 3: Database fallback (~12ms)
    metrics.miss++;
    const dbValue = await db.query(
        "SELECT value FROM kv_store WHERE key = $1", [key]
    );
    // Backfill L2
    await cacheProxy.set(shard, key, dbValue, 300);
    return dbValue;
}

// KAFKA INVALIDATION CONSUMER — evict from L1 and L2
async function onInvalidationEvent(event: { key: string }) {
    // Evict from L1 (all pods receive this event)
    l1Cache.delete(event.key);
    // InvalidationWorker evicts from L2 (separate consumer)
}

Data Schema (Multi-Tier Cache + Database)

The data model spans four layers: L1 in-process caches (per pod), L2 Redis Cluster (shared, replicated), PostgreSQL (source of truth), and Kafka (invalidation events). Each layer has different consistency guarantees, latency characteristics, and capacity.

The key insight is the consistency cascade: a database write triggers a Kafka invalidation event, which evicts the key from L2 Redis (via InvalidationWorker) and from all L1 caches (via AppService consumers). The staleness window is the Kafka propagation delay (50-200ms). For keys that require strong consistency, L1 is bypassed entirely.

Loading diagram...

Step-by-Step Walkthrough

1L1 in-process cache holds ~10K hot keys per pod with 5-second TTL. Independent per pod — no cross-pod sharing. Evicted by Kafka invalidation events and TTL expiry
2L2 Redis Cluster holds the full working set across 3 primary shards (16,384 hash slots). Each primary replicated to one replica in a different AZ. allkeys-lru eviction under memory pressure
3PostgreSQL kv_store is the authoritative source. Partitioned by key hash across 32 partitions. 2 read replicas serve cache-miss reads
4Kafka cache-invalidation topic carries eviction events. Consumed by InvalidationWorker (L2 DEL) and AppService pods (L1 eviction). 16 partitions for parallel consumption
5Consistency flow: DB write → Kafka publish → L2 DEL + L1 evict. Propagation: 50-200ms. Safety net: L1 TTL (5s) + L2 TTL (300s)

Pseudocode

-- PostgreSQL: Source of truth
CREATE TABLE kv_store (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL,
    updated_at TIMESTAMPTZ DEFAULT now()
) PARTITION BY HASH (key);
-- 32 partitions for balanced I/O

-- Redis Cluster: L2 distributed cache
-- 16,384 hash slots → 3 primary shards
-- Shard A: slots 0-5460     (AZ-a, replica in AZ-b)
-- Shard B: slots 5461-10922 (AZ-b, replica in AZ-c)
-- Shard C: slots 10923-16383 (AZ-c, replica in AZ-a)
-- allkeys-lru eviction, 300s TTL, async replication

-- L1 In-Process Cache: per-pod hot key cache
-- LRU with 10K max entries, 5s TTL
-- Populated by hot key detection (>1000 ops/sec)
-- Evicted by Kafka invalidation events

-- Invalidation flow (pseudocode):
-- 1. Service writes to PostgreSQL
-- 2. Service publishes {key, service_id, timestamp} to Kafka
-- 3. InvalidationWorker: DEL key from Redis Cluster
-- 4. All AppService pods: evict key from L1 LRU cache

Key Design Decisions

L1 In-Process Cache for Hot Keys

Choice

Each AppService pod maintains a local LRU cache (10K entries, 5-second TTL)

Rationale

Under Zipfian access patterns, the top 0.1% of keys generate approximately 20% of traffic. Serving these in-process eliminates the network round-trip to Redis for 20% of all cache reads, reducing p99 tail latency from ~1ms to sub-100us. The 5-second TTL limits staleness. Kafka invalidation events further reduce staleness to 50-200ms (the Kafka propagation delay). The trade-off is memory: 10K entries at 1KB each = 10MB per pod, 200MB total across 20 pods — negligible compared to the latency benefit.

Redis Cluster with Replicas (vs Independent Shards)

Choice

Redis Cluster mode with 3 primaries + 3 replicas instead of 3 independent Redis instances

Rationale

Redis Cluster provides two critical capabilities that independent shards lack: (1) automatic failover — when a primary fails, its replica promotes within ~5 seconds with no data loss, no thundering herd, and no operator intervention, and (2) built-in slot routing — MOVED/ASK redirects handle slot migration transparently during scaling. Independent shards (V1) require the proxy to detect failures and reroute traffic manually, losing ~33% of cached data in the process. The trade-off is operational complexity: Redis Cluster requires cluster management (slot migration, node addition/removal) and has constraints on multi-key operations (keys must be in the same slot).

Kafka Invalidation (vs Redis Pub/Sub)

Choice

Kafka topic for cross-service cache invalidation events

Rationale

Redis pub/sub is fire-and-forget: if an AppService pod is restarting during an invalidation, it misses the message and serves stale data indefinitely (until TTL expiry). Kafka provides durable messaging with consumer offsets — pods that restart consume from their last committed offset and process all missed invalidations. At-least-once delivery guarantees no invalidation is lost. The trade-off is latency: Kafka end-to-end is 50-200ms vs Redis pub/sub's ~5ms. For cache invalidation, this staleness window is acceptable for the vast majority of use cases.

CacheProxy for Connection Multiplexing

Choice

4 proxy pods multiplex 20 AppService pods into 1,200 Redis connections

Rationale

Without the proxy, 20 AppService pods each maintaining connection pools to 6 Redis nodes (50 connections per pool) would create 6,000 Redis connections. Redis handles connections via a single-threaded event loop — 6,000 connections add measurable overhead (memory for buffers, CPU for epoll management). CacheProxy reduces this to 4 pods x 6 nodes x 50 connections = 1,200. The proxy also centralizes hot key detection, slot table caching, and retry logic. The trade-off is one extra network hop (~0.5ms).

Hash Slot Routing (CRC16 mod 16384)

Choice

Redis Cluster native hash slots instead of consistent hashing

Rationale

Redis Cluster divides the keyspace into 16,384 fixed slots using CRC16. This is conceptually simpler than consistent hashing (no virtual nodes, no ring) and enables Redis-native features: hash tags ({tag}key forces keys to the same slot for multi-key operations), MOVED/ASK redirects for transparent slot migration, and cluster-wide SCAN across all slots. The trade-off is granularity: 16,384 slots means adding a 4th shard migrates ~4,096 slots (25%), while consistent hashing with 450 virtual nodes has finer-grained rebalancing.

Hot Key Detection via Count-Min Sketch

Choice

Probabilistic frequency tracking on CacheProxy with broadcast to AppService pods

Rationale

Tracking exact per-key frequency requires O(K) memory where K is the number of unique keys. With 10M+ keys, this is impractical. A Count-Min Sketch uses a fixed-size probabilistic data structure (~100KB) to estimate key frequency with bounded error. Keys estimated above 1,000 ops/sec are broadcast to AppService pods for L1 promotion. False positives (promoting a non-hot key to L1) waste a small amount of L1 memory. False negatives (missing a hot key) mean it stays in L2 — acceptable since L2 hit rate is already 75%.

Scale & Performance

Target RPS

500K peak (L1 absorbs 20%, L2 handles 75%, DB handles 5%)

Latency (p99)

<100us L1 hit, <1ms L2 hit, ~20ms miss (DB fallback)

Storage

78 GB L2 aggregate (3 x 26 GB shards) + 200 MB L1 aggregate

Availability

99.99% (replicated, auto-failover, multi-AZ)

Time & Space Complexity

Operation	Time	Space	Notes
L1 cache lookup (in-process LRU)	O(1) — hash map lookup	O(K) — K entries in LRU cache (typically 10,000)	Sub-100us latency. No network call. Hash map for O(1) lookup, doubly-linked list for O(1) LRU eviction. 10K entries use approximately 10MB of heap memory per pod.
L2 cache lookup (Redis Cluster via CacheProxy)	O(1) — CRC16 hash slot computation + Redis hash table GET	O(1) — single key-value pair	CRC16(key) mod 16384 computes the hash slot (~1us). CacheProxy looks up the slot-to-shard mapping (O(1) array index). Redis GET is O(1). Total: ~1ms including network hops through proxy.
Automatic failover (replica promotion)	O(N) — N primary nodes vote on failure + epoch increment	O(1) — fixed cluster metadata	Redis Cluster consensus requires majority of primaries to agree on failure. With 3 primaries, 2 must vote. Election and promotion complete in ~5 seconds. Slot table update propagates to all nodes and proxies within 1 second.
Hot key detection (Count-Min Sketch)	O(1) per key access — update d hash functions	O(w x d) — fixed-size sketch (e.g., 1024 x 5 = 5,120 counters, ~100KB)	Count-Min Sketch provides frequency estimates with bounded overcount (no undercount). False positive rate depends on sketch dimensions. At 100KB, accuracy is sufficient for detecting keys above 1,000 ops/sec threshold.

Database Schema (HLD)

L1 in-process cache (per AppService pod)

In-process LRU cache embedded in each AppService pod. Holds the top ~10,000 keys by access frequency. Populated by hot key broadcasts from CacheProxy and by organic access patterns. Evicted by Kafka invalidation events and 5-second TTL.

key STRING (cache key)value STRING (cached value)ttl INTEGER (5 seconds — short to limit staleness)access_count INTEGER (for LRU eviction ordering)

Indexes: Hash map (O(1) lookup) + doubly-linked list (LRU ordering)

Each pod has its own independent L1 — no sharing between pods. Memory footprint: ~10MB per pod (10K entries x 1KB). Under Zipfian access, L1 hit rate is approximately 20%. Staleness window: 50-200ms (Kafka invalidation propagation) or up to 5 seconds (TTL expiry), whichever is shorter.

L2 Redis Cluster (3 primaries + 3 replicas)

Three-shard Redis Cluster with 16,384 hash slots distributed equally. Each primary has one replica in a different AZ for fault isolation. allkeys-lru eviction policy under memory pressure. 300-second default TTL.

key STRING (routed by CRC16(key) mod 16384 → hash slot → shard)value STRING (cached value)TTL INTEGER (300 seconds default)

Indexes: Redis hash table per shard (O(1) GET/SET), Hash slot table (16,384 slots mapped to 3 primary shards)

Aggregate memory: 78GB (3 x 26GB). Aggregate throughput: 500K ops/sec. Automatic failover: replica promotes on primary failure (~5 seconds). Replication lag: <1ms under normal conditions. Multi-key operations require hash tags to ensure keys are in the same slot.

kv_store (PostgreSQL / BackendDB)

Source of truth for all cached data. Queried on L1+L2 cache miss. At 95% combined hit rate with 500K peak RPS, sees approximately 25K read ops/sec. Partitioned by key hash across 32 partitions.

key TEXT PKvalue TEXT NOT NULLupdated_at TIMESTAMPTZ DEFAULT now()

Indexes: PK on key (B-tree, partitioned)

Strong consistency via primary-replica setup. 2 read replicas serve cache-miss reads. Write path: primary only. Invalidation events are published to Kafka after database writes to ensure cross-service cache consistency.

cache-invalidation (Kafka topic)

Invalidation events published after database writes. Consumed by InvalidationWorker (L2 eviction) and AppService pods (L1 eviction).

key TEXT (cache key to invalidate, partition key)source_service TEXT (service that triggered the invalidation)timestamp BIGINT (event timestamp)

Indexes: Partitioned by key (16 partitions)

At-least-once delivery guarantees no invalidation is lost. Consumer deduplication by key + timestamp prevents double-eviction side effects (which are harmless anyway — evicting twice is idempotent). Retained for 24 hours for replay on consumer restart.

Event Contracts

cache_invalidationcache-invalidation

Cache invalidation events published by application services after database writes. Consumed by InvalidationWorker for L2 Redis eviction and by AppService pods for L1 in-process cache eviction.

Key Schema

key (string — the cache key to invalidate)

Value Schema

{ key: string, source_service: string, timestamp: number }

What-If Scenarios

Primary shard fails (hardware failure, OOM crash)

Impact

Redis Cluster detects failure via heartbeat timeout (~5 seconds). The corresponding replica promotes to primary. During the failover window, requests to the affected shard fail — AppService retries with exponential backoff. No data is lost (replica had a full copy). After promotion, CacheProxy updates its slot table and routes to the new primary. Total impact: ~5 seconds of elevated error rate for ~33% of keys, then full recovery.

Mitigation

Reduce heartbeat timeout from 5 seconds to 2 seconds for faster detection (trade-off: more false positives during network blips). Implement retry with jitter in AppService to avoid synchronized retry storms. Monitor replication lag continuously — if lag exceeds 10ms, alert for potential data loss on failover.

Kafka consumer lag spikes (InvalidationWorker falls behind)

Impact

Cache invalidation is delayed — L2 Redis and L1 in-process caches serve stale data for longer than the expected 50-200ms window. If lag reaches 30+ seconds, critical updates (price changes, inventory updates) are not reflected in the cache, potentially causing business logic errors (selling out-of-stock items at old prices).

Mitigation

Auto-scale InvalidationWorker pods based on consumer lag metric. Alert at lag > 5 seconds, critical at lag > 30 seconds. Increase Kafka partition count from 16 to 32 to enable more parallel consumers. Implement TTL as a safety net: even if invalidation is delayed, keys expire within 300 seconds.

L1 cache serves stale data after database write (invalidation delay)

Impact

A service writes a new value to the database and publishes an invalidation event. During the 50-200ms Kafka propagation delay, other services' L1 caches still hold the old value. For strongly consistent requirements (account balance, order status), this staleness window can cause incorrect behavior.

Mitigation

For strongly consistent keys, bypass L1 entirely and read from L2 Redis (sub-1ms). Implement per-key consistency flags: keys tagged as 'strong' skip L1, keys tagged as 'eventual' use the full L1/L2 hierarchy. The 5-second L1 TTL provides a safety net: even without Kafka invalidation, stale data is served for at most 5 seconds.

CacheProxy detects a hot key (50K ops/sec on one key)

Impact

Without hot key detection, all 50K ops/sec hit one Redis shard, saturating its CPU and increasing latency for all keys on that shard from 1ms to 10ms+. With hot key detection, CacheProxy broadcasts the key to AppService pods for L1 promotion. Within 5 seconds, all 20 pods cache the key locally, reducing Redis shard load from 50K to near zero (only periodic TTL refreshes).

Mitigation

Tune the hot key threshold: too low (100 ops/sec) promotes too many keys to L1, consuming L1 capacity; too high (10K ops/sec) allows shard overload before detection. Monitor CacheProxy's hot key broadcast rate — more than 100 keys/minute suggests the threshold is too low.

Failure Modes & Resilience

Component	Failure	Impact	Mitigation
CachePrimary (Redis Cluster primary shard)	Process crash or OOM kill	Corresponding replica promotes to primary within ~5 seconds. During failover window, requests to the affected shard's slots fail. No data loss (replica has full copy). After promotion, full recovery.	Monitor Redis memory utilization and OOM killer activity. Set maxmemory to 90% of instance memory to leave headroom. allkeys-lru eviction prevents OOM by evicting cold keys under memory pressure.
CacheProxy	All proxy pods crash simultaneously	All L2 cache operations fail — AppService cannot reach Redis Cluster. Only L1 cache (20% hit rate) continues serving. 80% of traffic falls through to BackendDB, exceeding its capacity. AppService should implement direct Redis Cluster connections as a fallback.	Run CacheProxy as a multi-AZ deployment (2 pods per AZ, minimum 4 pods). Health check interval: 3 seconds. Auto-scaling based on CPU. Implement fallback direct-to-Redis connections in AppService with embedded CRC16 routing (degraded mode without hot key detection).
InvalidationStream (Kafka)	Broker failure causing consumer lag	Invalidation events are delayed. L1 and L2 caches serve stale data beyond the expected 50-200ms window. TTL (300 seconds) provides the ultimate safety net.	MSK with 3-way replication (replication factor = 3, in-sync replicas = 2). Reduce TTL for highly mutable keys (e.g., 30 seconds for inventory counts). Monitor consumer lag and alert at >5 seconds.
AppService L1 cache	Memory pressure causes L1 eviction thrashing	L1 hit rate drops from 20% to near zero. All traffic hits L2 Redis, increasing Redis load by 25%. p99 latency increases from sub-100us to ~1ms for previously L1-cached keys. Not a critical failure — L2 absorbs the load.	Monitor per-pod L1 hit rate and eviction rate. If L1 eviction rate exceeds 10/sec, increase L1 capacity (from 10K to 20K entries) or reduce L1 TTL (from 5s to 2s) to reduce memory footprint.

Scaling Strategy

L1 scales automatically with AppService pod count — each new pod brings its own L1 cache. L2 scales by adding Redis Cluster shards: slot migration redistributes keys automatically. CacheProxy auto-scales based on CPU and request queue depth. InvalidationWorker auto-scales based on Kafka consumer lag. BackendDB scales via read replicas for cache-miss reads. The architecture scales linearly to ~2M ops/sec with 12 Redis Cluster shards (6 primary + 6 replica) without architectural changes. Beyond 2M ops/sec, L1 hit rate becomes critical — increasing L1 capacity and tuning hot key detection thresholds absorbs additional traffic before it reaches L2.

Monitoring & Alerting

Key metrics to monitor: (1) L1 hit rate per pod — should be ~20%. Drop below 10% indicates hot key detection failure or access pattern shift. (2) L2 per-shard hit rate — should be ~75%. Significant variance between shards indicates slot imbalance. (3) Replication lag per replica — should be <1ms. Lag >10ms increases data loss window on failover. (4) Invalidation consumer lag — should be <200ms. Lag >5 seconds means stale data is being served. (5) CacheProxy hot key broadcast rate — sudden spike indicates traffic pattern change. (6) Failover events — any automatic failover should trigger an alert for root cause investigation. (7) Cross-service invalidation propagation time — end-to-end from DB write to L1 eviction. Dashboard: Grafana with panels for L1/L2 hit rate time series, per-shard latency histogram, replication lag per replica, Kafka consumer lag, hot key top-10 list, failover timeline, and invalidation propagation latency. SLIs: L1 hit p99 < 200us, L2 hit p99 < 2ms, miss + backfill p99 < 30ms, combined hit rate > 93%, failover time < 10s, invalidation propagation < 500ms.

Cost Analysis

At 500K ops/sec peak: 6 Redis nodes (3 primary + 3 replica) cache.r7g.xlarge (~$1,050/month), PostgreSQL db.r7g.xlarge with 2 read replicas (~$1,050/month), ECS Fargate 20 AppService + 4 CacheProxy + 5 InvalidationWorker pods (~$1,400/month), MSK kafka.m7g.large (~$400/month), ALB (~$50/month). Total: ~$4,500/month (with replicas). Per-operation cost: $0.009/1K-ops — 7.7x cheaper than V0 and 2x cheaper than V1 at scale. The replicas double Redis cost compared to V1 ($1,050 vs $525) but provide automatic failover that eliminates the ~33% data loss on shard failure — a worthwhile trade-off for production workloads.

Security Considerations

Redis ACLs: Redis 6+ supports per-user ACLs with command-level permissions. CacheProxy connects with a dedicated user that has GET/SET/DEL permissions only (no FLUSHALL, CONFIG, DEBUG). InvalidationWorker connects with DEL-only permissions. Network isolation: all Redis nodes in private subnets, accessible only from CacheProxy and InvalidationWorker security groups. No direct access from AppService or external networks. TLS: in-transit encryption between all components (AppService -> CacheProxy -> Redis, AppService -> Kafka, InvalidationWorker -> Redis). At-rest encryption enabled on Redis via AWS KMS. Kafka topic ACLs: only authorized services can publish/consume invalidation events. Audit logging: all Redis commands logged to CloudWatch for security audit trail.

Deployment Strategy

Rolling deployment for AppService, CacheProxy, and InvalidationWorker — one pod at a time with ALB/Kafka consumer group rebalancing. Redis Cluster node upgrades use ElastiCache's online scaling: add a new replica, promote it, then remove the old node. Zero-downtime slot migration for shard additions: CLUSTER SETSLOT IMPORTING/MIGRATING with live key migration. Kafka topic partition changes performed during low-traffic windows (partition count increase is non-destructive). Canary deployment for CacheProxy: route 10% of traffic to new version, monitor L2 hit rate and hot key detection accuracy, promote to 100% after 15 minutes.

Real-World Examples

•Meta's TAO cache uses a two-tier architecture (L1 local + L2 distributed) with MySQL as the source of truth, serving billions of social graph queries per second
•Twitter's Cache Cluster uses Redis Cluster with replicas for timeline caching, handling millions of tweet lookups per second with automatic failover across data centers
•Netflix's EVCache runs on top of Memcached with zone-aware replication and an invalidation pipeline for cross-region cache consistency across AWS regions
•AWS ElastiCache for Redis in cluster mode provides the same architecture as a managed service: hash slot routing, automatic failover, and cross-AZ replication
•Pinterest's caching infrastructure uses a similar L1/L2 model with local caches on application servers and a distributed Redis cluster for shared cache state

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
V0: Naive (Single Redis Node)	T1	<1ms hit, ~20ms miss	~10K ops/sec	$695/month	Low	99% (single node)
V1: Consistent Hash Sharding	T2	<1ms hit, ~20ms miss	100K ops/sec	$1,800/month	Medium	99.5% (3 shards, no replication)
V2: Redis Cluster + Replication + Multi-Tier	T3	<100us L1 hit, <1ms L2 hit	500K ops/sec	$4,500/month	High	99.99% (replicated, auto-failover)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why use both L1 in-process cache and L2 Redis Cluster?

L1 and L2 serve complementary roles. L1 (in-process LRU) provides sub-100us latency by eliminating network round-trips — but it is limited to approximately 10MB per pod (10K entries) and each pod has its own independent L1 (no sharing). L2 (Redis Cluster) provides sub-1ms latency with 78GB of shared capacity accessible by all service pods — but requires a network round-trip. Under Zipfian access patterns, the top 0.1% of keys (in L1) account for approximately 20% of traffic. Serving these in-process reduces Redis load by 20% and cuts tail latency by 10x. The remaining 80% of traffic hits L2 Redis at sub-1ms latency. Only 5% of traffic reaches the database. This two-tier model is how Meta's TAO cache and Twitter's cache infrastructure achieve their latency targets.

How does automatic failover work in Redis Cluster?

Redis Cluster nodes exchange heartbeat messages (PING/PONG) every second. When a primary node fails to respond for the configured timeout (typically 5-15 seconds), the cluster marks it as PFAIL (possible failure). Once a majority of primary nodes agree on the failure (FAIL), the corresponding replica initiates election: it increments the cluster epoch, broadcasts a FAILOVER_AUTH_REQUEST to all primaries, and if it receives a majority of votes, promotes itself to primary. The new primary takes ownership of the failed primary's hash slots. CacheProxy detects the promotion via a MOVED redirect on the next request and updates its slot table. Total failover time: approximately 5 seconds under default configuration. Data loss window: only writes that reached the failed primary but had not yet replicated to the replica — typically under 1ms of writes due to asynchronous replication lag.

Why Kafka for cache invalidation instead of Redis pub/sub or polling?

Three options compared: (1) Redis pub/sub is the fastest (5ms end-to-end) but fire-and-forget — if a consumer is offline during the invalidation, it misses the message permanently and serves stale data until TTL expiry (300 seconds). (2) Polling (each service queries the database for recently changed keys) creates O(services x poll_rate) database load and has inherent staleness equal to the poll interval. (3) Kafka provides durable, replayable messaging with consumer offsets. When a service pod restarts, it consumes from its last committed offset and processes all missed invalidations — no stale data, no polling overhead. Kafka end-to-end latency (50-200ms) is acceptable for cache invalidation because even a 200ms staleness window is negligible compared to the 300-second TTL. The correct choice depends on the consistency requirement: Kafka for most use cases, Redis pub/sub when sub-10ms invalidation is required and TTL provides the safety net.

What is the data loss window during Redis Cluster failover?

Redis uses asynchronous replication: the primary acknowledges writes to the client before the write is replicated to the replica. If the primary fails immediately after acknowledging a write, that write is lost — it never reached the replica. Under normal conditions, replication lag is under 1ms, so the data loss window is negligible (the last ~1ms of writes). For a cache (not a primary data store), this is perfectly acceptable: lost writes simply become cache misses that are backfilled from the database on next access. If data loss must be prevented (e.g., distributed locks, rate limiter counters), Redis supports WAIT command to block until a write is replicated — at the cost of write latency equal to replication lag.

How does hot key detection prevent single-shard overload?

CacheProxy maintains a Count-Min Sketch (probabilistic frequency estimator, approximately 100KB of memory) that tracks access frequency per key. Every 5 seconds, it identifies keys exceeding 1,000 ops/sec and broadcasts their identities to all AppService pods via an internal notification channel. AppService pods add these keys to their L1 in-process cache with a 5-second TTL. Once in L1, the hot key is served in-process without hitting Redis — the per-shard load from that key drops from 10,000+ ops/sec to near zero (only TTL refresh reads). This prevents the single-shard overload that occurs in V1 when a viral key maps to one shard via consistent hashing.

Related Templates

Distributed Cache — Naive (Single Redis Node)Distributed Cache — Consistent Hash Sharding E-Commerce Checkout — Multi-Service Saga with CQRS

Discussion

Ready to design your own Distributed Cache?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator