Vetora logo
Hard10 componentsInterview: High

Real-Time Chat — Multi-Region WebSocket Mesh + CRDT Receipts

Global-scale messaging with per-region WebSocket gateways, CRDT-based delivery receipts, region-pinned Kafka partitions, and cross-region Cassandra replication. Sub-200ms delivery within a region, 99.99% availability via multi-region failover.

Multi-RegionCRDTWebSocketGlobal Scale
Problem Statement

The multi-region chat architecture exists because of a fundamental tension between two physics constraints: (1) the speed of light limits cross-region latency to 50-150ms, and (2) users expect sub-200ms message delivery regardless of where they are on the planet. A single-region WebSocket architecture (the Kafka Fan-Out variant) delivers messages in under 100ms for users near the data center, but a user in Tokyo connecting to a gateway in Virginia experiences 150ms+ round-trip time on every WebSocket frame. For a global messaging platform with users on every continent, this is unacceptable.

The solution is to place WebSocket gateways in every major region and route messages between regions only when necessary. When two users in the same region exchange messages, the message never leaves the region — delivery is sub-50ms. When users in different regions communicate, the message is routed via a cross-region replication stream, adding 50-150ms of inter-region latency. The key architectural challenge is maintaining global consistency: which user is connected to which region (session routing), message ordering across regions (Kafka partitioning), and delivery receipt convergence (CRDT-based state).

CRDT (Conflict-free Replicated Data Type) based delivery receipts are the most technically sophisticated element of this architecture. Each message has a per-recipient receipt state vector with three fields: sent_at, delivered_at, read_at. When a recipient's device acknowledges delivery, their local Presence Service updates the receipt CRDT. These CRDTs replicate across regions and merge conflict-free — last-writer-wins per field with Lamport timestamps. There is no coordination protocol, no distributed lock, no consensus algorithm. The CRDT merge function simply takes the maximum timestamp per field, producing a consistent state regardless of the order events arrive. This is how WhatsApp handles read receipts across their global infrastructure.

The multi-region architecture also provides 99.99% availability through regional failover. If an entire AWS region goes down (rare but possible — us-east-1 has had multi-hour outages), users automatically reconnect to the nearest healthy region within 5 seconds. Messages sent during the failover are buffered in Kafka (which has cross-region replication) and delivered once the user reconnects. No messages are lost.

This architecture is used by WhatsApp (2B+ users, multi-region Erlang clusters), Telegram (700M+ users, custom MTProto protocol), Discord (150M+ monthly users, Elixir-based gateway fleet), and Slack (enterprise messaging with regional deployment). It appears in senior/staff-level system design interviews where candidates are expected to reason about cross-region consistency, CRDT semantics, and the operational complexity of managing replicated infrastructure.

The primary trade-off is cost and operational complexity. Multi-region deployment replicates the entire stack per region — WS Gateway fleet, Kafka cluster, Cassandra ring, Redis cluster. At 3 regions, infrastructure cost is approximately 2.5x a single-region deployment. The engineering team must understand CRDT merge semantics, cross-region Kafka replication, Cassandra multi-DC consistency levels, and global DNS failover. This complexity is justified only at the scale where user experience and availability requirements demand multi-region presence.

Architecture Overview

The multi-region chat system uses 10 components organized into a global mesh: Client, Regional Load Balancer, Regional WebSocket Gateway, Global Session Router, Message Service, Presence Service (CRDT-based), Session Cache (Redis Cluster), Regional Kafka (Message Events), Cassandra (Message DB), Media Store (S3), and Notification Service.

Clients connect via WebSocket to the nearest Regional WS Gateway, determined by DNS-based geographic routing (AWS Route 53 latency-based routing). The Regional Load Balancer handles WebSocket upgrade and sticky sessions. Once connected, the WS Gateway registers the user's session in the Global Session Router, which maintains a distributed map of user_id to {region, gateway_pod_id} in the Session Cache (Redis Cluster). This map is the key to cross-region routing: when a message is sent, the Session Router tells the sending gateway which region the recipient is in.

The message send path flows from the WS Gateway to the Message Service, which performs three operations: (1) persist to Cassandra (partitioned by conversation_id with LOCAL_QUORUM write consistency for single-region durability), (2) publish to regional Kafka (message_sent topic, partitioned by conversation_id for ordering), and (3) return ACK to the sender. For intra-region delivery (sender and recipient in the same region), the WS Gateway pushes directly to the recipient's WebSocket — sub-50ms end-to-end. For cross-region delivery, the WS Gateway publishes to the cross_region_sync Kafka topic, which replicates to the destination region's Kafka cluster. The destination region's WS Gateway delivers to the recipient. Cross-region adds 50-150ms.

CRDT-based delivery receipts are managed by the Presence Service. Each message has a receipt state per recipient: {sent_at, delivered_at, read_at}. When a recipient acknowledges delivery (their WS client sends an ACK frame), the local Presence Service updates the CRDT in the Session Cache. CRDTs replicate across regions via Kafka (delivery_receipt topic). The merge function is last-writer-wins per field with Lamport timestamps: merge(a, b) = {sent_at: max(a.sent_at, b.sent_at), delivered_at: max(a.delivered_at, b.delivered_at), read_at: max(a.read_at, b.read_at)}. This guarantees convergence without coordination.

Presence tracking uses CRDT G-Counters. Each region maintains a local count of connected users. The global online count for a user is the sum of all regional counters (typically 0 or 1, but >1 if the user has multiple devices in different regions). Heartbeats arrive every 30 seconds via WebSocket frames. If missed, the WS Gateway emits a disconnect event after 60 seconds. Presence is eventually consistent across regions — during a network partition, a user may appear online in their region but offline in others.

Media messages (images, files) are stored in S3 with CloudFront CDN for edge delivery. Clients upload directly to S3 using pre-signed URLs generated by the Message Service, keeping the media upload path separate from the message delivery path. The message record contains only a media_url reference.

The Notification Service handles offline delivery. When the Session Router indicates no active session for a recipient, the message is routed to the Notification Service, which sends push notifications via FCM/APNs. When the user reconnects, unread messages are synced from Cassandra.

Regional failover: if a region goes down, DNS failover (Route 53 health checks) routes new connections to the nearest healthy region within 30-60 seconds. Existing users on the failed region must reconnect — client-side reconnect with exponential backoff targets the nearest healthy region via the updated DNS. Messages sent to users in the failed region are buffered in Kafka (cross-region replication ensures the destination region's Kafka has the data). When users reconnect to the new region, they register a new session and receive buffered messages. Total reconnect time: 3-5 seconds for well-implemented clients.

Architecture Preview
Loading architecture preview...
Cross-Region Message Delivery + CRDT Receipt Flow

This diagram traces a message from a sender in Region A to a recipient in Region B, demonstrating the cross-region routing and CRDT receipt convergence. The key insight is the separation of message delivery (Kafka cross-region sync, 50-150ms latency) from receipt acknowledgment (CRDT merge, eventually consistent). The sender in Region A sees the message marked as 'sent' immediately. When Region B delivers the message, the recipient's device sends a delivery ACK, which the local Presence Service writes as a CRDT update. This update replicates back to Region A via the delivery_receipt Kafka topic, where the sender sees the 'delivered' checkmark appear — total 200-500ms from send to delivery receipt visibility.

The CRDT merge is the critical correctness guarantee: even if receipt updates from multiple regions arrive out of order, the merge function (max per field) always produces the correct final state. There is no coordination, no distributed lock, no race condition.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Sender in Region A transmits a message via their WebSocket connection to the local WS Gateway
  2. 2WS Gateway forwards to Message Service, which persists to Cassandra (LOCAL_QUORUM — 2 of 3 local replicas) and publishes to regional Kafka
  3. 3Message Service ACKs to WS Gateway, which pushes a 'sent' checkmark to the sender's WebSocket — total ~50ms
  4. 4Session Router lookup reveals the recipient is in Region B. WS Gateway publishes to the cross_region_sync Kafka topic
  5. 5Cross-region Kafka replication delivers the message to Region B's Kafka cluster (50-150ms network latency)
  6. 6Region B's WS Gateway consumes the message and pushes it to the recipient's WebSocket connection
  7. 7Recipient's device sends a delivery ACK frame to their WS Gateway
  8. 8WS Gateway routes the ACK to the Presence Service, which updates the CRDT receipt state in Redis: {delivered_at: Lamport_T}
  9. 9Presence Service publishes the CRDT update to the delivery_receipt Kafka topic, which replicates back to Region A
  10. 10Region A's WS Gateway receives the receipt update and pushes a 'delivered' checkmark to the sender — total 200-500ms from initial send

Pseudocode

// CROSS-REGION MESSAGE DELIVERY
async function sendMessage(sender_id, conversation_id, content):
    message = { id: uuid(), conversation_id, sender_id, content, source_region: THIS_REGION }

    // Persist + publish (parallel)
    await Promise.all([
        cassandra.execute("INSERT INTO messages ... USING CONSISTENCY LOCAL_QUORUM", message),
        kafka.produce("chat.message_sent", key=conversation_id, value=message)
    ])

    // Determine recipient region
    session = await redis.get("session:" + recipient_id)
    if session.region == THIS_REGION:
        // Intra-region: push directly via WS Gateway
        await wsGateway.deliver(session.gateway_pod, recipient_id, message)
    else:
        // Cross-region: publish to sync topic
        await kafka.produce("chat.cross_region_sync", key=conversation_id, value={
            ...message,
            destination_region: session.region,
            lamport_timestamp: next_lamport()
        })

// CRDT DELIVERY RECEIPT
class ReceiptCRDT:
    sent_at: LamportTimestamp = 0
    delivered_at: LamportTimestamp = 0
    read_at: LamportTimestamp = 0

    function merge(other: ReceiptCRDT):
        // Last-writer-wins per field — commutative, associative, idempotent
        this.sent_at = max(this.sent_at, other.sent_at)
        this.delivered_at = max(this.delivered_at, other.delivered_at)
        this.read_at = max(this.read_at, other.read_at)
        // Guaranteed convergence regardless of merge order

async function onDeliveryAck(message_id, recipient_id):
    receipt = await redis.hgetall("receipt:" + message_id + ":" + recipient_id)
    receipt.delivered_at = next_lamport()
    await redis.hset("receipt:" + message_id + ":" + recipient_id, receipt)

    // Replicate CRDT update to all regions
    await kafka.produce("chat.delivery_receipt", key=message_id, value={
        message_id, recipient_id,
        delivered_at: receipt.delivered_at,
        source_region: THIS_REGION
    })

// CRDT MERGE ON RECEIPT (consuming from remote region)
async function onRemoteReceiptUpdate(update):
    local = await redis.hgetall("receipt:" + update.message_id + ":" + update.recipient_id)
    // Merge: max per field
    merged = {
        sent_at: Math.max(local.sent_at || 0, update.sent_at || 0),
        delivered_at: Math.max(local.delivered_at || 0, update.delivered_at || 0),
        read_at: Math.max(local.read_at || 0, update.read_at || 0)
    }
    await redis.hset("receipt:" + update.message_id + ":" + update.recipient_id, merged)
    // Notify sender if they are in this region
    await notifySenderOfReceipt(update.message_id, merged)
Multi-Region Data Model

The multi-region data model distributes state across three storage systems optimized for different access patterns and consistency requirements. Cassandra provides durable, cross-region-replicated message storage with tunable consistency per operation. Redis Cluster stores ephemeral session routing data, CRDT presence counters, and CRDT delivery receipt state — all of which are high-frequency, low-latency, and tolerate eventual consistency. Kafka provides ordered event streaming for cross-region message replication and receipt propagation.

The CRDT structures in Redis are the key innovation: receipt state and presence counters can be independently updated in any region and merged conflict-free. The merge function is encoded in the Presence Service — Redis itself just stores the data. This means Redis can be a simple key-value store; the CRDT semantics are in the application layer.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Cassandra messages: cross-region replicated with LOCAL_QUORUM writes and LOCAL_ONE reads. Each region can serve reads locally without cross-region calls. Replication lag: 50-100ms
  2. 2Cassandra conversations: includes region_hint for routing optimization. Participants stored as a denormalized SET for fast membership checks
  3. 3Redis sessions: global routing map (user -> region + pod). TTL-based with 60-second refresh via heartbeats. Eventually consistent — stale entries cause redirect, not message loss
  4. 4Redis CRDT receipts: per-message, per-recipient state with Lamport timestamps. Merge function is max() per field, implemented in the Presence Service application code. Convergence guaranteed without coordination
  5. 5Redis CRDT presence: G-Counter per user per region. Global online status = sum of all regional counters. Counters are incremented on connect, decremented on disconnect, and expire via TTL
  6. 6Kafka topics (not shown in ER): message_sent, cross_region_sync, and delivery_receipt provide the replication backbone connecting all regions. Events are the mechanism by which CRDT state propagates across regions

Pseudocode

-- Cassandra: Cross-region message storage
-- Write path: LOCAL_QUORUM (fast, durable within region)
INSERT INTO messages (conversation_id, timestamp, message_id, sender_id, content, source_region)
VALUES (?, now(), ?, ?, ?, 'us-east-1')
USING CONSISTENCY LOCAL_QUORUM;
-- Replication to other regions: async, 50-100ms lag

-- Read path: LOCAL_ONE (fast, eventual consistency)
SELECT * FROM messages
WHERE conversation_id = ?
ORDER BY timestamp DESC
LIMIT 50
USING CONSISTENCY LOCAL_ONE;
-- Serves from local replica — no cross-region call

-- Redis: CRDT delivery receipt state
-- Write (local region)
HSET receipt:msg123:user456
    sent_at 1705305600000
    delivered_at 1705305600500
-- Publish to Kafka for cross-region propagation

-- Merge (on receiving remote update)
local = HGETALL receipt:msg123:user456
merged_delivered = MAX(local.delivered_at, remote.delivered_at)
merged_read = MAX(local.read_at, remote.read_at)
HSET receipt:msg123:user456
    delivered_at merged_delivered
    read_at merged_read
-- Convergence: max() is commutative, associative, idempotent

-- Redis: CRDT G-Counter presence
-- Connect event (increment local region counter)
INCR presence:user456:us-east-1
EXPIRE presence:user456:us-east-1 60

-- Global presence check
regions = ["us-east-1", "eu-west-1", "ap-northeast-1"]
total = SUM(GET presence:user456:{region} for region in regions)
is_online = total > 0
Key Design Decisions
Per-Region WebSocket Gateways

Choice

WS Gateway fleet deployed in each region with DNS-based routing

Rationale

WebSocket connections are latency-sensitive — every frame includes the network round-trip time. A user in Tokyo connecting to Virginia adds 150ms+ RTT to every message. Per-region gateways ensure users connect to the nearest region (sub-20ms RTT), providing sub-50ms intra-region delivery. The cost is infrastructure replication (one gateway fleet per region) and the complexity of cross-region message routing. This is the same approach WhatsApp and Telegram use.

CRDT Delivery Receipts

Choice

Last-writer-wins CRDT per receipt field with Lamport timestamps

Rationale

Delivery receipts (sent/delivered/read) must converge to the same state regardless of the order events arrive from different regions. Traditional UPDATE statements create race conditions: Region A marks 'delivered', Region B marks 'read', but if Region A's update arrives last, it overwrites 'read' with 'delivered'. CRDTs prevent this: max(delivered_at, read_at) always produces the correct state. No distributed locks, no consensus protocol, no coordination. The trade-off is eventual consistency — receipts may take up to 500ms to converge across regions.

Region-Pinned Kafka Partitions

Choice

Kafka topics partitioned by conversation_id, pinned to the conversation's home region

Rationale

Kafka guarantees ordering within a partition. By partitioning by conversation_id and pinning to the region where the conversation was created, all messages in a conversation flow through the same regional Kafka cluster. This provides strict per-conversation ordering without cross-region coordination. Cross-region messages use a separate replication topic that preserves source ordering. Without pinning, messages from different regions could interleave on different partitions, breaking conversation ordering.

Cassandra with LOCAL_QUORUM Consistency

Choice

Cassandra multi-DC with LOCAL_QUORUM writes, LOCAL_ONE reads

Rationale

LOCAL_QUORUM ensures writes are durable within the local region (2 of 3 replicas acknowledge) without waiting for cross-region replication (which adds 50-100ms). Cross-region replicas are updated asynchronously. LOCAL_ONE reads accept data from any single local replica, providing fast reads with eventual consistency — acceptable for chat history (a few milliseconds of staleness is invisible to users). This combination gives sub-10ms write latency and sub-5ms read latency within a region.

Global Session Router (Not Broadcast)

Choice

Targeted routing via user->region map instead of broadcasting to all regions

Rationale

With N regions, broadcasting every message to all regions wastes N-1 cross-region hops for the majority case where sender and recipient are in the same region. The Session Router maintains a user_id -> region map, enabling targeted routing: intra-region messages skip the network, cross-region messages go to exactly one destination. The map is eventually consistent (Redis TTL) — stale entries cause a redirect hop, not message loss.

S3 + CDN for Media Messages

Choice

Direct S3 upload with CloudFront CDN for delivery

Rationale

Media messages (images: 200KB avg, videos: 5MB avg) are 100-1000x larger than text messages. Storing them inline in Cassandra would bloat the message store and degrade text message read performance. S3 provides infinite capacity at $0.023/GB/month. CloudFront CDN caches media at edge locations worldwide, delivering images in sub-50ms regardless of user location. The message record stores only a media_url reference, keeping the message path lightweight.

Scale & Performance

Target RPS

500K+ messages/sec (multi-region combined)

Latency (p99)

<50ms intra-region, <500ms cross-region

Storage

~10 TB/year (messages + media)

Availability

99.99% (multi-region failover)

Time & Space Complexity
OperationTimeSpaceNotes
Intra-region message deliveryO(1) Cassandra write + O(1) Kafka publish + O(1) WebSocket pushO(1) per messageSub-50ms end-to-end. All operations are single-region, no cross-region calls.
Cross-region message deliveryO(1) per hop + 50-150ms network latencyO(1) per messageSame as intra-region + one cross-region Kafka replication hop. Bounded by speed of light.
Group fan-out (N members, R regions)O(N) total pushes, O(N/R) per region on averageO(N) recipient listFan-out distributed across regions. 500-member group with 3 regions: ~167 pushes per region.
CRDT receipt mergeO(1) per receipt field (max comparison)O(1) per receipt (3 Lamport timestamps)Merge is commutative, associative, idempotent. No coordination needed.
Session lookup (user -> region)O(1) Redis GETO(1) per user sessionSub-1ms. Stale result causes a redirect hop (50-150ms), not message loss.
Database Schema (HLD)
messages (Cassandra, cross-region replicated)

Wide-column message store replicated across regions. Partition key: conversation_id. Clustering key: timestamp DESC. Write: LOCAL_QUORUM (2 of 3 local replicas). Read: LOCAL_ONE (fast, eventual consistency). Cross-region replication is asynchronous with 50-100ms lag. Each region can serve reads from its local replicas without cross-region calls.

conversation_id UUID (partition key)timestamp TIMEUUID (clustering key DESC)message_id UUIDsender_id UUIDcontent TEXTmessage_type VARCHARmedia_url TEXT (nullable)source_region VARCHARcreated_at TIMESTAMP

Indexes: Primary: (conversation_id, timestamp DESC)

Cross-region replication lag: 50-100ms. During region failure, reads from surviving regions may miss the last 50-100ms of messages — these are reconciled on the next read after replication catches up.

conversations (Cassandra)

Conversation metadata with region_hint for routing optimization. The region_hint indicates where the conversation was created and where Kafka partitions for this conversation are pinned.

conversation_id UUID PKtype VARCHAR (direct/group)participants SET<UUID>region_hint VARCHARlast_message_at TIMESTAMPcreated_at TIMESTAMP

region_hint is advisory — the Session Router uses it to optimize routing but falls back to broadcast if the hint is stale.

read_receipts (CRDT state in Redis)

CRDT delivery receipt state per message per recipient. Stored as Redis hashes with Lamport timestamp fields. Merge function: max() per field. Replicated across regions via the delivery_receipt Kafka topic. No database persistence needed — receipts are ephemeral (relevant only while the conversation is active).

key: receipt:{message_id}:{recipient_id}sent_at: Lamport timestampdelivered_at: Lamport timestampread_at: Lamport timestamplamport_clock: monotonic counter

CRDT merge: max(a.field, b.field) per field. Convergence guaranteed without coordination. TTL: 7 days (receipts older than 7 days are irrelevant).

user_sessions (Redis Cluster)

Global session routing map: user_id to {region, gateway_pod_id, connected_at}. Updated on WebSocket connect/disconnect. TTL: 60 seconds, refreshed by heartbeats. Eventually consistent across regions — stale entries cause a redirect hop (50-150ms) but not message loss.

key: session:{user_id}region: VARCHARgateway_pod_id: VARCHARconnected_at: TIMESTAMPTTL: 60 seconds

500K+ keys at peak. Redis Cluster with 6 shards handles 500K+ ops/sec.

Event Contracts
message_sentchat.message_sent

Published by Message Service when a new message is persisted. Consumed by WS Gateway for local delivery, cross-region sync, and Notification Service for offline push. Partitioned by conversation_id for per-conversation ordering. 48 partitions per regional Kafka cluster.

Key Schema

conversation_id (UUID)

Value Schema

{"message_id": "uuid", "conversation_id": "uuid", "sender_id": "uuid", "content": "string", "media_url": "string|null", "recipients": ["uuid"], "source_region": "string", "created_at": "iso8601"}

cross_region_syncchat.cross_region_sync

Cross-region message replication topic. Published by WS Gateway when the recipient is in a different region. Contains the full message payload plus source-region ordering metadata (sequence number, Lamport timestamp) for ordering preservation at the destination. Consumed by the destination region's WS Gateway.

Key Schema

conversation_id (UUID)

Value Schema

{"message_id": "uuid", "source_region": "string", "destination_region": "string", "sequence_number": "int", "lamport_timestamp": "int", "payload": {"...message fields..."}}

delivery_receiptchat.delivery_receipt

CRDT receipt update events. Published by Presence Service when a recipient acknowledges delivery or read. Contains the updated CRDT state (Lamport timestamps per field). Replicated across regions for receipt convergence.

Key Schema

message_id (UUID)

Value Schema

{"message_id": "uuid", "recipient_id": "uuid", "sent_at": "lamport_ts", "delivered_at": "lamport_ts", "read_at": "lamport_ts", "source_region": "string"}

What-If Scenarios

Network partition between US-East and EU-West regions

Impact

Users in each region can still send and receive messages with other users in the same region. Cross-region messages are buffered in Kafka's local partition and delivered when the partition heals. Presence CRDTs diverge — users may show as offline in the remote region. Delivery receipts stop propagating cross-region until connectivity is restored. No messages are lost.

Mitigation

Kafka cross-region replication with infinite retry ensures no data loss. CRDT merge automatically resolves divergent state when the partition heals. Client UI shows a 'syncing' indicator when cross-region connectivity is degraded. Message ordering is preserved via Lamport timestamps in the cross_region_sync topic.

500-member group with members across 3 regions

Impact

Each message requires fan-out to ~167 users per region. Cross-region replication adds 50-150ms for non-local members. Total fan-out for one message: ~500 pushes + ~2 cross-region replication hops. At 100 messages/sec in the group, that is 50K pushes/sec globally. Receipt CRDTs for 500 recipients per message generate significant merge traffic.

Mitigation

Per-region fan-out parallelism: each region's WS Gateway handles only its local members. Rate limit per group (10 msgs/sec) to prevent fan-out storms. Batch receipt CRDT updates (aggregate multiple receipts into one Kafka message) to reduce cross-region traffic.

Entire region failure (e.g., us-east-1 outage)

Impact

All users connected to the failed region lose their WebSocket connections. DNS failover (Route 53) stops routing new connections to the failed region within 30-60 seconds. Users reconnect to the nearest healthy region. During the failover window (30-60s), affected users cannot send or receive messages. Messages sent to them are buffered in Kafka cross-region replication. After reconnect, buffered messages are delivered and new session is registered.

Mitigation

Client-side reconnect with exponential backoff + jitter (prevent thundering herd). Route 53 health check interval: 10 seconds, failover threshold: 3 failures. Pre-warmed capacity in surviving regions (over-provision by 50%). Cassandra cross-region replicas serve message history reads from the new region immediately.

Thundering herd: 100K users reconnect simultaneously after region recovery

Impact

The recovering region's WS Gateway fleet receives 100K WebSocket upgrade requests in a burst. Each connection triggers a Session Router registration (Redis write) and an unread message sync (Cassandra read). Combined load: 100K WS upgrades + 100K Redis writes + 100K Cassandra reads within 10-30 seconds.

Mitigation

Client-side reconnect with random jitter (spread reconnections over 30 seconds). WS Gateway connection admission control (rate limit new connections to 5K/second). Queue excess connections with retry-after header. Pre-scale WS Gateway pods during maintenance windows.

Failure Modes & Resilience
ComponentFailureImpactMitigation
Regional WS GatewayPod OOM kill (50K connections lost)50K users disconnected simultaneously. Reconnection storm hits remaining pods. Users miss messages during 3-5 second reconnect window — retrieved via Cassandra on reconnect.Memory limits with connection shedding at 80%. Liveness probes restart before OOM. Client reconnect with jitter. Horizontal auto-scaling based on connection count.
Global Session RouterRedis Cluster shard failure1/6 of session routing data unavailable. Messages to affected users are routed to the wrong region, detected, and forwarded — adding 50-150ms extra latency. No message loss, but delivery is degraded.Redis Cluster with replicas (6 shards x 2 replicas). Automatic failover to replica within seconds. WS Gateway maintains local in-memory session cache as fallback for the most recent 10K sessions.
Regional KafkaBroker failure (1 of 3)Partition leaders on the failed broker are reassigned (30 seconds). During reassignment, affected partitions cannot accept publishes. Message Service retries with backoff. Cross-region replication for affected partitions pauses. No data loss (RF=3).ISR configuration: min.insync.replicas=2. Client retries with idempotent producer. Monitor under-replicated partitions and replace failed brokers within 1 hour.
Cassandra (Message DB)Cross-region replication lag spike (>5 seconds)Users in the lagging region see messages delayed by >5 seconds when scrolling through history or reconnecting. New message delivery via WebSocket is unaffected (delivered via Kafka, not Cassandra). Presence and receipts unaffected (stored in Redis, not Cassandra).Monitor replication lag (nodetool status). Alert at 2 seconds, investigate at 5 seconds. Common causes: network throttling, compaction backlog, large batch writes. Increase cross-region bandwidth or add Cassandra nodes.
Media Store (S3)Pre-signed URL generation failureUsers cannot upload new media (images, files). Text messages are unaffected. Previously uploaded media remains accessible via existing URLs and CDN cache.Retry pre-signed URL generation with exponential backoff. Fallback: queue media uploads client-side and retry when S3 access is restored. Alert on upload failure rate > 1%.
Scaling Strategy

Per-region horizontal scaling: (1) WS Gateway: auto-scale based on connection count. Each pod handles 50K connections. Add pods when any pod exceeds 40K (80%). Scale down during off-peak hours for the region's timezone. (2) Kafka: add partitions to increase throughput. Rebalance takes 30-60 seconds. Max consumers per partition = 1, so adding consumers requires adding partitions. (3) Cassandra: add nodes to the ring for storage and write throughput. Virtual nodes (vnodes) ensure even data distribution. Auto-scaling via AWS Keyspaces or manual ring expansion. (4) Redis: add shards for throughput, increase node size for memory. Redis Cluster supports online resharding. (5) Cross-region scaling: add new regions as user base expands geographically. Each new region requires deploying the full stack (WS Gateway, Kafka, Cassandra ring member, Redis shards). DNS latency-based routing automatically directs users to the nearest region. Typical scaling triggers: per-region connection count > 80% capacity, Kafka consumer lag > 2 seconds, Cassandra write latency p99 > 50ms.

Monitoring & Alerting

Critical metrics for multi-region chat: (1) Per-region WebSocket connection count — alert if any region exceeds 80% capacity. (2) Cross-region message delivery latency p99 — target < 500ms; spikes indicate replication issues. (3) Kafka cross-region replication lag — alert at 2 seconds, critical at 5 seconds. (4) CRDT receipt convergence time — measure time from receipt update to cross-region visibility; target < 500ms. (5) Session Router cache hit rate — should be > 99%; drops indicate session churn or Redis issues. (6) Regional failover detection time — measure Route 53 health check to DNS update latency; target < 60 seconds. (7) Cassandra cross-region replication lag — alert at 200ms, critical at 1 second. (8) Reconnect success rate after failover — target > 99% within 10 seconds. Dashboard: Grafana with per-region panels, cross-region latency heatmap, CRDT convergence histogram, and failover drill results.

Cost Analysis

Monthly cost at 500K concurrent users across 3 regions: WS Gateway fleet (20 pods/region x 3 regions, 8 vCPU/16 GB) ~$7,200/month. Kafka clusters (3 brokers/region x 3 regions, kafka.m7g.2xlarge) ~$5,400/month. Cassandra cluster (9 nodes x 3 regions, multi-DC replication) ~$8,100/month. Redis Cluster (6 shards, global) ~$2,700/month. Session Router + Message Service + Presence Service (ECS Fargate, 3 regions) ~$3,600/month. S3 storage + CloudFront CDN ~$1,500/month. Notification Service (Lambda) ~$500/month. Route 53 + networking ~$1,000/month. Total: ~$30,000/month. Cost per concurrent user: ~$0.06/month. The multi-region premium (vs single-region at ~$0.084/user) is offset by regional affinity reducing cross-region bandwidth costs and better hardware utilization from geographic load distribution.

Security Considerations

Multi-region security considerations: (1) End-to-end encryption: this architecture is E2E-encryption ready. The message payload can be encrypted client-side (Signal protocol / Double Ratchet) before entering the delivery pipeline. The server handles routing and storage without decrypting content. Trade-off: server-side search and content moderation become impossible. (2) Cross-region data sovereignty: messages replicated to regions in different legal jurisdictions may violate data residency laws (GDPR, CCPA). Solution: region-pinned conversations where data stays in the origin region, with cross-region routing only for delivery. (3) Transport security: WSS (TLS 1.3) for all WebSocket connections. Cross-region Kafka replication uses TLS mutual auth. Cassandra inter-node encryption enabled. (4) Authentication: JWT on WebSocket upgrade with short-lived tokens (15 minutes). Token refresh via a separate REST endpoint. Session tokens in Redis are encrypted at rest. (5) DDoS protection: AWS Shield + WAF on Regional LB. Per-IP connection limits. Per-user message rate limits enforced at the WS Gateway. (6) Audit logging: all message metadata (sender, recipient, timestamp, region) logged to a separate audit stream for compliance. Message content is not logged (privacy).

Deployment Strategy

Multi-region rolling deployment: (1) Deploy to one region at a time (canary region first). (2) Canary region receives 5% of new connections for 1 hour. Monitor error rates, latency, CRDT convergence. (3) If canary is healthy, deploy to remaining regions sequentially with 15-minute intervals. (4) WS Gateway deployment uses connection draining: stop accepting new connections on old pods, wait 30 seconds for in-flight messages, terminate. Users reconnect to new pods automatically. (5) Kafka configuration changes (partition count, replication factor) require a maintenance window with coordinated rollout across regions. (6) Cassandra schema changes use expand-contract: add new columns/tables in phase 1, populate in phase 2, remove old in phase 3. All phases must be backward-compatible since different regions may be on different versions during rollout. (7) Rollback: revert the canary region first, then monitor. If cross-region state is affected (CRDT schema change), perform a coordinated rollback across all regions.

Real-World Examples
  • WhatsApp — 2B+ users across global Erlang clusters with per-region connection routing and CRDT-style message state
  • Telegram — 700M+ users with custom MTProto protocol, multi-DC deployment, CRDT-inspired message ordering
  • Discord — 150M+ monthly users, Elixir-based WS gateway fleet, per-region deployment with global session routing
  • Slack — Enterprise messaging with regional deployment, cross-region message routing for distributed teams
Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
Naive (HTTP Polling)T11-2s delivery delay~1K concurrent users$555/month (1K users)Low — standard HTTP, no WS99% (single DB, no failover)
Kafka Fan-Out (WebSocket)T2<100ms delivery10K-1M concurrent users$2,500/month (100K users)Medium — WS, Kafka, Redis99.9% (replicated components)
Multi-Region Global (CRDT)T4<50ms intra-region, <500ms cross-region500K+ concurrent users$15,000+/month (multi-region)Very High — CRDT, multi-region Kafka99.99% (multi-region failover)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
What is a CRDT and why is it used for delivery receipts?

A CRDT (Conflict-free Replicated Data Type) is a data structure that can be independently updated on multiple replicas and always converges to a consistent state when merged — without coordination. For delivery receipts, each field (sent_at, delivered_at, read_at) uses a last-writer-wins register with Lamport timestamps. The merge function is simply max() per field. If Region A has {delivered: T1} and Region B has {read: T2}, merging produces {delivered: T1, read: T2} regardless of arrival order. This eliminates the need for distributed locks or consensus protocols for receipt state.

How does regional failover work?

AWS Route 53 health checks monitor each region's WS Gateway fleet. When a region fails (health check failures exceed threshold), Route 53 stops routing new DNS queries to that region within 30-60 seconds. Clients in the failed region lose their WebSocket connections and trigger a reconnect — DNS resolution returns the nearest healthy region. The client re-establishes a WebSocket connection and registers a new session. Unread messages are retrieved from Cassandra (which has cross-region replicas). Total reconnect time: 3-5 seconds for well-implemented clients. Messages sent during failover are buffered in Kafka's cross-region replication stream and delivered after reconnection.

How does cross-region message ordering work?

Per-conversation ordering is maintained by Kafka's partition-level guarantees: all messages for conversation_id=X land on the same partition and are consumed in order. Within a region, this provides strict ordering. Cross-region, messages flow through the cross_region_sync topic with source-region ordering metadata (sequence number + Lamport timestamp). The destination region's consumer applies these messages in source order. During network partitions between regions, messages from different regions may interleave — Lamport timestamps enable the client to sort them into a globally consistent order.

What is the cost of multi-region deployment?

Multi-region approximately 2.5x the cost of single-region: each region needs its own WS Gateway fleet, Kafka cluster, Redis cluster, and partial Cassandra ring. Some services (Session Router, Notification Service) can share global infrastructure. At 3 regions with 500K total concurrent users: WS Gateway fleet ~$5,000/month, Kafka ~$4,000/month, Cassandra ~$6,000/month, Redis ~$3,000/month, supporting services ~$2,000/month. Total: ~$20,000/month. Cost per concurrent user: ~$0.04/month — cheaper per user than the single-region variant because regional affinity reduces cross-region bandwidth costs.

When is multi-region overkill?

Multi-region is overkill when: (1) your users are geographically concentrated (e.g., a country-specific chat app where 95% of users are in one region), (2) your scale is under 100K concurrent users (the single-region Kafka Fan-Out variant handles this at lower cost and complexity), (3) your team lacks experience with multi-region operations (CRDT semantics, cross-region Kafka, Cassandra multi-DC tuning), or (4) 99.9% availability is sufficient (single-region with multi-AZ provides this). Multi-region is justified when you have a global user base, require 99.99%+ availability, and the engineering team can manage the operational complexity.

How do CRDT presence counters handle split-brain between regions?

During a network partition between regions, each region operates independently. A user connected in Region A shows as online in Region A but may appear offline in Region B (Region B's counter for that user has not received the latest heartbeat). When the partition heals, CRDT G-Counter merge resolves this: sum of all regional counters produces the correct global count. The eventual consistency window is bounded by the heartbeat interval (30s) + TTL (60s) — at most 90 seconds of stale presence during a partition. This is acceptable for chat (WhatsApp has the same behavior during network issues).

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Real-Time Chat?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator