Medium8 componentsInterview: Very High

Real-Time Chat — Kafka Fan-Out (WebSocket + Presence)

Q: Why Kafka instead of direct WebSocket push from the Message Service?

Direct push works for 1:1 messages but creates a synchronous bottleneck for group messages. A 500-member group requires 500 WebSocket pushes — doing this in the message send request path adds 250ms+ latency. Kafka decouples send from deliver: the Message Service publishes one event, returns immediately, and the Fan-out Worker handles the 500 pushes asynchronously. Kafka also provides replay capability if the fan-out worker crashes, and partitioning by conversation_id ensures per-conversation ordering.

Q: How does presence work with 10 million concurrent users?

Each user's presence is a single Redis key (user_id -> last_seen_timestamp) with a 60-second TTL. 10M keys consume approximately 1 GB of memory. Heartbeats arrive every 30 seconds: 10M / 30 = 333K writes/sec. A Redis Cluster with 3 nodes handles this comfortably. When the TTL expires (user stopped sending heartbeats), the key disappears and the user is considered offline. This is O(1) per user — no scanning, no aggregation.

Industry-standard chat architecture: WebSocket for real-time push, Kafka for group message fan-out, Redis for presence tracking, Cassandra for time-series message storage. Sub-100ms delivery for online users.

WebSocketKafkaReal-timeMessaging

Try in Simulator

Problem Statement

The Kafka fan-out chat architecture is the standard production approach used by messaging platforms handling millions of concurrent users. It solves the two fundamental problems that make the naive polling approach unworkable at scale: (1) delivery latency — messages must arrive in under 100ms for a responsive chat experience, and (2) server load — the system must handle message delivery without generating O(N) database queries per second from polling.

The key architectural shift from naive to fan-out is replacing pull-based message delivery (client polls server) with push-based delivery (server pushes to client via WebSocket). This single change eliminates 95% of the database load from empty poll responses and reduces delivery latency from 1-2 seconds to under 50ms. The WebSocket connection is persistent — once established, either side can send data instantly without the overhead of HTTP connection setup, headers, and authentication on every message.

The second architectural challenge this variant addresses is group message fan-out. When a user sends a message to a 500-member group, the server must deliver it to all 500 members who are currently online. If the message service did this synchronously (look up each member, push via their WebSocket), a single group message would take 500 x 0.5ms = 250ms just for the push loop — blocking the service from processing other messages. Kafka solves this by decoupling send from deliver: the message service publishes once to Kafka, and a dedicated fan-out worker handles the N pushes asynchronously.

Presence tracking (online/offline status) is a third concern that becomes trivial with WebSocket connections. The WebSocket gateway emits connect/disconnect events, and a Redis-based presence service tracks status with TTL-based heartbeats. Presence is eventually consistent — a user may show as online for up to 60 seconds after disconnecting — but this is the same trade-off WhatsApp and Telegram make, and users find it acceptable.

This architecture appears in virtually every system design interview for messaging roles. Interviewers expect candidates to explain the polling-to-WebSocket transition, articulate the Kafka fan-out pattern for group messages, reason about presence consistency trade-offs, and discuss the scaling properties of each component. The specific design decisions — Cassandra for time-series message storage, Redis for ephemeral presence, Kafka partitioned by conversation_id — are the industry standard, and each has a clear rationale that candidates should be able to articulate.

The primary limitation of this variant is single-region deployment. All WebSocket connections terminate in one region, so users far from the data center experience higher connection latency (but message delivery is still fast once connected). The Multi-Region Global variant addresses this with per-region gateways and cross-region routing.

Architecture Overview

The Kafka fan-out chat system uses 8 components organized into a push-based delivery pipeline: Client, API Gateway, Load Balancer, Message Service, WebSocket Service, Presence Service, Redis caches (message and presence), Cassandra message database, Kafka event stream, Fan-out Worker, and Push Worker.

Clients connect via two channels: a persistent WebSocket connection to the WS Service for real-time message delivery and presence, and REST API calls via the API Gateway for operations like message send, conversation history, and presence heartbeats. The API Gateway handles JWT authentication (~3ms), rate limiting (70K RPS cap), and routes REST requests to the Load Balancer. WebSocket upgrade requests route directly to the WS Service.

The message send path starts at the API Gateway, flows through the Load Balancer to the Message Service, which performs three operations: (1) write the message to the Message Cache (Redis — recent messages buffer), (2) persist to the Message DB (Cassandra — partitioned by conversation_id, sorted by timestamp for efficient scrollback), and (3) publish a message_sent event to the Chat Events Kafka topic. The publish completes in under 5ms, and the HTTP response returns to the sender. The message is now durably stored and queued for delivery.

For 1:1 messages, the WS Service checks if the recipient is connected to the same gateway. If online, the message is pushed directly via their WebSocket connection — sub-50ms from send to delivery. For group messages, the Fan-out Worker consumes from Kafka, looks up all group members in the Presence Cache, identifies online recipients, and pushes each one's message to the WS Service for WebSocket delivery. A 500-member group requires up to 500 pushes, but the Fan-out Worker handles this asynchronously — the sender's send call returned long ago.

The Push Worker handles offline delivery. It consumes message_sent events, checks the Presence Cache for offline recipients, and sends push notifications via FCM (Android) and APNs (iOS). When the user comes back online, their client fetches unread messages via the REST API, which reads from the Message Cache (recent messages) or Message DB (older history).

Presence tracking uses Redis with TTL-based heartbeats. Connected clients send a heartbeat every 30 seconds via PUT /api/v1/presence. The Presence Service writes a Redis SET with 60-second TTL. If a heartbeat is missed, the key expires and the user shows as offline. This is eventually consistent — a user who disconnects without a clean close shows as online for up to 60 seconds. This is the same trade-off WhatsApp uses, and it avoids the complexity of distributed heartbeat coordination.

The architecture scales horizontally at each tier. WebSocket connections are distributed across the WS Service fleet (20 pods, 100K connections per pod = 2M concurrent connections). Kafka partitions scale message throughput (24 partitions handle 50K+ messages/sec). The Fan-out Worker pool scales with Kafka consumer count. The Message Service, Presence Service, and Push Worker scale independently via pod count.

Architecture Preview

Loading architecture preview...

Open in Simulator

Request Flow — WebSocket Push + Kafka Fan-Out

The WebSocket push architecture fundamentally changes the message delivery model from pull (client polls) to push (server delivers). The message send path writes to Cassandra + Redis + Kafka in parallel, returning to the sender in under 50ms. Delivery to online recipients happens asynchronously via the Fan-out Worker: for 1:1 messages, direct WebSocket push (~20ms). For group messages, the worker iterates through all online members and pushes individually — decoupled from the sender's request path.

The key performance difference versus the naive approach: instead of N/2 database queries per second from polling (most returning empty), the system generates database traffic only proportional to actual message volume. At 30K messages/sec with 10M users, that is 30K writes/sec instead of 5M poll reads/sec — a 166x reduction in wasted database load.

Loading diagram...

Step-by-Step Walkthrough

1Sender transmits a message as a WebSocket frame to their connected WS Gateway — no HTTP overhead, no connection setup
2WS Gateway forwards the message to the Message Service via internal gRPC/HTTP (~3ms)
3Message Service performs three parallel writes: (a) Redis sorted set for recent messages cache, (b) Cassandra INSERT for durable persistence, (c) Kafka produce for downstream fan-out
4All three writes complete within 15ms (Cassandra is the slowest). Message Service ACKs back to WS Gateway, which ACKs to the sender. Total: ~50ms
5Kafka consumer (Fan-out Worker) picks up the message_sent event, typically within 10-50ms of publish
6Fan-out Worker checks the Presence Cache (Redis) to identify which recipients are online and which gateway pod they are connected to
7For each online recipient, the Fan-out Worker sends the message to the appropriate WS Gateway pod, which pushes it via the recipient's WebSocket connection
8Total end-to-end latency: ~70ms (50ms send + 20ms fan-out delivery) — 14x faster than the naive approach's 1-2 second polling delay

Pseudocode

// MESSAGE SEND PATH (Message Service)
async function sendMessage(conversation_id, sender_id, content):
    message = { id: uuid(), conversation_id, sender_id, content, created_at: now() }

    // Parallel writes — all three happen simultaneously
    await Promise.all([
        redis.zadd("conv:" + conversation_id + ":msgs", message.created_at, serialize(message)),
        cassandra.execute("INSERT INTO messages (conversation_id, timestamp, ...) VALUES (?, ?, ...)", message),
        kafka.produce("chat.message_sent", key=conversation_id, value=message)
    ])
    // Total: ~15ms (bounded by slowest write — Cassandra)
    return { status: 201, message_id: message.id }

// FAN-OUT WORKER (Kafka consumer)
async function onMessageSent(event):
    conversation = await getConversation(event.conversation_id)

    for recipient_id in conversation.participants:
        if recipient_id == event.sender_id: continue

        // Check presence
        session = await redis.get("presence:" + recipient_id)
        if session:
            // Online — push via WebSocket
            await wsGateway.deliver(session.gateway_pod, recipient_id, event)
        else:
            // Offline — queue push notification
            await pushQueue.enqueue(recipient_id, event)

// PRESENCE HEARTBEAT (every 30 seconds)
async function handleHeartbeat(user_id):
    await redis.setex("presence:" + user_id, 60, { last_seen: now(), gateway_pod: this_pod })
    // 60s TTL: missed heartbeat → offline after 60 seconds

Data Model

The data model uses three storage tiers, each optimized for a different access pattern. Cassandra provides durable, time-series-optimized message storage with (conversation_id, timestamp) as the partition/clustering key — ideal for 'last N messages' queries. Redis provides sub-millisecond access for the two hottest data sets: recent message cache (sorted sets) and presence state (key-value with TTL). Kafka provides ordered, durable event streaming for asynchronous fan-out.

The key insight is the separation of concerns: Cassandra handles durability, Redis handles speed, Kafka handles delivery. No single store is asked to do all three. This specialization is what enables the 10x improvement over the naive approach's single-PostgreSQL-for-everything design.

Loading diagram...

Step-by-Step Walkthrough

1Cassandra messages table: partition key is conversation_id, clustering key is timestamp DESC — a single partition read returns the last N messages for a conversation in sorted order
2Cassandra conversations table: denormalized participants as a SET type — fast membership check without a join table, trading storage for query speed
3Redis message cache: sorted set per conversation, score is timestamp. ZRANGEBYSCORE returns messages in time order. ZADD on every new message (write-through). ZREMRANGEBYRANK trims to last 100 entries
4Redis presence: simple key-value with 60s TTL. SETEX on heartbeat, GET on presence check. Expiry = offline. No explicit delete needed — TTL handles cleanup
5Kafka message_sent: published on every send, consumed by Fan-out Worker and Push Worker. Key is conversation_id for per-conversation ordering on each partition

Pseudocode

-- Cassandra schema
CREATE TABLE messages (
    conversation_id UUID,
    timestamp TIMEUUID,
    message_id UUID,
    sender_id UUID,
    content TEXT,
    message_type VARCHAR,
    PRIMARY KEY (conversation_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
-- Write: INSERT ... (append-only, O(1) with LSM-tree)
-- Read: SELECT * FROM messages WHERE conversation_id = ? ORDER BY timestamp DESC LIMIT 50
-- Cost: single partition read, ~5ms

-- Redis message cache
ZADD conv:abc123:msgs 1705305603000 '{"id":"m1","content":"Hello"}'
ZRANGEBYSCORE conv:abc123:msgs -inf +inf LIMIT 0 50  -- ~1ms
ZREMRANGEBYRANK conv:abc123:msgs 0 -101  -- Trim to last 100

-- Redis presence
SETEX presence:user123 60 '{"last_seen":1705305603,"pod":"ws-pod-7"}'
GET presence:user123  -- ~0.5ms, returns nil if offline (TTL expired)

-- Kafka topic
kafka.produce("chat.message_sent",
    key = "conv:abc123",
    value = '{"message_id":"m1","sender_id":"u1","content":"Hello","recipients":["u2","u3"]}'
)

Key Design Decisions

WebSocket for Real-Time Push

Choice

Persistent WebSocket connections replacing HTTP polling

Rationale

WebSocket eliminates the 95% wasted poll queries from the naive approach. Instead of N/2 queries/sec from polling, the server generates database traffic only when actual messages exist. Delivery latency drops from 1-2 seconds (poll interval) to under 50ms (WebSocket frame push). The trade-off is connection statefulness — WebSocket connections must be maintained, tracked, and gracefully handled on server restarts. This is manageable with the WS Service fleet and connection registry.

Kafka for Group Message Fan-Out

Choice

Kafka topic partitioned by conversation_id with dedicated fan-out workers

Rationale

Direct push works for 1:1 messages but not for groups. A 500-member group message requires 500 WebSocket pushes. If the Message Service did this synchronously, a single message would take 500 x 0.5ms = 250ms just for fan-out. Kafka decouples send from deliver: the Message Service publishes once, the Fan-out Worker handles the N pushes asynchronously. Partitioning by conversation_id ensures ordering within a conversation without distributed coordination.

Cassandra for Message Storage

Choice

Cassandra with (conversation_id, timestamp) partition/clustering key

Rationale

Chat messages are a time-series workload: writes are append-only (INSERT), reads are range scans (last N messages in conversation). Cassandra's wide-column model with LSM-tree storage is ideal for this pattern. At 30K writes/sec, PostgreSQL would need significant write optimization, while Cassandra handles this natively. The trade-off versus DynamoDB is operational control and cost at scale.

Redis for Presence (TTL-Based)

Choice

Redis SET with 60-second TTL refreshed by 30-second heartbeats

Rationale

Presence is ephemeral (no durability needed) and extremely high-frequency (10M users x heartbeat every 30s = 333K writes/sec). Redis SET with TTL is O(1) and perfect for this pattern. The 60-second TTL means a disconnected user shows as online for at most 60 seconds — acceptable for chat. Stronger consistency would require distributed heartbeat coordination, which is overkill.

Two-Tier Message Cache

Choice

Redis for last 100 messages per conversation + Cassandra for full history

Rationale

The common case is 'open chat, see recent messages' — this should be sub-5ms. Redis stores the last 100 messages per conversation as a sorted set. Cassandra stores the full history for scrollback. 85% of message reads hit the Redis cache (recent messages). Only scrollback past 100 messages requires a Cassandra query. This two-tier approach keeps the common path fast without sacrificing unlimited history.

No End-to-End Encryption

Choice

Messages stored in plaintext for server-side search

Rationale

End-to-end encryption (Signal protocol) prevents the server from reading message content, which blocks server-side search, message indexing, and content moderation. This template focuses on the system architecture, not the encryption layer. Adding E2E encryption is orthogonal to the fan-out pattern — the same Kafka + WebSocket delivery works regardless of whether the payload is encrypted.

Scale & Performance

Target RPS

50K+ messages/sec

Latency (p99)

<100ms delivery (online recipients)

Storage

~5 TB/year at scale

Availability

99.9% (replicated components)

Time & Space Complexity

Operation	Time	Space	Notes
Send message (POST /messages)	O(1) Cassandra write + O(1) Redis write + O(1) Kafka publish	O(1) per message	Three parallel writes, all O(1). Total latency dominated by Cassandra write (~15ms).
Deliver to online recipient (WebSocket push)	O(1) per recipient, O(N) for group of N	O(1) per push	Fan-out Worker handles O(N) group delivery asynchronously. N capped at 500 (max group size).
Presence heartbeat (PUT /presence)	O(1) Redis SET with TTL	O(1) per user	Total: 10M users x 1 heartbeat/30s = 333K ops/sec. Redis handles this natively.
Message history (GET /conversations/{id}/messages)	O(1) Redis lookup or O(log N + K) Cassandra range scan	O(K) result set	Redis hit (85%): ~1ms. Cassandra miss: ~8ms. K = requested message count (default 50).

Database Schema (HLD)

messages (Cassandra)

Wide-column message store optimized for time-series access. Partition key is conversation_id; clustering key is timestamp (DESC) for efficient 'last N messages' queries. Write-optimized: LSM-tree storage handles 30K inserts/sec without B-tree write amplification. Each row is approximately 500 bytes (message metadata + content). At 30K writes/sec, storage grows ~1.3 TB/month before compaction.

conversation_id UUID (partition key)timestamp TIMEUUID (clustering key, DESC)message_id UUIDsender_id UUIDcontent TEXTmessage_type VARCHARcreated_at TIMESTAMP

Indexes: Primary: (conversation_id, timestamp DESC)

Cassandra CL=LOCAL_QUORUM for writes. Range scan for 'last 50 messages' is a single partition read: O(1) seek + O(50) scan.

conversations (Cassandra)

Conversation metadata: participants list, type (1:1 or group), last message timestamp. Queried on conversation list and group membership lookups. Updated on every message send (last_message_at). Relatively small table — at 5M conversations, total size is under 5 GB.

conversation_id UUID PKtype VARCHAR (direct/group)participants SET<UUID>last_message_at TIMESTAMPcreated_at TIMESTAMP

Participant set is denormalized for fast membership checks without a join table.

recent_messages (Redis sorted set)

In-memory cache of the last 100 messages per conversation. Redis sorted set keyed by conversation_id with timestamp scores. Write-through: every message send writes to both Redis and Cassandra. Read-first: poll/reconnect queries check Redis before falling back to Cassandra. TTL of 1 hour evicts inactive conversations. 85% hit rate for message reads.

key: conv:{conversation_id}:msgsmember: serialized message JSONscore: timestamp (epoch ms)

Memory: ~50 KB per conversation (100 messages x 500 bytes). At 1M active conversations: ~50 GB Redis cluster.

user_presence (Redis key-value)

Ephemeral presence state: one key per online user with 60-second TTL. Refreshed every 30 seconds by client heartbeats. When TTL expires, the user is offline. No durability needed — if Redis crashes, all users briefly show as offline until their next heartbeat re-creates the key.

key: presence:{user_id}value: last_seen_ms (epoch)TTL: 60 seconds

10M keys x ~100 bytes = ~1 GB memory. 333K writes/sec (heartbeats) handled by 3-node Redis Cluster.

Event Contracts

message_sentchat.message_sent

Published by Message Service when a new message is persisted. Consumed by Fan-out Worker (group delivery) and Push Worker (offline notifications). Partitioned by conversation_id for per-conversation ordering. 24-hour retention for consumer restart replay.

Key Schema

conversation_id (UUID)

Value Schema

{"message_id": "uuid", "conversation_id": "uuid", "sender_id": "uuid", "content": "string", "message_type": "string", "recipients": ["uuid"], "created_at": "iso8601"}

delivery_receiptchat.delivery_receipt

Published when a message is delivered or read. Contains message_id, recipient_id, and status (delivered/read). Used for read receipt propagation back to the sender. Not modeled in this template but present in the Kafka topic schema for completeness.

Key Schema

message_id (UUID)

Value Schema

{"message_id": "uuid", "recipient_id": "uuid", "status": "delivered|read", "timestamp": "iso8601"}

What-If Scenarios

Viral 500-member group with 100 messages/second

Impact

Each message triggers 500 WebSocket pushes via the Fan-out Worker. At 100 msgs/sec, that is 50K pushes/sec from a single conversation. Kafka consumer lag for this partition increases. Other conversations on the same partition experience delivery delay.

Mitigation

Increase Kafka partitions to reduce per-partition load. Add Fan-out Worker instances. Implement rate limiting per group (e.g., max 10 msgs/sec per group, queue excess). Notify users of throttling in active groups.

WS Service pod crashes (1 of 20 pods)

Impact

50K WebSocket connections drop simultaneously. All 50K users receive a disconnect event and must reconnect. The remaining 19 pods absorb the reconnection storm. Users miss messages during the ~5 second reconnection window — these are retrieved via REST fallback on reconnect.

Mitigation

Graceful shutdown with connection draining (5-second window for in-flight messages). Client-side reconnect with exponential backoff (avoid thundering herd). Unread message sync on reconnect via GET /conversations/{id}/messages.

Redis presence cache failure

Impact

All users appear as offline. Fan-out Worker cannot determine who is online — all messages are routed to the Push Worker for notification delivery instead of WebSocket push. Users online on WebSocket do not receive real-time messages until presence cache recovers. Message sends continue (Cassandra is unaffected).

Mitigation

Redis Cluster mode with 3 replicas per shard. Fallback: WS Service maintains a local in-memory set of connected user_ids as a secondary presence source. Circuit breaker on presence lookups to avoid cascading failures.

User reconnects after 7 days offline

Impact

Client requests unread messages across potentially hundreds of conversations. Cassandra handles the range scan efficiently (partition-per-conversation), but the total data volume may be large (thousands of messages). Redis cache has evicted old conversations (1-hour TTL).

Mitigation

Paginate reconnect sync: fetch only the most recent 20 conversations, load messages on demand as the user scrolls. Implement unread count per conversation (lightweight Redis counter) to avoid scanning all conversations. Background sync for older conversations.

Failure Modes & Resilience

Component	Failure	Impact	Mitigation
Kafka (Chat Events)	Broker failure (1 of 3)	Kafka RF=3 means single broker failure has no data loss. Partitions on the failed broker are reassigned to surviving brokers within 30 seconds. During reassignment, affected partitions have no active leader — publishes to those partitions fail or queue. Fan-out delivery for affected conversations pauses.	RF=3 ensures no data loss. ISR (in-sync replicas) configuration ensures at least 2 replicas acknowledge writes. Client-side producer retries with idempotence handle transient failures. Consumer offset tracking ensures no duplicate delivery after recovery.
Cassandra (Message DB)	Node failure in 3-node cluster	With RF=3 and CL=LOCAL_QUORUM, a single node failure still allows reads and writes (2 of 3 nodes available). Write latency may increase slightly during hinted handoff. No data loss. Consistency maintained.	Anti-entropy repair runs daily to ensure all replicas converge. Replace failed node within 24 hours to restore full redundancy. Monitor read/write latency for degradation during reduced replica count.
WebSocket Service	Memory leak causing OOM kill	Pod is killed by Kubernetes OOM killer. All connections on that pod (up to 50K) drop simultaneously. Users must reconnect. In-flight messages may be lost for users on that pod. Other pods are unaffected.	Memory limits with gradual connection shedding (start rejecting new connections at 80% memory). Liveness probe restarts pods before OOM. Client reconnect with backoff jitter to prevent thundering herd.
Redis (Message Cache)	Cache eviction storm (working set exceeds capacity)	LRU eviction removes cached conversations. Cache hit rate drops from 85% to 50% or lower. All evicted conversations fall through to Cassandra, increasing Cassandra read load by 3-5x. Cassandra latency increases from 8ms to 20ms+ under the additional load.	Size Redis cache to hold 1.2x the expected working set. Monitor eviction rate and scale Redis nodes proactively. Single-flight loading prevents thundering herd on cache miss. Cassandra read replicas absorb the increased load.

Scaling Strategy

Horizontal scaling at each tier: (1) WS Service: add pods to handle more concurrent connections. At 50K connections/pod, each additional pod adds 50K capacity. Auto-scale based on connection count metric. (2) Kafka: add partitions to increase message throughput (max 1 consumer per partition). Rebalance takes 30-60 seconds. (3) Fan-out Worker: add instances up to partition count (24). Beyond that, add Kafka partitions. (4) Cassandra: add nodes to the ring for write throughput and storage. Auto-scaling via AWS Keyspaces (managed Cassandra). (5) Redis: add shards to the cluster for both memory and throughput. Memory-based auto-scaling. Key scaling trigger: WS connection count approaching 80% of fleet capacity. Secondary trigger: Kafka consumer lag exceeding 5-second delivery SLO.

Monitoring & Alerting

Key metrics: (1) WebSocket connection count per pod — alert if any pod exceeds 80% of max_connections (40K/50K). (2) Kafka consumer lag per partition — alert if lag exceeds 1,000 messages (indicating fan-out workers are falling behind). (3) Message delivery latency p99 — end-to-end from send to WebSocket push, target < 100ms. (4) Presence heartbeat success rate — should be > 99.9%; drops indicate WS connection issues. (5) Redis cache hit ratio — target 85%+; drops indicate working set growth or eviction pressure. (6) Cassandra write latency p99 — target < 50ms; spikes indicate compaction or node issues. Dashboard: Grafana with Kafka lag metrics (Burrow), Redis cluster metrics (redis-exporter), Cassandra metrics (JMX exporter), and custom WebSocket connection dashboards. Distributed tracing: Jaeger traces for end-to-end message delivery timing.

Cost Analysis

Monthly cost at 100K concurrent users: ECS Fargate fleet (Message Service 12 pods, WS Service 20 pods, Presence 4 pods, Fan-out 8 workers, Push 4 workers) ~$3,500/month. MSK Kafka cluster (3 brokers, kafka.m7g.xlarge) ~$1,200/month. DynamoDB/Cassandra (on-demand, 30K writes/sec + 15K reads/sec) ~$2,000/month. ElastiCache Redis (6 nodes, message + presence caches) ~$1,500/month. API Gateway ~$200/month. Total: ~$8,400/month. Cost per concurrent user: ~$0.084/month — 6.5x more cost-efficient than the naive approach per user. The efficiency gain comes from eliminating poll waste and using purpose-built stores (Redis for cache, Cassandra for time-series, Kafka for fan-out) instead of a single PostgreSQL instance.

Security Considerations

Security considerations for WebSocket-based chat: (1) Authentication: JWT on WebSocket upgrade request. Once connected, the session token is validated per frame (or per connection with periodic re-auth). Token rotation requires graceful connection migration. (2) Transport encryption: WSS (WebSocket Secure) for all connections — TLS 1.3. (3) Message content: stored in plaintext in Cassandra for server-side search. E2E encryption (Signal protocol) is orthogonal to this architecture — the same delivery pipeline works with encrypted payloads, but server-side search becomes impossible. (4) Rate limiting: per-user message rate limit (e.g., 60 msgs/minute) enforced at the API Gateway and WS Service. Per-group rate limit to prevent spam flooding. (5) Content moderation: server-side content scanning before Kafka publish (requires plaintext access). Incompatible with E2E encryption. (6) Connection security: WebSocket idle timeout (5 minutes without heartbeat). Maximum message size limit (64 KB) to prevent memory abuse.

Deployment Strategy

Rolling deployment with connection draining: (1) Deploy new version to a subset of WS Service pods. (2) Stop routing new connections to old pods (remove from LB target group). (3) Existing connections on old pods remain active for a drain period (30 seconds). (4) After drain, terminate old pods. Users on drained pods reconnect automatically to new pods. Message Service, Presence Service, and workers deploy independently via standard ECS rolling updates. Kafka topic configuration changes (partition count) require a separate maintenance window due to consumer rebalance. Database migrations (Cassandra schema changes) must be backward-compatible using a expand-contract pattern: add new columns in v1, populate in v2, remove old columns in v3.

Real-World Examples

•WhatsApp — WebSocket connections with Erlang-based message routing, 2B+ users
•Slack — WebSocket for real-time messages, Kafka for internal event processing
•Discord — WebSocket gateway fleet handling 10M+ concurrent connections
•Telegram — Custom MTProto protocol over TCP, similar fan-out architecture

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (HTTP Polling)	T1	1-2s delivery delay	~1K concurrent users	$555/month (1K users)	Low — standard HTTP, no WS	99% (single DB, no failover)
Kafka Fan-Out (WebSocket)	T2	<100ms delivery	10K-1M concurrent users	$2,500/month (100K users)	Medium — WS, Kafka, Redis	99.9% (replicated components)
Multi-Region Global (CRDT)	T4	<50ms intra-region, <500ms cross-region	500K+ concurrent users	$15,000+/month (multi-region)	Very High — CRDT, multi-region Kafka	99.99% (multi-region failover)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why Kafka instead of direct WebSocket push from the Message Service?

Direct push works for 1:1 messages but creates a synchronous bottleneck for group messages. A 500-member group requires 500 WebSocket pushes — doing this in the message send request path adds 250ms+ latency. Kafka decouples send from deliver: the Message Service publishes one event, returns immediately, and the Fan-out Worker handles the 500 pushes asynchronously. Kafka also provides replay capability if the fan-out worker crashes, and partitioning by conversation_id ensures per-conversation ordering.

How does presence work with 10 million concurrent users?

Each user's presence is a single Redis key (user_id -> last_seen_timestamp) with a 60-second TTL. 10M keys consume approximately 1 GB of memory. Heartbeats arrive every 30 seconds: 10M / 30 = 333K writes/sec. A Redis Cluster with 3 nodes handles this comfortably. When the TTL expires (user stopped sending heartbeats), the key disappears and the user is considered offline. This is O(1) per user — no scanning, no aggregation.

What happens when a WebSocket connection drops?

The WS Service detects the disconnection (TCP FIN or heartbeat timeout) and emits a disconnect event. The Presence Service receives the event and the user's Redis TTL begins counting down (60 seconds). If the user reconnects within 60 seconds, the presence is refreshed seamlessly. When the user reconnects, their client calls GET /conversations/{id}/messages to sync any messages that arrived while disconnected — these are served from the Message Cache (Redis) or Message DB (Cassandra).

How does Kafka consumer lag affect message delivery?

Consumer lag is the delay between a message being published to Kafka and the Fan-out Worker consuming it. Under normal load, lag is 10-50ms (near-real-time). During traffic spikes (e.g., a viral group chat), lag can increase to 1-5 seconds as the consumer falls behind. This manifests as increased delivery latency for group messages. Mitigation: add more consumer instances (up to the number of partitions — 24 in this template). 1:1 messages are not affected by consumer lag because they bypass Kafka for direct WebSocket push.

Why separate Message Cache and Message DB?

Message Cache (Redis) stores the last 100 messages per conversation for instant retrieval — the common case when a user opens a chat. Message DB (Cassandra) stores the full unlimited history for scrollback. Without the cache, every chat open would hit Cassandra (8ms vs 1ms). The 85% cache hit rate means 85% of message reads complete in 1ms. Only scrollback past 100 messages or cold conversations require a Cassandra query. This is a classic two-tier caching pattern.

What is the single-region limitation of this architecture?

All WebSocket connections terminate in one AWS region (us-east-1). A user in Tokyo connecting to Virginia experiences ~150ms RTT for the initial connection and every WebSocket frame. While message delivery is still 'fast' (~200ms including network), it is noticeably slower than the sub-50ms experienced by users near the data center. The Multi-Region Global variant solves this with per-region gateways, reducing intra-region delivery to sub-50ms regardless of user location.

Related Templates

Real-Time Chat — Naive (HTTP Polling)Real-Time Chat — Multi-Region Global (CRDT)

Discussion

Ready to design your own Real-Time Chat?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator