A real-time chat system like WhatsApp delivers 500K messages/s with sub-200ms end-to-end latency using WebSocket gateways, Cassandra message storage partitioned by conversation ID, and fan-out on write for group delivery. This 6-component architecture handles 1:1 messaging, 256-member group chats, read receipts, typing indicators, and offline inbox queuing with push notifications via APNs/FCM.
Real-time chat is one of the most frequently asked system design questions at top tech companies because it combines real-time communication, persistent storage, presence management, and offline delivery into a single problem. Building a system like WhatsApp or Slack requires engineering solutions for bidirectional communication, message ordering guarantees, and delivery semantics that work reliably across unreliable mobile networks.
At WhatsApp's scale, the system handles over 100 billion messages per day across 2 billion active users. Messages must be delivered in order within a conversation, exactly once under normal conditions, and at-least-once for offline recipients who reconnect. Group chats add fan-out complexity: a single message sent to a 256-member group generates 255 delivery operations that must complete reliably without blocking the sender.
Beyond basic message delivery, a production chat system must support read receipts (delivered, read), typing indicators (real-time ephemeral signals), media attachments (images, videos, documents), end-to-end encryption, and multi-device synchronization. Each of these features introduces its own scaling challenges. Typing indicators, for example, are high-frequency signals that must not be persisted but must be delivered with low latency — a fundamentally different workload from message storage.
This template models the complete messaging architecture: WebSocket gateway for persistent connections, chat service for message routing, fan-out service for group delivery, message store with per-conversation partitioning, presence service for online status and typing indicators, and a push notification service for offline delivery. The simulation shows how fan-out strategy affects latency for large groups and how connection management scales with user count.
## How the WebSocket Gateway Manages Persistent Connections
The chat architecture centers on a WebSocket Gateway that maintains persistent bidirectional connections with clients. When a user opens the app, they establish a WebSocket connection to the gateway, which registers the connection in a distributed connection registry (Redis). This registry maps user IDs to gateway instances, enabling any service in the cluster to route messages to the correct gateway and then to the correct WebSocket. Each gateway instance handles up to 500K concurrent connections, with multiple instances behind a load balancer that uses consistent hashing to route reconnections back to the same instance when possible.
## Direct Message Delivery and Offline Inbox Queuing
For 1:1 messages, the flow is straightforward: the sender's gateway receives the message, the Chat Service persists it to the Message Store (Cassandra, partitioned by conversation ID for ordered retrieval), and then looks up the recipient's gateway in the connection registry. If the recipient is online, the message is pushed directly via their WebSocket connection with end-to-end latency of approximately 45ms. If offline, the message is queued in the recipient's offline inbox partition in Cassandra and a push notification is sent via APNs (iOS) or FCM (Android) to alert the user.
## Group Chat Fan-Out Strategy and Scaling
Group chat uses a fan-out on write strategy: when a message is sent to a group, the Fan-Out Service retrieves the group membership list and writes a delivery record for each member. For small groups (under 100 members), fan-out happens synchronously, adding negligible latency. For large groups, fan-out is performed asynchronously via a message queue to prevent the sender from experiencing latency proportional to group size. Each delivery record triggers the same online/offline routing logic as 1:1 messages, with Redis MGET batch lookups resolving gateway assignments for all group members in a single round-trip.
## Presence Detection and Ephemeral Typing Indicators
The Presence Service tracks online status and typing indicators using heartbeat-based liveness detection. Clients send heartbeats every 30 seconds; the presence service marks users as offline if two consecutive heartbeats are missed, applying a 60-second timeout window. Typing indicators are ephemeral signals that bypass the message store entirely — they flow directly from the sender's gateway to the recipient's gateway via the connection registry, with a short 5-second TTL to handle the case where the sender closes the app mid-typing. This separation ensures that high-frequency ephemeral signals do not pollute the durable message storage layer.
The chat system's request flow splits into two distinct paths: message sending (write path) and real-time delivery (push path). The WebSocket Gateway is the central nervous system — it maintains persistent bidirectional connections with every online client and routes messages without polling. A distributed connection registry in Redis maps user IDs to the specific gateway instance holding their WebSocket, enabling any service to push messages to any user.
For 1:1 messages, delivery is direct: the Chat Service looks up the recipient's gateway in the connection registry and pushes the message through that gateway's WebSocket. For group chats, the Fan-Out Service resolves all group members and pushes to each member's gateway in parallel. Small groups (<100 members) use synchronous fan-out; large groups use a message queue for asynchronous delivery.
The Presence Service runs orthogonally to message delivery. Clients send heartbeat pings every 30 seconds through their WebSocket. If a heartbeat is missed for 60 seconds, the user is marked offline. Typing indicators flow through the same WebSocket channel but bypass the message store entirely — they're ephemeral signals with a 5-second TTL in the connection registry.
Step-by-Step Walkthrough
Pseudocode
// Message delivery — 1:1 direct push
async function sendMessage(senderId, recipientId, content):
// 1. Persist to Cassandra (partitioned by conversation)
conversationId = resolveConversation(senderId, recipientId)
message = await cassandra.execute(
"INSERT INTO messages (conversation_id, message_id, sender_id, content, created_at)
VALUES (?, uuid(), ?, ?, toTimestamp(now()))",
[conversationId, senderId, content]
) // ~8ms (quorum write, 2 of 3 replicas)
// 2. Lookup recipient's WebSocket gateway
gatewayId = await redis.get(`user:${recipientId}`)
if gatewayId:
// 3a. Online → direct push via their gateway
await gatewayCluster[gatewayId].push(recipientId, {
from: senderId, content, messageId: message.id, timestamp: message.created_at
}) // ~5ms (internal gRPC)
return { status: "delivered" }
else:
// 3b. Offline → store in inbox + push notification
await cassandra.execute(
"INSERT INTO offline_inbox (user_id, message_id) VALUES (?, ?)",
[recipientId, message.id]
)
await pushNotificationService.send(recipientId, {
title: senderName, body: truncate(content, 100)
})
return { status: "sent" } // delivered when they come online
// Group fan-out — parallel delivery to all online members
async function fanOutGroupMessage(groupId, message):
memberIds = await redis.smembers(`group:${groupId}:members`)
gateways = await redis.mget(memberIds.map(id => `user:${id}`))
// Push in parallel to all online members
await Promise.all(
gateways.filter(Boolean).map(([memberId, gw]) =>
gatewayCluster[gw].push(memberId, message)
)
) // ~15ms total (parallel, worst-case gateway)Choice
WebSocket with HTTP long-polling fallback
Rationale
WebSocket provides true bidirectional communication with minimal overhead after the initial handshake. Unlike HTTP polling, it delivers messages with sub-second latency without wasted requests. The HTTP long-polling fallback handles restrictive networks (corporate proxies, certain mobile carriers) where WebSocket upgrades fail.
Choice
Cassandra with conversation ID as partition key
Rationale
Chat message access patterns are almost exclusively sequential reads within a conversation (loading message history). Cassandra's wide-column model stores all messages in a conversation within a single partition, enabling efficient range scans sorted by timestamp. The partition key ensures all messages in a conversation are co-located on the same node, minimizing read latency.
Choice
Fan-out on write with async processing for large groups
Rationale
Fan-out on write ensures that each recipient's inbox is pre-materialized, enabling fast reads when they open the app. For small groups, synchronous fan-out adds negligible latency. For large groups (100+ members), async fan-out via a message queue prevents the sender from waiting for 100+ write operations. The trade-off is higher write amplification, which is acceptable given chat's write-heavy nature.
Choice
Heartbeat-based with 30-second interval and 60-second timeout
Rationale
Heartbeat-based presence detection is simple and reliable. The 30-second interval balances battery consumption on mobile devices against presence accuracy. A 60-second timeout (two missed heartbeats) provides confidence that the user is truly offline rather than experiencing a momentary network hiccup. Typing indicators use a separate, shorter-lived signal path to avoid polluting the presence heartbeat channel.
Target RPS
500K messages/s
Latency (p99)
<200ms (message delivery)
Storage
~10 TB/year (message history)
Availability
99.99%
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
When a recipient is offline, messages are stored in their offline inbox (a per-user queue in the message store). When the user reconnects, the chat service retrieves all pending messages from the offline inbox, delivers them via the new WebSocket connection, and marks them as delivered. A push notification is also sent via APNs (iOS) or FCM (Android) to alert the user of new messages, prompting them to open the app.
Fan-out on write duplicates each message to every group member's inbox at send time, making reads fast but writes expensive. Fan-out on read stores a single copy and fetches the group timeline at read time, making writes fast but reads expensive. Chat systems prefer fan-out on write because users read messages far more often than they send them, and the read path must be as fast as possible for a good user experience.
Message ordering is maintained per-conversation, not globally. Each message receives a monotonically increasing sequence number within its conversation partition. The server assigns sequence numbers (not clients) to prevent clock skew issues. Cassandra's clustering key on timestamp within a conversation partition guarantees ordered retrieval. For display, clients sort by the server-assigned sequence number.
Read receipts are handled as lightweight status updates: when a user reads a message, the client sends a 'read' event with the message ID and conversation ID. The chat service updates the message's delivery status (sent -> delivered -> read) and notifies the original sender via their WebSocket connection. For group chats, read receipts are batched — the client sends a single 'read up to message X' event rather than individual receipts for each message.
End-to-end encryption ensures that the server never has access to plaintext message content. Each user generates a public-private key pair. The sender encrypts the message with the recipient's public key before sending. The server stores and routes the encrypted ciphertext without the ability to decrypt it. For group chats, the Signal Protocol's group messaging extension uses a shared group key that is ratcheted forward with each message, providing forward secrecy.
Each WebSocket gateway instance maintains persistent TCP connections with online users, and each connection consumes a file descriptor and roughly 20-50 KB of memory. A single server with 1 million file descriptors and 64 GB of RAM can handle approximately 500K concurrent connections. At WhatsApp scale with 200M concurrent users, you need at least 400 gateway instances. The key challenge is the connection registry: when Service A needs to push a message to User B, it must know which gateway holds User B's connection. A Redis-based registry mapping user IDs to gateway instances solves this with sub-millisecond lookups, but the registry must be kept consistent as connections drop and reconnect.
Multi-device sync requires each device to maintain its own read cursor within each conversation. When a message arrives, the fan-out writes a delivery record for each of the user's registered devices. Each device independently fetches messages from its cursor position forward using the conversation partition in Cassandra. Sync conflicts arise when a user reads a message on one device: the read receipt must propagate to all other devices to clear notification badges. This is handled by publishing read events to a per-user pub/sub channel that all connected devices subscribe to, adding roughly 2-3 KB/s of overhead per additional device.
At-most-once delivery (fire-and-forget) risks message loss on network failures but is simpler and lower-latency. At-least-once delivery guarantees no message loss by requiring server-side acknowledgment and client-side retry, but can produce duplicates if the ACK is lost after the server persists the message. Chat systems like WhatsApp use at-least-once delivery with client-side deduplication: each message carries a unique client-generated ID, and the recipient discards messages with IDs it has already processed. The overhead is a small in-memory set of recent message IDs per conversation, typically capped at 1,000 entries.
Sign in to join the discussion.
Ready to design your own Real-Time Chat (WhatsApp)?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator