Vetora logo
Medium6 componentsInterview: Very High

Real-Time Chat (WhatsApp)

A real-time chat system like WhatsApp delivers 500K messages/s with sub-200ms end-to-end latency using WebSocket gateways, Cassandra message storage partitioned by conversation ID, and fan-out on write for group delivery. This 6-component architecture handles 1:1 messaging, 256-member group chats, read receipts, typing indicators, and offline inbox queuing with push notifications via APNs/FCM.

Real-TimeFan-OutWebSocket
Problem Statement

Real-time chat is one of the most frequently asked system design questions at top tech companies because it combines real-time communication, persistent storage, presence management, and offline delivery into a single problem. Building a system like WhatsApp or Slack requires engineering solutions for bidirectional communication, message ordering guarantees, and delivery semantics that work reliably across unreliable mobile networks.

At WhatsApp's scale, the system handles over 100 billion messages per day across 2 billion active users. Messages must be delivered in order within a conversation, exactly once under normal conditions, and at-least-once for offline recipients who reconnect. Group chats add fan-out complexity: a single message sent to a 256-member group generates 255 delivery operations that must complete reliably without blocking the sender.

Beyond basic message delivery, a production chat system must support read receipts (delivered, read), typing indicators (real-time ephemeral signals), media attachments (images, videos, documents), end-to-end encryption, and multi-device synchronization. Each of these features introduces its own scaling challenges. Typing indicators, for example, are high-frequency signals that must not be persisted but must be delivered with low latency — a fundamentally different workload from message storage.

This template models the complete messaging architecture: WebSocket gateway for persistent connections, chat service for message routing, fan-out service for group delivery, message store with per-conversation partitioning, presence service for online status and typing indicators, and a push notification service for offline delivery. The simulation shows how fan-out strategy affects latency for large groups and how connection management scales with user count.

Architecture Overview

## How the WebSocket Gateway Manages Persistent Connections

The chat architecture centers on a WebSocket Gateway that maintains persistent bidirectional connections with clients. When a user opens the app, they establish a WebSocket connection to the gateway, which registers the connection in a distributed connection registry (Redis). This registry maps user IDs to gateway instances, enabling any service in the cluster to route messages to the correct gateway and then to the correct WebSocket. Each gateway instance handles up to 500K concurrent connections, with multiple instances behind a load balancer that uses consistent hashing to route reconnections back to the same instance when possible.

## Direct Message Delivery and Offline Inbox Queuing

For 1:1 messages, the flow is straightforward: the sender's gateway receives the message, the Chat Service persists it to the Message Store (Cassandra, partitioned by conversation ID for ordered retrieval), and then looks up the recipient's gateway in the connection registry. If the recipient is online, the message is pushed directly via their WebSocket connection with end-to-end latency of approximately 45ms. If offline, the message is queued in the recipient's offline inbox partition in Cassandra and a push notification is sent via APNs (iOS) or FCM (Android) to alert the user.

## Group Chat Fan-Out Strategy and Scaling

Group chat uses a fan-out on write strategy: when a message is sent to a group, the Fan-Out Service retrieves the group membership list and writes a delivery record for each member. For small groups (under 100 members), fan-out happens synchronously, adding negligible latency. For large groups, fan-out is performed asynchronously via a message queue to prevent the sender from experiencing latency proportional to group size. Each delivery record triggers the same online/offline routing logic as 1:1 messages, with Redis MGET batch lookups resolving gateway assignments for all group members in a single round-trip.

## Presence Detection and Ephemeral Typing Indicators

The Presence Service tracks online status and typing indicators using heartbeat-based liveness detection. Clients send heartbeats every 30 seconds; the presence service marks users as offline if two consecutive heartbeats are missed, applying a 60-second timeout window. Typing indicators are ephemeral signals that bypass the message store entirely — they flow directly from the sender's gateway to the recipient's gateway via the connection registry, with a short 5-second TTL to handle the case where the sender closes the app mid-typing. This separation ensures that high-frequency ephemeral signals do not pollute the durable message storage layer.

Architecture Preview
Loading architecture preview...
Request Flow — Message Send & Real-Time Delivery

The chat system's request flow splits into two distinct paths: message sending (write path) and real-time delivery (push path). The WebSocket Gateway is the central nervous system — it maintains persistent bidirectional connections with every online client and routes messages without polling. A distributed connection registry in Redis maps user IDs to the specific gateway instance holding their WebSocket, enabling any service to push messages to any user.

For 1:1 messages, delivery is direct: the Chat Service looks up the recipient's gateway in the connection registry and pushes the message through that gateway's WebSocket. For group chats, the Fan-Out Service resolves all group members and pushes to each member's gateway in parallel. Small groups (<100 members) use synchronous fan-out; large groups use a message queue for asynchronous delivery.

The Presence Service runs orthogonally to message delivery. Clients send heartbeat pings every 30 seconds through their WebSocket. If a heartbeat is missed for 60 seconds, the user is marked offline. Typing indicators flow through the same WebSocket channel but bypass the message store entirely — they're ephemeral signals with a 5-second TTL in the connection registry.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Sender transmits a message through their persistent WebSocket connection to their assigned WebSocket Gateway instance. The gateway authenticates the message using the session token established at connection time.
  2. 2The WebSocket Gateway forwards the message to the Chat Service, which persists it to the Cassandra Message Store. Messages are partitioned by conversation_id for locality — all messages in a conversation live on the same Cassandra partition (~8ms quorum write).
  3. 3The Chat Service queries the Redis Connection Registry to find which WebSocket Gateway instance holds the recipient's connection. This is a simple key-value lookup: user:{recipientId} → gateway-instance-id (~1ms).
  4. 4For online recipients, the Chat Service pushes the message directly to the recipient's WebSocket Gateway, which delivers it through the recipient's WebSocket connection. End-to-end latency: ~45ms.
  5. 5If the recipient is offline (no entry in connection registry), the message is stored in an offline inbox partition in Cassandra. A push notification is triggered via APNs (iOS) or FCM (Android) through the Fan-Out Service.
  6. 6For group chats, the Fan-Out Service resolves all group members via Redis MGET (batch lookup), determines which members are online and which gateway holds each connection, then pushes in parallel to all online members' gateways.
  7. 7The Presence Service tracks online/offline status via WebSocket heartbeats. Clients ping every 30 seconds; 60 seconds without a heartbeat marks the user offline and removes them from the connection registry.
  8. 8Typing indicators bypass the message store entirely — they flow from the sender's gateway through the connection registry directly to the recipient's gateway with a 5-second TTL. No persistence needed for ephemeral signals.

Pseudocode

// Message delivery — 1:1 direct push
async function sendMessage(senderId, recipientId, content):
    // 1. Persist to Cassandra (partitioned by conversation)
    conversationId = resolveConversation(senderId, recipientId)
    message = await cassandra.execute(
        "INSERT INTO messages (conversation_id, message_id, sender_id, content, created_at)
         VALUES (?, uuid(), ?, ?, toTimestamp(now()))",
        [conversationId, senderId, content]
    )   // ~8ms (quorum write, 2 of 3 replicas)

    // 2. Lookup recipient's WebSocket gateway
    gatewayId = await redis.get(`user:${recipientId}`)

    if gatewayId:
        // 3a. Online → direct push via their gateway
        await gatewayCluster[gatewayId].push(recipientId, {
            from: senderId, content, messageId: message.id, timestamp: message.created_at
        })   // ~5ms (internal gRPC)
        return { status: "delivered" }
    else:
        // 3b. Offline → store in inbox + push notification
        await cassandra.execute(
            "INSERT INTO offline_inbox (user_id, message_id) VALUES (?, ?)",
            [recipientId, message.id]
        )
        await pushNotificationService.send(recipientId, {
            title: senderName, body: truncate(content, 100)
        })
        return { status: "sent" }   // delivered when they come online

// Group fan-out — parallel delivery to all online members
async function fanOutGroupMessage(groupId, message):
    memberIds = await redis.smembers(`group:${groupId}:members`)
    gateways = await redis.mget(memberIds.map(id => `user:${id}`))

    // Push in parallel to all online members
    await Promise.all(
        gateways.filter(Boolean).map(([memberId, gw]) =>
            gatewayCluster[gw].push(memberId, message)
        )
    )   // ~15ms total (parallel, worst-case gateway)
Key Design Decisions
Connection Protocol

Choice

WebSocket with HTTP long-polling fallback

Rationale

WebSocket provides true bidirectional communication with minimal overhead after the initial handshake. Unlike HTTP polling, it delivers messages with sub-second latency without wasted requests. The HTTP long-polling fallback handles restrictive networks (corporate proxies, certain mobile carriers) where WebSocket upgrades fail.

Message Storage

Choice

Cassandra with conversation ID as partition key

Rationale

Chat message access patterns are almost exclusively sequential reads within a conversation (loading message history). Cassandra's wide-column model stores all messages in a conversation within a single partition, enabling efficient range scans sorted by timestamp. The partition key ensures all messages in a conversation are co-located on the same node, minimizing read latency.

Fan-Out Strategy

Choice

Fan-out on write with async processing for large groups

Rationale

Fan-out on write ensures that each recipient's inbox is pre-materialized, enabling fast reads when they open the app. For small groups, synchronous fan-out adds negligible latency. For large groups (100+ members), async fan-out via a message queue prevents the sender from waiting for 100+ write operations. The trade-off is higher write amplification, which is acceptable given chat's write-heavy nature.

Presence Detection

Choice

Heartbeat-based with 30-second interval and 60-second timeout

Rationale

Heartbeat-based presence detection is simple and reliable. The 30-second interval balances battery consumption on mobile devices against presence accuracy. A 60-second timeout (two missed heartbeats) provides confidence that the user is truly offline rather than experiencing a momentary network hiccup. Typing indicators use a separate, shorter-lived signal path to avoid polluting the presence heartbeat channel.

Scale & Performance

Target RPS

500K messages/s

Latency (p99)

<200ms (message delivery)

Storage

~10 TB/year (message history)

Availability

99.99%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does WhatsApp handle message delivery when the recipient is offline?

When a recipient is offline, messages are stored in their offline inbox (a per-user queue in the message store). When the user reconnects, the chat service retrieves all pending messages from the offline inbox, delivers them via the new WebSocket connection, and marks them as delivered. A push notification is also sent via APNs (iOS) or FCM (Android) to alert the user of new messages, prompting them to open the app.

What is fan-out on write vs. fan-out on read for group chat?

Fan-out on write duplicates each message to every group member's inbox at send time, making reads fast but writes expensive. Fan-out on read stores a single copy and fetches the group timeline at read time, making writes fast but reads expensive. Chat systems prefer fan-out on write because users read messages far more often than they send them, and the read path must be as fast as possible for a good user experience.

How do you ensure message ordering in a distributed chat system?

Message ordering is maintained per-conversation, not globally. Each message receives a monotonically increasing sequence number within its conversation partition. The server assigns sequence numbers (not clients) to prevent clock skew issues. Cassandra's clustering key on timestamp within a conversation partition guarantees ordered retrieval. For display, clients sort by the server-assigned sequence number.

How do you handle read receipts at scale?

Read receipts are handled as lightweight status updates: when a user reads a message, the client sends a 'read' event with the message ID and conversation ID. The chat service updates the message's delivery status (sent -> delivered -> read) and notifies the original sender via their WebSocket connection. For group chats, read receipts are batched — the client sends a single 'read up to message X' event rather than individual receipts for each message.

How does end-to-end encryption work in a chat system?

End-to-end encryption ensures that the server never has access to plaintext message content. Each user generates a public-private key pair. The sender encrypts the message with the recipient's public key before sending. The server stores and routes the encrypted ciphertext without the ability to decrypt it. For group chats, the Signal Protocol's group messaging extension uses a shared group key that is ratcheted forward with each message, providing forward secrecy.

How would you explain the WebSocket gateway scaling challenge to an interviewer?

Each WebSocket gateway instance maintains persistent TCP connections with online users, and each connection consumes a file descriptor and roughly 20-50 KB of memory. A single server with 1 million file descriptors and 64 GB of RAM can handle approximately 500K concurrent connections. At WhatsApp scale with 200M concurrent users, you need at least 400 gateway instances. The key challenge is the connection registry: when Service A needs to push a message to User B, it must know which gateway holds User B's connection. A Redis-based registry mapping user IDs to gateway instances solves this with sub-millisecond lookups, but the registry must be kept consistent as connections drop and reconnect.

How would you design multi-device message sync for a chat system in an interview?

Multi-device sync requires each device to maintain its own read cursor within each conversation. When a message arrives, the fan-out writes a delivery record for each of the user's registered devices. Each device independently fetches messages from its cursor position forward using the conversation partition in Cassandra. Sync conflicts arise when a user reads a message on one device: the read receipt must propagate to all other devices to clear notification badges. This is handled by publishing read events to a per-user pub/sub channel that all connected devices subscribe to, adding roughly 2-3 KB/s of overhead per additional device.

What are the trade-offs between at-most-once and at-least-once delivery in a chat system?

At-most-once delivery (fire-and-forget) risks message loss on network failures but is simpler and lower-latency. At-least-once delivery guarantees no message loss by requiring server-side acknowledgment and client-side retry, but can produce duplicates if the ACK is lost after the server persists the message. Chat systems like WhatsApp use at-least-once delivery with client-side deduplication: each message carries a unique client-generated ID, and the recipient discards messages with IDs it has already processed. The overhead is a small in-memory set of recent message IDs per conversation, typically capped at 1,000 entries.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Real-Time Chat (WhatsApp)?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator