Episode 9 — System Design / 9.11 — Real World System Design Problems

9.11.c Design a Chat System (WhatsApp / Messenger)

Problem Statement

Design a real-time messaging system like WhatsApp or Facebook Messenger that supports one-on-one messaging, group chats, online presence, read receipts, and media sharing.

1. Requirements

Functional Requirements

One-on-one messaging between users
Group messaging (up to 500 members)
Online/offline presence indicators
Read receipts (sent, delivered, read)
Media sharing (images, videos, documents)
Message history and search
Push notifications for offline users

Non-Functional Requirements

Real-time delivery (< 200ms for online users)
Message ordering guaranteed within a conversation
At-least-once delivery (no message loss)
Support 500 million daily active users
End-to-end encryption for one-on-one chats
99.99% availability

2. Capacity Estimation

Traffic

Daily active users:     500 million
Messages per user/day:  40
Total messages/day:     20 billion
Messages per second:    20B / 86,400 ~= 230,000 msgs/sec
Peak messages/sec:      ~500,000 msgs/sec

Storage

Average message size:   200 bytes (text) + 100 bytes (metadata)
Daily message storage:  20B * 300 bytes = 6 TB/day
Yearly storage:         6 TB * 365 = ~2.2 PB/year
Media messages (10%):   2B messages * avg 200 KB = 400 TB/day

Connections

Concurrent connections: 500M * 0.3 (30% online at once) = 150 million
WebSocket connections per server: ~50,000
Servers needed for connections: 150M / 50K = 3,000 servers

3. High-Level Architecture

+--------+        +-------------------+        +-------------------+
| Client |<------>|   Load Balancer   |<------>|   Load Balancer   |
| (App)  | WS     |  (WebSocket LB)  |        |    (HTTP LB)      |
+--------+        +--------+----------+        +--------+----------+
                           |                            |
              +------------+------------+     +---------+---------+
              |                         |     |                   |
     +--------v--------+      +--------v-----v--+      +--------v--------+
     | Chat Server 1   |      | Chat Server 2   |      | API Server      |
     | (WebSocket)     |      | (WebSocket)     |      | (REST/Auth)     |
     +--------+--------+      +--------+--------+      +-----------------+
              |                         |
     +--------v-------------------------v--------+
     |           Message Queue (Kafka)           |
     +--------+------------------+---------------+
              |                  |
     +--------v--------+  +-----v-----------+  +-----------------+
     | Message Service  |  | Presence Service|  | Push Notification|
     |                  |  |                 |  | Service          |
     +--------+---------+  +--------+--------+  +--------+--------+
              |                     |                     |
     +--------v--------+  +--------v--------+  +---------v-------+
     | Message Store   |  | Presence Store  |  | APNS / FCM      |
     | (Cassandra)     |  | (Redis)         |  | (Push Gateway)  |
     +-----------------+  +-----------------+  +-----------------+
     
     +-----------------+  +-----------------+
     | Media Service   |  | Group Service   |
     | (S3 + CDN)      |  |                 |
     +-----------------+  +-----------------+

4. API Design

WebSocket Connection

WSS /ws/connect
  Headers: Authorization: Bearer <token>
  
  Client -> Server messages:
  {
    "type": "send_message",
    "conversation_id": "conv_123",
    "content": "Hello!",
    "content_type": "text",
    "client_msg_id": "uuid-456"  // for idempotency
  }
  
  Server -> Client messages:
  {
    "type": "new_message",
    "message_id": "msg_789",
    "conversation_id": "conv_123",
    "sender_id": "user_42",
    "content": "Hello!",
    "timestamp": 1681200000000
  }
  
  {
    "type": "receipt",
    "message_id": "msg_789",
    "status": "delivered",  // sent | delivered | read
    "timestamp": 1681200001000
  }
  
  {
    "type": "presence",
    "user_id": "user_42",
    "status": "online",     // online | offline | typing
    "last_seen": 1681200000000
  }

REST API (for non-real-time operations)

GET /api/v1/conversations
  Response: List of conversations with last message preview

GET /api/v1/conversations/{id}/messages?before={cursor}&limit=50
  Response: Paginated message history (cursor-based)

POST /api/v1/conversations
  Body: { "participant_ids": ["user_1", "user_2"] }
  Response: New conversation object

POST /api/v1/media/upload
  Body: multipart/form-data (file)
  Response: { "media_url": "https://cdn.chat.com/media/abc123" }

POST /api/v1/groups
  Body: { "name": "Project Team", "member_ids": ["u1","u2","u3"] }
  Response: Group conversation object

5. Database Schema

Messages Table (Cassandra)

CREATE TABLE messages (
    conversation_id  UUID,
    message_id       TIMEUUID,      -- time-ordered UUID
    sender_id        UUID,
    content          TEXT,
    content_type     VARCHAR,        -- text, image, video, file
    media_url        VARCHAR,
    status           VARCHAR,        -- sent, delivered, read
    created_at       TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Partitioned by conversation_id
-- Sorted by message_id (time-ordered) descending for recent-first queries

Conversations Table (Cassandra)

CREATE TABLE conversations (
    conversation_id  UUID PRIMARY KEY,
    type             VARCHAR,        -- 'direct' or 'group'
    participant_ids  SET<UUID>,
    group_name       VARCHAR,
    group_admin_id   UUID,
    created_at       TIMESTAMP,
    updated_at       TIMESTAMP
);

User Conversations Index (Cassandra)

CREATE TABLE user_conversations (
    user_id          UUID,
    updated_at       TIMESTAMP,
    conversation_id  UUID,
    last_message     TEXT,
    unread_count     INT,
    PRIMARY KEY (user_id, updated_at, conversation_id)
) WITH CLUSTERING ORDER BY (updated_at DESC);
-- Allows fetching user's conversations sorted by most recent activity

User Table (PostgreSQL)

CREATE TABLE users (
    user_id       UUID PRIMARY KEY,
    username      VARCHAR(50) UNIQUE NOT NULL,
    display_name  VARCHAR(100),
    phone_number  VARCHAR(20) UNIQUE,
    avatar_url    VARCHAR(500),
    created_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Why Cassandra for Messages?

Requirement                  Cassandra Strength
---------------------------------------------------------
High write throughput        Optimized for writes
Time-range queries           Clustering key on timestamp
Horizontal scaling           Linear scale-out
No single point of failure   Masterless architecture
Partition by conversation    Natural data locality

6. Deep Dive: Message Flow (1-on-1)

User A (online)                Chat Server 1              Chat Server 2           User B (online)
    |                               |                          |                       |
    |-- send_message ------------->|                          |                       |
    |   {to: B, text: "Hi"}       |                          |                       |
    |                              |                          |                       |
    |                              |-- Store in Kafka ------->|                       |
    |                              |-- Store in Cassandra     |                       |
    |                              |                          |                       |
    |<-- ack {status: "sent"} ----|                          |                       |
    |                              |                          |                       |
    |                              |-- Lookup: which server   |                       |
    |                              |   has User B's WS?       |                       |
    |                              |   (Redis session store)  |                       |
    |                              |                          |                       |
    |                              |-- Route to Server 2 ---->|                       |
    |                              |   via Kafka/Redis Pub    |                       |
    |                              |                          |-- new_message ------->|
    |                              |                          |   {from: A, "Hi"}     |
    |                              |                          |                       |
    |                              |                          |<-- ack "delivered" ---|
    |                              |                          |                       |
    |<-- receipt "delivered" ------|<-- receipt "delivered" ---|                       |
    |                              |                          |                       |

Message Delivery to Offline Users

User A sends message to User B (offline):

1. Chat Server stores message in Cassandra
2. Chat Server checks Redis: User B is OFFLINE
3. Chat Server sends to Push Notification Service
4. Push Notification Service -> APNS/FCM -> User B's device
5. When User B comes online:
   a. Connects via WebSocket
   b. Server fetches unread messages from Cassandra
   c. Delivers all pending messages
   d. Sends "delivered" receipts back to User A

7. Deep Dive: Connection Management & Routing

Session Registry (Redis)

Key:   session:{user_id}
Value: {
    "server_id": "chat-server-42",
    "connected_at": 1681200000,
    "device_type": "ios"
}
TTL:   300 seconds (refreshed by heartbeat)

Cross-Server Message Routing

Option A: Redis Pub/Sub
  - Each chat server subscribes to a channel: "server:{server_id}"
  - To route a message, publish to the target server's channel
  
Option B: Kafka Topic per Server
  - Each server consumes from its own topic
  - Better durability and replay capability

Option C: Service Mesh (recommended at scale)
  - gRPC between chat servers
  - Direct server-to-server communication
  - Service discovery via Consul/etcd

Heartbeat Protocol

Client sends PING every 30 seconds
Server responds with PONG
If no PING for 90 seconds: mark user offline

Client                     Server
  |--- PING (every 30s) --->|
  |<-- PONG ----------------|
  |                          |
  |    (90s no PING)         |
  |                          |--- Mark user offline
  |                          |--- Remove session from Redis
  |                          |--- Broadcast "offline" to contacts

8. Deep Dive: Group Messaging

Group "Engineering" (200 members)
User A sends "Meeting at 3pm"

Fan-out Strategy:
                                    
    User A --> Chat Server --> Kafka (group topic)
                                    |
                               Group Service
                                    |
                    +---------------+---------------+
                    |               |               |
              Server 1         Server 2         Server 3
              (Users B,C,D)    (Users E,F,G)    (Users H,I,J)
                    |               |               |
              Push to WS       Push to WS       Push to WS
              connections      connections      connections

Small Groups vs Large Groups

Small groups (< 100 members):
  - Fan-out on write: write message to each member's inbox
  - Fast reads, higher write cost
  - Acceptable for small groups

Large groups (100-500 members):
  - Fan-out on read: store message once in group partition
  - Members pull messages when they open the group
  - Lower write cost, slightly higher read latency
  
Very large groups (> 500 = broadcast channels):
  - Treat as a topic/channel
  - CDN-like distribution
  - No delivery receipts

9. Deep Dive: Online Presence

Presence Service Architecture:

+--------+     Heartbeat     +------------------+     +---------------+
| Client |  ------------->   | Chat Server      | --> | Redis Cluster |
+--------+   every 30s       | (presence logic) |     | (presence     |
                              +------------------+     |  state store) |
                                                       +-------+-------+
                                                               |
                                                       Pub/Sub to friends
                                                               |
                                                    +----------+---------+
                                                    | Subscribed Servers |
                                                    | (friend's servers) |
                                                    +--------------------+

Presence State Machine

        connect           heartbeat timeout
  [offline] ---------> [online] ------------> [offline]
               |            |
               |            | no heartbeat for 30s
               |            v
               |        [idle/away]
               |            |
               |            | heartbeat timeout
               |            v
               +------- [offline]

Scaling Presence for 500M Users

Problem: Broadcasting presence to all friends is expensive.

Solution 1: Lazy presence
  - Only fetch presence when user opens a chat
  - Don't broadcast globally

Solution 2: Presence channels
  - Each user subscribes to presence of their active chats only
  - Not all 500 contacts, just the 5-10 currently visible

Solution 3: Batch presence updates
  - Aggregate presence changes, send batch every 5 seconds
  - Reduces message volume dramatically

10. Media Sharing Flow

User A shares an image:

1. Client compresses image locally
2. Client requests presigned upload URL from API server
3. Client uploads directly to Object Store (S3)
4. Client sends message with media reference:
   { "type": "image", "media_id": "img_abc", "thumbnail_url": "..." }
5. Chat server delivers message to recipient
6. Recipient's client downloads image from CDN

Upload Flow:
Client --> API Server --> Generate presigned URL
Client --> S3 (direct upload, bypasses chat servers)
Client --> Chat Server --> Message with media_id

Download Flow:
Client --> CDN --> S3 (on cache miss)

Media Processing Pipeline

S3 Upload --> Lambda/Worker:
  1. Virus scan
  2. Generate thumbnail (3 sizes)
  3. Strip EXIF metadata (privacy)
  4. Encrypt at rest
  5. Store processed versions
  6. Update message with media URLs

11. Scaling Considerations

Database Sharding

Messages: Partition by conversation_id
  - All messages for a conversation are co-located
  - Efficient range queries for message history

User data: Partition by user_id
  - User conversations table sharded by user_id
  - Each user's inbox is on one shard

Chat Server Scaling

Sticky sessions: Users maintain WebSocket to ONE server
Load balancing: Consistent hashing on user_id

When scaling up/down:
1. New server joins the ring
2. Affected users are gracefully disconnected
3. Clients automatically reconnect (to new server)
4. Session registry updated in Redis

Multi-Region Deployment

+------------------+          +------------------+
|  US Region       |          |  Asia Region     |
|  Chat Servers    |  <---->  |  Chat Servers    |
|  Message Store   |  Kafka   |  Message Store   |
|  Redis Presence  |  Bridge  |  Redis Presence  |
+------------------+          +------------------+

Cross-region messages:
1. User A (US) sends to User B (Asia)
2. Message stored locally in US
3. Kafka bridge replicates to Asia cluster
4. Asia chat server delivers to User B
5. Delivery receipt flows back via bridge

End-to-End Encryption

Signal Protocol:
1. Each user has identity key pair + signed pre-keys
2. Sender encrypts with recipient's public key
3. Server CANNOT read message content
4. Group chats: sender encrypts once per recipient (Sender Keys)

Impact on architecture:
- Server stores encrypted blobs, not plaintext
- Search must happen client-side (or use encrypted search)
- Media also encrypted before upload

12. Key Tradeoffs

Decision	Option A	Option B	Our Choice
Protocol	HTTP long polling	WebSockets	WebSockets
Message storage	SQL (PostgreSQL)	NoSQL (Cassandra)	Cassandra
Group fan-out	Fan-out on write	Fan-out on read	Hybrid
Presence broadcast	Real-time to all	Lazy fetch	Lazy + batch
Message ordering	Global sequence	Per-conversation order	Per-conv
Delivery guarantee	At-most-once	At-least-once	At-least-once
Media upload	Through chat server	Direct to S3	Direct to S3

13. Failure Scenarios and Mitigations

Scenario                         Mitigation
------------------------------------------------------------------------
Chat server crash                Clients reconnect to another server;
                                 undelivered messages replayed from Kafka
Cassandra node failure           Replication factor 3; reads from replica
Redis presence data loss         Clients re-register on reconnect
Network partition                Messages queued in Kafka; delivered when
                                 partition heals
Message ordering violation       TIMEUUID ensures ordering within partition
Duplicate messages               Client-side msg_id for idempotent delivery

Key Takeaways

WebSockets are essential for real-time bidirectional communication; HTTP polling adds unacceptable latency.
Cassandra is ideal for message storage due to write-heavy workload and natural partitioning by conversation.
Presence is expensive at scale -- use lazy fetching and batched updates rather than broadcasting to all contacts.
Media should bypass chat servers entirely -- upload directly to object storage using presigned URLs.
At-least-once delivery with client-side deduplication is the right tradeoff for a messaging system where losing messages is unacceptable.