Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11.c Design a Chat System (WhatsApp / Messenger)
Problem Statement
Design a real-time messaging system like WhatsApp or Facebook Messenger that supports one-on-one messaging, group chats, online presence, read receipts, and media sharing.
1. Requirements
Functional Requirements
- One-on-one messaging between users
- Group messaging (up to 500 members)
- Online/offline presence indicators
- Read receipts (sent, delivered, read)
- Media sharing (images, videos, documents)
- Message history and search
- Push notifications for offline users
Non-Functional Requirements
- Real-time delivery (< 200ms for online users)
- Message ordering guaranteed within a conversation
- At-least-once delivery (no message loss)
- Support 500 million daily active users
- End-to-end encryption for one-on-one chats
- 99.99% availability
2. Capacity Estimation
Traffic
Daily active users: 500 million
Messages per user/day: 40
Total messages/day: 20 billion
Messages per second: 20B / 86,400 ~= 230,000 msgs/sec
Peak messages/sec: ~500,000 msgs/sec
Storage
Average message size: 200 bytes (text) + 100 bytes (metadata)
Daily message storage: 20B * 300 bytes = 6 TB/day
Yearly storage: 6 TB * 365 = ~2.2 PB/year
Media messages (10%): 2B messages * avg 200 KB = 400 TB/day
Connections
Concurrent connections: 500M * 0.3 (30% online at once) = 150 million
WebSocket connections per server: ~50,000
Servers needed for connections: 150M / 50K = 3,000 servers
3. High-Level Architecture
+--------+ +-------------------+ +-------------------+
| Client |<------>| Load Balancer |<------>| Load Balancer |
| (App) | WS | (WebSocket LB) | | (HTTP LB) |
+--------+ +--------+----------+ +--------+----------+
| |
+------------+------------+ +---------+---------+
| | | |
+--------v--------+ +--------v-----v--+ +--------v--------+
| Chat Server 1 | | Chat Server 2 | | API Server |
| (WebSocket) | | (WebSocket) | | (REST/Auth) |
+--------+--------+ +--------+--------+ +-----------------+
| |
+--------v-------------------------v--------+
| Message Queue (Kafka) |
+--------+------------------+---------------+
| |
+--------v--------+ +-----v-----------+ +-----------------+
| Message Service | | Presence Service| | Push Notification|
| | | | | Service |
+--------+---------+ +--------+--------+ +--------+--------+
| | |
+--------v--------+ +--------v--------+ +---------v-------+
| Message Store | | Presence Store | | APNS / FCM |
| (Cassandra) | | (Redis) | | (Push Gateway) |
+-----------------+ +-----------------+ +-----------------+
+-----------------+ +-----------------+
| Media Service | | Group Service |
| (S3 + CDN) | | |
+-----------------+ +-----------------+
4. API Design
WebSocket Connection
WSS /ws/connect
Headers: Authorization: Bearer <token>
Client -> Server messages:
{
"type": "send_message",
"conversation_id": "conv_123",
"content": "Hello!",
"content_type": "text",
"client_msg_id": "uuid-456" // for idempotency
}
Server -> Client messages:
{
"type": "new_message",
"message_id": "msg_789",
"conversation_id": "conv_123",
"sender_id": "user_42",
"content": "Hello!",
"timestamp": 1681200000000
}
{
"type": "receipt",
"message_id": "msg_789",
"status": "delivered", // sent | delivered | read
"timestamp": 1681200001000
}
{
"type": "presence",
"user_id": "user_42",
"status": "online", // online | offline | typing
"last_seen": 1681200000000
}
REST API (for non-real-time operations)
GET /api/v1/conversations
Response: List of conversations with last message preview
GET /api/v1/conversations/{id}/messages?before={cursor}&limit=50
Response: Paginated message history (cursor-based)
POST /api/v1/conversations
Body: { "participant_ids": ["user_1", "user_2"] }
Response: New conversation object
POST /api/v1/media/upload
Body: multipart/form-data (file)
Response: { "media_url": "https://cdn.chat.com/media/abc123" }
POST /api/v1/groups
Body: { "name": "Project Team", "member_ids": ["u1","u2","u3"] }
Response: Group conversation object
5. Database Schema
Messages Table (Cassandra)
CREATE TABLE messages (
conversation_id UUID,
message_id TIMEUUID, -- time-ordered UUID
sender_id UUID,
content TEXT,
content_type VARCHAR, -- text, image, video, file
media_url VARCHAR,
status VARCHAR, -- sent, delivered, read
created_at TIMESTAMP,
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Partitioned by conversation_id
-- Sorted by message_id (time-ordered) descending for recent-first queries
Conversations Table (Cassandra)
CREATE TABLE conversations (
conversation_id UUID PRIMARY KEY,
type VARCHAR, -- 'direct' or 'group'
participant_ids SET<UUID>,
group_name VARCHAR,
group_admin_id UUID,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
User Conversations Index (Cassandra)
CREATE TABLE user_conversations (
user_id UUID,
updated_at TIMESTAMP,
conversation_id UUID,
last_message TEXT,
unread_count INT,
PRIMARY KEY (user_id, updated_at, conversation_id)
) WITH CLUSTERING ORDER BY (updated_at DESC);
-- Allows fetching user's conversations sorted by most recent activity
User Table (PostgreSQL)
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
display_name VARCHAR(100),
phone_number VARCHAR(20) UNIQUE,
avatar_url VARCHAR(500),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Why Cassandra for Messages?
Requirement Cassandra Strength
---------------------------------------------------------
High write throughput Optimized for writes
Time-range queries Clustering key on timestamp
Horizontal scaling Linear scale-out
No single point of failure Masterless architecture
Partition by conversation Natural data locality
6. Deep Dive: Message Flow (1-on-1)
User A (online) Chat Server 1 Chat Server 2 User B (online)
| | | |
|-- send_message ------------->| | |
| {to: B, text: "Hi"} | | |
| | | |
| |-- Store in Kafka ------->| |
| |-- Store in Cassandra | |
| | | |
|<-- ack {status: "sent"} ----| | |
| | | |
| |-- Lookup: which server | |
| | has User B's WS? | |
| | (Redis session store) | |
| | | |
| |-- Route to Server 2 ---->| |
| | via Kafka/Redis Pub | |
| | |-- new_message ------->|
| | | {from: A, "Hi"} |
| | | |
| | |<-- ack "delivered" ---|
| | | |
|<-- receipt "delivered" ------|<-- receipt "delivered" ---| |
| | | |
Message Delivery to Offline Users
User A sends message to User B (offline):
1. Chat Server stores message in Cassandra
2. Chat Server checks Redis: User B is OFFLINE
3. Chat Server sends to Push Notification Service
4. Push Notification Service -> APNS/FCM -> User B's device
5. When User B comes online:
a. Connects via WebSocket
b. Server fetches unread messages from Cassandra
c. Delivers all pending messages
d. Sends "delivered" receipts back to User A
7. Deep Dive: Connection Management & Routing
Session Registry (Redis)
Key: session:{user_id}
Value: {
"server_id": "chat-server-42",
"connected_at": 1681200000,
"device_type": "ios"
}
TTL: 300 seconds (refreshed by heartbeat)
Cross-Server Message Routing
Option A: Redis Pub/Sub
- Each chat server subscribes to a channel: "server:{server_id}"
- To route a message, publish to the target server's channel
Option B: Kafka Topic per Server
- Each server consumes from its own topic
- Better durability and replay capability
Option C: Service Mesh (recommended at scale)
- gRPC between chat servers
- Direct server-to-server communication
- Service discovery via Consul/etcd
Heartbeat Protocol
Client sends PING every 30 seconds
Server responds with PONG
If no PING for 90 seconds: mark user offline
Client Server
|--- PING (every 30s) --->|
|<-- PONG ----------------|
| |
| (90s no PING) |
| |--- Mark user offline
| |--- Remove session from Redis
| |--- Broadcast "offline" to contacts
8. Deep Dive: Group Messaging
Group "Engineering" (200 members)
User A sends "Meeting at 3pm"
Fan-out Strategy:
User A --> Chat Server --> Kafka (group topic)
|
Group Service
|
+---------------+---------------+
| | |
Server 1 Server 2 Server 3
(Users B,C,D) (Users E,F,G) (Users H,I,J)
| | |
Push to WS Push to WS Push to WS
connections connections connections
Small Groups vs Large Groups
Small groups (< 100 members):
- Fan-out on write: write message to each member's inbox
- Fast reads, higher write cost
- Acceptable for small groups
Large groups (100-500 members):
- Fan-out on read: store message once in group partition
- Members pull messages when they open the group
- Lower write cost, slightly higher read latency
Very large groups (> 500 = broadcast channels):
- Treat as a topic/channel
- CDN-like distribution
- No delivery receipts
9. Deep Dive: Online Presence
Presence Service Architecture:
+--------+ Heartbeat +------------------+ +---------------+
| Client | -------------> | Chat Server | --> | Redis Cluster |
+--------+ every 30s | (presence logic) | | (presence |
+------------------+ | state store) |
+-------+-------+
|
Pub/Sub to friends
|
+----------+---------+
| Subscribed Servers |
| (friend's servers) |
+--------------------+
Presence State Machine
connect heartbeat timeout
[offline] ---------> [online] ------------> [offline]
| |
| | no heartbeat for 30s
| v
| [idle/away]
| |
| | heartbeat timeout
| v
+------- [offline]
Scaling Presence for 500M Users
Problem: Broadcasting presence to all friends is expensive.
Solution 1: Lazy presence
- Only fetch presence when user opens a chat
- Don't broadcast globally
Solution 2: Presence channels
- Each user subscribes to presence of their active chats only
- Not all 500 contacts, just the 5-10 currently visible
Solution 3: Batch presence updates
- Aggregate presence changes, send batch every 5 seconds
- Reduces message volume dramatically
10. Media Sharing Flow
User A shares an image:
1. Client compresses image locally
2. Client requests presigned upload URL from API server
3. Client uploads directly to Object Store (S3)
4. Client sends message with media reference:
{ "type": "image", "media_id": "img_abc", "thumbnail_url": "..." }
5. Chat server delivers message to recipient
6. Recipient's client downloads image from CDN
Upload Flow:
Client --> API Server --> Generate presigned URL
Client --> S3 (direct upload, bypasses chat servers)
Client --> Chat Server --> Message with media_id
Download Flow:
Client --> CDN --> S3 (on cache miss)
Media Processing Pipeline
S3 Upload --> Lambda/Worker:
1. Virus scan
2. Generate thumbnail (3 sizes)
3. Strip EXIF metadata (privacy)
4. Encrypt at rest
5. Store processed versions
6. Update message with media URLs
11. Scaling Considerations
Database Sharding
Messages: Partition by conversation_id
- All messages for a conversation are co-located
- Efficient range queries for message history
User data: Partition by user_id
- User conversations table sharded by user_id
- Each user's inbox is on one shard
Chat Server Scaling
Sticky sessions: Users maintain WebSocket to ONE server
Load balancing: Consistent hashing on user_id
When scaling up/down:
1. New server joins the ring
2. Affected users are gracefully disconnected
3. Clients automatically reconnect (to new server)
4. Session registry updated in Redis
Multi-Region Deployment
+------------------+ +------------------+
| US Region | | Asia Region |
| Chat Servers | <----> | Chat Servers |
| Message Store | Kafka | Message Store |
| Redis Presence | Bridge | Redis Presence |
+------------------+ +------------------+
Cross-region messages:
1. User A (US) sends to User B (Asia)
2. Message stored locally in US
3. Kafka bridge replicates to Asia cluster
4. Asia chat server delivers to User B
5. Delivery receipt flows back via bridge
End-to-End Encryption
Signal Protocol:
1. Each user has identity key pair + signed pre-keys
2. Sender encrypts with recipient's public key
3. Server CANNOT read message content
4. Group chats: sender encrypts once per recipient (Sender Keys)
Impact on architecture:
- Server stores encrypted blobs, not plaintext
- Search must happen client-side (or use encrypted search)
- Media also encrypted before upload
12. Key Tradeoffs
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Protocol | HTTP long polling | WebSockets | WebSockets |
| Message storage | SQL (PostgreSQL) | NoSQL (Cassandra) | Cassandra |
| Group fan-out | Fan-out on write | Fan-out on read | Hybrid |
| Presence broadcast | Real-time to all | Lazy fetch | Lazy + batch |
| Message ordering | Global sequence | Per-conversation order | Per-conv |
| Delivery guarantee | At-most-once | At-least-once | At-least-once |
| Media upload | Through chat server | Direct to S3 | Direct to S3 |
13. Failure Scenarios and Mitigations
Scenario Mitigation
------------------------------------------------------------------------
Chat server crash Clients reconnect to another server;
undelivered messages replayed from Kafka
Cassandra node failure Replication factor 3; reads from replica
Redis presence data loss Clients re-register on reconnect
Network partition Messages queued in Kafka; delivered when
partition heals
Message ordering violation TIMEUUID ensures ordering within partition
Duplicate messages Client-side msg_id for idempotent delivery
Key Takeaways
- WebSockets are essential for real-time bidirectional communication; HTTP polling adds unacceptable latency.
- Cassandra is ideal for message storage due to write-heavy workload and natural partitioning by conversation.
- Presence is expensive at scale -- use lazy fetching and batched updates rather than broadcasting to all contacts.
- Media should bypass chat servers entirely -- upload directly to object storage using presigned URLs.
- At-least-once delivery with client-side deduplication is the right tradeoff for a messaging system where losing messages is unacceptable.