Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11.g Design a Notification System (Push / Email / SMS)
Problem Statement
Design a scalable notification system that delivers push notifications, emails, and SMS messages to users. The system must handle user preferences, templates, priority levels, rate limiting, and delivery guarantees.
1. Requirements
Functional Requirements
- Send notifications via push (iOS/Android), email, and SMS
- User preference management (opt-in/out per channel and type)
- Template-based notification content
- Priority levels (critical, high, normal, low)
- Scheduling (send at specific time or timezone-aware)
- Delivery tracking (sent, delivered, opened, clicked)
- Rate limiting per user (prevent notification fatigue)
- Batch notifications (e.g., daily digest)
Non-Functional Requirements
- Deliver critical notifications within 5 seconds
- Support 1 billion notifications per day
- At-least-once delivery guarantee
- 99.9% delivery success rate
- Graceful degradation when a channel provider is down
- Support 500 million registered users
2. Capacity Estimation
Traffic
Notifications per day: 1 billion
Per second: 1B / 86,400 ~= 11,600 notifications/sec
Peak (10x): ~116,000 notifications/sec
Channel breakdown:
Push: 60% = 600 million/day
Email: 30% = 300 million/day
SMS: 10% = 100 million/day
Storage
Notification record: 500 bytes (metadata + content)
Daily storage: 1B * 500 bytes = 500 GB/day
Retention (90 days): 45 TB
Template storage: ~10 MB (negligible)
User preferences: 500M users * 200 bytes = 100 GB
External Provider Limits
APNS (Apple): Bursts ok, sustained ~100K/sec
FCM (Google): 500K messages/sec
SendGrid (Email): 100K emails/sec (enterprise plan)
Twilio (SMS): 10K SMS/sec (enterprise plan)
3. High-Level Architecture
+-------------------+ +-------------------+
| Internal Services | | Scheduled Jobs |
| (Order, Payment, | | (Cron/Scheduler) |
| Marketing, etc.) | | |
+--------+----------+ +--------+----------+
| |
+------------+------------+
|
+-------v--------+
| Notification |
| API Service |
+-------+--------+
|
+-------v--------+
| Validation & |
| Enrichment |
| (preferences, |
| templates, |
| rate limits) |
+-------+--------+
|
+-------v--------+
| Priority Queue |
| (Kafka) |
+-------+--------+
|
+------------+------------+
| | |
+--------v---+ +-----v------+ +---v----------+
| Push Worker| | Email | | SMS Worker |
| Pool | | Worker Pool| | Pool |
+--------+---+ +-----+------+ +---+----------+
| | |
+--------v---+ +-----v------+ +---v----------+
| APNS / FCM | | SendGrid / | | Twilio / |
| | | SES | | Vonage |
+------------+ +------------+ +--------------+
\ | /
+------+----+----+------+
| |
+-------v---+ +---v-----------+
| Delivery | | Analytics |
| Tracker | | Service |
+-----------+ +---------------+
4. API Design
POST /api/v1/notifications/send
Headers: Authorization: Bearer <service-token>
Body: {
"recipient_id": "user_123",
"notification_type": "order_shipped",
"channels": ["push", "email"], // optional override
"priority": "high", // critical|high|normal|low
"template_id": "tmpl_order_shipped",
"template_data": {
"order_id": "ord_456",
"tracking_number": "1Z999AA10123456784",
"delivery_date": "2026-04-14"
},
"idempotency_key": "order_456_shipped_v1",
"schedule_at": null // null = send immediately
}
Response 202: {
"notification_id": "notif_789",
"status": "queued",
"channels_targeted": ["push", "email"]
}
POST /api/v1/notifications/batch
Body: {
"recipients": [
{ "user_id": "user_1", "template_data": { ... } },
{ "user_id": "user_2", "template_data": { ... } }
],
"notification_type": "marketing_weekly",
"template_id": "tmpl_weekly_digest",
"priority": "low",
"channels": ["email"]
}
Response 202: { "batch_id": "batch_abc", "count": 2, "status": "queued" }
GET /api/v1/notifications/{notification_id}/status
Response 200: {
"notification_id": "notif_789",
"status": "delivered",
"channels": {
"push": { "status": "delivered", "delivered_at": "..." },
"email": { "status": "opened", "opened_at": "..." }
}
}
GET /api/v1/users/{user_id}/preferences
Response 200: {
"push_enabled": true,
"email_enabled": true,
"sms_enabled": false,
"quiet_hours": { "start": "22:00", "end": "08:00", "timezone": "US/Pacific" },
"preferences": {
"order_updates": { "push": true, "email": true, "sms": false },
"marketing": { "push": false, "email": true, "sms": false },
"security_alerts": { "push": true, "email": true, "sms": true }
}
}
PUT /api/v1/users/{user_id}/preferences
Body: { ... updated preferences ... }
Response 200: Updated preferences
5. Database Schema
Notifications Table (PostgreSQL + partitioning)
CREATE TABLE notifications (
notification_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
notification_type VARCHAR(100) NOT NULL,
template_id VARCHAR(100),
priority VARCHAR(20) DEFAULT 'normal',
content JSONB,
idempotency_key VARCHAR(255) UNIQUE,
scheduled_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (created_at);
-- Partition by month for efficient cleanup and querying
Delivery Status Table (Cassandra -- high write volume)
CREATE TABLE delivery_status (
notification_id UUID,
channel VARCHAR, -- 'push', 'email', 'sms'
status VARCHAR, -- 'queued','sent','delivered','opened','failed'
provider VARCHAR, -- 'apns','fcm','sendgrid','twilio'
provider_msg_id VARCHAR,
attempts INT,
last_attempt_at TIMESTAMP,
delivered_at TIMESTAMP,
error_message TEXT,
PRIMARY KEY (notification_id, channel)
);
Templates Table (PostgreSQL)
CREATE TABLE notification_templates (
template_id VARCHAR(100) PRIMARY KEY,
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(20) NOT NULL,
subject VARCHAR(500), -- for email
title VARCHAR(200), -- for push
body_template TEXT NOT NULL, -- mustache/handlebars
metadata JSONB,
version INTEGER DEFAULT 1,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
User Preferences Table (PostgreSQL)
CREATE TABLE user_notification_preferences (
user_id UUID NOT NULL,
notification_type VARCHAR(100) NOT NULL,
push_enabled BOOLEAN DEFAULT TRUE,
email_enabled BOOLEAN DEFAULT TRUE,
sms_enabled BOOLEAN DEFAULT FALSE,
quiet_hours_start TIME,
quiet_hours_end TIME,
timezone VARCHAR(50) DEFAULT 'UTC',
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, notification_type)
);
Device Tokens Table (PostgreSQL)
CREATE TABLE device_tokens (
token_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
platform VARCHAR(20) NOT NULL, -- 'ios', 'android', 'web'
device_token VARCHAR(500) NOT NULL,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_used_at TIMESTAMP
);
CREATE INDEX idx_device_user ON device_tokens(user_id) WHERE is_active = TRUE;
6. Deep Dive: Notification Processing Pipeline
End-to-End Flow
Service calls API
|
v
+------+--------+
| 1. Validation | Check: valid user, valid template, valid channels
+------+--------+
|
v
+------+--------+
| 2. Idempotency| Check: has this notification been sent already?
| Check | (Redis: SET NX idempotency_key TTL 24h)
+------+--------+
|
v
+------+---------+
| 3. Preference | Fetch user preferences
| Filtering | Remove opted-out channels
+------+---------+ Check quiet hours -> defer if in quiet period
|
v
+------+---------+
| 4. Rate Limit | Check: user has not exceeded notification quota
| Check | (e.g., max 5 push/hour, 3 email/day for marketing)
+------+---------+
|
v
+------+---------+
| 5. Template | Render template with provided data
| Rendering | Localize content based on user locale
+------+---------+
|
v
+------+---------+
| 6. Enqueue | Publish to Kafka topic per priority
| to Queue | Topic: notifications.{priority}.{channel}
+------+---------+
|
v
+------+---------+
| 7. Workers | Consume from Kafka, call external providers
| Process | Retry on failure with exponential backoff
+------+---------+
|
v
+------+---------+
| 8. Track | Update delivery status
| Delivery | Process webhooks from providers
+-----------------+
7. Deep Dive: Priority Queue System
Kafka Topic Design
Topics:
notifications.critical.push -- highest priority, dedicated consumers
notifications.critical.email
notifications.critical.sms
notifications.high.push
notifications.high.email
notifications.high.sms
notifications.normal.push
notifications.normal.email
notifications.normal.sms
notifications.low.push -- lowest priority, fewer consumers
notifications.low.email
notifications.low.sms
Consumer allocation:
Critical: 20 consumers per channel
High: 10 consumers per channel
Normal: 5 consumers per channel
Low: 2 consumers per channel
Priority Processing
Critical (security alerts, OTP):
- Dedicated fast path, no batching
- Bypass rate limiting
- SLA: < 5 seconds
High (order updates, payment confirmations):
- Priority processing
- SLA: < 30 seconds
Normal (social notifications, comments):
- Standard processing
- SLA: < 5 minutes
Low (marketing, digests):
- Batch processing allowed
- Respect quiet hours strictly
- SLA: < 1 hour
8. Deep Dive: Rate Limiting for Notifications
Per-User Rate Limits
Rate Limit Rules:
+-------------------+--------+--------+--------+-----------+
| Type | Push | Email | SMS | Window |
+-------------------+--------+--------+--------+-----------+
| Marketing | 3/day | 1/day | 1/week | per type |
| Social | 10/hr | 5/day | N/A | per type |
| Transactional | 20/hr | 10/hr | 5/hr | per type |
| Security/OTP | No limit| No limit| 10/hr | per type |
+-------------------+--------+--------+--------+-----------+
Global per-user limit: Max 50 push notifications per day
Redis Implementation
Key: ratelimit:notif:{user_id}:{type}:{channel}:{window}
Value: Counter
TTL: Window duration
def check_notification_rate_limit(user_id, notif_type, channel):
rules = get_rate_rules(notif_type, channel)
for rule in rules:
key = f"ratelimit:notif:{user_id}:{notif_type}:{channel}:{rule.window}"
count = redis.incr(key)
if count == 1:
redis.expire(key, rule.window_seconds)
if count > rule.limit:
return False, f"Rate limit exceeded: {rule.limit}/{rule.window}"
return True, None
Notification Aggregation
Problem: 50 people like your post = 50 separate notifications?
Solution: Aggregate within a time window.
Instead of:
"Alice liked your post"
"Bob liked your post"
"Charlie liked your post"
Send:
"Alice, Bob, and 48 others liked your post"
Implementation:
1. Buffer similar notifications in Redis sorted set
2. After 5-minute window or 10 notifications (whichever first):
3. Aggregate into single notification
4. Render aggregated template
5. Send once
Key: aggregate:{user_id}:{type}:{target_id}
Value: Sorted set of {actor_id, timestamp}
9. Deep Dive: Delivery Guarantees and Retry
Retry Strategy
Exponential backoff with jitter:
Attempt 1: Immediate
Attempt 2: 30 seconds + random(0, 10)
Attempt 3: 2 minutes + random(0, 30)
Attempt 4: 10 minutes + random(0, 60)
Attempt 5: 1 hour + random(0, 300)
Attempt 6: Move to Dead Letter Queue (DLQ)
Total retry window: ~1.5 hours
Provider Failover
Push: APNS (primary for iOS) -> FCM fallback (web push)
Email: SendGrid (primary) -> SES (secondary) -> Mailgun (tertiary)
SMS: Twilio (primary) -> Vonage (secondary)
Failover logic:
if primary.is_healthy():
result = primary.send(notification)
else:
result = secondary.send(notification)
alert("Primary provider down, using failover")
Health check: Ping providers every 30 seconds.
Circuit breaker: Open after 5 consecutive failures.
Webhook Processing for Delivery Status
Provider sends webhooks for status updates:
POST /webhooks/sendgrid
Body: {
"event": "delivered",
"sg_message_id": "abc123",
"timestamp": 1681200000
}
POST /webhooks/apns
Body: (Apple Push Notification Service feedback)
Processing:
1. Validate webhook signature
2. Map provider_msg_id to notification_id
3. Update delivery_status table
4. If "bounced" or "invalid_token": deactivate device token
10. Scaling Considerations
Worker Scaling
Push workers: Auto-scale based on Kafka consumer lag
Target: lag < 1000 messages per partition
Email workers: Rate-limited by provider (100K/sec SendGrid)
Pool size matches provider limits
SMS workers: Most expensive channel, smallest pool
Strict rate limiting and priority filtering
Database Scaling
Notifications table:
- Partitioned by month (DROP PARTITION for cleanup)
- Retention: 90 days in hot storage, 1 year in cold
Delivery status:
- Cassandra: handles high write volume naturally
- TTL on records: 30 days auto-expiry
User preferences:
- Cached in Redis (99% reads, 1% writes)
- Cache TTL: 1 hour
- Invalidate on preference update
Monitoring and Alerting
Key Metrics:
- Notification throughput per channel per priority
- Delivery success rate per provider
- End-to-end latency (API call to delivery)
- Queue depth and consumer lag
- Provider error rates and types
- Rate limit hit rate
- User unsubscribe rate (feedback loop)
11. Key Tradeoffs
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Queue technology | RabbitMQ | Kafka | Kafka |
| Priority handling | Single queue + priority | Separate queues | Separate queues |
| Template rendering | Server-side | Client-side | Server-side |
| Delivery guarantee | At-most-once | At-least-once | At-least-once |
| Rate limit storage | Database | Redis | Redis |
| Aggregation | Client-side grouping | Server-side buffering | Server-side |
| Provider failover | Manual switch | Automatic circuit | Automatic |
12. Failure Scenarios and Mitigations
Scenario Mitigation
------------------------------------------------------------------------
Provider outage (SendGrid) Auto-failover to SES; retry queue
Kafka broker failure Multi-broker cluster; replication factor 3
Invalid device tokens Process APNS feedback; deactivate stale tokens
Notification storm (bug) Global rate limiter; kill switch per notif type
Template rendering failure Fall back to plain text; alert template team
User timezone data missing Default to UTC; log for correction
DLQ accumulation Alerting on DLQ depth; manual review process
Key Takeaways
- Separate queues per priority ensure critical notifications (OTP, security) are never delayed by marketing blasts.
- Provider failover with circuit breakers prevents single-provider outages from blocking all notifications.
- Rate limiting per user per type prevents notification fatigue -- the biggest cause of users disabling notifications entirely.
- Notification aggregation ("Alice and 48 others liked your post") dramatically reduces notification volume while preserving information value.
- At-least-once delivery with idempotency keys is the right guarantee -- duplicate notifications are better than missed critical alerts.