Episode 9 — System Design / 9.11 — Real World System Design Problems

9.11.g Design a Notification System (Push / Email / SMS)

Problem Statement

Design a scalable notification system that delivers push notifications, emails, and SMS messages to users. The system must handle user preferences, templates, priority levels, rate limiting, and delivery guarantees.

1. Requirements

Functional Requirements

Send notifications via push (iOS/Android), email, and SMS
User preference management (opt-in/out per channel and type)
Template-based notification content
Priority levels (critical, high, normal, low)
Scheduling (send at specific time or timezone-aware)
Delivery tracking (sent, delivered, opened, clicked)
Rate limiting per user (prevent notification fatigue)
Batch notifications (e.g., daily digest)

Non-Functional Requirements

Deliver critical notifications within 5 seconds
Support 1 billion notifications per day
At-least-once delivery guarantee
99.9% delivery success rate
Graceful degradation when a channel provider is down
Support 500 million registered users

2. Capacity Estimation

Traffic

Notifications per day:   1 billion
Per second:              1B / 86,400 ~= 11,600 notifications/sec
Peak (10x):              ~116,000 notifications/sec

Channel breakdown:
  Push:    60% = 600 million/day
  Email:   30% = 300 million/day
  SMS:     10% = 100 million/day

Storage

Notification record:     500 bytes (metadata + content)
Daily storage:           1B * 500 bytes = 500 GB/day
Retention (90 days):     45 TB
Template storage:        ~10 MB (negligible)
User preferences:        500M users * 200 bytes = 100 GB

External Provider Limits

APNS (Apple):           Bursts ok, sustained ~100K/sec
FCM (Google):           500K messages/sec
SendGrid (Email):       100K emails/sec (enterprise plan)
Twilio (SMS):           10K SMS/sec (enterprise plan)

3. High-Level Architecture

+-------------------+     +-------------------+
| Internal Services |     | Scheduled Jobs    |
| (Order, Payment,  |     | (Cron/Scheduler)  |
| Marketing, etc.)  |     |                   |
+--------+----------+     +--------+----------+
         |                         |
         +------------+------------+
                      |
              +-------v--------+
              | Notification   |
              | API Service    |
              +-------+--------+
                      |
              +-------v--------+
              | Validation &   |
              | Enrichment     |
              | (preferences,  |
              |  templates,    |
              |  rate limits)  |
              +-------+--------+
                      |
              +-------v--------+
              | Priority Queue |
              | (Kafka)        |
              +-------+--------+
                      |
         +------------+------------+
         |            |            |
+--------v---+ +-----v------+ +---v----------+
| Push Worker| | Email      | | SMS Worker   |
| Pool       | | Worker Pool| | Pool         |
+--------+---+ +-----+------+ +---+----------+
         |            |            |
+--------v---+ +-----v------+ +---v----------+
| APNS / FCM | | SendGrid / | | Twilio /     |
|            | | SES        | | Vonage       |
+------------+ +------------+ +--------------+
         \            |            /
          +------+----+----+------+
                 |         |
         +-------v---+ +---v-----------+
         | Delivery  | | Analytics     |
         | Tracker   | | Service       |
         +-----------+ +---------------+

4. API Design

POST /api/v1/notifications/send
  Headers: Authorization: Bearer <service-token>
  Body: {
    "recipient_id": "user_123",
    "notification_type": "order_shipped",
    "channels": ["push", "email"],      // optional override
    "priority": "high",                  // critical|high|normal|low
    "template_id": "tmpl_order_shipped",
    "template_data": {
      "order_id": "ord_456",
      "tracking_number": "1Z999AA10123456784",
      "delivery_date": "2026-04-14"
    },
    "idempotency_key": "order_456_shipped_v1",
    "schedule_at": null                  // null = send immediately
  }
  Response 202: {
    "notification_id": "notif_789",
    "status": "queued",
    "channels_targeted": ["push", "email"]
  }

POST /api/v1/notifications/batch
  Body: {
    "recipients": [
      { "user_id": "user_1", "template_data": { ... } },
      { "user_id": "user_2", "template_data": { ... } }
    ],
    "notification_type": "marketing_weekly",
    "template_id": "tmpl_weekly_digest",
    "priority": "low",
    "channels": ["email"]
  }
  Response 202: { "batch_id": "batch_abc", "count": 2, "status": "queued" }

GET /api/v1/notifications/{notification_id}/status
  Response 200: {
    "notification_id": "notif_789",
    "status": "delivered",
    "channels": {
      "push": { "status": "delivered", "delivered_at": "..." },
      "email": { "status": "opened", "opened_at": "..." }
    }
  }

GET /api/v1/users/{user_id}/preferences
  Response 200: {
    "push_enabled": true,
    "email_enabled": true,
    "sms_enabled": false,
    "quiet_hours": { "start": "22:00", "end": "08:00", "timezone": "US/Pacific" },
    "preferences": {
      "order_updates": { "push": true, "email": true, "sms": false },
      "marketing": { "push": false, "email": true, "sms": false },
      "security_alerts": { "push": true, "email": true, "sms": true }
    }
  }

PUT /api/v1/users/{user_id}/preferences
  Body: { ... updated preferences ... }
  Response 200: Updated preferences

5. Database Schema

Notifications Table (PostgreSQL + partitioning)

CREATE TABLE notifications (
    notification_id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id           UUID NOT NULL,
    notification_type VARCHAR(100) NOT NULL,
    template_id       VARCHAR(100),
    priority          VARCHAR(20) DEFAULT 'normal',
    content           JSONB,
    idempotency_key   VARCHAR(255) UNIQUE,
    scheduled_at      TIMESTAMP,
    created_at        TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (created_at);
-- Partition by month for efficient cleanup and querying

Delivery Status Table (Cassandra -- high write volume)

CREATE TABLE delivery_status (
    notification_id   UUID,
    channel           VARCHAR,    -- 'push', 'email', 'sms'
    status            VARCHAR,    -- 'queued','sent','delivered','opened','failed'
    provider          VARCHAR,    -- 'apns','fcm','sendgrid','twilio'
    provider_msg_id   VARCHAR,
    attempts          INT,
    last_attempt_at   TIMESTAMP,
    delivered_at      TIMESTAMP,
    error_message     TEXT,
    PRIMARY KEY (notification_id, channel)
);

Templates Table (PostgreSQL)

CREATE TABLE notification_templates (
    template_id       VARCHAR(100) PRIMARY KEY,
    notification_type VARCHAR(100) NOT NULL,
    channel           VARCHAR(20) NOT NULL,
    subject           VARCHAR(500),        -- for email
    title             VARCHAR(200),        -- for push
    body_template     TEXT NOT NULL,        -- mustache/handlebars
    metadata          JSONB,
    version           INTEGER DEFAULT 1,
    is_active         BOOLEAN DEFAULT TRUE,
    created_at        TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

User Preferences Table (PostgreSQL)

CREATE TABLE user_notification_preferences (
    user_id             UUID NOT NULL,
    notification_type   VARCHAR(100) NOT NULL,
    push_enabled        BOOLEAN DEFAULT TRUE,
    email_enabled       BOOLEAN DEFAULT TRUE,
    sms_enabled         BOOLEAN DEFAULT FALSE,
    quiet_hours_start   TIME,
    quiet_hours_end     TIME,
    timezone            VARCHAR(50) DEFAULT 'UTC',
    updated_at          TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (user_id, notification_type)
);

Device Tokens Table (PostgreSQL)

CREATE TABLE device_tokens (
    token_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id       UUID NOT NULL,
    platform      VARCHAR(20) NOT NULL,   -- 'ios', 'android', 'web'
    device_token  VARCHAR(500) NOT NULL,
    is_active     BOOLEAN DEFAULT TRUE,
    created_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_used_at  TIMESTAMP
);

CREATE INDEX idx_device_user ON device_tokens(user_id) WHERE is_active = TRUE;

6. Deep Dive: Notification Processing Pipeline

End-to-End Flow

Service calls API
       |
       v
+------+--------+
| 1. Validation |  Check: valid user, valid template, valid channels
+------+--------+
       |
       v
+------+--------+
| 2. Idempotency|  Check: has this notification been sent already?
|    Check      |  (Redis: SET NX idempotency_key TTL 24h)
+------+--------+
       |
       v
+------+---------+
| 3. Preference  |  Fetch user preferences
|    Filtering   |  Remove opted-out channels
+------+---------+  Check quiet hours -> defer if in quiet period
       |
       v
+------+---------+
| 4. Rate Limit  |  Check: user has not exceeded notification quota
|    Check       |  (e.g., max 5 push/hour, 3 email/day for marketing)
+------+---------+
       |
       v
+------+---------+
| 5. Template    |  Render template with provided data
|    Rendering   |  Localize content based on user locale
+------+---------+
       |
       v
+------+---------+
| 6. Enqueue     |  Publish to Kafka topic per priority
|    to Queue    |  Topic: notifications.{priority}.{channel}
+------+---------+
       |
       v
+------+---------+
| 7. Workers     |  Consume from Kafka, call external providers
|    Process     |  Retry on failure with exponential backoff
+------+---------+
       |
       v
+------+---------+
| 8. Track       |  Update delivery status
|    Delivery    |  Process webhooks from providers
+-----------------+

7. Deep Dive: Priority Queue System

Kafka Topic Design

Topics:
  notifications.critical.push     -- highest priority, dedicated consumers
  notifications.critical.email
  notifications.critical.sms
  notifications.high.push
  notifications.high.email
  notifications.high.sms
  notifications.normal.push
  notifications.normal.email
  notifications.normal.sms
  notifications.low.push          -- lowest priority, fewer consumers
  notifications.low.email
  notifications.low.sms

Consumer allocation:
  Critical: 20 consumers per channel
  High:     10 consumers per channel
  Normal:    5 consumers per channel
  Low:       2 consumers per channel

Priority Processing

Critical (security alerts, OTP):
  - Dedicated fast path, no batching
  - Bypass rate limiting
  - SLA: < 5 seconds

High (order updates, payment confirmations):
  - Priority processing
  - SLA: < 30 seconds

Normal (social notifications, comments):
  - Standard processing
  - SLA: < 5 minutes

Low (marketing, digests):
  - Batch processing allowed
  - Respect quiet hours strictly
  - SLA: < 1 hour

8. Deep Dive: Rate Limiting for Notifications

Per-User Rate Limits

Rate Limit Rules:
+-------------------+--------+--------+--------+-----------+
| Type              | Push   | Email  | SMS    | Window    |
+-------------------+--------+--------+--------+-----------+
| Marketing         | 3/day  | 1/day  | 1/week | per type  |
| Social            | 10/hr  | 5/day  | N/A    | per type  |
| Transactional     | 20/hr  | 10/hr  | 5/hr   | per type  |
| Security/OTP      | No limit| No limit| 10/hr | per type  |
+-------------------+--------+--------+--------+-----------+

Global per-user limit: Max 50 push notifications per day

Redis Implementation

Key:   ratelimit:notif:{user_id}:{type}:{channel}:{window}
Value: Counter
TTL:   Window duration

def check_notification_rate_limit(user_id, notif_type, channel):
    rules = get_rate_rules(notif_type, channel)
    for rule in rules:
        key = f"ratelimit:notif:{user_id}:{notif_type}:{channel}:{rule.window}"
        count = redis.incr(key)
        if count == 1:
            redis.expire(key, rule.window_seconds)
        if count > rule.limit:
            return False, f"Rate limit exceeded: {rule.limit}/{rule.window}"
    return True, None

Notification Aggregation

Problem: 50 people like your post = 50 separate notifications?

Solution: Aggregate within a time window.

Instead of:
  "Alice liked your post"
  "Bob liked your post"
  "Charlie liked your post"

Send:
  "Alice, Bob, and 48 others liked your post"

Implementation:
  1. Buffer similar notifications in Redis sorted set
  2. After 5-minute window or 10 notifications (whichever first):
  3. Aggregate into single notification
  4. Render aggregated template
  5. Send once

  Key:   aggregate:{user_id}:{type}:{target_id}
  Value: Sorted set of {actor_id, timestamp}

9. Deep Dive: Delivery Guarantees and Retry

Retry Strategy

Exponential backoff with jitter:

Attempt 1: Immediate
Attempt 2: 30 seconds + random(0, 10)
Attempt 3: 2 minutes + random(0, 30)
Attempt 4: 10 minutes + random(0, 60)
Attempt 5: 1 hour + random(0, 300)
Attempt 6: Move to Dead Letter Queue (DLQ)

Total retry window: ~1.5 hours

Provider Failover

Push: APNS (primary for iOS) -> FCM fallback (web push)
Email: SendGrid (primary) -> SES (secondary) -> Mailgun (tertiary)
SMS: Twilio (primary) -> Vonage (secondary)

Failover logic:
  if primary.is_healthy():
      result = primary.send(notification)
  else:
      result = secondary.send(notification)
      alert("Primary provider down, using failover")

Health check: Ping providers every 30 seconds.
Circuit breaker: Open after 5 consecutive failures.

Webhook Processing for Delivery Status

Provider sends webhooks for status updates:

POST /webhooks/sendgrid
  Body: {
    "event": "delivered",
    "sg_message_id": "abc123",
    "timestamp": 1681200000
  }

POST /webhooks/apns
  Body: (Apple Push Notification Service feedback)

Processing:
1. Validate webhook signature
2. Map provider_msg_id to notification_id
3. Update delivery_status table
4. If "bounced" or "invalid_token": deactivate device token

10. Scaling Considerations

Worker Scaling

Push workers:   Auto-scale based on Kafka consumer lag
                Target: lag < 1000 messages per partition
                
Email workers:  Rate-limited by provider (100K/sec SendGrid)
                Pool size matches provider limits

SMS workers:    Most expensive channel, smallest pool
                Strict rate limiting and priority filtering

Database Scaling

Notifications table:
  - Partitioned by month (DROP PARTITION for cleanup)
  - Retention: 90 days in hot storage, 1 year in cold
  
Delivery status:
  - Cassandra: handles high write volume naturally
  - TTL on records: 30 days auto-expiry

User preferences:
  - Cached in Redis (99% reads, 1% writes)
  - Cache TTL: 1 hour
  - Invalidate on preference update

Monitoring and Alerting

Key Metrics:
  - Notification throughput per channel per priority
  - Delivery success rate per provider
  - End-to-end latency (API call to delivery)
  - Queue depth and consumer lag
  - Provider error rates and types
  - Rate limit hit rate
  - User unsubscribe rate (feedback loop)

11. Key Tradeoffs

Decision	Option A	Option B	Our Choice
Queue technology	RabbitMQ	Kafka	Kafka
Priority handling	Single queue + priority	Separate queues	Separate queues
Template rendering	Server-side	Client-side	Server-side
Delivery guarantee	At-most-once	At-least-once	At-least-once
Rate limit storage	Database	Redis	Redis
Aggregation	Client-side grouping	Server-side buffering	Server-side
Provider failover	Manual switch	Automatic circuit	Automatic

12. Failure Scenarios and Mitigations

Scenario                          Mitigation
------------------------------------------------------------------------
Provider outage (SendGrid)        Auto-failover to SES; retry queue
Kafka broker failure              Multi-broker cluster; replication factor 3
Invalid device tokens             Process APNS feedback; deactivate stale tokens
Notification storm (bug)          Global rate limiter; kill switch per notif type
Template rendering failure        Fall back to plain text; alert template team
User timezone data missing        Default to UTC; log for correction
DLQ accumulation                  Alerting on DLQ depth; manual review process

Key Takeaways

Separate queues per priority ensure critical notifications (OTP, security) are never delayed by marketing blasts.
Provider failover with circuit breakers prevents single-provider outages from blocking all notifications.
Rate limiting per user per type prevents notification fatigue -- the biggest cause of users disabling notifications entirely.
Notification aggregation ("Alice and 48 others liked your post") dramatically reduces notification volume while preserving information value.
At-least-once delivery with idempotency keys is the right guarantee -- duplicate notifications are better than missed critical alerts.