Episode 9 — System Design / 9.11 — Real World System Design Problems

9.11.b Design a Distributed Rate Limiter

Problem Statement

Design a rate limiting service that can throttle requests across a distributed system. The service must support multiple rate limiting strategies and work across multiple data centers with minimal latency overhead.

1. Requirements

Functional Requirements

Limit requests per user/IP/API key based on configurable rules
Support multiple limiting algorithms (fixed window, sliding window, token bucket)
Return appropriate headers (remaining quota, reset time)
Support multi-tier limiting (per second, per minute, per hour)
Allow rule configuration per API endpoint
Provide a dashboard for monitoring rate limit events

Non-Functional Requirements

Ultra-low latency addition to request path (< 5ms overhead)
High availability (rate limiter failure should not block requests)
Consistency: slight over-limiting is acceptable; under-limiting is not
Scale to 1 million+ requests per second
Work across distributed data centers

2. Capacity Estimation

Traffic

Total API requests:     1 million per second (peak)
Unique users:           10 million active
Rules per user:         ~5 (different endpoints, different tiers)
Rate limit checks/sec:  1 million (one per request)

Storage

Per-user counter state:  ~200 bytes (counters + timestamps)
Total state:             10M users * 200 bytes = 2 GB
Fits entirely in memory (Redis)

Network

Request to rate limiter:  ~100 bytes
Response from limiter:    ~50 bytes
Bandwidth:                1M * 150 bytes = 150 MB/s

3. High-Level Architecture

                    +-------------------+
                    |    API Gateway    |
                    +--------+----------+
                             |
                    +--------v----------+
                    | Rate Limit Module |  <-- Embedded or sidecar
                    +--------+----------+
                             |
              +--------------+--------------+
              |                             |
     +--------v--------+          +--------v--------+
     | Redis Cluster   |          | Rules Service   |
     | (Counter Store) |          | (Config Store)  |
     +--------+--------+          +-----------------+
              |                          |
     +--------v--------+          +-----v-----------+
     | Redis Sentinel  |          | Rules Database  |
     | (Failover)      |          | (PostgreSQL)    |
     +-----------------+          +-----------------+
              
              +-------------------+
              | Analytics/Logging |
              | (Kafka -> ELK)   |
              +-------------------+

4. API Design

Rate Limiter Internal API

POST /api/v1/rate-limit/check
  Body: {
    "client_id": "user_12345",
    "endpoint": "/api/v1/messages",
    "method": "POST",
    "ip_address": "203.0.113.42"
  }
  Response 200 (Allowed): {
    "allowed": true,
    "remaining": 87,
    "limit": 100,
    "reset_at": 1681200060
  }
  Response 200 (Denied): {
    "allowed": false,
    "remaining": 0,
    "limit": 100,
    "reset_at": 1681200060,
    "retry_after": 42
  }

Response Headers Passed to Client

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1681200060
Retry-After: 42

Rule Configuration API

POST /api/v1/rate-limit/rules
  Body: {
    "rule_id": "messages_per_min",
    "endpoint_pattern": "/api/v1/messages",
    "method": "POST",
    "limit": 100,
    "window_seconds": 60,
    "algorithm": "sliding_window",
    "scope": "per_user",
    "priority": 1
  }

GET /api/v1/rate-limit/rules
  Response: List of all active rules

DELETE /api/v1/rate-limit/rules/{rule_id}
  Response 204: No Content

5. Rate Limiting Algorithms

Algorithm 1: Fixed Window Counter

Window: |-------- 60 seconds --------|-------- 60 seconds --------|
        |  count: 0 -> 1 -> ... 100  |  count: 0 -> 1 -> ...     |
        |  ALLOW    ALLOW    DENY     |  ALLOW    ALLOW            |

Key:   rate_limit:{client_id}:{endpoint}:{window_start}
Value: counter (integer)
TTL:   window_size + buffer

def fixed_window_check(client_id, endpoint, limit, window_sec):
    window_start = current_time() // window_sec * window_sec
    key = f"rate:{client_id}:{endpoint}:{window_start}"
    
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window_sec + 1)
    
    if count > limit:
        return RateLimitResult(allowed=False, remaining=0)
    return RateLimitResult(allowed=True, remaining=limit - count)

Problem: Burst at window boundary.

Window 1 (12:00:00 - 12:00:59): 100 requests at 12:00:58-59
Window 2 (12:01:00 - 12:01:59): 100 requests at 12:01:00-01

Result: 200 requests in 2 seconds! (limit was 100/min)

Algorithm 2: Sliding Window Log

Store timestamp of every request in a sorted set:

ZADD   rate:{client_id}  {timestamp}  {request_id}
ZREMRANGEBYSCORE rate:{client_id} 0 {current_time - window}
ZCARD  rate:{client_id}

def sliding_window_log(client_id, limit, window_sec):
    now = current_time_ms()
    key = f"rate:{client_id}"
    window_start = now - (window_sec * 1000)
    
    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)   # Remove old entries
    pipe.zadd(key, {str(now): now})                # Add current
    pipe.zcard(key)                                # Count
    pipe.expire(key, window_sec + 1)
    _, _, count, _ = pipe.execute()
    
    if count > limit:
        return RateLimitResult(allowed=False, remaining=0)
    return RateLimitResult(allowed=True, remaining=limit - count)

Pros: Perfectly accurate sliding window. Cons: High memory usage (stores every request timestamp).

Algorithm 3: Sliding Window Counter (Recommended)

Combines fixed windows with weighted overlap:

Previous window count: 84  (12:00:00 - 12:00:59)
Current window count:  15  (12:01:00 - 12:01:59)
Current position:      12:01:15 (25% into current window)
Overlap from previous: 75%

Weighted count = 84 * 0.75 + 15 = 63 + 15 = 78
Limit: 100
Result: ALLOW (78 < 100)

def sliding_window_counter(client_id, limit, window_sec):
    now = current_time()
    current_window = now // window_sec * window_sec
    previous_window = current_window - window_sec
    elapsed = now - current_window
    weight = 1 - (elapsed / window_sec)
    
    prev_key = f"rate:{client_id}:{previous_window}"
    curr_key = f"rate:{client_id}:{current_window}"
    
    prev_count = redis.get(prev_key) or 0
    curr_count = redis.incr(curr_key)
    if curr_count == 1:
        redis.expire(curr_key, window_sec * 2)
    
    weighted = int(prev_count) * weight + curr_count
    
    if weighted > limit:
        return RateLimitResult(allowed=False, remaining=0)
    return RateLimitResult(allowed=True, remaining=limit - weighted)

Pros: Low memory, smooth rate limiting, no boundary burst. Cons: Approximation (usually within 0.003% error).

Algorithm 4: Token Bucket

Bucket:
+----------------------------------+
|  tokens: 10  |  max: 10         |
|  refill_rate: 1 token/second    |
+----------------------------------+

Request arrives:
  - If tokens > 0: decrement, ALLOW
  - If tokens == 0: DENY

Tokens refill continuously at the configured rate.

def token_bucket(client_id, max_tokens, refill_rate):
    key = f"bucket:{client_id}"
    now = current_time()
    
    bucket = redis.hgetall(key)
    if not bucket:
        # Initialize bucket
        redis.hmset(key, {"tokens": max_tokens - 1, "last_refill": now})
        redis.expire(key, 3600)
        return RateLimitResult(allowed=True, remaining=max_tokens - 1)
    
    last_refill = float(bucket["last_refill"])
    tokens = float(bucket["tokens"])
    
    # Refill tokens
    elapsed = now - last_refill
    new_tokens = min(max_tokens, tokens + elapsed * refill_rate)
    
    if new_tokens >= 1:
        new_tokens -= 1
        redis.hmset(key, {"tokens": new_tokens, "last_refill": now})
        return RateLimitResult(allowed=True, remaining=int(new_tokens))
    else:
        return RateLimitResult(allowed=False, remaining=0)

Pros: Allows controlled bursts, smooth rate limiting. Cons: More complex state management.

6. Database Schema

Rules Table (PostgreSQL)

CREATE TABLE rate_limit_rules (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(255) NOT NULL,
    endpoint_pattern VARCHAR(500) NOT NULL,
    http_method     VARCHAR(10),
    scope           VARCHAR(20) NOT NULL,  -- 'per_user', 'per_ip', 'global'
    algorithm       VARCHAR(30) NOT NULL,  -- 'sliding_window', 'token_bucket'
    limit_count     INTEGER NOT NULL,
    window_seconds  INTEGER NOT NULL,
    priority        INTEGER DEFAULT 0,
    is_active       BOOLEAN DEFAULT TRUE,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Rate Limit Events (for analytics, stored in Kafka -> ClickHouse)

CREATE TABLE rate_limit_events (
    event_id        UUID,
    client_id       VARCHAR(255),
    endpoint        VARCHAR(500),
    rule_id         UUID,
    allowed         BOOLEAN,
    current_count   INTEGER,
    limit_count     INTEGER,
    timestamp       DateTime
) ENGINE = MergeTree()
ORDER BY (timestamp, client_id);

7. Deep Dive: Redis-Based Implementation

Atomic Operations with Lua Scripts

Race conditions occur when multiple servers check and increment concurrently. Solution: use Redis Lua scripts for atomic check-and-increment.

-- sliding_window_counter.lua
local current_key = KEYS[1]
local previous_key = KEYS[2]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local weight = tonumber(ARGV[3])

local prev_count = tonumber(redis.call('GET', previous_key) or "0")
local curr_count = tonumber(redis.call('INCR', current_key))

if curr_count == 1 then
    redis.call('EXPIRE', current_key, window * 2)
end

local weighted_count = prev_count * weight + curr_count

if weighted_count > limit then
    redis.call('DECR', current_key)  -- rollback
    return {0, 0, limit}             -- denied
end

local remaining = limit - math.ceil(weighted_count)
return {1, remaining, limit}         -- allowed

Redis Cluster Setup

+------------------+     +------------------+     +------------------+
| Redis Master 1   |     | Redis Master 2   |     | Redis Master 3   |
| Slots 0-5460     |     | Slots 5461-10922 |     | Slots 10923-16383|
+--------+---------+     +--------+---------+     +--------+---------+
         |                        |                        |
+--------+---------+     +--------+---------+     +--------+---------+
| Redis Replica 1a |     | Redis Replica 2a |     | Redis Replica 3a |
+------------------+     +------------------+     +------------------+

8. Deep Dive: Multi-Tier Rate Limiting

Tier 1 (Per Second):    Max 10 requests/second    (burst protection)
Tier 2 (Per Minute):    Max 200 requests/minute   (sustained load)
Tier 3 (Per Hour):      Max 5000 requests/hour    (quota enforcement)
Tier 4 (Per Day):       Max 50000 requests/day    (billing tier)

Evaluation order: Tier 1 -> Tier 2 -> Tier 3 -> Tier 4
ALL tiers must pass for request to be allowed.

def multi_tier_check(client_id, endpoint):
    rules = get_rules(endpoint)  # sorted by window size ascending
    
    for rule in rules:
        result = sliding_window_counter(
            client_id=client_id,
            limit=rule.limit_count,
            window_sec=rule.window_seconds
        )
        if not result.allowed:
            return RateLimitResult(
                allowed=False,
                rule_violated=rule.name,
                retry_after=rule.window_seconds - elapsed_in_window()
            )
    
    return RateLimitResult(allowed=True)

9. Deep Dive: Distributed Rate Limiting

Problem: Multiple Data Centers

User in US makes requests that hit US-East AND EU-West:

US-East rate limiter:  count = 50   (limit 100)
EU-West rate limiter:  count = 60   (limit 100)
Actual total:          110 requests (OVER LIMIT!)

Solution 1: Centralized Counter (Strong Consistency)

All data centers write to one Redis cluster.

Pros: Accurate counts
Cons: Cross-region latency (100-200ms), single region failure = global outage

Solution 2: Local Counters with Sync (Eventual Consistency)

Each DC maintains local counters.
Periodically sync (every 1-5 seconds) via gossip protocol or CRDTs.

Split the rate limit across DCs:
  - US-East: limit = 60 (60% of traffic)
  - EU-West: limit = 40 (40% of traffic)

Adjust allocations dynamically based on traffic patterns.

Pros: Low latency, resilient
Cons: Can temporarily over-allow (acceptable tradeoff)

Solution 3: Hybrid Approach (Recommended)

+------------------+          +------------------+
| US-East DC       |          | EU-West DC       |
|                  |          |                  |
| Local Redis      |  sync    | Local Redis      |
| (primary check)  |<-------->| (primary check)  |
|                  | every 5s |                  |
| Local limit: 60  |          | Local limit: 40  |
+------------------+          +------------------+
         \                           /
          +--- Global Redis ----------+ (async reconciliation)
              (source of truth)

1. Check local Redis first (< 1ms)
2. If local limit allows, ALLOW immediately
3. Async: sync local counts to global Redis every 5 seconds
4. Global Redis redistributes allocations based on actual usage
5. Allocation rebalancing: if US-East uses only 30, give EU-West 70

10. Scaling Considerations

Handling Rate Limiter Failure

Policy: FAIL OPEN (allow requests when rate limiter is down)

Rationale:
- Better to serve requests without rate limiting than to block all traffic
- Use circuit breaker pattern to detect limiter failure
- Fall back to local in-memory approximate limiting

Alternative: FAIL CLOSED (deny all) -- only for security-critical endpoints

Performance Optimization

1. Local cache of rules (refresh every 30 seconds)
2. Batch counter updates (aggregate locally, flush every 100ms)
3. Pipeline Redis commands (check + increment in one round trip)
4. Connection pooling to Redis
5. Use Redis MULTI/EXEC or Lua for atomicity

Monitoring

Key Metrics:
- Rate limit check latency (p50, p95, p99)
- Cache hit ratio for rules
- Number of rate-limited requests per endpoint
- Redis memory usage and command rate
- Cross-DC sync lag

11. Key Tradeoffs

Decision	Option A	Option B	Our Choice
Algorithm	Fixed window	Sliding window counter	Sliding
Storage	Local memory	Redis cluster	Redis
Failure mode	Fail open	Fail closed	Fail open
Distributed approach	Centralized counter	Local + sync	Hybrid
Atomicity	Optimistic (read-check)	Lua scripts (atomic)	Lua scripts
Granularity	Single tier	Multi-tier	Multi-tier

12. Failure Scenarios and Mitigations

Scenario                         Mitigation
------------------------------------------------------------------------
Redis node failure               Sentinel auto-failover, fall back to local
Network partition between DCs    Each DC operates independently with local limits
Clock skew between servers       Use Redis server time (not client time)
Hot key (one user floods)        Use Redis cluster with key-based sharding
Rule misconfiguration            Require approval workflow, gradual rollout
Thundering herd after window     Token bucket smooths out bursts

Key Takeaways

Sliding window counter is the best general-purpose algorithm -- accurate, memory-efficient, and avoids boundary burst problems.
Redis Lua scripts ensure atomicity without distributed locks.
Fail open is the correct default -- rate limiter outage should not cascade into a full service outage.
Multi-tier limiting protects against both bursts and sustained abuse.
In distributed setups, eventual consistency with local counters is the pragmatic choice -- slight over-allowing beats cross-region latency.