Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11.b Design a Distributed Rate Limiter
Problem Statement
Design a rate limiting service that can throttle requests across a distributed system. The service must support multiple rate limiting strategies and work across multiple data centers with minimal latency overhead.
1. Requirements
Functional Requirements
- Limit requests per user/IP/API key based on configurable rules
- Support multiple limiting algorithms (fixed window, sliding window, token bucket)
- Return appropriate headers (remaining quota, reset time)
- Support multi-tier limiting (per second, per minute, per hour)
- Allow rule configuration per API endpoint
- Provide a dashboard for monitoring rate limit events
Non-Functional Requirements
- Ultra-low latency addition to request path (< 5ms overhead)
- High availability (rate limiter failure should not block requests)
- Consistency: slight over-limiting is acceptable; under-limiting is not
- Scale to 1 million+ requests per second
- Work across distributed data centers
2. Capacity Estimation
Traffic
Total API requests: 1 million per second (peak)
Unique users: 10 million active
Rules per user: ~5 (different endpoints, different tiers)
Rate limit checks/sec: 1 million (one per request)
Storage
Per-user counter state: ~200 bytes (counters + timestamps)
Total state: 10M users * 200 bytes = 2 GB
Fits entirely in memory (Redis)
Network
Request to rate limiter: ~100 bytes
Response from limiter: ~50 bytes
Bandwidth: 1M * 150 bytes = 150 MB/s
3. High-Level Architecture
+-------------------+
| API Gateway |
+--------+----------+
|
+--------v----------+
| Rate Limit Module | <-- Embedded or sidecar
+--------+----------+
|
+--------------+--------------+
| |
+--------v--------+ +--------v--------+
| Redis Cluster | | Rules Service |
| (Counter Store) | | (Config Store) |
+--------+--------+ +-----------------+
| |
+--------v--------+ +-----v-----------+
| Redis Sentinel | | Rules Database |
| (Failover) | | (PostgreSQL) |
+-----------------+ +-----------------+
+-------------------+
| Analytics/Logging |
| (Kafka -> ELK) |
+-------------------+
4. API Design
Rate Limiter Internal API
POST /api/v1/rate-limit/check
Body: {
"client_id": "user_12345",
"endpoint": "/api/v1/messages",
"method": "POST",
"ip_address": "203.0.113.42"
}
Response 200 (Allowed): {
"allowed": true,
"remaining": 87,
"limit": 100,
"reset_at": 1681200060
}
Response 200 (Denied): {
"allowed": false,
"remaining": 0,
"limit": 100,
"reset_at": 1681200060,
"retry_after": 42
}
Response Headers Passed to Client
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1681200060
Retry-After: 42
Rule Configuration API
POST /api/v1/rate-limit/rules
Body: {
"rule_id": "messages_per_min",
"endpoint_pattern": "/api/v1/messages",
"method": "POST",
"limit": 100,
"window_seconds": 60,
"algorithm": "sliding_window",
"scope": "per_user",
"priority": 1
}
GET /api/v1/rate-limit/rules
Response: List of all active rules
DELETE /api/v1/rate-limit/rules/{rule_id}
Response 204: No Content
5. Rate Limiting Algorithms
Algorithm 1: Fixed Window Counter
Window: |-------- 60 seconds --------|-------- 60 seconds --------|
| count: 0 -> 1 -> ... 100 | count: 0 -> 1 -> ... |
| ALLOW ALLOW DENY | ALLOW ALLOW |
Key: rate_limit:{client_id}:{endpoint}:{window_start}
Value: counter (integer)
TTL: window_size + buffer
def fixed_window_check(client_id, endpoint, limit, window_sec):
window_start = current_time() // window_sec * window_sec
key = f"rate:{client_id}:{endpoint}:{window_start}"
count = redis.incr(key)
if count == 1:
redis.expire(key, window_sec + 1)
if count > limit:
return RateLimitResult(allowed=False, remaining=0)
return RateLimitResult(allowed=True, remaining=limit - count)
Problem: Burst at window boundary.
Window 1 (12:00:00 - 12:00:59): 100 requests at 12:00:58-59
Window 2 (12:01:00 - 12:01:59): 100 requests at 12:01:00-01
Result: 200 requests in 2 seconds! (limit was 100/min)
Algorithm 2: Sliding Window Log
Store timestamp of every request in a sorted set:
ZADD rate:{client_id} {timestamp} {request_id}
ZREMRANGEBYSCORE rate:{client_id} 0 {current_time - window}
ZCARD rate:{client_id}
def sliding_window_log(client_id, limit, window_sec):
now = current_time_ms()
key = f"rate:{client_id}"
window_start = now - (window_sec * 1000)
pipe = redis.pipeline()
pipe.zremrangebyscore(key, 0, window_start) # Remove old entries
pipe.zadd(key, {str(now): now}) # Add current
pipe.zcard(key) # Count
pipe.expire(key, window_sec + 1)
_, _, count, _ = pipe.execute()
if count > limit:
return RateLimitResult(allowed=False, remaining=0)
return RateLimitResult(allowed=True, remaining=limit - count)
Pros: Perfectly accurate sliding window. Cons: High memory usage (stores every request timestamp).
Algorithm 3: Sliding Window Counter (Recommended)
Combines fixed windows with weighted overlap:
Previous window count: 84 (12:00:00 - 12:00:59)
Current window count: 15 (12:01:00 - 12:01:59)
Current position: 12:01:15 (25% into current window)
Overlap from previous: 75%
Weighted count = 84 * 0.75 + 15 = 63 + 15 = 78
Limit: 100
Result: ALLOW (78 < 100)
def sliding_window_counter(client_id, limit, window_sec):
now = current_time()
current_window = now // window_sec * window_sec
previous_window = current_window - window_sec
elapsed = now - current_window
weight = 1 - (elapsed / window_sec)
prev_key = f"rate:{client_id}:{previous_window}"
curr_key = f"rate:{client_id}:{current_window}"
prev_count = redis.get(prev_key) or 0
curr_count = redis.incr(curr_key)
if curr_count == 1:
redis.expire(curr_key, window_sec * 2)
weighted = int(prev_count) * weight + curr_count
if weighted > limit:
return RateLimitResult(allowed=False, remaining=0)
return RateLimitResult(allowed=True, remaining=limit - weighted)
Pros: Low memory, smooth rate limiting, no boundary burst. Cons: Approximation (usually within 0.003% error).
Algorithm 4: Token Bucket
Bucket:
+----------------------------------+
| tokens: 10 | max: 10 |
| refill_rate: 1 token/second |
+----------------------------------+
Request arrives:
- If tokens > 0: decrement, ALLOW
- If tokens == 0: DENY
Tokens refill continuously at the configured rate.
def token_bucket(client_id, max_tokens, refill_rate):
key = f"bucket:{client_id}"
now = current_time()
bucket = redis.hgetall(key)
if not bucket:
# Initialize bucket
redis.hmset(key, {"tokens": max_tokens - 1, "last_refill": now})
redis.expire(key, 3600)
return RateLimitResult(allowed=True, remaining=max_tokens - 1)
last_refill = float(bucket["last_refill"])
tokens = float(bucket["tokens"])
# Refill tokens
elapsed = now - last_refill
new_tokens = min(max_tokens, tokens + elapsed * refill_rate)
if new_tokens >= 1:
new_tokens -= 1
redis.hmset(key, {"tokens": new_tokens, "last_refill": now})
return RateLimitResult(allowed=True, remaining=int(new_tokens))
else:
return RateLimitResult(allowed=False, remaining=0)
Pros: Allows controlled bursts, smooth rate limiting. Cons: More complex state management.
6. Database Schema
Rules Table (PostgreSQL)
CREATE TABLE rate_limit_rules (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
endpoint_pattern VARCHAR(500) NOT NULL,
http_method VARCHAR(10),
scope VARCHAR(20) NOT NULL, -- 'per_user', 'per_ip', 'global'
algorithm VARCHAR(30) NOT NULL, -- 'sliding_window', 'token_bucket'
limit_count INTEGER NOT NULL,
window_seconds INTEGER NOT NULL,
priority INTEGER DEFAULT 0,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Rate Limit Events (for analytics, stored in Kafka -> ClickHouse)
CREATE TABLE rate_limit_events (
event_id UUID,
client_id VARCHAR(255),
endpoint VARCHAR(500),
rule_id UUID,
allowed BOOLEAN,
current_count INTEGER,
limit_count INTEGER,
timestamp DateTime
) ENGINE = MergeTree()
ORDER BY (timestamp, client_id);
7. Deep Dive: Redis-Based Implementation
Atomic Operations with Lua Scripts
Race conditions occur when multiple servers check and increment concurrently. Solution: use Redis Lua scripts for atomic check-and-increment.
-- sliding_window_counter.lua
local current_key = KEYS[1]
local previous_key = KEYS[2]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local weight = tonumber(ARGV[3])
local prev_count = tonumber(redis.call('GET', previous_key) or "0")
local curr_count = tonumber(redis.call('INCR', current_key))
if curr_count == 1 then
redis.call('EXPIRE', current_key, window * 2)
end
local weighted_count = prev_count * weight + curr_count
if weighted_count > limit then
redis.call('DECR', current_key) -- rollback
return {0, 0, limit} -- denied
end
local remaining = limit - math.ceil(weighted_count)
return {1, remaining, limit} -- allowed
Redis Cluster Setup
+------------------+ +------------------+ +------------------+
| Redis Master 1 | | Redis Master 2 | | Redis Master 3 |
| Slots 0-5460 | | Slots 5461-10922 | | Slots 10923-16383|
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+--------+---------+ +--------+---------+ +--------+---------+
| Redis Replica 1a | | Redis Replica 2a | | Redis Replica 3a |
+------------------+ +------------------+ +------------------+
8. Deep Dive: Multi-Tier Rate Limiting
Tier 1 (Per Second): Max 10 requests/second (burst protection)
Tier 2 (Per Minute): Max 200 requests/minute (sustained load)
Tier 3 (Per Hour): Max 5000 requests/hour (quota enforcement)
Tier 4 (Per Day): Max 50000 requests/day (billing tier)
Evaluation order: Tier 1 -> Tier 2 -> Tier 3 -> Tier 4
ALL tiers must pass for request to be allowed.
def multi_tier_check(client_id, endpoint):
rules = get_rules(endpoint) # sorted by window size ascending
for rule in rules:
result = sliding_window_counter(
client_id=client_id,
limit=rule.limit_count,
window_sec=rule.window_seconds
)
if not result.allowed:
return RateLimitResult(
allowed=False,
rule_violated=rule.name,
retry_after=rule.window_seconds - elapsed_in_window()
)
return RateLimitResult(allowed=True)
9. Deep Dive: Distributed Rate Limiting
Problem: Multiple Data Centers
User in US makes requests that hit US-East AND EU-West:
US-East rate limiter: count = 50 (limit 100)
EU-West rate limiter: count = 60 (limit 100)
Actual total: 110 requests (OVER LIMIT!)
Solution 1: Centralized Counter (Strong Consistency)
All data centers write to one Redis cluster.
Pros: Accurate counts
Cons: Cross-region latency (100-200ms), single region failure = global outage
Solution 2: Local Counters with Sync (Eventual Consistency)
Each DC maintains local counters.
Periodically sync (every 1-5 seconds) via gossip protocol or CRDTs.
Split the rate limit across DCs:
- US-East: limit = 60 (60% of traffic)
- EU-West: limit = 40 (40% of traffic)
Adjust allocations dynamically based on traffic patterns.
Pros: Low latency, resilient
Cons: Can temporarily over-allow (acceptable tradeoff)
Solution 3: Hybrid Approach (Recommended)
+------------------+ +------------------+
| US-East DC | | EU-West DC |
| | | |
| Local Redis | sync | Local Redis |
| (primary check) |<-------->| (primary check) |
| | every 5s | |
| Local limit: 60 | | Local limit: 40 |
+------------------+ +------------------+
\ /
+--- Global Redis ----------+ (async reconciliation)
(source of truth)
1. Check local Redis first (< 1ms)
2. If local limit allows, ALLOW immediately
3. Async: sync local counts to global Redis every 5 seconds
4. Global Redis redistributes allocations based on actual usage
5. Allocation rebalancing: if US-East uses only 30, give EU-West 70
10. Scaling Considerations
Handling Rate Limiter Failure
Policy: FAIL OPEN (allow requests when rate limiter is down)
Rationale:
- Better to serve requests without rate limiting than to block all traffic
- Use circuit breaker pattern to detect limiter failure
- Fall back to local in-memory approximate limiting
Alternative: FAIL CLOSED (deny all) -- only for security-critical endpoints
Performance Optimization
1. Local cache of rules (refresh every 30 seconds)
2. Batch counter updates (aggregate locally, flush every 100ms)
3. Pipeline Redis commands (check + increment in one round trip)
4. Connection pooling to Redis
5. Use Redis MULTI/EXEC or Lua for atomicity
Monitoring
Key Metrics:
- Rate limit check latency (p50, p95, p99)
- Cache hit ratio for rules
- Number of rate-limited requests per endpoint
- Redis memory usage and command rate
- Cross-DC sync lag
11. Key Tradeoffs
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Algorithm | Fixed window | Sliding window counter | Sliding |
| Storage | Local memory | Redis cluster | Redis |
| Failure mode | Fail open | Fail closed | Fail open |
| Distributed approach | Centralized counter | Local + sync | Hybrid |
| Atomicity | Optimistic (read-check) | Lua scripts (atomic) | Lua scripts |
| Granularity | Single tier | Multi-tier | Multi-tier |
12. Failure Scenarios and Mitigations
Scenario Mitigation
------------------------------------------------------------------------
Redis node failure Sentinel auto-failover, fall back to local
Network partition between DCs Each DC operates independently with local limits
Clock skew between servers Use Redis server time (not client time)
Hot key (one user floods) Use Redis cluster with key-based sharding
Rule misconfiguration Require approval workflow, gradual rollout
Thundering herd after window Token bucket smooths out bursts
Key Takeaways
- Sliding window counter is the best general-purpose algorithm -- accurate, memory-efficient, and avoids boundary burst problems.
- Redis Lua scripts ensure atomicity without distributed locks.
- Fail open is the correct default -- rate limiter outage should not cascade into a full service outage.
- Multi-tier limiting protects against both bursts and sustained abuse.
- In distributed setups, eventual consistency with local counters is the pragmatic choice -- slight over-allowing beats cross-region latency.