Episode 9 — System Design / 9.10 — Advanced Distributed Systems
9.10.d — Rate Limiting
Introduction
Rate limiting controls how many requests a client can make in a given time window. It protects your system from being overwhelmed -- whether by a buggy client, a traffic spike, or a deliberate DDoS attack. Every production API implements rate limiting, and interviewers expect you to know the algorithms and trade-offs.
1. Why Rate Limiting?
| Purpose | Example |
|---|---|
| Prevent abuse | Stop a single user from scraping your entire database |
| Protect resources | Prevent overload on expensive downstream services (DB, ML model) |
| Fair usage | Ensure no single tenant monopolizes a shared service |
| Cost control | Limit calls to expensive third-party APIs (Stripe, Twilio) |
| DDoS mitigation | Drop malicious traffic before it reaches application servers |
| Compliance | Enforce contractual API quotas (free tier: 100 req/min) |
2. Rate Limiting Algorithms
2a. Token Bucket
The most widely used algorithm (used by AWS, Stripe, and most API gateways).
+------------------------------------------------------------------------+
| TOKEN BUCKET |
| |
| Bucket capacity: 10 tokens |
| Refill rate: 2 tokens/second |
| |
| +-------+ |
| | Token | Refill: 2 tokens/sec |
| | Source | ---------+ |
| +-------+ | |
| v |
| +---------------------+ |
| | [T][T][T][T][T][ ][ ][ ][ ][ ] | Bucket (max 10) |
| +---------------------+ |
| | |
| Request arrives: |
| Token available? |
| / \ |
| YES NO |
| | | |
| Remove 1 REJECT |
| token (429 Too |
| ALLOW Many Requests) |
+------------------------------------------------------------------------+
Properties:
- Allows bursts up to bucket capacity (e.g., 10 requests at once)
- Smooths out traffic over time (refill rate = steady-state throughput)
- Simple to implement, low memory (just store: token count + last refill time)
Pseudocode:
class TokenBucket:
capacity = 10
tokens = 10
refill_rate = 2 // tokens per second
last_refill = now()
function allow_request():
// Refill tokens based on elapsed time
elapsed = now() - last_refill
tokens = min(capacity, tokens + elapsed * refill_rate)
last_refill = now()
if tokens >= 1:
tokens -= 1
return ALLOW
else:
return REJECT
2b. Leaky Bucket
Processes requests at a fixed rate, like water leaking from a bucket.
+------------------------------------------------------------------------+
| LEAKY BUCKET |
| |
| Requests arrive (variable rate): |
| |
| Req Req Req Req |
| | | | | |
| v v v v |
| +------------------+ |
| | QUEUE (bucket) | Capacity: 10 |
| | [R][R][R][R][ ] | If full -> REJECT |
| +--------+---------+ |
| | |
| Fixed leak rate |
| (2 requests/sec) |
| | |
| v |
| Process request |
+------------------------------------------------------------------------+
Difference from Token Bucket:
| Token Bucket | Leaky Bucket | |
|---|---|---|
| Burst handling | Allows bursts (up to bucket capacity) | Smooths all traffic to fixed rate |
| Output rate | Variable (bursty) | Constant (steady drip) |
| Best for | APIs that tolerate bursts | Services needing constant throughput |
2c. Fixed Window Counter
Count requests in fixed time windows.
+------------------------------------------------------------------------+
| FIXED WINDOW COUNTER |
| |
| Window size: 1 minute Limit: 100 requests/window |
| |
| |<--- Window 1 --->|<--- Window 2 --->|<--- Window 3 --->| |
| | 12:00 - 12:01 | 12:01 - 12:02 | 12:02 - 12:03 | |
| | | | | |
| | count: 87 | count: 100 | count: 45 | |
| | (ALLOW) | (REJECT next) | (ALLOW) | |
| |
| PROBLEM: Boundary spike |
| |..........|..........| |
| | 70 req |100 req | |
| | (last 30s)(first 30s) |
| = 170 requests in 60 seconds! (exceeds 100/min intent) |
+------------------------------------------------------------------------+
Problem: A burst at the boundary of two windows can allow 2x the limit in a 1-minute span. This is the main weakness of fixed windows.
2d. Sliding Window Log
Track the timestamp of every request. Count requests within the last N seconds.
+------------------------------------------------------------------------+
| SLIDING WINDOW LOG |
| |
| Limit: 5 requests per 60 seconds |
| |
| Request log: [12:00:15, 12:00:25, 12:00:40, 12:00:55, 12:01:05] |
| |
| New request at 12:01:10: |
| Remove entries older than 12:00:10 (60s ago) |
| Remaining: [12:00:15, 12:00:25, 12:00:40, 12:00:55, 12:01:05] |
| Count = 5 --> REJECT (already at limit) |
| |
| New request at 12:01:20: |
| Remove entries older than 12:00:20 |
| Remaining: [12:00:25, 12:00:40, 12:00:55, 12:01:05] |
| Count = 4 --> ALLOW (add 12:01:20 to log) |
+------------------------------------------------------------------------+
Trade-off: Perfectly accurate, but high memory usage (stores every timestamp).
2e. Sliding Window Counter (Hybrid)
Combines fixed window efficiency with sliding window accuracy.
+------------------------------------------------------------------------+
| SLIDING WINDOW COUNTER |
| |
| Limit: 100 requests/minute |
| |
| Previous window (12:00-12:01): 84 requests |
| Current window (12:01-12:02): 36 requests |
| Current time: 12:01:15 (25% into current window) |
| |
| Weighted count = (previous * overlap%) + current |
| = (84 * 0.75) + 36 |
| = 63 + 36 |
| = 99 |
| |
| 99 < 100 --> ALLOW |
| |
| Memory: Only 2 counters per window (very efficient) |
| Accuracy: ~99.9% (slight approximation at boundaries) |
+------------------------------------------------------------------------+
Algorithm Comparison
+----------------+-----------+----------+---------+------------------+
| Algorithm | Memory | Accuracy | Burst | Implementation |
| | | | Allowed?| Complexity |
+----------------+-----------+----------+---------+------------------+
| Token Bucket | O(1) | Good | Yes | Simple |
| Leaky Bucket | O(1) | Good | No | Simple |
| Fixed Window | O(1) | Low | At edges| Simplest |
| Sliding Log | O(n) | Perfect | No | Moderate |
| Sliding Counter| O(1) | High | Partial | Moderate |
+----------------+-----------+----------+---------+------------------+
3. Distributed Rate Limiting
In a multi-server environment, rate limits must be coordinated.
WITHOUT coordination: WITH coordination (centralized):
[Server 1] limit: 100/min [Server 1] -+
[Server 2] limit: 100/min [Server 2] -+--> [Redis] limit: 100/min
[Server 3] limit: 100/min [Server 3] -+ (shared counter)
User hits all 3 servers User hits any server
= 300 requests allowed! = 100 requests total
Approaches
| Approach | Mechanism | Pros | Cons |
|---|---|---|---|
| Centralized (Redis) | Single Redis instance stores all counters | Accurate, consistent | Redis is SPOF, network latency |
| Local + Sync | Local counters synced periodically | Fast local decisions | Temporarily over-limit between syncs |
| Sticky sessions | Route each user to same server | Simple, no coordination | Uneven load, failover breaks it |
| Token bucket in Redis | Lua script for atomic token bucket | Atomic, accurate | Adds Redis dependency |
Redis Lua script for atomic token bucket:
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity, ARGV[2] = refill_rate, ARGV[3] = now
local tokens = redis.call("GET", KEYS[1] .. ":tokens")
local last = redis.call("GET", KEYS[1] .. ":timestamp")
-- Refill
local elapsed = ARGV[3] - (last or ARGV[3])
tokens = math.min(ARGV[1], (tokens or ARGV[1]) + elapsed * ARGV[2])
if tokens >= 1 then
tokens = tokens - 1
redis.call("SET", KEYS[1] .. ":tokens", tokens)
redis.call("SET", KEYS[1] .. ":timestamp", ARGV[3])
return 1 -- ALLOW
else
return 0 -- REJECT
end
4. Rate Limiting at Different Layers
+------------------------------------------------------------------------+
| LAYER WHAT TO LIMIT TOOL |
+------------------------------------------------------------------------+
| |
| CDN / Edge IP-based, geographic Cloudflare, AWS Shield |
| DDoS protection |
| |
| API Gateway Per-user, per-API-key Kong, AWS API GW, Envoy |
| Quota management |
| |
| Application Per-endpoint, per-user Custom middleware |
| Business logic rules |
| |
| Database Connection limits Connection pooling (PgPool) |
| Query rate limits |
| |
| Downstream APIs Respect third-party Client-side rate limiter |
| rate limits |
+------------------------------------------------------------------------+
5. DDoS Protection
+------------------------------------------------------------------------+
| DDoS MITIGATION LAYERS |
| |
| Layer 3/4 (Network/Transport) |
| - SYN flood, UDP flood, amplification attacks |
| - Mitigation: Cloud provider (AWS Shield, Cloudflare) |
| - Rate limit by IP, blackhole routing |
| |
| Layer 7 (Application) |
| - HTTP flood, slowloris, API abuse |
| - Mitigation: WAF (Web Application Firewall) |
| - Rate limiting, CAPTCHA, bot detection |
| |
| +---[Internet]---[CDN/WAF]---[LB]---[App]---[DB]---+ |
| | ^ | |
| | | | |
| | Most attacks filtered | |
| | here before reaching | |
| | your infrastructure | |
| +----------------------------------------------------+ |
+------------------------------------------------------------------------+
6. API Quota Management
+------------------------------------------------------------------------+
| API TIER SYSTEM (Common Pattern) |
| |
| +------------------+------------------+------------------+ |
| | FREE Tier | PRO Tier | ENTERPRISE Tier | |
| | | | | |
| | 100 req/min | 1,000 req/min | 10,000 req/min | |
| | 10,000 req/day | 100,000 req/day | Unlimited/day | |
| | No SLA | 99.9% SLA | 99.99% SLA | |
| | Shared pool | Dedicated pool | Dedicated infra | |
| | Best-effort | Priority queue | Guaranteed | |
| +------------------+------------------+------------------+ |
| |
| Response headers (industry standard): |
| X-RateLimit-Limit: 100 (total allowed) |
| X-RateLimit-Remaining: 73 (remaining in window) |
| X-RateLimit-Reset: 1620000060 (Unix timestamp of reset) |
| Retry-After: 30 (seconds until retry, on 429) |
+------------------------------------------------------------------------+
7. Rate Limiting in System Design Interviews
When to mention rate limiting:
| Scenario | Why |
|---|---|
| Designing a public API | Prevent abuse, enforce quotas |
| Designing a social media feed | Limit posting frequency (anti-spam) |
| Designing a messaging system | Throttle message send rate |
| Designing a payment system | Prevent duplicate charges |
| Any system with external clients | Always add rate limiting at the gateway |
Interview template:
1. WHERE: "Rate limiting at the API gateway level"
2. WHAT: "Per-user limits based on API tier (free: 100/min, pro: 1000/min)"
3. HOW: "Token bucket algorithm with Redis as the centralized counter"
4. RESPONSE: "429 status code with Retry-After header"
5. FALLBACK: "If Redis is down, fall back to local rate limiting (slightly over-limit is acceptable)"
Key Takeaways
- Token bucket is the go-to algorithm. It handles bursts gracefully and is used by most production systems.
- Fixed window has a boundary problem. Sliding window counter is the better alternative with minimal extra cost.
- Distributed rate limiting needs a shared store. Redis with Lua scripts is the standard approach.
- Rate limit at the edge first. CDN/WAF -> API Gateway -> Application -> Database.
- Always return informative headers.
X-RateLimit-RemainingandRetry-Afterare essential for good API design. - DDoS protection is a separate layer. Network-level attacks are handled by cloud providers and WAFs, not application-level rate limiters.
- In interviews, mention rate limiting proactively for any public-facing API. It shows production awareness.
Explain-It Challenge
Design question: You are building a rate limiter for a global API with 10 million users and 50 servers across 3 regions.
- Which algorithm do you choose and why?
- How do you handle the distributed coordination problem?
- What happens if your centralized rate-limit store (Redis) goes down?
- How do you handle different rate limits for different user tiers?
Next -> 9.10.e — Auth and Authorization