Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.d — Rate Limiting

Introduction

Rate limiting controls how many requests a client can make in a given time window. It protects your system from being overwhelmed -- whether by a buggy client, a traffic spike, or a deliberate DDoS attack. Every production API implements rate limiting, and interviewers expect you to know the algorithms and trade-offs.

1. Why Rate Limiting?

Purpose	Example
Prevent abuse	Stop a single user from scraping your entire database
Protect resources	Prevent overload on expensive downstream services (DB, ML model)
Fair usage	Ensure no single tenant monopolizes a shared service
Cost control	Limit calls to expensive third-party APIs (Stripe, Twilio)
DDoS mitigation	Drop malicious traffic before it reaches application servers
Compliance	Enforce contractual API quotas (free tier: 100 req/min)

2. Rate Limiting Algorithms

2a. Token Bucket

The most widely used algorithm (used by AWS, Stripe, and most API gateways).

+------------------------------------------------------------------------+
|                       TOKEN BUCKET                                      |
|                                                                        |
|  Bucket capacity: 10 tokens                                            |
|  Refill rate: 2 tokens/second                                          |
|                                                                        |
|  +-------+                                                             |
|  | Token |   Refill: 2 tokens/sec                                      |
|  | Source | ---------+                                                 |
|  +-------+          |                                                  |
|                     v                                                  |
|         +---------------------+                                        |
|         | [T][T][T][T][T][ ][ ][ ][ ][ ] |  Bucket (max 10)            |
|         +---------------------+                                        |
|                     |                                                  |
|           Request arrives:                                             |
|           Token available?                                             |
|            /            \                                              |
|          YES             NO                                            |
|           |               |                                            |
|        Remove 1        REJECT                                          |
|        token           (429 Too                                        |
|        ALLOW            Many Requests)                                 |
+------------------------------------------------------------------------+

Properties:

Allows bursts up to bucket capacity (e.g., 10 requests at once)
Smooths out traffic over time (refill rate = steady-state throughput)
Simple to implement, low memory (just store: token count + last refill time)

Pseudocode:

class TokenBucket:
    capacity = 10
    tokens = 10
    refill_rate = 2          // tokens per second
    last_refill = now()

    function allow_request():
        // Refill tokens based on elapsed time
        elapsed = now() - last_refill
        tokens = min(capacity, tokens + elapsed * refill_rate)
        last_refill = now()

        if tokens >= 1:
            tokens -= 1
            return ALLOW
        else:
            return REJECT

2b. Leaky Bucket

Processes requests at a fixed rate, like water leaking from a bucket.

+------------------------------------------------------------------------+
|                       LEAKY BUCKET                                      |
|                                                                        |
|  Requests arrive (variable rate):                                      |
|                                                                        |
|   Req Req Req Req                                                      |
|    |   |   |   |                                                       |
|    v   v   v   v                                                       |
|  +------------------+                                                  |
|  |  QUEUE (bucket)  |  Capacity: 10                                    |
|  |  [R][R][R][R][ ] |  If full -> REJECT                              |
|  +--------+---------+                                                  |
|           |                                                            |
|     Fixed leak rate                                                    |
|     (2 requests/sec)                                                   |
|           |                                                            |
|           v                                                            |
|     Process request                                                    |
+------------------------------------------------------------------------+

Difference from Token Bucket:

	Token Bucket	Leaky Bucket
Burst handling	Allows bursts (up to bucket capacity)	Smooths all traffic to fixed rate
Output rate	Variable (bursty)	Constant (steady drip)
Best for	APIs that tolerate bursts	Services needing constant throughput

2c. Fixed Window Counter

Count requests in fixed time windows.

+------------------------------------------------------------------------+
|                    FIXED WINDOW COUNTER                                  |
|                                                                        |
|  Window size: 1 minute        Limit: 100 requests/window              |
|                                                                        |
|  |<--- Window 1 --->|<--- Window 2 --->|<--- Window 3 --->|           |
|  |  12:00 - 12:01   |  12:01 - 12:02   |  12:02 - 12:03   |           |
|  |                   |                   |                   |           |
|  |  count: 87        |  count: 100      |  count: 45        |           |
|  |  (ALLOW)          |  (REJECT next)   |  (ALLOW)          |           |
|                                                                        |
|  PROBLEM: Boundary spike                                               |
|  |..........|..........|                                               |
|  |    70 req |100 req  |                                               |
|  |  (last 30s)(first 30s)                                              |
|  =  170 requests in 60 seconds!  (exceeds 100/min intent)             |
+------------------------------------------------------------------------+

Problem: A burst at the boundary of two windows can allow 2x the limit in a 1-minute span. This is the main weakness of fixed windows.

2d. Sliding Window Log

Track the timestamp of every request. Count requests within the last N seconds.

+------------------------------------------------------------------------+
|                  SLIDING WINDOW LOG                                      |
|                                                                        |
|  Limit: 5 requests per 60 seconds                                     |
|                                                                        |
|  Request log: [12:00:15, 12:00:25, 12:00:40, 12:00:55, 12:01:05]      |
|                                                                        |
|  New request at 12:01:10:                                              |
|    Remove entries older than 12:00:10 (60s ago)                        |
|    Remaining: [12:00:15, 12:00:25, 12:00:40, 12:00:55, 12:01:05]      |
|    Count = 5 --> REJECT (already at limit)                             |
|                                                                        |
|  New request at 12:01:20:                                              |
|    Remove entries older than 12:00:20                                  |
|    Remaining: [12:00:25, 12:00:40, 12:00:55, 12:01:05]                |
|    Count = 4 --> ALLOW (add 12:01:20 to log)                           |
+------------------------------------------------------------------------+

Trade-off: Perfectly accurate, but high memory usage (stores every timestamp).

2e. Sliding Window Counter (Hybrid)

Combines fixed window efficiency with sliding window accuracy.

+------------------------------------------------------------------------+
|               SLIDING WINDOW COUNTER                                    |
|                                                                        |
|  Limit: 100 requests/minute                                           |
|                                                                        |
|  Previous window (12:00-12:01): 84 requests                           |
|  Current window  (12:01-12:02): 36 requests                           |
|  Current time: 12:01:15 (25% into current window)                     |
|                                                                        |
|  Weighted count = (previous * overlap%) + current                      |
|                 = (84 * 0.75) + 36                                     |
|                 = 63 + 36                                              |
|                 = 99                                                   |
|                                                                        |
|  99 < 100 --> ALLOW                                                    |
|                                                                        |
|  Memory: Only 2 counters per window (very efficient)                   |
|  Accuracy: ~99.9% (slight approximation at boundaries)                 |
+------------------------------------------------------------------------+

Algorithm Comparison

+----------------+-----------+----------+---------+------------------+
|   Algorithm    |  Memory   | Accuracy | Burst   | Implementation   |
|                |           |          | Allowed?| Complexity       |
+----------------+-----------+----------+---------+------------------+
| Token Bucket   | O(1)      | Good     | Yes     | Simple           |
| Leaky Bucket   | O(1)      | Good     | No      | Simple           |
| Fixed Window   | O(1)      | Low      | At edges| Simplest         |
| Sliding Log    | O(n)      | Perfect  | No      | Moderate         |
| Sliding Counter| O(1)      | High     | Partial | Moderate         |
+----------------+-----------+----------+---------+------------------+

3. Distributed Rate Limiting

In a multi-server environment, rate limits must be coordinated.

  WITHOUT coordination:              WITH coordination (centralized):

  [Server 1] limit: 100/min         [Server 1] -+
  [Server 2] limit: 100/min          [Server 2] -+--> [Redis] limit: 100/min
  [Server 3] limit: 100/min         [Server 3] -+        (shared counter)
                                        
  User hits all 3 servers            User hits any server
  = 300 requests allowed!            = 100 requests total

Approaches

Approach	Mechanism	Pros	Cons
Centralized (Redis)	Single Redis instance stores all counters	Accurate, consistent	Redis is SPOF, network latency
Local + Sync	Local counters synced periodically	Fast local decisions	Temporarily over-limit between syncs
Sticky sessions	Route each user to same server	Simple, no coordination	Uneven load, failover breaks it
Token bucket in Redis	Lua script for atomic token bucket	Atomic, accurate	Adds Redis dependency

Redis Lua script for atomic token bucket:

-- KEYS[1] = rate limit key
-- ARGV[1] = capacity, ARGV[2] = refill_rate, ARGV[3] = now

local tokens = redis.call("GET", KEYS[1] .. ":tokens")
local last   = redis.call("GET", KEYS[1] .. ":timestamp")

-- Refill
local elapsed = ARGV[3] - (last or ARGV[3])
tokens = math.min(ARGV[1], (tokens or ARGV[1]) + elapsed * ARGV[2])

if tokens >= 1 then
    tokens = tokens - 1
    redis.call("SET", KEYS[1] .. ":tokens", tokens)
    redis.call("SET", KEYS[1] .. ":timestamp", ARGV[3])
    return 1  -- ALLOW
else
    return 0  -- REJECT
end

4. Rate Limiting at Different Layers

+------------------------------------------------------------------------+
|  LAYER              WHAT TO LIMIT          TOOL                         |
+------------------------------------------------------------------------+
|                                                                        |
|  CDN / Edge         IP-based, geographic   Cloudflare, AWS Shield      |
|                     DDoS protection                                    |
|                                                                        |
|  API Gateway        Per-user, per-API-key  Kong, AWS API GW, Envoy    |
|                     Quota management                                   |
|                                                                        |
|  Application        Per-endpoint, per-user Custom middleware            |
|                     Business logic rules                               |
|                                                                        |
|  Database           Connection limits      Connection pooling (PgPool) |
|                     Query rate limits                                  |
|                                                                        |
|  Downstream APIs    Respect third-party     Client-side rate limiter   |
|                     rate limits                                        |
+------------------------------------------------------------------------+

5. DDoS Protection

+------------------------------------------------------------------------+
|                     DDoS MITIGATION LAYERS                              |
|                                                                        |
|  Layer 3/4 (Network/Transport)                                         |
|    - SYN flood, UDP flood, amplification attacks                       |
|    - Mitigation: Cloud provider (AWS Shield, Cloudflare)               |
|    - Rate limit by IP, blackhole routing                               |
|                                                                        |
|  Layer 7 (Application)                                                 |
|    - HTTP flood, slowloris, API abuse                                  |
|    - Mitigation: WAF (Web Application Firewall)                        |
|    - Rate limiting, CAPTCHA, bot detection                             |
|                                                                        |
|  +---[Internet]---[CDN/WAF]---[LB]---[App]---[DB]---+                 |
|  |                    ^                               |                 |
|  |                    |                               |                 |
|  |          Most attacks filtered                     |                 |
|  |          here before reaching                      |                 |
|  |          your infrastructure                       |                 |
|  +----------------------------------------------------+                |
+------------------------------------------------------------------------+

6. API Quota Management

+------------------------------------------------------------------------+
|                  API TIER SYSTEM (Common Pattern)                        |
|                                                                        |
|  +------------------+------------------+------------------+            |
|  |   FREE Tier      |   PRO Tier       |  ENTERPRISE Tier |            |
|  |                  |                  |                  |            |
|  | 100 req/min      | 1,000 req/min    | 10,000 req/min   |            |
|  | 10,000 req/day   | 100,000 req/day  | Unlimited/day    |            |
|  | No SLA           | 99.9% SLA        | 99.99% SLA       |            |
|  | Shared pool      | Dedicated pool   | Dedicated infra  |            |
|  | Best-effort      | Priority queue   | Guaranteed        |            |
|  +------------------+------------------+------------------+            |
|                                                                        |
|  Response headers (industry standard):                                 |
|    X-RateLimit-Limit:     100       (total allowed)                    |
|    X-RateLimit-Remaining: 73        (remaining in window)              |
|    X-RateLimit-Reset:     1620000060 (Unix timestamp of reset)         |
|    Retry-After:           30         (seconds until retry, on 429)     |
+------------------------------------------------------------------------+

7. Rate Limiting in System Design Interviews

When to mention rate limiting:

Scenario	Why
Designing a public API	Prevent abuse, enforce quotas
Designing a social media feed	Limit posting frequency (anti-spam)
Designing a messaging system	Throttle message send rate
Designing a payment system	Prevent duplicate charges
Any system with external clients	Always add rate limiting at the gateway

Interview template:

  1. WHERE: "Rate limiting at the API gateway level"
  2. WHAT:  "Per-user limits based on API tier (free: 100/min, pro: 1000/min)"
  3. HOW:   "Token bucket algorithm with Redis as the centralized counter"
  4. RESPONSE: "429 status code with Retry-After header"
  5. FALLBACK: "If Redis is down, fall back to local rate limiting (slightly over-limit is acceptable)"

Key Takeaways

Token bucket is the go-to algorithm. It handles bursts gracefully and is used by most production systems.
Fixed window has a boundary problem. Sliding window counter is the better alternative with minimal extra cost.
Distributed rate limiting needs a shared store. Redis with Lua scripts is the standard approach.
Rate limit at the edge first. CDN/WAF -> API Gateway -> Application -> Database.
Always return informative headers. X-RateLimit-Remaining and Retry-After are essential for good API design.
DDoS protection is a separate layer. Network-level attacks are handled by cloud providers and WAFs, not application-level rate limiters.
In interviews, mention rate limiting proactively for any public-facing API. It shows production awareness.

Explain-It Challenge

Design question: You are building a rate limiter for a global API with 10 million users and 50 servers across 3 regions.

Which algorithm do you choose and why?

How do you handle the distributed coordination problem?

What happens if your centralized rate-limit store (Redis) goes down?

How do you handle different rate limits for different user tiers?

Next -> 9.10.e — Auth and Authorization