Episode 6 — Scaling Reliability Microservices Web3 / 6.8 — Production Hardening

6.8.a — Rate Limiting

In one sentence: Rate limiting controls how many requests a client can make in a time window — it is the single most important defense against API abuse, runaway clients, and brute-force attacks in production.

Navigation: ← 6.8 Overview · 6.8.b — CORS and Secure Headers →


1. Why Rate Limiting Is Essential

Without rate limiting, a single bad actor (or a buggy client) can:

  • Exhaust server resources — CPU, memory, database connections
  • Drive up cloud costs — every request costs money (Lambda, API Gateway, DB reads)
  • Brute-force authentication — try millions of username/password combos
  • Scrape your data — download your entire database through the API
  • Degrade service for everyone — one client starving others of capacity

Rate limiting is not optional. Every production API must have it.

WITHOUT rate limiting:           WITH rate limiting:

Attacker → 10,000 req/s         Attacker → 10,000 req/s
    │                                │
    ▼                                ▼
┌──────────┐                    ┌──────────────┐
│  Server   │ ← overwhelmed     │ Rate Limiter  │
│  (DOWN)   │                   │ 429 Too Many  │───→ 9,900 rejected
└──────────┘                    └──────┬───────┘
                                       │ 100 allowed
                                       ▼
                                ┌──────────┐
                                │  Server   │ ← healthy
                                │  (UP)     │
                                └──────────┘

2. Rate Limiting Algorithms

2.1 Fixed Window

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute blocks) and count requests per window.

Fixed Window (limit: 5 requests per minute)

Timeline:   |-------- Minute 1 --------|-------- Minute 2 --------|
Requests:    X  X  X  X  X  ✗  ✗         X  X  X
Count:       1  2  3  4  5  DENIED        1  2  3

Legend: X = allowed, ✗ = denied (429)
// Simple fixed-window counter (in-memory)
const windowCounts = new Map();
const WINDOW_SIZE_MS = 60 * 1000; // 1 minute
const MAX_REQUESTS = 100;

function fixedWindowLimiter(clientId) {
  const now = Date.now();
  const windowKey = `${clientId}:${Math.floor(now / WINDOW_SIZE_MS)}`;

  const current = windowCounts.get(windowKey) || 0;
  if (current >= MAX_REQUESTS) {
    return { allowed: false, remaining: 0 };
  }

  windowCounts.set(windowKey, current + 1);
  return { allowed: true, remaining: MAX_REQUESTS - current - 1 };
}

Pros: Simple, low memory. Cons: Burst problem — a client can make 100 requests at 0:59 and 100 more at 1:01, getting 200 in 2 seconds.


2.2 Sliding Window Log

Records the timestamp of every request and counts how many fall within the rolling window.

Sliding Window Log (limit: 5 per 60s, current time: T=75s)

Stored timestamps: [10, 25, 40, 55, 70]
                    ↑               ↑
                    oldest          newest

Window: [75 - 60, 75] = [15, 75]
In-window: [25, 40, 55, 70] → 4 requests → ALLOWED (1 remaining)

Out-of-window: [10] → pruned
// Sliding window log
const requestLogs = new Map();
const WINDOW_MS = 60 * 1000;
const MAX_REQUESTS = 100;

function slidingWindowLog(clientId) {
  const now = Date.now();
  const windowStart = now - WINDOW_MS;

  if (!requestLogs.has(clientId)) {
    requestLogs.set(clientId, []);
  }

  const logs = requestLogs.get(clientId);

  // Remove expired entries
  while (logs.length > 0 && logs[0] < windowStart) {
    logs.shift();
  }

  if (logs.length >= MAX_REQUESTS) {
    return { allowed: false, remaining: 0 };
  }

  logs.push(now);
  return { allowed: true, remaining: MAX_REQUESTS - logs.length };
}

Pros: No burst problem — perfectly accurate. Cons: High memory (stores every timestamp). Expensive for high-volume APIs.


2.3 Sliding Window Counter

A hybrid: combines fixed-window counting with a weighted estimate from the previous window. Best of both worlds.

Sliding Window Counter (limit: 100/min)

 Prev window count: 80      Current window count: 30
 |-------- Prev --------|-------- Current --------|
                              ↑
                         We are 25% into current window

 Weighted count = (80 × 0.75) + 30 = 60 + 30 = 90
 90 < 100 → ALLOWED
// Sliding window counter (memory-efficient)
const counters = new Map();
const WINDOW_MS = 60 * 1000;
const MAX_REQUESTS = 100;

function slidingWindowCounter(clientId) {
  const now = Date.now();
  const currentWindow = Math.floor(now / WINDOW_MS);
  const prevWindow = currentWindow - 1;
  const windowProgress = (now % WINDOW_MS) / WINDOW_MS;

  const prevCount = counters.get(`${clientId}:${prevWindow}`) || 0;
  const currCount = counters.get(`${clientId}:${currentWindow}`) || 0;

  // Weighted estimate
  const estimatedCount = prevCount * (1 - windowProgress) + currCount;

  if (estimatedCount >= MAX_REQUESTS) {
    return { allowed: false, remaining: 0 };
  }

  counters.set(`${clientId}:${currentWindow}`, currCount + 1);
  return {
    allowed: true,
    remaining: Math.floor(MAX_REQUESTS - estimatedCount - 1),
  };
}

Pros: Low memory, smooth rate enforcement, no burst spikes. Cons: Slightly less precise than sliding window log (but close enough for production).


2.4 Token Bucket

Each client has a bucket that fills with tokens at a steady rate. Each request costs one token. If the bucket is empty, the request is denied.

Token Bucket (capacity: 10, refill: 2 tokens/sec)

Time 0:   [##########] 10 tokens — full bucket
           Request → [#########] 9 tokens

Time 0.5: [##########] 10 tokens (refilled 1, but capped at 10)
           5 requests → [#####] 5 tokens

Time 3:   [##########] 10 tokens (refilled over time, capped)
           Burst of 10 → [          ] 0 tokens
           Next request → DENIED (wait for refill)

Key: # = available token
// Token bucket implementation
class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;        // max tokens
    this.tokens = capacity;          // current tokens
    this.refillRate = refillRate;    // tokens per second
    this.lastRefill = Date.now();
  }

  tryConsume() {
    this.refill();

    if (this.tokens < 1) {
      return { allowed: false, retryAfter: (1 / this.refillRate) };
    }

    this.tokens -= 1;
    return { allowed: true, remaining: Math.floor(this.tokens) };
  }

  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

// Usage
const buckets = new Map();

function tokenBucketLimiter(clientId) {
  if (!buckets.has(clientId)) {
    buckets.set(clientId, new TokenBucket(10, 2)); // 10 capacity, 2/sec refill
  }
  return buckets.get(clientId).tryConsume();
}

Pros: Allows controlled bursts. Smooth long-term rate. Industry standard (AWS, Stripe use it). Cons: More complex to implement. State management per client.


2.5 Leaky Bucket

Requests enter a queue (bucket) and are processed at a fixed rate. If the bucket overflows, requests are dropped.

Leaky Bucket (capacity: 5, drain rate: 1 req/sec)

Incoming: ████████████  (12 requests burst)

┌─────────┐
│ █ █ █ █ █ │ ← bucket (5 capacity)
│ █ █ █ █ █ │
└─────┬─────┘
      │ drains at 1/sec
      ▼
   Processed: █...█...█...█...█  (steady 1/sec)

Overflow: ███████  (7 dropped — 429 Too Many Requests)
// Leaky bucket (queue-based)
class LeakyBucket {
  constructor(capacity, drainRate) {
    this.capacity = capacity;
    this.queue = [];
    this.drainRate = drainRate; // requests per second

    // Drain the bucket at a steady rate
    setInterval(() => {
      if (this.queue.length > 0) {
        const request = this.queue.shift();
        request.resolve(); // process the request
      }
    }, 1000 / this.drainRate);
  }

  enqueue() {
    return new Promise((resolve, reject) => {
      if (this.queue.length >= this.capacity) {
        reject(new Error('Rate limit exceeded'));
        return;
      }
      this.queue.push({ resolve });
    });
  }
}

Pros: Perfectly smooth output rate. Prevents any burstiness downstream. Cons: Adds latency (requests wait in queue). More complex.


Algorithm Comparison

AlgorithmBurst HandlingMemoryPrecisionComplexityBest For
Fixed WindowPoor (boundary burst)LowLowSimpleBasic APIs
Sliding Window LogExcellentHighExactMediumLow-volume, precision-critical
Sliding Window CounterGoodLowNear-exactMediumMost production APIs
Token BucketControlled burstsMediumGoodMediumAPIs that allow bursts (AWS, Stripe)
Leaky BucketNo burstsMediumGoodHighSmooth-output requirements

3. Rate Limiting by Identity

By IP Address

// Most common — limit by IP
function getClientIP(req) {
  // Behind a load balancer / proxy, use X-Forwarded-For
  return req.headers['x-forwarded-for']?.split(',')[0]?.trim()
    || req.socket.remoteAddress;
}

Problem: NAT — thousands of users behind one corporate IP get collectively throttled.

By User (Authenticated)

// After auth middleware sets req.user
function getRateLimitKey(req) {
  if (req.user) return `user:${req.user.id}`;
  return `ip:${getClientIP(req)}`;
}

By API Key

// For third-party API consumers
function getRateLimitKey(req) {
  const apiKey = req.headers['x-api-key'];
  if (apiKey) return `key:${apiKey}`;
  return `ip:${getClientIP(req)}`;
}

4. express-rate-limit Library

The standard rate limiting middleware for Express.

import rateLimit from 'express-rate-limit';

// Global rate limiter
const globalLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,  // 15 minutes
  max: 100,                   // 100 requests per window
  standardHeaders: true,      // Return rate limit info in `RateLimit-*` headers
  legacyHeaders: false,       // Disable `X-RateLimit-*` headers
  message: {
    error: 'Too many requests, please try again later.',
    retryAfter: 15 * 60,
  },
  keyGenerator: (req) => {
    return req.headers['x-forwarded-for']?.split(',')[0]?.trim()
      || req.ip;
  },
});

app.use(globalLimiter);

Endpoint-Specific Limits

// Strict limit on auth endpoints (prevent brute force)
const authLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,  // 15 minutes
  max: 5,                     // Only 5 login attempts
  message: { error: 'Too many login attempts. Try again in 15 minutes.' },
  skipSuccessfulRequests: true, // Don't count successful logins
});

// Relaxed limit on read endpoints
const readLimiter = rateLimit({
  windowMs: 1 * 60 * 1000,   // 1 minute
  max: 60,                    // 60 reads per minute
});

// Very strict limit on expensive operations
const writeLimiter = rateLimit({
  windowMs: 1 * 60 * 1000,
  max: 10,                    // 10 writes per minute
});

// Apply per-route
app.post('/api/auth/login', authLimiter, loginHandler);
app.post('/api/auth/register', authLimiter, registerHandler);
app.get('/api/products', readLimiter, listProducts);
app.post('/api/orders', writeLimiter, createOrder);

5. Rate Limit Headers

Standard headers that tell clients about their rate limit status:

HTTP/1.1 200 OK
RateLimit-Limit: 100          # Max requests per window
RateLimit-Remaining: 42       # Requests left in current window
RateLimit-Reset: 1625847600   # Unix timestamp when window resets

# On 429 response:
HTTP/1.1 429 Too Many Requests
Retry-After: 30               # Seconds until the client should retry
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1625847600
// Custom header middleware
function rateLimitHeaders(req, res, next) {
  const result = checkRateLimit(req);

  res.set('RateLimit-Limit', String(result.limit));
  res.set('RateLimit-Remaining', String(result.remaining));
  res.set('RateLimit-Reset', String(result.resetTime));

  if (!result.allowed) {
    res.set('Retry-After', String(result.retryAfter));
    return res.status(429).json({
      error: 'Rate limit exceeded',
      retryAfter: result.retryAfter,
    });
  }

  next();
}

6. Redis-Based Distributed Rate Limiting

In-memory rate limiting breaks when you have multiple server instances. Redis provides shared state.

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Server 1  │     │ Server 2  │     │ Server 3  │
│ (Express)│     │ (Express)│     │ (Express)│
└─────┬────┘     └─────┬────┘     └─────┬────┘
      │                │                │
      └────────┬───────┴────────┬───────┘
               │                │
               ▼                ▼
        ┌─────────────────────────┐
        │      Redis Cluster       │
        │  Shared rate limit state │
        │  "user:123" → count: 42  │
        └─────────────────────────┘
import rateLimit from 'express-rate-limit';
import RedisStore from 'rate-limit-redis';
import { createClient } from 'redis';

// Connect to Redis
const redisClient = createClient({
  url: process.env.REDIS_URL || 'redis://localhost:6379',
});
await redisClient.connect();

// Distributed rate limiter
const distributedLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 100,
  standardHeaders: true,
  legacyHeaders: false,

  // Use Redis as the store
  store: new RedisStore({
    sendCommand: (...args) => redisClient.sendCommand(args),
    prefix: 'rl:',  // Key prefix in Redis
  }),
});

app.use('/api/', distributedLimiter);

Redis Sliding Window with Lua Script

For precise distributed rate limiting, use a Lua script (atomic operations in Redis):

// Atomic sliding window counter in Redis
const SLIDING_WINDOW_SCRIPT = `
  local key = KEYS[1]
  local window = tonumber(ARGV[1])
  local limit = tonumber(ARGV[2])
  local now = tonumber(ARGV[3])

  -- Remove expired entries
  redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

  -- Count current entries
  local count = redis.call('ZCARD', key)

  if count < limit then
    -- Add this request
    redis.call('ZADD', key, now, now .. ':' .. math.random())
    redis.call('PEXPIRE', key, window)
    return {1, limit - count - 1}  -- allowed, remaining
  else
    return {0, 0}  -- denied, 0 remaining
  end
`;

async function slidingWindowRedis(clientId, windowMs, maxRequests) {
  const result = await redisClient.eval(SLIDING_WINDOW_SCRIPT, {
    keys: [`ratelimit:${clientId}`],
    arguments: [String(windowMs), String(maxRequests), String(Date.now())],
  });

  return {
    allowed: result[0] === 1,
    remaining: result[1],
  };
}

7. Rate Limiting in Microservices

Gateway-Level vs Service-Level

                    Gateway-Level                Service-Level
                    Rate Limiting                Rate Limiting

Client ──→ ┌────────────────┐           Client ──→ ┌──────────┐
           │  API Gateway    │                      │ Gateway   │ (pass-through)
           │  100 req/min    │                      └─────┬────┘
           └───────┬────────┘                             │
                   │                              ┌───────┼───────┐
           ┌───────┼───────┐                      ▼       ▼       ▼
           ▼       ▼       ▼               ┌─────────┐ ┌──────┐ ┌──────┐
      ┌────────┐ ┌──────┐ ┌──────┐        │ Auth     │ │Orders│ │Search│
      │ Auth   │ │Orders│ │Search│        │ 20/min  │ │50/min│ │200/m │
      └────────┘ └──────┘ └──────┘        └─────────┘ └──────┘ └──────┘

Best practice: Both. Gateway for global limits, services for granular limits.

// API Gateway global limit
const gatewayLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 200,  // No client exceeds 200/min total
  keyGenerator: (req) => req.headers['x-api-key'] || req.ip,
});

// Per-service limits (inside each service)
// auth-service
const authServiceLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 10,  // 10 auth attempts per 15 min
  keyGenerator: (req) => req.headers['x-user-id'] || req.ip,
});

// search-service
const searchServiceLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 60,  // 60 searches per minute
});

8. Tiered Rate Limiting

Different users get different limits based on their plan.

// Tiered rate limiting by API key plan
const TIER_LIMITS = {
  free:       { windowMs: 60 * 1000, max: 10 },
  basic:      { windowMs: 60 * 1000, max: 100 },
  pro:        { windowMs: 60 * 1000, max: 1000 },
  enterprise: { windowMs: 60 * 1000, max: 10000 },
};

async function tieredRateLimiter(req, res, next) {
  const apiKey = req.headers['x-api-key'];

  if (!apiKey) {
    // Anonymous — strictest limits
    return rateLimit({ windowMs: 60000, max: 5 })(req, res, next);
  }

  // Look up the plan for this API key
  const plan = await getPlanForApiKey(apiKey);  // e.g., from Redis or DB
  const limits = TIER_LIMITS[plan] || TIER_LIMITS.free;

  const result = await slidingWindowRedis(
    `tier:${apiKey}`,
    limits.windowMs,
    limits.max
  );

  // Set headers
  res.set('RateLimit-Limit', String(limits.max));
  res.set('RateLimit-Remaining', String(result.remaining));

  if (!result.allowed) {
    res.set('Retry-After', String(Math.ceil(limits.windowMs / 1000)));
    return res.status(429).json({
      error: 'Rate limit exceeded',
      plan,
      limit: limits.max,
      upgradeUrl: '/pricing',
    });
  }

  next();
}

app.use('/api/', tieredRateLimiter);

9. Abuse Prevention Patterns

Exponential Backoff on Repeated Violations

// Track repeat offenders
const violations = new Map();

function abuseDetection(req, res, next) {
  const clientId = req.ip;
  const record = violations.get(clientId) || { count: 0, blockedUntil: 0 };

  // Check if currently blocked
  if (Date.now() < record.blockedUntil) {
    const retryAfter = Math.ceil((record.blockedUntil - Date.now()) / 1000);
    res.set('Retry-After', String(retryAfter));
    return res.status(429).json({
      error: 'Temporarily blocked due to repeated abuse',
      retryAfter,
    });
  }

  // Reset if enough time has passed
  if (record.lastViolation && Date.now() - record.lastViolation > 3600000) {
    violations.delete(clientId);
    return next();
  }

  next();
}

// Call this when a rate limit is hit
function recordViolation(clientId) {
  const record = violations.get(clientId) || { count: 0 };
  record.count += 1;
  record.lastViolation = Date.now();

  // Exponential backoff: 1min, 5min, 30min, 2hr, 24hr
  const backoffMinutes = [1, 5, 30, 120, 1440];
  const backoffIndex = Math.min(record.count - 1, backoffMinutes.length - 1);
  record.blockedUntil = Date.now() + backoffMinutes[backoffIndex] * 60 * 1000;

  violations.set(clientId, record);
}

10. Complete Production Implementation

import express from 'express';
import rateLimit from 'express-rate-limit';
import RedisStore from 'rate-limit-redis';
import { createClient } from 'redis';

const app = express();
const redisClient = createClient({ url: process.env.REDIS_URL });
await redisClient.connect();

// --- Layer 1: Global rate limit (all routes) ---
const globalLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 500,
  standardHeaders: true,
  legacyHeaders: false,
  store: new RedisStore({
    sendCommand: (...args) => redisClient.sendCommand(args),
    prefix: 'rl:global:',
  }),
  message: { error: 'Global rate limit exceeded' },
});

// --- Layer 2: Auth endpoint limiter ---
const authLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 5,
  skipSuccessfulRequests: true,
  store: new RedisStore({
    sendCommand: (...args) => redisClient.sendCommand(args),
    prefix: 'rl:auth:',
  }),
  message: { error: 'Too many login attempts' },
});

// --- Layer 3: API-key-based tiered limiter ---
const apiLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: async (req) => {
    const apiKey = req.headers['x-api-key'];
    if (!apiKey) return 10;
    const plan = await redisClient.get(`plan:${apiKey}`);
    const limits = { free: 30, basic: 100, pro: 1000, enterprise: 10000 };
    return limits[plan] || 10;
  },
  store: new RedisStore({
    sendCommand: (...args) => redisClient.sendCommand(args),
    prefix: 'rl:api:',
  }),
  keyGenerator: (req) => req.headers['x-api-key'] || req.ip,
});

// --- Layer 4: Expensive operation limiter ---
const expensiveLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 5,
  store: new RedisStore({
    sendCommand: (...args) => redisClient.sendCommand(args),
    prefix: 'rl:expensive:',
  }),
});

// Apply middleware
app.use(globalLimiter);
app.post('/api/auth/login', authLimiter);
app.post('/api/auth/register', authLimiter);
app.use('/api/', apiLimiter);
app.post('/api/export', expensiveLimiter);
app.post('/api/ai/generate', expensiveLimiter);

app.listen(3000, () => {
  console.log('Server with production rate limiting running on :3000');
});

11. Key Takeaways

  1. Rate limiting is mandatory — every production API needs it to prevent abuse, control costs, and protect availability.
  2. Choose the right algorithm — sliding window counter or token bucket cover most production needs.
  3. Distribute with Redis — in-memory counters break with multiple server instances.
  4. Layer your limits — gateway-level for global caps, service-level for granular control.
  5. Set proper headersRateLimit-Limit, RateLimit-Remaining, Retry-After so clients can self-regulate.
  6. Tier by identity — different limits for anonymous, free, and paid users.

Explain-It Challenge

  1. Your boss says "just set a global limit of 100 requests per minute." Why is this insufficient for production?
  2. A client behind corporate NAT complains they keep getting 429 errors. What happened and how do you fix it?
  3. Explain why token bucket is preferred by AWS and Stripe over fixed-window counting.

Navigation: ← 6.8 Overview · 6.8.b — CORS and Secure Headers →