Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2.d — Rate Limits and Retries

In one sentence: LLM APIs enforce rate limits (requests per minute, tokens per minute) to manage capacity — production applications must handle 429 errors gracefully with exponential backoff, retry strategies, concurrency control, and circuit breakers to stay reliable under pressure.

Navigation: ← 4.2.c — Cost Awareness · 4.2 Overview →

1. What Are Rate Limits?

Rate limits are caps on how many API requests you can make within a time window. Providers impose them to:

Protect infrastructure — prevent any single customer from overwhelming shared servers
Ensure fair access — distribute capacity across all customers
Prevent abuse — stop runaway scripts or misconfigured loops

Types of rate limits

Limit Type	Abbreviation	What It Measures	Example
Requests Per Minute	RPM	Number of API calls	500 RPM
Tokens Per Minute	TPM	Total tokens processed (input + output)	200,000 TPM
Requests Per Day	RPD	Daily call volume	10,000 RPD
Tokens Per Day	TPD	Daily token volume	40,000,000 TPD

Typical rate limits by tier

Provider/Tier	RPM	TPM	Notes
OpenAI Free	3	40,000	Very restrictive
OpenAI Tier 1	500	200,000	After first payment
OpenAI Tier 3	5,000	2,000,000	After $100+ spend
OpenAI Tier 5	10,000	30,000,000	Enterprise level
Anthropic Build	50	40,000	Starting tier
Anthropic Scale	4,000	400,000	Higher tier

Key insight: You can hit rate limits from either RPM or TPM. Sending many small requests hits RPM; sending few large requests (with big prompts) hits TPM.

Scenario A: Hit RPM limit
  501 requests with 100 tokens each = 501 RPM (over 500 limit)
  Total tokens: 50,100 (well under 200K TPM)

Scenario B: Hit TPM limit
  50 requests with 5,000 tokens each = 50 RPM (under 500 limit)
  Total tokens: 250,000 (over 200K TPM limit)

Both result in 429 errors, but for different reasons!

2. The 429 Error: Too Many Requests

When you exceed a rate limit, the API returns an HTTP 429 status code:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
retry-after: 2

{
  "error": {
    "message": "Rate limit reached for gpt-4o. Limit: 500 requests per minute. Please try again in 1.2s.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Important response headers

Header	Purpose	Example
`retry-after`	Seconds to wait before retrying	`2`
`x-ratelimit-limit-requests`	Your RPM limit	`500`
`x-ratelimit-remaining-requests`	Remaining requests this minute	`0`
`x-ratelimit-limit-tokens`	Your TPM limit	`200000`
`x-ratelimit-remaining-tokens`	Remaining tokens this minute	`0`
`x-ratelimit-reset-requests`	Time until RPM resets	`1.2s`
`x-ratelimit-reset-tokens`	Time until TPM resets	`4.5s`

Always read these headers — they tell you exactly when you can retry and how much capacity remains.

3. Exponential Backoff with Jitter

The standard retry strategy for 429 errors is exponential backoff with jitter — wait progressively longer between retries, with a random component to prevent "thundering herd" problems.

Why not just retry immediately?

Without backoff:
  Request → 429 → Retry immediately → 429 → Retry immediately → 429 → ...
  (hammering the API, making things worse)

With exponential backoff:
  Request → 429 → Wait 1s → Retry → 429 → Wait 2s → Retry → 429 → Wait 4s → Retry → Success

With exponential backoff + jitter:
  Request → 429 → Wait 1.3s → Retry → 429 → Wait 2.7s → Retry → Success
  (random jitter prevents all clients from retrying at the same instant)

Implementation

async function callWithRetry(apiCallFn, {
  maxRetries = 5,
  baseDelayMs = 1000,
  maxDelayMs = 60000,
  retryableStatuses = [429, 500, 502, 503, 529]
} = {}) {
  let lastError;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await apiCallFn();
    } catch (error) {
      lastError = error;

      // Don't retry non-retryable errors
      const status = error?.status || error?.response?.status;
      if (status && !retryableStatuses.includes(status)) {
        throw error;  // 400, 401, 403, 404 — don't retry
      }

      // Don't retry if we've exhausted attempts
      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries + 1} attempts: ${error.message}`);
      }

      // Calculate delay with exponential backoff + jitter
      const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
      const jitter = Math.random() * baseDelayMs;  // Random 0 to baseDelay
      const delay = Math.min(exponentialDelay + jitter, maxDelayMs);

      // Use retry-after header if available
      const retryAfter = error?.response?.headers?.['retry-after'];
      const retryAfterMs = retryAfter ? parseFloat(retryAfter) * 1000 : 0;
      const finalDelay = Math.max(delay, retryAfterMs);

      console.warn(
        `Attempt ${attempt + 1} failed (${status}). ` +
        `Retrying in ${(finalDelay / 1000).toFixed(1)}s...`
      );

      await new Promise(resolve => setTimeout(resolve, finalDelay));
    }
  }

  throw lastError;
}

// Usage
const response = await callWithRetry(
  () => openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: "Hello" }],
    max_tokens: 100
  }),
  { maxRetries: 3, baseDelayMs: 1000 }
);

Backoff progression

Attempt 0: Immediate (first try)
Attempt 1: ~1.0-2.0s  (1000 * 2^0 + jitter)
Attempt 2: ~2.0-3.0s  (1000 * 2^1 + jitter)
Attempt 3: ~4.0-5.0s  (1000 * 2^2 + jitter)
Attempt 4: ~8.0-9.0s  (1000 * 2^3 + jitter)
Attempt 5: ~16.0-17.0s (1000 * 2^4 + jitter)

Total max wait before giving up: ~31-36 seconds

4. Which Errors to Retry

Not all errors should be retried. Retrying a 400 Bad Request wastes time and money.

Status Code	Meaning	Retry?	Why
400	Bad Request	No	Your request is malformed — fix it
401	Unauthorized	No	Invalid API key — fix credentials
403	Forbidden	No	Permission denied — check access
404	Not Found	No	Wrong endpoint or model name
422	Unprocessable	No	Invalid parameters — fix request
429	Rate Limited	Yes	Temporary — wait and retry
500	Internal Server Error	Yes	Provider issue — may resolve
502	Bad Gateway	Yes	Temporary infrastructure issue
503	Service Unavailable	Yes	Provider overloaded — wait and retry
529	Overloaded	Yes	Anthropic-specific — system at capacity

5. Timeout Handling

LLM API calls can take 5-60+ seconds depending on model, prompt length, and output length. Always set timeouts.

import OpenAI from 'openai';

// Set client-level timeout
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  timeout: 30000  // 30 seconds (in milliseconds)
});

// Or per-request timeout using AbortController
async function callWithTimeout(messages, timeoutMs = 30000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await openai.chat.completions.create(
      {
        model: "gpt-4o",
        messages,
        max_tokens: 1000
      },
      { signal: controller.signal }
    );
    return response;
  } catch (error) {
    if (error.name === 'AbortError') {
      throw new Error(`API call timed out after ${timeoutMs}ms`);
    }
    throw error;
  } finally {
    clearTimeout(timeoutId);
  }
}

Timeout recommendations

Scenario	Recommended Timeout	Why
Short responses (classification)	10-15s	Should complete quickly
Medium responses (chat)	30s	Standard conversational response
Long responses (code generation)	60s	Complex generation takes time
Streaming responses	10s initial, then per-chunk	First token should arrive quickly

6. Concurrent Request Management

When processing many items in parallel, you need to limit concurrency to stay within rate limits.

Token bucket / Semaphore pattern

class RateLimiter {
  constructor(maxConcurrent = 10, requestsPerMinute = 500) {
    this.maxConcurrent = maxConcurrent;
    this.running = 0;
    this.queue = [];
    this.requestTimestamps = [];
    this.requestsPerMinute = requestsPerMinute;
  }

  async acquire() {
    // Wait until we're under concurrent limit
    while (this.running >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }

    // Wait until RPM limit allows
    await this.#waitForRpmSlot();

    this.running++;
    this.requestTimestamps.push(Date.now());
  }

  release() {
    this.running--;
    if (this.queue.length > 0) {
      const next = this.queue.shift();
      next();
    }
  }

  async #waitForRpmSlot() {
    const now = Date.now();
    const oneMinuteAgo = now - 60000;

    // Remove timestamps older than 1 minute
    this.requestTimestamps = this.requestTimestamps.filter(t => t > oneMinuteAgo);

    // If at RPM limit, wait until oldest request expires
    if (this.requestTimestamps.length >= this.requestsPerMinute) {
      const waitTime = this.requestTimestamps[0] - oneMinuteAgo + 100;
      await new Promise(resolve => setTimeout(resolve, waitTime));
    }
  }
}

// Usage
const limiter = new RateLimiter(10, 500);  // 10 concurrent, 500 RPM

async function processItem(item) {
  await limiter.acquire();
  try {
    return await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: item }],
      max_tokens: 500
    });
  } finally {
    limiter.release();
  }
}

// Process 1000 items with controlled concurrency
const items = [...]; // 1000 items
const results = await Promise.all(items.map(processItem));

Simple p-limit approach

import pLimit from 'p-limit';

const limit = pLimit(10);  // Max 10 concurrent requests

const items = ['item1', 'item2', /* ...1000 items */];

const results = await Promise.all(
  items.map(item =>
    limit(() => openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: `Process: ${item}` }],
      max_tokens: 200
    }))
  )
);

7. Circuit Breaker Pattern

When an API is consistently failing, continuing to send requests wastes resources and delays recovery. A circuit breaker stops requests temporarily and resumes after a cooldown period.

Circuit States:

  CLOSED (normal)                    OPEN (failing)
  ┌──────────┐   failures > threshold   ┌──────────┐
  │  Requests │ ─────────────────────▶ │  Requests │
  │   pass    │                         │  blocked  │
  │  through  │                         │  (fast    │
  └──────────┘                         │   fail)   │
       ▲                                └──────────┘
       │                                     │
       │         cooldown elapsed            │
       │                                     ▼
       │                               ┌──────────┐
       └────────── success ──────────  │ HALF-OPEN │
                                       │ (test 1   │
                   failure ──────────▶ │  request)  │
                   (back to OPEN)      └──────────┘

Implementation

class CircuitBreaker {
  constructor({
    failureThreshold = 5,
    cooldownMs = 30000,
    monitorWindowMs = 60000
  } = {}) {
    this.state = 'CLOSED';  // CLOSED, OPEN, HALF_OPEN
    this.failures = [];
    this.failureThreshold = failureThreshold;
    this.cooldownMs = cooldownMs;
    this.monitorWindowMs = monitorWindowMs;
    this.openedAt = null;
  }

  async call(fn) {
    // If OPEN, check if cooldown has passed
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt >= this.cooldownMs) {
        this.state = 'HALF_OPEN';
        console.log('Circuit breaker: HALF_OPEN — testing one request');
      } else {
        throw new Error('Circuit breaker is OPEN — request blocked');
      }
    }

    try {
      const result = await fn();

      // Success — reset if HALF_OPEN
      if (this.state === 'HALF_OPEN') {
        this.state = 'CLOSED';
        this.failures = [];
        console.log('Circuit breaker: CLOSED — service recovered');
      }

      return result;
    } catch (error) {
      this.#recordFailure();

      // If HALF_OPEN and failed, go back to OPEN
      if (this.state === 'HALF_OPEN') {
        this.state = 'OPEN';
        this.openedAt = Date.now();
        console.log('Circuit breaker: OPEN — service still failing');
      }

      // If enough failures in window, open the circuit
      if (this.state === 'CLOSED' && this.#recentFailures() >= this.failureThreshold) {
        this.state = 'OPEN';
        this.openedAt = Date.now();
        console.log(`Circuit breaker: OPEN — ${this.failureThreshold} failures in window`);
      }

      throw error;
    }
  }

  #recordFailure() {
    this.failures.push(Date.now());
  }

  #recentFailures() {
    const cutoff = Date.now() - this.monitorWindowMs;
    this.failures = this.failures.filter(t => t > cutoff);
    return this.failures.length;
  }
}

// Usage
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  cooldownMs: 30000
});

async function safeApiCall(messages) {
  return breaker.call(() =>
    callWithRetry(() =>
      openai.chat.completions.create({
        model: "gpt-4o",
        messages,
        max_tokens: 1000
      }),
      { maxRetries: 2 }
    )
  );
}

8. Complete Production Error Handling

Combining all patterns into a production-ready API wrapper:

import OpenAI from 'openai';
import pLimit from 'p-limit';

class LLMClient {
  constructor({
    apiKey,
    maxRetries = 3,
    timeoutMs = 30000,
    maxConcurrent = 10,
    circuitBreakerThreshold = 5,
    circuitBreakerCooldownMs = 30000
  } = {}) {
    this.openai = new OpenAI({ apiKey, timeout: timeoutMs });
    this.maxRetries = maxRetries;
    this.concurrencyLimit = pLimit(maxConcurrent);
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: circuitBreakerThreshold,
      cooldownMs: circuitBreakerCooldownMs
    });
  }

  async complete(params) {
    return this.concurrencyLimit(() =>
      this.circuitBreaker.call(() =>
        this.#callWithRetry(params)
      )
    );
  }

  async #callWithRetry(params) {
    let lastError;

    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      try {
        const response = await this.openai.chat.completions.create(params);

        // Warn on truncation
        if (response.choices[0]?.finish_reason === 'length') {
          console.warn('Response truncated — consider increasing max_tokens');
        }

        return response;
      } catch (error) {
        lastError = error;
        const status = error?.status;

        // Non-retryable errors
        if (status && [400, 401, 403, 404, 422].includes(status)) {
          throw error;
        }

        if (attempt < this.maxRetries) {
          const delay = this.#calculateDelay(attempt, error);
          console.warn(`Attempt ${attempt + 1} failed (${status}). Retrying in ${delay}ms...`);
          await new Promise(r => setTimeout(r, delay));
        }
      }
    }

    throw lastError;
  }

  #calculateDelay(attempt, error) {
    const baseDelay = 1000 * Math.pow(2, attempt);
    const jitter = Math.random() * 1000;
    const maxDelay = 60000;

    // Respect retry-after header
    const retryAfter = error?.response?.headers?.['retry-after'];
    const retryAfterMs = retryAfter ? parseFloat(retryAfter) * 1000 : 0;

    return Math.min(Math.max(baseDelay + jitter, retryAfterMs), maxDelay);
  }
}

// Usage
const llm = new LLMClient({
  apiKey: process.env.OPENAI_API_KEY,
  maxRetries: 3,
  timeoutMs: 30000,
  maxConcurrent: 10
});

// Single request — fully protected
const response = await llm.complete({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
  max_tokens: 500
});

// Batch processing — concurrency-limited, with retries and circuit breaker
const items = ['item1', 'item2', /* ...hundreds of items */];
const results = await Promise.all(
  items.map(item =>
    llm.complete({
      model: "gpt-4o",
      messages: [{ role: "user", content: `Process: ${item}` }],
      max_tokens: 200
    }).catch(error => ({ error: error.message, item }))  // Graceful per-item failure
  )
);

9. Production Error Handling Checklist

Use this checklist for every LLM integration you ship:

#	Check	Status
1	Set explicit `timeout` on the API client
2	Implement exponential backoff with jitter for 429/5xx
3	Respect `retry-after` header from 429 responses
4	Do NOT retry 400/401/403/404 errors
5	Set `max_tokens` explicitly (don't rely on defaults)
6	Check `finish_reason` for truncation (`"length"`)
7	Limit concurrent requests to stay within RPM/TPM limits
8	Implement circuit breaker for persistent outages
9	Log every API call: model, tokens, latency, status, cost
10	Set up alerts for error rate spikes (> 1% error rate)
11	Have a fallback model (e.g., GPT-4o-mini if GPT-4o fails)
12	Validate API response structure before using it
13	Handle empty or null responses gracefully
14	Monitor rate limit headers to proactively throttle
15	Test error handling with simulated failures

10. Fallback Model Strategy

When your primary model is rate-limited or down, fall back to an alternative:

async function completeWithFallback(messages, maxTokens = 1000) {
  const models = [
    { name: 'gpt-4o', timeout: 30000 },
    { name: 'gpt-4o-mini', timeout: 15000 },    // Faster, cheaper fallback
  ];

  for (const model of models) {
    try {
      const response = await callWithRetry(
        () => openai.chat.completions.create({
          model: model.name,
          messages,
          max_tokens: maxTokens
        }),
        { maxRetries: 2 }
      );
      return { response, model: model.name, fallback: model.name !== models[0].name };
    } catch (error) {
      console.warn(`${model.name} failed: ${error.message}. Trying next...`);
    }
  }

  throw new Error('All models failed — service unavailable');
}

11. Key Takeaways

Rate limits are per-minute (RPM and TPM) — you can hit either. Monitor both.
429 = temporary — always retry with exponential backoff + jitter. Never retry 400-level client errors.
Respect retry-after — the header tells you exactly when to retry. Use it.
Limit concurrency — sending 1,000 parallel requests will instantly hit rate limits. Use semaphores or p-limit.
Circuit breakers prevent cascading failure — stop hammering a down API and recover gracefully.
Always set timeouts — LLM calls can take 5-60+ seconds. Without timeouts, your app hangs.
Have fallback models — when GPT-4o is overloaded, GPT-4o-mini keeps your app running.
Log everything — model, tokens, latency, status code, cost. You can't debug what you don't log.

Explain-It Challenge

Your batch processing script sends 1,000 API requests and 40% of them fail with 429 errors. Explain what's happening and how you would redesign the script.
Draw a timeline showing what happens when 100 clients all get a 429 error at the same time — first without jitter, then with jitter. Why does jitter matter?
A team says "we don't need a circuit breaker — we have retries." Explain the scenario where retries alone make things worse and a circuit breaker would help.

Navigation: ← 4.2.c — Cost Awareness · 4.2 Overview →