Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2.d — Rate Limits and Retries

In one sentence: LLM APIs enforce rate limits (requests per minute, tokens per minute) to manage capacity — production applications must handle 429 errors gracefully with exponential backoff, retry strategies, concurrency control, and circuit breakers to stay reliable under pressure.

Navigation: ← 4.2.c — Cost Awareness · 4.2 Overview →


1. What Are Rate Limits?

Rate limits are caps on how many API requests you can make within a time window. Providers impose them to:

  • Protect infrastructure — prevent any single customer from overwhelming shared servers
  • Ensure fair access — distribute capacity across all customers
  • Prevent abuse — stop runaway scripts or misconfigured loops

Types of rate limits

Limit TypeAbbreviationWhat It MeasuresExample
Requests Per MinuteRPMNumber of API calls500 RPM
Tokens Per MinuteTPMTotal tokens processed (input + output)200,000 TPM
Requests Per DayRPDDaily call volume10,000 RPD
Tokens Per DayTPDDaily token volume40,000,000 TPD

Typical rate limits by tier

Provider/TierRPMTPMNotes
OpenAI Free340,000Very restrictive
OpenAI Tier 1500200,000After first payment
OpenAI Tier 35,0002,000,000After $100+ spend
OpenAI Tier 510,00030,000,000Enterprise level
Anthropic Build5040,000Starting tier
Anthropic Scale4,000400,000Higher tier

Key insight: You can hit rate limits from either RPM or TPM. Sending many small requests hits RPM; sending few large requests (with big prompts) hits TPM.

Scenario A: Hit RPM limit
  501 requests with 100 tokens each = 501 RPM (over 500 limit)
  Total tokens: 50,100 (well under 200K TPM)

Scenario B: Hit TPM limit
  50 requests with 5,000 tokens each = 50 RPM (under 500 limit)
  Total tokens: 250,000 (over 200K TPM limit)

Both result in 429 errors, but for different reasons!

2. The 429 Error: Too Many Requests

When you exceed a rate limit, the API returns an HTTP 429 status code:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
retry-after: 2

{
  "error": {
    "message": "Rate limit reached for gpt-4o. Limit: 500 requests per minute. Please try again in 1.2s.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Important response headers

HeaderPurposeExample
retry-afterSeconds to wait before retrying2
x-ratelimit-limit-requestsYour RPM limit500
x-ratelimit-remaining-requestsRemaining requests this minute0
x-ratelimit-limit-tokensYour TPM limit200000
x-ratelimit-remaining-tokensRemaining tokens this minute0
x-ratelimit-reset-requestsTime until RPM resets1.2s
x-ratelimit-reset-tokensTime until TPM resets4.5s

Always read these headers — they tell you exactly when you can retry and how much capacity remains.


3. Exponential Backoff with Jitter

The standard retry strategy for 429 errors is exponential backoff with jitter — wait progressively longer between retries, with a random component to prevent "thundering herd" problems.

Why not just retry immediately?

Without backoff:
  Request → 429 → Retry immediately → 429 → Retry immediately → 429 → ...
  (hammering the API, making things worse)

With exponential backoff:
  Request → 429 → Wait 1s → Retry → 429 → Wait 2s → Retry → 429 → Wait 4s → Retry → Success

With exponential backoff + jitter:
  Request → 429 → Wait 1.3s → Retry → 429 → Wait 2.7s → Retry → Success
  (random jitter prevents all clients from retrying at the same instant)

Implementation

async function callWithRetry(apiCallFn, {
  maxRetries = 5,
  baseDelayMs = 1000,
  maxDelayMs = 60000,
  retryableStatuses = [429, 500, 502, 503, 529]
} = {}) {
  let lastError;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await apiCallFn();
    } catch (error) {
      lastError = error;

      // Don't retry non-retryable errors
      const status = error?.status || error?.response?.status;
      if (status && !retryableStatuses.includes(status)) {
        throw error;  // 400, 401, 403, 404 — don't retry
      }

      // Don't retry if we've exhausted attempts
      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries + 1} attempts: ${error.message}`);
      }

      // Calculate delay with exponential backoff + jitter
      const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
      const jitter = Math.random() * baseDelayMs;  // Random 0 to baseDelay
      const delay = Math.min(exponentialDelay + jitter, maxDelayMs);

      // Use retry-after header if available
      const retryAfter = error?.response?.headers?.['retry-after'];
      const retryAfterMs = retryAfter ? parseFloat(retryAfter) * 1000 : 0;
      const finalDelay = Math.max(delay, retryAfterMs);

      console.warn(
        `Attempt ${attempt + 1} failed (${status}). ` +
        `Retrying in ${(finalDelay / 1000).toFixed(1)}s...`
      );

      await new Promise(resolve => setTimeout(resolve, finalDelay));
    }
  }

  throw lastError;
}

// Usage
const response = await callWithRetry(
  () => openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: "Hello" }],
    max_tokens: 100
  }),
  { maxRetries: 3, baseDelayMs: 1000 }
);

Backoff progression

Attempt 0: Immediate (first try)
Attempt 1: ~1.0-2.0s  (1000 * 2^0 + jitter)
Attempt 2: ~2.0-3.0s  (1000 * 2^1 + jitter)
Attempt 3: ~4.0-5.0s  (1000 * 2^2 + jitter)
Attempt 4: ~8.0-9.0s  (1000 * 2^3 + jitter)
Attempt 5: ~16.0-17.0s (1000 * 2^4 + jitter)

Total max wait before giving up: ~31-36 seconds

4. Which Errors to Retry

Not all errors should be retried. Retrying a 400 Bad Request wastes time and money.

Status CodeMeaningRetry?Why
400Bad RequestNoYour request is malformed — fix it
401UnauthorizedNoInvalid API key — fix credentials
403ForbiddenNoPermission denied — check access
404Not FoundNoWrong endpoint or model name
422UnprocessableNoInvalid parameters — fix request
429Rate LimitedYesTemporary — wait and retry
500Internal Server ErrorYesProvider issue — may resolve
502Bad GatewayYesTemporary infrastructure issue
503Service UnavailableYesProvider overloaded — wait and retry
529OverloadedYesAnthropic-specific — system at capacity

5. Timeout Handling

LLM API calls can take 5-60+ seconds depending on model, prompt length, and output length. Always set timeouts.

import OpenAI from 'openai';

// Set client-level timeout
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  timeout: 30000  // 30 seconds (in milliseconds)
});

// Or per-request timeout using AbortController
async function callWithTimeout(messages, timeoutMs = 30000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await openai.chat.completions.create(
      {
        model: "gpt-4o",
        messages,
        max_tokens: 1000
      },
      { signal: controller.signal }
    );
    return response;
  } catch (error) {
    if (error.name === 'AbortError') {
      throw new Error(`API call timed out after ${timeoutMs}ms`);
    }
    throw error;
  } finally {
    clearTimeout(timeoutId);
  }
}

Timeout recommendations

ScenarioRecommended TimeoutWhy
Short responses (classification)10-15sShould complete quickly
Medium responses (chat)30sStandard conversational response
Long responses (code generation)60sComplex generation takes time
Streaming responses10s initial, then per-chunkFirst token should arrive quickly

6. Concurrent Request Management

When processing many items in parallel, you need to limit concurrency to stay within rate limits.

Token bucket / Semaphore pattern

class RateLimiter {
  constructor(maxConcurrent = 10, requestsPerMinute = 500) {
    this.maxConcurrent = maxConcurrent;
    this.running = 0;
    this.queue = [];
    this.requestTimestamps = [];
    this.requestsPerMinute = requestsPerMinute;
  }

  async acquire() {
    // Wait until we're under concurrent limit
    while (this.running >= this.maxConcurrent) {
      await new Promise(resolve => this.queue.push(resolve));
    }

    // Wait until RPM limit allows
    await this.#waitForRpmSlot();

    this.running++;
    this.requestTimestamps.push(Date.now());
  }

  release() {
    this.running--;
    if (this.queue.length > 0) {
      const next = this.queue.shift();
      next();
    }
  }

  async #waitForRpmSlot() {
    const now = Date.now();
    const oneMinuteAgo = now - 60000;

    // Remove timestamps older than 1 minute
    this.requestTimestamps = this.requestTimestamps.filter(t => t > oneMinuteAgo);

    // If at RPM limit, wait until oldest request expires
    if (this.requestTimestamps.length >= this.requestsPerMinute) {
      const waitTime = this.requestTimestamps[0] - oneMinuteAgo + 100;
      await new Promise(resolve => setTimeout(resolve, waitTime));
    }
  }
}

// Usage
const limiter = new RateLimiter(10, 500);  // 10 concurrent, 500 RPM

async function processItem(item) {
  await limiter.acquire();
  try {
    return await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: item }],
      max_tokens: 500
    });
  } finally {
    limiter.release();
  }
}

// Process 1000 items with controlled concurrency
const items = [...]; // 1000 items
const results = await Promise.all(items.map(processItem));

Simple p-limit approach

import pLimit from 'p-limit';

const limit = pLimit(10);  // Max 10 concurrent requests

const items = ['item1', 'item2', /* ...1000 items */];

const results = await Promise.all(
  items.map(item =>
    limit(() => openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: `Process: ${item}` }],
      max_tokens: 200
    }))
  )
);

7. Circuit Breaker Pattern

When an API is consistently failing, continuing to send requests wastes resources and delays recovery. A circuit breaker stops requests temporarily and resumes after a cooldown period.

Circuit States:

  CLOSED (normal)                    OPEN (failing)
  ┌──────────┐   failures > threshold   ┌──────────┐
  │  Requests │ ─────────────────────▶ │  Requests │
  │   pass    │                         │  blocked  │
  │  through  │                         │  (fast    │
  └──────────┘                         │   fail)   │
       ▲                                └──────────┘
       │                                     │
       │         cooldown elapsed            │
       │                                     ▼
       │                               ┌──────────┐
       └────────── success ──────────  │ HALF-OPEN │
                                       │ (test 1   │
                   failure ──────────▶ │  request)  │
                   (back to OPEN)      └──────────┘

Implementation

class CircuitBreaker {
  constructor({
    failureThreshold = 5,
    cooldownMs = 30000,
    monitorWindowMs = 60000
  } = {}) {
    this.state = 'CLOSED';  // CLOSED, OPEN, HALF_OPEN
    this.failures = [];
    this.failureThreshold = failureThreshold;
    this.cooldownMs = cooldownMs;
    this.monitorWindowMs = monitorWindowMs;
    this.openedAt = null;
  }

  async call(fn) {
    // If OPEN, check if cooldown has passed
    if (this.state === 'OPEN') {
      if (Date.now() - this.openedAt >= this.cooldownMs) {
        this.state = 'HALF_OPEN';
        console.log('Circuit breaker: HALF_OPEN — testing one request');
      } else {
        throw new Error('Circuit breaker is OPEN — request blocked');
      }
    }

    try {
      const result = await fn();

      // Success — reset if HALF_OPEN
      if (this.state === 'HALF_OPEN') {
        this.state = 'CLOSED';
        this.failures = [];
        console.log('Circuit breaker: CLOSED — service recovered');
      }

      return result;
    } catch (error) {
      this.#recordFailure();

      // If HALF_OPEN and failed, go back to OPEN
      if (this.state === 'HALF_OPEN') {
        this.state = 'OPEN';
        this.openedAt = Date.now();
        console.log('Circuit breaker: OPEN — service still failing');
      }

      // If enough failures in window, open the circuit
      if (this.state === 'CLOSED' && this.#recentFailures() >= this.failureThreshold) {
        this.state = 'OPEN';
        this.openedAt = Date.now();
        console.log(`Circuit breaker: OPEN — ${this.failureThreshold} failures in window`);
      }

      throw error;
    }
  }

  #recordFailure() {
    this.failures.push(Date.now());
  }

  #recentFailures() {
    const cutoff = Date.now() - this.monitorWindowMs;
    this.failures = this.failures.filter(t => t > cutoff);
    return this.failures.length;
  }
}

// Usage
const breaker = new CircuitBreaker({
  failureThreshold: 5,
  cooldownMs: 30000
});

async function safeApiCall(messages) {
  return breaker.call(() =>
    callWithRetry(() =>
      openai.chat.completions.create({
        model: "gpt-4o",
        messages,
        max_tokens: 1000
      }),
      { maxRetries: 2 }
    )
  );
}

8. Complete Production Error Handling

Combining all patterns into a production-ready API wrapper:

import OpenAI from 'openai';
import pLimit from 'p-limit';

class LLMClient {
  constructor({
    apiKey,
    maxRetries = 3,
    timeoutMs = 30000,
    maxConcurrent = 10,
    circuitBreakerThreshold = 5,
    circuitBreakerCooldownMs = 30000
  } = {}) {
    this.openai = new OpenAI({ apiKey, timeout: timeoutMs });
    this.maxRetries = maxRetries;
    this.concurrencyLimit = pLimit(maxConcurrent);
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: circuitBreakerThreshold,
      cooldownMs: circuitBreakerCooldownMs
    });
  }

  async complete(params) {
    return this.concurrencyLimit(() =>
      this.circuitBreaker.call(() =>
        this.#callWithRetry(params)
      )
    );
  }

  async #callWithRetry(params) {
    let lastError;

    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      try {
        const response = await this.openai.chat.completions.create(params);

        // Warn on truncation
        if (response.choices[0]?.finish_reason === 'length') {
          console.warn('Response truncated — consider increasing max_tokens');
        }

        return response;
      } catch (error) {
        lastError = error;
        const status = error?.status;

        // Non-retryable errors
        if (status && [400, 401, 403, 404, 422].includes(status)) {
          throw error;
        }

        if (attempt < this.maxRetries) {
          const delay = this.#calculateDelay(attempt, error);
          console.warn(`Attempt ${attempt + 1} failed (${status}). Retrying in ${delay}ms...`);
          await new Promise(r => setTimeout(r, delay));
        }
      }
    }

    throw lastError;
  }

  #calculateDelay(attempt, error) {
    const baseDelay = 1000 * Math.pow(2, attempt);
    const jitter = Math.random() * 1000;
    const maxDelay = 60000;

    // Respect retry-after header
    const retryAfter = error?.response?.headers?.['retry-after'];
    const retryAfterMs = retryAfter ? parseFloat(retryAfter) * 1000 : 0;

    return Math.min(Math.max(baseDelay + jitter, retryAfterMs), maxDelay);
  }
}

// Usage
const llm = new LLMClient({
  apiKey: process.env.OPENAI_API_KEY,
  maxRetries: 3,
  timeoutMs: 30000,
  maxConcurrent: 10
});

// Single request — fully protected
const response = await llm.complete({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
  max_tokens: 500
});

// Batch processing — concurrency-limited, with retries and circuit breaker
const items = ['item1', 'item2', /* ...hundreds of items */];
const results = await Promise.all(
  items.map(item =>
    llm.complete({
      model: "gpt-4o",
      messages: [{ role: "user", content: `Process: ${item}` }],
      max_tokens: 200
    }).catch(error => ({ error: error.message, item }))  // Graceful per-item failure
  )
);

9. Production Error Handling Checklist

Use this checklist for every LLM integration you ship:

#CheckStatus
1Set explicit timeout on the API client
2Implement exponential backoff with jitter for 429/5xx
3Respect retry-after header from 429 responses
4Do NOT retry 400/401/403/404 errors
5Set max_tokens explicitly (don't rely on defaults)
6Check finish_reason for truncation ("length")
7Limit concurrent requests to stay within RPM/TPM limits
8Implement circuit breaker for persistent outages
9Log every API call: model, tokens, latency, status, cost
10Set up alerts for error rate spikes (> 1% error rate)
11Have a fallback model (e.g., GPT-4o-mini if GPT-4o fails)
12Validate API response structure before using it
13Handle empty or null responses gracefully
14Monitor rate limit headers to proactively throttle
15Test error handling with simulated failures

10. Fallback Model Strategy

When your primary model is rate-limited or down, fall back to an alternative:

async function completeWithFallback(messages, maxTokens = 1000) {
  const models = [
    { name: 'gpt-4o', timeout: 30000 },
    { name: 'gpt-4o-mini', timeout: 15000 },    // Faster, cheaper fallback
  ];

  for (const model of models) {
    try {
      const response = await callWithRetry(
        () => openai.chat.completions.create({
          model: model.name,
          messages,
          max_tokens: maxTokens
        }),
        { maxRetries: 2 }
      );
      return { response, model: model.name, fallback: model.name !== models[0].name };
    } catch (error) {
      console.warn(`${model.name} failed: ${error.message}. Trying next...`);
    }
  }

  throw new Error('All models failed — service unavailable');
}

11. Key Takeaways

  1. Rate limits are per-minute (RPM and TPM) — you can hit either. Monitor both.
  2. 429 = temporary — always retry with exponential backoff + jitter. Never retry 400-level client errors.
  3. Respect retry-after — the header tells you exactly when to retry. Use it.
  4. Limit concurrency — sending 1,000 parallel requests will instantly hit rate limits. Use semaphores or p-limit.
  5. Circuit breakers prevent cascading failure — stop hammering a down API and recover gracefully.
  6. Always set timeouts — LLM calls can take 5-60+ seconds. Without timeouts, your app hangs.
  7. Have fallback models — when GPT-4o is overloaded, GPT-4o-mini keeps your app running.
  8. Log everything — model, tokens, latency, status code, cost. You can't debug what you don't log.

Explain-It Challenge

  1. Your batch processing script sends 1,000 API requests and 40% of them fail with 429 errors. Explain what's happening and how you would redesign the script.
  2. Draw a timeline showing what happens when 100 clients all get a 429 error at the same time — first without jitter, then with jitter. Why does jitter matter?
  3. A team says "we don't need a circuit breaker — we have retries." Explain the scenario where retries alone make things worse and a circuit breaker would help.

Navigation: ← 4.2.c — Cost Awareness · 4.2 Overview →