Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly
4.2.d — Rate Limits and Retries
In one sentence: LLM APIs enforce rate limits (requests per minute, tokens per minute) to manage capacity — production applications must handle 429 errors gracefully with exponential backoff, retry strategies, concurrency control, and circuit breakers to stay reliable under pressure.
Navigation: ← 4.2.c — Cost Awareness · 4.2 Overview →
1. What Are Rate Limits?
Rate limits are caps on how many API requests you can make within a time window. Providers impose them to:
- Protect infrastructure — prevent any single customer from overwhelming shared servers
- Ensure fair access — distribute capacity across all customers
- Prevent abuse — stop runaway scripts or misconfigured loops
Types of rate limits
| Limit Type | Abbreviation | What It Measures | Example |
|---|---|---|---|
| Requests Per Minute | RPM | Number of API calls | 500 RPM |
| Tokens Per Minute | TPM | Total tokens processed (input + output) | 200,000 TPM |
| Requests Per Day | RPD | Daily call volume | 10,000 RPD |
| Tokens Per Day | TPD | Daily token volume | 40,000,000 TPD |
Typical rate limits by tier
| Provider/Tier | RPM | TPM | Notes |
|---|---|---|---|
| OpenAI Free | 3 | 40,000 | Very restrictive |
| OpenAI Tier 1 | 500 | 200,000 | After first payment |
| OpenAI Tier 3 | 5,000 | 2,000,000 | After $100+ spend |
| OpenAI Tier 5 | 10,000 | 30,000,000 | Enterprise level |
| Anthropic Build | 50 | 40,000 | Starting tier |
| Anthropic Scale | 4,000 | 400,000 | Higher tier |
Key insight: You can hit rate limits from either RPM or TPM. Sending many small requests hits RPM; sending few large requests (with big prompts) hits TPM.
Scenario A: Hit RPM limit
501 requests with 100 tokens each = 501 RPM (over 500 limit)
Total tokens: 50,100 (well under 200K TPM)
Scenario B: Hit TPM limit
50 requests with 5,000 tokens each = 50 RPM (under 500 limit)
Total tokens: 250,000 (over 200K TPM limit)
Both result in 429 errors, but for different reasons!
2. The 429 Error: Too Many Requests
When you exceed a rate limit, the API returns an HTTP 429 status code:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
retry-after: 2
{
"error": {
"message": "Rate limit reached for gpt-4o. Limit: 500 requests per minute. Please try again in 1.2s.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}
Important response headers
| Header | Purpose | Example |
|---|---|---|
retry-after | Seconds to wait before retrying | 2 |
x-ratelimit-limit-requests | Your RPM limit | 500 |
x-ratelimit-remaining-requests | Remaining requests this minute | 0 |
x-ratelimit-limit-tokens | Your TPM limit | 200000 |
x-ratelimit-remaining-tokens | Remaining tokens this minute | 0 |
x-ratelimit-reset-requests | Time until RPM resets | 1.2s |
x-ratelimit-reset-tokens | Time until TPM resets | 4.5s |
Always read these headers — they tell you exactly when you can retry and how much capacity remains.
3. Exponential Backoff with Jitter
The standard retry strategy for 429 errors is exponential backoff with jitter — wait progressively longer between retries, with a random component to prevent "thundering herd" problems.
Why not just retry immediately?
Without backoff:
Request → 429 → Retry immediately → 429 → Retry immediately → 429 → ...
(hammering the API, making things worse)
With exponential backoff:
Request → 429 → Wait 1s → Retry → 429 → Wait 2s → Retry → 429 → Wait 4s → Retry → Success
With exponential backoff + jitter:
Request → 429 → Wait 1.3s → Retry → 429 → Wait 2.7s → Retry → Success
(random jitter prevents all clients from retrying at the same instant)
Implementation
async function callWithRetry(apiCallFn, {
maxRetries = 5,
baseDelayMs = 1000,
maxDelayMs = 60000,
retryableStatuses = [429, 500, 502, 503, 529]
} = {}) {
let lastError;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await apiCallFn();
} catch (error) {
lastError = error;
// Don't retry non-retryable errors
const status = error?.status || error?.response?.status;
if (status && !retryableStatuses.includes(status)) {
throw error; // 400, 401, 403, 404 — don't retry
}
// Don't retry if we've exhausted attempts
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries + 1} attempts: ${error.message}`);
}
// Calculate delay with exponential backoff + jitter
const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
const jitter = Math.random() * baseDelayMs; // Random 0 to baseDelay
const delay = Math.min(exponentialDelay + jitter, maxDelayMs);
// Use retry-after header if available
const retryAfter = error?.response?.headers?.['retry-after'];
const retryAfterMs = retryAfter ? parseFloat(retryAfter) * 1000 : 0;
const finalDelay = Math.max(delay, retryAfterMs);
console.warn(
`Attempt ${attempt + 1} failed (${status}). ` +
`Retrying in ${(finalDelay / 1000).toFixed(1)}s...`
);
await new Promise(resolve => setTimeout(resolve, finalDelay));
}
}
throw lastError;
}
// Usage
const response = await callWithRetry(
() => openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
max_tokens: 100
}),
{ maxRetries: 3, baseDelayMs: 1000 }
);
Backoff progression
Attempt 0: Immediate (first try)
Attempt 1: ~1.0-2.0s (1000 * 2^0 + jitter)
Attempt 2: ~2.0-3.0s (1000 * 2^1 + jitter)
Attempt 3: ~4.0-5.0s (1000 * 2^2 + jitter)
Attempt 4: ~8.0-9.0s (1000 * 2^3 + jitter)
Attempt 5: ~16.0-17.0s (1000 * 2^4 + jitter)
Total max wait before giving up: ~31-36 seconds
4. Which Errors to Retry
Not all errors should be retried. Retrying a 400 Bad Request wastes time and money.
| Status Code | Meaning | Retry? | Why |
|---|---|---|---|
| 400 | Bad Request | No | Your request is malformed — fix it |
| 401 | Unauthorized | No | Invalid API key — fix credentials |
| 403 | Forbidden | No | Permission denied — check access |
| 404 | Not Found | No | Wrong endpoint or model name |
| 422 | Unprocessable | No | Invalid parameters — fix request |
| 429 | Rate Limited | Yes | Temporary — wait and retry |
| 500 | Internal Server Error | Yes | Provider issue — may resolve |
| 502 | Bad Gateway | Yes | Temporary infrastructure issue |
| 503 | Service Unavailable | Yes | Provider overloaded — wait and retry |
| 529 | Overloaded | Yes | Anthropic-specific — system at capacity |
5. Timeout Handling
LLM API calls can take 5-60+ seconds depending on model, prompt length, and output length. Always set timeouts.
import OpenAI from 'openai';
// Set client-level timeout
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
timeout: 30000 // 30 seconds (in milliseconds)
});
// Or per-request timeout using AbortController
async function callWithTimeout(messages, timeoutMs = 30000) {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await openai.chat.completions.create(
{
model: "gpt-4o",
messages,
max_tokens: 1000
},
{ signal: controller.signal }
);
return response;
} catch (error) {
if (error.name === 'AbortError') {
throw new Error(`API call timed out after ${timeoutMs}ms`);
}
throw error;
} finally {
clearTimeout(timeoutId);
}
}
Timeout recommendations
| Scenario | Recommended Timeout | Why |
|---|---|---|
| Short responses (classification) | 10-15s | Should complete quickly |
| Medium responses (chat) | 30s | Standard conversational response |
| Long responses (code generation) | 60s | Complex generation takes time |
| Streaming responses | 10s initial, then per-chunk | First token should arrive quickly |
6. Concurrent Request Management
When processing many items in parallel, you need to limit concurrency to stay within rate limits.
Token bucket / Semaphore pattern
class RateLimiter {
constructor(maxConcurrent = 10, requestsPerMinute = 500) {
this.maxConcurrent = maxConcurrent;
this.running = 0;
this.queue = [];
this.requestTimestamps = [];
this.requestsPerMinute = requestsPerMinute;
}
async acquire() {
// Wait until we're under concurrent limit
while (this.running >= this.maxConcurrent) {
await new Promise(resolve => this.queue.push(resolve));
}
// Wait until RPM limit allows
await this.#waitForRpmSlot();
this.running++;
this.requestTimestamps.push(Date.now());
}
release() {
this.running--;
if (this.queue.length > 0) {
const next = this.queue.shift();
next();
}
}
async #waitForRpmSlot() {
const now = Date.now();
const oneMinuteAgo = now - 60000;
// Remove timestamps older than 1 minute
this.requestTimestamps = this.requestTimestamps.filter(t => t > oneMinuteAgo);
// If at RPM limit, wait until oldest request expires
if (this.requestTimestamps.length >= this.requestsPerMinute) {
const waitTime = this.requestTimestamps[0] - oneMinuteAgo + 100;
await new Promise(resolve => setTimeout(resolve, waitTime));
}
}
}
// Usage
const limiter = new RateLimiter(10, 500); // 10 concurrent, 500 RPM
async function processItem(item) {
await limiter.acquire();
try {
return await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: item }],
max_tokens: 500
});
} finally {
limiter.release();
}
}
// Process 1000 items with controlled concurrency
const items = [...]; // 1000 items
const results = await Promise.all(items.map(processItem));
Simple p-limit approach
import pLimit from 'p-limit';
const limit = pLimit(10); // Max 10 concurrent requests
const items = ['item1', 'item2', /* ...1000 items */];
const results = await Promise.all(
items.map(item =>
limit(() => openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: `Process: ${item}` }],
max_tokens: 200
}))
)
);
7. Circuit Breaker Pattern
When an API is consistently failing, continuing to send requests wastes resources and delays recovery. A circuit breaker stops requests temporarily and resumes after a cooldown period.
Circuit States:
CLOSED (normal) OPEN (failing)
┌──────────┐ failures > threshold ┌──────────┐
│ Requests │ ─────────────────────▶ │ Requests │
│ pass │ │ blocked │
│ through │ │ (fast │
└──────────┘ │ fail) │
▲ └──────────┘
│ │
│ cooldown elapsed │
│ ▼
│ ┌──────────┐
└────────── success ────────── │ HALF-OPEN │
│ (test 1 │
failure ──────────▶ │ request) │
(back to OPEN) └──────────┘
Implementation
class CircuitBreaker {
constructor({
failureThreshold = 5,
cooldownMs = 30000,
monitorWindowMs = 60000
} = {}) {
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.failures = [];
this.failureThreshold = failureThreshold;
this.cooldownMs = cooldownMs;
this.monitorWindowMs = monitorWindowMs;
this.openedAt = null;
}
async call(fn) {
// If OPEN, check if cooldown has passed
if (this.state === 'OPEN') {
if (Date.now() - this.openedAt >= this.cooldownMs) {
this.state = 'HALF_OPEN';
console.log('Circuit breaker: HALF_OPEN — testing one request');
} else {
throw new Error('Circuit breaker is OPEN — request blocked');
}
}
try {
const result = await fn();
// Success — reset if HALF_OPEN
if (this.state === 'HALF_OPEN') {
this.state = 'CLOSED';
this.failures = [];
console.log('Circuit breaker: CLOSED — service recovered');
}
return result;
} catch (error) {
this.#recordFailure();
// If HALF_OPEN and failed, go back to OPEN
if (this.state === 'HALF_OPEN') {
this.state = 'OPEN';
this.openedAt = Date.now();
console.log('Circuit breaker: OPEN — service still failing');
}
// If enough failures in window, open the circuit
if (this.state === 'CLOSED' && this.#recentFailures() >= this.failureThreshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
console.log(`Circuit breaker: OPEN — ${this.failureThreshold} failures in window`);
}
throw error;
}
}
#recordFailure() {
this.failures.push(Date.now());
}
#recentFailures() {
const cutoff = Date.now() - this.monitorWindowMs;
this.failures = this.failures.filter(t => t > cutoff);
return this.failures.length;
}
}
// Usage
const breaker = new CircuitBreaker({
failureThreshold: 5,
cooldownMs: 30000
});
async function safeApiCall(messages) {
return breaker.call(() =>
callWithRetry(() =>
openai.chat.completions.create({
model: "gpt-4o",
messages,
max_tokens: 1000
}),
{ maxRetries: 2 }
)
);
}
8. Complete Production Error Handling
Combining all patterns into a production-ready API wrapper:
import OpenAI from 'openai';
import pLimit from 'p-limit';
class LLMClient {
constructor({
apiKey,
maxRetries = 3,
timeoutMs = 30000,
maxConcurrent = 10,
circuitBreakerThreshold = 5,
circuitBreakerCooldownMs = 30000
} = {}) {
this.openai = new OpenAI({ apiKey, timeout: timeoutMs });
this.maxRetries = maxRetries;
this.concurrencyLimit = pLimit(maxConcurrent);
this.circuitBreaker = new CircuitBreaker({
failureThreshold: circuitBreakerThreshold,
cooldownMs: circuitBreakerCooldownMs
});
}
async complete(params) {
return this.concurrencyLimit(() =>
this.circuitBreaker.call(() =>
this.#callWithRetry(params)
)
);
}
async #callWithRetry(params) {
let lastError;
for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
try {
const response = await this.openai.chat.completions.create(params);
// Warn on truncation
if (response.choices[0]?.finish_reason === 'length') {
console.warn('Response truncated — consider increasing max_tokens');
}
return response;
} catch (error) {
lastError = error;
const status = error?.status;
// Non-retryable errors
if (status && [400, 401, 403, 404, 422].includes(status)) {
throw error;
}
if (attempt < this.maxRetries) {
const delay = this.#calculateDelay(attempt, error);
console.warn(`Attempt ${attempt + 1} failed (${status}). Retrying in ${delay}ms...`);
await new Promise(r => setTimeout(r, delay));
}
}
}
throw lastError;
}
#calculateDelay(attempt, error) {
const baseDelay = 1000 * Math.pow(2, attempt);
const jitter = Math.random() * 1000;
const maxDelay = 60000;
// Respect retry-after header
const retryAfter = error?.response?.headers?.['retry-after'];
const retryAfterMs = retryAfter ? parseFloat(retryAfter) * 1000 : 0;
return Math.min(Math.max(baseDelay + jitter, retryAfterMs), maxDelay);
}
}
// Usage
const llm = new LLMClient({
apiKey: process.env.OPENAI_API_KEY,
maxRetries: 3,
timeoutMs: 30000,
maxConcurrent: 10
});
// Single request — fully protected
const response = await llm.complete({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
max_tokens: 500
});
// Batch processing — concurrency-limited, with retries and circuit breaker
const items = ['item1', 'item2', /* ...hundreds of items */];
const results = await Promise.all(
items.map(item =>
llm.complete({
model: "gpt-4o",
messages: [{ role: "user", content: `Process: ${item}` }],
max_tokens: 200
}).catch(error => ({ error: error.message, item })) // Graceful per-item failure
)
);
9. Production Error Handling Checklist
Use this checklist for every LLM integration you ship:
| # | Check | Status |
|---|---|---|
| 1 | Set explicit timeout on the API client | |
| 2 | Implement exponential backoff with jitter for 429/5xx | |
| 3 | Respect retry-after header from 429 responses | |
| 4 | Do NOT retry 400/401/403/404 errors | |
| 5 | Set max_tokens explicitly (don't rely on defaults) | |
| 6 | Check finish_reason for truncation ("length") | |
| 7 | Limit concurrent requests to stay within RPM/TPM limits | |
| 8 | Implement circuit breaker for persistent outages | |
| 9 | Log every API call: model, tokens, latency, status, cost | |
| 10 | Set up alerts for error rate spikes (> 1% error rate) | |
| 11 | Have a fallback model (e.g., GPT-4o-mini if GPT-4o fails) | |
| 12 | Validate API response structure before using it | |
| 13 | Handle empty or null responses gracefully | |
| 14 | Monitor rate limit headers to proactively throttle | |
| 15 | Test error handling with simulated failures |
10. Fallback Model Strategy
When your primary model is rate-limited or down, fall back to an alternative:
async function completeWithFallback(messages, maxTokens = 1000) {
const models = [
{ name: 'gpt-4o', timeout: 30000 },
{ name: 'gpt-4o-mini', timeout: 15000 }, // Faster, cheaper fallback
];
for (const model of models) {
try {
const response = await callWithRetry(
() => openai.chat.completions.create({
model: model.name,
messages,
max_tokens: maxTokens
}),
{ maxRetries: 2 }
);
return { response, model: model.name, fallback: model.name !== models[0].name };
} catch (error) {
console.warn(`${model.name} failed: ${error.message}. Trying next...`);
}
}
throw new Error('All models failed — service unavailable');
}
11. Key Takeaways
- Rate limits are per-minute (RPM and TPM) — you can hit either. Monitor both.
- 429 = temporary — always retry with exponential backoff + jitter. Never retry 400-level client errors.
- Respect
retry-after— the header tells you exactly when to retry. Use it. - Limit concurrency — sending 1,000 parallel requests will instantly hit rate limits. Use semaphores or
p-limit. - Circuit breakers prevent cascading failure — stop hammering a down API and recover gracefully.
- Always set timeouts — LLM calls can take 5-60+ seconds. Without timeouts, your app hangs.
- Have fallback models — when GPT-4o is overloaded, GPT-4o-mini keeps your app running.
- Log everything — model, tokens, latency, status code, cost. You can't debug what you don't log.
Explain-It Challenge
- Your batch processing script sends 1,000 API requests and 40% of them fail with 429 errors. Explain what's happening and how you would redesign the script.
- Draw a timeline showing what happens when 100 clients all get a 429 error at the same time — first without jitter, then with jitter. Why does jitter matter?
- A team says "we don't need a circuit breaker — we have retries." Explain the scenario where retries alone make things worse and a circuit breaker would help.
Navigation: ← 4.2.c — Cost Awareness · 4.2 Overview →