Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications
4.10.c — Retry Mechanisms
In one sentence: Not every failed LLM call should be retried — you must distinguish transient errors (network blips, rate limits, server overload) from permanent errors (bad API key, invalid input), use exponential backoff with jitter to avoid thundering herds, set strict retry limits to control cost, and sometimes retry with a modified prompt when the model's output was the problem.
Navigation: ← 4.10.b — Partial Responses and Timeouts · 4.10.d — Logging AI Requests →
1. Why Retries Are Essential for LLM Applications
LLM APIs fail frequently compared to traditional APIs. Here's why:
Traditional REST API:
- Servers are stateless, requests are fast (~50ms)
- Failures are rare (<0.01%)
- Retries are a nice-to-have
LLM API:
- Generation takes 1-60+ seconds per request
- Multiple failure modes (rate limit, timeout, malformed output, server error)
- Combined failure rate can be 2-10% in production
- Cost per call is significant ($0.001-$0.10+)
- Retries are MANDATORY for production systems
The failure math
If a single LLM call has a 5% failure rate:
- 1 call: 95% success
- 2 calls: 95% + (5% × 95%) = 99.75% success (with 1 retry)
- 3 calls: 99.99% success (with 2 retries)
For a pipeline of 3 sequential LLM calls, each with 5% failure:
- Without retries: 0.95³ = 85.7% pipeline success
- With 1 retry each: 0.9975³ = 99.25% pipeline success
Retries turn an unreliable system into a reliable one.
2. When to Retry vs When NOT to Retry
The most important decision in retry logic is whether to retry. Retrying a permanent error wastes money and time.
Retryable errors (transient)
| Error Type | HTTP Status | Why Retry Works |
|---|---|---|
| Rate limit | 429 | Wait and try again — the limit resets |
| Server error | 500 | Transient infrastructure issue |
| Bad gateway | 502 | Load balancer issue — next request may hit a healthy server |
| Service unavailable | 503 | Server overloaded — will recover |
| Gateway timeout | 504 | Request took too long — may succeed on retry |
| Connection error | N/A | Network blip — often resolves quickly |
| Timeout | N/A | Server was slow — may be faster on retry |
| Malformed output | 200 | Model returned bad JSON — retry often fixes it |
| Truncated output | 200 | finish_reason: "length" — retry with more tokens |
Non-retryable errors (permanent)
| Error Type | HTTP Status | Why Retry Fails |
|---|---|---|
| Authentication | 401 | Bad API key — retry won't fix it |
| Forbidden | 403 | No access — retry won't fix it |
| Not found | 404 | Wrong model name — retry won't fix it |
| Bad request | 400 | Malformed request — retry sends the same bad request |
| Content policy | 400/403 | Prompt violates policy — same prompt will be rejected again |
| Context too long | 400 | Input exceeds model's context — same input will fail again |
| Insufficient quota | 402 | Out of credits — need to add funds |
Decision function
/**
* Determine if an error is retryable.
* Returns { retryable: boolean, reason: string }
*/
function isRetryable(error) {
// Network errors — always retryable
if (error.code === 'ETIMEDOUT' || error.code === 'ECONNRESET' ||
error.code === 'ECONNREFUSED' || error.code === 'ENOTFOUND' ||
error.name === 'AbortError') {
return { retryable: true, reason: 'network_error' };
}
const status = error.status || error.statusCode;
// Retryable HTTP errors
if ([429, 500, 502, 503, 504, 408].includes(status)) {
return { retryable: true, reason: `http_${status}` };
}
// Non-retryable HTTP errors
if ([400, 401, 402, 403, 404, 422].includes(status)) {
return { retryable: false, reason: `http_${status}` };
}
// Unknown errors — default to not retryable (fail fast)
return { retryable: false, reason: 'unknown_error' };
}
3. Exponential Backoff with Jitter
When you retry, you must wait between attempts. If you retry immediately, you'll hit the same rate limit or overloaded server. Exponential backoff with jitter is the standard approach.
Why exponential backoff?
Scenario: API returns 429 (rate limited)
WITHOUT backoff (immediate retry):
Attempt 1: 429 → retry immediately
Attempt 2: 429 → retry immediately (server is STILL rate limited)
Attempt 3: 429 → retry immediately (making it WORSE)
Result: All attempts fail, you wasted time and annoyed the server
WITH exponential backoff:
Attempt 1: 429 → wait 1 second
Attempt 2: 429 → wait 2 seconds
Attempt 3: 429 → wait 4 seconds
Attempt 4: Success! (rate limit window has passed)
Why jitter?
Jitter adds randomness to the delay. Without it, if 100 clients all get rate limited at the same time, they all retry at the same time (1s, 2s, 4s), causing a thundering herd.
WITHOUT jitter (thundering herd):
Client A: retry at 1.000s, 2.000s, 4.000s
Client B: retry at 1.000s, 2.000s, 4.000s
Client C: retry at 1.000s, 2.000s, 4.000s
→ All 3 clients hit the server simultaneously every time
WITH jitter (spread out):
Client A: retry at 0.700s, 2.300s, 3.100s
Client B: retry at 1.200s, 1.800s, 4.500s
Client C: retry at 0.900s, 2.900s, 3.700s
→ Load is distributed over time
Implementation
/**
* Calculate exponential backoff delay with jitter.
*
* @param {number} attempt - Current attempt number (0-based)
* @param {object} options - Configuration
* @returns {number} Delay in milliseconds
*/
function calculateBackoff(attempt, options = {}) {
const {
baseDelayMs = 1000, // Start at 1 second
maxDelayMs = 60000, // Cap at 60 seconds
jitterFactor = 0.5 // ±50% randomness
} = options;
// Exponential: 1s, 2s, 4s, 8s, 16s, ...
const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
// Cap at maximum
const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
// Add jitter: ±50% of the delay
const jitter = cappedDelay * jitterFactor * (Math.random() * 2 - 1);
const finalDelay = Math.max(0, cappedDelay + jitter);
return Math.round(finalDelay);
}
// Example delays for attempts 0-5:
// Attempt 0: ~1000ms (750 - 1500ms with jitter)
// Attempt 1: ~2000ms (1500 - 3000ms)
// Attempt 2: ~4000ms (3000 - 6000ms)
// Attempt 3: ~8000ms (6000 - 12000ms)
// Attempt 4: ~16000ms (12000 - 24000ms)
// Attempt 5: ~32000ms (24000 - 48000ms)
Full jitter (recommended by AWS)
/**
* "Full jitter" strategy — recommended by AWS architecture blog.
* Delay = random(0, min(cap, base * 2^attempt))
* Provides maximum spread, reducing thundering herd more than equal jitter.
*/
function fullJitterBackoff(attempt, baseDelayMs = 1000, maxDelayMs = 60000) {
const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
return Math.round(Math.random() * cappedDelay);
}
4. Building a Robust Retry Wrapper
Here's a production-ready retry wrapper that handles all the cases discussed above.
/**
* Retry wrapper for LLM API calls with exponential backoff.
*
* Features:
* - Exponential backoff with full jitter
* - Distinguishes retryable vs non-retryable errors
* - Respects Retry-After headers
* - Configurable max retries
* - Detailed logging of each attempt
* - Returns attempt metadata
*
* @param {Function} fn - Async function to retry
* @param {object} options - Configuration
* @returns {Promise<{result: any, attempts: number, totalDelayMs: number}>}
*/
async function retryWithBackoff(fn, options = {}) {
const {
maxRetries = 3,
baseDelayMs = 1000,
maxDelayMs = 60000,
onRetry = () => {}, // Callback for each retry
isRetryableError = isRetryable // Error classifier function
} = options;
let lastError = null;
let totalDelayMs = 0;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const result = await fn(attempt);
return {
result,
attempts: attempt + 1,
totalDelayMs
};
} catch (error) {
lastError = error;
// Check if we should retry
const retryDecision = isRetryableError(error);
if (!retryDecision.retryable) {
console.error(`Non-retryable error (${retryDecision.reason}):`, error.message);
throw error; // Don't retry — throw immediately
}
// Check if we've exhausted retries
if (attempt >= maxRetries) {
console.error(`All ${maxRetries + 1} attempts failed.`);
break;
}
// Calculate delay
let delayMs;
// Respect Retry-After header if present
const retryAfter = error.headers?.['retry-after'];
if (retryAfter) {
delayMs = parseInt(retryAfter) * 1000;
} else {
delayMs = fullJitterBackoff(attempt, baseDelayMs, maxDelayMs);
}
totalDelayMs += delayMs;
console.warn(
`Attempt ${attempt + 1}/${maxRetries + 1} failed ` +
`(${retryDecision.reason}). Retrying in ${delayMs}ms...`
);
onRetry({
attempt,
error,
reason: retryDecision.reason,
delayMs,
totalDelayMs
});
// Wait before retrying
await new Promise(resolve => setTimeout(resolve, delayMs));
}
}
// All retries exhausted
const enhancedError = new Error(
`All ${maxRetries + 1} attempts failed. Last error: ${lastError.message}`
);
enhancedError.lastError = lastError;
enhancedError.attempts = maxRetries + 1;
enhancedError.totalDelayMs = totalDelayMs;
throw enhancedError;
}
// Helper: full jitter backoff
function fullJitterBackoff(attempt, baseDelayMs = 1000, maxDelayMs = 60000) {
const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
return Math.round(Math.random() * cappedDelay);
}
// Usage
const response = await retryWithBackoff(
async (attempt) => {
return await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Extract data...' }],
max_tokens: 2000,
temperature: 0
});
},
{
maxRetries: 3,
baseDelayMs: 1000,
maxDelayMs: 30000,
onRetry: ({ attempt, reason, delayMs }) => {
console.log(`Retry ${attempt + 1}: ${reason}, waiting ${delayMs}ms`);
}
}
);
console.log(`Success after ${response.attempts} attempt(s)`);
5. Retrying Malformed Output (Not Just Errors)
Sometimes the API call succeeds (HTTP 200) but the output is unusable — bad JSON, wrong schema, or nonsensical content. You need to retry these too.
Retry with the same prompt
/**
* Retry LLM call when the output fails validation.
* Handles both API errors AND output validation failures.
*/
async function retryUntilValid(messages, schema, options = {}) {
const {
maxRetries = 3,
model = 'gpt-4o',
maxTokens = 4096,
temperature = 0
} = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
// Make the API call (with error-level retries handled by retryWithBackoff)
const response = await retryWithBackoff(
() => openai.chat.completions.create({
model,
messages,
max_tokens: maxTokens,
temperature,
response_format: { type: 'json_object' }
}),
{ maxRetries: 2 } // Inner retries for API errors
);
const content = response.result.choices[0].message.content;
// Check for truncation
if (response.result.choices[0].finish_reason === 'length') {
console.warn(`Attempt ${attempt + 1}: Response truncated`);
maxTokens = Math.min(maxTokens * 2, 16384);
continue; // Retry with more tokens
}
// Try to parse and validate
try {
const data = JSON.parse(content);
const validation = schema.safeParse(data);
if (validation.success) {
return {
data: validation.data,
attempts: attempt + 1,
raw: content
};
}
console.warn(
`Attempt ${attempt + 1}: Schema validation failed:`,
validation.error.issues.map(i => i.message).join(', ')
);
} catch (parseError) {
console.warn(`Attempt ${attempt + 1}: JSON parse failed:`, parseError.message);
}
// If not the last attempt, continue to retry
}
throw new Error(`Failed to get valid response after ${maxRetries + 1} attempts`);
}
Retry with modified prompt (passing errors back)
This is the most powerful technique: when the model's output fails validation, include the validation error in the next prompt so the model can correct itself.
import { z } from 'zod';
/**
* Retry with validation error feedback.
* Each retry includes the previous error so the model can self-correct.
*/
async function retryWithFeedback(baseMessages, schema, options = {}) {
const {
maxRetries = 3,
model = 'gpt-4o',
maxTokens = 4096
} = options;
let currentMessages = [...baseMessages];
let lastOutput = null;
let lastErrors = null;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await openai.chat.completions.create({
model,
messages: currentMessages,
max_tokens: maxTokens,
temperature: 0,
response_format: { type: 'json_object' }
});
const content = response.choices[0].message.content;
lastOutput = content;
// Try to parse and validate
let data;
try {
data = JSON.parse(content);
} catch (e) {
lastErrors = [`JSON parse error: ${e.message}`];
if (attempt < maxRetries) {
// Add error feedback to messages
currentMessages = [
...baseMessages,
{ role: 'assistant', content },
{
role: 'user',
content: `Your response was not valid JSON. Error: ${e.message}\n\nPlease fix the JSON and respond again. Return ONLY valid JSON.`
}
];
continue;
}
break;
}
const validation = schema.safeParse(data);
if (validation.success) {
return {
data: validation.data,
attempts: attempt + 1
};
}
// Format validation errors for feedback
lastErrors = validation.error.issues.map(issue =>
`Field "${issue.path.join('.')}": ${issue.message}`
);
if (attempt < maxRetries) {
// Include the validation errors in the next prompt
const errorFeedback = lastErrors.join('\n');
currentMessages = [
...baseMessages,
{ role: 'assistant', content },
{
role: 'user',
content: `Your JSON response had validation errors:\n${errorFeedback}\n\nPlease fix these errors and respond with corrected JSON only.`
}
];
console.warn(`Attempt ${attempt + 1}: Retrying with error feedback`);
}
}
throw new Error(
`Failed after ${maxRetries + 1} attempts. Last errors: ${lastErrors?.join('; ')}`
);
}
// Usage
const UserSchema = z.object({
name: z.string().min(1, 'Name is required'),
age: z.number().int().min(0).max(150),
email: z.string().email('Must be a valid email'),
role: z.enum(['admin', 'user', 'moderator'])
});
const result = await retryWithFeedback(
[
{ role: 'system', content: 'Extract user info as JSON: {name, age, email, role}' },
{ role: 'user', content: 'Alice is 30, alice@example.com, she moderates the forum' }
],
UserSchema,
{ maxRetries: 2 }
);
console.log(result.data);
// { name: 'Alice', age: 30, email: 'alice@example.com', role: 'moderator' }
6. Cost Awareness of Retries
Every retry costs money. You must balance reliability against cost.
The cost math
Single call cost:
Input: 1,000 tokens × $2.50/1M = $0.0025
Output: 500 tokens × $10.00/1M = $0.005
Total: $0.0075 per call
With retries (worst case, all 3 retries triggered):
4 calls × $0.0075 = $0.03 per request
At 100,000 requests/day:
Without retries: $750/day
With retries (5% failure rate, avg 1.15 calls): $862.50/day
Worst case (all retry): $3,000/day
With feedback retries (input grows each retry):
Attempt 1: 1,000 input + 500 output = $0.0075
Attempt 2: 2,500 input + 500 output = $0.0113 (includes previous output + error)
Attempt 3: 4,000 input + 500 output = $0.0150
Total: $0.0338 per 3-attempt sequence — 4.5x the single call cost
Cost control strategies
/**
* Cost-aware retry wrapper.
* Tracks cost per attempt and stops if total cost exceeds budget.
*/
async function costAwareRetry(fn, options = {}) {
const {
maxRetries = 3,
maxCostCents = 10, // Maximum cost in cents per request sequence
inputCostPer1M = 2.50, // Dollars per 1M input tokens
outputCostPer1M = 10.00, // Dollars per 1M output tokens
baseDelayMs = 1000,
onRetry = () => {}
} = options;
let totalCostCents = 0;
let lastError = null;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const result = await fn(attempt);
// Calculate cost of this attempt
const usage = result.usage;
if (usage) {
const inputCost = (usage.prompt_tokens / 1000000) * inputCostPer1M * 100;
const outputCost = (usage.completion_tokens / 1000000) * outputCostPer1M * 100;
totalCostCents += inputCost + outputCost;
}
return { result, attempts: attempt + 1, totalCostCents };
} catch (error) {
lastError = error;
// Check cost budget before retrying
if (totalCostCents >= maxCostCents) {
console.error(`Cost budget exceeded (${totalCostCents.toFixed(2)}c / ${maxCostCents}c). Stopping retries.`);
throw error;
}
if (attempt < maxRetries && isRetryable(error).retryable) {
const delay = fullJitterBackoff(attempt, baseDelayMs);
onRetry({ attempt, error, totalCostCents, delay });
await new Promise(r => setTimeout(r, delay));
}
}
}
throw lastError;
}
7. Max Retry Limits
Setting the right number of retries is a balance between reliability and resource usage.
Recommended retry counts by error type
| Error Type | Max Retries | Reasoning |
|---|---|---|
| Rate limit (429) | 3-5 | Rate limits reset; more retries = higher success |
| Server error (500/502/503) | 2-3 | Usually resolves quickly or doesn't resolve at all |
| Timeout | 2 | If the server is slow, more retries won't help much |
| Malformed output | 2-3 | Model output varies; retries often fix it |
| Schema validation failure | 2 | With error feedback, 2 retries usually suffice |
| Connection error | 3 | Network issues often resolve within seconds |
Circuit breaker pattern
When an API is consistently failing, stop retrying to avoid wasting resources.
/**
* Simple circuit breaker for LLM APIs.
* After N consecutive failures, "open" the circuit and fail fast.
*/
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeoutMs = options.resetTimeoutMs || 60000;
this.consecutiveFailures = 0;
this.isOpen = false;
this.openedAt = null;
}
async call(fn) {
// Check if circuit is open
if (this.isOpen) {
const elapsed = Date.now() - this.openedAt;
if (elapsed < this.resetTimeoutMs) {
throw new Error(
`Circuit breaker is OPEN. ${Math.ceil((this.resetTimeoutMs - elapsed) / 1000)}s until retry. ` +
`(${this.consecutiveFailures} consecutive failures)`
);
}
// Time has passed — try a "half-open" request
console.log('Circuit breaker: attempting half-open request...');
}
try {
const result = await fn();
// Success — reset the circuit
this.consecutiveFailures = 0;
this.isOpen = false;
return result;
} catch (error) {
this.consecutiveFailures++;
if (this.consecutiveFailures >= this.failureThreshold) {
this.isOpen = true;
this.openedAt = Date.now();
console.error(
`Circuit breaker OPENED after ${this.consecutiveFailures} failures. ` +
`Will retry after ${this.resetTimeoutMs / 1000}s.`
);
}
throw error;
}
}
}
// Usage
const llmCircuit = new CircuitBreaker({
failureThreshold: 5,
resetTimeoutMs: 30000 // 30 seconds
});
try {
const response = await llmCircuit.call(async () => {
return await retryWithBackoff(
() => openai.chat.completions.create({ model: 'gpt-4o', messages: [...] }),
{ maxRetries: 2 }
);
});
} catch (error) {
if (error.message.includes('Circuit breaker is OPEN')) {
// Return cached response or graceful degradation
return { error: 'Service temporarily unavailable', cached: getCachedResponse() };
}
throw error;
}
8. Fallback Strategies
When all retries fail, you need a fallback plan. Don't just throw an error to the user.
/**
* LLM call with retry + fallback chain.
* Tries primary model, then fallback model, then cached response.
*/
async function llmCallWithFallback(messages, options = {}) {
const {
primaryModel = 'gpt-4o',
fallbackModel = 'gpt-4o-mini',
maxRetries = 2,
cacheKey = null
} = options;
// Attempt 1: Primary model with retries
try {
const result = await retryWithBackoff(
() => openai.chat.completions.create({
model: primaryModel,
messages,
temperature: 0
}),
{ maxRetries }
);
// Cache successful response
if (cacheKey) {
cache.set(cacheKey, result.result.choices[0].message.content);
}
return {
content: result.result.choices[0].message.content,
source: 'primary',
model: primaryModel,
attempts: result.attempts
};
} catch (primaryError) {
console.warn(`Primary model failed: ${primaryError.message}. Trying fallback...`);
}
// Attempt 2: Fallback model (cheaper, more available)
try {
const result = await retryWithBackoff(
() => openai.chat.completions.create({
model: fallbackModel,
messages,
temperature: 0
}),
{ maxRetries: 1 } // Fewer retries for fallback
);
return {
content: result.result.choices[0].message.content,
source: 'fallback',
model: fallbackModel,
attempts: result.attempts
};
} catch (fallbackError) {
console.warn(`Fallback model failed: ${fallbackError.message}. Checking cache...`);
}
// Attempt 3: Return cached response
if (cacheKey) {
const cached = cache.get(cacheKey);
if (cached) {
return {
content: cached,
source: 'cache',
model: 'cached',
attempts: 0,
stale: true
};
}
}
// Attempt 4: Graceful degradation
return {
content: null,
source: 'degraded',
model: 'none',
attempts: 0,
error: 'All models and fallbacks failed'
};
}
9. Complete Production Retry System
Here's a full, production-ready retry system combining everything from this section.
import { z } from 'zod';
/**
* Production LLM call with:
* - Error classification (retryable vs permanent)
* - Exponential backoff with full jitter
* - Output validation with error feedback
* - Cost tracking
* - Fallback model support
* - Circuit breaker integration
* - Detailed logging
*/
class RobustLlmClient {
constructor(options = {}) {
this.openai = options.openai;
this.defaults = {
model: options.model || 'gpt-4o',
fallbackModel: options.fallbackModel || 'gpt-4o-mini',
maxRetries: options.maxRetries ?? 3,
maxOutputRetries: options.maxOutputRetries ?? 2,
baseDelayMs: options.baseDelayMs || 1000,
maxDelayMs: options.maxDelayMs || 60000,
timeoutMs: options.timeoutMs || 60000,
temperature: options.temperature ?? 0
};
this.circuitBreaker = new CircuitBreaker();
this.stats = { calls: 0, retries: 0, failures: 0, totalCostCents: 0 };
}
async call(messages, schema = null, options = {}) {
const config = { ...this.defaults, ...options };
this.stats.calls++;
return this.circuitBreaker.call(async () => {
// Try primary model
try {
return await this._callWithRetries(messages, schema, config, config.model);
} catch (primaryError) {
// Try fallback model
try {
console.warn(`Primary model failed, trying fallback: ${config.fallbackModel}`);
return await this._callWithRetries(messages, schema, config, config.fallbackModel);
} catch (fallbackError) {
this.stats.failures++;
throw fallbackError;
}
}
});
}
async _callWithRetries(messages, schema, config, model) {
let lastError;
let currentMessages = [...messages];
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
try {
// Make the API call
const response = await this._makeCall(currentMessages, config, model);
const content = response.choices[0].message.content;
const finishReason = response.choices[0].finish_reason;
// Check for truncation
if (finishReason === 'length') {
throw Object.assign(new Error('Response truncated'), { retryable: true });
}
// If no schema, return raw content
if (!schema) {
return { content, model, attempts: attempt + 1, validated: false };
}
// Validate against schema
const parsed = JSON.parse(content);
const validation = schema.safeParse(parsed);
if (validation.success) {
return { data: validation.data, content, model, attempts: attempt + 1, validated: true };
}
// Validation failed — retry with feedback
const errors = validation.error.issues
.map(i => `${i.path.join('.')}: ${i.message}`)
.join('\n');
if (attempt < config.maxOutputRetries) {
currentMessages = [
...messages,
{ role: 'assistant', content },
{ role: 'user', content: `Validation errors:\n${errors}\n\nFix and respond with corrected JSON only.` }
];
this.stats.retries++;
continue;
}
throw new Error(`Schema validation failed: ${errors}`);
} catch (error) {
lastError = error;
// Check if retryable
const decision = error.retryable !== undefined
? { retryable: error.retryable }
: isRetryable(error);
if (!decision.retryable || attempt >= config.maxRetries) {
throw error;
}
// Backoff
const delay = fullJitterBackoff(attempt, config.baseDelayMs, config.maxDelayMs);
this.stats.retries++;
await new Promise(r => setTimeout(r, delay));
}
}
throw lastError;
}
async _makeCall(messages, config, model) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), config.timeoutMs);
try {
return await this.openai.chat.completions.create(
{
model,
messages,
max_tokens: config.maxTokens || 4096,
temperature: config.temperature,
...(config.responseFormat ? { response_format: config.responseFormat } : {})
},
{ signal: controller.signal }
);
} finally {
clearTimeout(timeout);
}
}
getStats() {
return { ...this.stats };
}
}
// Usage
const client = new RobustLlmClient({
openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
model: 'gpt-4o',
fallbackModel: 'gpt-4o-mini',
maxRetries: 3,
timeoutMs: 45000
});
const UserSchema = z.object({
name: z.string(),
age: z.number(),
email: z.string().email()
});
const result = await client.call(
[
{ role: 'system', content: 'Extract user info as JSON.' },
{ role: 'user', content: 'Alice, 30, alice@test.com' }
],
UserSchema
);
console.log(result.data);
// { name: 'Alice', age: 30, email: 'alice@test.com' }
console.log(`Completed in ${result.attempts} attempt(s) using ${result.model}`);
10. Key Takeaways
- Classify errors before retrying — retrying a 401 (bad API key) wastes money and time; only retry transient errors like 429, 500, 502, 503, timeouts, and malformed output.
- Use exponential backoff with full jitter — start at 1 second, double each attempt, cap at 60 seconds, and add randomness to prevent thundering herds of synchronized retries.
- Retry with error feedback for validation failures — when the model's JSON fails schema validation, include the validation errors in the next prompt so the model can self-correct.
- Track cost per retry sequence — feedback retries grow input size on each attempt; set cost budgets and monitor retry rates to prevent runaway spending.
- Have a fallback chain — primary model with retries, then fallback model (cheaper/faster), then cached response, then graceful degradation. Never let every path end in an error thrown to the user.
Explain-It Challenge
- Your team's LLM API has a 5% failure rate per call. A user request triggers a pipeline of 4 sequential LLM calls. Calculate the pipeline success rate without retries, then with 2 retries per call.
- Two approaches to retrying malformed JSON: (A) retry with the same prompt, hoping for different output, and (B) retry with validation errors included in the prompt. When is each approach appropriate?
- Your monitoring shows that 15% of requests are being retried, and 3% of those retries also fail. The average cost per retry sequence is 2.1x a single call. Is this healthy? What thresholds would trigger investigation?
Navigation: ← 4.10.b — Partial Responses and Timeouts · 4.10.d — Logging AI Requests →