Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications

4.10.c — Retry Mechanisms

In one sentence: Not every failed LLM call should be retried — you must distinguish transient errors (network blips, rate limits, server overload) from permanent errors (bad API key, invalid input), use exponential backoff with jitter to avoid thundering herds, set strict retry limits to control cost, and sometimes retry with a modified prompt when the model's output was the problem.

Navigation: ← 4.10.b — Partial Responses and Timeouts · 4.10.d — Logging AI Requests →

1. Why Retries Are Essential for LLM Applications

LLM APIs fail frequently compared to traditional APIs. Here's why:

Traditional REST API:
  - Servers are stateless, requests are fast (~50ms)
  - Failures are rare (<0.01%)
  - Retries are a nice-to-have

LLM API:
  - Generation takes 1-60+ seconds per request
  - Multiple failure modes (rate limit, timeout, malformed output, server error)
  - Combined failure rate can be 2-10% in production
  - Cost per call is significant ($0.001-$0.10+)
  - Retries are MANDATORY for production systems

The failure math

If a single LLM call has a 5% failure rate:
  - 1 call:   95% success
  - 2 calls:  95% + (5% × 95%) = 99.75% success (with 1 retry)
  - 3 calls:  99.99% success (with 2 retries)

For a pipeline of 3 sequential LLM calls, each with 5% failure:
  - Without retries: 0.95³ = 85.7% pipeline success
  - With 1 retry each: 0.9975³ = 99.25% pipeline success

Retries turn an unreliable system into a reliable one.

2. When to Retry vs When NOT to Retry

The most important decision in retry logic is whether to retry. Retrying a permanent error wastes money and time.

Retryable errors (transient)

Error Type	HTTP Status	Why Retry Works
Rate limit	429	Wait and try again — the limit resets
Server error	500	Transient infrastructure issue
Bad gateway	502	Load balancer issue — next request may hit a healthy server
Service unavailable	503	Server overloaded — will recover
Gateway timeout	504	Request took too long — may succeed on retry
Connection error	N/A	Network blip — often resolves quickly
Timeout	N/A	Server was slow — may be faster on retry
Malformed output	200	Model returned bad JSON — retry often fixes it
Truncated output	200	finish_reason: "length" — retry with more tokens

Non-retryable errors (permanent)

Error Type	HTTP Status	Why Retry Fails
Authentication	401	Bad API key — retry won't fix it
Forbidden	403	No access — retry won't fix it
Not found	404	Wrong model name — retry won't fix it
Bad request	400	Malformed request — retry sends the same bad request
Content policy	400/403	Prompt violates policy — same prompt will be rejected again
Context too long	400	Input exceeds model's context — same input will fail again
Insufficient quota	402	Out of credits — need to add funds

Decision function

/**
 * Determine if an error is retryable.
 * Returns { retryable: boolean, reason: string }
 */
function isRetryable(error) {
  // Network errors — always retryable
  if (error.code === 'ETIMEDOUT' || error.code === 'ECONNRESET' ||
      error.code === 'ECONNREFUSED' || error.code === 'ENOTFOUND' ||
      error.name === 'AbortError') {
    return { retryable: true, reason: 'network_error' };
  }

  const status = error.status || error.statusCode;

  // Retryable HTTP errors
  if ([429, 500, 502, 503, 504, 408].includes(status)) {
    return { retryable: true, reason: `http_${status}` };
  }

  // Non-retryable HTTP errors
  if ([400, 401, 402, 403, 404, 422].includes(status)) {
    return { retryable: false, reason: `http_${status}` };
  }

  // Unknown errors — default to not retryable (fail fast)
  return { retryable: false, reason: 'unknown_error' };
}

3. Exponential Backoff with Jitter

When you retry, you must wait between attempts. If you retry immediately, you'll hit the same rate limit or overloaded server. Exponential backoff with jitter is the standard approach.

Why exponential backoff?

Scenario: API returns 429 (rate limited)

WITHOUT backoff (immediate retry):
  Attempt 1: 429 → retry immediately
  Attempt 2: 429 → retry immediately (server is STILL rate limited)
  Attempt 3: 429 → retry immediately (making it WORSE)
  Result: All attempts fail, you wasted time and annoyed the server

WITH exponential backoff:
  Attempt 1: 429 → wait 1 second
  Attempt 2: 429 → wait 2 seconds
  Attempt 3: 429 → wait 4 seconds
  Attempt 4: Success! (rate limit window has passed)

Why jitter?

Jitter adds randomness to the delay. Without it, if 100 clients all get rate limited at the same time, they all retry at the same time (1s, 2s, 4s), causing a thundering herd.

WITHOUT jitter (thundering herd):
  Client A: retry at 1.000s, 2.000s, 4.000s
  Client B: retry at 1.000s, 2.000s, 4.000s
  Client C: retry at 1.000s, 2.000s, 4.000s
  → All 3 clients hit the server simultaneously every time

WITH jitter (spread out):
  Client A: retry at 0.700s, 2.300s, 3.100s
  Client B: retry at 1.200s, 1.800s, 4.500s
  Client C: retry at 0.900s, 2.900s, 3.700s
  → Load is distributed over time

Implementation

/**
 * Calculate exponential backoff delay with jitter.
 * 
 * @param {number} attempt - Current attempt number (0-based)
 * @param {object} options - Configuration
 * @returns {number} Delay in milliseconds
 */
function calculateBackoff(attempt, options = {}) {
  const {
    baseDelayMs = 1000,    // Start at 1 second
    maxDelayMs = 60000,    // Cap at 60 seconds
    jitterFactor = 0.5     // ±50% randomness
  } = options;

  // Exponential: 1s, 2s, 4s, 8s, 16s, ...
  const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
  
  // Cap at maximum
  const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
  
  // Add jitter: ±50% of the delay
  const jitter = cappedDelay * jitterFactor * (Math.random() * 2 - 1);
  const finalDelay = Math.max(0, cappedDelay + jitter);

  return Math.round(finalDelay);
}

// Example delays for attempts 0-5:
// Attempt 0: ~1000ms   (750 - 1500ms with jitter)
// Attempt 1: ~2000ms   (1500 - 3000ms)
// Attempt 2: ~4000ms   (3000 - 6000ms)
// Attempt 3: ~8000ms   (6000 - 12000ms)
// Attempt 4: ~16000ms  (12000 - 24000ms)
// Attempt 5: ~32000ms  (24000 - 48000ms)

Full jitter (recommended by AWS)

/**
 * "Full jitter" strategy — recommended by AWS architecture blog.
 * Delay = random(0, min(cap, base * 2^attempt))
 * Provides maximum spread, reducing thundering herd more than equal jitter.
 */
function fullJitterBackoff(attempt, baseDelayMs = 1000, maxDelayMs = 60000) {
  const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
  const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
  return Math.round(Math.random() * cappedDelay);
}

4. Building a Robust Retry Wrapper

Here's a production-ready retry wrapper that handles all the cases discussed above.

/**
 * Retry wrapper for LLM API calls with exponential backoff.
 * 
 * Features:
 * - Exponential backoff with full jitter
 * - Distinguishes retryable vs non-retryable errors
 * - Respects Retry-After headers
 * - Configurable max retries
 * - Detailed logging of each attempt
 * - Returns attempt metadata
 * 
 * @param {Function} fn - Async function to retry
 * @param {object} options - Configuration
 * @returns {Promise<{result: any, attempts: number, totalDelayMs: number}>}
 */
async function retryWithBackoff(fn, options = {}) {
  const {
    maxRetries = 3,
    baseDelayMs = 1000,
    maxDelayMs = 60000,
    onRetry = () => {},           // Callback for each retry
    isRetryableError = isRetryable // Error classifier function
  } = options;

  let lastError = null;
  let totalDelayMs = 0;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const result = await fn(attempt);
      return {
        result,
        attempts: attempt + 1,
        totalDelayMs
      };
    } catch (error) {
      lastError = error;

      // Check if we should retry
      const retryDecision = isRetryableError(error);

      if (!retryDecision.retryable) {
        console.error(`Non-retryable error (${retryDecision.reason}):`, error.message);
        throw error; // Don't retry — throw immediately
      }

      // Check if we've exhausted retries
      if (attempt >= maxRetries) {
        console.error(`All ${maxRetries + 1} attempts failed.`);
        break;
      }

      // Calculate delay
      let delayMs;

      // Respect Retry-After header if present
      const retryAfter = error.headers?.['retry-after'];
      if (retryAfter) {
        delayMs = parseInt(retryAfter) * 1000;
      } else {
        delayMs = fullJitterBackoff(attempt, baseDelayMs, maxDelayMs);
      }

      totalDelayMs += delayMs;

      console.warn(
        `Attempt ${attempt + 1}/${maxRetries + 1} failed ` +
        `(${retryDecision.reason}). Retrying in ${delayMs}ms...`
      );

      onRetry({
        attempt,
        error,
        reason: retryDecision.reason,
        delayMs,
        totalDelayMs
      });

      // Wait before retrying
      await new Promise(resolve => setTimeout(resolve, delayMs));
    }
  }

  // All retries exhausted
  const enhancedError = new Error(
    `All ${maxRetries + 1} attempts failed. Last error: ${lastError.message}`
  );
  enhancedError.lastError = lastError;
  enhancedError.attempts = maxRetries + 1;
  enhancedError.totalDelayMs = totalDelayMs;
  throw enhancedError;
}

// Helper: full jitter backoff
function fullJitterBackoff(attempt, baseDelayMs = 1000, maxDelayMs = 60000) {
  const exponentialDelay = baseDelayMs * Math.pow(2, attempt);
  const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
  return Math.round(Math.random() * cappedDelay);
}

// Usage
const response = await retryWithBackoff(
  async (attempt) => {
    return await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: 'Extract data...' }],
      max_tokens: 2000,
      temperature: 0
    });
  },
  {
    maxRetries: 3,
    baseDelayMs: 1000,
    maxDelayMs: 30000,
    onRetry: ({ attempt, reason, delayMs }) => {
      console.log(`Retry ${attempt + 1}: ${reason}, waiting ${delayMs}ms`);
    }
  }
);

console.log(`Success after ${response.attempts} attempt(s)`);

5. Retrying Malformed Output (Not Just Errors)

Sometimes the API call succeeds (HTTP 200) but the output is unusable — bad JSON, wrong schema, or nonsensical content. You need to retry these too.

Retry with the same prompt

/**
 * Retry LLM call when the output fails validation.
 * Handles both API errors AND output validation failures.
 */
async function retryUntilValid(messages, schema, options = {}) {
  const {
    maxRetries = 3,
    model = 'gpt-4o',
    maxTokens = 4096,
    temperature = 0
  } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    // Make the API call (with error-level retries handled by retryWithBackoff)
    const response = await retryWithBackoff(
      () => openai.chat.completions.create({
        model,
        messages,
        max_tokens: maxTokens,
        temperature,
        response_format: { type: 'json_object' }
      }),
      { maxRetries: 2 } // Inner retries for API errors
    );

    const content = response.result.choices[0].message.content;

    // Check for truncation
    if (response.result.choices[0].finish_reason === 'length') {
      console.warn(`Attempt ${attempt + 1}: Response truncated`);
      maxTokens = Math.min(maxTokens * 2, 16384);
      continue; // Retry with more tokens
    }

    // Try to parse and validate
    try {
      const data = JSON.parse(content);
      const validation = schema.safeParse(data);

      if (validation.success) {
        return {
          data: validation.data,
          attempts: attempt + 1,
          raw: content
        };
      }

      console.warn(
        `Attempt ${attempt + 1}: Schema validation failed:`,
        validation.error.issues.map(i => i.message).join(', ')
      );
    } catch (parseError) {
      console.warn(`Attempt ${attempt + 1}: JSON parse failed:`, parseError.message);
    }

    // If not the last attempt, continue to retry
  }

  throw new Error(`Failed to get valid response after ${maxRetries + 1} attempts`);
}

Retry with modified prompt (passing errors back)

This is the most powerful technique: when the model's output fails validation, include the validation error in the next prompt so the model can correct itself.

import { z } from 'zod';

/**
 * Retry with validation error feedback.
 * Each retry includes the previous error so the model can self-correct.
 */
async function retryWithFeedback(baseMessages, schema, options = {}) {
  const {
    maxRetries = 3,
    model = 'gpt-4o',
    maxTokens = 4096
  } = options;

  let currentMessages = [...baseMessages];
  let lastOutput = null;
  let lastErrors = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await openai.chat.completions.create({
      model,
      messages: currentMessages,
      max_tokens: maxTokens,
      temperature: 0,
      response_format: { type: 'json_object' }
    });

    const content = response.choices[0].message.content;
    lastOutput = content;

    // Try to parse and validate
    let data;
    try {
      data = JSON.parse(content);
    } catch (e) {
      lastErrors = [`JSON parse error: ${e.message}`];

      if (attempt < maxRetries) {
        // Add error feedback to messages
        currentMessages = [
          ...baseMessages,
          { role: 'assistant', content },
          {
            role: 'user',
            content: `Your response was not valid JSON. Error: ${e.message}\n\nPlease fix the JSON and respond again. Return ONLY valid JSON.`
          }
        ];
        continue;
      }
      break;
    }

    const validation = schema.safeParse(data);

    if (validation.success) {
      return {
        data: validation.data,
        attempts: attempt + 1
      };
    }

    // Format validation errors for feedback
    lastErrors = validation.error.issues.map(issue => 
      `Field "${issue.path.join('.')}": ${issue.message}`
    );

    if (attempt < maxRetries) {
      // Include the validation errors in the next prompt
      const errorFeedback = lastErrors.join('\n');
      
      currentMessages = [
        ...baseMessages,
        { role: 'assistant', content },
        {
          role: 'user',
          content: `Your JSON response had validation errors:\n${errorFeedback}\n\nPlease fix these errors and respond with corrected JSON only.`
        }
      ];

      console.warn(`Attempt ${attempt + 1}: Retrying with error feedback`);
    }
  }

  throw new Error(
    `Failed after ${maxRetries + 1} attempts. Last errors: ${lastErrors?.join('; ')}`
  );
}

// Usage
const UserSchema = z.object({
  name: z.string().min(1, 'Name is required'),
  age: z.number().int().min(0).max(150),
  email: z.string().email('Must be a valid email'),
  role: z.enum(['admin', 'user', 'moderator'])
});

const result = await retryWithFeedback(
  [
    { role: 'system', content: 'Extract user info as JSON: {name, age, email, role}' },
    { role: 'user', content: 'Alice is 30, alice@example.com, she moderates the forum' }
  ],
  UserSchema,
  { maxRetries: 2 }
);

console.log(result.data);
// { name: 'Alice', age: 30, email: 'alice@example.com', role: 'moderator' }

6. Cost Awareness of Retries

Every retry costs money. You must balance reliability against cost.

The cost math

Single call cost:
  Input: 1,000 tokens × $2.50/1M = $0.0025
  Output: 500 tokens × $10.00/1M = $0.005
  Total: $0.0075 per call

With retries (worst case, all 3 retries triggered):
  4 calls × $0.0075 = $0.03 per request
  
At 100,000 requests/day:
  Without retries: $750/day
  With retries (5% failure rate, avg 1.15 calls): $862.50/day
  Worst case (all retry): $3,000/day

With feedback retries (input grows each retry):
  Attempt 1: 1,000 input + 500 output = $0.0075
  Attempt 2: 2,500 input + 500 output = $0.0113 (includes previous output + error)
  Attempt 3: 4,000 input + 500 output = $0.0150
  Total: $0.0338 per 3-attempt sequence — 4.5x the single call cost

Cost control strategies

/**
 * Cost-aware retry wrapper.
 * Tracks cost per attempt and stops if total cost exceeds budget.
 */
async function costAwareRetry(fn, options = {}) {
  const {
    maxRetries = 3,
    maxCostCents = 10,      // Maximum cost in cents per request sequence
    inputCostPer1M = 2.50,  // Dollars per 1M input tokens
    outputCostPer1M = 10.00, // Dollars per 1M output tokens
    baseDelayMs = 1000,
    onRetry = () => {}
  } = options;

  let totalCostCents = 0;
  let lastError = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const result = await fn(attempt);

      // Calculate cost of this attempt
      const usage = result.usage;
      if (usage) {
        const inputCost = (usage.prompt_tokens / 1000000) * inputCostPer1M * 100;
        const outputCost = (usage.completion_tokens / 1000000) * outputCostPer1M * 100;
        totalCostCents += inputCost + outputCost;
      }

      return { result, attempts: attempt + 1, totalCostCents };

    } catch (error) {
      lastError = error;

      // Check cost budget before retrying
      if (totalCostCents >= maxCostCents) {
        console.error(`Cost budget exceeded (${totalCostCents.toFixed(2)}c / ${maxCostCents}c). Stopping retries.`);
        throw error;
      }

      if (attempt < maxRetries && isRetryable(error).retryable) {
        const delay = fullJitterBackoff(attempt, baseDelayMs);
        onRetry({ attempt, error, totalCostCents, delay });
        await new Promise(r => setTimeout(r, delay));
      }
    }
  }

  throw lastError;
}

7. Max Retry Limits

Setting the right number of retries is a balance between reliability and resource usage.

Recommended retry counts by error type

Error Type	Max Retries	Reasoning
Rate limit (429)	3-5	Rate limits reset; more retries = higher success
Server error (500/502/503)	2-3	Usually resolves quickly or doesn't resolve at all
Timeout	2	If the server is slow, more retries won't help much
Malformed output	2-3	Model output varies; retries often fix it
Schema validation failure	2	With error feedback, 2 retries usually suffice
Connection error	3	Network issues often resolve within seconds

Circuit breaker pattern

When an API is consistently failing, stop retrying to avoid wasting resources.

/**
 * Simple circuit breaker for LLM APIs.
 * After N consecutive failures, "open" the circuit and fail fast.
 */
class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeoutMs = options.resetTimeoutMs || 60000;
    this.consecutiveFailures = 0;
    this.isOpen = false;
    this.openedAt = null;
  }

  async call(fn) {
    // Check if circuit is open
    if (this.isOpen) {
      const elapsed = Date.now() - this.openedAt;
      if (elapsed < this.resetTimeoutMs) {
        throw new Error(
          `Circuit breaker is OPEN. ${Math.ceil((this.resetTimeoutMs - elapsed) / 1000)}s until retry. ` +
          `(${this.consecutiveFailures} consecutive failures)`
        );
      }
      // Time has passed — try a "half-open" request
      console.log('Circuit breaker: attempting half-open request...');
    }

    try {
      const result = await fn();
      // Success — reset the circuit
      this.consecutiveFailures = 0;
      this.isOpen = false;
      return result;
    } catch (error) {
      this.consecutiveFailures++;
      
      if (this.consecutiveFailures >= this.failureThreshold) {
        this.isOpen = true;
        this.openedAt = Date.now();
        console.error(
          `Circuit breaker OPENED after ${this.consecutiveFailures} failures. ` +
          `Will retry after ${this.resetTimeoutMs / 1000}s.`
        );
      }
      
      throw error;
    }
  }
}

// Usage
const llmCircuit = new CircuitBreaker({
  failureThreshold: 5,
  resetTimeoutMs: 30000 // 30 seconds
});

try {
  const response = await llmCircuit.call(async () => {
    return await retryWithBackoff(
      () => openai.chat.completions.create({ model: 'gpt-4o', messages: [...] }),
      { maxRetries: 2 }
    );
  });
} catch (error) {
  if (error.message.includes('Circuit breaker is OPEN')) {
    // Return cached response or graceful degradation
    return { error: 'Service temporarily unavailable', cached: getCachedResponse() };
  }
  throw error;
}

8. Fallback Strategies

When all retries fail, you need a fallback plan. Don't just throw an error to the user.

/**
 * LLM call with retry + fallback chain.
 * Tries primary model, then fallback model, then cached response.
 */
async function llmCallWithFallback(messages, options = {}) {
  const {
    primaryModel = 'gpt-4o',
    fallbackModel = 'gpt-4o-mini',
    maxRetries = 2,
    cacheKey = null
  } = options;

  // Attempt 1: Primary model with retries
  try {
    const result = await retryWithBackoff(
      () => openai.chat.completions.create({
        model: primaryModel,
        messages,
        temperature: 0
      }),
      { maxRetries }
    );
    
    // Cache successful response
    if (cacheKey) {
      cache.set(cacheKey, result.result.choices[0].message.content);
    }
    
    return {
      content: result.result.choices[0].message.content,
      source: 'primary',
      model: primaryModel,
      attempts: result.attempts
    };
  } catch (primaryError) {
    console.warn(`Primary model failed: ${primaryError.message}. Trying fallback...`);
  }

  // Attempt 2: Fallback model (cheaper, more available)
  try {
    const result = await retryWithBackoff(
      () => openai.chat.completions.create({
        model: fallbackModel,
        messages,
        temperature: 0
      }),
      { maxRetries: 1 } // Fewer retries for fallback
    );
    
    return {
      content: result.result.choices[0].message.content,
      source: 'fallback',
      model: fallbackModel,
      attempts: result.attempts
    };
  } catch (fallbackError) {
    console.warn(`Fallback model failed: ${fallbackError.message}. Checking cache...`);
  }

  // Attempt 3: Return cached response
  if (cacheKey) {
    const cached = cache.get(cacheKey);
    if (cached) {
      return {
        content: cached,
        source: 'cache',
        model: 'cached',
        attempts: 0,
        stale: true
      };
    }
  }

  // Attempt 4: Graceful degradation
  return {
    content: null,
    source: 'degraded',
    model: 'none',
    attempts: 0,
    error: 'All models and fallbacks failed'
  };
}

9. Complete Production Retry System

Here's a full, production-ready retry system combining everything from this section.

import { z } from 'zod';

/**
 * Production LLM call with:
 * - Error classification (retryable vs permanent)
 * - Exponential backoff with full jitter
 * - Output validation with error feedback
 * - Cost tracking
 * - Fallback model support
 * - Circuit breaker integration
 * - Detailed logging
 */
class RobustLlmClient {
  constructor(options = {}) {
    this.openai = options.openai;
    this.defaults = {
      model: options.model || 'gpt-4o',
      fallbackModel: options.fallbackModel || 'gpt-4o-mini',
      maxRetries: options.maxRetries ?? 3,
      maxOutputRetries: options.maxOutputRetries ?? 2,
      baseDelayMs: options.baseDelayMs || 1000,
      maxDelayMs: options.maxDelayMs || 60000,
      timeoutMs: options.timeoutMs || 60000,
      temperature: options.temperature ?? 0
    };
    this.circuitBreaker = new CircuitBreaker();
    this.stats = { calls: 0, retries: 0, failures: 0, totalCostCents: 0 };
  }

  async call(messages, schema = null, options = {}) {
    const config = { ...this.defaults, ...options };
    this.stats.calls++;

    return this.circuitBreaker.call(async () => {
      // Try primary model
      try {
        return await this._callWithRetries(messages, schema, config, config.model);
      } catch (primaryError) {
        // Try fallback model
        try {
          console.warn(`Primary model failed, trying fallback: ${config.fallbackModel}`);
          return await this._callWithRetries(messages, schema, config, config.fallbackModel);
        } catch (fallbackError) {
          this.stats.failures++;
          throw fallbackError;
        }
      }
    });
  }

  async _callWithRetries(messages, schema, config, model) {
    let lastError;
    let currentMessages = [...messages];

    for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
      try {
        // Make the API call
        const response = await this._makeCall(currentMessages, config, model);
        const content = response.choices[0].message.content;
        const finishReason = response.choices[0].finish_reason;

        // Check for truncation
        if (finishReason === 'length') {
          throw Object.assign(new Error('Response truncated'), { retryable: true });
        }

        // If no schema, return raw content
        if (!schema) {
          return { content, model, attempts: attempt + 1, validated: false };
        }

        // Validate against schema
        const parsed = JSON.parse(content);
        const validation = schema.safeParse(parsed);

        if (validation.success) {
          return { data: validation.data, content, model, attempts: attempt + 1, validated: true };
        }

        // Validation failed — retry with feedback
        const errors = validation.error.issues
          .map(i => `${i.path.join('.')}: ${i.message}`)
          .join('\n');

        if (attempt < config.maxOutputRetries) {
          currentMessages = [
            ...messages,
            { role: 'assistant', content },
            { role: 'user', content: `Validation errors:\n${errors}\n\nFix and respond with corrected JSON only.` }
          ];
          this.stats.retries++;
          continue;
        }

        throw new Error(`Schema validation failed: ${errors}`);

      } catch (error) {
        lastError = error;

        // Check if retryable
        const decision = error.retryable !== undefined
          ? { retryable: error.retryable }
          : isRetryable(error);

        if (!decision.retryable || attempt >= config.maxRetries) {
          throw error;
        }

        // Backoff
        const delay = fullJitterBackoff(attempt, config.baseDelayMs, config.maxDelayMs);
        this.stats.retries++;
        await new Promise(r => setTimeout(r, delay));
      }
    }

    throw lastError;
  }

  async _makeCall(messages, config, model) {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), config.timeoutMs);

    try {
      return await this.openai.chat.completions.create(
        {
          model,
          messages,
          max_tokens: config.maxTokens || 4096,
          temperature: config.temperature,
          ...(config.responseFormat ? { response_format: config.responseFormat } : {})
        },
        { signal: controller.signal }
      );
    } finally {
      clearTimeout(timeout);
    }
  }

  getStats() {
    return { ...this.stats };
  }
}

// Usage
const client = new RobustLlmClient({
  openai: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
  model: 'gpt-4o',
  fallbackModel: 'gpt-4o-mini',
  maxRetries: 3,
  timeoutMs: 45000
});

const UserSchema = z.object({
  name: z.string(),
  age: z.number(),
  email: z.string().email()
});

const result = await client.call(
  [
    { role: 'system', content: 'Extract user info as JSON.' },
    { role: 'user', content: 'Alice, 30, alice@test.com' }
  ],
  UserSchema
);

console.log(result.data);
// { name: 'Alice', age: 30, email: 'alice@test.com' }
console.log(`Completed in ${result.attempts} attempt(s) using ${result.model}`);

10. Key Takeaways

Classify errors before retrying — retrying a 401 (bad API key) wastes money and time; only retry transient errors like 429, 500, 502, 503, timeouts, and malformed output.
Use exponential backoff with full jitter — start at 1 second, double each attempt, cap at 60 seconds, and add randomness to prevent thundering herds of synchronized retries.
Retry with error feedback for validation failures — when the model's JSON fails schema validation, include the validation errors in the next prompt so the model can self-correct.
Track cost per retry sequence — feedback retries grow input size on each attempt; set cost budgets and monitor retry rates to prevent runaway spending.
Have a fallback chain — primary model with retries, then fallback model (cheaper/faster), then cached response, then graceful degradation. Never let every path end in an error thrown to the user.

Explain-It Challenge

Your team's LLM API has a 5% failure rate per call. A user request triggers a pipeline of 4 sequential LLM calls. Calculate the pipeline success rate without retries, then with 2 retries per call.
Two approaches to retrying malformed JSON: (A) retry with the same prompt, hoping for different output, and (B) retry with validation errors included in the prompt. When is each approach appropriate?
Your monitoring shows that 15% of requests are being retried, and 3% of those retries also fail. The average cost per retry sequence is 2.1x a single call. Is this healthy? What thresholds would trigger investigation?

Navigation: ← 4.10.b — Partial Responses and Timeouts · 4.10.d — Logging AI Requests →