Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications

4.10.b — Partial Responses and Timeouts

In one sentence: LLM responses can be cut short (the model hits its token limit mid-sentence) or never arrive (the network or API times out), and your application must detect both conditions, handle incomplete data gracefully, and decide whether to retry, salvage, or fail with a useful error.

Navigation: ← 4.10.a — Handling Invalid JSON · 4.10.c — Retry Mechanisms →


1. What Is a Partial Response?

A partial response is an LLM output that was cut short before the model finished generating. The API returns a 200 OK status — there's no HTTP error — but the content is incomplete. This is one of the most insidious failure modes because it looks like success at the HTTP level.

What the model was trying to generate:
  {"name": "Alice", "age": 30, "email": "alice@example.com", "address": "123 Main St"}

What you actually received (truncated at token limit):
  {"name": "Alice", "age": 30, "email": "alice@exam

Result:
  ✓ HTTP 200 OK
  ✓ Response body exists
  ✗ Content is incomplete
  ✗ JSON.parse() will throw
  ✗ Data is unusable without recovery

2. Understanding finish_reason

Every LLM API response includes a finish_reason field that tells you why the model stopped generating. This is your primary signal for detecting partial responses.

OpenAI finish_reason values

finish_reasonMeaningIs output complete?Action
"stop"Model naturally finished (hit a stop token)YesNormal processing
"length"Model hit max_tokens limitNo — truncatedDetect and handle
"content_filter"Content was filtered by safety systemNo — blockedLog and handle
"tool_calls"Model wants to call a function/toolPartial — waiting for tool resultExecute tool and continue
nullStill generating (streaming)No — in progressWait for completion

How to check finish_reason

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Write a detailed JSON report...' }],
  max_tokens: 500 // Could be too small for the request
});

const choice = response.choices[0];
const content = choice.message.content;
const finishReason = choice.finish_reason;

console.log('finish_reason:', finishReason);

if (finishReason === 'length') {
  console.warn('WARNING: Response was truncated! Output is incomplete.');
  // Content is cut off — do NOT try to JSON.parse() it directly
}

if (finishReason === 'content_filter') {
  console.warn('WARNING: Response was blocked by content filter.');
  // Content may be null or partial
}

if (finishReason === 'stop') {
  // Normal completion — safe to process
  console.log('Response completed normally.');
}

Anthropic (Claude) stop_reason values

// Claude uses "stop_reason" instead of "finish_reason"
const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Generate a report...' }]
});

const stopReason = response.stop_reason;

// Claude stop_reason values:
// "end_turn"     → Model finished naturally (equivalent to "stop")
// "max_tokens"   → Hit token limit (equivalent to "length")
// "stop_sequence" → Hit a custom stop sequence
// "tool_use"     → Model wants to use a tool

3. Why Truncation Happens

Truncation occurs when the model's output would exceed the max_tokens parameter. Understanding the causes helps you prevent it.

Cause 1: max_tokens set too low

// PROBLEM: max_tokens is too small for the expected output
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: 'Generate a detailed JSON object with 20 user profiles including name, email, address, phone, and bio for each.'
  }],
  max_tokens: 200  // Way too small for 20 detailed profiles
});

// Result: truncated JSON like
// [{"name": "Alice Johnson", "email": "alice@example.com", "address": "123 Main St, Springfield, IL", "phone": "555-01

Cause 2: Model generates more than expected

// You asked for a "brief" summary but the model was verbose
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: 'Briefly summarize this article as JSON with a "summary" field'
  }],
  max_tokens: 100  // "Brief" to you, but the model may disagree
});

Cause 3: Context window exhaustion

// Input is so large that very little room remains for output
// GPT-4o: 128K context window
// If your input uses 127,500 tokens, only 500 tokens remain for output
// The model may be forced to truncate even with a reasonable max_tokens

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: longSystemPrompt },      // 2,000 tokens
    { role: 'user', content: massiveDocument },          // 125,000 tokens
    { role: 'user', content: 'Summarize as JSON' }       // 5 tokens
  ],
  max_tokens: 4000  // You requested 4000, but only 500 tokens remain
});

// The actual output may be limited to ~500 tokens regardless of max_tokens

4. Handling Truncated JSON

When finish_reason is "length", the JSON is almost certainly broken. Here are strategies to salvage what you can.

Strategy 1: Detect and reject

The simplest approach — reject truncated responses and retry.

async function getJsonResponse(prompt, options = {}) {
  const { maxRetries = 2, maxTokens = 4096 } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: prompt }],
      max_tokens: maxTokens,
      temperature: 0
    });

    const choice = response.choices[0];

    if (choice.finish_reason === 'length') {
      console.warn(`Attempt ${attempt + 1}: Response truncated. ` +
        `Increasing max_tokens from ${maxTokens} to ${maxTokens * 2}`);
      
      // Retry with more tokens
      maxTokens = Math.min(maxTokens * 2, 16384); // Cap at 16K
      continue;
    }

    if (choice.finish_reason === 'content_filter') {
      throw new Error('Response blocked by content filter');
    }

    return choice.message.content;
  }

  throw new Error('Response truncated after all retry attempts');
}

Strategy 2: Salvage partial JSON

Sometimes truncated JSON contains usable data. You can attempt to close the JSON structure.

/**
 * Attempt to repair truncated JSON by closing open brackets/braces.
 * 
 * WARNING: This is a heuristic. It may produce valid JSON with
 * incomplete data. Always validate the result against your schema.
 */
function repairTruncatedJson(text) {
  // First, try to parse as-is (maybe it's fine)
  try {
    return { data: JSON.parse(text), repaired: false };
  } catch (e) {
    // Continue with repair
  }

  let repaired = text.trim();

  // Remove any trailing incomplete string (cut off mid-value)
  // Look for an unclosed string at the end
  const lastQuoteIndex = repaired.lastIndexOf('"');
  const lastColonIndex = repaired.lastIndexOf(':');
  const lastCommaIndex = repaired.lastIndexOf(',');
  
  // If the last meaningful character is mid-value, truncate to last complete pair
  // Find the last complete key-value pair
  if (lastCommaIndex > lastQuoteIndex) {
    // Trailing comma — good boundary point
    repaired = repaired.substring(0, lastCommaIndex);
  } else {
    // Try to find the last complete value
    // Count quotes to see if we're mid-string
    const quoteCount = (repaired.match(/"/g) || []).length;
    if (quoteCount % 2 !== 0) {
      // Odd number of quotes — we're inside a string
      // Close the string
      repaired += '"';
    }
  }

  // Count open brackets/braces and close them
  let openBraces = 0;
  let openBrackets = 0;
  let inString = false;
  let prevChar = '';

  for (const char of repaired) {
    if (char === '"' && prevChar !== '\\') {
      inString = !inString;
    }
    if (!inString) {
      if (char === '{') openBraces++;
      if (char === '}') openBraces--;
      if (char === '[') openBrackets++;
      if (char === ']') openBrackets--;
    }
    prevChar = char;
  }

  // Close any open structures
  // Remove trailing comma before closing
  repaired = repaired.replace(/,\s*$/, '');
  
  for (let i = 0; i < openBrackets; i++) {
    repaired += ']';
  }
  for (let i = 0; i < openBraces; i++) {
    repaired += '}';
  }

  try {
    return { data: JSON.parse(repaired), repaired: true };
  } catch (e) {
    return { data: null, repaired: false, error: e.message };
  }
}

// Test cases
console.log(repairTruncatedJson('{"name": "Alice", "age": 30, "ema'));
// Attempts to close the truncated JSON

console.log(repairTruncatedJson('[{"id": 1}, {"id": 2}, {"id":'));
// Attempts to close the array and objects

Strategy 3: Continuation request

Ask the model to continue from where it left off.

/**
 * If the response was truncated, ask the model to continue.
 * Concatenate the parts and parse.
 */
async function getCompleteResponse(messages, maxTokens = 4096) {
  let fullContent = '';
  let attempts = 0;
  const maxAttempts = 3;

  while (attempts < maxAttempts) {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: attempts === 0
        ? messages
        : [
            ...messages,
            { role: 'assistant', content: fullContent },
            { role: 'user', content: 'Continue exactly where you left off. Do not repeat any content.' }
          ],
      max_tokens: maxTokens,
      temperature: 0
    });

    const choice = response.choices[0];
    fullContent += choice.message.content;
    attempts++;

    if (choice.finish_reason === 'stop') {
      // Model finished — try to parse the complete content
      break;
    }

    if (choice.finish_reason === 'length') {
      console.warn(`Part ${attempts} truncated, requesting continuation...`);
      continue;
    }

    break; // Other finish reasons — don't continue
  }

  return fullContent;
}

5. Network Timeouts

Network timeouts occur when the HTTP connection fails before the LLM API returns a response. This is different from truncation — you get no response at all.

Common timeout causes

CauseTypical DurationSymptom
API overload30-60s wait, then timeoutConnection established but no response
Long generation30-120s for complex tasksModel is still generating when timeout fires
Network issuesVariableDNS failure, connection refused, connection reset
Cold start10-30s for some modelsFirst request to a model takes longer
Large input10-30s processing timeHuge context takes longer to process

Setting appropriate timeouts

// Default timeout is often 60 seconds — may not be enough for complex tasks

// Using the OpenAI SDK
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  timeout: 60000, // 60 seconds (SDK default)
  maxRetries: 2    // SDK-level retries
});

// Per-request timeout using AbortController
async function callWithTimeout(prompt, timeoutMs = 30000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await openai.chat.completions.create(
      {
        model: 'gpt-4o',
        messages: [{ role: 'user', content: prompt }],
        max_tokens: 1000
      },
      { signal: controller.signal }
    );

    clearTimeout(timeoutId);
    return response;

  } catch (error) {
    clearTimeout(timeoutId);

    if (error.name === 'AbortError') {
      console.error(`Request timed out after ${timeoutMs}ms`);
      throw new Error(`LLM request timed out after ${timeoutMs}ms`);
    }

    throw error; // Re-throw non-timeout errors
  }
}

Timeout guidelines by use case

Use CaseRecommended TimeoutReasoning
Quick extraction (short input/output)15-30 secondsSimple task, fast response expected
General chatbot30-60 secondsModerate input/output
RAG with large context60-90 secondsLarge input takes longer to process
Complex analysis90-120 secondsModel thinks longer for complex tasks
Code generation60-120 secondsCan produce long outputs
Streaming120-180 seconds totalFirst token should arrive within 10s

6. API-Specific Timeout Errors

Each LLM provider has different error types for timeout and overload conditions.

OpenAI errors

import OpenAI from 'openai';

try {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }]
  });
} catch (error) {
  if (error instanceof OpenAI.APIConnectionError) {
    // Network error — could not connect
    console.error('Connection failed:', error.message);
    // RETRY: Yes — transient network issue
  }
  
  if (error instanceof OpenAI.RateLimitError) {
    // 429 Too Many Requests
    console.error('Rate limited:', error.message);
    // RETRY: Yes — after delay (see retry-after header)
  }

  if (error instanceof OpenAI.APIError) {
    if (error.status === 408) {
      // Request Timeout
      console.error('Request timed out on server side');
      // RETRY: Yes — server was overloaded
    }
    if (error.status === 500) {
      // Internal Server Error
      console.error('OpenAI server error');
      // RETRY: Yes — transient server issue
    }
    if (error.status === 503) {
      // Service Unavailable
      console.error('OpenAI service overloaded');
      // RETRY: Yes — after delay
    }
  }

  if (error instanceof OpenAI.AuthenticationError) {
    // 401 Unauthorized
    console.error('Invalid API key');
    // RETRY: No — fix the API key first
  }

  if (error instanceof OpenAI.BadRequestError) {
    // 400 Bad Request
    console.error('Invalid request:', error.message);
    // RETRY: No — fix the request first
  }
}

Building a comprehensive error classifier

/**
 * Classify an LLM API error into actionable categories.
 */
function classifyError(error) {
  // Timeout / abort
  if (error.name === 'AbortError' || error.code === 'ETIMEDOUT' || error.code === 'ECONNABORTED') {
    return { type: 'timeout', retryable: true, delay: 5000 };
  }

  // Network errors
  if (error.code === 'ECONNREFUSED' || error.code === 'ENOTFOUND' || error.code === 'ENETUNREACH') {
    return { type: 'network', retryable: true, delay: 10000 };
  }

  // HTTP status-based classification
  const status = error.status || error.statusCode;
  
  if (status === 429) {
    // Rate limit — check for retry-after header
    const retryAfter = error.headers?.['retry-after'];
    const delay = retryAfter ? parseInt(retryAfter) * 1000 : 30000;
    return { type: 'rate_limit', retryable: true, delay };
  }

  if (status === 408 || status === 504) {
    return { type: 'server_timeout', retryable: true, delay: 10000 };
  }

  if (status === 500 || status === 502 || status === 503) {
    return { type: 'server_error', retryable: true, delay: 15000 };
  }

  if (status === 400) {
    return { type: 'bad_request', retryable: false, delay: 0 };
  }

  if (status === 401 || status === 403) {
    return { type: 'auth_error', retryable: false, delay: 0 };
  }

  if (status === 404) {
    return { type: 'not_found', retryable: false, delay: 0 };
  }

  // Unknown error
  return { type: 'unknown', retryable: false, delay: 0 };
}

7. Handling Timeouts in Streaming Mode

Streaming changes the timeout dynamics. Instead of waiting for the full response, you receive tokens incrementally. The timeout concern shifts from "total response time" to "time between tokens."

/**
 * Stream an LLM response with per-chunk timeout detection.
 * If no new chunk arrives within chunkTimeoutMs, abort.
 */
async function streamWithTimeout(messages, options = {}) {
  const {
    totalTimeoutMs = 120000,  // 2 minutes total
    chunkTimeoutMs = 15000,   // 15 seconds between chunks
    maxTokens = 4096,
    onChunk = () => {}
  } = options;

  const controller = new AbortController();
  const totalTimer = setTimeout(() => controller.abort(), totalTimeoutMs);

  let fullContent = '';
  let chunkTimer = null;
  let finishReason = null;

  const resetChunkTimer = () => {
    if (chunkTimer) clearTimeout(chunkTimer);
    chunkTimer = setTimeout(() => {
      console.error('No chunk received in', chunkTimeoutMs, 'ms — aborting');
      controller.abort();
    }, chunkTimeoutMs);
  };

  try {
    const stream = await openai.chat.completions.create(
      {
        model: 'gpt-4o',
        messages,
        max_tokens: maxTokens,
        stream: true
      },
      { signal: controller.signal }
    );

    resetChunkTimer(); // Start watching for first chunk

    for await (const chunk of stream) {
      resetChunkTimer(); // Reset timer on each chunk

      const delta = chunk.choices[0]?.delta?.content || '';
      fullContent += delta;
      onChunk(delta);

      if (chunk.choices[0]?.finish_reason) {
        finishReason = chunk.choices[0].finish_reason;
      }
    }

    return {
      content: fullContent,
      finishReason,
      truncated: finishReason === 'length'
    };

  } catch (error) {
    if (error.name === 'AbortError') {
      return {
        content: fullContent, // Return whatever we got so far
        finishReason: 'timeout',
        truncated: true,
        error: 'Stream timed out'
      };
    }
    throw error;
  } finally {
    clearTimeout(totalTimer);
    if (chunkTimer) clearTimeout(chunkTimer);
  }
}

// Usage
const result = await streamWithTimeout(
  [{ role: 'user', content: 'Write a report...' }],
  {
    totalTimeoutMs: 60000,
    chunkTimeoutMs: 10000,
    onChunk: (text) => process.stdout.write(text)
  }
);

if (result.truncated) {
  console.warn('\nResponse was incomplete:', result.finishReason);
}

8. Comprehensive Partial Response Handler

Putting everything together into a single function that handles all partial response scenarios.

/**
 * Complete solution for handling partial responses.
 * Detects truncation, timeouts, and content filtering.
 * Returns a standardized result with metadata.
 */
async function robustLlmCall(messages, options = {}) {
  const {
    model = 'gpt-4o',
    maxTokens = 4096,
    timeoutMs = 60000,
    maxContinuations = 2,
    temperature = 0
  } = options;

  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

  let fullContent = '';
  let totalTokensUsed = 0;
  let continuationCount = 0;
  let currentMessages = [...messages];

  try {
    while (continuationCount <= maxContinuations) {
      const response = await openai.chat.completions.create(
        {
          model,
          messages: currentMessages,
          max_tokens: maxTokens,
          temperature
        },
        { signal: controller.signal }
      );

      const choice = response.choices[0];
      const content = choice.message.content || '';
      fullContent += content;
      totalTokensUsed += response.usage?.total_tokens || 0;

      // Check finish reason
      if (choice.finish_reason === 'stop') {
        // Completed normally
        return {
          success: true,
          content: fullContent,
          finishReason: 'stop',
          totalTokens: totalTokensUsed,
          continuations: continuationCount,
          truncated: false
        };
      }

      if (choice.finish_reason === 'content_filter') {
        return {
          success: false,
          content: fullContent,
          finishReason: 'content_filter',
          totalTokens: totalTokensUsed,
          continuations: continuationCount,
          truncated: true,
          error: 'Response blocked by content filter'
        };
      }

      if (choice.finish_reason === 'length') {
        continuationCount++;
        console.warn(`Response truncated (continuation ${continuationCount}/${maxContinuations})`);

        if (continuationCount > maxContinuations) {
          // Exceeded max continuations — return what we have
          return {
            success: false,
            content: fullContent,
            finishReason: 'length',
            totalTokens: totalTokensUsed,
            continuations: continuationCount - 1,
            truncated: true,
            error: 'Response truncated after max continuations'
          };
        }

        // Request continuation
        currentMessages = [
          ...messages,
          { role: 'assistant', content: fullContent },
          { role: 'user', content: 'Continue exactly where you left off. Do not repeat any previous content.' }
        ];
        continue;
      }

      // Unknown finish reason
      return {
        success: true,
        content: fullContent,
        finishReason: choice.finish_reason,
        totalTokens: totalTokensUsed,
        continuations: continuationCount,
        truncated: false
      };
    }
  } catch (error) {
    if (error.name === 'AbortError') {
      return {
        success: false,
        content: fullContent,
        finishReason: 'timeout',
        totalTokens: totalTokensUsed,
        continuations: continuationCount,
        truncated: true,
        error: `Request timed out after ${timeoutMs}ms`
      };
    }

    return {
      success: false,
      content: fullContent,
      finishReason: 'error',
      totalTokens: totalTokensUsed,
      continuations: continuationCount,
      truncated: true,
      error: error.message
    };
  } finally {
    clearTimeout(timeoutId);
  }
}

// Usage
const result = await robustLlmCall(
  [
    { role: 'system', content: 'You are a JSON extraction API.' },
    { role: 'user', content: 'Extract all entities from this document...' }
  ],
  {
    maxTokens: 4096,
    timeoutMs: 45000,
    maxContinuations: 2
  }
);

if (result.success) {
  console.log('Complete response received');
} else {
  console.error('Incomplete response:', result.error);
  console.log('Partial content length:', result.content.length);
  // Decide: retry, salvage partial content, or return error to user
}

9. Setting max_tokens Correctly

One of the most important preventive measures is setting max_tokens appropriately.

// RULE: max_tokens should be based on your expected output size, not arbitrary

// BAD: Arbitrary small number
{ max_tokens: 100 }  // Almost certainly too small for JSON

// BAD: Maximum possible value
{ max_tokens: 128000 }  // Wasteful — you pay for reserved capacity on some APIs

// GOOD: Estimate based on expected output
const MAX_TOKENS_MAP = {
  'simple-extraction':   500,   // {"name": "...", "age": ...}
  'list-extraction':     2000,  // Array of 10-20 items
  'summary':             1000,  // A paragraph or two
  'detailed-analysis':   4000,  // Multi-section response
  'code-generation':     8000,  // Substantial code block
  'document-generation': 16000  // Long-form content
};

function getMaxTokens(taskType, safetyMultiplier = 1.5) {
  const base = MAX_TOKENS_MAP[taskType] || 4096;
  return Math.ceil(base * safetyMultiplier);
}

// Usage
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  max_tokens: getMaxTokens('list-extraction') // 3000
});

Token budget calculator

/**
 * Calculate available output tokens given input size and model limits.
 */
function calculateOutputBudget(inputTokens, modelContextWindow, options = {}) {
  const {
    safetyMargin = 100,
    minOutputTokens = 256
  } = options;

  const available = modelContextWindow - inputTokens - safetyMargin;

  if (available < minOutputTokens) {
    throw new Error(
      `Not enough room for output. Input: ${inputTokens} tokens, ` +
      `Context: ${modelContextWindow}, Available: ${available}, ` +
      `Minimum required: ${minOutputTokens}`
    );
  }

  return available;
}

// Example
const inputTokens = 120000; // Large RAG context
const budget = calculateOutputBudget(inputTokens, 128000);
console.log(`Output budget: ${budget} tokens`);
// Output budget: 7900 tokens

10. Key Takeaways

  1. Always check finish_reason — a 200 OK response does not mean the output is complete; "length" means the model was cut off mid-generation.
  2. Set max_tokens deliberately — base it on expected output size with a safety multiplier, not arbitrary values.
  3. Network timeouts and truncation are different failures — timeouts give you no response; truncation gives you partial content. Handle both.
  4. Use AbortController for per-request timeouts — the SDK timeout is a global default; per-request control lets you set tighter limits for simple tasks and longer limits for complex ones.
  5. Streaming changes the timeout model — monitor time between chunks, not just total response time; a stall mid-stream is a strong signal that something went wrong.

Explain-It Challenge

  1. Your production system returns garbage JSON 3% of the time despite using response_format: json_object. The error logs show finish_reason: "length" for every failure. What is happening and how do you fix it?
  2. A user reports that the chatbot "hangs forever." You discover the API call has no timeout. Explain the risks and propose timeout values for three different pages: quick search, document summary, and code generation.
  3. Your continuation strategy (asking the model to "continue where you left off") sometimes produces duplicate content. Why does this happen and what strategies can mitigate it?

Navigation: ← 4.10.a — Handling Invalid JSON · 4.10.c — Retry Mechanisms →