Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications

4.10 — Error Handling in AI Applications: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps — reopen README.md4.10.a4.10.d.
  3. Practice4.10-Exercise-Questions.md.
  4. Polish answers4.10-Interview-Questions.md.

Core vocabulary

TermOne-liner
finish_reasonAPI field indicating why the model stopped: "stop" (complete), "length" (truncated), "content_filter" (blocked)
Malformed JSONLLM output that looks like JSON but fails JSON.parse() — trailing commas, single quotes, extra text
TruncationOutput cut short because max_tokens was reached; HTTP 200 but content is incomplete
Exponential backoffDoubling wait time between retries: 1s → 2s → 4s → 8s → 16s
JitterRandom variation added to backoff delay to prevent synchronized retries
Thundering herdMultiple clients retrying at the exact same time, re-triggering the failure
Circuit breakerPattern that stops retrying after N consecutive failures; fails fast until reset
PII filteringRedacting personal information (emails, SSNs, phone numbers) from logs
Structured loggingMachine-readable (JSON) log format with consistent fields for querying
Retry-AfterHTTP header from the server telling you when to retry

Five failure modes to ALWAYS handle

1. Malformed JSON        → Multi-layer parser (extract, clean, validate)
2. Truncated output      → Check finish_reason, increase max_tokens, continuation
3. Timeout               → AbortController, per-request timeout, fallback
4. Rate limit (429)      → Exponential backoff with jitter, respect Retry-After
5. Wrong schema          → Zod validation, retry with error feedback

JSON parsing strategy (order of attempts)

1. JSON.parse(text)              → Direct parse (fast path, ~80% of responses)
2. Extract from ```json ``` block → Code block extraction
3. Find first { to last }       → Boundary extraction
4. Clean common errors           → Fix trailing commas, single quotes, comments
5. Extract + clean combined      → Boundary extraction THEN cleaning
6. FAIL                          → Return error, trigger retry

Common JSON errors from LLMs

TRAILING COMMA:    {"a": 1, "b": 2,}     → remove ,}  → ,}
SINGLE QUOTES:     {'a': 'value'}         → replace ' with "
UNQUOTED KEYS:     {name: "Alice"}        → add quotes to keys
COMMENTS:          {"a": 1 // comment}    → strip // and /* */
EXTRA TEXT:        "Here: {"a": 1}"       → extract { to }
CODE BLOCKS:       ```json {"a":1} ```    → extract from fences
NaN/undefined:     {"v": NaN}             → replace with null

finish_reason cheat sheet

"stop"            → Complete. Safe to process.
"length"          → TRUNCATED. Output is incomplete. Retry with more tokens.
"content_filter"  → BLOCKED. Content violated safety policy.
"tool_calls"      → Model wants to call a function. Execute and continue.
null              → Still generating (streaming).

Timeout guidelines

Quick extraction:     15-30 seconds
General chatbot:      30-60 seconds
RAG (large context):  60-90 seconds
Complex analysis:     90-120 seconds
Code generation:      60-120 seconds
Streaming total:      120-180 seconds
Streaming per-chunk:  10-15 seconds

Retryable vs non-retryable

RETRY (transient):
  429 Rate limit        → Wait + retry (respect Retry-After)
  500 Server error      → Retry after backoff
  502 Bad gateway       → Retry after backoff
  503 Unavailable       → Retry after backoff
  504 Gateway timeout   → Retry after backoff
  ECONNRESET/ETIMEDOUT  → Retry after backoff
  Malformed output      → Retry (same prompt or with feedback)

DO NOT RETRY (permanent):
  400 Bad request       → Fix the request
  401 Unauthorized      → Fix the API key
  403 Forbidden         → Fix permissions
  404 Not found         → Fix the model name
  402 Payment required  → Add credits
  Context too long      → Reduce input size

Exponential backoff with full jitter

// Full jitter (AWS recommended)
delay = random(0, min(cap, base * 2^attempt))

// Example delays (base=1s, cap=60s):
// Attempt 0:  0 - 1s
// Attempt 1:  0 - 2s
// Attempt 2:  0 - 4s
// Attempt 3:  0 - 8s
// Attempt 4:  0 - 16s
// Attempt 5:  0 - 32s

Recommended max retries

Rate limit:          3-5 retries
Server error:        2-3 retries
Timeout:             2 retries
Malformed output:    2-3 retries
Schema validation:   2 retries (with feedback)
Connection error:    3 retries

Retry cost math

Single call:           $0.0075
With 1 retry:          $0.0150  (2x)
With feedback retry:   $0.0188  (2.5x, input grows)
3 feedback retries:    $0.0526  (7x)

At 100K requests/day, 5% failure rate:
  No retries:          $750/day
  With simple retries: $788/day  (+5%)
  With 3 feedback:     $1,015/day (+35%)

Logging checklist

ALWAYS LOG (every call):
  ✓ Request ID (unique)
  ✓ Trace ID (links to user request)
  ✓ Timestamp
  ✓ Model (exact version, NOT alias)
  ✓ Temperature, max_tokens
  ✓ Prompt tokens, completion tokens
  ✓ Latency (ms)
  ✓ finish_reason
  ✓ Cost (calculated)
  ✓ Feature name
  ✓ Prompt version
  ✓ Error type (if any)
  ✓ Validation success/failure
  ✓ Retry attempt number

LOG WITH CAUTION (PII-filtered):
  ⚠ System prompt content
  ⚠ User message content
  ⚠ Response content
  ⚠ Validation error details

NEVER LOG:
  ✗ Raw PII (SSN, credit cards)
  ✗ API keys or secrets
  ✗ Full conversation history in production

Monitoring dashboard metrics

SUCCESS RATE:    Target >98%    Alert <95%
P95 LATENCY:     Target <5s     Alert >10s
ERROR RATE:      Target <2%     Alert >5%
DAILY COST:      Track trend    Alert on 50%+ spike
RETRY RATE:      Target <10%    Alert >15%
PARSE FAIL RATE: Target <2%     Alert >5%
TRUNCATION RATE: Target <1%     Alert >3%

Alert rules

CRITICAL:
  - Error rate > 5% (over 100+ requests)
  - Rate limit count > 50 in 5 minutes
  - Circuit breaker opened

WARNING:
  - P95 latency > 10 seconds
  - Cost rate > budget threshold
  - Validation failure rate > 10%
  - Truncation rate > 3%

Privacy tiers

PRODUCTION:   Metrics only, no content, PII filtered, 90-day retention
STAGING:      Content with PII filter, 30-day retention
DEVELOPMENT:  Full content (local only), 7-day retention
AUDIT:        Request ID + hash + timestamp, 7-year retention, immutable

Fallback chain

1. Primary model (gpt-4o) + 3 retries
   ↓ all failed
2. Fallback model (gpt-4o-mini) + 1 retry
   ↓ all failed
3. Cached response (stale but available)
   ↓ no cache
4. Graceful degradation (error message to user)

The golden rule

EVERY LLM API CALL = 
  try/catch 
  + finish_reason check 
  + JSON parse with recovery 
  + schema validation 
  + timeout (AbortController) 
  + retry with backoff 
  + structured logging 
  + fallback plan

Pipeline error handling

Step 1 (LLM) → checkpoint → Step 2 (LLM) → checkpoint → Step 3 (LLM)
      ↓ fail                     ↓ fail                     ↓ fail
  retry (inner)              retry (inner)              retry (inner)
      ↓ fail                     ↓ fail                     ↓ fail
  fallback model             fallback model             fallback model
      ↓ fail                     ↓ fail                     ↓ fail
  ABORT pipeline             resume from                resume from
                             checkpoint 1               checkpoint 2

Budget: track total cost across all steps. Abort if cost > threshold.
Circuit breaker: shared across pipeline. If API is down, fail entire pipeline fast.
Trace ID: links all logs from all steps for end-to-end debugging.

End of 4.10 quick revision.