Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications
4.10 — Error Handling in AI Applications: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md→4.10.a…4.10.d. - Practice —
4.10-Exercise-Questions.md. - Polish answers —
4.10-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| finish_reason | API field indicating why the model stopped: "stop" (complete), "length" (truncated), "content_filter" (blocked) |
| Malformed JSON | LLM output that looks like JSON but fails JSON.parse() — trailing commas, single quotes, extra text |
| Truncation | Output cut short because max_tokens was reached; HTTP 200 but content is incomplete |
| Exponential backoff | Doubling wait time between retries: 1s → 2s → 4s → 8s → 16s |
| Jitter | Random variation added to backoff delay to prevent synchronized retries |
| Thundering herd | Multiple clients retrying at the exact same time, re-triggering the failure |
| Circuit breaker | Pattern that stops retrying after N consecutive failures; fails fast until reset |
| PII filtering | Redacting personal information (emails, SSNs, phone numbers) from logs |
| Structured logging | Machine-readable (JSON) log format with consistent fields for querying |
| Retry-After | HTTP header from the server telling you when to retry |
Five failure modes to ALWAYS handle
1. Malformed JSON → Multi-layer parser (extract, clean, validate)
2. Truncated output → Check finish_reason, increase max_tokens, continuation
3. Timeout → AbortController, per-request timeout, fallback
4. Rate limit (429) → Exponential backoff with jitter, respect Retry-After
5. Wrong schema → Zod validation, retry with error feedback
JSON parsing strategy (order of attempts)
1. JSON.parse(text) → Direct parse (fast path, ~80% of responses)
2. Extract from ```json ``` block → Code block extraction
3. Find first { to last } → Boundary extraction
4. Clean common errors → Fix trailing commas, single quotes, comments
5. Extract + clean combined → Boundary extraction THEN cleaning
6. FAIL → Return error, trigger retry
Common JSON errors from LLMs
TRAILING COMMA: {"a": 1, "b": 2,} → remove ,} → ,}
SINGLE QUOTES: {'a': 'value'} → replace ' with "
UNQUOTED KEYS: {name: "Alice"} → add quotes to keys
COMMENTS: {"a": 1 // comment} → strip // and /* */
EXTRA TEXT: "Here: {"a": 1}" → extract { to }
CODE BLOCKS: ```json {"a":1} ``` → extract from fences
NaN/undefined: {"v": NaN} → replace with null
finish_reason cheat sheet
"stop" → Complete. Safe to process.
"length" → TRUNCATED. Output is incomplete. Retry with more tokens.
"content_filter" → BLOCKED. Content violated safety policy.
"tool_calls" → Model wants to call a function. Execute and continue.
null → Still generating (streaming).
Timeout guidelines
Quick extraction: 15-30 seconds
General chatbot: 30-60 seconds
RAG (large context): 60-90 seconds
Complex analysis: 90-120 seconds
Code generation: 60-120 seconds
Streaming total: 120-180 seconds
Streaming per-chunk: 10-15 seconds
Retryable vs non-retryable
RETRY (transient):
429 Rate limit → Wait + retry (respect Retry-After)
500 Server error → Retry after backoff
502 Bad gateway → Retry after backoff
503 Unavailable → Retry after backoff
504 Gateway timeout → Retry after backoff
ECONNRESET/ETIMEDOUT → Retry after backoff
Malformed output → Retry (same prompt or with feedback)
DO NOT RETRY (permanent):
400 Bad request → Fix the request
401 Unauthorized → Fix the API key
403 Forbidden → Fix permissions
404 Not found → Fix the model name
402 Payment required → Add credits
Context too long → Reduce input size
Exponential backoff with full jitter
// Full jitter (AWS recommended)
delay = random(0, min(cap, base * 2^attempt))
// Example delays (base=1s, cap=60s):
// Attempt 0: 0 - 1s
// Attempt 1: 0 - 2s
// Attempt 2: 0 - 4s
// Attempt 3: 0 - 8s
// Attempt 4: 0 - 16s
// Attempt 5: 0 - 32s
Recommended max retries
Rate limit: 3-5 retries
Server error: 2-3 retries
Timeout: 2 retries
Malformed output: 2-3 retries
Schema validation: 2 retries (with feedback)
Connection error: 3 retries
Retry cost math
Single call: $0.0075
With 1 retry: $0.0150 (2x)
With feedback retry: $0.0188 (2.5x, input grows)
3 feedback retries: $0.0526 (7x)
At 100K requests/day, 5% failure rate:
No retries: $750/day
With simple retries: $788/day (+5%)
With 3 feedback: $1,015/day (+35%)
Logging checklist
ALWAYS LOG (every call):
✓ Request ID (unique)
✓ Trace ID (links to user request)
✓ Timestamp
✓ Model (exact version, NOT alias)
✓ Temperature, max_tokens
✓ Prompt tokens, completion tokens
✓ Latency (ms)
✓ finish_reason
✓ Cost (calculated)
✓ Feature name
✓ Prompt version
✓ Error type (if any)
✓ Validation success/failure
✓ Retry attempt number
LOG WITH CAUTION (PII-filtered):
⚠ System prompt content
⚠ User message content
⚠ Response content
⚠ Validation error details
NEVER LOG:
✗ Raw PII (SSN, credit cards)
✗ API keys or secrets
✗ Full conversation history in production
Monitoring dashboard metrics
SUCCESS RATE: Target >98% Alert <95%
P95 LATENCY: Target <5s Alert >10s
ERROR RATE: Target <2% Alert >5%
DAILY COST: Track trend Alert on 50%+ spike
RETRY RATE: Target <10% Alert >15%
PARSE FAIL RATE: Target <2% Alert >5%
TRUNCATION RATE: Target <1% Alert >3%
Alert rules
CRITICAL:
- Error rate > 5% (over 100+ requests)
- Rate limit count > 50 in 5 minutes
- Circuit breaker opened
WARNING:
- P95 latency > 10 seconds
- Cost rate > budget threshold
- Validation failure rate > 10%
- Truncation rate > 3%
Privacy tiers
PRODUCTION: Metrics only, no content, PII filtered, 90-day retention
STAGING: Content with PII filter, 30-day retention
DEVELOPMENT: Full content (local only), 7-day retention
AUDIT: Request ID + hash + timestamp, 7-year retention, immutable
Fallback chain
1. Primary model (gpt-4o) + 3 retries
↓ all failed
2. Fallback model (gpt-4o-mini) + 1 retry
↓ all failed
3. Cached response (stale but available)
↓ no cache
4. Graceful degradation (error message to user)
The golden rule
EVERY LLM API CALL =
try/catch
+ finish_reason check
+ JSON parse with recovery
+ schema validation
+ timeout (AbortController)
+ retry with backoff
+ structured logging
+ fallback plan
Pipeline error handling
Step 1 (LLM) → checkpoint → Step 2 (LLM) → checkpoint → Step 3 (LLM)
↓ fail ↓ fail ↓ fail
retry (inner) retry (inner) retry (inner)
↓ fail ↓ fail ↓ fail
fallback model fallback model fallback model
↓ fail ↓ fail ↓ fail
ABORT pipeline resume from resume from
checkpoint 1 checkpoint 2
Budget: track total cost across all steps. Abort if cost > threshold.
Circuit breaker: shared across pipeline. If API is down, fail entire pipeline fast.
Trace ID: links all logs from all steps for end-to-end debugging.
End of 4.10 quick revision.