Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications

Interview Questions: Error Handling in AI Applications

Model answers for handling invalid JSON, partial responses and timeouts, retry mechanisms, and logging AI requests.

How to use this material (instructions)

  1. Read lessons in orderREADME.md, then 4.10.a4.10.d.
  2. Practice out loud — definition → example → pitfall.
  3. Pair with exercises4.10-Exercise-Questions.md.
  4. Quick review4.10-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. Why do LLMs return invalid JSON, and how do you handle it?

Why interviewers ask: Tests whether you understand that LLMs are text-prediction machines, not structured-data generators, and whether you've built real systems on top of them.

Model answer:

LLMs return invalid JSON because they generate text token by token based on statistical prediction — they have no internal JSON parser validating their output as they produce it. Common failures include: wrapping JSON in conversational text ("Here's the data: {...}"), using single quotes instead of double quotes (a Python/JavaScript habit from training data), adding trailing commas (valid in JavaScript but not in JSON), leaving unquoted keys, and including comments.

To handle this, I build a multi-layer parsing strategy. Layer 1: try JSON.parse() directly. Layer 2: extract JSON from markdown code blocks (```json ... ```). Layer 3: find the outermost {...} or [...] boundaries and parse the extracted substring. Layer 4: apply cleaning transformations — remove trailing commas, fix single quotes, strip comments, add quotes to bare keys. Each layer is tried in order; the function returns the first successful parse. After parsing, I always validate against a schema (using Zod or similar) because valid JSON is not the same as correct JSON.

For prevention, I use response_format: { type: 'json_object' } on OpenAI, or json_schema for strict structured output. But even with these, I still wrap everything in error handling because responses can be truncated.


Q2. What is finish_reason and why is it important?

Why interviewers ask: Separates candidates who've read documentation from those who've handled real production failures.

Model answer:

finish_reason is a field in every LLM API response that tells you why the model stopped generating text. It's critical because an HTTP 200 response does not guarantee the output is complete or usable.

The key values are: "stop" — the model finished naturally (safe to process). "length" — the model hit the max_tokens limit and was cut off mid-generation (output is almost certainly incomplete, and JSON will be broken). "content_filter" — safety filters blocked the output (content may be null or redacted). "tool_calls" — the model wants to invoke a function (output is partial, waiting for tool results).

In production, I always check finish_reason before processing the response. If it's "length", I either retry with a higher max_tokens, attempt to salvage the partial content, or request a continuation. The most insidious bug is ignoring finish_reason: "length" — you get a 200 OK with truncated JSON, your parser crashes, and the error message says "invalid JSON" with no indication that the real problem was a token limit.


Q3. When should you retry an LLM API call and when should you not?

Why interviewers ask: Tests your ability to classify errors and avoid wasting resources on unrecoverable failures.

Model answer:

Retry when the error is transient — something that might resolve on its own: rate limits (429 — the limit window will reset), server errors (500/502/503 — typically brief outages), network errors (connection reset, timeout — the network may recover), and malformed model output (200 with bad JSON — the model is non-deterministic and may produce valid output on retry).

Do not retry when the error is permanent — something that will fail the same way every time: authentication errors (401/403 — your API key is wrong), bad request (400 — your request is malformed), model not found (404 — wrong model name), content policy violation (prompt violates terms — same prompt will be rejected again), and context length exceeded (input too long — same input will fail again).

For retries, I use exponential backoff with full jitter: start at 1 second, double each attempt, cap at 60 seconds, and randomize to prevent thundering herds. I respect Retry-After headers when present. Max retries is typically 2-3 for server errors, up to 5 for rate limits, and 2 for malformed output (with error feedback in the prompt on subsequent attempts).


Q4. Why is logging more important for AI applications than traditional applications?

Why interviewers ask: Tests understanding of non-determinism and its operational implications.

Model answer:

Traditional APIs are deterministic — same input, same output. If a user reports a bug, the developer can reproduce it by replaying the same request. LLM applications are non-deterministic — the same prompt can produce different outputs on each call, and the model itself changes when the provider ships updates.

Without comprehensive logging, debugging an LLM issue is often impossible. The user says "the AI gave me wrong information," but when the developer tries the same prompt, they get a correct response. The original failure is gone forever. With logging — full request (prompt, parameters, model version) and response (content, tokens, latency, finish_reason) — you can see exactly what happened.

Beyond debugging, logs enable cost tracking (token-level attribution per feature), prompt optimization (A/B testing prompt versions with quantitative metrics), regression detection (automated alerts when error rates spike after a model update), and compliance auditing (proving what the AI actually said). Every LLM call should log: request ID, model version, temperature, token counts, latency, finish_reason, validation results, and error details. Content logging requires PII filtering in production.


Intermediate (Q5–Q8)

Q5. Walk me through how you'd build a production-ready JSON parser for LLM output.

Why interviewers ask: Tests practical engineering skill and understanding of the full range of LLM output failure modes.

Model answer:

I build a five-layer pipeline that tries increasingly aggressive strategies:

Layer 1: Direct parse. JSON.parse(text.trim()). Works when the model returns clean JSON. Fast path for the ~80% of responses that are already valid.

Layer 2: Code block extraction. Regex match ```json ... ``` or ``` ... ```. LLMs frequently wrap JSON in markdown fences. Extract the content and parse.

Layer 3: Boundary extraction. Find the first { and last } (or [ and ]). This handles "Here's the data: {...} Hope this helps!" patterns. Parse the extracted substring.

Layer 4: Cleaning. Apply transformations to fix common LLM errors: remove trailing commas (/,\s*([\]}])/g$1), replace single quotes with double quotes (if no double quotes are present), add quotes to unquoted keys, remove JavaScript-style comments, replace NaN/undefined with null.

Layer 5: Extraction + cleaning combined. Apply boundary extraction first, then cleaning on the extracted content. This handles cases like the model wrapping malformed JSON in text.

The function returns { success, data, strategy, error } — the strategy field tells me which layer succeeded, which is valuable for monitoring. I track strategy distribution in logs: if "direct" drops from 80% to 60%, it signals a prompt regression.

After parsing, I always validate with Zod against a schema. Parseable JSON can still have wrong fields, wrong types, or missing required properties. The parser + validator pipeline catches both syntax and semantic errors.


Q6. How would you implement exponential backoff with jitter for an LLM API?

Why interviewers ask: Tests knowledge of distributed systems patterns and ability to implement them correctly.

Model answer:

Exponential backoff increases wait time between retries: 1s, 2s, 4s, 8s, 16s — calculated as base * 2^attempt, capped at a maximum (typically 60 seconds).

Jitter adds randomness to prevent thundering herds. Without jitter, if 1,000 clients get rate-limited simultaneously, they all retry at exactly 1s, then 2s, then 4s — creating synchronized spikes that re-trigger the rate limit. With jitter, retries spread across the time window.

I prefer full jitter (recommended by AWS): delay = random(0, min(cap, base * 2^attempt)). This gives maximum spread. Equal jitter (delay = exponential/2 + random(0, exponential/2)) guarantees a minimum delay but provides less spreading.

Implementation:

function fullJitterBackoff(attempt, baseMs = 1000, capMs = 60000) {
  return Math.round(Math.random() * Math.min(capMs, baseMs * Math.pow(2, attempt)));
}

Before retrying, I also check the Retry-After header — if the server tells me when to retry, I respect that instead of my own calculation. I also classify errors: only retry transient errors (429, 500, 502, 503, 504, network failures). Never retry 400, 401, 403, 404. The retry function tracks total delay and cost, and I set a cost ceiling to prevent expensive retry loops, especially for feedback retries where input tokens grow with each attempt.


Q7. How do you handle the situation where the LLM returns valid JSON but it doesn't match your expected schema?

Why interviewers ask: Tests whether you validate beyond parsing, and whether you've dealt with the subtlety of "correct format, wrong content."

Model answer:

Valid JSON with wrong structure is actually more common than outright invalid JSON, especially with response_format: json_object enabled. The model might return {"user_name": "Alice"} when you expected {"name": "Alice"}, or {"age": "thirty"} instead of {"age": 30}, or omit required fields entirely.

My approach has three parts:

1. Schema validation with Zod. Every LLM response is validated against a strict Zod schema immediately after parsing. Zod's .safeParse() returns detailed error objects instead of throwing, letting me handle failures gracefully:

const result = schema.safeParse(parsed);
if (!result.success) {
  // result.error.issues gives specific field-level errors
}

2. Retry with error feedback. When validation fails, I add the assistant's response and the validation errors to the conversation and ask the model to correct itself. This is remarkably effective — giving the model specific feedback like "Field 'age': Expected number, received string" usually produces correct output on the first retry.

3. Cost and attempt limits. Feedback retries are expensive because input grows each time (previous response + errors added to context). I cap at 2 feedback retries and track cost per sequence. If a particular prompt regularly triggers validation failures, that's a signal to improve the prompt rather than relying on retries.


Q8. Design a logging strategy for a production LLM application that handles sensitive user data.

Why interviewers ask: Tests your ability to balance debugging needs with privacy and compliance requirements.

Model answer:

The core tension is: comprehensive logs are essential for debugging non-deterministic AI, but prompts often contain PII. My strategy uses tiered logging:

Tier 1: Metrics (always logged, no PII). Every call logs: request ID, model, temperature, token counts, latency, finish_reason, cost, validation success/failure, error type, feature name, prompt version. This is machine-readable JSON sent to a time-series database. Retained for 1 year.

Tier 2: Filtered content (staging + debug mode). Full request/response content with PII filter applied — regex-based redaction of emails, phone numbers, SSNs, credit card numbers. Retained for 30 days. Access restricted to the engineering team.

Tier 3: Full content (development only). Unfiltered content for local debugging. Never stored in centralized logs. 7-day retention on developer machines only.

Tier 4: Audit trail (compliance). Request ID, user ID (hashed), timestamp, model, feature, and a content hash (not the content itself). Proves what happened without storing sensitive data. Immutable storage, 7-year retention.

The PII filter is applied at the logging layer, not the application layer, so it can't be accidentally bypassed. I also implement consent-based detailed logging — if a user reports an issue, we can temporarily enable Tier 2 logging for their session with their consent.


Advanced (Q9–Q11)

Q9. Design a complete error handling system for a multi-step LLM pipeline.

Why interviewers ask: Tests system design thinking — individual error handling is table stakes; orchestrating it across a pipeline is the real challenge.

Model answer:

Consider a pipeline: Document → Extraction (LLM Call 1) → Enrichment (LLM Call 2) → Classification (LLM Call 3) → Output.

Per-call error handling (inner layer): Each LLM call wraps in the RobustLlmClient — retry with backoff for transient errors, schema validation, error feedback retries, timeout handling. This is the unit of reliability.

Pipeline-level error handling (outer layer): The pipeline orchestrator handles:

  1. Step-level retries: If a step fails after all inner retries, try the step once more with a different model (GPT-4o → GPT-4o-mini).

  2. Checkpoint and resume: After each successful step, save the intermediate result. If step 3 fails, don't re-run steps 1 and 2 — resume from the last checkpoint. This saves cost and time.

  3. Partial results: If step 3 fails permanently, return the results from steps 1 and 2 with a flag indicating incomplete processing. Let the application layer decide whether partial results are acceptable.

  4. Circuit breaker: A shared circuit breaker across all pipeline stages. If the LLM API is down, fail the entire pipeline fast instead of waiting for 3 calls × 3 retries × backoff.

  5. Cost tracking: Sum the cost across all steps and retries. If the pipeline's total cost exceeds a budget (e.g., $0.50), abort and return an error rather than continuing to burn money.

  6. Observability: Each pipeline execution gets a traceId that links all LLM call logs. The dashboard shows pipeline-level success rate, cost per execution, and which steps fail most often.


Q10. Your LLM application has been running for 3 months. Using only the logs, walk me through how you'd identify and fix a performance regression.

Why interviewers ask: Tests operational maturity and ability to use observability data to drive improvements.

Model answer:

Step 1: Detect the regression. Automated alerts fire: success rate dropped from 98.5% to 94.2% over the past 24 hours. P95 latency increased from 4.2s to 8.1s.

Step 2: Correlate with changes. Query logs for what changed: Did we deploy a new prompt version? Did the model version change? (Check the logged model field — providers sometimes update models behind aliases like "gpt-4o".) Did traffic patterns change? Did error type distribution shift?

Step 3: Segment the analysis. Break down by: (a) feature — is one feature degraded or all? (b) error type — are we seeing more rate limits, parse failures, or validation failures? (c) prompt version — compare metrics for each version. (d) time of day — correlate with API provider status.

Step 4: Root cause. Suppose I find: validation failures for the "extraction" feature jumped from 1% to 6%. The prompt version didn't change, but the logged model field shows gpt-4o-2024-08-06 was updated to gpt-4o-2025-01-15. The validation errors cluster on the "category" field — the model started returning free-text categories instead of enum values.

Step 5: Fix. Update the prompt to add explicit examples of valid category values. Add a "category must be one of: [electronics, books, clothing]" instruction. Deploy as prompt v2.5. Compare v2.5 metrics against v2.4 in logs. If success rate returns to baseline, close the incident.

Step 6: Prevent recurrence. Pin the model version (gpt-4o-2025-01-15 instead of gpt-4o). Add a pre-deployment eval suite that tests the top 50 extraction examples against any new prompt version. Add an alert specifically for validation failure rate per feature.


Q11. How do you balance retry reliability against cost in a high-traffic LLM system?

Why interviewers ask: Tests engineering judgment about trade-offs in systems where every call costs real money.

Model answer:

The fundamental tension: more retries = higher reliability = higher cost. At scale, this is a real budget decision, not just engineering preference.

Quantify the trade-off. If a single call costs $0.0075 and you retry up to 3 times with error feedback (where input grows ~50% per retry): worst case is $0.0075 + $0.0113 + $0.015 + $0.019 = $0.053 — 7x the single-call cost. At 100K requests/day with 5% failure rate triggering retries: baseline $750/day, with retries $780-$860/day. With 3 feedback retries per failure at worst case: $750 + (5,000 failures × $0.053) = $1,015/day. That's a 35% cost increase.

Strategies to optimize:

  1. Tiered retry budgets per feature. Revenue-critical extraction pipeline: 3 retries allowed, $0.10 budget per sequence. Background batch processing: 1 retry, $0.02 budget. Chatbot UX: 2 retries, $0.05 budget.

  2. Cheaper model on retry. First attempt: GPT-4o. Retry with error feedback: GPT-4o-mini (20x cheaper). Often the error feedback gives enough guidance that the smaller model succeeds.

  3. Prompt investment over retry investment. If a prompt triggers >5% validation failures, improving the prompt is cheaper than retrying 5% of requests. I track "retry-adjusted cost per call" and invest in prompt optimization for the most expensive features.

  4. Circuit breaker + fast failure. When the API is persistently failing (not transient), circuit breaker prevents burning through thousands of retry budgets. Fail fast, serve cached results, retry when the circuit resets.

  5. Cost alerting. Daily cost alert threshold. If retries push cost above budget, investigate immediately rather than letting it run. Cost per retry sequence is a logged metric, so I can identify which features are most expensive to retry.


Quick-fire

#QuestionOne-line answer
1What does finish_reason: "length" mean?Response was truncated — model hit max_tokens before finishing
2Is 429 retryable?Yes — rate limit resets after a delay
3Is 401 retryable?No — bad API key; retrying sends the same broken credentials
4What is full jitter?delay = random(0, min(cap, base * 2^attempt)) — maximum spread
5Why log model version?Providers update models behind aliases; exact version needed to reproduce issues
6What should you ALWAYS check after JSON.parse?Schema validation — valid JSON != correct JSON
7What is a circuit breaker?Stops retrying after N consecutive failures; resets after a timeout
8PII in logs is dangerous because?Compliance violations (GDPR, HIPAA), security risk if logs are breached
9Retry-After header — what do you do?Respect it — wait the specified time instead of your own backoff
10Biggest risk of unlimited retries?Runaway cost and masking permanent errors
11What is a "continuation request"?Sending truncated output back to the model and asking it to continue

← Back to 4.10 — Error Handling in AI Applications (README)