Episode 4 — Generative AI Engineering / 4.4 — Structured Output in AI Systems

Interview Questions: Structured Output in AI Systems

Model answers for unstructured response problems, structured output benefits, real-world applications, and schema design.

How to use this material (instructions)

Read lessons in order — README.md, then 4.4.a → 4.4.d.
Practice out loud — definition → example → pitfall.
Pair with exercises — 4.4-Exercise-Questions.md.
Quick review — 4.4-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is structured output in the context of LLMs, and why does it matter?

Why interviewers ask: Tests whether you understand the fundamental problem that structured output solves — the gap between human-readable and machine-readable text.

Model answer:

Structured output means constraining an LLM's response to a predefined, parseable format — typically JSON — instead of free-form natural language text. By default, LLMs generate human-readable prose, which is ideal for chat interfaces but terrible for production code that needs to extract specific data fields.

The core problem: the same question yields infinitely variable text. Ask "What is the sentiment?" and you might get "It's positive," "The sentiment is positive," or "I'd classify this as a positive review." Code that tries to parse all variations using regex or string matching is fragile and inevitably breaks. Structured output solves this by making the LLM return {"sentiment": "positive", "confidence": 0.92}, which can be parsed with JSON.parse() — one line, zero ambiguity.

This matters for production because it enables reliable downstream processing, type-safe validation, consistent API contracts, meaningful error handling, and automated testing — none of which are possible with free-form text responses.

Q2. What format do you typically use for structured LLM output, and why?

Why interviewers ask: Tests practical knowledge — JSON vs XML vs CSV, and the reasoning behind the choice.

Model answer:

JSON is the default choice for approximately 95% of structured LLM output use cases, for several reasons:

First, native JavaScript compatibility — JSON.parse() gives you a ready-to-use object with no transformation. Since most AI application backends run on Node.js, this is a significant advantage.

Second, LLM familiarity — models like GPT-4o and Claude have seen enormous amounts of JSON in their training data. They produce valid JSON more reliably than XML or CSV, especially for nested data structures.

Third, universal tooling — every programming language, database, and API framework handles JSON natively. Validation libraries like Zod are purpose-built for JSON schemas.

Fourth, expressiveness — JSON supports nested objects, arrays, booleans, numbers, and null values, making it suitable for complex schemas. CSV is limited to flat tabular data, and XML is verbose and harder to parse.

I use CSV only when producing tabular data for spreadsheet consumption, and XML only when integrating with legacy systems that require it (some enterprise SOAP APIs). For everything else, JSON.

Q3. How do you handle the case where the LLM returns invalid or malformed JSON?

Why interviewers ask: Tests your ability to build robust systems — happy-path-only developers are a red flag.

Model answer:

I implement a multi-layer defense strategy:

Layer 1: Prompt design. I include explicit instructions like "Respond with ONLY valid JSON, no markdown fences, no explanation text." I set temperature to 0 for maximum consistency. I provide the exact schema shape in the system prompt.

Layer 2: Response sanitization. Before parsing, I handle common LLM quirks: stripping markdown code fences (```json...```), trimming whitespace, and if needed, extracting JSON from between the first { and last } in the response.

let content = response.choices[0].message.content.trim();
if (content.startsWith('```')) {
  content = content.replace(/^```(?:json)?\n?/, '').replace(/\n?```$/, '');
}

Layer 3: Parse with try/catch. JSON.parse() inside a try/catch block, so malformed JSON is caught immediately with a clear error.

Layer 4: Schema validation. After parsing, I validate the object against the expected schema — checking required fields exist, types are correct, enums contain valid values.

Layer 5: Retry with backoff. If parsing or validation fails, I retry the LLM call (typically up to 2 retries with exponential backoff). If all retries fail, I return a safe default or escalate the error.

Q4. What are three real-world applications of structured LLM output?

Why interviewers ask: Tests whether you can connect the concept to practical business value.

Model answer:

1. Content moderation. The LLM returns { flagged: boolean, severity: number, categories: string[], reason: string, suggestedAction: string }. This powers automated moderation pipelines: auto-approve clean content, auto-remove high-severity violations, and queue borderline cases for human review. The structured format enables dashboard analytics (flag rates by category, severity distribution) and consistent audit trails.

2. Email classification and triage. The LLM returns { intent: string, urgency: string, department: string, extractedEntities: { orderNumber, accountId } }. This automates routing — critical issues go to Slack alerts, billing questions go to the billing team, and common inquiries get auto-responses from templates. Without structured output, a human would need to read and categorize every email.

3. Invoice data extraction. The LLM returns { invoiceNumber, vendor, lineItems: [{ description, quantity, unitPrice, totalPrice }], totalAmount, currency }. This replaces manual data entry into accounting systems. The structured format enables automated validation (do line item totals match the subtotal?) and direct integration with ERP systems. A company processing 500 invoices/day can save hours of manual work.

Intermediate (Q5–Q8)

Q5. Explain the difference between parsing free-form text and parsing structured output. Why is the structured approach fundamentally more reliable?

Why interviewers ask: Tests depth of understanding — specifically the infinite-variability problem and why it's unsolvable with better regex.

Model answer:

Parsing free-form text is a battle against infinite variability. An LLM can express the same information in thousands of different phrasings, and your regex or string parsing must handle every variation. This creates the regex escalation spiral — each edge case you fix introduces new patterns that need handling, and the model can always generate a phrasing you didn't anticipate. The approach fails because you're trying to convert infinitely variable text into structured data after the fact.

Structured output flips the approach: instead of generating text and then parsing it, you instruct the LLM to produce structured data directly. The LLM does the "understanding" and the "formatting" in one step. Your code then uses standard parsers (JSON.parse()) that handle 100% of valid inputs by definition.

The reliability difference is fundamental, not incremental:

Free-form parsing has an error floor you can never reach zero — there will always be phrasings you haven't seen.
Structured parsing has a clear binary: either it's valid JSON or it's not. Valid JSON always parses correctly. Invalid JSON always fails loudly (not silently).

Additionally, with structured output, failures are detectable (JSON parse error, missing field, wrong type) and recoverable (retry, fallback). With free-form parsing, failures are often silent — the regex doesn't match, you get null, and null quietly propagates through your system as corrupted data.

Q6. How do you design a schema for an LLM application? Walk through your process.

Why interviewers ask: Tests system design thinking and whether you approach schema design methodically or ad hoc.

Model answer:

I follow a schema-first workflow with five steps:

Step 1: Requirements gathering. Before touching a prompt, I identify all data consumers: What does the frontend display? What goes into the database? What does the analytics pipeline aggregate? What decisions does the routing logic make? Each consumer's needs become candidate fields.

Step 2: Field design. For each field, I choose: a descriptive name in the project's convention (camelCase for JavaScript), the data type (string, number, boolean, array, enum), whether it's required or optional, and any constraints (min/max, allowed values, max length). I use enums aggressively — any field with a finite set of valid values should be an enum, not a free-form string.

Step 3: Structure decisions. I choose flat vs nested based on the data relationships. I keep nesting to 2-3 levels maximum because LLMs produce more malformed JSON with deeper nesting. For schemas with 10+ fields, I group related fields into sub-objects for readability.

Step 4: Documentation and versioning. I document each field with its purpose, consumer, example value, and validation rules. I assign a version number (semantic versioning: major for breaking changes, minor for additive changes).

Step 5: Implementation and iteration. I write the prompt with the schema, build validation, write tests against sample inputs, then run against real data and refine. Edge cases from real data often reveal fields I missed or constraints that are too tight.

Q7. What are the tradeoffs between flat and nested schemas? When do you choose each?

Why interviewers ask: Tests nuanced understanding of schema design and its practical implications.

Model answer:

Flat schemas put all fields at the top level: { customerName, customerEmail, shippingStreet, shippingCity, billingStreet, billingCity }.

Advantages: simpler access (data.customerName), direct mapping to flat database tables, no nested null checks needed, fewer JSON tokens (fewer braces), and higher LLM reliability (less chance of mismatched braces).

Disadvantages: field name repetition (shipping vs billing prefix), no logical grouping, unwieldy at 15+ fields, and harder to pass related data to functions.

Nested schemas group related fields: { customer: { name, email }, shipping: { street, city }, billing: { street, city } }.

Advantages: logical grouping, reusable sub-objects (both addresses share the same shape), cleaner function signatures (formatAddress(data.shipping)), and better documentation.

Disadvantages: deeper access (data.customer?.name), null checks needed for each level, more tokens, and higher error rate with deep nesting.

My guideline: I use flat for schemas under 10 fields or when mapping to flat database tables. I use nested for larger schemas with clear groupings. I never exceed 3 levels of nesting — if I need deeper structure, I either flatten it or split the work across multiple LLM calls. I also consider the LLM reliability angle: if the application is high-stakes and I need near-zero parse failures, I lean flat.

Q8. How do required vs optional fields affect LLM behavior, and how do you handle them correctly?

Why interviewers ask: Tests understanding of a subtle but important interaction between schema design and LLM hallucination.

Model answer:

This is a critical schema design decision that directly impacts hallucination. If every field is marked as required, the LLM will fabricate data for fields that aren't present in the input. For example, if the input text mentions only a name and role, but the schema requires an email field, the LLM might invent "john@example.com" — a hallucinated value that looks plausible but is completely made up.

The correct approach: required fields are things the LLM can always determine from the input (e.g., sentiment of a review, classification of a support ticket). Optional fields are things that may or may not be present in the input (e.g., email address, phone number, order number).

In the prompt, I explicitly instruct: "For optional fields, return null if the information is not present in the text. Do NOT guess or fabricate values." I list which fields are required and which are optional separately.

In the code, I validate that required fields are present and non-null, and I handle optional fields with null checks throughout the downstream code:

// Required: throw if missing
if (!data.sentiment) throw new Error('Missing required field: sentiment');

// Optional: handle gracefully
const email = data.email ?? 'Not provided';

This approach prevents hallucination of optional fields while ensuring required fields are always present.

Advanced (Q9–Q11)

Q9. Design a complete structured output pipeline for a production email triage system processing 50,000 emails per day.

Why interviewers ask: Tests end-to-end system design — schema, prompt, validation, routing, error handling, monitoring, and cost awareness.

Model answer:

Schema (v1.0):

{
  intent: 'inquiry | complaint | support_request | billing | cancellation | feedback | spam | other',
  urgency: 'critical | high | medium | low',
  department: 'sales | support | billing | engineering | management',
  summary: 'string (max 200 chars)',
  customerSentiment: 'angry | frustrated | neutral | happy',
  requiresHumanReview: 'boolean',
  extractedEntities: {
    orderNumber: 'string | null',
    accountId: 'string | null',
    productMentioned: 'string | null',
    amountMentioned: 'number | null'
  }
}

Pipeline architecture:

Ingestion: Emails arrive via webhook. Each is queued in a message queue (e.g., SQS) for reliable processing.
Classification: Worker dequeues email, calls GPT-4o with temperature 0, structured output prompt, 2-retry wrapper with JSON sanitization and schema validation.
Routing logic: Based on structured data — critical urgency goes to Slack + PagerDuty, cancellation requests go to retention team, spam goes to filter, everything else routes by department.
Error handling: JSON parse failures → retry. Validation failures → retry with stricter prompt. All retries exhausted → flag for manual classification. Expected failure rate: <0.1% = ~50 emails/day needing manual classification.
Cost management: 50K emails/day, average ~500 input tokens + ~150 output tokens per email. At GPT-4o rates: ~$80/day for input + ~$75/day for output = ~$155/day. With batch API (50% discount for non-urgent): ~$77/day.
Monitoring: Dashboard showing intent distribution, urgency breakdown, sentiment trends, auto-routing accuracy (sampled human review of 1% = 500 emails/day), and LLM error rate. Alert if error rate exceeds 0.5% or if "critical" emails spike above baseline.
Schema versioning: Version embedded in output. Canary deploys for schema changes. Backward compatibility maintained for at least 2 weeks during transitions.

Q10. How do you version and migrate LLM output schemas in a production system with multiple consumers?

Why interviewers ask: Tests real-world engineering maturity — most candidates can design a schema but few think about how it evolves over time.

Model answer:

Schema versioning follows semantic versioning adapted for data contracts:

Minor versions (1.0 → 1.1): Adding new optional fields or new enum values. These are backward-compatible — existing consumers ignore fields they don't know about, and unknown enum values can fall through to a default case.

Major versions (1.x → 2.0): Adding new required fields, removing fields, changing field types, or renaming fields. These are breaking changes that require coordinated migration.

Migration process for major versions:

Phase 1 (1 week): Deploy new code that reads both v1 and v2 formats. Include a version field _schemaVersion in the output. Transform v1 data to v2 shape when needed.

Phase 2 (canary): Route 10% of traffic to the v2 prompt. Monitor parse error rates, validation failures, and downstream behavior. Compare v1 and v2 outputs for the same inputs to catch regressions.

Phase 3 (rollout): Gradually increase v2 traffic to 50%, then 100%. Keep v1 handling code active.

Phase 4 (cleanup, 2 weeks after 100%): Remove v1 prompt and v1 handling code. Update documentation.

Key principles: Never force all consumers to update simultaneously. Include the version in every output object. Log both the schema version and the raw response for debugging. Keep the migration window short (2-4 weeks) to avoid indefinite dual-version support.

Q11. A product manager says "We need the AI to analyze customer feedback AND generate a response in the same call." How do you design the schema, and what are the risks?

Why interviewers ask: Tests ability to handle complex schema requirements and make architecture trade-off decisions.

Model answer:

This is a common request that creates a tension between analysis (structured, deterministic) and generation (creative, variable-length). The naive approach — one schema with both — has significant problems.

Schema design:

{
  analysis: {
    sentiment: 'positive | negative | neutral | mixed',
    category: 'bug | feature_request | praise | complaint | question',
    urgency: 'low | medium | high | critical',
    summary: 'string (max 200 chars)',
    actionRequired: 'boolean'
  },
  generatedResponse: {
    subject: 'string',
    body: 'string',
    tone: 'empathetic | professional | friendly'
  }
}

Risks:

Temperature conflict. Analysis needs temperature 0 for consistency. Response generation benefits from temperature 0.5-0.7 for natural-sounding text. One call can only have one temperature setting. If you use 0, the response sounds robotic. If you use 0.7, the analysis becomes inconsistent.
Token budget competition. The generated response might be 200+ tokens, which increases the total output and cost. If the model spends its "attention" on crafting a good response, the analysis quality may suffer.
Failure coupling. If the generated response has a JSON formatting issue (e.g., unescaped quotes in the email body), it breaks the entire parse — including the analysis that was perfectly fine.

My recommendation: Split into two calls. Call 1: analysis only, temperature 0, strict schema. Call 2: response generation, temperature 0.5, using the structured analysis as input context. This gives you the right temperature for each task, independent failure handling, and the ability to use the analysis (from Call 1) to inform the response (in Call 2) — e.g., "The customer is angry about a billing issue, generate an empathetic response about billing."

The latency cost of two calls is usually acceptable because they can be parallelized if the response doesn't depend on the analysis, or piped sequentially when it does. The reliability gain far outweighs the extra 200ms.

Quick-fire

#	Question	One-line answer
1	Default format for structured LLM output?	JSON — native to JS, best LLM reliability, universal tooling
2	Why not regex for parsing LLM responses?	Infinite variability — the model can always phrase things differently
3	Temperature for structured output?	0 — maximum consistency for parseable responses
4	What is silent data corruption?	Wrong data, no error — e.g., regex extracts wrong value, null propagates
5	Most important data type for LLM schemas?	Enum — constrains LLM to exact values your code handles
6	Max nesting depth for LLM JSON?	2-3 levels — deeper nesting increases malformed JSON rate
7	Required vs optional fields and hallucination?	Required fields may cause hallucination if data isn't in the input — use optional + null
8	How to handle markdown code fences in LLM JSON?	Strip them — content.replace(/^```(?:json)?\n?/, '').replace(/\n?```$/, '')
9	Schema versioning: adding optional field?	Minor version (1.0 → 1.1) — backward compatible
10	Schema versioning: adding required field?	Major version (1.x → 2.0) — breaking change, needs migration plan

← Back to 4.4 — Structured Output in AI Systems (README)