Episode 4 — Generative AI Engineering / 4.4 — Structured Output in AI Systems

4.4.d — Designing Output Schemas

In one sentence: A well-designed output schema is the contract between your LLM and your application code — getting it right means thinking carefully about field names, data types, required vs optional fields, nesting depth, documentation, and versioning before you ever write the prompt.

Navigation: ← 4.4.c — Common Applications · 4.4 Overview →


1. Planning Your Schema Before Writing the Prompt

The biggest mistake developers make with structured LLM output is jumping straight to the prompt without designing the schema first. Schema design is API design — it deserves the same thought and rigor.

The wrong approach: Prompt-first

// Developer thinks: "I need to extract data from customer feedback"
// Immediately writes a prompt:

const prompt = `Analyze this feedback and return JSON with the important stuff.`;

// The LLM returns... something. Different every time.
// { "feeling": "happy" }  — one time
// { "sentiment": "positive", "score": 8 } — another time
// { "analysis": { "overall": "good", "details": "..." } } — yet another time

// Now you have three different shapes in your database.
// Your frontend breaks on half of them.

The right approach: Schema-first

// Step 1: List EVERY piece of data you need
// - What does the downstream code expect?
// - What does the database schema look like?
// - What does the frontend display?
// - What does the analytics dashboard aggregate?

// Step 2: Define the exact shape
const feedbackAnalysisSchema = {
  sentiment: 'positive | negative | neutral | mixed',  // Used by: dashboard filter
  sentimentScore: 'number (-1.0 to 1.0)',               // Used by: trend chart
  confidence: 'number (0.0 to 1.0)',                     // Used by: quality filter
  category: 'bug | feature_request | praise | complaint | question',  // Used by: routing
  summary: 'string (1-2 sentences)',                     // Used by: ticket preview
  actionRequired: 'boolean',                             // Used by: alert system
  suggestedPriority: 'low | medium | high | critical',   // Used by: queue sorting
};

// Step 3: NOW write the prompt that produces this exact shape
// Step 4: Build validation that enforces this shape
// Step 5: Build the downstream code that consumes this shape

Schema design checklist

Before writing the prompt, answer these questions:

[ ] Who consumes this data? (frontend, API, database, analytics, other services)
[ ] What fields does each consumer need?
[ ] What are the exact allowed values for each field?
[ ] Which fields are required vs optional?
[ ] What are the data types? (string, number, boolean, array, enum)
[ ] What are the constraints? (min/max length, ranges, patterns)
[ ] How will this schema evolve over time?
[ ] Is this schema consistent with other schemas in the system?
[ ] Can the LLM realistically produce this data from the input?

2. Field Naming Conventions

Consistent field naming is critical for maintainable systems. Pick a convention and enforce it everywhere.

camelCase — JavaScript/TypeScript standard

// camelCase: first word lowercase, subsequent words capitalized
const schema = {
  firstName: 'string',
  lastName: 'string',
  emailAddress: 'string',
  phoneNumber: 'string',
  dateOfBirth: 'string',
  isActive: 'boolean',
  totalOrderCount: 'number',
  lastLoginAt: 'string (ISO 8601)',
  shippingAddress: {
    streetLine1: 'string',
    streetLine2: 'string or null',
    cityName: 'string',
    stateCode: 'string',
    zipCode: 'string',
    countryCode: 'string',
  },
};

// Why camelCase for LLM schemas?
// 1. Native to JavaScript — JSON.parse() gives you ready-to-use objects
// 2. Matches your codebase conventions — no transformation needed
// 3. LLMs are very familiar with camelCase from training data

snake_case — Python/Database standard

// snake_case: all lowercase, words separated by underscores
const schema = {
  first_name: 'string',
  last_name: 'string',
  email_address: 'string',
  phone_number: 'string',
  date_of_birth: 'string',
  is_active: 'boolean',
  total_order_count: 'number',
  last_login_at: 'string (ISO 8601)',
  shipping_address: {
    street_line_1: 'string',
    street_line_2: 'string or null',
    city_name: 'string',
    state_code: 'string',
    zip_code: 'string',
    country_code: 'string',
  },
};

// When to use snake_case:
// 1. Your backend is Python (Django, FastAPI)
// 2. Your database uses snake_case columns
// 3. You want to avoid transformation between LLM output and DB

Choosing and enforcing a convention

// RULE: Pick ONE convention for all LLM schemas in your project.
// Document it. Enforce it in code reviews.

// BAD: Mixed conventions (this happens more than you'd think)
const badSchema = {
  firstName: 'string',        // camelCase
  last_name: 'string',        // snake_case
  'email-address': 'string',  // kebab-case
  PhoneNumber: 'string',      // PascalCase
};

// GOOD: Consistent camelCase throughout
const goodSchema = {
  firstName: 'string',
  lastName: 'string',
  emailAddress: 'string',
  phoneNumber: 'string',
};

// If your system needs both conventions (JS frontend + Python backend),
// transform at the boundary, not in the schema:
function camelToSnake(obj) {
  const result = {};
  for (const [key, value] of Object.entries(obj)) {
    const snakeKey = key.replace(/[A-Z]/g, letter => `_${letter.toLowerCase()}`);
    result[snakeKey] = typeof value === 'object' && value !== null && !Array.isArray(value)
      ? camelToSnake(value)
      : value;
  }
  return result;
}

Naming best practices

// 1. Be descriptive — avoid abbreviations
// BAD:
{ fn: 'string', ln: 'string', dob: 'string', addr: 'string' }
// GOOD:
{ firstName: 'string', lastName: 'string', dateOfBirth: 'string', address: 'string' }

// 2. Use consistent prefixes for related concepts
// BAD:
{ shippingAddr: 'string', billing_address: 'string', homeLocation: 'string' }
// GOOD:
{ shippingAddress: 'string', billingAddress: 'string', homeAddress: 'string' }

// 3. Boolean fields should read as yes/no questions
// BAD:
{ active: 'boolean', spam: 'boolean', review: 'boolean' }
// GOOD:
{ isActive: 'boolean', isSpam: 'boolean', needsReview: 'boolean' }

// 4. Arrays should be plural nouns
// BAD:
{ tag: ['string'], skill: ['string'] }
// GOOD:
{ tags: ['string'], skills: ['string'] }

// 5. Dates should indicate format
// BAD:
{ created: 'string', updated: 'string' }
// GOOD:
{ createdAt: 'string (ISO 8601)', updatedAt: 'string (ISO 8601)' }
// Or with explicit format:
{ createdDate: 'YYYY-MM-DD', createdDateTime: 'YYYY-MM-DDTHH:mm:ssZ' }

3. Required vs Optional Fields

Not every field makes sense for every input. Designing required and optional fields correctly prevents the LLM from hallucinating data and your code from breaking on missing fields.

Why this distinction matters

// If every field is required, the LLM will MAKE UP data for fields
// that don't exist in the input:

// Input: "John, 32, engineer"
// Schema (all required): { name, age, role, email, phone, address, linkedin }
// LLM might fabricate: { email: "john@example.com", phone: "555-1234", ... }
// These are HALLUCINATED — the input never mentioned email or phone

// Solution: Mark fields as required or optional
const schema = {
  name: 'string (REQUIRED)',
  age: 'number (REQUIRED)',
  role: 'string (REQUIRED)',
  email: 'string or null (OPTIONAL — only include if explicitly mentioned)',
  phone: 'string or null (OPTIONAL — only include if explicitly mentioned)',
  address: 'string or null (OPTIONAL — only include if explicitly mentioned)',
  linkedIn: 'string or null (OPTIONAL — only include if explicitly mentioned)',
};

Implementing required vs optional in prompts

async function extractPersonData(text) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Extract person data from the text. Respond with ONLY JSON.

REQUIRED fields (always include):
- "name": string
- "role": string

OPTIONAL fields (include ONLY if explicitly stated in the text, otherwise use null):
- "age": number or null
- "email": string or null
- "phone": string or null
- "company": string or null
- "location": string or null

IMPORTANT: Do NOT guess or fabricate optional fields. If the information is not in the text, use null.

Format:
{
  "name": "string",
  "role": "string",
  "age": number | null,
  "email": "string" | null,
  "phone": "string" | null,
  "company": "string" | null,
  "location": "string" | null
}`
      },
      { role: 'user', content: text },
    ],
  });

  const data = JSON.parse(response.choices[0].message.content);
  
  // Validate required fields
  if (!data.name || typeof data.name !== 'string') {
    throw new Error('Required field "name" is missing or invalid');
  }
  if (!data.role || typeof data.role !== 'string') {
    throw new Error('Required field "role" is missing or invalid');
  }
  
  return data;
}

// Input: "John is a senior developer"
// Output: { name: "John", role: "senior developer", age: null, email: null, phone: null, company: null, location: null }
// Note: No fabricated email or phone — they're correctly null

// Input: "Jane Doe, jane@acme.com, Lead Designer at Acme Corp, based in NYC"
// Output: { name: "Jane Doe", role: "Lead Designer", age: null, email: "jane@acme.com", phone: null, company: "Acme Corp", location: "NYC" }

Handling optional fields in downstream code

// Always handle optional fields safely in your code

// BAD: Assumes all fields exist
function createCandidateCard(data) {
  return `
    <h2>${data.name}</h2>
    <p>Email: ${data.email}</p>       <!-- Shows "null" if missing -->
    <p>Phone: ${data.phone}</p>       <!-- Shows "null" if missing -->
    <p>Location: ${data.location}</p> <!-- Shows "null" if missing -->
  `;
}

// GOOD: Handle null gracefully
function createCandidateCard(data) {
  return `
    <h2>${data.name}</h2>
    <p>Role: ${data.role}</p>
    ${data.email ? `<p>Email: ${data.email}</p>` : ''}
    ${data.phone ? `<p>Phone: ${data.phone}</p>` : ''}
    ${data.location ? `<p>Location: ${data.location}</p>` : ''}
    ${data.company ? `<p>Company: ${data.company}</p>` : ''}
  `;
}

4. Data Types: String, Number, Boolean, Array, Enum

Choosing the right data type for each field prevents type coercion bugs and makes validation straightforward.

String

// Strings: free-form text fields
{
  name: 'string',                         // Any text
  summary: 'string (max 500 characters)',  // With length constraint
  email: 'string (email format)',          // With format constraint
  date: 'string (YYYY-MM-DD)',            // With pattern constraint
  id: 'string (UUID format)',             // With specific format
}

// When to use strings:
// - Names, descriptions, summaries, explanations
// - Dates (ISO 8601 format — safer than having LLM return Date objects)
// - IDs, codes, reference numbers
// - URLs, email addresses, phone numbers (validate format separately)

// Common pitfall: asking for a number but getting a string
// LLM returns: { "price": "$29.99" }  ← string with currency symbol
// Solution: Specify explicitly: "price": number (no currency symbol, just the number)

Number

// Numbers: numeric values
{
  age: 'number (integer, 0-150)',
  price: 'number (decimal, e.g., 29.99)',
  confidence: 'number (0.0 to 1.0)',
  score: 'number (integer, 0 to 100)',
  latitude: 'number (-90 to 90)',
  longitude: 'number (-180 to 180)',
  count: 'number (non-negative integer)',
}

// Critical: Be explicit about integer vs decimal
// "quantity": number          → LLM might return 3 or 3.0 or 3.5
// "quantity": integer         → Clearer, but LLM might still return 3.0
// "quantity": number (whole number, no decimals) → Most explicit

// Validation example
function validateNumber(value, { min, max, integer } = {}) {
  if (typeof value !== 'number' || isNaN(value)) return false;
  if (integer && !Number.isInteger(value)) return false;
  if (min !== undefined && value < min) return false;
  if (max !== undefined && value > max) return false;
  return true;
}

Boolean

// Booleans: true/false values
{
  isSpam: 'boolean',
  isUrgent: 'boolean',
  hasAttachment: 'boolean',
  needsReview: 'boolean',
  isApproved: 'boolean',
}

// Why booleans are better than string alternatives:
// BAD: { spam: "yes" }       → Is it "yes", "Yes", "YES", "true", "1"?
// BAD: { spam: "no" }        → Is it "no", "No", "NO", "false", "0"?
// GOOD: { isSpam: true }     → Only two possible values. No ambiguity.

// Common pitfall: LLM returns string "true" instead of boolean true
// Always validate:
function ensureBoolean(value) {
  if (typeof value === 'boolean') return value;
  if (value === 'true') return true;
  if (value === 'false') return false;
  throw new Error(`Expected boolean, got: ${value}`);
}

Array

// Arrays: lists of items
{
  tags: ['string'],                        // Array of strings
  scores: ['number'],                      // Array of numbers
  items: [{ name: 'string', qty: 'number' }],  // Array of objects
  categories: ['string (1-5 items)'],      // Array with length constraint
}

// Specify array constraints in the prompt:
// "tags": array of strings, 3 to 10 items
// "highlights": array of strings, each 1-2 sentences, maximum 5 items

// Validation example
function validateArray(value, { minLength, maxLength, itemValidator } = {}) {
  if (!Array.isArray(value)) return false;
  if (minLength !== undefined && value.length < minLength) return false;
  if (maxLength !== undefined && value.length > maxLength) return false;
  if (itemValidator && !value.every(itemValidator)) return false;
  return true;
}

// Usage
const tagsValid = validateArray(data.tags, {
  minLength: 1,
  maxLength: 15,
  itemValidator: (tag) => typeof tag === 'string' && tag.length > 0,
});

Enum — The most underrated type

// Enums: restricted set of allowed values
// This is the SINGLE MOST IMPORTANT type for structured LLM output.
// Enums constrain the LLM to specific values your code can handle.

{
  sentiment: '"positive" | "negative" | "neutral" | "mixed"',
  priority: '"low" | "medium" | "high" | "critical"',
  status: '"pending" | "approved" | "rejected"',
  category: '"bug" | "feature" | "question" | "other"',
  difficulty: '"beginner" | "intermediate" | "advanced"',
}

// Why enums are critical:
// Without enum: LLM might say "Positive", "POSITIVE", "positive!", "pos", "good"
// With enum:    LLM says exactly "positive" — matches your switch/case perfectly

// In the prompt, be explicit about allowed values:
const prompt = `
Classify the priority. The "priority" field MUST be exactly one of: "low", "medium", "high", "critical".
Do not use any other value.
`;

// Validation
function validateEnum(value, allowedValues) {
  return allowedValues.includes(value);
}

// Usage in code — enums make switch statements safe
switch (data.priority) {
  case 'critical': return escalateImmediately(data);
  case 'high':     return assignToSenior(data);
  case 'medium':   return addToQueue(data);
  case 'low':      return addToBacklog(data);
  // No default needed — validation ensures it's one of these four
}

5. Nesting vs Flat Structures

Choosing the right level of nesting affects readability, parseability, and downstream processing.

Flat structure: Simple, easy to process

// Flat: all fields at the top level
const flatSchema = {
  customerName: 'string',
  customerEmail: 'string',
  customerPhone: 'string',
  shippingStreet: 'string',
  shippingCity: 'string',
  shippingState: 'string',
  shippingZip: 'string',
  billingStreet: 'string',
  billingCity: 'string',
  billingState: 'string',
  billingZip: 'string',
  orderTotal: 'number',
  orderDate: 'string',
};

// Pros:
// + Simple to access: data.customerName
// + Easy to map to database columns
// + No nested null checks
// + Easier for LLMs to generate (fewer braces)

// Cons:
// - Field name repetition (shipping vs billing)
// - No logical grouping
// - Gets unwieldy with many fields
// - Harder to pass groups of related fields to functions

Nested structure: Organized, grouped

// Nested: related fields grouped into objects
const nestedSchema = {
  customer: {
    name: 'string',
    email: 'string',
    phone: 'string',
  },
  shippingAddress: {
    street: 'string',
    city: 'string',
    state: 'string',
    zip: 'string',
  },
  billingAddress: {
    street: 'string',
    city: 'string',
    state: 'string',
    zip: 'string',
  },
  order: {
    total: 'number',
    date: 'string',
  },
};

// Pros:
// + Logical grouping — related fields together
// + Reusable sub-schemas (both addresses share the same shape)
// + Easier to pass groups to functions: formatAddress(data.shippingAddress)
// + Clearer documentation

// Cons:
// - Deeper access: data.customer.name vs data.customerName
// - Null checks needed: data.customer?.name
// - More tokens (extra braces and indentation)
// - Slightly harder for LLMs with deep nesting

Guidelines: When to nest, when to flatten

Use FLAT when:
  - Schema has fewer than 10 fields
  - Fields aren't naturally grouped
  - You're mapping directly to a flat database table
  - LLM reliability is critical (fewer braces = fewer parse errors)
  - Performance matters (fewer tokens)

Use NESTED when:
  - Schema has more than 10 fields
  - Fields have clear logical groupings
  - Sub-objects are reused (e.g., address appears twice)
  - You pass groups of fields to different functions
  - Schema matches your application's object model

AVOID:
  - More than 3 levels of nesting (LLMs struggle with deep nesting)
  - Mixing flat and nested randomly
  - Nesting single fields (don't: { metadata: { version: 1 } } if version is the only field)

Practical nesting depth guide

// GOOD: 1-2 levels of nesting (LLMs handle this reliably)
{
  candidate: {
    name: 'string',
    email: 'string',
  },
  scores: {
    technical: 90,
    cultural: 85,
  },
  recommendation: 'yes',
}

// OK: 3 levels (test carefully — LLMs sometimes misplace closing braces)
{
  analysis: {
    sentiment: {
      overall: 'positive',
      aspects: [
        { topic: 'quality', score: 0.9 },
      ],
    },
  },
}

// BAD: 4+ levels (LLMs frequently produce malformed JSON at this depth)
{
  report: {
    sections: {
      analysis: {
        sentiment: {
          detailed: {
            aspects: [
              { sub: { value: 'too deep' } },
            ],
          },
        },
      },
    },
  },
}
// If you need this depth, split into multiple LLM calls or flatten.

6. Schema Documentation for Team Communication

Schemas are contracts between teams. Document them clearly so everyone — frontend, backend, AI, QA — knows what to expect.

Inline documentation format

/**
 * Feedback Analysis Schema v2.1
 * 
 * Used by:
 *   - POST /api/feedback/analyze (backend)
 *   - FeedbackDashboard component (frontend)
 *   - Daily sentiment report (analytics)
 *   - Alert system (ops)
 * 
 * LLM: GPT-4o, temperature 0
 * Prompt: See prompts/feedback-analysis.txt
 * Average tokens: ~150 output tokens per analysis
 * Validation: Zod schema in schemas/feedback.ts
 */
const feedbackAnalysisSchema = {
  /**
   * Overall sentiment classification.
   * @type {string}
   * @enum ["positive", "negative", "neutral", "mixed"]
   * @required
   * @example "positive"
   */
  sentiment: 'enum',
  
  /**
   * Numerical sentiment score.
   * @type {number}
   * @range -1.0 to 1.0
   * @required
   * @example 0.85
   * @usage Trend charts on dashboard, threshold for alerts (<-0.5)
   */
  sentimentScore: 'number',
  
  /**
   * Model's confidence in the classification.
   * @type {number}
   * @range 0.0 to 1.0
   * @required
   * @example 0.92
   * @usage Entries with confidence < 0.7 are routed to human review
   */
  confidence: 'number',
  
  /**
   * Feedback category for routing.
   * @type {string}
   * @enum ["bug", "feature_request", "praise", "complaint", "question", "other"]
   * @required
   * @example "feature_request"
   * @usage Determines which Slack channel gets the notification
   */
  category: 'enum',
  
  /**
   * Brief summary of the feedback.
   * @type {string}
   * @maxLength 200
   * @required
   * @example "Customer requesting dark mode support"
   * @usage Displayed in ticket list preview
   */
  summary: 'string',
  
  /**
   * Whether immediate action is needed.
   * @type {boolean}
   * @required
   * @example false
   * @usage Triggers PagerDuty alert when true
   */
  actionRequired: 'boolean',
  
  /**
   * Specific feature or product mentioned.
   * @type {string | null}
   * @optional
   * @example "dark mode"
   * @usage Links feedback to product roadmap items
   */
  featureMentioned: 'string | null',
};

Schema as a shared reference

// Keep schemas in a shared location that all teams can reference

// File: schemas/feedback-analysis.js
export const FEEDBACK_ANALYSIS_SCHEMA = {
  version: '2.1',
  lastUpdated: '2025-06-15',
  owner: 'ai-team',
  
  fields: {
    sentiment: {
      type: 'string',
      enum: ['positive', 'negative', 'neutral', 'mixed'],
      required: true,
      description: 'Overall sentiment classification',
    },
    sentimentScore: {
      type: 'number',
      min: -1.0,
      max: 1.0,
      required: true,
      description: 'Numerical sentiment score',
    },
    confidence: {
      type: 'number',
      min: 0.0,
      max: 1.0,
      required: true,
      description: 'Classification confidence level',
    },
    category: {
      type: 'string',
      enum: ['bug', 'feature_request', 'praise', 'complaint', 'question', 'other'],
      required: true,
      description: 'Feedback category for routing',
    },
    summary: {
      type: 'string',
      maxLength: 200,
      required: true,
      description: 'Brief summary of the feedback',
    },
    actionRequired: {
      type: 'boolean',
      required: true,
      description: 'Whether immediate action is needed',
    },
    featureMentioned: {
      type: 'string',
      required: false,
      nullable: true,
      description: 'Specific feature or product mentioned',
    },
  },
};

// Frontend, backend, AI team, and QA all import this same file
// Any change is visible to everyone via version control

7. Versioning Schemas as Requirements Change

Schemas evolve. New fields are added, types change, values are deprecated. Without versioning, changes break consumers silently.

Why versioning matters

// Week 1: Original schema
// { sentiment: "positive"|"negative"|"neutral", confidence: number }

// Week 3: PM wants "mixed" as a sentiment option
// { sentiment: "positive"|"negative"|"neutral"|"mixed", confidence: number }
// Impact: Frontend switch statement needs a new case

// Week 5: QA wants aspect-level breakdown
// { sentiment: ..., confidence: ..., aspects: [...] }
// Impact: Database needs a new column, frontend needs a new component

// Week 8: Data team wants the score as a number, not a label
// { sentiment: ..., sentimentScore: number, confidence: ..., aspects: [...] }
// Impact: Analytics queries need updating

// Without versioning, each change is a surprise that breaks things.
// With versioning, changes are planned, communicated, and backward-compatible.

Semantic versioning for schemas

// Version format: MAJOR.MINOR
// MAJOR: Breaking changes (removed fields, changed types)
// MINOR: Additive changes (new optional fields)

// v1.0 — Initial schema
const v1_0 = {
  sentiment: 'positive | negative | neutral',
  confidence: 'number',
};

// v1.1 — Added optional field (backward compatible)
const v1_1 = {
  sentiment: 'positive | negative | neutral',
  confidence: 'number',
  summary: 'string | null',  // NEW (optional)
};

// v1.2 — Added new enum value (backward compatible if consumers handle unknown)
const v1_2 = {
  sentiment: 'positive | negative | neutral | mixed',  // ADDED "mixed"
  confidence: 'number',
  summary: 'string | null',
};

// v2.0 — Breaking change (required field added, type changed)
const v2_0 = {
  sentiment: 'positive | negative | neutral | mixed',
  sentimentScore: 'number (-1 to 1)',  // NEW (required) — replaces inferred score
  confidence: 'number',
  summary: 'string',                    // Changed from nullable to required
  aspects: [{ topic: 'string', score: 'number' }],  // NEW (required)
};

Managing versions in production

// Include version in the schema output
async function analyzeFeedback(text) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Analyze feedback. Respond with JSON:
{
  "_schemaVersion": "2.1",
  "sentiment": "positive"|"negative"|"neutral"|"mixed",
  "sentimentScore": number (-1.0 to 1.0),
  "confidence": number (0.0 to 1.0),
  "category": "bug"|"feature_request"|"praise"|"complaint"|"question"|"other",
  "summary": "string (max 200 chars)",
  "actionRequired": boolean,
  "featureMentioned": "string or null"
}`
      },
      { role: 'user', content: text },
    ],
  });

  const data = JSON.parse(response.choices[0].message.content);
  data._schemaVersion = '2.1'; // Ensure version even if LLM omits it
  return data;
}

// Consumer code handles versioning
function processFeedbackResult(data) {
  switch (data._schemaVersion) {
    case '1.0':
    case '1.1':
      // Old schema — map to current format
      return {
        ...data,
        sentimentScore: data.sentiment === 'positive' ? 0.5 : -0.5,
        category: 'other',
        actionRequired: false,
      };
    case '2.0':
    case '2.1':
      // Current schema — use directly
      return data;
    default:
      console.warn(`Unknown schema version: ${data._schemaVersion}`);
      return data; // Best effort
  }
}

Migration strategy

// When migrating from v1 to v2:

// Step 1: Deploy code that handles BOTH v1 and v2
// Step 2: Update the LLM prompt to produce v2
// Step 3: Verify all consumers handle v2 correctly
// Step 4: After a transition period, remove v1 handling code

// Example migration:
class FeedbackAnalyzer {
  constructor(schemaVersion = '2.1') {
    this.schemaVersion = schemaVersion;
    this.prompts = {
      '1.0': `Respond with JSON: { "sentiment": "positive"|"negative"|"neutral", "confidence": number }`,
      '2.0': `Respond with JSON: { "sentiment": ..., "sentimentScore": number, "aspects": [...] }`,
      '2.1': `Respond with JSON: { full v2.1 schema }`,
    };
  }
  
  async analyze(text) {
    const prompt = this.prompts[this.schemaVersion];
    // ... LLM call with the versioned prompt
  }
}

// Gradual rollout:
// 10% of traffic → new version (canary)
// Monitor error rates
// If stable → 50% → 100%
// Remove old version after 2 weeks

8. Putting It All Together: Schema Design Workflow

┌─────────────────────────────────────────────────────────────────────┐
│                SCHEMA DESIGN WORKFLOW                                │
│                                                                     │
│  1. REQUIREMENTS                                                    │
│     └─ What data do consumers need?                                 │
│     └─ What decisions will code make based on this data?            │
│     └─ What gets stored in the database?                            │
│                                                                     │
│  2. DESIGN                                                          │
│     └─ Choose naming convention (camelCase for JS)                  │
│     └─ Define fields, types, and constraints                        │
│     └─ Mark required vs optional                                    │
│     └─ Choose flat vs nested                                        │
│     └─ Use enums for every field with a finite set of values        │
│                                                                     │
│  3. DOCUMENT                                                        │
│     └─ Write field descriptions                                     │
│     └─ Note who consumes each field                                 │
│     └─ Specify validation rules                                     │
│     └─ Assign version number                                        │
│                                                                     │
│  4. IMPLEMENT                                                       │
│     └─ Write the prompt with the schema                             │
│     └─ Build validation (Zod / manual)                              │
│     └─ Write tests with sample inputs                               │
│     └─ Build error recovery (retry, fallback)                       │
│                                                                     │
│  5. ITERATE                                                         │
│     └─ Run against real data                                        │
│     └─ Identify edge cases                                          │
│     └─ Refine field constraints                                     │
│     └─ Version the changes                                          │
└─────────────────────────────────────────────────────────────────────┘

Complete example: Designing a schema from scratch

// TASK: Build a product review analysis system

// Step 1: Requirements gathering
// - Frontend needs: overall rating, pros/cons list, summary
// - Database stores: all fields for historical analysis
// - Analytics needs: numerical scores, categories
// - Alert system needs: is this a product defect report?

// Step 2: Design the schema
const reviewAnalysisSchema = {
  // Required fields
  overallSentiment: '"very_positive" | "positive" | "neutral" | "negative" | "very_negative"',
  sentimentScore: 'number (-1.0 to 1.0)',
  confidence: 'number (0.0 to 1.0)',
  summary: 'string (1-2 sentences, max 200 chars)',
  pros: ['string (1-5 items)'],
  cons: ['string (0-5 items)'],
  
  // Optional fields (only if mentioned in the review)
  productMentioned: 'string | null',
  defectReported: 'boolean',
  purchaseVerified: 'boolean | null',
  
  // Aspect breakdown
  aspects: [{
    name: 'string',
    sentiment: '"positive" | "negative" | "neutral"',
    score: 'number (-1.0 to 1.0)',
  }],
};

// Step 3: Write the prompt
const systemPrompt = `You are a product review analyst. Analyze the given review.

Respond with ONLY a valid JSON object:
{
  "overallSentiment": "very_positive" | "positive" | "neutral" | "negative" | "very_negative",
  "sentimentScore": number (-1.0 to 1.0),
  "confidence": number (0.0 to 1.0),
  "summary": "string (1-2 sentences, max 200 chars)",
  "pros": ["string (list 1-5 positive points)"],
  "cons": ["string (list 0-5 negative points, empty array if none)"],
  "productMentioned": "string or null (only if a specific product is named)",
  "defectReported": boolean (true if the review reports a product defect or malfunction),
  "purchaseVerified": boolean or null (true/false only if explicitly stated),
  "aspects": [
    {
      "name": "string (e.g., quality, price, shipping, customer_service)",
      "sentiment": "positive" | "negative" | "neutral",
      "score": number (-1.0 to 1.0)
    }
  ]
}

Rules:
- Use null for optional fields when the information is not in the review
- Do not fabricate information not present in the review
- pros and cons must be specific observations from the review, not generic statements`;

// Step 4: Implement with validation
async function analyzeReview(reviewText) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: reviewText },
    ],
  });

  const data = JSON.parse(response.choices[0].message.content);
  
  // Validate required fields and types
  const errors = [];
  if (!['very_positive', 'positive', 'neutral', 'negative', 'very_negative'].includes(data.overallSentiment)) {
    errors.push('Invalid overallSentiment');
  }
  if (typeof data.sentimentScore !== 'number' || data.sentimentScore < -1 || data.sentimentScore > 1) {
    errors.push('Invalid sentimentScore');
  }
  if (!Array.isArray(data.pros)) {
    errors.push('pros must be an array');
  }
  if (!Array.isArray(data.cons)) {
    errors.push('cons must be an array');
  }
  
  if (errors.length > 0) {
    throw new Error(`Schema validation failed: ${errors.join(', ')}`);
  }
  
  // Add metadata
  data._schemaVersion = '1.0';
  data._analyzedAt = new Date().toISOString();
  
  return data;
}

9. Key Takeaways

  1. Design the schema before writing the prompt — know what every consumer needs, define the exact shape, then write the prompt to produce that shape.
  2. Pick one naming convention and stick with it — camelCase for JavaScript projects, snake_case for Python. Never mix. Document the convention.
  3. Required fields must always be present; optional fields use null — this prevents the LLM from hallucinating data and your code from crashing on missing fields.
  4. Enums are the most important type — they constrain the LLM to exact values your code can handle, eliminating the "what if the model says it differently" problem.
  5. Keep nesting to 2-3 levels maximum — deeper nesting increases token cost and LLM error rate. Flatten or split into multiple calls for complex schemas.
  6. Document schemas with consumer, purpose, and example for each field — schemas are team contracts, not just LLM instructions.
  7. Version your schemas — additive changes are minor versions, breaking changes are major versions. Plan migrations and support old versions during transitions.

Explain-It Challenge

  1. A new team member adds a field called "dt" to the schema. Explain why this is a problem and what the field should be named instead.
  2. The product manager wants to add three new optional fields to the sentiment analysis schema. Should this be a major or minor version bump? What if one of the new fields is required?
  3. Your schema has 4 levels of nesting and the LLM sometimes produces malformed JSON. Propose two strategies to fix this without changing what data you collect.

Navigation: ← 4.4.c — Common Applications · 4.4 Overview →