Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications

4.10.d — Logging AI Requests

In one sentence: Every LLM API call should be logged with full context — the complete request, response, token counts, latency, model version, and parameters — because without comprehensive logging, debugging non-deterministic AI behavior is nearly impossible, optimizing prompts is guesswork, and detecting regressions requires months instead of minutes.

Navigation: ← 4.10.c — Retry Mechanisms · ← 4.10 Overview

1. Why Logging Is Critical for AI Applications

Traditional applications are deterministic: same input produces the same output. If something breaks, you can reproduce it. LLM applications are non-deterministic: the same input can produce different outputs, and the model itself changes when providers update it. This fundamentally changes what logging means.

Traditional API debugging:
  1. User reports bug
  2. Developer reproduces with same input
  3. Developer reads code, finds bug
  4. Fix deployed

LLM application debugging:
  1. User reports "the AI gave a wrong answer"
  2. Developer tries the same input — gets a DIFFERENT (possibly correct) answer
  3. Without logs: impossible to know what went wrong
  4. With logs: can see exact prompt, model version, temperature, and raw response
  5. Root cause found: model hallucinated a fact, prompt was ambiguous, etc.

What you lose without logging

Capability	Without Logging	With Logging
Debugging	Cannot reproduce issues	Full request/response replay
Cost tracking	Surprise bills	Token-level cost attribution
Prompt optimization	Guessing what works	Data-driven A/B comparisons
Regression detection	Users report problems weeks later	Automated alerts within minutes
Compliance/audit	Cannot prove what the AI said	Full audit trail
Performance monitoring	Blind to latency spikes	Real-time dashboards
Model comparison	Subjective "feels better"	Quantitative metrics

2. What to Log: The Complete Field List

Every LLM API call should capture these fields.

Request fields

const requestLog = {
  // Identification
  requestId: 'req_abc123',           // Unique ID for this call
  traceId: 'trace_xyz789',           // Links to parent user request
  userId: 'user_456',                // Who triggered this (anonymized)
  sessionId: 'sess_789',             // Conversation session

  // Timing
  timestamp: '2026-04-11T10:30:00Z', // ISO 8601

  // Model configuration
  model: 'gpt-4o-2024-08-06',       // Exact model version — NOT "gpt-4o"
  temperature: 0,
  topP: 1,
  maxTokens: 4096,
  responseFormat: 'json_object',
  seed: 42,                          // If using seed for reproducibility

  // Input
  systemPrompt: '...',               // The system message
  messages: [...],                   // Full message array (see privacy section)
  messageCount: 5,                   // Number of messages in conversation
  
  // Metadata
  feature: 'user-extraction',        // Which feature triggered this call
  promptVersion: 'v2.3',             // Version of the prompt template
  environment: 'production'
};

Response fields

const responseLog = {
  // Link back to request
  requestId: 'req_abc123',

  // Timing
  latencyMs: 2340,                   // Total round-trip time
  timeToFirstTokenMs: 450,           // For streaming: time until first token

  // Output
  content: '...',                    // Raw response content
  finishReason: 'stop',             // "stop", "length", "content_filter"
  
  // Token usage
  promptTokens: 1250,
  completionTokens: 340,
  totalTokens: 1590,

  // Cost (calculated)
  costCents: 0.59,                   // Calculated from token counts and pricing

  // Validation
  jsonParseSuccess: true,
  schemaValidationSuccess: true,
  validationErrors: null,

  // Retry info
  attemptNumber: 1,                  // Which attempt this was
  totalAttempts: 1,                  // How many attempts total
  retryReason: null,                 // Why we retried (if applicable)

  // Error (if any)
  error: null,
  errorType: null,                   // 'api_error', 'timeout', 'parse_error', etc.
  httpStatus: 200
};

3. Building a Structured Logger

A structured logger outputs machine-readable logs (JSON) that can be queried, aggregated, and alerted on.

/**
 * Structured logger for LLM API calls.
 * Outputs JSON logs that can be sent to any log aggregation service.
 */
class LlmLogger {
  constructor(options = {}) {
    this.serviceName = options.serviceName || 'llm-service';
    this.environment = options.environment || process.env.NODE_ENV || 'development';
    this.logFunction = options.logFunction || console.log;
    this.piiFilter = options.piiFilter || null;
    this.costConfig = options.costConfig || DEFAULT_COST_CONFIG;
  }

  /**
   * Log a complete LLM request-response cycle.
   */
  logCall(request, response, metadata = {}) {
    const log = {
      // Standard fields
      level: response.error ? 'error' : 'info',
      service: this.serviceName,
      environment: this.environment,
      timestamp: new Date().toISOString(),

      // Request
      requestId: metadata.requestId || this._generateId(),
      traceId: metadata.traceId,
      feature: metadata.feature,
      promptVersion: metadata.promptVersion,

      // Model config
      model: request.model,
      temperature: request.temperature,
      maxTokens: request.max_tokens,
      responseFormat: request.response_format?.type,

      // Input metrics (not the full content — see privacy section)
      messageCount: request.messages?.length,
      systemPromptLength: request.messages?.find(m => m.role === 'system')?.content?.length,
      userMessageLength: request.messages?.filter(m => m.role === 'user')
        .reduce((sum, m) => sum + (m.content?.length || 0), 0),

      // Output
      finishReason: response.finishReason,
      outputLength: response.content?.length,

      // Tokens
      promptTokens: response.usage?.prompt_tokens,
      completionTokens: response.usage?.completion_tokens,
      totalTokens: response.usage?.total_tokens,

      // Cost
      costCents: this._calculateCost(request.model, response.usage),

      // Performance
      latencyMs: response.latencyMs,
      timeToFirstTokenMs: response.timeToFirstTokenMs,

      // Validation
      jsonParseSuccess: response.jsonParseSuccess,
      schemaValidationSuccess: response.schemaValidationSuccess,

      // Retries
      attemptNumber: metadata.attemptNumber || 1,
      totalAttempts: metadata.totalAttempts || 1,

      // Error
      error: response.error?.message,
      errorType: response.errorType,
      httpStatus: response.httpStatus
    };

    this.logFunction(JSON.stringify(log));
    return log;
  }

  /**
   * Log full request/response content (for debugging).
   * ONLY call this when detailed debugging is needed — contains sensitive data.
   */
  logDetailedCall(request, response, metadata = {}) {
    const baseLog = this.logCall(request, response, metadata);

    const detailedLog = {
      ...baseLog,
      level: 'debug',
      // Full content — apply PII filter if configured
      messages: this.piiFilter
        ? this._filterPii(request.messages)
        : request.messages,
      responseContent: this.piiFilter
        ? this.piiFilter(response.content)
        : response.content,
      validationErrors: response.validationErrors
    };

    this.logFunction(JSON.stringify(detailedLog));
    return detailedLog;
  }

  _calculateCost(model, usage) {
    if (!usage) return null;
    const pricing = this.costConfig[model];
    if (!pricing) return null;

    const inputCost = (usage.prompt_tokens / 1000000) * pricing.inputPer1M;
    const outputCost = (usage.completion_tokens / 1000000) * pricing.outputPer1M;
    return parseFloat(((inputCost + outputCost) * 100).toFixed(4));
  }

  _filterPii(messages) {
    if (!messages) return messages;
    return messages.map(msg => ({
      ...msg,
      content: this.piiFilter(msg.content)
    }));
  }

  _generateId() {
    return 'req_' + Math.random().toString(36).substring(2, 15);
  }
}

const DEFAULT_COST_CONFIG = {
  'gpt-4o': { inputPer1M: 2.50, outputPer1M: 10.00 },
  'gpt-4o-2024-08-06': { inputPer1M: 2.50, outputPer1M: 10.00 },
  'gpt-4o-mini': { inputPer1M: 0.15, outputPer1M: 0.60 },
  'claude-sonnet-4-20250514': { inputPer1M: 3.00, outputPer1M: 15.00 }
};

4. Instrumenting API Calls

Wrap your LLM calls to automatically capture timing and log data.

/**
 * Instrumented LLM client that automatically logs every call.
 */
class InstrumentedLlmClient {
  constructor(openai, logger) {
    this.openai = openai;
    this.logger = logger;
  }

  async chatCompletion(request, metadata = {}) {
    const startTime = Date.now();
    let response = {};

    try {
      const result = await this.openai.chat.completions.create(request);

      response = {
        content: result.choices[0]?.message?.content,
        finishReason: result.choices[0]?.finish_reason,
        usage: result.usage,
        latencyMs: Date.now() - startTime,
        httpStatus: 200,
        jsonParseSuccess: null,
        schemaValidationSuccess: null,
        error: null,
        errorType: null
      };

      // Attempt JSON parsing if content exists
      if (response.content) {
        try {
          JSON.parse(response.content);
          response.jsonParseSuccess = true;
        } catch (e) {
          response.jsonParseSuccess = false;
        }
      }

      // Log the call
      this.logger.logCall(request, response, metadata);

      return result;

    } catch (error) {
      response = {
        content: null,
        finishReason: 'error',
        usage: null,
        latencyMs: Date.now() - startTime,
        httpStatus: error.status || null,
        error: error,
        errorType: this._classifyErrorType(error)
      };

      // Log the error
      this.logger.logCall(request, response, metadata);

      throw error;
    }
  }

  _classifyErrorType(error) {
    if (error.name === 'AbortError') return 'timeout';
    if (error.status === 429) return 'rate_limit';
    if (error.status === 401) return 'auth_error';
    if (error.status === 400) return 'bad_request';
    if (error.status >= 500) return 'server_error';
    if (error.code === 'ETIMEDOUT') return 'network_timeout';
    if (error.code === 'ECONNREFUSED') return 'connection_refused';
    return 'unknown';
  }
}

// Usage
const logger = new LlmLogger({
  serviceName: 'my-ai-app',
  environment: 'production'
});

const client = new InstrumentedLlmClient(
  new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
  logger
);

// Every call is now automatically logged
const result = await client.chatCompletion(
  {
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello!' }],
    temperature: 0.7
  },
  {
    feature: 'chatbot',
    promptVersion: 'v1.2',
    traceId: 'user-session-abc'
  }
);

5. Privacy Considerations: PII in Prompts

LLM prompts often contain Personally Identifiable Information (PII) — user names, emails, addresses, medical info, financial data. Logging this data creates serious privacy and compliance risks.

The problem

User message: "My name is John Smith, my SSN is 123-45-6789, 
              and I need help with my medical bill from Dr. Johnson."

If you log this verbatim:
  ✗ GDPR violation (if user is in EU)
  ✗ HIPAA violation (medical info)
  ✗ PCI DSS violation (if financial data)
  ✗ Security risk (SSN in logs)
  ✗ Liability if logs are breached

PII filtering strategies

/**
 * PII filter for log content.
 * Replaces common PII patterns with placeholders.
 */
function createPiiFilter(options = {}) {
  const {
    filterEmails = true,
    filterPhones = true,
    filterSSNs = true,
    filterCreditCards = true,
    filterNames = false,  // Hard to do reliably — opt-in only
    placeholder = '[REDACTED]'
  } = options;

  const patterns = [];

  if (filterSSNs) {
    patterns.push({
      regex: /\b\d{3}-\d{2}-\d{4}\b/g,
      replacement: `${placeholder}_SSN`
    });
  }

  if (filterCreditCards) {
    patterns.push({
      regex: /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g,
      replacement: `${placeholder}_CC`
    });
  }

  if (filterEmails) {
    patterns.push({
      regex: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
      replacement: `${placeholder}_EMAIL`
    });
  }

  if (filterPhones) {
    patterns.push({
      regex: /\b(\+?1[-.]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b/g,
      replacement: `${placeholder}_PHONE`
    });
  }

  return function filter(text) {
    if (!text || typeof text !== 'string') return text;
    
    let filtered = text;
    for (const { regex, replacement } of patterns) {
      filtered = filtered.replace(regex, replacement);
    }
    return filtered;
  };
}

// Usage
const piiFilter = createPiiFilter({
  filterEmails: true,
  filterPhones: true,
  filterSSNs: true,
  filterCreditCards: true
});

const logger = new LlmLogger({
  piiFilter,
  serviceName: 'my-ai-app'
});

// Test
const input = 'My email is john@example.com and SSN is 123-45-6789';
console.log(piiFilter(input));
// "My email is [REDACTED]_EMAIL and SSN is [REDACTED]_SSN"

Tiered logging strategy

/**
 * Log different levels of detail based on environment and need.
 */
const LOGGING_TIERS = {
  production: {
    // Log metadata only — no content
    logContent: false,
    logPrompt: false,
    logMetrics: true,
    logErrors: true,
    filterPii: true,
    retention: '90 days'
  },
  staging: {
    // Log content with PII filtering
    logContent: true,
    logPrompt: true,
    logMetrics: true,
    logErrors: true,
    filterPii: true,
    retention: '30 days'
  },
  development: {
    // Log everything (local only)
    logContent: true,
    logPrompt: true,
    logMetrics: true,
    logErrors: true,
    filterPii: false,
    retention: '7 days'
  },
  debug: {
    // Temporary detailed logging for specific issues
    logContent: true,
    logPrompt: true,
    logMetrics: true,
    logErrors: true,
    filterPii: true,
    retention: '24 hours'
  }
};

6. Log Storage and Retention

Where to store logs

Storage	Best For	Cost	Query Speed
Console/stdout	Development, containerized apps	Free	N/A (ephemeral)
File system	Simple apps, local debugging	Low	Slow (grep)
Elasticsearch	Full-text search across logs	Medium	Fast
CloudWatch / Datadog	Cloud-native apps, dashboards	Medium-High	Fast
BigQuery / Athena	Long-term analytics, large scale	Low storage, pay-per-query	Medium
S3 / GCS	Archival, compliance	Very low	Slow (batch)

Retention strategy

const RETENTION_POLICY = {
  // Metrics (aggregated, no PII)
  metrics: {
    resolution: '1 minute',
    retention: '1 year',
    storage: 'time-series-db'
  },

  // Request logs (metadata only)
  requestLogs: {
    fields: ['requestId', 'model', 'tokens', 'latency', 'error', 'feature'],
    retention: '90 days',
    storage: 'elasticsearch'
  },

  // Full content logs (PII-filtered)
  contentLogs: {
    fields: 'all (PII-filtered)',
    retention: '30 days',
    storage: 'elasticsearch',
    accessControl: 'engineering-team-only'
  },

  // Error logs (detailed, for debugging)
  errorLogs: {
    fields: 'all (PII-filtered)',
    retention: '90 days',
    storage: 'elasticsearch',
    alert: true
  },

  // Audit logs (compliance)
  auditLogs: {
    fields: ['requestId', 'userId', 'timestamp', 'model', 'feature', 'contentHash'],
    retention: '7 years',
    storage: 's3-glacier',
    immutable: true
  }
};

7. Monitoring Dashboards

Transform logs into real-time visibility with dashboards.

Key metrics to track

┌─────────────────────────────────────────────────────────────────────┐
│                    LLM MONITORING DASHBOARD                         │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │
│  │ Success Rate  │  │ Avg Latency  │  │ Daily Cost   │             │
│  │    97.3%      │  │   2.4s       │  │   $142.50    │             │
│  │   ↓ 0.5%     │  │   ↑ 0.3s     │  │   ↑ $12.30   │             │
│  └──────────────┘  └──────────────┘  └──────────────┘             │
│                                                                     │
│  Error Breakdown (last 24h):                                        │
│  ├── Rate limit (429):     1.2%  ████░░░░░░                       │
│  ├── Timeout:              0.8%  ███░░░░░░░                       │
│  ├── Server error (5xx):   0.4%  ██░░░░░░░░                       │
│  ├── Parse failure:        0.2%  █░░░░░░░░░                       │
│  └── Validation failure:   0.1%  ░░░░░░░░░░                       │
│                                                                     │
│  Latency Distribution:                                              │
│  p50: 1.8s  p90: 4.2s  p95: 6.1s  p99: 12.3s                     │
│                                                                     │
│  Token Usage (daily):                                               │
│  Input:  12.4M tokens ($31.00)                                      │
│  Output: 3.2M tokens ($32.00)                                       │
│  Total:  15.6M tokens ($63.00)                                      │
│                                                                     │
│  Top Features by Cost:                                              │
│  1. document-analysis   $45.20  (72%)                               │
│  2. chatbot             $12.80  (20%)                               │
│  3. extraction          $5.00   (8%)                                │
└─────────────────────────────────────────────────────────────────────┘

Metrics aggregation

/**
 * Aggregate LLM call metrics for dashboard display.
 */
class LlmMetricsAggregator {
  constructor() {
    this.metrics = {
      total: 0,
      success: 0,
      errors: {},
      latencies: [],
      tokenUsage: { prompt: 0, completion: 0 },
      costCents: 0,
      byFeature: {},
      byModel: {}
    };
    this.windowStart = Date.now();
  }

  record(log) {
    this.metrics.total++;

    if (!log.error) {
      this.metrics.success++;
    } else {
      const errorType = log.errorType || 'unknown';
      this.metrics.errors[errorType] = (this.metrics.errors[errorType] || 0) + 1;
    }

    if (log.latencyMs) {
      this.metrics.latencies.push(log.latencyMs);
    }

    if (log.promptTokens) {
      this.metrics.tokenUsage.prompt += log.promptTokens;
    }
    if (log.completionTokens) {
      this.metrics.tokenUsage.completion += log.completionTokens;
    }

    if (log.costCents) {
      this.metrics.costCents += log.costCents;
    }

    // Track by feature
    if (log.feature) {
      if (!this.metrics.byFeature[log.feature]) {
        this.metrics.byFeature[log.feature] = { total: 0, errors: 0, costCents: 0 };
      }
      this.metrics.byFeature[log.feature].total++;
      if (log.error) this.metrics.byFeature[log.feature].errors++;
      if (log.costCents) this.metrics.byFeature[log.feature].costCents += log.costCents;
    }
  }

  getSummary() {
    const sorted = [...this.metrics.latencies].sort((a, b) => a - b);
    const percentile = (p) => sorted[Math.floor(sorted.length * p)] || 0;

    return {
      windowMs: Date.now() - this.windowStart,
      total: this.metrics.total,
      successRate: this.metrics.total ? (this.metrics.success / this.metrics.total * 100).toFixed(1) + '%' : 'N/A',
      errors: this.metrics.errors,
      latency: {
        p50: percentile(0.5),
        p90: percentile(0.9),
        p95: percentile(0.95),
        p99: percentile(0.99),
        avg: sorted.length ? Math.round(sorted.reduce((a, b) => a + b, 0) / sorted.length) : 0
      },
      tokens: this.metrics.tokenUsage,
      costDollars: (this.metrics.costCents / 100).toFixed(2),
      byFeature: this.metrics.byFeature
    };
  }

  reset() {
    this.metrics = {
      total: 0, success: 0, errors: {}, latencies: [],
      tokenUsage: { prompt: 0, completion: 0 },
      costCents: 0, byFeature: {}, byModel: {}
    };
    this.windowStart = Date.now();
  }
}

8. Alerting on Error Rates

Set up alerts to be notified when things go wrong before users complain.

/**
 * Alert rules for LLM monitoring.
 */
const ALERT_RULES = [
  {
    name: 'High Error Rate',
    condition: (metrics) => {
      const errorRate = 1 - (metrics.success / metrics.total);
      return metrics.total > 100 && errorRate > 0.05; // >5% error rate
    },
    severity: 'critical',
    message: (metrics) => `LLM error rate is ${((1 - metrics.success / metrics.total) * 100).toFixed(1)}% (threshold: 5%)`
  },
  {
    name: 'High Latency',
    condition: (metrics) => {
      const sorted = [...metrics.latencies].sort((a, b) => a - b);
      const p95 = sorted[Math.floor(sorted.length * 0.95)] || 0;
      return p95 > 10000; // p95 > 10 seconds
    },
    severity: 'warning',
    message: () => 'LLM p95 latency exceeded 10 seconds'
  },
  {
    name: 'Cost Spike',
    condition: (metrics) => {
      const hourlyRate = metrics.costCents / ((Date.now() - metrics.windowStart) / 3600000);
      return hourlyRate > 500; // >$5/hour
    },
    severity: 'warning',
    message: (metrics) => {
      const hourlyRate = metrics.costCents / ((Date.now() - metrics.windowStart) / 3600000);
      return `LLM cost rate: $${(hourlyRate / 100).toFixed(2)}/hour (threshold: $5/hour)`;
    }
  },
  {
    name: 'Rate Limit Storm',
    condition: (metrics) => {
      const rateLimitCount = metrics.errors['rate_limit'] || 0;
      return rateLimitCount > 50; // >50 rate limits in window
    },
    severity: 'critical',
    message: (metrics) => `${metrics.errors['rate_limit']} rate limit errors — consider throttling requests`
  },
  {
    name: 'Validation Failure Spike',
    condition: (metrics) => {
      // If >10% of successful API calls fail validation, the prompt may be broken
      const validationFailures = metrics.errors['validation'] || 0;
      return metrics.success > 50 && (validationFailures / metrics.success) > 0.10;
    },
    severity: 'warning',
    message: () => 'Schema validation failure rate >10% — check prompt or schema changes'
  }
];

/**
 * Check all alert rules against current metrics.
 */
function checkAlerts(metrics) {
  const triggered = [];

  for (const rule of ALERT_RULES) {
    if (rule.condition(metrics)) {
      triggered.push({
        name: rule.name,
        severity: rule.severity,
        message: rule.message(metrics),
        timestamp: new Date().toISOString()
      });
    }
  }

  return triggered;
}

9. Using Logs to Improve Prompts

Logs aren't just for debugging — they're your best tool for systematic prompt improvement.

Identifying problem patterns

/**
 * Analyze logs to find prompt improvement opportunities.
 */
function analyzePromptPerformance(logs) {
  const analysis = {
    totalCalls: logs.length,
    parseFailures: [],
    validationFailures: [],
    slowResponses: [],
    expensiveCalls: [],
    truncatedResponses: []
  };

  for (const log of logs) {
    if (!log.jsonParseSuccess) {
      analysis.parseFailures.push({
        requestId: log.requestId,
        feature: log.feature,
        promptVersion: log.promptVersion,
        outputLength: log.outputLength
      });
    }

    if (!log.schemaValidationSuccess && log.jsonParseSuccess) {
      analysis.validationFailures.push({
        requestId: log.requestId,
        feature: log.feature,
        promptVersion: log.promptVersion,
        errors: log.validationErrors
      });
    }

    if (log.latencyMs > 10000) {
      analysis.slowResponses.push({
        requestId: log.requestId,
        feature: log.feature,
        latencyMs: log.latencyMs,
        promptTokens: log.promptTokens
      });
    }

    if (log.finishReason === 'length') {
      analysis.truncatedResponses.push({
        requestId: log.requestId,
        feature: log.feature,
        completionTokens: log.completionTokens
      });
    }
  }

  // Generate recommendations
  const recommendations = [];

  const parseFailRate = analysis.parseFailures.length / logs.length;
  if (parseFailRate > 0.02) {
    recommendations.push(
      `JSON parse failure rate is ${(parseFailRate * 100).toFixed(1)}%. ` +
      `Consider using response_format: json_object or strengthening JSON instructions.`
    );
  }

  if (analysis.truncatedResponses.length > 0) {
    recommendations.push(
      `${analysis.truncatedResponses.length} responses were truncated. ` +
      `Increase max_tokens or simplify the expected output format.`
    );
  }

  // Group validation errors by field
  const fieldErrors = {};
  for (const failure of analysis.validationFailures) {
    for (const error of (failure.errors || [])) {
      const field = error.path?.join('.') || 'unknown';
      fieldErrors[field] = (fieldErrors[field] || 0) + 1;
    }
  }

  if (Object.keys(fieldErrors).length > 0) {
    const topFields = Object.entries(fieldErrors)
      .sort((a, b) => b[1] - a[1])
      .slice(0, 5);

    recommendations.push(
      `Most common validation failures by field: ` +
      topFields.map(([field, count]) => `${field} (${count})`).join(', ') +
      `. Add explicit examples of these fields in the prompt.`
    );
  }

  return { ...analysis, recommendations };
}

A/B testing prompts with logs

/**
 * Compare performance of two prompt versions using logged data.
 */
function comparePromptVersions(logs, versionA, versionB) {
  const a = logs.filter(l => l.promptVersion === versionA);
  const b = logs.filter(l => l.promptVersion === versionB);

  const analyze = (subset) => ({
    count: subset.length,
    successRate: subset.filter(l => !l.error && l.schemaValidationSuccess).length / subset.length,
    avgLatencyMs: subset.reduce((sum, l) => sum + (l.latencyMs || 0), 0) / subset.length,
    avgTokens: subset.reduce((sum, l) => sum + (l.totalTokens || 0), 0) / subset.length,
    avgCostCents: subset.reduce((sum, l) => sum + (l.costCents || 0), 0) / subset.length,
    parseFailRate: subset.filter(l => !l.jsonParseSuccess).length / subset.length,
    truncationRate: subset.filter(l => l.finishReason === 'length').length / subset.length
  });

  return {
    [versionA]: analyze(a),
    [versionB]: analyze(b),
    recommendation: null // Add statistical significance test in production
  };
}

// Usage
const comparison = comparePromptVersions(recentLogs, 'v2.3', 'v2.4');
console.table(comparison);
/*
┌──────────────────┬─────────┬─────────┐
│                  │  v2.3   │  v2.4   │
├──────────────────┼─────────┼─────────┤
│ count            │  5,420  │  5,380  │
│ successRate      │  96.2%  │  98.7%  │
│ avgLatencyMs     │  2,340  │  2,180  │
│ avgTokens        │  1,590  │  1,420  │
│ avgCostCents     │  0.59   │  0.51   │
│ parseFailRate    │  1.8%   │  0.3%   │
│ truncationRate   │  0.5%   │  0.1%   │
└──────────────────┴─────────┴─────────┘
Winner: v2.4 (better on all metrics)
*/

10. Key Takeaways

Log every LLM call with full metadata — request ID, model version, temperature, token counts, latency, finish reason, cost, and error details. Without this, debugging non-deterministic AI is impossible.
Separate metrics from content — always log metrics (tokens, latency, cost); log full content only when needed and only with PII filtering.
Build dashboards around success rate, latency, cost, and error breakdown — these four metrics tell you the health of your AI system at a glance.
Set alerts for error rate spikes, latency increases, and cost anomalies — don't wait for users to report problems when your monitoring can catch them in minutes.
Use logs to improve prompts systematically — parse failure rates, validation error patterns, and A/B comparisons replace guesswork with data-driven prompt optimization.

Explain-It Challenge

A new developer asks "why can't I just use console.log for LLM debugging?" Explain three specific scenarios where structured logging catches problems that console.log would miss.
Your company processes medical records through an LLM. A compliance officer asks "what is logged and who can access it?" Design a logging policy that satisfies both debugging needs and HIPAA requirements.
Your prompt v2.4 has a 98.7% success rate compared to v2.3's 96.2%. Both ran on ~5,400 requests. Is this difference statistically significant or could it be noise? How would you determine this?

Navigation: ← 4.10.c — Retry Mechanisms · ← 4.10 Overview