Episode 4 — Generative AI Engineering / 4.10 — Error Handling in AI Applications
4.10.d — Logging AI Requests
In one sentence: Every LLM API call should be logged with full context — the complete request, response, token counts, latency, model version, and parameters — because without comprehensive logging, debugging non-deterministic AI behavior is nearly impossible, optimizing prompts is guesswork, and detecting regressions requires months instead of minutes.
Navigation: ← 4.10.c — Retry Mechanisms · ← 4.10 Overview
1. Why Logging Is Critical for AI Applications
Traditional applications are deterministic: same input produces the same output. If something breaks, you can reproduce it. LLM applications are non-deterministic: the same input can produce different outputs, and the model itself changes when providers update it. This fundamentally changes what logging means.
Traditional API debugging:
1. User reports bug
2. Developer reproduces with same input
3. Developer reads code, finds bug
4. Fix deployed
LLM application debugging:
1. User reports "the AI gave a wrong answer"
2. Developer tries the same input — gets a DIFFERENT (possibly correct) answer
3. Without logs: impossible to know what went wrong
4. With logs: can see exact prompt, model version, temperature, and raw response
5. Root cause found: model hallucinated a fact, prompt was ambiguous, etc.
What you lose without logging
| Capability | Without Logging | With Logging |
|---|---|---|
| Debugging | Cannot reproduce issues | Full request/response replay |
| Cost tracking | Surprise bills | Token-level cost attribution |
| Prompt optimization | Guessing what works | Data-driven A/B comparisons |
| Regression detection | Users report problems weeks later | Automated alerts within minutes |
| Compliance/audit | Cannot prove what the AI said | Full audit trail |
| Performance monitoring | Blind to latency spikes | Real-time dashboards |
| Model comparison | Subjective "feels better" | Quantitative metrics |
2. What to Log: The Complete Field List
Every LLM API call should capture these fields.
Request fields
const requestLog = {
// Identification
requestId: 'req_abc123', // Unique ID for this call
traceId: 'trace_xyz789', // Links to parent user request
userId: 'user_456', // Who triggered this (anonymized)
sessionId: 'sess_789', // Conversation session
// Timing
timestamp: '2026-04-11T10:30:00Z', // ISO 8601
// Model configuration
model: 'gpt-4o-2024-08-06', // Exact model version — NOT "gpt-4o"
temperature: 0,
topP: 1,
maxTokens: 4096,
responseFormat: 'json_object',
seed: 42, // If using seed for reproducibility
// Input
systemPrompt: '...', // The system message
messages: [...], // Full message array (see privacy section)
messageCount: 5, // Number of messages in conversation
// Metadata
feature: 'user-extraction', // Which feature triggered this call
promptVersion: 'v2.3', // Version of the prompt template
environment: 'production'
};
Response fields
const responseLog = {
// Link back to request
requestId: 'req_abc123',
// Timing
latencyMs: 2340, // Total round-trip time
timeToFirstTokenMs: 450, // For streaming: time until first token
// Output
content: '...', // Raw response content
finishReason: 'stop', // "stop", "length", "content_filter"
// Token usage
promptTokens: 1250,
completionTokens: 340,
totalTokens: 1590,
// Cost (calculated)
costCents: 0.59, // Calculated from token counts and pricing
// Validation
jsonParseSuccess: true,
schemaValidationSuccess: true,
validationErrors: null,
// Retry info
attemptNumber: 1, // Which attempt this was
totalAttempts: 1, // How many attempts total
retryReason: null, // Why we retried (if applicable)
// Error (if any)
error: null,
errorType: null, // 'api_error', 'timeout', 'parse_error', etc.
httpStatus: 200
};
3. Building a Structured Logger
A structured logger outputs machine-readable logs (JSON) that can be queried, aggregated, and alerted on.
/**
* Structured logger for LLM API calls.
* Outputs JSON logs that can be sent to any log aggregation service.
*/
class LlmLogger {
constructor(options = {}) {
this.serviceName = options.serviceName || 'llm-service';
this.environment = options.environment || process.env.NODE_ENV || 'development';
this.logFunction = options.logFunction || console.log;
this.piiFilter = options.piiFilter || null;
this.costConfig = options.costConfig || DEFAULT_COST_CONFIG;
}
/**
* Log a complete LLM request-response cycle.
*/
logCall(request, response, metadata = {}) {
const log = {
// Standard fields
level: response.error ? 'error' : 'info',
service: this.serviceName,
environment: this.environment,
timestamp: new Date().toISOString(),
// Request
requestId: metadata.requestId || this._generateId(),
traceId: metadata.traceId,
feature: metadata.feature,
promptVersion: metadata.promptVersion,
// Model config
model: request.model,
temperature: request.temperature,
maxTokens: request.max_tokens,
responseFormat: request.response_format?.type,
// Input metrics (not the full content — see privacy section)
messageCount: request.messages?.length,
systemPromptLength: request.messages?.find(m => m.role === 'system')?.content?.length,
userMessageLength: request.messages?.filter(m => m.role === 'user')
.reduce((sum, m) => sum + (m.content?.length || 0), 0),
// Output
finishReason: response.finishReason,
outputLength: response.content?.length,
// Tokens
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
totalTokens: response.usage?.total_tokens,
// Cost
costCents: this._calculateCost(request.model, response.usage),
// Performance
latencyMs: response.latencyMs,
timeToFirstTokenMs: response.timeToFirstTokenMs,
// Validation
jsonParseSuccess: response.jsonParseSuccess,
schemaValidationSuccess: response.schemaValidationSuccess,
// Retries
attemptNumber: metadata.attemptNumber || 1,
totalAttempts: metadata.totalAttempts || 1,
// Error
error: response.error?.message,
errorType: response.errorType,
httpStatus: response.httpStatus
};
this.logFunction(JSON.stringify(log));
return log;
}
/**
* Log full request/response content (for debugging).
* ONLY call this when detailed debugging is needed — contains sensitive data.
*/
logDetailedCall(request, response, metadata = {}) {
const baseLog = this.logCall(request, response, metadata);
const detailedLog = {
...baseLog,
level: 'debug',
// Full content — apply PII filter if configured
messages: this.piiFilter
? this._filterPii(request.messages)
: request.messages,
responseContent: this.piiFilter
? this.piiFilter(response.content)
: response.content,
validationErrors: response.validationErrors
};
this.logFunction(JSON.stringify(detailedLog));
return detailedLog;
}
_calculateCost(model, usage) {
if (!usage) return null;
const pricing = this.costConfig[model];
if (!pricing) return null;
const inputCost = (usage.prompt_tokens / 1000000) * pricing.inputPer1M;
const outputCost = (usage.completion_tokens / 1000000) * pricing.outputPer1M;
return parseFloat(((inputCost + outputCost) * 100).toFixed(4));
}
_filterPii(messages) {
if (!messages) return messages;
return messages.map(msg => ({
...msg,
content: this.piiFilter(msg.content)
}));
}
_generateId() {
return 'req_' + Math.random().toString(36).substring(2, 15);
}
}
const DEFAULT_COST_CONFIG = {
'gpt-4o': { inputPer1M: 2.50, outputPer1M: 10.00 },
'gpt-4o-2024-08-06': { inputPer1M: 2.50, outputPer1M: 10.00 },
'gpt-4o-mini': { inputPer1M: 0.15, outputPer1M: 0.60 },
'claude-sonnet-4-20250514': { inputPer1M: 3.00, outputPer1M: 15.00 }
};
4. Instrumenting API Calls
Wrap your LLM calls to automatically capture timing and log data.
/**
* Instrumented LLM client that automatically logs every call.
*/
class InstrumentedLlmClient {
constructor(openai, logger) {
this.openai = openai;
this.logger = logger;
}
async chatCompletion(request, metadata = {}) {
const startTime = Date.now();
let response = {};
try {
const result = await this.openai.chat.completions.create(request);
response = {
content: result.choices[0]?.message?.content,
finishReason: result.choices[0]?.finish_reason,
usage: result.usage,
latencyMs: Date.now() - startTime,
httpStatus: 200,
jsonParseSuccess: null,
schemaValidationSuccess: null,
error: null,
errorType: null
};
// Attempt JSON parsing if content exists
if (response.content) {
try {
JSON.parse(response.content);
response.jsonParseSuccess = true;
} catch (e) {
response.jsonParseSuccess = false;
}
}
// Log the call
this.logger.logCall(request, response, metadata);
return result;
} catch (error) {
response = {
content: null,
finishReason: 'error',
usage: null,
latencyMs: Date.now() - startTime,
httpStatus: error.status || null,
error: error,
errorType: this._classifyErrorType(error)
};
// Log the error
this.logger.logCall(request, response, metadata);
throw error;
}
}
_classifyErrorType(error) {
if (error.name === 'AbortError') return 'timeout';
if (error.status === 429) return 'rate_limit';
if (error.status === 401) return 'auth_error';
if (error.status === 400) return 'bad_request';
if (error.status >= 500) return 'server_error';
if (error.code === 'ETIMEDOUT') return 'network_timeout';
if (error.code === 'ECONNREFUSED') return 'connection_refused';
return 'unknown';
}
}
// Usage
const logger = new LlmLogger({
serviceName: 'my-ai-app',
environment: 'production'
});
const client = new InstrumentedLlmClient(
new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
logger
);
// Every call is now automatically logged
const result = await client.chatCompletion(
{
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
temperature: 0.7
},
{
feature: 'chatbot',
promptVersion: 'v1.2',
traceId: 'user-session-abc'
}
);
5. Privacy Considerations: PII in Prompts
LLM prompts often contain Personally Identifiable Information (PII) — user names, emails, addresses, medical info, financial data. Logging this data creates serious privacy and compliance risks.
The problem
User message: "My name is John Smith, my SSN is 123-45-6789,
and I need help with my medical bill from Dr. Johnson."
If you log this verbatim:
✗ GDPR violation (if user is in EU)
✗ HIPAA violation (medical info)
✗ PCI DSS violation (if financial data)
✗ Security risk (SSN in logs)
✗ Liability if logs are breached
PII filtering strategies
/**
* PII filter for log content.
* Replaces common PII patterns with placeholders.
*/
function createPiiFilter(options = {}) {
const {
filterEmails = true,
filterPhones = true,
filterSSNs = true,
filterCreditCards = true,
filterNames = false, // Hard to do reliably — opt-in only
placeholder = '[REDACTED]'
} = options;
const patterns = [];
if (filterSSNs) {
patterns.push({
regex: /\b\d{3}-\d{2}-\d{4}\b/g,
replacement: `${placeholder}_SSN`
});
}
if (filterCreditCards) {
patterns.push({
regex: /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/g,
replacement: `${placeholder}_CC`
});
}
if (filterEmails) {
patterns.push({
regex: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
replacement: `${placeholder}_EMAIL`
});
}
if (filterPhones) {
patterns.push({
regex: /\b(\+?1[-.]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b/g,
replacement: `${placeholder}_PHONE`
});
}
return function filter(text) {
if (!text || typeof text !== 'string') return text;
let filtered = text;
for (const { regex, replacement } of patterns) {
filtered = filtered.replace(regex, replacement);
}
return filtered;
};
}
// Usage
const piiFilter = createPiiFilter({
filterEmails: true,
filterPhones: true,
filterSSNs: true,
filterCreditCards: true
});
const logger = new LlmLogger({
piiFilter,
serviceName: 'my-ai-app'
});
// Test
const input = 'My email is john@example.com and SSN is 123-45-6789';
console.log(piiFilter(input));
// "My email is [REDACTED]_EMAIL and SSN is [REDACTED]_SSN"
Tiered logging strategy
/**
* Log different levels of detail based on environment and need.
*/
const LOGGING_TIERS = {
production: {
// Log metadata only — no content
logContent: false,
logPrompt: false,
logMetrics: true,
logErrors: true,
filterPii: true,
retention: '90 days'
},
staging: {
// Log content with PII filtering
logContent: true,
logPrompt: true,
logMetrics: true,
logErrors: true,
filterPii: true,
retention: '30 days'
},
development: {
// Log everything (local only)
logContent: true,
logPrompt: true,
logMetrics: true,
logErrors: true,
filterPii: false,
retention: '7 days'
},
debug: {
// Temporary detailed logging for specific issues
logContent: true,
logPrompt: true,
logMetrics: true,
logErrors: true,
filterPii: true,
retention: '24 hours'
}
};
6. Log Storage and Retention
Where to store logs
| Storage | Best For | Cost | Query Speed |
|---|---|---|---|
| Console/stdout | Development, containerized apps | Free | N/A (ephemeral) |
| File system | Simple apps, local debugging | Low | Slow (grep) |
| Elasticsearch | Full-text search across logs | Medium | Fast |
| CloudWatch / Datadog | Cloud-native apps, dashboards | Medium-High | Fast |
| BigQuery / Athena | Long-term analytics, large scale | Low storage, pay-per-query | Medium |
| S3 / GCS | Archival, compliance | Very low | Slow (batch) |
Retention strategy
const RETENTION_POLICY = {
// Metrics (aggregated, no PII)
metrics: {
resolution: '1 minute',
retention: '1 year',
storage: 'time-series-db'
},
// Request logs (metadata only)
requestLogs: {
fields: ['requestId', 'model', 'tokens', 'latency', 'error', 'feature'],
retention: '90 days',
storage: 'elasticsearch'
},
// Full content logs (PII-filtered)
contentLogs: {
fields: 'all (PII-filtered)',
retention: '30 days',
storage: 'elasticsearch',
accessControl: 'engineering-team-only'
},
// Error logs (detailed, for debugging)
errorLogs: {
fields: 'all (PII-filtered)',
retention: '90 days',
storage: 'elasticsearch',
alert: true
},
// Audit logs (compliance)
auditLogs: {
fields: ['requestId', 'userId', 'timestamp', 'model', 'feature', 'contentHash'],
retention: '7 years',
storage: 's3-glacier',
immutable: true
}
};
7. Monitoring Dashboards
Transform logs into real-time visibility with dashboards.
Key metrics to track
┌─────────────────────────────────────────────────────────────────────┐
│ LLM MONITORING DASHBOARD │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Success Rate │ │ Avg Latency │ │ Daily Cost │ │
│ │ 97.3% │ │ 2.4s │ │ $142.50 │ │
│ │ ↓ 0.5% │ │ ↑ 0.3s │ │ ↑ $12.30 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Error Breakdown (last 24h): │
│ ├── Rate limit (429): 1.2% ████░░░░░░ │
│ ├── Timeout: 0.8% ███░░░░░░░ │
│ ├── Server error (5xx): 0.4% ██░░░░░░░░ │
│ ├── Parse failure: 0.2% █░░░░░░░░░ │
│ └── Validation failure: 0.1% ░░░░░░░░░░ │
│ │
│ Latency Distribution: │
│ p50: 1.8s p90: 4.2s p95: 6.1s p99: 12.3s │
│ │
│ Token Usage (daily): │
│ Input: 12.4M tokens ($31.00) │
│ Output: 3.2M tokens ($32.00) │
│ Total: 15.6M tokens ($63.00) │
│ │
│ Top Features by Cost: │
│ 1. document-analysis $45.20 (72%) │
│ 2. chatbot $12.80 (20%) │
│ 3. extraction $5.00 (8%) │
└─────────────────────────────────────────────────────────────────────┘
Metrics aggregation
/**
* Aggregate LLM call metrics for dashboard display.
*/
class LlmMetricsAggregator {
constructor() {
this.metrics = {
total: 0,
success: 0,
errors: {},
latencies: [],
tokenUsage: { prompt: 0, completion: 0 },
costCents: 0,
byFeature: {},
byModel: {}
};
this.windowStart = Date.now();
}
record(log) {
this.metrics.total++;
if (!log.error) {
this.metrics.success++;
} else {
const errorType = log.errorType || 'unknown';
this.metrics.errors[errorType] = (this.metrics.errors[errorType] || 0) + 1;
}
if (log.latencyMs) {
this.metrics.latencies.push(log.latencyMs);
}
if (log.promptTokens) {
this.metrics.tokenUsage.prompt += log.promptTokens;
}
if (log.completionTokens) {
this.metrics.tokenUsage.completion += log.completionTokens;
}
if (log.costCents) {
this.metrics.costCents += log.costCents;
}
// Track by feature
if (log.feature) {
if (!this.metrics.byFeature[log.feature]) {
this.metrics.byFeature[log.feature] = { total: 0, errors: 0, costCents: 0 };
}
this.metrics.byFeature[log.feature].total++;
if (log.error) this.metrics.byFeature[log.feature].errors++;
if (log.costCents) this.metrics.byFeature[log.feature].costCents += log.costCents;
}
}
getSummary() {
const sorted = [...this.metrics.latencies].sort((a, b) => a - b);
const percentile = (p) => sorted[Math.floor(sorted.length * p)] || 0;
return {
windowMs: Date.now() - this.windowStart,
total: this.metrics.total,
successRate: this.metrics.total ? (this.metrics.success / this.metrics.total * 100).toFixed(1) + '%' : 'N/A',
errors: this.metrics.errors,
latency: {
p50: percentile(0.5),
p90: percentile(0.9),
p95: percentile(0.95),
p99: percentile(0.99),
avg: sorted.length ? Math.round(sorted.reduce((a, b) => a + b, 0) / sorted.length) : 0
},
tokens: this.metrics.tokenUsage,
costDollars: (this.metrics.costCents / 100).toFixed(2),
byFeature: this.metrics.byFeature
};
}
reset() {
this.metrics = {
total: 0, success: 0, errors: {}, latencies: [],
tokenUsage: { prompt: 0, completion: 0 },
costCents: 0, byFeature: {}, byModel: {}
};
this.windowStart = Date.now();
}
}
8. Alerting on Error Rates
Set up alerts to be notified when things go wrong before users complain.
/**
* Alert rules for LLM monitoring.
*/
const ALERT_RULES = [
{
name: 'High Error Rate',
condition: (metrics) => {
const errorRate = 1 - (metrics.success / metrics.total);
return metrics.total > 100 && errorRate > 0.05; // >5% error rate
},
severity: 'critical',
message: (metrics) => `LLM error rate is ${((1 - metrics.success / metrics.total) * 100).toFixed(1)}% (threshold: 5%)`
},
{
name: 'High Latency',
condition: (metrics) => {
const sorted = [...metrics.latencies].sort((a, b) => a - b);
const p95 = sorted[Math.floor(sorted.length * 0.95)] || 0;
return p95 > 10000; // p95 > 10 seconds
},
severity: 'warning',
message: () => 'LLM p95 latency exceeded 10 seconds'
},
{
name: 'Cost Spike',
condition: (metrics) => {
const hourlyRate = metrics.costCents / ((Date.now() - metrics.windowStart) / 3600000);
return hourlyRate > 500; // >$5/hour
},
severity: 'warning',
message: (metrics) => {
const hourlyRate = metrics.costCents / ((Date.now() - metrics.windowStart) / 3600000);
return `LLM cost rate: $${(hourlyRate / 100).toFixed(2)}/hour (threshold: $5/hour)`;
}
},
{
name: 'Rate Limit Storm',
condition: (metrics) => {
const rateLimitCount = metrics.errors['rate_limit'] || 0;
return rateLimitCount > 50; // >50 rate limits in window
},
severity: 'critical',
message: (metrics) => `${metrics.errors['rate_limit']} rate limit errors — consider throttling requests`
},
{
name: 'Validation Failure Spike',
condition: (metrics) => {
// If >10% of successful API calls fail validation, the prompt may be broken
const validationFailures = metrics.errors['validation'] || 0;
return metrics.success > 50 && (validationFailures / metrics.success) > 0.10;
},
severity: 'warning',
message: () => 'Schema validation failure rate >10% — check prompt or schema changes'
}
];
/**
* Check all alert rules against current metrics.
*/
function checkAlerts(metrics) {
const triggered = [];
for (const rule of ALERT_RULES) {
if (rule.condition(metrics)) {
triggered.push({
name: rule.name,
severity: rule.severity,
message: rule.message(metrics),
timestamp: new Date().toISOString()
});
}
}
return triggered;
}
9. Using Logs to Improve Prompts
Logs aren't just for debugging — they're your best tool for systematic prompt improvement.
Identifying problem patterns
/**
* Analyze logs to find prompt improvement opportunities.
*/
function analyzePromptPerformance(logs) {
const analysis = {
totalCalls: logs.length,
parseFailures: [],
validationFailures: [],
slowResponses: [],
expensiveCalls: [],
truncatedResponses: []
};
for (const log of logs) {
if (!log.jsonParseSuccess) {
analysis.parseFailures.push({
requestId: log.requestId,
feature: log.feature,
promptVersion: log.promptVersion,
outputLength: log.outputLength
});
}
if (!log.schemaValidationSuccess && log.jsonParseSuccess) {
analysis.validationFailures.push({
requestId: log.requestId,
feature: log.feature,
promptVersion: log.promptVersion,
errors: log.validationErrors
});
}
if (log.latencyMs > 10000) {
analysis.slowResponses.push({
requestId: log.requestId,
feature: log.feature,
latencyMs: log.latencyMs,
promptTokens: log.promptTokens
});
}
if (log.finishReason === 'length') {
analysis.truncatedResponses.push({
requestId: log.requestId,
feature: log.feature,
completionTokens: log.completionTokens
});
}
}
// Generate recommendations
const recommendations = [];
const parseFailRate = analysis.parseFailures.length / logs.length;
if (parseFailRate > 0.02) {
recommendations.push(
`JSON parse failure rate is ${(parseFailRate * 100).toFixed(1)}%. ` +
`Consider using response_format: json_object or strengthening JSON instructions.`
);
}
if (analysis.truncatedResponses.length > 0) {
recommendations.push(
`${analysis.truncatedResponses.length} responses were truncated. ` +
`Increase max_tokens or simplify the expected output format.`
);
}
// Group validation errors by field
const fieldErrors = {};
for (const failure of analysis.validationFailures) {
for (const error of (failure.errors || [])) {
const field = error.path?.join('.') || 'unknown';
fieldErrors[field] = (fieldErrors[field] || 0) + 1;
}
}
if (Object.keys(fieldErrors).length > 0) {
const topFields = Object.entries(fieldErrors)
.sort((a, b) => b[1] - a[1])
.slice(0, 5);
recommendations.push(
`Most common validation failures by field: ` +
topFields.map(([field, count]) => `${field} (${count})`).join(', ') +
`. Add explicit examples of these fields in the prompt.`
);
}
return { ...analysis, recommendations };
}
A/B testing prompts with logs
/**
* Compare performance of two prompt versions using logged data.
*/
function comparePromptVersions(logs, versionA, versionB) {
const a = logs.filter(l => l.promptVersion === versionA);
const b = logs.filter(l => l.promptVersion === versionB);
const analyze = (subset) => ({
count: subset.length,
successRate: subset.filter(l => !l.error && l.schemaValidationSuccess).length / subset.length,
avgLatencyMs: subset.reduce((sum, l) => sum + (l.latencyMs || 0), 0) / subset.length,
avgTokens: subset.reduce((sum, l) => sum + (l.totalTokens || 0), 0) / subset.length,
avgCostCents: subset.reduce((sum, l) => sum + (l.costCents || 0), 0) / subset.length,
parseFailRate: subset.filter(l => !l.jsonParseSuccess).length / subset.length,
truncationRate: subset.filter(l => l.finishReason === 'length').length / subset.length
});
return {
[versionA]: analyze(a),
[versionB]: analyze(b),
recommendation: null // Add statistical significance test in production
};
}
// Usage
const comparison = comparePromptVersions(recentLogs, 'v2.3', 'v2.4');
console.table(comparison);
/*
┌──────────────────┬─────────┬─────────┐
│ │ v2.3 │ v2.4 │
├──────────────────┼─────────┼─────────┤
│ count │ 5,420 │ 5,380 │
│ successRate │ 96.2% │ 98.7% │
│ avgLatencyMs │ 2,340 │ 2,180 │
│ avgTokens │ 1,590 │ 1,420 │
│ avgCostCents │ 0.59 │ 0.51 │
│ parseFailRate │ 1.8% │ 0.3% │
│ truncationRate │ 0.5% │ 0.1% │
└──────────────────┴─────────┴─────────┘
Winner: v2.4 (better on all metrics)
*/
10. Key Takeaways
- Log every LLM call with full metadata — request ID, model version, temperature, token counts, latency, finish reason, cost, and error details. Without this, debugging non-deterministic AI is impossible.
- Separate metrics from content — always log metrics (tokens, latency, cost); log full content only when needed and only with PII filtering.
- Build dashboards around success rate, latency, cost, and error breakdown — these four metrics tell you the health of your AI system at a glance.
- Set alerts for error rate spikes, latency increases, and cost anomalies — don't wait for users to report problems when your monitoring can catch them in minutes.
- Use logs to improve prompts systematically — parse failure rates, validation error patterns, and A/B comparisons replace guesswork with data-driven prompt optimization.
Explain-It Challenge
- A new developer asks "why can't I just use
console.logfor LLM debugging?" Explain three specific scenarios where structured logging catches problems thatconsole.logwould miss. - Your company processes medical records through an LLM. A compliance officer asks "what is logged and who can access it?" Design a logging policy that satisfies both debugging needs and HIPAA requirements.
- Your prompt v2.4 has a 98.7% success rate compared to v2.3's 96.2%. Both ran on ~5,400 requests. Is this difference statistically significant or could it be noise? How would you determine this?
Navigation: ← 4.10.c — Retry Mechanisms · ← 4.10 Overview