Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems
4.14.d — Observability and Monitoring
In one sentence: AI observability means logging every LLM call, tracing multi-step pipelines, tracking key metrics (latency, tokens, errors, hallucination rate, cost), and alerting when quality degrades — because an AI system without monitoring is a black box that fails silently.
Navigation: ← 4.14.c — Evaluating Retrieval Quality · 4.14 Overview
1. What AI Observability Means
Traditional observability has three pillars: logs, metrics, and traces. AI observability adds a fourth: evaluation.
┌──────────────────────────────────────────────────────────────────┐
│ THE FOUR PILLARS OF AI OBSERVABILITY │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ LOGS │ │ METRICS │ │ TRACES │ │ EVALUATION │ │
│ │ │ │ │ │ │ │ │ │
│ │ Every │ │ Latency │ │ Full │ │ Hallucination│ │
│ │ LLM call │ │ Token $ │ │ pipeline │ │ rate, │ │
│ │ input/ │ │ Error % │ │ step-by- │ │ confidence, │ │
│ │ output/ │ │ Through- │ │ step │ │ retrieval │ │
│ │ params │ │ put │ │ timing │ │ quality │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │
│ │
│ Traditional ──────────────────────────────► AI-specific │
│ │
│ KEY INSIGHT: Traditional monitoring tells you IF the system │
│ is running. AI evaluation tells you if it's running CORRECTLY. │
└──────────────────────────────────────────────────────────────────┘
Why AI systems need special monitoring
| Traditional API | AI System |
|---|---|
| Returns correct data or throws an error | Returns plausible-looking data that might be wrong |
| Fails loudly (500 error) | Fails silently (confident-sounding hallucination) |
| Deterministic: same input = same output | Probabilistic: same input can produce different outputs |
| Fixed cost per call | Variable cost (depends on token count) |
| Latency is predictable | Latency varies with output length |
| Testing is straightforward | Testing requires evaluation suites |
2. Key Metrics to Track
The essential AI metrics dashboard
class AIMetricsCollector {
constructor() {
this.metrics = [];
}
recordCall(data) {
this.metrics.push({
timestamp: new Date(),
requestId: data.requestId,
// Performance metrics
latencyMs: data.latencyMs,
timeToFirstToken: data.timeToFirstToken,
tokensPerSecond: data.outputTokens / (data.latencyMs / 1000),
// Cost metrics
inputTokens: data.inputTokens,
outputTokens: data.outputTokens,
totalTokens: data.inputTokens + data.outputTokens,
estimatedCost: this.estimateCost(data),
// Quality metrics
confidence: data.confidence,
hallucinationScore: data.hallucinationScore,
retrievalRelevance: data.retrievalRelevance,
// Operational metrics
model: data.model,
promptVersion: data.promptVersion,
statusCode: data.statusCode,
isError: data.statusCode >= 400,
errorType: data.errorType || null,
// User metrics
userFeedback: data.userFeedback || null, // 'thumbs_up' | 'thumbs_down' | null
wasEdited: data.wasEdited || false
});
}
estimateCost(data) {
// Pricing per 1M tokens (example GPT-4o rates)
const pricing = {
'gpt-4o': { input: 2.50, output: 10.00 },
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'gpt-4.1': { input: 2.00, output: 8.00 },
'gpt-4.1-mini': { input: 0.40, output: 1.60 }
};
const rates = pricing[data.model] || pricing['gpt-4o'];
return (
(data.inputTokens / 1_000_000) * rates.input +
(data.outputTokens / 1_000_000) * rates.output
);
}
// Generate metrics summary for a time window
getSummary(windowMinutes = 60) {
const cutoff = Date.now() - windowMinutes * 60 * 1000;
const recent = this.metrics.filter(m => m.timestamp >= cutoff);
if (recent.length === 0) return { noData: true };
const avg = (arr) => arr.length > 0
? arr.reduce((a, b) => a + b, 0) / arr.length
: 0;
const p95 = (arr) => {
const sorted = [...arr].sort((a, b) => a - b);
return sorted[Math.floor(sorted.length * 0.95)] || 0;
};
return {
window: `${windowMinutes} minutes`,
totalCalls: recent.length,
callsPerMinute: recent.length / windowMinutes,
// Latency
avgLatencyMs: avg(recent.map(m => m.latencyMs)).toFixed(0),
p95LatencyMs: p95(recent.map(m => m.latencyMs)).toFixed(0),
avgTimeToFirstToken: avg(recent.map(m => m.timeToFirstToken)).toFixed(0),
// Tokens & cost
avgInputTokens: avg(recent.map(m => m.inputTokens)).toFixed(0),
avgOutputTokens: avg(recent.map(m => m.outputTokens)).toFixed(0),
totalCost: recent.reduce((sum, m) => sum + m.estimatedCost, 0).toFixed(4),
avgCostPerCall: avg(recent.map(m => m.estimatedCost)).toFixed(6),
// Errors
errorRate: (recent.filter(m => m.isError).length / recent.length).toFixed(4),
errorsByType: this.groupBy(recent.filter(m => m.isError), 'errorType'),
// Quality
avgConfidence: avg(recent.filter(m => m.confidence != null).map(m => m.confidence)).toFixed(3),
avgHallucinationScore: avg(recent.filter(m => m.hallucinationScore != null).map(m => m.hallucinationScore)).toFixed(3),
avgRetrievalRelevance: avg(recent.filter(m => m.retrievalRelevance != null).map(m => m.retrievalRelevance)).toFixed(3),
// User satisfaction
feedbackCount: recent.filter(m => m.userFeedback != null).length,
thumbsUpRate: recent.filter(m => m.userFeedback === 'thumbs_up').length /
Math.max(recent.filter(m => m.userFeedback != null).length, 1),
editRate: recent.filter(m => m.wasEdited).length / recent.length
};
}
groupBy(items, key) {
return items.reduce((groups, item) => {
const val = item[key] || 'unknown';
groups[val] = (groups[val] || 0) + 1;
return groups;
}, {});
}
}
Metric categories and their targets
| Category | Metric | What It Measures | Typical Target |
|---|---|---|---|
| Latency | Avg response time | User experience | < 2s for chat, < 10s for complex tasks |
| Latency | P95 response time | Worst-case experience | < 5s for chat |
| Latency | Time to first token | Perceived responsiveness | < 500ms |
| Cost | Cost per call | Budget management | Depends on use case |
| Cost | Daily/monthly spend | Budget tracking | Within budget |
| Cost | Tokens per call | Efficiency | Trending stable or down |
| Errors | Error rate | System reliability | < 1% |
| Errors | Rate limit hits | Capacity planning | < 0.1% |
| Quality | Hallucination rate | Answer reliability | < 5% (varies by domain) |
| Quality | Confidence score avg | System certainty | > 0.75 |
| Quality | Retrieval relevance | RAG effectiveness | > 0.70 |
| User | Thumbs up rate | User satisfaction | > 80% |
| User | Edit rate | Answer usability | < 15% |
3. Logging: The Foundation of Observability
Every AI call should be logged with full context for debugging and evaluation.
import { randomUUID } from 'crypto';
class AILogger {
constructor(storage) {
this.storage = storage; // Could be file, database, external service
}
// Wrap any LLM call with logging
async loggedCall(callFn, metadata = {}) {
const requestId = randomUUID();
const startTime = Date.now();
const logEntry = {
requestId,
timestamp: new Date().toISOString(),
...metadata,
status: 'pending'
};
try {
// Execute the LLM call
const result = await callFn();
const endTime = Date.now();
// Complete the log entry
Object.assign(logEntry, {
status: 'success',
latencyMs: endTime - startTime,
// Input details
model: metadata.model,
messages: metadata.messages,
temperature: metadata.temperature,
promptVersion: metadata.promptVersion,
// Output details
response: result.choices?.[0]?.message?.content,
finishReason: result.choices?.[0]?.finish_reason,
// Token usage
inputTokens: result.usage?.prompt_tokens,
outputTokens: result.usage?.completion_tokens,
totalTokens: result.usage?.total_tokens,
// Request metadata
systemFingerprint: result.system_fingerprint
});
await this.storage.write(logEntry);
return { result, requestId };
} catch (error) {
Object.assign(logEntry, {
status: 'error',
latencyMs: Date.now() - startTime,
error: {
name: error.name,
message: error.message,
code: error.code || error.status,
type: error.type
}
});
await this.storage.write(logEntry);
throw error; // Re-throw so the caller handles it
}
}
}
// Example storage implementations
class ConsoleStorage {
async write(entry) {
console.log(JSON.stringify(entry));
}
}
class FileStorage {
constructor(filePath) {
this.filePath = filePath;
}
async write(entry) {
const fs = await import('fs/promises');
await fs.appendFile(this.filePath, JSON.stringify(entry) + '\n');
}
}
// Usage
import OpenAI from 'openai';
const openai = new OpenAI();
const logger = new AILogger(new FileStorage('./ai-logs.jsonl'));
const { result, requestId } = await logger.loggedCall(
() => openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
]
}),
{
model: 'gpt-4o',
temperature: 0,
promptVersion: 'v2.1',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' }
]
}
);
console.log(`Logged as ${requestId}: ${result.choices[0].message.content}`);
What to log (and what NOT to log)
| Always Log | Never Log |
|---|---|
| Request ID, timestamp | Raw user PII (unless required & encrypted) |
| Model, temperature, params | API keys or secrets |
| Input/output token counts | Full conversation history of other users |
| Latency | Internal IP addresses |
| Error details | |
| Prompt version | |
| Finish reason | |
| Confidence scores |
4. Tracing Multi-Step AI Pipelines
A RAG pipeline or agent involves multiple steps. Tracing connects all steps in a single request so you can debug end-to-end.
class AITracer {
constructor() {
this.traces = new Map();
}
startTrace(traceId, metadata = {}) {
const trace = {
traceId,
startTime: Date.now(),
metadata,
spans: [],
status: 'in_progress'
};
this.traces.set(traceId, trace);
return trace;
}
startSpan(traceId, spanName, metadata = {}) {
const trace = this.traces.get(traceId);
if (!trace) throw new Error(`Trace ${traceId} not found`);
const span = {
spanId: randomUUID(),
spanName,
startTime: Date.now(),
metadata,
status: 'in_progress'
};
trace.spans.push(span);
return span;
}
endSpan(traceId, spanId, result = {}) {
const trace = this.traces.get(traceId);
const span = trace?.spans.find(s => s.spanId === spanId);
if (!span) return;
span.endTime = Date.now();
span.durationMs = span.endTime - span.startTime;
span.status = 'completed';
span.result = result;
}
endTrace(traceId, result = {}) {
const trace = this.traces.get(traceId);
if (!trace) return;
trace.endTime = Date.now();
trace.totalDurationMs = trace.endTime - trace.startTime;
trace.status = 'completed';
trace.result = result;
return trace;
}
// Get a visual summary of the trace
getTraceSummary(traceId) {
const trace = this.traces.get(traceId);
if (!trace) return null;
const lines = [
`Trace: ${trace.traceId} (${trace.totalDurationMs || '?'}ms total)`,
`Status: ${trace.status}`,
'',
'Steps:'
];
for (const span of trace.spans) {
const pct = trace.totalDurationMs
? ((span.durationMs / trace.totalDurationMs) * 100).toFixed(1)
: '?';
lines.push(
` ${span.spanName}: ${span.durationMs || '?'}ms (${pct}%) — ${span.status}`
);
}
return lines.join('\n');
}
}
// Usage: Trace a full RAG pipeline
import { randomUUID } from 'crypto';
const tracer = new AITracer();
async function tracedRAGPipeline(query) {
const traceId = randomUUID();
tracer.startTrace(traceId, { query });
// Step 1: Embed the query
const embedSpan = tracer.startSpan(traceId, 'embed_query');
const embedding = await embedQuery(query);
tracer.endSpan(traceId, embedSpan.spanId, {
dimensions: embedding.length
});
// Step 2: Retrieve documents
const retrieveSpan = tracer.startSpan(traceId, 'retrieve_documents');
const docs = await vectorDB.search(embedding, { topK: 5 });
tracer.endSpan(traceId, retrieveSpan.spanId, {
docsRetrieved: docs.length,
topRelevance: docs[0]?.score
});
// Step 3: Build prompt
const promptSpan = tracer.startSpan(traceId, 'build_prompt');
const prompt = buildRAGPrompt(query, docs);
tracer.endSpan(traceId, promptSpan.spanId, {
promptTokens: estimateTokens(prompt)
});
// Step 4: Generate answer
const generateSpan = tracer.startSpan(traceId, 'generate_answer');
const answer = await openai.chat.completions.create({
model: 'gpt-4o',
messages: prompt,
temperature: 0
});
tracer.endSpan(traceId, generateSpan.spanId, {
outputTokens: answer.usage?.completion_tokens,
finishReason: answer.choices[0]?.finish_reason
});
// Step 5: Hallucination check
const checkSpan = tracer.startSpan(traceId, 'hallucination_check');
const hallCheck = await hallucinationDetector.check(
answer.choices[0].message.content,
docs
);
tracer.endSpan(traceId, checkSpan.spanId, {
hallucinationScore: hallCheck.score,
flaggedClaims: hallCheck.flaggedClaims?.length || 0
});
// Complete the trace
const trace = tracer.endTrace(traceId, {
answer: answer.choices[0].message.content,
confidence: hallCheck.score < 0.3 ? 'high' : 'low'
});
console.log(tracer.getTraceSummary(traceId));
/*
Trace: abc-123 (2340ms total)
Status: completed
Steps:
embed_query: 120ms (5.1%) — completed
retrieve_documents: 230ms (9.8%) — completed
build_prompt: 15ms (0.6%) — completed
generate_answer: 1450ms (62.0%) — completed
hallucination_check: 525ms (22.4%) — completed
*/
return { trace, answer: answer.choices[0].message.content };
}
5. Tools for AI Observability
Overview of popular tools
| Tool | Best For | Key Feature |
|---|---|---|
| LangSmith | LangChain-based apps | Deep chain tracing, eval suites |
| Weights & Biases | ML experiment tracking | Prompt versioning, eval dashboards |
| Helicone | API-level monitoring | Drop-in proxy, cost tracking |
| Langfuse | Open-source LLM observability | Self-hostable, traces, evals |
| Braintrust | Evaluation & logging | Eval-first approach, logging |
| Custom dashboards | Full control | Exactly what you need, nothing else |
Helicone: Drop-in API proxy
import OpenAI from 'openai';
// Helicone acts as a proxy — just change the base URL
const openai = new OpenAI({
baseURL: 'https://oai.helicone.ai/v1',
defaultHeaders: {
'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
// Custom properties for filtering in the dashboard
'Helicone-Property-Environment': 'production',
'Helicone-Property-Feature': 'document-qa',
'Helicone-Property-PromptVersion': 'v2.1'
}
});
// Use OpenAI as normal — Helicone logs everything automatically
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'What is the refund policy?' }]
});
// Helicone dashboard now shows:
// - Request/response logs
// - Latency distribution
// - Token usage over time
// - Cost per feature/environment
// - Error rates
Langfuse: Open-source tracing
import { Langfuse } from 'langfuse';
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
baseUrl: process.env.LANGFUSE_BASE_URL // Self-hosted or cloud
});
async function tracedQuery(query) {
// Create a trace
const trace = langfuse.trace({
name: 'rag-query',
metadata: { query }
});
// Span for retrieval
const retrievalSpan = trace.span({ name: 'retrieval' });
const docs = await retrieveDocuments(query);
retrievalSpan.end({ output: { docCount: docs.length } });
// Generation span
const generationSpan = trace.generation({
name: 'llm-generation',
model: 'gpt-4o',
input: [{ role: 'user', content: query }]
});
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: query }]
});
generationSpan.end({
output: response.choices[0].message.content,
usage: {
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens
}
});
// Score the trace
trace.score({
name: 'confidence',
value: 0.85
});
// Flush at the end
await langfuse.flushAsync();
}
Custom lightweight monitoring
// When external tools are overkill, build a simple monitor
class SimpleAIMonitor {
constructor() {
this.metrics = {
calls: 0,
errors: 0,
totalLatencyMs: 0,
totalInputTokens: 0,
totalOutputTokens: 0,
totalCost: 0,
feedbackPositive: 0,
feedbackNegative: 0,
confidenceSum: 0,
confidenceCount: 0,
startTime: Date.now()
};
this.recentErrors = [];
}
record(data) {
this.metrics.calls++;
this.metrics.totalLatencyMs += data.latencyMs || 0;
this.metrics.totalInputTokens += data.inputTokens || 0;
this.metrics.totalOutputTokens += data.outputTokens || 0;
this.metrics.totalCost += data.cost || 0;
if (data.isError) {
this.metrics.errors++;
this.recentErrors.push({
timestamp: new Date(),
error: data.errorMessage
});
// Keep only last 100 errors
if (this.recentErrors.length > 100) this.recentErrors.shift();
}
if (data.feedback === 'positive') this.metrics.feedbackPositive++;
if (data.feedback === 'negative') this.metrics.feedbackNegative++;
if (data.confidence != null) {
this.metrics.confidenceSum += data.confidence;
this.metrics.confidenceCount++;
}
}
getSnapshot() {
const m = this.metrics;
const uptimeMinutes = (Date.now() - m.startTime) / 60000;
return {
uptime: `${uptimeMinutes.toFixed(0)} minutes`,
totalCalls: m.calls,
callsPerMinute: (m.calls / uptimeMinutes).toFixed(2),
errorRate: m.calls > 0 ? (m.errors / m.calls * 100).toFixed(2) + '%' : '0%',
avgLatencyMs: m.calls > 0 ? (m.totalLatencyMs / m.calls).toFixed(0) : 0,
totalCost: `$${m.totalCost.toFixed(4)}`,
avgCostPerCall: m.calls > 0 ? `$${(m.totalCost / m.calls).toFixed(6)}` : '$0',
avgConfidence: m.confidenceCount > 0
? (m.confidenceSum / m.confidenceCount).toFixed(3)
: 'N/A',
satisfactionRate: (m.feedbackPositive + m.feedbackNegative) > 0
? (m.feedbackPositive / (m.feedbackPositive + m.feedbackNegative) * 100).toFixed(1) + '%'
: 'No feedback yet',
recentErrors: this.recentErrors.slice(-5)
};
}
}
6. Alerting on Quality Degradation
Monitoring without alerting is a dashboard no one watches. Set up alerts for critical thresholds.
class AIAlertManager {
constructor(notifier) {
this.notifier = notifier; // Slack, PagerDuty, email, etc.
this.rules = [];
this.cooldowns = new Map(); // Prevent alert spam
}
addRule(rule) {
this.rules.push({
name: rule.name,
condition: rule.condition, // Function that returns true if alert should fire
severity: rule.severity, // 'critical' | 'warning' | 'info'
cooldownMinutes: rule.cooldownMinutes || 15,
message: rule.message // Function that returns alert message
});
}
async evaluate(metrics) {
for (const rule of this.rules) {
if (rule.condition(metrics)) {
// Check cooldown
const lastFired = this.cooldowns.get(rule.name);
const now = Date.now();
if (lastFired && (now - lastFired) < rule.cooldownMinutes * 60 * 1000) {
continue; // Still in cooldown
}
// Fire alert
const message = rule.message(metrics);
await this.notifier.send({
severity: rule.severity,
rule: rule.name,
message,
metrics,
timestamp: new Date().toISOString()
});
this.cooldowns.set(rule.name, now);
}
}
}
}
// Configure alerts
const alertManager = new AIAlertManager({
send: async (alert) => {
console.log(`[${alert.severity.toUpperCase()}] ${alert.rule}: ${alert.message}`);
// In production: send to Slack, PagerDuty, etc.
}
});
// Error rate alert
alertManager.addRule({
name: 'high_error_rate',
severity: 'critical',
cooldownMinutes: 10,
condition: (m) => parseFloat(m.errorRate) > 5,
message: (m) => `Error rate is ${m.errorRate} (threshold: 5%). ` +
`${m.totalCalls} calls in window.`
});
// Latency alert
alertManager.addRule({
name: 'high_latency',
severity: 'warning',
cooldownMinutes: 15,
condition: (m) => parseInt(m.p95LatencyMs) > 5000,
message: (m) => `P95 latency is ${m.p95LatencyMs}ms (threshold: 5000ms). ` +
`Users are experiencing slow responses.`
});
// Hallucination rate alert
alertManager.addRule({
name: 'hallucination_spike',
severity: 'critical',
cooldownMinutes: 30,
condition: (m) => parseFloat(m.avgHallucinationScore) > 0.15,
message: (m) => `Average hallucination score is ${m.avgHallucinationScore} ` +
`(threshold: 0.15). Check prompt changes, model updates, or data drift.`
});
// Cost alert
alertManager.addRule({
name: 'cost_spike',
severity: 'warning',
cooldownMinutes: 60,
condition: (m) => parseFloat(m.totalCost?.replace('$', '')) > 100,
message: (m) => `Hourly cost is ${m.totalCost} (threshold: $100). ` +
`Check for runaway loops or token-heavy prompts.`
});
// Confidence drop alert
alertManager.addRule({
name: 'low_confidence',
severity: 'warning',
cooldownMinutes: 30,
condition: (m) => m.avgConfidence !== 'N/A' && parseFloat(m.avgConfidence) < 0.6,
message: (m) => `Average confidence dropped to ${m.avgConfidence} ` +
`(threshold: 0.6). May indicate retrieval quality issues.`
});
// User satisfaction alert
alertManager.addRule({
name: 'low_satisfaction',
severity: 'warning',
cooldownMinutes: 60,
condition: (m) => {
const rate = parseFloat(m.satisfactionRate);
return !isNaN(rate) && rate < 70;
},
message: (m) => `User satisfaction is ${m.satisfactionRate} ` +
`(threshold: 70%). Investigate recent changes.`
});
// Run evaluation periodically
// const metrics = monitor.getSnapshot();
// await alertManager.evaluate(metrics);
7. A/B Testing Prompts and Models in Production
Test prompt and model changes on real traffic before rolling out to everyone.
class AIABTester {
constructor() {
this.experiments = new Map();
}
createExperiment(config) {
const experiment = {
id: config.id,
name: config.name,
variants: config.variants, // [{ name, weight, config }]
metrics: new Map(), // variant -> [metric records]
startTime: Date.now(),
status: 'running'
};
// Normalize weights
const totalWeight = experiment.variants.reduce((s, v) => s + v.weight, 0);
experiment.variants = experiment.variants.map(v => ({
...v,
normalizedWeight: v.weight / totalWeight
}));
for (const variant of experiment.variants) {
experiment.metrics.set(variant.name, []);
}
this.experiments.set(config.id, experiment);
return experiment;
}
// Assign a request to a variant (deterministic by userId for consistency)
assignVariant(experimentId, userId) {
const experiment = this.experiments.get(experimentId);
if (!experiment || experiment.status !== 'running') return null;
// Simple hash-based assignment for consistency
const hash = this.simpleHash(userId + experimentId);
const normalized = (hash % 1000) / 1000; // 0-1
let cumulative = 0;
for (const variant of experiment.variants) {
cumulative += variant.normalizedWeight;
if (normalized < cumulative) {
return variant;
}
}
return experiment.variants[experiment.variants.length - 1];
}
recordMetric(experimentId, variantName, metric) {
const experiment = this.experiments.get(experimentId);
if (!experiment) return;
const variantMetrics = experiment.metrics.get(variantName);
if (variantMetrics) {
variantMetrics.push({
timestamp: Date.now(),
...metric
});
}
}
getResults(experimentId) {
const experiment = this.experiments.get(experimentId);
if (!experiment) return null;
const results = {};
for (const variant of experiment.variants) {
const metrics = experiment.metrics.get(variant.name) || [];
if (metrics.length === 0) {
results[variant.name] = { sampleSize: 0 };
continue;
}
const avg = (arr) => arr.reduce((a, b) => a + b, 0) / arr.length;
results[variant.name] = {
sampleSize: metrics.length,
avgLatencyMs: avg(metrics.map(m => m.latencyMs)).toFixed(0),
avgConfidence: avg(metrics.map(m => m.confidence).filter(Boolean)).toFixed(3),
avgCost: avg(metrics.map(m => m.cost).filter(Boolean)).toFixed(6),
errorRate: (metrics.filter(m => m.isError).length / metrics.length).toFixed(4),
satisfactionRate: metrics.filter(m => m.feedback === 'positive').length /
Math.max(metrics.filter(m => m.feedback != null).length, 1)
};
}
return {
experimentId,
name: experiment.name,
duration: `${((Date.now() - experiment.startTime) / 3600000).toFixed(1)} hours`,
results
};
}
simpleHash(str) {
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = ((hash << 5) - hash) + str.charCodeAt(i);
hash = hash & hash; // Convert to 32-bit integer
}
return Math.abs(hash);
}
}
// Usage
const abTester = new AIABTester();
abTester.createExperiment({
id: 'prompt-v3-test',
name: 'Test new system prompt v3 vs v2',
variants: [
{
name: 'control',
weight: 50,
config: {
systemPrompt: 'You are a helpful customer service assistant...',
model: 'gpt-4o'
}
},
{
name: 'treatment',
weight: 50,
config: {
systemPrompt: 'You are an expert customer service agent. Always cite sources...',
model: 'gpt-4o'
}
}
]
});
// For each request:
const variant = abTester.assignVariant('prompt-v3-test', 'user-123');
// Use variant.config to make the LLM call
// Then record the metric:
abTester.recordMetric('prompt-v3-test', variant.name, {
latencyMs: 1200,
confidence: 0.88,
cost: 0.003,
feedback: 'positive'
});
// After sufficient data:
const results = abTester.getResults('prompt-v3-test');
console.log(results);
8. Cost Monitoring and Optimization
AI systems can become expensive fast. Track and optimize costs systematically.
class AICostMonitor {
constructor() {
this.daily = new Map(); // date -> cost breakdown
}
record(data) {
const date = new Date().toISOString().split('T')[0]; // YYYY-MM-DD
if (!this.daily.has(date)) {
this.daily.set(date, {
totalCost: 0,
byModel: {},
byFeature: {},
totalInputTokens: 0,
totalOutputTokens: 0,
callCount: 0
});
}
const day = this.daily.get(date);
day.totalCost += data.cost;
day.totalInputTokens += data.inputTokens;
day.totalOutputTokens += data.outputTokens;
day.callCount++;
// By model
day.byModel[data.model] = (day.byModel[data.model] || 0) + data.cost;
// By feature
if (data.feature) {
day.byFeature[data.feature] = (day.byFeature[data.feature] || 0) + data.cost;
}
}
getDailyReport(date) {
const day = this.daily.get(date);
if (!day) return null;
return {
date,
totalCost: `$${day.totalCost.toFixed(4)}`,
callCount: day.callCount,
avgCostPerCall: `$${(day.totalCost / day.callCount).toFixed(6)}`,
totalInputTokens: day.totalInputTokens.toLocaleString(),
totalOutputTokens: day.totalOutputTokens.toLocaleString(),
byModel: Object.entries(day.byModel)
.sort((a, b) => b[1] - a[1])
.map(([model, cost]) => ({ model, cost: `$${cost.toFixed(4)}` })),
byFeature: Object.entries(day.byFeature)
.sort((a, b) => b[1] - a[1])
.map(([feature, cost]) => ({ feature, cost: `$${cost.toFixed(4)}` }))
};
}
getOptimizationSuggestions(date) {
const day = this.daily.get(date);
if (!day) return [];
const suggestions = [];
// Check if a cheaper model could work
if (day.byModel['gpt-4o'] > day.totalCost * 0.8) {
suggestions.push({
type: 'model_downgrade',
impact: 'HIGH',
suggestion: 'gpt-4o accounts for 80%+ of cost. Test gpt-4o-mini for ' +
'simpler tasks (classification, extraction). Potential 90%+ cost reduction ' +
'on those tasks.'
});
}
// Check token efficiency
const avgInputTokens = day.totalInputTokens / day.callCount;
if (avgInputTokens > 5000) {
suggestions.push({
type: 'prompt_optimization',
impact: 'MEDIUM',
suggestion: `Average input is ${avgInputTokens.toFixed(0)} tokens. ` +
`Review system prompts for verbosity. Consider prompt caching for ` +
`repeated prefixes.`
});
}
// Check output tokens
const avgOutputTokens = day.totalOutputTokens / day.callCount;
if (avgOutputTokens > 2000) {
suggestions.push({
type: 'output_optimization',
impact: 'MEDIUM',
suggestion: `Average output is ${avgOutputTokens.toFixed(0)} tokens. ` +
`Add max_tokens limit. Use structured output to constrain response length.`
});
}
return suggestions;
}
}
Cost optimization strategies
| Strategy | Savings | Tradeoff |
|---|---|---|
| Use smaller models for simple tasks | 90-95% on those tasks | Slightly lower quality |
| Prompt caching (reuse system prompt prefix) | 50-75% on cached tokens | API support required |
| Reduce system prompt length | Proportional to reduction | May reduce instruction quality |
| Set max_tokens | Prevents runaway outputs | May truncate long answers |
| Cache frequent queries | 100% on cache hits | Stale answers if data changes |
| Batch non-urgent requests | Lower per-token pricing | Increased latency |
| Structured output | Reduces unnecessary verbosity | Requires schema design |
9. Key Takeaways
- AI observability has four pillars: logs, metrics, traces, and evaluation. Traditional monitoring tells you if the system is running; evaluation tells you if it's running correctly.
- Log every LLM call with full input, output, parameters, token counts, and latency. This is your debugging lifeline and evaluation data source.
- Track six metric categories: latency, cost, errors, quality (hallucination, confidence), retrieval quality, and user satisfaction. Each tells a different story.
- Trace multi-step pipelines end-to-end so you can identify which step is the bottleneck or failure point in RAG and agent systems.
- Set up alerts for quality degradation — hallucination spikes, confidence drops, cost increases, and user satisfaction drops should trigger immediate investigation.
- A/B test every change — new prompts, models, and retrieval strategies should be tested on real traffic before full rollout.
- Monitor costs actively — AI costs can spike unexpectedly. Track by model, feature, and time. Use cheaper models where quality allows.
Explain-It Challenge
- Your AI system has been running for 3 months with no monitoring. The product manager says "it seems fine." Explain why you need observability and what specific failure modes they might be missing.
- You notice latency P95 jumped from 2s to 8s overnight but error rate is unchanged. Walk through your debugging process using traces.
- Your monthly AI bill doubled but call volume only increased 20%. What are the most likely causes and how would your cost monitoring dashboard help you find them?
Navigation: ← 4.14.c — Evaluating Retrieval Quality · 4.14 Overview