Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14.d — Observability and Monitoring

In one sentence: AI observability means logging every LLM call, tracing multi-step pipelines, tracking key metrics (latency, tokens, errors, hallucination rate, cost), and alerting when quality degrades — because an AI system without monitoring is a black box that fails silently.

Navigation: ← 4.14.c — Evaluating Retrieval Quality · 4.14 Overview


1. What AI Observability Means

Traditional observability has three pillars: logs, metrics, and traces. AI observability adds a fourth: evaluation.

┌──────────────────────────────────────────────────────────────────┐
│  THE FOUR PILLARS OF AI OBSERVABILITY                            │
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐    │
│  │  LOGS    │  │ METRICS  │  │  TRACES  │  │  EVALUATION  │    │
│  │          │  │          │  │          │  │              │    │
│  │ Every    │  │ Latency  │  │ Full     │  │ Hallucination│    │
│  │ LLM call │  │ Token $  │  │ pipeline │  │ rate,        │    │
│  │ input/   │  │ Error %  │  │ step-by- │  │ confidence,  │    │
│  │ output/  │  │ Through- │  │ step     │  │ retrieval    │    │
│  │ params   │  │ put      │  │ timing   │  │ quality      │    │
│  └──────────┘  └──────────┘  └──────────┘  └──────────────┘    │
│                                                                  │
│  Traditional ──────────────────────────────► AI-specific         │
│                                                                  │
│  KEY INSIGHT: Traditional monitoring tells you IF the system     │
│  is running. AI evaluation tells you if it's running CORRECTLY.  │
└──────────────────────────────────────────────────────────────────┘

Why AI systems need special monitoring

Traditional APIAI System
Returns correct data or throws an errorReturns plausible-looking data that might be wrong
Fails loudly (500 error)Fails silently (confident-sounding hallucination)
Deterministic: same input = same outputProbabilistic: same input can produce different outputs
Fixed cost per callVariable cost (depends on token count)
Latency is predictableLatency varies with output length
Testing is straightforwardTesting requires evaluation suites

2. Key Metrics to Track

The essential AI metrics dashboard

class AIMetricsCollector {
  constructor() {
    this.metrics = [];
  }

  recordCall(data) {
    this.metrics.push({
      timestamp: new Date(),
      requestId: data.requestId,
      
      // Performance metrics
      latencyMs: data.latencyMs,
      timeToFirstToken: data.timeToFirstToken,
      tokensPerSecond: data.outputTokens / (data.latencyMs / 1000),
      
      // Cost metrics
      inputTokens: data.inputTokens,
      outputTokens: data.outputTokens,
      totalTokens: data.inputTokens + data.outputTokens,
      estimatedCost: this.estimateCost(data),
      
      // Quality metrics
      confidence: data.confidence,
      hallucinationScore: data.hallucinationScore,
      retrievalRelevance: data.retrievalRelevance,
      
      // Operational metrics
      model: data.model,
      promptVersion: data.promptVersion,
      statusCode: data.statusCode,
      isError: data.statusCode >= 400,
      errorType: data.errorType || null,
      
      // User metrics
      userFeedback: data.userFeedback || null, // 'thumbs_up' | 'thumbs_down' | null
      wasEdited: data.wasEdited || false
    });
  }

  estimateCost(data) {
    // Pricing per 1M tokens (example GPT-4o rates)
    const pricing = {
      'gpt-4o': { input: 2.50, output: 10.00 },
      'gpt-4o-mini': { input: 0.15, output: 0.60 },
      'gpt-4.1': { input: 2.00, output: 8.00 },
      'gpt-4.1-mini': { input: 0.40, output: 1.60 }
    };

    const rates = pricing[data.model] || pricing['gpt-4o'];
    return (
      (data.inputTokens / 1_000_000) * rates.input +
      (data.outputTokens / 1_000_000) * rates.output
    );
  }

  // Generate metrics summary for a time window
  getSummary(windowMinutes = 60) {
    const cutoff = Date.now() - windowMinutes * 60 * 1000;
    const recent = this.metrics.filter(m => m.timestamp >= cutoff);

    if (recent.length === 0) return { noData: true };

    const avg = (arr) => arr.length > 0 
      ? arr.reduce((a, b) => a + b, 0) / arr.length 
      : 0;
    
    const p95 = (arr) => {
      const sorted = [...arr].sort((a, b) => a - b);
      return sorted[Math.floor(sorted.length * 0.95)] || 0;
    };

    return {
      window: `${windowMinutes} minutes`,
      totalCalls: recent.length,
      callsPerMinute: recent.length / windowMinutes,

      // Latency
      avgLatencyMs: avg(recent.map(m => m.latencyMs)).toFixed(0),
      p95LatencyMs: p95(recent.map(m => m.latencyMs)).toFixed(0),
      avgTimeToFirstToken: avg(recent.map(m => m.timeToFirstToken)).toFixed(0),

      // Tokens & cost
      avgInputTokens: avg(recent.map(m => m.inputTokens)).toFixed(0),
      avgOutputTokens: avg(recent.map(m => m.outputTokens)).toFixed(0),
      totalCost: recent.reduce((sum, m) => sum + m.estimatedCost, 0).toFixed(4),
      avgCostPerCall: avg(recent.map(m => m.estimatedCost)).toFixed(6),

      // Errors
      errorRate: (recent.filter(m => m.isError).length / recent.length).toFixed(4),
      errorsByType: this.groupBy(recent.filter(m => m.isError), 'errorType'),

      // Quality
      avgConfidence: avg(recent.filter(m => m.confidence != null).map(m => m.confidence)).toFixed(3),
      avgHallucinationScore: avg(recent.filter(m => m.hallucinationScore != null).map(m => m.hallucinationScore)).toFixed(3),
      avgRetrievalRelevance: avg(recent.filter(m => m.retrievalRelevance != null).map(m => m.retrievalRelevance)).toFixed(3),

      // User satisfaction
      feedbackCount: recent.filter(m => m.userFeedback != null).length,
      thumbsUpRate: recent.filter(m => m.userFeedback === 'thumbs_up').length /
        Math.max(recent.filter(m => m.userFeedback != null).length, 1),
      editRate: recent.filter(m => m.wasEdited).length / recent.length
    };
  }

  groupBy(items, key) {
    return items.reduce((groups, item) => {
      const val = item[key] || 'unknown';
      groups[val] = (groups[val] || 0) + 1;
      return groups;
    }, {});
  }
}

Metric categories and their targets

CategoryMetricWhat It MeasuresTypical Target
LatencyAvg response timeUser experience< 2s for chat, < 10s for complex tasks
LatencyP95 response timeWorst-case experience< 5s for chat
LatencyTime to first tokenPerceived responsiveness< 500ms
CostCost per callBudget managementDepends on use case
CostDaily/monthly spendBudget trackingWithin budget
CostTokens per callEfficiencyTrending stable or down
ErrorsError rateSystem reliability< 1%
ErrorsRate limit hitsCapacity planning< 0.1%
QualityHallucination rateAnswer reliability< 5% (varies by domain)
QualityConfidence score avgSystem certainty> 0.75
QualityRetrieval relevanceRAG effectiveness> 0.70
UserThumbs up rateUser satisfaction> 80%
UserEdit rateAnswer usability< 15%

3. Logging: The Foundation of Observability

Every AI call should be logged with full context for debugging and evaluation.

import { randomUUID } from 'crypto';

class AILogger {
  constructor(storage) {
    this.storage = storage; // Could be file, database, external service
  }

  // Wrap any LLM call with logging
  async loggedCall(callFn, metadata = {}) {
    const requestId = randomUUID();
    const startTime = Date.now();

    const logEntry = {
      requestId,
      timestamp: new Date().toISOString(),
      ...metadata,
      status: 'pending'
    };

    try {
      // Execute the LLM call
      const result = await callFn();

      const endTime = Date.now();

      // Complete the log entry
      Object.assign(logEntry, {
        status: 'success',
        latencyMs: endTime - startTime,
        
        // Input details
        model: metadata.model,
        messages: metadata.messages,
        temperature: metadata.temperature,
        promptVersion: metadata.promptVersion,
        
        // Output details
        response: result.choices?.[0]?.message?.content,
        finishReason: result.choices?.[0]?.finish_reason,
        
        // Token usage
        inputTokens: result.usage?.prompt_tokens,
        outputTokens: result.usage?.completion_tokens,
        totalTokens: result.usage?.total_tokens,
        
        // Request metadata
        systemFingerprint: result.system_fingerprint
      });

      await this.storage.write(logEntry);
      return { result, requestId };

    } catch (error) {
      Object.assign(logEntry, {
        status: 'error',
        latencyMs: Date.now() - startTime,
        error: {
          name: error.name,
          message: error.message,
          code: error.code || error.status,
          type: error.type
        }
      });

      await this.storage.write(logEntry);
      throw error; // Re-throw so the caller handles it
    }
  }
}

// Example storage implementations
class ConsoleStorage {
  async write(entry) {
    console.log(JSON.stringify(entry));
  }
}

class FileStorage {
  constructor(filePath) {
    this.filePath = filePath;
  }
  async write(entry) {
    const fs = await import('fs/promises');
    await fs.appendFile(this.filePath, JSON.stringify(entry) + '\n');
  }
}

// Usage
import OpenAI from 'openai';

const openai = new OpenAI();
const logger = new AILogger(new FileStorage('./ai-logs.jsonl'));

const { result, requestId } = await logger.loggedCall(
  () => openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'What is the capital of France?' }
    ]
  }),
  {
    model: 'gpt-4o',
    temperature: 0,
    promptVersion: 'v2.1',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'What is the capital of France?' }
    ]
  }
);

console.log(`Logged as ${requestId}: ${result.choices[0].message.content}`);

What to log (and what NOT to log)

Always LogNever Log
Request ID, timestampRaw user PII (unless required & encrypted)
Model, temperature, paramsAPI keys or secrets
Input/output token countsFull conversation history of other users
LatencyInternal IP addresses
Error details
Prompt version
Finish reason
Confidence scores

4. Tracing Multi-Step AI Pipelines

A RAG pipeline or agent involves multiple steps. Tracing connects all steps in a single request so you can debug end-to-end.

class AITracer {
  constructor() {
    this.traces = new Map();
  }

  startTrace(traceId, metadata = {}) {
    const trace = {
      traceId,
      startTime: Date.now(),
      metadata,
      spans: [],
      status: 'in_progress'
    };
    this.traces.set(traceId, trace);
    return trace;
  }

  startSpan(traceId, spanName, metadata = {}) {
    const trace = this.traces.get(traceId);
    if (!trace) throw new Error(`Trace ${traceId} not found`);

    const span = {
      spanId: randomUUID(),
      spanName,
      startTime: Date.now(),
      metadata,
      status: 'in_progress'
    };

    trace.spans.push(span);
    return span;
  }

  endSpan(traceId, spanId, result = {}) {
    const trace = this.traces.get(traceId);
    const span = trace?.spans.find(s => s.spanId === spanId);
    if (!span) return;

    span.endTime = Date.now();
    span.durationMs = span.endTime - span.startTime;
    span.status = 'completed';
    span.result = result;
  }

  endTrace(traceId, result = {}) {
    const trace = this.traces.get(traceId);
    if (!trace) return;

    trace.endTime = Date.now();
    trace.totalDurationMs = trace.endTime - trace.startTime;
    trace.status = 'completed';
    trace.result = result;

    return trace;
  }

  // Get a visual summary of the trace
  getTraceSummary(traceId) {
    const trace = this.traces.get(traceId);
    if (!trace) return null;

    const lines = [
      `Trace: ${trace.traceId} (${trace.totalDurationMs || '?'}ms total)`,
      `Status: ${trace.status}`,
      '',
      'Steps:'
    ];

    for (const span of trace.spans) {
      const pct = trace.totalDurationMs 
        ? ((span.durationMs / trace.totalDurationMs) * 100).toFixed(1)
        : '?';
      lines.push(
        `  ${span.spanName}: ${span.durationMs || '?'}ms (${pct}%) — ${span.status}`
      );
    }

    return lines.join('\n');
  }
}

// Usage: Trace a full RAG pipeline
import { randomUUID } from 'crypto';

const tracer = new AITracer();

async function tracedRAGPipeline(query) {
  const traceId = randomUUID();
  tracer.startTrace(traceId, { query });

  // Step 1: Embed the query
  const embedSpan = tracer.startSpan(traceId, 'embed_query');
  const embedding = await embedQuery(query);
  tracer.endSpan(traceId, embedSpan.spanId, { 
    dimensions: embedding.length 
  });

  // Step 2: Retrieve documents
  const retrieveSpan = tracer.startSpan(traceId, 'retrieve_documents');
  const docs = await vectorDB.search(embedding, { topK: 5 });
  tracer.endSpan(traceId, retrieveSpan.spanId, { 
    docsRetrieved: docs.length,
    topRelevance: docs[0]?.score 
  });

  // Step 3: Build prompt
  const promptSpan = tracer.startSpan(traceId, 'build_prompt');
  const prompt = buildRAGPrompt(query, docs);
  tracer.endSpan(traceId, promptSpan.spanId, { 
    promptTokens: estimateTokens(prompt) 
  });

  // Step 4: Generate answer
  const generateSpan = tracer.startSpan(traceId, 'generate_answer');
  const answer = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: prompt,
    temperature: 0
  });
  tracer.endSpan(traceId, generateSpan.spanId, {
    outputTokens: answer.usage?.completion_tokens,
    finishReason: answer.choices[0]?.finish_reason
  });

  // Step 5: Hallucination check
  const checkSpan = tracer.startSpan(traceId, 'hallucination_check');
  const hallCheck = await hallucinationDetector.check(
    answer.choices[0].message.content, 
    docs
  );
  tracer.endSpan(traceId, checkSpan.spanId, {
    hallucinationScore: hallCheck.score,
    flaggedClaims: hallCheck.flaggedClaims?.length || 0
  });

  // Complete the trace
  const trace = tracer.endTrace(traceId, {
    answer: answer.choices[0].message.content,
    confidence: hallCheck.score < 0.3 ? 'high' : 'low'
  });

  console.log(tracer.getTraceSummary(traceId));
  /*
  Trace: abc-123 (2340ms total)
  Status: completed

  Steps:
    embed_query:         120ms (5.1%) — completed
    retrieve_documents:  230ms (9.8%) — completed
    build_prompt:         15ms (0.6%) — completed
    generate_answer:    1450ms (62.0%) — completed
    hallucination_check: 525ms (22.4%) — completed
  */

  return { trace, answer: answer.choices[0].message.content };
}

5. Tools for AI Observability

Overview of popular tools

ToolBest ForKey Feature
LangSmithLangChain-based appsDeep chain tracing, eval suites
Weights & BiasesML experiment trackingPrompt versioning, eval dashboards
HeliconeAPI-level monitoringDrop-in proxy, cost tracking
LangfuseOpen-source LLM observabilitySelf-hostable, traces, evals
BraintrustEvaluation & loggingEval-first approach, logging
Custom dashboardsFull controlExactly what you need, nothing else

Helicone: Drop-in API proxy

import OpenAI from 'openai';

// Helicone acts as a proxy — just change the base URL
const openai = new OpenAI({
  baseURL: 'https://oai.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
    // Custom properties for filtering in the dashboard
    'Helicone-Property-Environment': 'production',
    'Helicone-Property-Feature': 'document-qa',
    'Helicone-Property-PromptVersion': 'v2.1'
  }
});

// Use OpenAI as normal — Helicone logs everything automatically
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is the refund policy?' }]
});

// Helicone dashboard now shows:
// - Request/response logs
// - Latency distribution
// - Token usage over time
// - Cost per feature/environment
// - Error rates

Langfuse: Open-source tracing

import { Langfuse } from 'langfuse';

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  baseUrl: process.env.LANGFUSE_BASE_URL // Self-hosted or cloud
});

async function tracedQuery(query) {
  // Create a trace
  const trace = langfuse.trace({
    name: 'rag-query',
    metadata: { query }
  });

  // Span for retrieval
  const retrievalSpan = trace.span({ name: 'retrieval' });
  const docs = await retrieveDocuments(query);
  retrievalSpan.end({ output: { docCount: docs.length } });

  // Generation span
  const generationSpan = trace.generation({
    name: 'llm-generation',
    model: 'gpt-4o',
    input: [{ role: 'user', content: query }]
  });

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: query }]
  });

  generationSpan.end({
    output: response.choices[0].message.content,
    usage: {
      promptTokens: response.usage.prompt_tokens,
      completionTokens: response.usage.completion_tokens
    }
  });

  // Score the trace
  trace.score({
    name: 'confidence',
    value: 0.85
  });

  // Flush at the end
  await langfuse.flushAsync();
}

Custom lightweight monitoring

// When external tools are overkill, build a simple monitor

class SimpleAIMonitor {
  constructor() {
    this.metrics = {
      calls: 0,
      errors: 0,
      totalLatencyMs: 0,
      totalInputTokens: 0,
      totalOutputTokens: 0,
      totalCost: 0,
      feedbackPositive: 0,
      feedbackNegative: 0,
      confidenceSum: 0,
      confidenceCount: 0,
      startTime: Date.now()
    };
    this.recentErrors = [];
  }

  record(data) {
    this.metrics.calls++;
    this.metrics.totalLatencyMs += data.latencyMs || 0;
    this.metrics.totalInputTokens += data.inputTokens || 0;
    this.metrics.totalOutputTokens += data.outputTokens || 0;
    this.metrics.totalCost += data.cost || 0;

    if (data.isError) {
      this.metrics.errors++;
      this.recentErrors.push({
        timestamp: new Date(),
        error: data.errorMessage
      });
      // Keep only last 100 errors
      if (this.recentErrors.length > 100) this.recentErrors.shift();
    }

    if (data.feedback === 'positive') this.metrics.feedbackPositive++;
    if (data.feedback === 'negative') this.metrics.feedbackNegative++;

    if (data.confidence != null) {
      this.metrics.confidenceSum += data.confidence;
      this.metrics.confidenceCount++;
    }
  }

  getSnapshot() {
    const m = this.metrics;
    const uptimeMinutes = (Date.now() - m.startTime) / 60000;

    return {
      uptime: `${uptimeMinutes.toFixed(0)} minutes`,
      totalCalls: m.calls,
      callsPerMinute: (m.calls / uptimeMinutes).toFixed(2),
      errorRate: m.calls > 0 ? (m.errors / m.calls * 100).toFixed(2) + '%' : '0%',
      avgLatencyMs: m.calls > 0 ? (m.totalLatencyMs / m.calls).toFixed(0) : 0,
      totalCost: `$${m.totalCost.toFixed(4)}`,
      avgCostPerCall: m.calls > 0 ? `$${(m.totalCost / m.calls).toFixed(6)}` : '$0',
      avgConfidence: m.confidenceCount > 0 
        ? (m.confidenceSum / m.confidenceCount).toFixed(3) 
        : 'N/A',
      satisfactionRate: (m.feedbackPositive + m.feedbackNegative) > 0
        ? (m.feedbackPositive / (m.feedbackPositive + m.feedbackNegative) * 100).toFixed(1) + '%'
        : 'No feedback yet',
      recentErrors: this.recentErrors.slice(-5)
    };
  }
}

6. Alerting on Quality Degradation

Monitoring without alerting is a dashboard no one watches. Set up alerts for critical thresholds.

class AIAlertManager {
  constructor(notifier) {
    this.notifier = notifier; // Slack, PagerDuty, email, etc.
    this.rules = [];
    this.cooldowns = new Map(); // Prevent alert spam
  }

  addRule(rule) {
    this.rules.push({
      name: rule.name,
      condition: rule.condition,      // Function that returns true if alert should fire
      severity: rule.severity,        // 'critical' | 'warning' | 'info'
      cooldownMinutes: rule.cooldownMinutes || 15,
      message: rule.message           // Function that returns alert message
    });
  }

  async evaluate(metrics) {
    for (const rule of this.rules) {
      if (rule.condition(metrics)) {
        // Check cooldown
        const lastFired = this.cooldowns.get(rule.name);
        const now = Date.now();

        if (lastFired && (now - lastFired) < rule.cooldownMinutes * 60 * 1000) {
          continue; // Still in cooldown
        }

        // Fire alert
        const message = rule.message(metrics);
        await this.notifier.send({
          severity: rule.severity,
          rule: rule.name,
          message,
          metrics,
          timestamp: new Date().toISOString()
        });

        this.cooldowns.set(rule.name, now);
      }
    }
  }
}

// Configure alerts
const alertManager = new AIAlertManager({
  send: async (alert) => {
    console.log(`[${alert.severity.toUpperCase()}] ${alert.rule}: ${alert.message}`);
    // In production: send to Slack, PagerDuty, etc.
  }
});

// Error rate alert
alertManager.addRule({
  name: 'high_error_rate',
  severity: 'critical',
  cooldownMinutes: 10,
  condition: (m) => parseFloat(m.errorRate) > 5,
  message: (m) => `Error rate is ${m.errorRate} (threshold: 5%). ` +
    `${m.totalCalls} calls in window.`
});

// Latency alert
alertManager.addRule({
  name: 'high_latency',
  severity: 'warning',
  cooldownMinutes: 15,
  condition: (m) => parseInt(m.p95LatencyMs) > 5000,
  message: (m) => `P95 latency is ${m.p95LatencyMs}ms (threshold: 5000ms). ` +
    `Users are experiencing slow responses.`
});

// Hallucination rate alert
alertManager.addRule({
  name: 'hallucination_spike',
  severity: 'critical',
  cooldownMinutes: 30,
  condition: (m) => parseFloat(m.avgHallucinationScore) > 0.15,
  message: (m) => `Average hallucination score is ${m.avgHallucinationScore} ` +
    `(threshold: 0.15). Check prompt changes, model updates, or data drift.`
});

// Cost alert
alertManager.addRule({
  name: 'cost_spike',
  severity: 'warning',
  cooldownMinutes: 60,
  condition: (m) => parseFloat(m.totalCost?.replace('$', '')) > 100,
  message: (m) => `Hourly cost is ${m.totalCost} (threshold: $100). ` +
    `Check for runaway loops or token-heavy prompts.`
});

// Confidence drop alert
alertManager.addRule({
  name: 'low_confidence',
  severity: 'warning',
  cooldownMinutes: 30,
  condition: (m) => m.avgConfidence !== 'N/A' && parseFloat(m.avgConfidence) < 0.6,
  message: (m) => `Average confidence dropped to ${m.avgConfidence} ` +
    `(threshold: 0.6). May indicate retrieval quality issues.`
});

// User satisfaction alert
alertManager.addRule({
  name: 'low_satisfaction',
  severity: 'warning',
  cooldownMinutes: 60,
  condition: (m) => {
    const rate = parseFloat(m.satisfactionRate);
    return !isNaN(rate) && rate < 70;
  },
  message: (m) => `User satisfaction is ${m.satisfactionRate} ` +
    `(threshold: 70%). Investigate recent changes.`
});

// Run evaluation periodically
// const metrics = monitor.getSnapshot();
// await alertManager.evaluate(metrics);

7. A/B Testing Prompts and Models in Production

Test prompt and model changes on real traffic before rolling out to everyone.

class AIABTester {
  constructor() {
    this.experiments = new Map();
  }

  createExperiment(config) {
    const experiment = {
      id: config.id,
      name: config.name,
      variants: config.variants, // [{ name, weight, config }]
      metrics: new Map(),        // variant -> [metric records]
      startTime: Date.now(),
      status: 'running'
    };

    // Normalize weights
    const totalWeight = experiment.variants.reduce((s, v) => s + v.weight, 0);
    experiment.variants = experiment.variants.map(v => ({
      ...v,
      normalizedWeight: v.weight / totalWeight
    }));

    for (const variant of experiment.variants) {
      experiment.metrics.set(variant.name, []);
    }

    this.experiments.set(config.id, experiment);
    return experiment;
  }

  // Assign a request to a variant (deterministic by userId for consistency)
  assignVariant(experimentId, userId) {
    const experiment = this.experiments.get(experimentId);
    if (!experiment || experiment.status !== 'running') return null;

    // Simple hash-based assignment for consistency
    const hash = this.simpleHash(userId + experimentId);
    const normalized = (hash % 1000) / 1000; // 0-1

    let cumulative = 0;
    for (const variant of experiment.variants) {
      cumulative += variant.normalizedWeight;
      if (normalized < cumulative) {
        return variant;
      }
    }

    return experiment.variants[experiment.variants.length - 1];
  }

  recordMetric(experimentId, variantName, metric) {
    const experiment = this.experiments.get(experimentId);
    if (!experiment) return;

    const variantMetrics = experiment.metrics.get(variantName);
    if (variantMetrics) {
      variantMetrics.push({
        timestamp: Date.now(),
        ...metric
      });
    }
  }

  getResults(experimentId) {
    const experiment = this.experiments.get(experimentId);
    if (!experiment) return null;

    const results = {};

    for (const variant of experiment.variants) {
      const metrics = experiment.metrics.get(variant.name) || [];

      if (metrics.length === 0) {
        results[variant.name] = { sampleSize: 0 };
        continue;
      }

      const avg = (arr) => arr.reduce((a, b) => a + b, 0) / arr.length;

      results[variant.name] = {
        sampleSize: metrics.length,
        avgLatencyMs: avg(metrics.map(m => m.latencyMs)).toFixed(0),
        avgConfidence: avg(metrics.map(m => m.confidence).filter(Boolean)).toFixed(3),
        avgCost: avg(metrics.map(m => m.cost).filter(Boolean)).toFixed(6),
        errorRate: (metrics.filter(m => m.isError).length / metrics.length).toFixed(4),
        satisfactionRate: metrics.filter(m => m.feedback === 'positive').length /
          Math.max(metrics.filter(m => m.feedback != null).length, 1)
      };
    }

    return {
      experimentId,
      name: experiment.name,
      duration: `${((Date.now() - experiment.startTime) / 3600000).toFixed(1)} hours`,
      results
    };
  }

  simpleHash(str) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      hash = ((hash << 5) - hash) + str.charCodeAt(i);
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
}

// Usage
const abTester = new AIABTester();

abTester.createExperiment({
  id: 'prompt-v3-test',
  name: 'Test new system prompt v3 vs v2',
  variants: [
    {
      name: 'control',
      weight: 50,
      config: {
        systemPrompt: 'You are a helpful customer service assistant...',
        model: 'gpt-4o'
      }
    },
    {
      name: 'treatment',
      weight: 50,
      config: {
        systemPrompt: 'You are an expert customer service agent. Always cite sources...',
        model: 'gpt-4o'
      }
    }
  ]
});

// For each request:
const variant = abTester.assignVariant('prompt-v3-test', 'user-123');
// Use variant.config to make the LLM call
// Then record the metric:
abTester.recordMetric('prompt-v3-test', variant.name, {
  latencyMs: 1200,
  confidence: 0.88,
  cost: 0.003,
  feedback: 'positive'
});

// After sufficient data:
const results = abTester.getResults('prompt-v3-test');
console.log(results);

8. Cost Monitoring and Optimization

AI systems can become expensive fast. Track and optimize costs systematically.

class AICostMonitor {
  constructor() {
    this.daily = new Map(); // date -> cost breakdown
  }

  record(data) {
    const date = new Date().toISOString().split('T')[0]; // YYYY-MM-DD

    if (!this.daily.has(date)) {
      this.daily.set(date, {
        totalCost: 0,
        byModel: {},
        byFeature: {},
        totalInputTokens: 0,
        totalOutputTokens: 0,
        callCount: 0
      });
    }

    const day = this.daily.get(date);
    day.totalCost += data.cost;
    day.totalInputTokens += data.inputTokens;
    day.totalOutputTokens += data.outputTokens;
    day.callCount++;

    // By model
    day.byModel[data.model] = (day.byModel[data.model] || 0) + data.cost;

    // By feature
    if (data.feature) {
      day.byFeature[data.feature] = (day.byFeature[data.feature] || 0) + data.cost;
    }
  }

  getDailyReport(date) {
    const day = this.daily.get(date);
    if (!day) return null;

    return {
      date,
      totalCost: `$${day.totalCost.toFixed(4)}`,
      callCount: day.callCount,
      avgCostPerCall: `$${(day.totalCost / day.callCount).toFixed(6)}`,
      totalInputTokens: day.totalInputTokens.toLocaleString(),
      totalOutputTokens: day.totalOutputTokens.toLocaleString(),
      byModel: Object.entries(day.byModel)
        .sort((a, b) => b[1] - a[1])
        .map(([model, cost]) => ({ model, cost: `$${cost.toFixed(4)}` })),
      byFeature: Object.entries(day.byFeature)
        .sort((a, b) => b[1] - a[1])
        .map(([feature, cost]) => ({ feature, cost: `$${cost.toFixed(4)}` }))
    };
  }

  getOptimizationSuggestions(date) {
    const day = this.daily.get(date);
    if (!day) return [];

    const suggestions = [];

    // Check if a cheaper model could work
    if (day.byModel['gpt-4o'] > day.totalCost * 0.8) {
      suggestions.push({
        type: 'model_downgrade',
        impact: 'HIGH',
        suggestion: 'gpt-4o accounts for 80%+ of cost. Test gpt-4o-mini for ' +
          'simpler tasks (classification, extraction). Potential 90%+ cost reduction ' +
          'on those tasks.'
      });
    }

    // Check token efficiency
    const avgInputTokens = day.totalInputTokens / day.callCount;
    if (avgInputTokens > 5000) {
      suggestions.push({
        type: 'prompt_optimization',
        impact: 'MEDIUM',
        suggestion: `Average input is ${avgInputTokens.toFixed(0)} tokens. ` +
          `Review system prompts for verbosity. Consider prompt caching for ` +
          `repeated prefixes.`
      });
    }

    // Check output tokens
    const avgOutputTokens = day.totalOutputTokens / day.callCount;
    if (avgOutputTokens > 2000) {
      suggestions.push({
        type: 'output_optimization',
        impact: 'MEDIUM',
        suggestion: `Average output is ${avgOutputTokens.toFixed(0)} tokens. ` +
          `Add max_tokens limit. Use structured output to constrain response length.`
      });
    }

    return suggestions;
  }
}

Cost optimization strategies

StrategySavingsTradeoff
Use smaller models for simple tasks90-95% on those tasksSlightly lower quality
Prompt caching (reuse system prompt prefix)50-75% on cached tokensAPI support required
Reduce system prompt lengthProportional to reductionMay reduce instruction quality
Set max_tokensPrevents runaway outputsMay truncate long answers
Cache frequent queries100% on cache hitsStale answers if data changes
Batch non-urgent requestsLower per-token pricingIncreased latency
Structured outputReduces unnecessary verbosityRequires schema design

9. Key Takeaways

  1. AI observability has four pillars: logs, metrics, traces, and evaluation. Traditional monitoring tells you if the system is running; evaluation tells you if it's running correctly.
  2. Log every LLM call with full input, output, parameters, token counts, and latency. This is your debugging lifeline and evaluation data source.
  3. Track six metric categories: latency, cost, errors, quality (hallucination, confidence), retrieval quality, and user satisfaction. Each tells a different story.
  4. Trace multi-step pipelines end-to-end so you can identify which step is the bottleneck or failure point in RAG and agent systems.
  5. Set up alerts for quality degradation — hallucination spikes, confidence drops, cost increases, and user satisfaction drops should trigger immediate investigation.
  6. A/B test every change — new prompts, models, and retrieval strategies should be tested on real traffic before full rollout.
  7. Monitor costs actively — AI costs can spike unexpectedly. Track by model, feature, and time. Use cheaper models where quality allows.

Explain-It Challenge

  1. Your AI system has been running for 3 months with no monitoring. The product manager says "it seems fine." Explain why you need observability and what specific failure modes they might be missing.
  2. You notice latency P95 jumped from 2s to 8s overnight but error rate is unchanged. Walk through your debugging process using traces.
  3. Your monthly AI bill doubled but call volume only increased 20%. What are the most likely causes and how would your cost monitoring dashboard help you find them?

Navigation: ← 4.14.c — Evaluating Retrieval Quality · 4.14 Overview