Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14.d — Observability and Monitoring

In one sentence: AI observability means logging every LLM call, tracing multi-step pipelines, tracking key metrics (latency, tokens, errors, hallucination rate, cost), and alerting when quality degrades — because an AI system without monitoring is a black box that fails silently.

Navigation: ← 4.14.c — Evaluating Retrieval Quality · 4.14 Overview

1. What AI Observability Means

Traditional observability has three pillars: logs, metrics, and traces. AI observability adds a fourth: evaluation.

┌──────────────────────────────────────────────────────────────────┐
│  THE FOUR PILLARS OF AI OBSERVABILITY                            │
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐    │
│  │  LOGS    │  │ METRICS  │  │  TRACES  │  │  EVALUATION  │    │
│  │          │  │          │  │          │  │              │    │
│  │ Every    │  │ Latency  │  │ Full     │  │ Hallucination│    │
│  │ LLM call │  │ Token $  │  │ pipeline │  │ rate,        │    │
│  │ input/   │  │ Error %  │  │ step-by- │  │ confidence,  │    │
│  │ output/  │  │ Through- │  │ step     │  │ retrieval    │    │
│  │ params   │  │ put      │  │ timing   │  │ quality      │    │
│  └──────────┘  └──────────┘  └──────────┘  └──────────────┘    │
│                                                                  │
│  Traditional ──────────────────────────────► AI-specific         │
│                                                                  │
│  KEY INSIGHT: Traditional monitoring tells you IF the system     │
│  is running. AI evaluation tells you if it's running CORRECTLY.  │
└──────────────────────────────────────────────────────────────────┘

Why AI systems need special monitoring

Traditional API	AI System
Returns correct data or throws an error	Returns plausible-looking data that might be wrong
Fails loudly (500 error)	Fails silently (confident-sounding hallucination)
Deterministic: same input = same output	Probabilistic: same input can produce different outputs
Fixed cost per call	Variable cost (depends on token count)
Latency is predictable	Latency varies with output length
Testing is straightforward	Testing requires evaluation suites

2. Key Metrics to Track

The essential AI metrics dashboard

class AIMetricsCollector {
  constructor() {
    this.metrics = [];
  }

  recordCall(data) {
    this.metrics.push({
      timestamp: new Date(),
      requestId: data.requestId,
      
      // Performance metrics
      latencyMs: data.latencyMs,
      timeToFirstToken: data.timeToFirstToken,
      tokensPerSecond: data.outputTokens / (data.latencyMs / 1000),
      
      // Cost metrics
      inputTokens: data.inputTokens,
      outputTokens: data.outputTokens,
      totalTokens: data.inputTokens + data.outputTokens,
      estimatedCost: this.estimateCost(data),
      
      // Quality metrics
      confidence: data.confidence,
      hallucinationScore: data.hallucinationScore,
      retrievalRelevance: data.retrievalRelevance,
      
      // Operational metrics
      model: data.model,
      promptVersion: data.promptVersion,
      statusCode: data.statusCode,
      isError: data.statusCode >= 400,
      errorType: data.errorType || null,
      
      // User metrics
      userFeedback: data.userFeedback || null, // 'thumbs_up' | 'thumbs_down' | null
      wasEdited: data.wasEdited || false
    });
  }

  estimateCost(data) {
    // Pricing per 1M tokens (example GPT-4o rates)
    const pricing = {
      'gpt-4o': { input: 2.50, output: 10.00 },
      'gpt-4o-mini': { input: 0.15, output: 0.60 },
      'gpt-4.1': { input: 2.00, output: 8.00 },
      'gpt-4.1-mini': { input: 0.40, output: 1.60 }
    };

    const rates = pricing[data.model] || pricing['gpt-4o'];
    return (
      (data.inputTokens / 1_000_000) * rates.input +
      (data.outputTokens / 1_000_000) * rates.output
    );
  }

  // Generate metrics summary for a time window
  getSummary(windowMinutes = 60) {
    const cutoff = Date.now() - windowMinutes * 60 * 1000;
    const recent = this.metrics.filter(m => m.timestamp >= cutoff);

    if (recent.length === 0) return { noData: true };

    const avg = (arr) => arr.length > 0 
      ? arr.reduce((a, b) => a + b, 0) / arr.length 
      : 0;
    
    const p95 = (arr) => {
      const sorted = [...arr].sort((a, b) => a - b);
      return sorted[Math.floor(sorted.length * 0.95)] || 0;
    };

    return {
      window: `${windowMinutes} minutes`,
      totalCalls: recent.length,
      callsPerMinute: recent.length / windowMinutes,

      // Latency
      avgLatencyMs: avg(recent.map(m => m.latencyMs)).toFixed(0),
      p95LatencyMs: p95(recent.map(m => m.latencyMs)).toFixed(0),
      avgTimeToFirstToken: avg(recent.map(m => m.timeToFirstToken)).toFixed(0),

      // Tokens & cost
      avgInputTokens: avg(recent.map(m => m.inputTokens)).toFixed(0),
      avgOutputTokens: avg(recent.map(m => m.outputTokens)).toFixed(0),
      totalCost: recent.reduce((sum, m) => sum + m.estimatedCost, 0).toFixed(4),
      avgCostPerCall: avg(recent.map(m => m.estimatedCost)).toFixed(6),

      // Errors
      errorRate: (recent.filter(m => m.isError).length / recent.length).toFixed(4),
      errorsByType: this.groupBy(recent.filter(m => m.isError), 'errorType'),

      // Quality
      avgConfidence: avg(recent.filter(m => m.confidence != null).map(m => m.confidence)).toFixed(3),
      avgHallucinationScore: avg(recent.filter(m => m.hallucinationScore != null).map(m => m.hallucinationScore)).toFixed(3),
      avgRetrievalRelevance: avg(recent.filter(m => m.retrievalRelevance != null).map(m => m.retrievalRelevance)).toFixed(3),

      // User satisfaction
      feedbackCount: recent.filter(m => m.userFeedback != null).length,
      thumbsUpRate: recent.filter(m => m.userFeedback === 'thumbs_up').length /
        Math.max(recent.filter(m => m.userFeedback != null).length, 1),
      editRate: recent.filter(m => m.wasEdited).length / recent.length
    };
  }

  groupBy(items, key) {
    return items.reduce((groups, item) => {
      const val = item[key] || 'unknown';
      groups[val] = (groups[val] || 0) + 1;
      return groups;
    }, {});
  }
}

Metric categories and their targets

Category	Metric	What It Measures	Typical Target
Latency	Avg response time	User experience	< 2s for chat, < 10s for complex tasks
Latency	P95 response time	Worst-case experience	< 5s for chat
Latency	Time to first token	Perceived responsiveness	< 500ms
Cost	Cost per call	Budget management	Depends on use case
Cost	Daily/monthly spend	Budget tracking	Within budget
Cost	Tokens per call	Efficiency	Trending stable or down
Errors	Error rate	System reliability	< 1%
Errors	Rate limit hits	Capacity planning	< 0.1%
Quality	Hallucination rate	Answer reliability	< 5% (varies by domain)
Quality	Confidence score avg	System certainty	> 0.75
Quality	Retrieval relevance	RAG effectiveness	> 0.70
User	Thumbs up rate	User satisfaction	> 80%
User	Edit rate	Answer usability	< 15%

3. Logging: The Foundation of Observability

Every AI call should be logged with full context for debugging and evaluation.

import { randomUUID } from 'crypto';

class AILogger {
  constructor(storage) {
    this.storage = storage; // Could be file, database, external service
  }

  // Wrap any LLM call with logging
  async loggedCall(callFn, metadata = {}) {
    const requestId = randomUUID();
    const startTime = Date.now();

    const logEntry = {
      requestId,
      timestamp: new Date().toISOString(),
      ...metadata,
      status: 'pending'
    };

    try {
      // Execute the LLM call
      const result = await callFn();

      const endTime = Date.now();

      // Complete the log entry
      Object.assign(logEntry, {
        status: 'success',
        latencyMs: endTime - startTime,
        
        // Input details
        model: metadata.model,
        messages: metadata.messages,
        temperature: metadata.temperature,
        promptVersion: metadata.promptVersion,
        
        // Output details
        response: result.choices?.[0]?.message?.content,
        finishReason: result.choices?.[0]?.finish_reason,
        
        // Token usage
        inputTokens: result.usage?.prompt_tokens,
        outputTokens: result.usage?.completion_tokens,
        totalTokens: result.usage?.total_tokens,
        
        // Request metadata
        systemFingerprint: result.system_fingerprint
      });

      await this.storage.write(logEntry);
      return { result, requestId };

    } catch (error) {
      Object.assign(logEntry, {
        status: 'error',
        latencyMs: Date.now() - startTime,
        error: {
          name: error.name,
          message: error.message,
          code: error.code || error.status,
          type: error.type
        }
      });

      await this.storage.write(logEntry);
      throw error; // Re-throw so the caller handles it
    }
  }
}

// Example storage implementations
class ConsoleStorage {
  async write(entry) {
    console.log(JSON.stringify(entry));
  }
}

class FileStorage {
  constructor(filePath) {
    this.filePath = filePath;
  }
  async write(entry) {
    const fs = await import('fs/promises');
    await fs.appendFile(this.filePath, JSON.stringify(entry) + '\n');
  }
}

// Usage
import OpenAI from 'openai';

const openai = new OpenAI();
const logger = new AILogger(new FileStorage('./ai-logs.jsonl'));

const { result, requestId } = await logger.loggedCall(
  () => openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'What is the capital of France?' }
    ]
  }),
  {
    model: 'gpt-4o',
    temperature: 0,
    promptVersion: 'v2.1',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'What is the capital of France?' }
    ]
  }
);

console.log(`Logged as ${requestId}: ${result.choices[0].message.content}`);

What to log (and what NOT to log)

Always Log	Never Log
Request ID, timestamp	Raw user PII (unless required & encrypted)
Model, temperature, params	API keys or secrets
Input/output token counts	Full conversation history of other users
Latency	Internal IP addresses
Error details
Prompt version
Finish reason
Confidence scores

4. Tracing Multi-Step AI Pipelines

A RAG pipeline or agent involves multiple steps. Tracing connects all steps in a single request so you can debug end-to-end.

class AITracer {
  constructor() {
    this.traces = new Map();
  }

  startTrace(traceId, metadata = {}) {
    const trace = {
      traceId,
      startTime: Date.now(),
      metadata,
      spans: [],
      status: 'in_progress'
    };
    this.traces.set(traceId, trace);
    return trace;
  }

  startSpan(traceId, spanName, metadata = {}) {
    const trace = this.traces.get(traceId);
    if (!trace) throw new Error(`Trace ${traceId} not found`);

    const span = {
      spanId: randomUUID(),
      spanName,
      startTime: Date.now(),
      metadata,
      status: 'in_progress'
    };

    trace.spans.push(span);
    return span;
  }

  endSpan(traceId, spanId, result = {}) {
    const trace = this.traces.get(traceId);
    const span = trace?.spans.find(s => s.spanId === spanId);
    if (!span) return;

    span.endTime = Date.now();
    span.durationMs = span.endTime - span.startTime;
    span.status = 'completed';
    span.result = result;
  }

  endTrace(traceId, result = {}) {
    const trace = this.traces.get(traceId);
    if (!trace) return;

    trace.endTime = Date.now();
    trace.totalDurationMs = trace.endTime - trace.startTime;
    trace.status = 'completed';
    trace.result = result;

    return trace;
  }

  // Get a visual summary of the trace
  getTraceSummary(traceId) {
    const trace = this.traces.get(traceId);
    if (!trace) return null;

    const lines = [
      `Trace: ${trace.traceId} (${trace.totalDurationMs || '?'}ms total)`,
      `Status: ${trace.status}`,
      '',
      'Steps:'
    ];

    for (const span of trace.spans) {
      const pct = trace.totalDurationMs 
        ? ((span.durationMs / trace.totalDurationMs) * 100).toFixed(1)
        : '?';
      lines.push(
        `  ${span.spanName}: ${span.durationMs || '?'}ms (${pct}%) — ${span.status}`
      );
    }

    return lines.join('\n');
  }
}

// Usage: Trace a full RAG pipeline
import { randomUUID } from 'crypto';

const tracer = new AITracer();

async function tracedRAGPipeline(query) {
  const traceId = randomUUID();
  tracer.startTrace(traceId, { query });

  // Step 1: Embed the query
  const embedSpan = tracer.startSpan(traceId, 'embed_query');
  const embedding = await embedQuery(query);
  tracer.endSpan(traceId, embedSpan.spanId, { 
    dimensions: embedding.length 
  });

  // Step 2: Retrieve documents
  const retrieveSpan = tracer.startSpan(traceId, 'retrieve_documents');
  const docs = await vectorDB.search(embedding, { topK: 5 });
  tracer.endSpan(traceId, retrieveSpan.spanId, { 
    docsRetrieved: docs.length,
    topRelevance: docs[0]?.score 
  });

  // Step 3: Build prompt
  const promptSpan = tracer.startSpan(traceId, 'build_prompt');
  const prompt = buildRAGPrompt(query, docs);
  tracer.endSpan(traceId, promptSpan.spanId, { 
    promptTokens: estimateTokens(prompt) 
  });

  // Step 4: Generate answer
  const generateSpan = tracer.startSpan(traceId, 'generate_answer');
  const answer = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: prompt,
    temperature: 0
  });
  tracer.endSpan(traceId, generateSpan.spanId, {
    outputTokens: answer.usage?.completion_tokens,
    finishReason: answer.choices[0]?.finish_reason
  });

  // Step 5: Hallucination check
  const checkSpan = tracer.startSpan(traceId, 'hallucination_check');
  const hallCheck = await hallucinationDetector.check(
    answer.choices[0].message.content, 
    docs
  );
  tracer.endSpan(traceId, checkSpan.spanId, {
    hallucinationScore: hallCheck.score,
    flaggedClaims: hallCheck.flaggedClaims?.length || 0
  });

  // Complete the trace
  const trace = tracer.endTrace(traceId, {
    answer: answer.choices[0].message.content,
    confidence: hallCheck.score < 0.3 ? 'high' : 'low'
  });

  console.log(tracer.getTraceSummary(traceId));
  /*
  Trace: abc-123 (2340ms total)
  Status: completed

  Steps:
    embed_query:         120ms (5.1%) — completed
    retrieve_documents:  230ms (9.8%) — completed
    build_prompt:         15ms (0.6%) — completed
    generate_answer:    1450ms (62.0%) — completed
    hallucination_check: 525ms (22.4%) — completed
  */

  return { trace, answer: answer.choices[0].message.content };
}

5. Tools for AI Observability

Overview of popular tools

Tool	Best For	Key Feature
LangSmith	LangChain-based apps	Deep chain tracing, eval suites
Weights & Biases	ML experiment tracking	Prompt versioning, eval dashboards
Helicone	API-level monitoring	Drop-in proxy, cost tracking
Langfuse	Open-source LLM observability	Self-hostable, traces, evals
Braintrust	Evaluation & logging	Eval-first approach, logging
Custom dashboards	Full control	Exactly what you need, nothing else

Helicone: Drop-in API proxy

import OpenAI from 'openai';

// Helicone acts as a proxy — just change the base URL
const openai = new OpenAI({
  baseURL: 'https://oai.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
    // Custom properties for filtering in the dashboard
    'Helicone-Property-Environment': 'production',
    'Helicone-Property-Feature': 'document-qa',
    'Helicone-Property-PromptVersion': 'v2.1'
  }
});

// Use OpenAI as normal — Helicone logs everything automatically
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is the refund policy?' }]
});

// Helicone dashboard now shows:
// - Request/response logs
// - Latency distribution
// - Token usage over time
// - Cost per feature/environment
// - Error rates

Langfuse: Open-source tracing

import { Langfuse } from 'langfuse';

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  baseUrl: process.env.LANGFUSE_BASE_URL // Self-hosted or cloud
});

async function tracedQuery(query) {
  // Create a trace
  const trace = langfuse.trace({
    name: 'rag-query',
    metadata: { query }
  });

  // Span for retrieval
  const retrievalSpan = trace.span({ name: 'retrieval' });
  const docs = await retrieveDocuments(query);
  retrievalSpan.end({ output: { docCount: docs.length } });

  // Generation span
  const generationSpan = trace.generation({
    name: 'llm-generation',
    model: 'gpt-4o',
    input: [{ role: 'user', content: query }]
  });

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: query }]
  });

  generationSpan.end({
    output: response.choices[0].message.content,
    usage: {
      promptTokens: response.usage.prompt_tokens,
      completionTokens: response.usage.completion_tokens
    }
  });

  // Score the trace
  trace.score({
    name: 'confidence',
    value: 0.85
  });

  // Flush at the end
  await langfuse.flushAsync();
}

Custom lightweight monitoring

// When external tools are overkill, build a simple monitor

class SimpleAIMonitor {
  constructor() {
    this.metrics = {
      calls: 0,
      errors: 0,
      totalLatencyMs: 0,
      totalInputTokens: 0,
      totalOutputTokens: 0,
      totalCost: 0,
      feedbackPositive: 0,
      feedbackNegative: 0,
      confidenceSum: 0,
      confidenceCount: 0,
      startTime: Date.now()
    };
    this.recentErrors = [];
  }

  record(data) {
    this.metrics.calls++;
    this.metrics.totalLatencyMs += data.latencyMs || 0;
    this.metrics.totalInputTokens += data.inputTokens || 0;
    this.metrics.totalOutputTokens += data.outputTokens || 0;
    this.metrics.totalCost += data.cost || 0;

    if (data.isError) {
      this.metrics.errors++;
      this.recentErrors.push({
        timestamp: new Date(),
        error: data.errorMessage
      });
      // Keep only last 100 errors
      if (this.recentErrors.length > 100) this.recentErrors.shift();
    }

    if (data.feedback === 'positive') this.metrics.feedbackPositive++;
    if (data.feedback === 'negative') this.metrics.feedbackNegative++;

    if (data.confidence != null) {
      this.metrics.confidenceSum += data.confidence;
      this.metrics.confidenceCount++;
    }
  }

  getSnapshot() {
    const m = this.metrics;
    const uptimeMinutes = (Date.now() - m.startTime) / 60000;

    return {
      uptime: `${uptimeMinutes.toFixed(0)} minutes`,
      totalCalls: m.calls,
      callsPerMinute: (m.calls / uptimeMinutes).toFixed(2),
      errorRate: m.calls > 0 ? (m.errors / m.calls * 100).toFixed(2) + '%' : '0%',
      avgLatencyMs: m.calls > 0 ? (m.totalLatencyMs / m.calls).toFixed(0) : 0,
      totalCost: `$${m.totalCost.toFixed(4)}`,
      avgCostPerCall: m.calls > 0 ? `$${(m.totalCost / m.calls).toFixed(6)}` : '$0',
      avgConfidence: m.confidenceCount > 0 
        ? (m.confidenceSum / m.confidenceCount).toFixed(3) 
        : 'N/A',
      satisfactionRate: (m.feedbackPositive + m.feedbackNegative) > 0
        ? (m.feedbackPositive / (m.feedbackPositive + m.feedbackNegative) * 100).toFixed(1) + '%'
        : 'No feedback yet',
      recentErrors: this.recentErrors.slice(-5)
    };
  }
}

6. Alerting on Quality Degradation

Monitoring without alerting is a dashboard no one watches. Set up alerts for critical thresholds.

class AIAlertManager {
  constructor(notifier) {
    this.notifier = notifier; // Slack, PagerDuty, email, etc.
    this.rules = [];
    this.cooldowns = new Map(); // Prevent alert spam
  }

  addRule(rule) {
    this.rules.push({
      name: rule.name,
      condition: rule.condition,      // Function that returns true if alert should fire
      severity: rule.severity,        // 'critical' | 'warning' | 'info'
      cooldownMinutes: rule.cooldownMinutes || 15,
      message: rule.message           // Function that returns alert message
    });
  }

  async evaluate(metrics) {
    for (const rule of this.rules) {
      if (rule.condition(metrics)) {
        // Check cooldown
        const lastFired = this.cooldowns.get(rule.name);
        const now = Date.now();

        if (lastFired && (now - lastFired) < rule.cooldownMinutes * 60 * 1000) {
          continue; // Still in cooldown
        }

        // Fire alert
        const message = rule.message(metrics);
        await this.notifier.send({
          severity: rule.severity,
          rule: rule.name,
          message,
          metrics,
          timestamp: new Date().toISOString()
        });

        this.cooldowns.set(rule.name, now);
      }
    }
  }
}

// Configure alerts
const alertManager = new AIAlertManager({
  send: async (alert) => {
    console.log(`[${alert.severity.toUpperCase()}] ${alert.rule}: ${alert.message}`);
    // In production: send to Slack, PagerDuty, etc.
  }
});

// Error rate alert
alertManager.addRule({
  name: 'high_error_rate',
  severity: 'critical',
  cooldownMinutes: 10,
  condition: (m) => parseFloat(m.errorRate) > 5,
  message: (m) => `Error rate is ${m.errorRate} (threshold: 5%). ` +
    `${m.totalCalls} calls in window.`
});

// Latency alert
alertManager.addRule({
  name: 'high_latency',
  severity: 'warning',
  cooldownMinutes: 15,
  condition: (m) => parseInt(m.p95LatencyMs) > 5000,
  message: (m) => `P95 latency is ${m.p95LatencyMs}ms (threshold: 5000ms). ` +
    `Users are experiencing slow responses.`
});

// Hallucination rate alert
alertManager.addRule({
  name: 'hallucination_spike',
  severity: 'critical',
  cooldownMinutes: 30,
  condition: (m) => parseFloat(m.avgHallucinationScore) > 0.15,
  message: (m) => `Average hallucination score is ${m.avgHallucinationScore} ` +
    `(threshold: 0.15). Check prompt changes, model updates, or data drift.`
});

// Cost alert
alertManager.addRule({
  name: 'cost_spike',
  severity: 'warning',
  cooldownMinutes: 60,
  condition: (m) => parseFloat(m.totalCost?.replace('$', '')) > 100,
  message: (m) => `Hourly cost is ${m.totalCost} (threshold: $100). ` +
    `Check for runaway loops or token-heavy prompts.`
});

// Confidence drop alert
alertManager.addRule({
  name: 'low_confidence',
  severity: 'warning',
  cooldownMinutes: 30,
  condition: (m) => m.avgConfidence !== 'N/A' && parseFloat(m.avgConfidence) < 0.6,
  message: (m) => `Average confidence dropped to ${m.avgConfidence} ` +
    `(threshold: 0.6). May indicate retrieval quality issues.`
});

// User satisfaction alert
alertManager.addRule({
  name: 'low_satisfaction',
  severity: 'warning',
  cooldownMinutes: 60,
  condition: (m) => {
    const rate = parseFloat(m.satisfactionRate);
    return !isNaN(rate) && rate < 70;
  },
  message: (m) => `User satisfaction is ${m.satisfactionRate} ` +
    `(threshold: 70%). Investigate recent changes.`
});

// Run evaluation periodically
// const metrics = monitor.getSnapshot();
// await alertManager.evaluate(metrics);

7. A/B Testing Prompts and Models in Production

Test prompt and model changes on real traffic before rolling out to everyone.

class AIABTester {
  constructor() {
    this.experiments = new Map();
  }

  createExperiment(config) {
    const experiment = {
      id: config.id,
      name: config.name,
      variants: config.variants, // [{ name, weight, config }]
      metrics: new Map(),        // variant -> [metric records]
      startTime: Date.now(),
      status: 'running'
    };

    // Normalize weights
    const totalWeight = experiment.variants.reduce((s, v) => s + v.weight, 0);
    experiment.variants = experiment.variants.map(v => ({
      ...v,
      normalizedWeight: v.weight / totalWeight
    }));

    for (const variant of experiment.variants) {
      experiment.metrics.set(variant.name, []);
    }

    this.experiments.set(config.id, experiment);
    return experiment;
  }

  // Assign a request to a variant (deterministic by userId for consistency)
  assignVariant(experimentId, userId) {
    const experiment = this.experiments.get(experimentId);
    if (!experiment || experiment.status !== 'running') return null;

    // Simple hash-based assignment for consistency
    const hash = this.simpleHash(userId + experimentId);
    const normalized = (hash % 1000) / 1000; // 0-1

    let cumulative = 0;
    for (const variant of experiment.variants) {
      cumulative += variant.normalizedWeight;
      if (normalized < cumulative) {
        return variant;
      }
    }

    return experiment.variants[experiment.variants.length - 1];
  }

  recordMetric(experimentId, variantName, metric) {
    const experiment = this.experiments.get(experimentId);
    if (!experiment) return;

    const variantMetrics = experiment.metrics.get(variantName);
    if (variantMetrics) {
      variantMetrics.push({
        timestamp: Date.now(),
        ...metric
      });
    }
  }

  getResults(experimentId) {
    const experiment = this.experiments.get(experimentId);
    if (!experiment) return null;

    const results = {};

    for (const variant of experiment.variants) {
      const metrics = experiment.metrics.get(variant.name) || [];

      if (metrics.length === 0) {
        results[variant.name] = { sampleSize: 0 };
        continue;
      }

      const avg = (arr) => arr.reduce((a, b) => a + b, 0) / arr.length;

      results[variant.name] = {
        sampleSize: metrics.length,
        avgLatencyMs: avg(metrics.map(m => m.latencyMs)).toFixed(0),
        avgConfidence: avg(metrics.map(m => m.confidence).filter(Boolean)).toFixed(3),
        avgCost: avg(metrics.map(m => m.cost).filter(Boolean)).toFixed(6),
        errorRate: (metrics.filter(m => m.isError).length / metrics.length).toFixed(4),
        satisfactionRate: metrics.filter(m => m.feedback === 'positive').length /
          Math.max(metrics.filter(m => m.feedback != null).length, 1)
      };
    }

    return {
      experimentId,
      name: experiment.name,
      duration: `${((Date.now() - experiment.startTime) / 3600000).toFixed(1)} hours`,
      results
    };
  }

  simpleHash(str) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      hash = ((hash << 5) - hash) + str.charCodeAt(i);
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
}

// Usage
const abTester = new AIABTester();

abTester.createExperiment({
  id: 'prompt-v3-test',
  name: 'Test new system prompt v3 vs v2',
  variants: [
    {
      name: 'control',
      weight: 50,
      config: {
        systemPrompt: 'You are a helpful customer service assistant...',
        model: 'gpt-4o'
      }
    },
    {
      name: 'treatment',
      weight: 50,
      config: {
        systemPrompt: 'You are an expert customer service agent. Always cite sources...',
        model: 'gpt-4o'
      }
    }
  ]
});

// For each request:
const variant = abTester.assignVariant('prompt-v3-test', 'user-123');
// Use variant.config to make the LLM call
// Then record the metric:
abTester.recordMetric('prompt-v3-test', variant.name, {
  latencyMs: 1200,
  confidence: 0.88,
  cost: 0.003,
  feedback: 'positive'
});

// After sufficient data:
const results = abTester.getResults('prompt-v3-test');
console.log(results);

8. Cost Monitoring and Optimization

AI systems can become expensive fast. Track and optimize costs systematically.

class AICostMonitor {
  constructor() {
    this.daily = new Map(); // date -> cost breakdown
  }

  record(data) {
    const date = new Date().toISOString().split('T')[0]; // YYYY-MM-DD

    if (!this.daily.has(date)) {
      this.daily.set(date, {
        totalCost: 0,
        byModel: {},
        byFeature: {},
        totalInputTokens: 0,
        totalOutputTokens: 0,
        callCount: 0
      });
    }

    const day = this.daily.get(date);
    day.totalCost += data.cost;
    day.totalInputTokens += data.inputTokens;
    day.totalOutputTokens += data.outputTokens;
    day.callCount++;

    // By model
    day.byModel[data.model] = (day.byModel[data.model] || 0) + data.cost;

    // By feature
    if (data.feature) {
      day.byFeature[data.feature] = (day.byFeature[data.feature] || 0) + data.cost;
    }
  }

  getDailyReport(date) {
    const day = this.daily.get(date);
    if (!day) return null;

    return {
      date,
      totalCost: `$${day.totalCost.toFixed(4)}`,
      callCount: day.callCount,
      avgCostPerCall: `$${(day.totalCost / day.callCount).toFixed(6)}`,
      totalInputTokens: day.totalInputTokens.toLocaleString(),
      totalOutputTokens: day.totalOutputTokens.toLocaleString(),
      byModel: Object.entries(day.byModel)
        .sort((a, b) => b[1] - a[1])
        .map(([model, cost]) => ({ model, cost: `$${cost.toFixed(4)}` })),
      byFeature: Object.entries(day.byFeature)
        .sort((a, b) => b[1] - a[1])
        .map(([feature, cost]) => ({ feature, cost: `$${cost.toFixed(4)}` }))
    };
  }

  getOptimizationSuggestions(date) {
    const day = this.daily.get(date);
    if (!day) return [];

    const suggestions = [];

    // Check if a cheaper model could work
    if (day.byModel['gpt-4o'] > day.totalCost * 0.8) {
      suggestions.push({
        type: 'model_downgrade',
        impact: 'HIGH',
        suggestion: 'gpt-4o accounts for 80%+ of cost. Test gpt-4o-mini for ' +
          'simpler tasks (classification, extraction). Potential 90%+ cost reduction ' +
          'on those tasks.'
      });
    }

    // Check token efficiency
    const avgInputTokens = day.totalInputTokens / day.callCount;
    if (avgInputTokens > 5000) {
      suggestions.push({
        type: 'prompt_optimization',
        impact: 'MEDIUM',
        suggestion: `Average input is ${avgInputTokens.toFixed(0)} tokens. ` +
          `Review system prompts for verbosity. Consider prompt caching for ` +
          `repeated prefixes.`
      });
    }

    // Check output tokens
    const avgOutputTokens = day.totalOutputTokens / day.callCount;
    if (avgOutputTokens > 2000) {
      suggestions.push({
        type: 'output_optimization',
        impact: 'MEDIUM',
        suggestion: `Average output is ${avgOutputTokens.toFixed(0)} tokens. ` +
          `Add max_tokens limit. Use structured output to constrain response length.`
      });
    }

    return suggestions;
  }
}

Cost optimization strategies

Strategy	Savings	Tradeoff
Use smaller models for simple tasks	90-95% on those tasks	Slightly lower quality
Prompt caching (reuse system prompt prefix)	50-75% on cached tokens	API support required
Reduce system prompt length	Proportional to reduction	May reduce instruction quality
Set max_tokens	Prevents runaway outputs	May truncate long answers
Cache frequent queries	100% on cache hits	Stale answers if data changes
Batch non-urgent requests	Lower per-token pricing	Increased latency
Structured output	Reduces unnecessary verbosity	Requires schema design

9. Key Takeaways

AI observability has four pillars: logs, metrics, traces, and evaluation. Traditional monitoring tells you if the system is running; evaluation tells you if it's running correctly.
Log every LLM call with full input, output, parameters, token counts, and latency. This is your debugging lifeline and evaluation data source.
Track six metric categories: latency, cost, errors, quality (hallucination, confidence), retrieval quality, and user satisfaction. Each tells a different story.
Trace multi-step pipelines end-to-end so you can identify which step is the bottleneck or failure point in RAG and agent systems.
Set up alerts for quality degradation — hallucination spikes, confidence drops, cost increases, and user satisfaction drops should trigger immediate investigation.
A/B test every change — new prompts, models, and retrieval strategies should be tested on real traffic before full rollout.
Monitor costs actively — AI costs can spike unexpectedly. Track by model, feature, and time. Use cheaper models where quality allows.

Explain-It Challenge

Your AI system has been running for 3 months with no monitoring. The product manager says "it seems fine." Explain why you need observability and what specific failure modes they might be missing.
You notice latency P95 jumped from 2s to 8s overnight but error rate is unchanged. Walk through your debugging process using traces.
Your monthly AI bill doubled but call volume only increased 20%. What are the most likely causes and how would your cost monitoring dashboard help you find them?

Navigation: ← 4.14.c — Evaluating Retrieval Quality · 4.14 Overview