Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14.a — Detecting Hallucinations

In one sentence: Hallucination detection is the engineering discipline of automatically identifying when an AI system generates information that isn't grounded in its source material — using cross-referencing, consistency checking, NLI models, and human evaluation to catch fabrications before they reach users.

Navigation: ← 4.14 Overview · 4.14.b — Confidence Scores →

1. What Hallucination Detection Means in Production

In 4.1.d, you learned why LLMs hallucinate — they predict statistically plausible text, not verified facts. Now the question shifts from "why does this happen?" to "how do we catch it automatically, at scale, in real time?"

In production, hallucination detection is not a one-time check. It is a continuous pipeline that runs alongside every AI response:

┌─────────────────────────────────────────────────────────────────┐
│  HALLUCINATION DETECTION IN PRODUCTION                          │
│                                                                 │
│  User Query ──► RAG Pipeline ──► LLM Response                  │
│                                       │                         │
│                                       ▼                         │
│                              ┌─────────────────┐               │
│                              │ Detection Layer  │               │
│                              │                  │               │
│                              │ 1. Source check   │               │
│                              │ 2. Consistency    │               │
│                              │ 3. NLI scoring    │               │
│                              │ 4. Confidence     │               │
│                              └────────┬────────┘               │
│                                       │                         │
│                              ┌────────▼────────┐               │
│                              │  Pass?           │               │
│                              │  YES → Serve     │               │
│                              │  NO  → Fallback  │               │
│                              └─────────────────┘               │
└─────────────────────────────────────────────────────────────────┘

The cost of undetected hallucinations ranges from embarrassment (chatbot gives wrong product specs) to lawsuits (medical chatbot fabricates dosages). The cost of detection is a few hundred milliseconds of latency and some compute. The math is clear.

2. Automated Detection Method 1: Cross-Referencing with Source Documents

The most intuitive approach: compare the AI's claims against the documents it was supposed to use. If the answer says something the source documents don't say, it's likely hallucinated.

How it works

The RAG pipeline retrieves source chunks.
The LLM generates an answer from those chunks.
A separate verification step checks whether each claim in the answer appears in the source chunks.

import OpenAI from 'openai';

const openai = new OpenAI();

// Step 1: The original RAG answer (already generated)
const ragAnswer = `The refund policy allows returns within 30 days 
of purchase. Digital products can be returned within 7 days.`;

const sourceChunks = [
  "Returns are accepted within 30 days of purchase with a valid receipt.",
  "Digital products are non-refundable after download.",
  "Shipping costs are not included in refund amounts."
];

// Step 2: Cross-reference verification
async function crossReferenceCheck(answer, sources) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `You are a hallucination detector. Given an AI-generated answer 
and the source documents it was based on, identify every claim in the answer 
and check whether each claim is SUPPORTED, CONTRADICTED, or NOT FOUND in the sources.

Return JSON:
{
  "claims": [
    {
      "claim": "the specific claim text",
      "verdict": "SUPPORTED" | "CONTRADICTED" | "NOT_FOUND",
      "evidence": "quote from source or null",
      "explanation": "why this verdict"
    }
  ],
  "overall_hallucination": true | false,
  "hallucination_score": 0.0 to 1.0
}`
      },
      {
        role: 'user',
        content: `ANSWER TO VERIFY:
${answer}

SOURCE DOCUMENTS:
${sources.map((s, i) => `[Source ${i + 1}]: ${s}`).join('\n')}`
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

// Step 3: Run the check
const result = await crossReferenceCheck(ragAnswer, sourceChunks);
console.log(JSON.stringify(result, null, 2));

/*
{
  "claims": [
    {
      "claim": "Returns allowed within 30 days of purchase",
      "verdict": "SUPPORTED",
      "evidence": "Returns are accepted within 30 days of purchase with a valid receipt.",
      "explanation": "Directly supported by Source 1"
    },
    {
      "claim": "Digital products can be returned within 7 days",
      "verdict": "CONTRADICTED",
      "evidence": "Digital products are non-refundable after download.",
      "explanation": "Source 2 says digital products are non-refundable, 
                       but the answer claims a 7-day return window"
    }
  ],
  "overall_hallucination": true,
  "hallucination_score": 0.5
}
*/

Strengths and limitations

Strength	Limitation
Directly compares against ground truth	Requires a second LLM call (cost + latency)
Can pinpoint exactly which claim is wrong	The verifier LLM can itself hallucinate
Works for any domain	Only catches hallucinations relative to provided sources
Produces explainable results	Doesn't catch cases where sources themselves are wrong

3. Automated Detection Method 2: Consistency Checking

Ask the same question multiple ways and check if the answers agree. If the model gives contradictory answers to semantically identical questions, at least one answer is hallucinated.

import OpenAI from 'openai';

const openai = new OpenAI();

async function consistencyCheck(originalQuestion, context, numVariations = 3) {
  // Step 1: Generate question variations
  const variationResponse = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.7,
    messages: [
      {
        role: 'system',
        content: `Generate ${numVariations} rephrasings of the given question. 
Each rephrasing should ask the same thing but with different wording.
Return JSON: { "variations": ["q1", "q2", "q3"] }`
      },
      { role: 'user', content: originalQuestion }
    ],
    response_format: { type: 'json_object' }
  });

  const { variations } = JSON.parse(variationResponse.choices[0].message.content);
  const allQuestions = [originalQuestion, ...variations];

  // Step 2: Get answers to each variation
  const answers = await Promise.all(
    allQuestions.map(async (question) => {
      const response = await openai.chat.completions.create({
        model: 'gpt-4o',
        temperature: 0,
        messages: [
          {
            role: 'system',
            content: `Answer the question based ONLY on the provided context.
Context: ${context}`
          },
          { role: 'user', content: question }
        ]
      });
      return {
        question,
        answer: response.choices[0].message.content
      };
    })
  );

  // Step 3: Check consistency across answers
  const consistencyResponse = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `You are a consistency checker. Given multiple question-answer pairs 
that all ask the same thing differently, determine if the answers are consistent.

Return JSON:
{
  "is_consistent": true | false,
  "consistency_score": 0.0 to 1.0,
  "contradictions": ["description of each contradiction found"],
  "consensus_answer": "the most common/reliable answer if consistent"
}`
      },
      {
        role: 'user',
        content: answers
          .map((a, i) => `Q${i + 1}: ${a.question}\nA${i + 1}: ${a.answer}`)
          .join('\n\n')
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(consistencyResponse.choices[0].message.content);
}

// Usage
const result = await consistencyCheck(
  "What is the maximum refund period?",
  "Returns are accepted within 30 days of purchase with a valid receipt."
);

console.log(result);
/*
{
  "is_consistent": true,
  "consistency_score": 0.95,
  "contradictions": [],
  "consensus_answer": "The maximum refund period is 30 days from purchase."
}
*/

When consistency checking shines

No source documents available (e.g., the model is answering from training data)
Confidence calibration — high consistency usually correlates with correctness
Catching model uncertainty — if the model gives different answers each time, it's unsure

When it fails

The model can be consistently wrong — all variations produce the same hallucination
Expensive: requires multiple LLM calls per question
Slow: adds significant latency

4. Automated Detection Method 3: NLI (Natural Language Inference) Models

NLI models are specifically trained to determine whether one piece of text entails, contradicts, or is neutral toward another piece of text. They are fast, cheap, and don't require an LLM call.

NLI Classification:
  ENTAILMENT:    "Source says X" → "Answer says X"          ✓ Supported
  CONTRADICTION: "Source says X" → "Answer says NOT X"      ✗ Hallucinated
  NEUTRAL:       "Source says X" → "Answer says Y (unrelated)" ? Uncertain

Using an NLI model for hallucination detection

// Using a lightweight NLI model via an inference API
// (e.g., Hugging Face Inference API with a cross-encoder/nli-deberta model)

async function nliHallucinationCheck(claims, sourceText) {
  const results = [];

  for (const claim of claims) {
    // Call NLI model to classify: does sourceText entail this claim?
    const response = await fetch(
      'https://api-inference.huggingface.co/models/cross-encoder/nli-deberta-v3-large',
      {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${process.env.HF_TOKEN}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({
          inputs: {
            source_sentence: sourceText,
            sentences: [claim]
          }
        })
      }
    );

    const scores = await response.json();
    results.push({
      claim,
      entailment: scores[0]?.score || 0,
      contradiction: scores[1]?.score || 0,
      neutral: scores[2]?.score || 0,
      verdict: getVerdict(scores)
    });
  }

  return results;
}

function getVerdict(scores) {
  // Find the highest-scoring label
  const labels = ['entailment', 'contradiction', 'neutral'];
  const maxIndex = scores.indexOf(Math.max(...scores.map(s => s.score)));
  return labels[maxIndex]?.toUpperCase() || 'UNKNOWN';
}

// Alternative: Build a local pipeline with transformers.js
// This avoids API calls entirely and runs in Node.js

import { pipeline } from '@xenova/transformers';

async function localNliCheck(premise, hypothesis) {
  const classifier = await pipeline(
    'zero-shot-classification',
    'Xenova/nli-deberta-v3-small'
  );

  const result = await classifier(hypothesis, {
    candidate_labels: ['entailment', 'contradiction', 'neutral'],
    hypothesis_template: premise + ' Therefore, {}'
  });

  return {
    hypothesis,
    label: result.labels[0],
    confidence: result.scores[0]
  };
}

NLI vs LLM-based verification

Factor	NLI Model	LLM Verifier
Speed	~10-50ms per claim	~500-2000ms per call
Cost	Free (local) or very cheap	$0.01-0.10 per verification
Accuracy	Good for simple claims	Better for nuanced claims
Explainability	Scores only	Can explain reasoning
Scalability	Excellent (run locally)	Limited by API rate limits
Complex reasoning	Struggles	Handles well

5. Human Evaluation: The Gold Standard

Automated methods catch most hallucinations, but human evaluation remains the gold standard for validating your detection pipeline and catching edge cases.

Sampling strategy

You cannot review every AI response. Instead, sample strategically:

function shouldSampleForReview(response) {
  const rules = [
    // Random sampling: review 2% of all responses
    () => Math.random() < 0.02,

    // Low confidence: always review
    () => response.confidence < 0.7,

    // Flagged by automated detection
    () => response.hallucinationScore > 0.3,

    // High-stakes domains
    () => response.domain === 'medical' || response.domain === 'legal',

    // New prompt versions: review 20% for first 48 hours
    () => response.promptVersion !== response.stablePromptVersion 
          && Math.random() < 0.20,

    // User-reported issues
    () => response.userFeedback === 'thumbs_down'
  ];

  return rules.some(rule => rule());
}

Labeling workflow

┌───────────────────────────────────────────────────────────────┐
│  HUMAN EVALUATION PIPELINE                                    │
│                                                               │
│  1. Sampled responses enter a review queue                    │
│  2. Reviewer sees: query, sources, AI response                │
│  3. Reviewer labels each claim:                               │
│     ✓ Correct  |  ✗ Hallucinated  |  ~ Partially correct     │
│  4. Labels are stored in evaluation database                  │
│  5. Metrics are computed weekly:                              │
│     - Hallucination rate (by category, severity, domain)      │
│     - Agreement between human and automated detection         │
│     - False positive rate of automated pipeline               │
│  6. Disagreements trigger calibration sessions                │
└───────────────────────────────────────────────────────────────┘

Building a labeling interface data model

const reviewRecord = {
  id: 'review-001',
  timestamp: '2026-04-11T14:30:00Z',
  
  // The AI interaction being reviewed
  query: 'What is the return policy for electronics?',
  retrievedSources: [
    { docId: 'policy-v3', chunk: 2, text: 'Electronics can be returned within 15 days...' }
  ],
  aiResponse: 'Electronics can be returned within 30 days of purchase.',
  
  // Automated detection results
  automatedScore: { hallucinationScore: 0.6, flaggedClaims: ['30 days'] },
  
  // Human review
  reviewer: 'reviewer-jane',
  humanLabels: [
    {
      claim: 'Electronics can be returned within 30 days',
      label: 'HALLUCINATED',
      severity: 'HIGH',
      correctInfo: 'Policy states 15 days, not 30',
      category: 'FACTUAL_ERROR'
    }
  ],
  overallVerdict: 'HALLUCINATED',
  
  // Meta: did the automated system agree?
  automatedAgreed: true
};

6. Source Attribution as Hallucination Detection

A powerful heuristic: if the model can't point to a specific source for a claim, it's probably hallucinated. Require the model to cite sources for every claim, then verify the citations.

import OpenAI from 'openai';

const openai = new OpenAI();

async function generateWithSourceAttribution(query, sources) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Answer the user's question using ONLY the provided sources. 
For every factual claim in your answer, include a citation in the format [Source N].
If you cannot find information in the sources, say "I don't have information about that."

CRITICAL: Do NOT include any information that isn't directly stated in a source.

Return JSON:
{
  "answer": "Your answer with [Source N] citations inline",
  "citations": [
    {
      "claim": "the specific claim",
      "sourceIndex": 1,
      "quote": "exact quote from source supporting this claim"
    }
  ],
  "ungrounded_statements": ["any statement you included without a source"]
}`
      },
      {
        role: 'user',
        content: `QUESTION: ${query}

SOURCES:
${sources.map((s, i) => `[Source ${i + 1}]: ${s}`).join('\n')}`
      }
    ],
    response_format: { type: 'json_object' }
  });

  const result = JSON.parse(response.choices[0].message.content);

  // Verify citations actually exist in sources
  const verifiedCitations = result.citations.map(citation => {
    const sourceText = sources[citation.sourceIndex - 1] || '';
    const quoteFound = sourceText.toLowerCase().includes(
      citation.quote.toLowerCase().substring(0, 50)
    );
    return {
      ...citation,
      verified: quoteFound,
      warning: quoteFound ? null : 'Citation quote not found in source'
    };
  });

  return {
    ...result,
    citations: verifiedCitations,
    hallucinationFlags: {
      ungroundedCount: result.ungrounded_statements.length,
      unverifiedCitations: verifiedCitations.filter(c => !c.verified).length,
      isLikelyHallucinated: result.ungrounded_statements.length > 0 
        || verifiedCitations.some(c => !c.verified)
    }
  };
}

// Usage
const sources = [
  "The company was founded in 2015 by Sarah Chen in Austin, Texas.",
  "Annual revenue reached $50 million in 2024.",
  "The company has 200 employees across 3 offices."
];

const result = await generateWithSourceAttribution(
  "Tell me about the company's history and size",
  sources
);

console.log(JSON.stringify(result, null, 2));

Why source attribution works

Forces the model to ground claims — the instruction "cite your source" pushes the model toward retrieval rather than generation.
Makes hallucinations visible — unverified citations are red flags.
Users can verify — cited sources allow users to check the answer themselves.
Creates audit trail — every claim traces back to a specific document.

7. Building a Hallucination Detection Pipeline

In production, you combine multiple detection methods into a layered pipeline:

import OpenAI from 'openai';

const openai = new OpenAI();

class HallucinationDetectionPipeline {
  constructor(config = {}) {
    this.thresholds = {
      sourceAttributionMin: config.sourceAttributionMin || 0.8,
      nliEntailmentMin: config.nliEntailmentMin || 0.7,
      consistencyMin: config.consistencyMin || 0.85,
      overallMax: config.overallMax || 0.3  // max acceptable hallucination score
    };
  }

  async detect(response, sources, query) {
    const startTime = Date.now();

    // Layer 1: Source attribution check (fast, always run)
    const attributionResult = await this.checkSourceAttribution(response, sources);

    // Layer 2: NLI check on flagged claims (medium speed, run if Layer 1 flags)
    let nliResult = { score: 0, flaggedClaims: [] };
    if (attributionResult.score > 0.2) {
      nliResult = await this.checkNLI(
        attributionResult.ungroundedClaims, 
        sources.join(' ')
      );
    }

    // Layer 3: Consistency check (slow, run only for high-risk)
    let consistencyResult = { score: 1.0, contradictions: [] };
    if (attributionResult.score > 0.4 || nliResult.score > 0.3) {
      consistencyResult = await this.checkConsistency(query, sources);
    }

    // Combine scores
    const overallScore = this.combineScores(
      attributionResult.score,
      nliResult.score,
      1 - consistencyResult.score
    );

    const latencyMs = Date.now() - startTime;

    return {
      overallHallucinationScore: overallScore,
      isHallucinated: overallScore > this.thresholds.overallMax,
      layers: {
        sourceAttribution: attributionResult,
        nli: nliResult,
        consistency: consistencyResult
      },
      recommendation: this.getRecommendation(overallScore),
      latencyMs
    };
  }

  async checkSourceAttribution(response, sources) {
    // Extract claims from the response
    const claimsResponse = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      temperature: 0,
      messages: [
        {
          role: 'system',
          content: `Extract all factual claims from this text. Return JSON:
{ "claims": ["claim 1", "claim 2", ...] }`
        },
        { role: 'user', content: response }
      ],
      response_format: { type: 'json_object' }
    });

    const { claims } = JSON.parse(claimsResponse.choices[0].message.content);

    // Check each claim against sources
    const sourcesText = sources.join(' ').toLowerCase();
    const ungroundedClaims = [];
    let groundedCount = 0;

    for (const claim of claims) {
      // Simple keyword overlap check (fast first pass)
      const claimWords = claim.toLowerCase().split(/\s+/).filter(w => w.length > 3);
      const overlapRatio = claimWords.filter(w => sourcesText.includes(w)).length 
                           / claimWords.length;
      
      if (overlapRatio < 0.5) {
        ungroundedClaims.push(claim);
      } else {
        groundedCount++;
      }
    }

    const score = claims.length > 0 
      ? ungroundedClaims.length / claims.length 
      : 0;

    return {
      totalClaims: claims.length,
      groundedClaims: groundedCount,
      ungroundedClaims,
      score // 0 = fully grounded, 1 = fully hallucinated
    };
  }

  async checkNLI(claims, sourceText) {
    // In production, use an actual NLI model here
    // This example uses an LLM as a stand-in
    const flaggedClaims = [];

    for (const claim of claims) {
      const response = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        temperature: 0,
        messages: [
          {
            role: 'system',
            content: `Determine the relationship between the premise and hypothesis.
Return JSON: { "label": "ENTAILMENT" | "CONTRADICTION" | "NEUTRAL", "confidence": 0-1 }`
          },
          {
            role: 'user',
            content: `Premise: ${sourceText}\nHypothesis: ${claim}`
          }
        ],
        response_format: { type: 'json_object' }
      });

      const result = JSON.parse(response.choices[0].message.content);
      if (result.label === 'CONTRADICTION' || result.label === 'NEUTRAL') {
        flaggedClaims.push({ claim, ...result });
      }
    }

    return {
      score: claims.length > 0 ? flaggedClaims.length / claims.length : 0,
      flaggedClaims
    };
  }

  async checkConsistency(query, sources) {
    // Generate answer twice with different phrasings
    const answers = await Promise.all([
      this.generateAnswer(query, sources),
      this.generateAnswer(`Rephrase and answer: ${query}`, sources)
    ]);

    // Compare answers
    const comparisonResponse = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      temperature: 0,
      messages: [
        {
          role: 'system',
          content: `Compare these two answers to the same question. 
Return JSON: { "score": 0-1, "contradictions": ["..."] }
Score 1 = fully consistent, 0 = fully contradictory.`
        },
        {
          role: 'user',
          content: `Answer 1: ${answers[0]}\nAnswer 2: ${answers[1]}`
        }
      ],
      response_format: { type: 'json_object' }
    });

    return JSON.parse(comparisonResponse.choices[0].message.content);
  }

  async generateAnswer(query, sources) {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0,
      messages: [
        {
          role: 'system',
          content: `Answer based only on: ${sources.join('\n')}`
        },
        { role: 'user', content: query }
      ]
    });
    return response.choices[0].message.content;
  }

  combineScores(attributionScore, nliScore, consistencyScore) {
    // Weighted average — attribution is most reliable
    return (
      attributionScore * 0.5 +
      nliScore * 0.3 +
      consistencyScore * 0.2
    );
  }

  getRecommendation(score) {
    if (score < 0.1) return 'SERVE — high confidence, no hallucination detected';
    if (score < 0.3) return 'SERVE_WITH_DISCLAIMER — minor concerns, add "verify" note';
    if (score < 0.6) return 'HUMAN_REVIEW — moderate hallucination risk, queue for review';
    return 'BLOCK — high hallucination risk, do not serve to user';
  }
}

// Usage
const pipeline = new HallucinationDetectionPipeline();

const result = await pipeline.detect(
  'The company was founded in 2015 in Austin and generates $50M annually.',
  [
    'The company was founded in 2015 by Sarah Chen in Austin, Texas.',
    'Annual revenue reached $50 million in 2024.'
  ],
  'Tell me about the company'
);

console.log(JSON.stringify(result, null, 2));

8. Metrics: Measuring Hallucination at Scale

Track these metrics to understand your system's hallucination behavior over time:

class HallucinationMetrics {
  constructor() {
    this.records = [];
  }

  record(detection) {
    this.records.push({
      timestamp: new Date(),
      ...detection
    });
  }

  // Core metrics
  getMetrics(timeWindowHours = 24) {
    const cutoff = Date.now() - (timeWindowHours * 60 * 60 * 1000);
    const recent = this.records.filter(r => r.timestamp >= cutoff);

    if (recent.length === 0) return null;

    const hallucinated = recent.filter(r => r.isHallucinated);
    const humanReviewed = recent.filter(r => r.humanLabel !== undefined);

    // Core rate
    const hallucinationRate = hallucinated.length / recent.length;

    // If we have human reviews, calculate detection accuracy
    let falsePositiveRate = null;
    let falseNegativeRate = null;

    if (humanReviewed.length > 0) {
      const falsePositives = humanReviewed.filter(
        r => r.isHallucinated && r.humanLabel === 'NOT_HALLUCINATED'
      );
      const falseNegatives = humanReviewed.filter(
        r => !r.isHallucinated && r.humanLabel === 'HALLUCINATED'
      );

      falsePositiveRate = falsePositives.length / humanReviewed.length;
      falseNegativeRate = falseNegatives.length / humanReviewed.length;
    }

    return {
      totalResponses: recent.length,
      hallucinationRate,
      hallucinatedCount: hallucinated.length,
      
      // Detection accuracy (requires human labels)
      falsePositiveRate,   // flagged as hallucination but wasn't
      falseNegativeRate,   // missed hallucination
      
      // Breakdown by severity
      bySeverity: {
        high: hallucinated.filter(r => r.severity === 'HIGH').length,
        medium: hallucinated.filter(r => r.severity === 'MEDIUM').length,
        low: hallucinated.filter(r => r.severity === 'LOW').length
      },
      
      // Breakdown by detection layer
      byLayer: {
        caughtByAttribution: hallucinated.filter(
          r => r.layers?.sourceAttribution?.score > 0.3
        ).length,
        caughtByNLI: hallucinated.filter(
          r => r.layers?.nli?.score > 0.3
        ).length,
        caughtByConsistency: hallucinated.filter(
          r => r.layers?.consistency?.score < 0.7
        ).length
      },
      
      // Average detection latency
      avgDetectionLatencyMs: recent.reduce(
        (sum, r) => sum + (r.latencyMs || 0), 0
      ) / recent.length
    };
  }
}

// Usage
const metrics = new HallucinationMetrics();

// After each detection
metrics.record({
  isHallucinated: true,
  severity: 'HIGH',
  layers: { /* ... */ },
  latencyMs: 450
});

// Weekly report
const report = metrics.getMetrics(168); // 7 days
console.log(`Hallucination rate: ${(report.hallucinationRate * 100).toFixed(1)}%`);
console.log(`False positive rate: ${(report.falsePositiveRate * 100).toFixed(1)}%`);

Key metrics to track

Metric	Formula	Target
Hallucination rate	hallucinated / total responses	< 5% for most apps, < 0.1% for high-stakes
False positive rate	flagged but not hallucinated / total flagged	< 10% (too many false alarms = ignored alerts)
False negative rate	missed hallucinations / total hallucinations	< 5% (the dangerous metric)
Detection latency	time for detection pipeline	< 500ms for real-time, < 5s for async
Severity distribution	count by HIGH/MEDIUM/LOW	HIGH should trend toward zero
Layer effectiveness	% caught by each detection layer	Helps decide which layers to invest in

9. Key Takeaways

Hallucination detection is a pipeline, not a single check — combine source attribution, consistency checking, NLI models, and human review for defense in depth.
Cross-referencing with source documents is the most reliable automated method — if the source doesn't say it, the model shouldn't say it.
Consistency checking catches uncertain hallucinations by asking the same question multiple ways — but consistent doesn't mean correct.
NLI models are fast and cheap for claim-level verification — use them as a first filter before expensive LLM-based checks.
Human evaluation is the gold standard — sample strategically (low confidence, flagged responses, high-stakes domains) rather than reviewing everything.
Source attribution doubles as detection — requiring citations forces grounding and makes hallucinations visible.
Track false negatives aggressively — a hallucination that slips through is far more dangerous than a false alarm.

Explain-It Challenge

Your product manager says "just use temperature 0 and hallucinations go away." Explain why that's wrong and what you actually need.
A colleague proposes using ONLY consistency checking for hallucination detection. What's the critical flaw in this approach?
Your hallucination detection pipeline has a 15% false positive rate. Users are complaining that valid answers are being blocked. How do you reduce false positives without increasing false negatives?

Navigation: ← 4.14 Overview · 4.14.b — Confidence Scores →