Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14.b — Confidence Scores

In one sentence: Confidence scoring attaches a numeric reliability signal to every AI output — using model self-assessment, log probabilities, and calibration — so your system can auto-approve high-confidence answers and route low-confidence answers to human review.

Navigation: ← 4.14.a — Detecting Hallucinations · 4.14.c — Evaluating Retrieval Quality →

1. What Confidence Scoring Means for AI Outputs

Traditional software either works or throws an error. AI systems exist in a gray zone: the output looks reasonable but might be wrong. Confidence scoring quantifies that uncertainty.

Traditional Software:
  Input → Function → Output  (always correct if no error)

AI System WITHOUT confidence:
  Input → LLM → Output  (might be correct, might be hallucinated — who knows?)

AI System WITH confidence:
  Input → LLM → { output, confidence: 0.92 }
  
  if (confidence > 0.85) → auto-serve to user
  if (confidence 0.5-0.85) → serve with disclaimer
  if (confidence < 0.5) → route to human

Confidence scoring is the bridge between "the AI said something" and "we know how much to trust what the AI said."

2. Model-Generated Confidence: Asking the Model to Self-Assess

The simplest approach: ask the LLM to rate its own confidence. This is surprisingly effective when done correctly.

Basic self-assessment

import OpenAI from 'openai';

const openai = new OpenAI();

async function generateWithConfidence(query, context) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `You are a helpful assistant. Answer the user's question based on 
the provided context.

After your answer, assess your own confidence on a scale of 0.0 to 1.0:
- 1.0: The answer is explicitly and completely stated in the context
- 0.8-0.9: The answer is strongly supported by the context with minor inference
- 0.5-0.7: The answer requires significant inference or context is partially relevant
- 0.2-0.4: The answer is weakly supported; context is tangentially relevant
- 0.0-0.1: The context doesn't address the question at all

Return JSON:
{
  "answer": "your answer",
  "confidence": 0.0 to 1.0,
  "confidence_reasoning": "brief explanation of your confidence level",
  "source_coverage": "FULL" | "PARTIAL" | "NONE"
}`
      },
      {
        role: 'user',
        content: `Context: ${context}\n\nQuestion: ${query}`
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

// Example: Well-supported question
const result1 = await generateWithConfidence(
  'What year was the company founded?',
  'Acme Corp was founded in 2015 by Jane Smith in San Francisco.'
);
console.log(result1);
/*
{
  "answer": "Acme Corp was founded in 2015.",
  "confidence": 0.95,
  "confidence_reasoning": "The founding year is explicitly stated in the context.",
  "source_coverage": "FULL"
}
*/

// Example: Poorly-supported question
const result2 = await generateWithConfidence(
  'What is the company revenue?',
  'Acme Corp was founded in 2015 by Jane Smith in San Francisco.'
);
console.log(result2);
/*
{
  "answer": "I don't have information about Acme Corp's revenue in the provided context.",
  "confidence": 0.1,
  "confidence_reasoning": "The context only mentions founding details, not financial information.",
  "source_coverage": "NONE"
}
*/

Multi-dimensional confidence

For complex applications, break confidence into multiple dimensions:

async function generateWithDetailedConfidence(query, context) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Answer the question and provide multi-dimensional confidence scores.

Return JSON:
{
  "answer": "your answer",
  "confidence": {
    "overall": 0.0 to 1.0,
    "factual_accuracy": 0.0 to 1.0,
    "completeness": 0.0 to 1.0,
    "source_grounding": 0.0 to 1.0,
    "reasoning": "brief explanation"
  }
}`
      },
      {
        role: 'user',
        content: `Context: ${context}\n\nQuestion: ${query}`
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

const result = await generateWithDetailedConfidence(
  'What are the key benefits and pricing of the enterprise plan?',
  'The enterprise plan includes SSO, audit logs, and priority support. Contact sales for pricing.'
);

console.log(result);
/*
{
  "answer": "The enterprise plan includes SSO, audit logs, and priority support. Pricing is not publicly listed — you need to contact sales.",
  "confidence": {
    "overall": 0.75,
    "factual_accuracy": 0.95,
    "completeness": 0.6,
    "source_grounding": 0.9,
    "reasoning": "Benefits are well-covered by the source, but pricing info is incomplete (only 'contact sales' available)"
  }
}
*/

Limitations of self-assessment

Issue	Explanation
Overconfidence bias	LLMs tend to report higher confidence than warranted, especially on hallucinated content
Not calibrated	A model saying 0.9 doesn't mean 90% accuracy without calibration
Sycophantic agreement	Models may report high confidence because the question implies an answer
Inconsistent scales	Different prompts produce different confidence scales

3. Log Probabilities as Confidence Signals

A more objective confidence signal: log probabilities (logprobs). These are the model's internal probability assignments to each token it generates. The OpenAI API can return these directly.

What are logprobs?

When the model generates a token, it assigns a probability to every token in its vocabulary. The log probability of the chosen token tells you how confident the model was in that specific choice.

High logprob (close to 0):    The model is very confident about this token
Low logprob (very negative):  The model is uncertain about this token

logprob = -0.01  → probability ≈ 99%    (very confident)
logprob = -0.10  → probability ≈ 90%    (confident)
logprob = -0.69  → probability ≈ 50%    (coin flip)
logprob = -2.30  → probability ≈ 10%    (uncertain)
logprob = -4.60  → probability ≈ 1%     (very uncertain)

Extracting logprobs from the API

import OpenAI from 'openai';

const openai = new OpenAI();

async function getResponseWithLogprobs(query) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    logprobs: true,
    top_logprobs: 5, // Return top 5 alternative tokens at each position
    messages: [
      {
        role: 'system',
        content: 'Answer the question concisely.'
      },
      { role: 'user', content: query }
    ]
  });

  const content = response.choices[0].message.content;
  const logprobs = response.choices[0].logprobs?.content || [];

  return { content, logprobs };
}

// Analyze logprobs for confidence
function analyzeLogprobs(logprobs) {
  if (logprobs.length === 0) return { averageConfidence: 0, tokens: [] };

  const tokenAnalysis = logprobs.map(tokenInfo => {
    const probability = Math.exp(tokenInfo.logprob); // Convert logprob to probability
    const topAlternatives = tokenInfo.top_logprobs?.map(alt => ({
      token: alt.token,
      probability: Math.exp(alt.logprob)
    })) || [];

    return {
      token: tokenInfo.token,
      logprob: tokenInfo.logprob,
      probability,
      isConfident: probability > 0.8,
      topAlternatives
    };
  });

  // Overall confidence: geometric mean of token probabilities
  const avgLogprob = tokenAnalysis.reduce(
    (sum, t) => sum + t.logprob, 0
  ) / tokenAnalysis.length;

  const averageConfidence = Math.exp(avgLogprob);

  // Find low-confidence spans (potential hallucination points)
  const lowConfidenceTokens = tokenAnalysis.filter(t => t.probability < 0.5);

  return {
    averageConfidence,
    averageLogprob: avgLogprob,
    totalTokens: tokenAnalysis.length,
    lowConfidenceCount: lowConfidenceTokens.length,
    lowConfidenceTokens: lowConfidenceTokens.map(t => t.token),
    tokens: tokenAnalysis
  };
}

// Usage
const { content, logprobs } = await getResponseWithLogprobs(
  'What is the capital of France?'
);

console.log('Answer:', content);

const analysis = analyzeLogprobs(logprobs);
console.log(`Average confidence: ${(analysis.averageConfidence * 100).toFixed(1)}%`);
console.log(`Low confidence tokens: ${analysis.lowConfidenceCount}`);

// For a factual question like "capital of France", 
// expect high confidence (>95%) on key tokens like "Paris"

Using logprobs for token-level hallucination detection

function detectLowConfidenceSpans(logprobs, threshold = 0.5) {
  const analysis = analyzeLogprobs(logprobs);
  const spans = [];
  let currentSpan = null;

  for (const token of analysis.tokens) {
    if (token.probability < threshold) {
      // Low confidence token
      if (!currentSpan) {
        currentSpan = { tokens: [], startIndex: analysis.tokens.indexOf(token) };
      }
      currentSpan.tokens.push({
        text: token.token,
        probability: token.probability,
        alternatives: token.topAlternatives.slice(0, 3)
      });
    } else {
      // High confidence — close any open span
      if (currentSpan) {
        spans.push(currentSpan);
        currentSpan = null;
      }
    }
  }

  if (currentSpan) spans.push(currentSpan);

  return {
    spans,
    hasLowConfidenceSpans: spans.length > 0,
    riskLevel: spans.length === 0 ? 'LOW' 
      : spans.length <= 2 ? 'MEDIUM' 
      : 'HIGH'
  };
}

Logprobs vs self-assessment

Factor	Logprobs	Self-Assessment
Objectivity	Objective (direct from model internals)	Subjective (model's opinion of itself)
Granularity	Per-token confidence	Per-response confidence
Calibration	Closer to calibrated (but not perfect)	Often overconfident
Availability	Only some APIs expose logprobs	Works with any model
Interpretability	Hard to aggregate meaningfully	Easy to understand
Cost	No extra cost (same API call)	No extra cost (same API call)

4. Calibration: Is 90% Confidence Actually Correct 90% of the Time?

A confidence score is only useful if it's calibrated — meaning a score of 0.9 corresponds to being correct ~90% of the time. Most LLM confidence scores are not well-calibrated out of the box.

Understanding calibration

WELL-CALIBRATED:
  When the model says "90% confident" → it's correct ~90% of the time
  When the model says "50% confident" → it's correct ~50% of the time

OVERCONFIDENT (common for LLMs):
  When the model says "90% confident" → it's correct only ~70% of the time
  When the model says "50% confident" → it's correct only ~30% of the time

UNDERCONFIDENT (rare):
  When the model says "50% confident" → it's correct ~80% of the time

Building a calibration dataset

async function buildCalibrationDataset(evalQuestions) {
  // evalQuestions: [{ question, context, groundTruth }]
  const results = [];

  for (const item of evalQuestions) {
    // Get model answer with confidence
    const response = await generateWithConfidence(item.question, item.context);

    // Check if answer is correct (automated or human-labeled)
    const isCorrect = await checkCorrectness(
      response.answer, 
      item.groundTruth
    );

    results.push({
      question: item.question,
      modelConfidence: response.confidence,
      isCorrect,
      answer: response.answer,
      groundTruth: item.groundTruth
    });
  }

  return results;
}

// Compute calibration statistics
function computeCalibration(dataset) {
  // Bin predictions by confidence level
  const bins = [
    { range: [0.0, 0.2], label: '0-20%', predictions: [] },
    { range: [0.2, 0.4], label: '20-40%', predictions: [] },
    { range: [0.4, 0.6], label: '40-60%', predictions: [] },
    { range: [0.6, 0.8], label: '60-80%', predictions: [] },
    { range: [0.8, 1.0], label: '80-100%', predictions: [] }
  ];

  for (const item of dataset) {
    const bin = bins.find(
      b => item.modelConfidence >= b.range[0] && item.modelConfidence < b.range[1]
    ) || bins[bins.length - 1]; // Last bin catches 1.0

    bin.predictions.push(item);
  }

  const calibrationTable = bins.map(bin => {
    const count = bin.predictions.length;
    const correctCount = bin.predictions.filter(p => p.isCorrect).length;
    const actualAccuracy = count > 0 ? correctCount / count : null;
    const expectedAccuracy = (bin.range[0] + bin.range[1]) / 2;

    return {
      bin: bin.label,
      count,
      expectedAccuracy,
      actualAccuracy,
      calibrationGap: actualAccuracy !== null 
        ? Math.abs(actualAccuracy - expectedAccuracy) 
        : null,
      isOverconfident: actualAccuracy !== null && actualAccuracy < expectedAccuracy
    };
  });

  // Expected Calibration Error (ECE)
  const totalPredictions = dataset.length;
  const ece = calibrationTable.reduce((sum, bin) => {
    if (bin.count === 0) return sum;
    return sum + (bin.count / totalPredictions) * (bin.calibrationGap || 0);
  }, 0);

  return {
    calibrationTable,
    ece,   // Lower is better. 0 = perfectly calibrated
    interpretation: ece < 0.05 ? 'Well calibrated' 
      : ece < 0.15 ? 'Moderately calibrated' 
      : 'Poorly calibrated — consider recalibration'
  };
}

async function checkCorrectness(modelAnswer, groundTruth) {
  // Use an LLM to check semantic equivalence
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Determine if the model answer is semantically correct compared to the ground truth.
Return JSON: { "isCorrect": true/false, "explanation": "..." }`
      },
      {
        role: 'user',
        content: `Model answer: ${modelAnswer}\nGround truth: ${groundTruth}`
      }
    ],
    response_format: { type: 'json_object' }
  });
  const result = JSON.parse(response.choices[0].message.content);
  return result.isCorrect;
}

Recalibration: Platt scaling

If your model is systematically overconfident, apply a simple correction:

// Platt scaling: fit a logistic regression on calibration data
// to map raw confidence → calibrated confidence

class PlattScaler {
  constructor() {
    this.a = 1;  // slope
    this.b = 0;  // intercept
  }

  // Fit on calibration dataset
  fit(dataset) {
    // dataset: [{ modelConfidence, isCorrect }]
    // In production, use a proper logistic regression library
    // This is a simplified gradient descent implementation

    let a = 1, b = 0;
    const learningRate = 0.01;
    const iterations = 1000;

    for (let iter = 0; iter < iterations; iter++) {
      let gradA = 0, gradB = 0;

      for (const item of dataset) {
        const z = a * item.modelConfidence + b;
        const calibrated = 1 / (1 + Math.exp(-z));
        const error = calibrated - (item.isCorrect ? 1 : 0);
        
        gradA += error * item.modelConfidence;
        gradB += error;
      }

      a -= learningRate * (gradA / dataset.length);
      b -= learningRate * (gradB / dataset.length);
    }

    this.a = a;
    this.b = b;
  }

  // Apply calibration to a raw confidence score
  calibrate(rawConfidence) {
    const z = this.a * rawConfidence + this.b;
    return 1 / (1 + Math.exp(-z));
  }
}

// Usage
const scaler = new PlattScaler();
scaler.fit(calibrationDataset);

// Before: raw model confidence = 0.90 (but actually correct 70% of the time)
// After:  calibrated confidence = 0.72 (now matches actual accuracy)
const calibrated = scaler.calibrate(0.90);
console.log(`Calibrated confidence: ${calibrated.toFixed(2)}`);

5. Building Confidence into Structured Outputs

Integrate confidence scores directly into your API response schemas:

import { z } from 'zod';

// Define a schema with built-in confidence
const ConfidentResponseSchema = z.object({
  answer: z.string().describe('The answer to the user question'),
  confidence: z.number().min(0).max(1).describe('Overall confidence score 0-1'),
  confidenceBreakdown: z.object({
    sourceGrounding: z.number().min(0).max(1)
      .describe('How well the answer is supported by source documents'),
    completeness: z.number().min(0).max(1)
      .describe('How completely the question is answered'),
    specificity: z.number().min(0).max(1)
      .describe('How specific and precise the answer is')
  }),
  sources: z.array(z.object({
    document: z.string(),
    chunk: z.number(),
    relevance: z.number().min(0).max(1)
  })),
  caveats: z.array(z.string())
    .describe('Any limitations or uncertainties in the answer')
});

// Generate a response matching this schema
async function generateConfidentResponse(query, context, sources) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based on the provided context.
Include confidence scores following these guidelines:
- sourceGrounding: 1.0 if every claim has a direct source quote, lower if inferring
- completeness: 1.0 if the question is fully answered, lower if information gaps exist
- specificity: 1.0 if the answer includes specific details, lower if vague
- overall confidence: weighted average of the three dimensions

List any caveats or limitations honestly.

Return valid JSON matching this schema:
${JSON.stringify(ConfidentResponseSchema.shape, null, 2)}`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nSources:\n${
          sources.map((s, i) => `[${i + 1}] ${s.document}: ${s.text}`).join('\n')
        }\n\nQuestion: ${query}`
      }
    ],
    response_format: { type: 'json_object' }
  });

  const raw = JSON.parse(response.choices[0].message.content);

  // Validate with Zod
  const validated = ConfidentResponseSchema.parse(raw);
  return validated;
}

6. Thresholding: Auto-Approve, Disclaimer, or Human Review

The real power of confidence scores: automated routing decisions.

class ConfidenceRouter {
  constructor(config = {}) {
    this.thresholds = {
      autoApprove: config.autoApprove || 0.85,
      disclaimer: config.disclaimer || 0.60,
      humanReview: config.humanReview || 0.40
      // Below humanReview → refuse to answer
    };
    this.metrics = {
      autoApproved: 0,
      withDisclaimer: 0,
      sentToHuman: 0,
      refused: 0
    };
  }

  route(response) {
    const confidence = response.confidence;

    if (confidence >= this.thresholds.autoApprove) {
      this.metrics.autoApproved++;
      return {
        action: 'AUTO_APPROVE',
        response: response.answer,
        metadata: { confidence, route: 'auto' }
      };
    }

    if (confidence >= this.thresholds.disclaimer) {
      this.metrics.withDisclaimer++;
      return {
        action: 'SERVE_WITH_DISCLAIMER',
        response: response.answer,
        disclaimer: 'This answer may not be fully accurate. Please verify important details.',
        caveats: response.caveats,
        metadata: { confidence, route: 'disclaimer' }
      };
    }

    if (confidence >= this.thresholds.humanReview) {
      this.metrics.sentToHuman++;
      return {
        action: 'HUMAN_REVIEW',
        response: null,
        fallbackMessage: 'Let me connect you with a specialist who can help with this.',
        queuePayload: {
          query: response.query,
          aiSuggestedAnswer: response.answer,
          confidence,
          caveats: response.caveats,
          sources: response.sources
        },
        metadata: { confidence, route: 'human' }
      };
    }

    this.metrics.refused++;
    return {
      action: 'REFUSE',
      response: null,
      fallbackMessage: "I don't have enough information to answer that accurately. " +
        "Could you try rephrasing or ask about something else?",
      metadata: { confidence, route: 'refused' }
    };
  }

  getStats() {
    const total = Object.values(this.metrics).reduce((a, b) => a + b, 0);
    return {
      total,
      autoApproveRate: total > 0 ? this.metrics.autoApproved / total : 0,
      disclaimerRate: total > 0 ? this.metrics.withDisclaimer / total : 0,
      humanReviewRate: total > 0 ? this.metrics.sentToHuman / total : 0,
      refusalRate: total > 0 ? this.metrics.refused / total : 0
    };
  }
}

// Usage
const router = new ConfidenceRouter({
  autoApprove: 0.85,
  disclaimer: 0.60,
  humanReview: 0.40
});

// High-confidence answer: auto-approved
const highConf = router.route({ answer: 'Founded in 2015', confidence: 0.95, caveats: [] });
console.log(highConf.action); // 'AUTO_APPROVE'

// Medium-confidence answer: served with disclaimer
const medConf = router.route({ answer: 'About $50M revenue', confidence: 0.70, caveats: ['Exact figure not in sources'] });
console.log(medConf.action); // 'SERVE_WITH_DISCLAIMER'

// Low-confidence answer: human review
const lowConf = router.route({ answer: 'Maybe 200 employees?', confidence: 0.45, caveats: ['Not directly stated'] });
console.log(lowConf.action); // 'HUMAN_REVIEW'

// Very low confidence: refused
const veryLow = router.route({ answer: 'Unknown', confidence: 0.15, caveats: ['No relevant information'] });
console.log(veryLow.action); // 'REFUSE'

// Check routing distribution
console.log(router.getStats());

Tuning thresholds

┌─────────────────────────────────────────────────────────┐
│  THRESHOLD TUNING — THE TRADEOFF                        │
│                                                         │
│  Higher auto-approve threshold (e.g., 0.95):            │
│    ✓ Fewer wrong answers served automatically           │
│    ✗ More answers routed to human (expensive, slow)     │
│    Use for: medical, legal, financial                    │
│                                                         │
│  Lower auto-approve threshold (e.g., 0.70):             │
│    ✓ More answers served automatically (faster, cheaper)│
│    ✗ More wrong answers reach users                     │
│    Use for: casual Q&A, internal tools, brainstorming   │
│                                                         │
│  MEASURE to tune:                                       │
│    1. Sample auto-approved responses for human review   │
│    2. Calculate error rate at each threshold             │
│    3. Plot accuracy vs automation rate                   │
│    4. Pick the threshold that matches your risk budget   │
└─────────────────────────────────────────────────────────┘

7. Combining Confidence Signals

In production, combine multiple confidence signals for a more robust score:

function computeCompositeConfidence(signals) {
  const {
    selfAssessedConfidence,   // Model's self-reported confidence (0-1)
    logprobConfidence,        // Average token probability (0-1)
    sourceOverlap,            // Fraction of claims grounded in sources (0-1)
    retrievalRelevance,       // Average relevance of retrieved chunks (0-1)
    hallucinationScore        // From hallucination detection pipeline (0-1, lower=better)
  } = signals;

  // Weighted combination
  const weights = {
    selfAssessed: 0.15,     // Least reliable on its own
    logprob: 0.25,          // Objective but noisy
    sourceOverlap: 0.30,    // Most reliable for RAG
    retrievalRelevance: 0.20, // Quality of inputs
    hallucination: 0.10     // Penalty signal
  };

  const composite = (
    selfAssessedConfidence * weights.selfAssessed +
    logprobConfidence * weights.logprob +
    sourceOverlap * weights.sourceOverlap +
    retrievalRelevance * weights.retrievalRelevance +
    (1 - hallucinationScore) * weights.hallucination
  );

  // Apply calibration if available
  // const calibrated = plattScaler.calibrate(composite);

  return {
    composite: Math.max(0, Math.min(1, composite)),
    signals: {
      selfAssessed: selfAssessedConfidence,
      logprob: logprobConfidence,
      sourceOverlap,
      retrievalRelevance,
      hallucinationPenalty: hallucinationScore
    },
    dominantSignal: getDominantSignal(signals, weights)
  };
}

function getDominantSignal(signals, weights) {
  // Find which signal is pulling the score down the most
  const contributions = {
    selfAssessed: signals.selfAssessedConfidence * weights.selfAssessed,
    logprob: signals.logprobConfidence * weights.logprob,
    sourceOverlap: signals.sourceOverlap * weights.sourceOverlap,
    retrievalRelevance: signals.retrievalRelevance * weights.retrievalRelevance,
    hallucination: (1 - signals.hallucinationScore) * weights.hallucination
  };

  const sorted = Object.entries(contributions).sort((a, b) => a[1] - b[1]);
  return {
    weakest: sorted[0][0],
    weakestScore: sorted[0][1],
    strongest: sorted[sorted.length - 1][0],
    strongestScore: sorted[sorted.length - 1][1]
  };
}

// Usage
const composite = computeCompositeConfidence({
  selfAssessedConfidence: 0.90,
  logprobConfidence: 0.85,
  sourceOverlap: 0.70,       // Some claims not grounded
  retrievalRelevance: 0.80,
  hallucinationScore: 0.20   // Mild hallucination signal
});

console.log(`Composite confidence: ${composite.composite.toFixed(2)}`);
console.log(`Weakest signal: ${composite.dominantSignal.weakest}`);
// Composite: ~0.78, weakest signal: sourceOverlap

8. Key Takeaways

Self-assessed confidence is the easiest method — ask the model to rate its own certainty — but it's prone to overconfidence and needs calibration.
Log probabilities provide objective, per-token confidence signals at no extra cost — low logprobs on key tokens are red flags for hallucination.
Calibration is essential — raw confidence scores are meaningless unless you verify that 90% confidence = 90% accuracy. Build calibration datasets and measure Expected Calibration Error.
Thresholding enables automation — auto-approve high confidence, add disclaimers for medium, route low confidence to humans. This is the business value of confidence scoring.
Combine multiple signals for robustness — self-assessment, logprobs, source overlap, retrieval relevance, and hallucination scores together outperform any single signal.
Tune thresholds to your risk budget — medical apps need high thresholds (0.95+), casual Q&A can tolerate lower (0.70+). Measure error rates at each threshold to find the right balance.

Explain-It Challenge

A product manager asks "why can't the model just tell us when it's wrong?" Explain the overconfidence problem and why calibration matters.
You have logprobs showing 99% confidence on the token "Paris" for "capital of France" but 12% confidence on the token "1987" for "company founding year." What does this tell you and what action do you take?
Your confidence router is auto-approving 80% of responses but human reviewers find 8% of those are wrong. What do you adjust?

Navigation: ← 4.14.a — Detecting Hallucinations · 4.14.c — Evaluating Retrieval Quality →