Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems
4.14.b — Confidence Scores
In one sentence: Confidence scoring attaches a numeric reliability signal to every AI output — using model self-assessment, log probabilities, and calibration — so your system can auto-approve high-confidence answers and route low-confidence answers to human review.
Navigation: ← 4.14.a — Detecting Hallucinations · 4.14.c — Evaluating Retrieval Quality →
1. What Confidence Scoring Means for AI Outputs
Traditional software either works or throws an error. AI systems exist in a gray zone: the output looks reasonable but might be wrong. Confidence scoring quantifies that uncertainty.
Traditional Software:
Input → Function → Output (always correct if no error)
AI System WITHOUT confidence:
Input → LLM → Output (might be correct, might be hallucinated — who knows?)
AI System WITH confidence:
Input → LLM → { output, confidence: 0.92 }
if (confidence > 0.85) → auto-serve to user
if (confidence 0.5-0.85) → serve with disclaimer
if (confidence < 0.5) → route to human
Confidence scoring is the bridge between "the AI said something" and "we know how much to trust what the AI said."
2. Model-Generated Confidence: Asking the Model to Self-Assess
The simplest approach: ask the LLM to rate its own confidence. This is surprisingly effective when done correctly.
Basic self-assessment
import OpenAI from 'openai';
const openai = new OpenAI();
async function generateWithConfidence(query, context) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer the user's question based on
the provided context.
After your answer, assess your own confidence on a scale of 0.0 to 1.0:
- 1.0: The answer is explicitly and completely stated in the context
- 0.8-0.9: The answer is strongly supported by the context with minor inference
- 0.5-0.7: The answer requires significant inference or context is partially relevant
- 0.2-0.4: The answer is weakly supported; context is tangentially relevant
- 0.0-0.1: The context doesn't address the question at all
Return JSON:
{
"answer": "your answer",
"confidence": 0.0 to 1.0,
"confidence_reasoning": "brief explanation of your confidence level",
"source_coverage": "FULL" | "PARTIAL" | "NONE"
}`
},
{
role: 'user',
content: `Context: ${context}\n\nQuestion: ${query}`
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
// Example: Well-supported question
const result1 = await generateWithConfidence(
'What year was the company founded?',
'Acme Corp was founded in 2015 by Jane Smith in San Francisco.'
);
console.log(result1);
/*
{
"answer": "Acme Corp was founded in 2015.",
"confidence": 0.95,
"confidence_reasoning": "The founding year is explicitly stated in the context.",
"source_coverage": "FULL"
}
*/
// Example: Poorly-supported question
const result2 = await generateWithConfidence(
'What is the company revenue?',
'Acme Corp was founded in 2015 by Jane Smith in San Francisco.'
);
console.log(result2);
/*
{
"answer": "I don't have information about Acme Corp's revenue in the provided context.",
"confidence": 0.1,
"confidence_reasoning": "The context only mentions founding details, not financial information.",
"source_coverage": "NONE"
}
*/
Multi-dimensional confidence
For complex applications, break confidence into multiple dimensions:
async function generateWithDetailedConfidence(query, context) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{
role: 'system',
content: `Answer the question and provide multi-dimensional confidence scores.
Return JSON:
{
"answer": "your answer",
"confidence": {
"overall": 0.0 to 1.0,
"factual_accuracy": 0.0 to 1.0,
"completeness": 0.0 to 1.0,
"source_grounding": 0.0 to 1.0,
"reasoning": "brief explanation"
}
}`
},
{
role: 'user',
content: `Context: ${context}\n\nQuestion: ${query}`
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
const result = await generateWithDetailedConfidence(
'What are the key benefits and pricing of the enterprise plan?',
'The enterprise plan includes SSO, audit logs, and priority support. Contact sales for pricing.'
);
console.log(result);
/*
{
"answer": "The enterprise plan includes SSO, audit logs, and priority support. Pricing is not publicly listed — you need to contact sales.",
"confidence": {
"overall": 0.75,
"factual_accuracy": 0.95,
"completeness": 0.6,
"source_grounding": 0.9,
"reasoning": "Benefits are well-covered by the source, but pricing info is incomplete (only 'contact sales' available)"
}
}
*/
Limitations of self-assessment
| Issue | Explanation |
|---|---|
| Overconfidence bias | LLMs tend to report higher confidence than warranted, especially on hallucinated content |
| Not calibrated | A model saying 0.9 doesn't mean 90% accuracy without calibration |
| Sycophantic agreement | Models may report high confidence because the question implies an answer |
| Inconsistent scales | Different prompts produce different confidence scales |
3. Log Probabilities as Confidence Signals
A more objective confidence signal: log probabilities (logprobs). These are the model's internal probability assignments to each token it generates. The OpenAI API can return these directly.
What are logprobs?
When the model generates a token, it assigns a probability to every token in its vocabulary. The log probability of the chosen token tells you how confident the model was in that specific choice.
High logprob (close to 0): The model is very confident about this token
Low logprob (very negative): The model is uncertain about this token
logprob = -0.01 → probability ≈ 99% (very confident)
logprob = -0.10 → probability ≈ 90% (confident)
logprob = -0.69 → probability ≈ 50% (coin flip)
logprob = -2.30 → probability ≈ 10% (uncertain)
logprob = -4.60 → probability ≈ 1% (very uncertain)
Extracting logprobs from the API
import OpenAI from 'openai';
const openai = new OpenAI();
async function getResponseWithLogprobs(query) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
logprobs: true,
top_logprobs: 5, // Return top 5 alternative tokens at each position
messages: [
{
role: 'system',
content: 'Answer the question concisely.'
},
{ role: 'user', content: query }
]
});
const content = response.choices[0].message.content;
const logprobs = response.choices[0].logprobs?.content || [];
return { content, logprobs };
}
// Analyze logprobs for confidence
function analyzeLogprobs(logprobs) {
if (logprobs.length === 0) return { averageConfidence: 0, tokens: [] };
const tokenAnalysis = logprobs.map(tokenInfo => {
const probability = Math.exp(tokenInfo.logprob); // Convert logprob to probability
const topAlternatives = tokenInfo.top_logprobs?.map(alt => ({
token: alt.token,
probability: Math.exp(alt.logprob)
})) || [];
return {
token: tokenInfo.token,
logprob: tokenInfo.logprob,
probability,
isConfident: probability > 0.8,
topAlternatives
};
});
// Overall confidence: geometric mean of token probabilities
const avgLogprob = tokenAnalysis.reduce(
(sum, t) => sum + t.logprob, 0
) / tokenAnalysis.length;
const averageConfidence = Math.exp(avgLogprob);
// Find low-confidence spans (potential hallucination points)
const lowConfidenceTokens = tokenAnalysis.filter(t => t.probability < 0.5);
return {
averageConfidence,
averageLogprob: avgLogprob,
totalTokens: tokenAnalysis.length,
lowConfidenceCount: lowConfidenceTokens.length,
lowConfidenceTokens: lowConfidenceTokens.map(t => t.token),
tokens: tokenAnalysis
};
}
// Usage
const { content, logprobs } = await getResponseWithLogprobs(
'What is the capital of France?'
);
console.log('Answer:', content);
const analysis = analyzeLogprobs(logprobs);
console.log(`Average confidence: ${(analysis.averageConfidence * 100).toFixed(1)}%`);
console.log(`Low confidence tokens: ${analysis.lowConfidenceCount}`);
// For a factual question like "capital of France",
// expect high confidence (>95%) on key tokens like "Paris"
Using logprobs for token-level hallucination detection
function detectLowConfidenceSpans(logprobs, threshold = 0.5) {
const analysis = analyzeLogprobs(logprobs);
const spans = [];
let currentSpan = null;
for (const token of analysis.tokens) {
if (token.probability < threshold) {
// Low confidence token
if (!currentSpan) {
currentSpan = { tokens: [], startIndex: analysis.tokens.indexOf(token) };
}
currentSpan.tokens.push({
text: token.token,
probability: token.probability,
alternatives: token.topAlternatives.slice(0, 3)
});
} else {
// High confidence — close any open span
if (currentSpan) {
spans.push(currentSpan);
currentSpan = null;
}
}
}
if (currentSpan) spans.push(currentSpan);
return {
spans,
hasLowConfidenceSpans: spans.length > 0,
riskLevel: spans.length === 0 ? 'LOW'
: spans.length <= 2 ? 'MEDIUM'
: 'HIGH'
};
}
Logprobs vs self-assessment
| Factor | Logprobs | Self-Assessment |
|---|---|---|
| Objectivity | Objective (direct from model internals) | Subjective (model's opinion of itself) |
| Granularity | Per-token confidence | Per-response confidence |
| Calibration | Closer to calibrated (but not perfect) | Often overconfident |
| Availability | Only some APIs expose logprobs | Works with any model |
| Interpretability | Hard to aggregate meaningfully | Easy to understand |
| Cost | No extra cost (same API call) | No extra cost (same API call) |
4. Calibration: Is 90% Confidence Actually Correct 90% of the Time?
A confidence score is only useful if it's calibrated — meaning a score of 0.9 corresponds to being correct ~90% of the time. Most LLM confidence scores are not well-calibrated out of the box.
Understanding calibration
WELL-CALIBRATED:
When the model says "90% confident" → it's correct ~90% of the time
When the model says "50% confident" → it's correct ~50% of the time
OVERCONFIDENT (common for LLMs):
When the model says "90% confident" → it's correct only ~70% of the time
When the model says "50% confident" → it's correct only ~30% of the time
UNDERCONFIDENT (rare):
When the model says "50% confident" → it's correct ~80% of the time
Building a calibration dataset
async function buildCalibrationDataset(evalQuestions) {
// evalQuestions: [{ question, context, groundTruth }]
const results = [];
for (const item of evalQuestions) {
// Get model answer with confidence
const response = await generateWithConfidence(item.question, item.context);
// Check if answer is correct (automated or human-labeled)
const isCorrect = await checkCorrectness(
response.answer,
item.groundTruth
);
results.push({
question: item.question,
modelConfidence: response.confidence,
isCorrect,
answer: response.answer,
groundTruth: item.groundTruth
});
}
return results;
}
// Compute calibration statistics
function computeCalibration(dataset) {
// Bin predictions by confidence level
const bins = [
{ range: [0.0, 0.2], label: '0-20%', predictions: [] },
{ range: [0.2, 0.4], label: '20-40%', predictions: [] },
{ range: [0.4, 0.6], label: '40-60%', predictions: [] },
{ range: [0.6, 0.8], label: '60-80%', predictions: [] },
{ range: [0.8, 1.0], label: '80-100%', predictions: [] }
];
for (const item of dataset) {
const bin = bins.find(
b => item.modelConfidence >= b.range[0] && item.modelConfidence < b.range[1]
) || bins[bins.length - 1]; // Last bin catches 1.0
bin.predictions.push(item);
}
const calibrationTable = bins.map(bin => {
const count = bin.predictions.length;
const correctCount = bin.predictions.filter(p => p.isCorrect).length;
const actualAccuracy = count > 0 ? correctCount / count : null;
const expectedAccuracy = (bin.range[0] + bin.range[1]) / 2;
return {
bin: bin.label,
count,
expectedAccuracy,
actualAccuracy,
calibrationGap: actualAccuracy !== null
? Math.abs(actualAccuracy - expectedAccuracy)
: null,
isOverconfident: actualAccuracy !== null && actualAccuracy < expectedAccuracy
};
});
// Expected Calibration Error (ECE)
const totalPredictions = dataset.length;
const ece = calibrationTable.reduce((sum, bin) => {
if (bin.count === 0) return sum;
return sum + (bin.count / totalPredictions) * (bin.calibrationGap || 0);
}, 0);
return {
calibrationTable,
ece, // Lower is better. 0 = perfectly calibrated
interpretation: ece < 0.05 ? 'Well calibrated'
: ece < 0.15 ? 'Moderately calibrated'
: 'Poorly calibrated — consider recalibration'
};
}
async function checkCorrectness(modelAnswer, groundTruth) {
// Use an LLM to check semantic equivalence
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
messages: [
{
role: 'system',
content: `Determine if the model answer is semantically correct compared to the ground truth.
Return JSON: { "isCorrect": true/false, "explanation": "..." }`
},
{
role: 'user',
content: `Model answer: ${modelAnswer}\nGround truth: ${groundTruth}`
}
],
response_format: { type: 'json_object' }
});
const result = JSON.parse(response.choices[0].message.content);
return result.isCorrect;
}
Recalibration: Platt scaling
If your model is systematically overconfident, apply a simple correction:
// Platt scaling: fit a logistic regression on calibration data
// to map raw confidence → calibrated confidence
class PlattScaler {
constructor() {
this.a = 1; // slope
this.b = 0; // intercept
}
// Fit on calibration dataset
fit(dataset) {
// dataset: [{ modelConfidence, isCorrect }]
// In production, use a proper logistic regression library
// This is a simplified gradient descent implementation
let a = 1, b = 0;
const learningRate = 0.01;
const iterations = 1000;
for (let iter = 0; iter < iterations; iter++) {
let gradA = 0, gradB = 0;
for (const item of dataset) {
const z = a * item.modelConfidence + b;
const calibrated = 1 / (1 + Math.exp(-z));
const error = calibrated - (item.isCorrect ? 1 : 0);
gradA += error * item.modelConfidence;
gradB += error;
}
a -= learningRate * (gradA / dataset.length);
b -= learningRate * (gradB / dataset.length);
}
this.a = a;
this.b = b;
}
// Apply calibration to a raw confidence score
calibrate(rawConfidence) {
const z = this.a * rawConfidence + this.b;
return 1 / (1 + Math.exp(-z));
}
}
// Usage
const scaler = new PlattScaler();
scaler.fit(calibrationDataset);
// Before: raw model confidence = 0.90 (but actually correct 70% of the time)
// After: calibrated confidence = 0.72 (now matches actual accuracy)
const calibrated = scaler.calibrate(0.90);
console.log(`Calibrated confidence: ${calibrated.toFixed(2)}`);
5. Building Confidence into Structured Outputs
Integrate confidence scores directly into your API response schemas:
import { z } from 'zod';
// Define a schema with built-in confidence
const ConfidentResponseSchema = z.object({
answer: z.string().describe('The answer to the user question'),
confidence: z.number().min(0).max(1).describe('Overall confidence score 0-1'),
confidenceBreakdown: z.object({
sourceGrounding: z.number().min(0).max(1)
.describe('How well the answer is supported by source documents'),
completeness: z.number().min(0).max(1)
.describe('How completely the question is answered'),
specificity: z.number().min(0).max(1)
.describe('How specific and precise the answer is')
}),
sources: z.array(z.object({
document: z.string(),
chunk: z.number(),
relevance: z.number().min(0).max(1)
})),
caveats: z.array(z.string())
.describe('Any limitations or uncertainties in the answer')
});
// Generate a response matching this schema
async function generateConfidentResponse(query, context, sources) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{
role: 'system',
content: `Answer the user's question based on the provided context.
Include confidence scores following these guidelines:
- sourceGrounding: 1.0 if every claim has a direct source quote, lower if inferring
- completeness: 1.0 if the question is fully answered, lower if information gaps exist
- specificity: 1.0 if the answer includes specific details, lower if vague
- overall confidence: weighted average of the three dimensions
List any caveats or limitations honestly.
Return valid JSON matching this schema:
${JSON.stringify(ConfidentResponseSchema.shape, null, 2)}`
},
{
role: 'user',
content: `Context:\n${context}\n\nSources:\n${
sources.map((s, i) => `[${i + 1}] ${s.document}: ${s.text}`).join('\n')
}\n\nQuestion: ${query}`
}
],
response_format: { type: 'json_object' }
});
const raw = JSON.parse(response.choices[0].message.content);
// Validate with Zod
const validated = ConfidentResponseSchema.parse(raw);
return validated;
}
6. Thresholding: Auto-Approve, Disclaimer, or Human Review
The real power of confidence scores: automated routing decisions.
class ConfidenceRouter {
constructor(config = {}) {
this.thresholds = {
autoApprove: config.autoApprove || 0.85,
disclaimer: config.disclaimer || 0.60,
humanReview: config.humanReview || 0.40
// Below humanReview → refuse to answer
};
this.metrics = {
autoApproved: 0,
withDisclaimer: 0,
sentToHuman: 0,
refused: 0
};
}
route(response) {
const confidence = response.confidence;
if (confidence >= this.thresholds.autoApprove) {
this.metrics.autoApproved++;
return {
action: 'AUTO_APPROVE',
response: response.answer,
metadata: { confidence, route: 'auto' }
};
}
if (confidence >= this.thresholds.disclaimer) {
this.metrics.withDisclaimer++;
return {
action: 'SERVE_WITH_DISCLAIMER',
response: response.answer,
disclaimer: 'This answer may not be fully accurate. Please verify important details.',
caveats: response.caveats,
metadata: { confidence, route: 'disclaimer' }
};
}
if (confidence >= this.thresholds.humanReview) {
this.metrics.sentToHuman++;
return {
action: 'HUMAN_REVIEW',
response: null,
fallbackMessage: 'Let me connect you with a specialist who can help with this.',
queuePayload: {
query: response.query,
aiSuggestedAnswer: response.answer,
confidence,
caveats: response.caveats,
sources: response.sources
},
metadata: { confidence, route: 'human' }
};
}
this.metrics.refused++;
return {
action: 'REFUSE',
response: null,
fallbackMessage: "I don't have enough information to answer that accurately. " +
"Could you try rephrasing or ask about something else?",
metadata: { confidence, route: 'refused' }
};
}
getStats() {
const total = Object.values(this.metrics).reduce((a, b) => a + b, 0);
return {
total,
autoApproveRate: total > 0 ? this.metrics.autoApproved / total : 0,
disclaimerRate: total > 0 ? this.metrics.withDisclaimer / total : 0,
humanReviewRate: total > 0 ? this.metrics.sentToHuman / total : 0,
refusalRate: total > 0 ? this.metrics.refused / total : 0
};
}
}
// Usage
const router = new ConfidenceRouter({
autoApprove: 0.85,
disclaimer: 0.60,
humanReview: 0.40
});
// High-confidence answer: auto-approved
const highConf = router.route({ answer: 'Founded in 2015', confidence: 0.95, caveats: [] });
console.log(highConf.action); // 'AUTO_APPROVE'
// Medium-confidence answer: served with disclaimer
const medConf = router.route({ answer: 'About $50M revenue', confidence: 0.70, caveats: ['Exact figure not in sources'] });
console.log(medConf.action); // 'SERVE_WITH_DISCLAIMER'
// Low-confidence answer: human review
const lowConf = router.route({ answer: 'Maybe 200 employees?', confidence: 0.45, caveats: ['Not directly stated'] });
console.log(lowConf.action); // 'HUMAN_REVIEW'
// Very low confidence: refused
const veryLow = router.route({ answer: 'Unknown', confidence: 0.15, caveats: ['No relevant information'] });
console.log(veryLow.action); // 'REFUSE'
// Check routing distribution
console.log(router.getStats());
Tuning thresholds
┌─────────────────────────────────────────────────────────┐
│ THRESHOLD TUNING — THE TRADEOFF │
│ │
│ Higher auto-approve threshold (e.g., 0.95): │
│ ✓ Fewer wrong answers served automatically │
│ ✗ More answers routed to human (expensive, slow) │
│ Use for: medical, legal, financial │
│ │
│ Lower auto-approve threshold (e.g., 0.70): │
│ ✓ More answers served automatically (faster, cheaper)│
│ ✗ More wrong answers reach users │
│ Use for: casual Q&A, internal tools, brainstorming │
│ │
│ MEASURE to tune: │
│ 1. Sample auto-approved responses for human review │
│ 2. Calculate error rate at each threshold │
│ 3. Plot accuracy vs automation rate │
│ 4. Pick the threshold that matches your risk budget │
└─────────────────────────────────────────────────────────┘
7. Combining Confidence Signals
In production, combine multiple confidence signals for a more robust score:
function computeCompositeConfidence(signals) {
const {
selfAssessedConfidence, // Model's self-reported confidence (0-1)
logprobConfidence, // Average token probability (0-1)
sourceOverlap, // Fraction of claims grounded in sources (0-1)
retrievalRelevance, // Average relevance of retrieved chunks (0-1)
hallucinationScore // From hallucination detection pipeline (0-1, lower=better)
} = signals;
// Weighted combination
const weights = {
selfAssessed: 0.15, // Least reliable on its own
logprob: 0.25, // Objective but noisy
sourceOverlap: 0.30, // Most reliable for RAG
retrievalRelevance: 0.20, // Quality of inputs
hallucination: 0.10 // Penalty signal
};
const composite = (
selfAssessedConfidence * weights.selfAssessed +
logprobConfidence * weights.logprob +
sourceOverlap * weights.sourceOverlap +
retrievalRelevance * weights.retrievalRelevance +
(1 - hallucinationScore) * weights.hallucination
);
// Apply calibration if available
// const calibrated = plattScaler.calibrate(composite);
return {
composite: Math.max(0, Math.min(1, composite)),
signals: {
selfAssessed: selfAssessedConfidence,
logprob: logprobConfidence,
sourceOverlap,
retrievalRelevance,
hallucinationPenalty: hallucinationScore
},
dominantSignal: getDominantSignal(signals, weights)
};
}
function getDominantSignal(signals, weights) {
// Find which signal is pulling the score down the most
const contributions = {
selfAssessed: signals.selfAssessedConfidence * weights.selfAssessed,
logprob: signals.logprobConfidence * weights.logprob,
sourceOverlap: signals.sourceOverlap * weights.sourceOverlap,
retrievalRelevance: signals.retrievalRelevance * weights.retrievalRelevance,
hallucination: (1 - signals.hallucinationScore) * weights.hallucination
};
const sorted = Object.entries(contributions).sort((a, b) => a[1] - b[1]);
return {
weakest: sorted[0][0],
weakestScore: sorted[0][1],
strongest: sorted[sorted.length - 1][0],
strongestScore: sorted[sorted.length - 1][1]
};
}
// Usage
const composite = computeCompositeConfidence({
selfAssessedConfidence: 0.90,
logprobConfidence: 0.85,
sourceOverlap: 0.70, // Some claims not grounded
retrievalRelevance: 0.80,
hallucinationScore: 0.20 // Mild hallucination signal
});
console.log(`Composite confidence: ${composite.composite.toFixed(2)}`);
console.log(`Weakest signal: ${composite.dominantSignal.weakest}`);
// Composite: ~0.78, weakest signal: sourceOverlap
8. Key Takeaways
- Self-assessed confidence is the easiest method — ask the model to rate its own certainty — but it's prone to overconfidence and needs calibration.
- Log probabilities provide objective, per-token confidence signals at no extra cost — low logprobs on key tokens are red flags for hallucination.
- Calibration is essential — raw confidence scores are meaningless unless you verify that 90% confidence = 90% accuracy. Build calibration datasets and measure Expected Calibration Error.
- Thresholding enables automation — auto-approve high confidence, add disclaimers for medium, route low confidence to humans. This is the business value of confidence scoring.
- Combine multiple signals for robustness — self-assessment, logprobs, source overlap, retrieval relevance, and hallucination scores together outperform any single signal.
- Tune thresholds to your risk budget — medical apps need high thresholds (0.95+), casual Q&A can tolerate lower (0.70+). Measure error rates at each threshold to find the right balance.
Explain-It Challenge
- A product manager asks "why can't the model just tell us when it's wrong?" Explain the overconfidence problem and why calibration matters.
- You have logprobs showing 99% confidence on the token "Paris" for "capital of France" but 12% confidence on the token "1987" for "company founding year." What does this tell you and what action do you take?
- Your confidence router is auto-approving 80% of responses but human reviewers find 8% of those are wrong. What do you adjust?
Navigation: ← 4.14.a — Detecting Hallucinations · 4.14.c — Evaluating Retrieval Quality →