Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems
Interview Questions: Evaluating AI Systems
Model answers for hallucination detection, confidence scoring, retrieval quality metrics, and observability/monitoring in production AI systems.
How to use this material (instructions)
- Read lessons in order --
README.md, then4.14.a->4.14.d. - Practice out loud -- definition -> example -> pitfall.
- Pair with exercises --
4.14-Exercise-Questions.md. - Quick review --
4.14-Quick-Revision.md.
Beginner (Q1-Q4)
Q1. What is hallucination detection and why does it matter in production AI?
Why interviewers ask: Tests whether you understand that deploying an LLM is only half the challenge -- continuously verifying its outputs is the other half.
Model answer:
Hallucination detection is the engineering discipline of automatically identifying when an AI system generates information that is not grounded in its source material. In a RAG pipeline, this means catching cases where the LLM fabricates facts, contradicts retrieved documents, or invents citations that do not exist.
It matters in production because LLM outputs look confident even when wrong. Traditional software fails loudly (a 500 error), but an AI system fails silently -- it produces a fluent, well-structured answer that happens to contain fabricated information. The cost ranges from embarrassment (wrong product specs) to lawsuits (fabricated medical dosages or legal citations).
The three primary automated detection methods are:
- Cross-referencing with source documents -- a second LLM call verifies each claim in the answer against the retrieved chunks
- Consistency checking -- ask the same question multiple ways and check if answers agree
- NLI (Natural Language Inference) models -- lightweight models that classify whether a source text entails, contradicts, or is neutral toward a given claim
In practice, you combine all three into a layered pipeline and supplement with human evaluation on a sampled subset of responses.
Q2. What is a confidence score and how does it enable routing decisions?
Why interviewers ask: Tests understanding of how to turn uncertain AI outputs into actionable engineering decisions.
Model answer:
A confidence score is a numeric value (0.0 to 1.0) attached to every AI output that quantifies how reliable the system believes the response is. Unlike traditional software where outputs are either correct or an error, AI systems exist in a gray zone -- the output looks reasonable but might be wrong. Confidence scoring quantifies that uncertainty.
The engineering value is in automated routing:
const router = new ConfidenceRouter({
autoApprove: 0.85, // High confidence -> serve directly
disclaimer: 0.60, // Medium -> serve with "please verify" note
humanReview: 0.40 // Low -> route to human agent
// Below 0.40 -> refuse to answer
});
Confidence can come from multiple signals: model self-assessment (ask the LLM to rate itself), log probabilities (the model's internal token-level certainty), source overlap (fraction of claims grounded in retrieved documents), and retrieval relevance (quality of the chunks that were retrieved).
Self-assessment alone is unreliable because LLMs are systematically overconfident -- a model reporting 90% confidence may only be correct 70% of the time. This is why calibration (measuring whether 90% confidence = 90% accuracy on a labeled dataset) and combining multiple signals are essential.
The thresholds depend on domain risk: a medical chatbot might auto-approve only above 0.95, while a casual Q&A tool might use 0.70.
Q3. What are precision@k and recall@k in retrieval evaluation?
Why interviewers ask: These are the foundational metrics for evaluating any retrieval system -- interviewers want to confirm you understand both and can reason about the trade-off between them.
Model answer:
Precision@k measures "of the k documents retrieved, how many were actually relevant?" It is calculated as:
Precision@k = (relevant documents in top-k) / k
Recall@k measures "of all relevant documents in the corpus, how many did we retrieve in the top-k?" It is calculated as:
Recall@k = (relevant documents in top-k) / (total relevant documents)
Example: You retrieve 5 documents. Positions 1, 3, 5 are relevant. There are 4 total relevant documents in the corpus.
- Precision@5 = 3/5 = 0.60 (40% of what we retrieved is noise)
- Recall@5 = 3/4 = 0.75 (we found 75% of all relevant documents)
The trade-off is critical for RAG: higher recall means the LLM has access to more of the relevant information, reducing the chance of "I don't know" or hallucinated gap-filling. Higher precision means less noise in the context, so the LLM is less likely to be confused by irrelevant chunks.
For most RAG systems, recall matters more -- missing a relevant document means the LLM cannot produce a correct answer regardless of how powerful it is. Including a few irrelevant documents is less harmful because the LLM can usually ignore them. This is why retrieval quality is called "the ceiling" for RAG answer quality.
Q4. What are the four pillars of AI observability?
Why interviewers ask: Tests whether you understand that AI monitoring goes beyond traditional DevOps -- the fourth pillar (evaluation) is what makes AI observability unique.
Model answer:
The four pillars are logs, metrics, traces, and evaluation.
Logs capture every LLM call with full context: request ID, timestamp, model, temperature, prompt version, input messages, output text, token counts, latency, finish reason, and errors. This is the debugging lifeline.
Metrics aggregate logs into dashboards: latency (avg, P95), error rate, token usage, cost per call, throughput (calls per minute). These tell you if the system is running.
Traces connect multi-step pipelines. A RAG request involves embedding, retrieval, prompt construction, generation, and hallucination checking -- each as a separate "span." Tracing shows which step is the bottleneck and where failures propagate.
Evaluation is the AI-specific pillar. It measures whether the system is running correctly: hallucination rate, confidence score distributions, retrieval quality metrics (precision@k, recall@k, MRR), and user satisfaction (thumbs up/down rate). Traditional monitoring catches crashes; evaluation catches confident-sounding wrong answers.
The key insight: a traditional API returns correct data or throws an error. An AI system returns plausible-looking data that might be wrong, at variable cost, with non-deterministic latency. Without the fourth pillar, you cannot distinguish between "working" and "working correctly."
Intermediate (Q5-Q8)
Q5. Compare NLI-based hallucination detection with LLM-based verification. When would you choose each?
Why interviewers ask: Tests practical engineering judgment -- picking the right tool for the trade-off between speed, cost, and accuracy.
Model answer:
NLI (Natural Language Inference) models are specialized classifiers trained to determine whether a premise entails, contradicts, or is neutral toward a hypothesis. They are small (e.g., DeBERTa-based), run locally or via a lightweight API, and produce results in 10-50ms per claim.
LLM-based verification uses a second LLM call to analyze whether claims in the answer are supported by source documents. It can explain its reasoning, handle nuanced or multi-hop claims, and produce detailed verdicts with evidence quotes. It takes 500-2000ms and costs $0.01-$0.10 per verification.
| Factor | NLI Model | LLM Verifier |
|---|---|---|
| Speed | 10-50ms per claim | 500-2000ms per call |
| Cost | Free locally or very cheap | $0.01-$0.10 per verification |
| Accuracy | Good for simple, single-hop claims | Better for nuanced, multi-hop claims |
| Explainability | Scores only (entailment/contradiction/neutral) | Can explain reasoning and quote evidence |
| Scalability | Excellent (runs locally, no rate limits) | Limited by API rate limits |
| Complex reasoning | Struggles with multi-step inference | Handles well |
When to choose NLI: Use as a fast first filter in high-volume systems. Run NLI on all responses, and escalate only flagged ones to the slower LLM verifier. This saves cost and latency on the 80%+ of responses that are clearly grounded.
When to choose LLM verification: Use for high-stakes domains (medical, legal, financial) where you need explanations for flagged claims, or when claims require multi-hop reasoning ("The refund applies because the customer is in the EU AND purchased within 14 days").
In production, use both in a layered pipeline: NLI first (fast, cheap, catches obvious issues), then LLM verification on anything NLI flags as uncertain or contradictory.
Q6. Explain MRR and NDCG. When is each metric most useful?
Why interviewers ask: Tests depth of knowledge beyond basic precision/recall -- these ranking-aware metrics are essential for production retrieval evaluation.
Model answer:
MRR (Mean Reciprocal Rank) measures how high the first relevant result appears across a set of queries. For each query, the reciprocal rank is 1 / position_of_first_relevant_result. MRR is the average across all queries.
Query 1: first relevant at position 1 -> RR = 1/1 = 1.00
Query 2: first relevant at position 4 -> RR = 1/4 = 0.25
Query 3: first relevant at position 2 -> RR = 1/2 = 0.50
MRR = (1.00 + 0.25 + 0.50) / 3 = 0.583
MRR is most useful when you only need one good result -- like a RAG system where the first relevant chunk often suffices to answer the question. It ignores the quality of results beyond the first relevant one.
NDCG (Normalized Discounted Cumulative Gain) measures the quality of the entire ranking, accounting for graded relevance (not just binary relevant/irrelevant). It rewards putting the most relevant documents at the top and penalizes burying them lower.
DCG@k = sum over positions i: relevance(i) / log2(i + 1)
NDCG@k = DCG@k / IDCG@k (IDCG is the DCG of the ideal ranking)
NDCG is most useful when ranking order matters across multiple results -- like a system that injects the top-5 chunks into the prompt and benefits from having the most relevant chunks first (since LLMs pay more attention to content at the beginning and end of context).
Practical rule: Use MRR when your system relies on the single best result. Use NDCG when your system uses multiple retrieved results and their ordering affects quality.
Q7. How do you build a calibration dataset and fix an overconfident model?
Why interviewers ask: Tests understanding of a critical production problem -- confidence scores are worthless without calibration.
Model answer:
A calibration dataset is a set of questions with verified ground-truth answers, used to measure whether the model's reported confidence matches its actual accuracy.
Building the dataset:
- Collect 200+ questions covering all categories your system handles
- For each question, get the model's answer with its confidence score
- Compare each answer against ground truth (automated or human-labeled) to determine correctness
- Bin predictions by confidence (0-20%, 20-40%, 40-60%, 60-80%, 80-100%)
- For each bin, compute actual accuracy
// If the 80-100% confidence bin contains 180 predictions
// and 150 are correct:
const actualAccuracy = 150 / 180; // 83.3%
const expectedAccuracy = 0.90; // midpoint of 80-100%
// Gap: model says ~90% but is only 83.3% accurate -> overconfident
Measuring calibration: Compute the Expected Calibration Error (ECE) -- the weighted average of the gap between expected and actual accuracy across all bins. ECE < 0.05 is well-calibrated; ECE > 0.15 is poorly calibrated and needs correction.
Fixing overconfidence with Platt scaling: Fit a logistic regression on the calibration dataset that maps raw confidence to calibrated confidence. After fitting:
// Before: raw confidence 0.90 (actually correct 70% of the time)
// After: calibrated = sigmoid(a * 0.90 + b) = 0.72
// Now 0.72 matches actual accuracy
const scaler = new PlattScaler();
scaler.fit(calibrationDataset); // [{ modelConfidence, isCorrect }]
const calibrated = scaler.calibrate(0.90); // -> 0.72
The key insight: recalibrate whenever you change the model, prompt, or retrieval strategy -- each change shifts the relationship between reported confidence and actual accuracy.
Q8. Design a confidence-based routing system for a customer support chatbot.
Why interviewers ask: Tests practical system design -- turning confidence scores into business value through automated routing.
Model answer:
The routing system uses a ConfidenceRouter with three tiers, tuned for customer support risk:
Thresholds:
- Auto-approve (>= 0.85): Answer served directly to user. For customer support, 0.85 is appropriate because wrong answers cause frustration and support tickets, but are not life-threatening.
- Serve with disclaimer (0.60-0.85): Answer served with "This information may not be fully accurate. If this doesn't resolve your issue, please contact a human agent." Caveat list displayed.
- Human review (0.40-0.60): Answer NOT served. User sees "Let me connect you with a specialist." The AI's suggested answer and sources are queued for human agent review.
- Refuse (< 0.40): No answer generated. "I don't have enough information to help with that. Let me connect you with a team member."
class ConfidenceRouter {
constructor() {
this.thresholds = { autoApprove: 0.85, disclaimer: 0.60, humanReview: 0.40 };
this.metrics = { autoApproved: 0, withDisclaimer: 0, sentToHuman: 0, refused: 0 };
}
route(response) {
const { confidence } = response;
if (confidence >= this.thresholds.autoApprove) {
this.metrics.autoApproved++;
return { action: 'AUTO_APPROVE', response: response.answer };
}
if (confidence >= this.thresholds.disclaimer) {
this.metrics.withDisclaimer++;
return {
action: 'SERVE_WITH_DISCLAIMER',
response: response.answer,
disclaimer: 'This may not be fully accurate. Contact support if needed.'
};
}
if (confidence >= this.thresholds.humanReview) {
this.metrics.sentToHuman++;
return {
action: 'HUMAN_REVIEW',
fallbackMessage: 'Let me connect you with a specialist.',
queuePayload: { query: response.query, suggestedAnswer: response.answer, confidence }
};
}
this.metrics.refused++;
return {
action: 'REFUSE',
fallbackMessage: "I don't have enough information. Let me connect you with a team member."
};
}
}
Tuning approach: Sample 5% of auto-approved responses for human review. If error rate exceeds the risk budget (e.g., 3%), raise the auto-approve threshold. Track the automation rate (percentage auto-approved) -- typically 60-75% for customer support. Below 50% means the system is too conservative; above 85% means thresholds are probably too permissive.
Combining signals: Don't rely on a single confidence source. Compute a composite from self-assessment (weight 0.15), logprobs (0.25), source overlap (0.30), and retrieval relevance (0.30). This is more robust than any individual signal.
Advanced (Q9-Q11)
Q9. Design a multi-layer hallucination detection pipeline for a production RAG system.
Why interviewers ask: Tests end-to-end system design ability -- combining multiple detection methods with engineering trade-offs of latency, cost, and accuracy.
Model answer:
The pipeline has three layers, each progressively more expensive and accurate. Later layers only run when earlier layers flag concerns.
Layer 1 -- Source attribution check (always runs, ~100ms):
Extract claims from the response using gpt-4o-mini. For each claim, compute keyword overlap with source documents. If overlap ratio < 0.5, flag the claim as ungrounded. This is fast and cheap -- it catches obvious hallucinations like fabricated numbers or names not present in any source.
Layer 2 -- NLI check (runs if Layer 1 flags > 20% of claims, ~200ms): Run flagged claims through an NLI model (e.g., DeBERTa cross-encoder). Classify each as ENTAILMENT, CONTRADICTION, or NEUTRAL. This catches semantic contradictions that keyword overlap misses ("15 days" vs "30 days" both contain "days" but NLI catches the contradiction).
Layer 3 -- Consistency check (runs if Layer 1 > 40% OR Layer 2 > 30%, ~1500ms): Generate the answer twice with different question phrasings. Compare for consistency. This is the most expensive but catches cases where the model is confidently hallucinating the same wrong answer.
class HallucinationPipeline {
async detect(response, sources, query) {
// Layer 1: Fast source attribution (always)
const attribution = await this.checkSourceAttribution(response, sources);
// Layer 2: NLI (only if Layer 1 flags concerns)
let nli = { score: 0 };
if (attribution.score > 0.2) {
nli = await this.checkNLI(attribution.ungroundedClaims, sources.join(' '));
}
// Layer 3: Consistency (only if high risk)
let consistency = { score: 1.0 };
if (attribution.score > 0.4 || nli.score > 0.3) {
consistency = await this.checkConsistency(query, sources);
}
// Weighted combination
const overall = attribution.score * 0.5 + nli.score * 0.3 + (1 - consistency.score) * 0.2;
return {
score: overall,
isHallucinated: overall > 0.3,
recommendation: overall < 0.1 ? 'SERVE'
: overall < 0.3 ? 'SERVE_WITH_DISCLAIMER'
: overall < 0.6 ? 'HUMAN_REVIEW' : 'BLOCK'
};
}
}
Metrics to track: Hallucination rate (target < 5%), false positive rate (target < 10%), false negative rate (target < 5%), and average detection latency (target < 500ms for real-time). Use human evaluation on a sampled subset (2% random + 100% of flagged responses) to measure detection accuracy and recalibrate thresholds weekly.
Q10. You are building a retrieval evaluation framework from scratch. Walk through the complete process.
Why interviewers ask: Tests the ability to set up systematic measurement infrastructure -- most teams skip this and pay for it later.
Model answer:
The framework has four components: evaluation dataset, metric implementation, evaluation runner, and reporting/alerting.
Component 1 -- Evaluation dataset (minimum 200 queries):
Build using three strategies combined:
- Manual labeling (50 queries): Domain experts write questions and label which document chunks answer them. Highest quality, covers edge cases.
- LLM-generated (100 queries): Feed each chunk to GPT-4o and ask it to generate 3 questions the chunk answers. Scale quickly, but review for quality.
- User query logs (50+ queries): Sample real production queries and have reviewers label which retrieved chunks were actually relevant. Most realistic.
Each entry: { query, relevantDocIds, expectedAnswer, relevanceScores (0-3 per doc), difficulty, category }.
Component 2 -- Metric implementation:
class RetrievalEvaluator {
precisionAtK(retrieved, relevant, k) {
const topK = retrieved.slice(0, k);
return topK.filter(d => relevant.includes(d)).length / k;
}
recallAtK(retrieved, relevant, k) {
if (relevant.length === 0) return 0;
const topK = retrieved.slice(0, k);
return topK.filter(d => relevant.includes(d)).length / relevant.length;
}
reciprocalRank(retrieved, relevant) {
for (let i = 0; i < retrieved.length; i++) {
if (relevant.includes(retrieved[i])) return 1 / (i + 1);
}
return 0;
}
ndcgAtK(retrieved, relevanceScores, k) {
const topK = retrieved.slice(0, k);
let dcg = 0;
for (let i = 0; i < topK.length; i++) {
dcg += (relevanceScores.get(topK[i]) || 0) / Math.log2(i + 2);
}
const idealScores = Array.from(relevanceScores.values()).sort((a, b) => b - a).slice(0, k);
let idcg = 0;
for (let i = 0; i < idealScores.length; i++) {
idcg += idealScores[i] / Math.log2(i + 2);
}
return idcg > 0 ? dcg / idcg : 0;
}
}
Component 3 -- Evaluation runner:
Run on every change to the retrieval pipeline (new embedding model, chunking strategy, re-ranking model). Also run weekly on production data to detect drift. Compare results against baseline using A/B test methodology.
Component 4 -- Reporting and alerting:
Track metrics over time. Alert if any metric drops more than 10% from baseline. Report includes per-category breakdown (which types of queries are weakest), per-difficulty breakdown, and trends over the past 4 weeks.
End-to-end vs component evaluation: Always evaluate both. Component evaluation (retrieval only) isolates retrieval problems. End-to-end evaluation (retrieval + generation) catches issues in the full pipeline. When they disagree, the pattern tells you where to invest: good retrieval + bad answers = fix the prompt; bad retrieval + good answers = lucky but fragile.
Q11. Your production AI system's hallucination rate jumped from 5% to 18% overnight. Walk through your debugging process.
Why interviewers ask: Tests production incident response -- the ability to diagnose complex AI system failures using observability data.
Model answer:
This is a severity-critical incident. I follow a structured debugging process using the four observability pillars.
Step 1 -- Confirm the signal (5 minutes): Check if the metric is real or a measurement artifact. Did the hallucination detection pipeline itself change? Was there a scoring threshold update? Check the volume of evaluated responses -- a small sample size can cause spikes.
Step 2 -- Correlate with operational changes (10 minutes):
Check the deployment log: was a new prompt version deployed? Was the model changed or updated (provider-side model update)? Was the retrieval index rebuilt? Was a new data source ingested? Check promptVersion and model fields in the logs -- if the spike correlates exactly with a deployment, you have a likely cause.
Step 3 -- Analyze the traces (20 minutes): Pull a sample of 20-30 hallucinated responses from the spike window. For each, examine the full trace:
// Check retrieval quality for hallucinated responses
const hallucinatedTraces = traces.filter(t => t.hallucinationScore > 0.3);
for (const trace of hallucinatedTraces.slice(0, 20)) {
console.log('Query:', trace.query);
console.log('Retrieval relevance:', trace.spans.retrieve.relevanceScores);
console.log('Top doc score:', trace.spans.retrieve.topScore);
console.log('Prompt version:', trace.metadata.promptVersion);
console.log('Model:', trace.metadata.model);
console.log('Answer:', trace.answer.substring(0, 200));
console.log('---');
}
Look for patterns:
- All hallucinated responses have low retrieval scores: Retrieval is broken. Check embedding model, vector DB connection, index corruption.
- Retrieval is fine but answers are wrong: The LLM is not following instructions. Check for prompt regression, model change, or temperature accidentally set above 0.
- Specific category spiking: A new document was ingested with bad content (context poisoning) or a specific document was deleted that many queries depend on.
- Random distribution: The model provider may have silently updated the model (check
system_fingerprint).
Step 4 -- Root cause and fix (varies): Once identified, fix the root cause: rollback the prompt, revert the index, fix the broken retrieval connection, or adjust detection thresholds. Then run the evaluation suite to confirm the fix before redeploying.
Step 5 -- Post-incident: Add a regression test for this failure mode to the evaluation dataset. Set up a tighter alert threshold if the existing one was too slow to catch the issue. Document the incident for the team.
Quick-fire
| # | Question | One-line answer |
|---|---|---|
| 1 | What is hallucination detection? | Automatically identifying when an AI generates info not grounded in its source material |
| 2 | Name three hallucination detection methods | Cross-referencing with sources, consistency checking, NLI models |
| 3 | What is a confidence score? | A 0-1 numeric value quantifying how reliable an AI output is |
| 4 | What are logprobs? | The model's internal probability assignments to each generated token -- objective confidence signal |
| 5 | What does "calibration" mean for confidence? | 90% confidence should mean ~90% actual accuracy -- measured on labeled data |
| 6 | What is Precision@k? | Fraction of retrieved documents in top-k that are relevant -- measures noise |
| 7 | What is Recall@k? | Fraction of all relevant documents that appear in top-k -- measures coverage |
| 8 | What does MRR measure? | How high the first relevant result appears -- average of reciprocal ranks across queries |
| 9 | What does NDCG measure? | Ranking quality with graded relevance -- rewards putting the best results highest |
| 10 | What are the four pillars of AI observability? | Logs, metrics, traces, evaluation -- evaluation is the AI-specific pillar |
| 11 | Why is retrieval quality called "the ceiling"? | The LLM can only work with what retrieval gives it -- wrong chunks = wrong answers |
<- Back to 4.14 -- Evaluating AI Systems (README)