4.14 — Evaluating AI Systems: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps -- reopen
README.md -> 4.14.a...4.14.d.
- Practice --
4.14-Exercise-Questions.md.
- Polish answers --
4.14-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|
| Hallucination | AI generates information not grounded in its source material -- looks confident but is fabricated |
| Hallucination detection | Engineering discipline of automatically catching fabricated claims in AI outputs |
| Cross-referencing | Verify each claim in the AI answer against the retrieved source documents |
| Consistency checking | Ask the same question multiple ways and check if answers agree |
| NLI (Natural Language Inference) | Lightweight model that classifies whether a premise entails, contradicts, or is neutral toward a claim |
| Confidence score | 0.0-1.0 value quantifying how reliable an AI output is believed to be |
| Self-assessment | Asking the LLM to rate its own confidence -- simple but prone to overconfidence |
| Log probabilities (logprobs) | Model's internal probability assignments per generated token -- objective confidence signal |
| Calibration | Verifying that 90% confidence = ~90% actual accuracy on a labeled dataset |
| Platt scaling | Logistic regression to remap raw confidence to calibrated confidence |
| ECE (Expected Calibration Error) | Weighted average of the gap between expected and actual accuracy across confidence bins |
| Precision@k | Fraction of retrieved docs in top-k that are relevant -- measures noise |
| Recall@k | Fraction of all relevant docs that appear in top-k -- measures coverage |
| MRR (Mean Reciprocal Rank) | Average of 1/position_of_first_relevant_result across queries -- measures ranking quality |
| NDCG (Normalized Discounted Cumulative Gain) | Measures ranking quality with graded relevance (0-3 scale, not just binary) |
| Faithfulness | Whether the answer uses only information from retrieved context (no hallucination) |
| Relevance | Whether the answer actually addresses the user's question |
| Source attribution | Requiring the model to cite specific sources for every claim -- doubles as hallucination detection |
| Observability | Logging, tracing, metrics, and evaluation -- the four pillars for AI systems |
| Trace | End-to-end record of a multi-step pipeline, with timing per span |
| Span | A single step within a trace (e.g., embed_query, retrieve, generate) |
Hallucination detection strategies
┌─────────────────────────────────────────────────────────────────┐
│ LAYERED DETECTION PIPELINE │
│ │
│ Layer 1: Source attribution (always, ~100ms) │
│ Extract claims -> keyword overlap with sources │
│ Flags obvious fabrications (names, numbers not in sources) │
│ │
│ Layer 2: NLI check (if Layer 1 flags > 20%, ~200ms) │
│ Run flagged claims through NLI model │
│ Catches semantic contradictions ("15 days" vs "30 days") │
│ │
│ Layer 3: Consistency check (if high risk, ~1500ms) │
│ Ask same question multiple ways, compare answers │
│ Catches confident-but-wrong hallucinations │
│ │
│ Combine: overall = 0.5*attribution + 0.3*nli + 0.2*consistency│
│ │
│ < 0.1 -> SERVE │
│ < 0.3 -> SERVE_WITH_DISCLAIMER │
│ < 0.6 -> HUMAN_REVIEW │
│ >= 0.6 -> BLOCK │
└─────────────────────────────────────────────────────────────────┘
Method comparison
| Method | Speed | Cost | Best For | Weakness |
|---|
| Cross-referencing | 500-2000ms | LLM call | Pinpointing which claim is wrong | Verifier LLM can itself hallucinate |
| Consistency checking | 2000-5000ms | Multiple LLM calls | Catching model uncertainty | Consistently wrong hallucinations pass |
| NLI models | 10-50ms | Free (local) | Fast first filter at scale | Struggles with complex multi-hop claims |
| Source attribution | 0ms (built into prompt) | No extra cost | Forcing grounding + audit trail | Model can fabricate citations |
| Human evaluation | Minutes-hours | Expensive | Gold standard, catches edge cases | Doesn't scale, only for sampling |
Human evaluation sampling rules
Math.random() < 0.02
response.confidence < 0.7
response.hallucinationScore > 0.3
response.domain === 'medical' || response.domain === 'legal'
response.promptVersion !== stableVersion && Math.random() < 0.2
response.userFeedback === 'thumbs_down'
Confidence score implementation
Signal sources
| Signal | Weight | Objectivity | Notes |
|---|
| Self-assessment | 0.15 | Subjective | Prone to overconfidence; cheapest to implement |
| Log probabilities | 0.25 | Objective | Per-token; no extra cost; not all APIs expose them |
| Source overlap | 0.30 | Objective | Most reliable for RAG; fraction of claims grounded in sources |
| Retrieval relevance | 0.20 | Objective | Quality of input chunks; bad retrieval = bad answers |
| Hallucination penalty | 0.10 | Objective | From detection pipeline; penalizes flagged content |
Composite formula
composite = selfAssessed * 0.15
+ logprob * 0.25
+ sourceOverlap * 0.30
+ retrievalRelevance * 0.20
+ (1 - hallucinationScore) * 0.10
Logprob reference
logprob = -0.01 -> probability ~99% (very confident)
logprob = -0.10 -> probability ~90% (confident)
logprob = -0.69 -> probability ~50% (coin flip)
logprob = -2.30 -> probability ~10% (uncertain)
logprob = -4.60 -> probability ~1% (very uncertain)
Convert: probability = Math.exp(logprob)
Routing thresholds by domain
| Domain | Auto-approve | Disclaimer | Human review | Refuse |
|---|
| Medical / Legal | >= 0.95 | 0.80-0.95 | 0.60-0.80 | < 0.60 |
| Customer support | >= 0.85 | 0.60-0.85 | 0.40-0.60 | < 0.40 |
| Internal tools | >= 0.75 | 0.50-0.75 | 0.30-0.50 | < 0.30 |
| Casual Q&A | >= 0.70 | 0.40-0.70 | 0.20-0.40 | < 0.20 |
Calibration check
Step 1: Collect 200+ questions with ground-truth answers
Step 2: Get model confidence for each
Step 3: Bin by confidence (0-20%, 20-40%, 40-60%, 60-80%, 80-100%)
Step 4: Compute actual accuracy per bin
Step 5: ECE = weighted avg of |actual - expected| across bins
ECE < 0.05 = well calibrated
ECE > 0.15 = poorly calibrated -> apply Platt scaling
Retrieval quality metrics with formulas
Precision@k
Precision@k = (relevant docs in top-k) / k
Example: k=5, relevant at positions 1,3,5 -> Precision@5 = 3/5 = 0.60
Recall@k
Recall@k = (relevant docs in top-k) / (total relevant docs in corpus)
Example: k=5, found 3 of 4 relevant -> Recall@5 = 3/4 = 0.75
MRR
Reciprocal Rank = 1 / (position of first relevant result)
MRR = average of RR across all queries
Example:
Query 1: first relevant at position 1 -> 1/1 = 1.00
Query 2: first relevant at position 4 -> 1/4 = 0.25
Query 3: first relevant at position 2 -> 1/2 = 0.50
MRR = (1.00 + 0.25 + 0.50) / 3 = 0.583
NDCG@k
Relevance scale: 0 (irrelevant), 1 (tangential), 2 (very relevant), 3 (perfect)
DCG@k = sum over i=1..k: relevance(i) / log2(i + 1)
IDCG@k = DCG of the ideal (best possible) ranking
NDCG@k = DCG@k / IDCG@k (0 to 1, higher is better)
Example ranking: [3, 1, 3, 0, 2]
DCG@5 = 3/1.0 + 1/1.58 + 3/2.0 + 0/2.32 + 2/2.58 = 5.90
Ideal: [3, 3, 2, 1, 0]
IDCG@5 = 3/1.0 + 3/1.58 + 2/2.0 + 1/2.32 + 0/2.58 = 6.32
NDCG@5 = 5.90 / 6.32 = 0.93 (93% of ideal ranking)
Metric targets
| Metric | Target | What It Tells You |
|---|
| Precision@5 | > 0.60 | Low noise in retrieved context |
| Recall@5 | > 0.70 | Good coverage of relevant docs |
| MRR | > 0.70 | First relevant result is near the top |
| NDCG@5 | > 0.75 | Overall ranking quality with graded relevance |
When metrics disagree
| Scenario | Meaning | Action |
|---|
| High precision, low recall | Few results but all relevant; missing documents | Retrieve more candidates (increase k), add hybrid search |
| Low precision, high recall | Finding everything but with lots of noise | Better re-ranking, stricter similarity threshold |
| High MRR, low NDCG | First result is great, rest are poorly ordered | Improve re-ranking for positions 2-5 |
| Low MRR, high NDCG | Good results exist but aren't ranked first | Improve initial ranking or add re-ranking stage |
Evaluation dataset sizes
| Stage | Minimum | Ideal |
|---|
| Quick sanity check | 20-50 | -- |
| Pre-deployment eval | 100-200 | 500+ |
| Ongoing monitoring | 50/week | 200/week |
| A/B test significance | 200+ per variant | 1000+ |
Observability dashboard design
The four pillars
LOGS METRICS TRACES EVALUATION
Every LLM call Latency Full pipeline Hallucination rate
input/output P95, avg step-by-step confidence dist.
params Token cost timing per span retrieval quality
prompt version Error rate bottleneck ID user satisfaction
token counts Throughput error propagation calibration check
Six metric categories
| Category | Key Metrics | Targets |
|---|
| Latency | Avg response, P95, time to first token | < 2s avg, < 5s P95, < 500ms TTFT |
| Cost | Cost per call, daily spend, tokens per call | Within budget, trending stable |
| Errors | Error rate, rate limit hits | < 1% errors, < 0.1% rate limits |
| Quality | Hallucination rate, confidence avg | < 5% hallucination, > 0.75 confidence |
| Retrieval | Precision@k, Recall@k, MRR | > 0.60 precision, > 0.70 recall |
| User | Thumbs up rate, edit rate | > 80% thumbs up, < 15% edits |
What to log vs what NOT to log
| Always Log | Never Log |
|---|
| Request ID, timestamp | Raw user PII (unless required + encrypted) |
| Model, temperature, params | API keys or secrets |
| Input/output token counts | Full conversation history of other users |
| Latency, finish reason | Internal IP addresses |
| Prompt version | |
| Confidence scores, error details | |
RAG trace spans (typical)
embed_query: ~120ms (5%) -- Embed user question
retrieve_documents: ~230ms (10%) -- Vector DB search
build_prompt: ~15ms (1%) -- Format context + instructions
generate_answer: ~1450ms (62%) -- LLM generation (usually the bottleneck)
hallucination_check: ~525ms (22%) -- Verification pipeline
Alert rules
| Rule | Metric | Threshold | Severity | Cooldown |
|---|
| High error rate | Error % | > 5% | Critical | 10 min |
| Latency spike | P95 latency | > 5000ms | Warning | 15 min |
| Hallucination spike | Avg hallucination score | > 0.15 | Critical | 30 min |
| Cost spike | Hourly cost | > budget * 2 | Warning | 60 min |
| Confidence drop | Avg confidence | < 0.60 | Warning | 30 min |
| Low satisfaction | Thumbs up rate | < 70% | Warning | 60 min |
Observability tools
| Tool | Best For | Key Feature |
|---|
| Helicone | API-level monitoring | Drop-in proxy, zero code change |
| LangSmith | LangChain apps | Deep chain tracing, eval suites |
| Langfuse | Open-source, self-hosted | Traces, evals, full control |
| Weights & Biases | ML experiment tracking | Prompt versioning, dashboards |
| Custom dashboards | Full control | Exactly what you need |
Common gotchas
| Gotcha | What Goes Wrong | Fix |
|---|
| Overconfident self-assessment | Model says 0.95 but is correct only 70% of the time | Calibrate with labeled dataset; combine multiple signals |
| Consistency != correctness | Model hallucinations are the same across rephrasings | Consistency check alone is not enough; layer with source checking |
| NLI on complex claims | NLI model fails on multi-hop reasoning | Use NLI for simple claims, escalate complex ones to LLM verifier |
| No calibration dataset | Confidence thresholds are guesses, not data-driven | Build 200+ question dataset with ground truth; measure ECE |
| Evaluating retrieval by answer quality | Good answers mask bad retrieval (LLM compensates) | Evaluate retrieval and generation separately (component eval) |
| Stale eval dataset | Eval set doesn't cover new document types or query patterns | Refresh with production query logs monthly |
| Too many false positives | Valid answers blocked by overly aggressive hallucination detection | Tune thresholds; accept slightly higher false negatives for lower false positives |
| No prompt version logging | Can't correlate quality changes with prompt deployments | Log prompt version on every call |
| Ignoring false negatives | Focus on false positives while hallucinations slip through | Track false negative rate aggressively; sample auto-approved responses |
| NDCG with binary relevance | Using NDCG when you only have relevant/not-relevant labels | Use Precision@k and MRR for binary; NDCG needs graded (0-3) labels |
| Provider model updates | LLM provider silently updates model; quality shifts | Monitor system_fingerprint; run eval suite on schedule, not just on your changes |
| Missing cost monitoring | Bill doubles; no breakdown by model/feature to diagnose | Track cost per call, per model, per feature; set budget alerts |
Key formulas cheat sheet
Precision@k = relevant_in_top_k / k
Recall@k = relevant_in_top_k / total_relevant
MRR = mean(1 / rank_of_first_relevant)
NDCG@k = DCG@k / IDCG@k
DCG@k = sum(relevance_i / log2(i + 1))
probability = Math.exp(logprob)
ECE = sum(bin_weight * |actual_accuracy - expected_accuracy|)
Composite conf. = sum(signal_i * weight_i)
Hallucination rate = hallucinated_responses / total_responses
False positive rate = flagged_but_correct / total_flagged
False negative rate = missed_hallucinations / total_hallucinations
One-line summary
AI Evaluation = Detect hallucinations + Score confidence + Measure retrieval quality + Monitor everything in production -- continuously, not once
End of 4.14 quick revision.