Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14 — Evaluating AI Systems: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps -- reopen README.md -> 4.14.a...4.14.d.
Practice -- 4.14-Exercise-Questions.md.
Polish answers -- 4.14-Interview-Questions.md.

Core vocabulary

Term	One-liner
Hallucination	AI generates information not grounded in its source material -- looks confident but is fabricated
Hallucination detection	Engineering discipline of automatically catching fabricated claims in AI outputs
Cross-referencing	Verify each claim in the AI answer against the retrieved source documents
Consistency checking	Ask the same question multiple ways and check if answers agree
NLI (Natural Language Inference)	Lightweight model that classifies whether a premise entails, contradicts, or is neutral toward a claim
Confidence score	0.0-1.0 value quantifying how reliable an AI output is believed to be
Self-assessment	Asking the LLM to rate its own confidence -- simple but prone to overconfidence
Log probabilities (logprobs)	Model's internal probability assignments per generated token -- objective confidence signal
Calibration	Verifying that 90% confidence = ~90% actual accuracy on a labeled dataset
Platt scaling	Logistic regression to remap raw confidence to calibrated confidence
ECE (Expected Calibration Error)	Weighted average of the gap between expected and actual accuracy across confidence bins
Precision@k	Fraction of retrieved docs in top-k that are relevant -- measures noise
Recall@k	Fraction of all relevant docs that appear in top-k -- measures coverage
MRR (Mean Reciprocal Rank)	Average of `1/position_of_first_relevant_result` across queries -- measures ranking quality
NDCG (Normalized Discounted Cumulative Gain)	Measures ranking quality with graded relevance (0-3 scale, not just binary)
Faithfulness	Whether the answer uses only information from retrieved context (no hallucination)
Relevance	Whether the answer actually addresses the user's question
Source attribution	Requiring the model to cite specific sources for every claim -- doubles as hallucination detection
Observability	Logging, tracing, metrics, and evaluation -- the four pillars for AI systems
Trace	End-to-end record of a multi-step pipeline, with timing per span
Span	A single step within a trace (e.g., embed_query, retrieve, generate)

Hallucination detection strategies

┌─────────────────────────────────────────────────────────────────┐
│  LAYERED DETECTION PIPELINE                                     │
│                                                                 │
│  Layer 1: Source attribution (always, ~100ms)                   │
│    Extract claims -> keyword overlap with sources               │
│    Flags obvious fabrications (names, numbers not in sources)   │
│                                                                 │
│  Layer 2: NLI check (if Layer 1 flags > 20%, ~200ms)           │
│    Run flagged claims through NLI model                         │
│    Catches semantic contradictions ("15 days" vs "30 days")    │
│                                                                 │
│  Layer 3: Consistency check (if high risk, ~1500ms)             │
│    Ask same question multiple ways, compare answers             │
│    Catches confident-but-wrong hallucinations                   │
│                                                                 │
│  Combine: overall = 0.5*attribution + 0.3*nli + 0.2*consistency│
│                                                                 │
│  < 0.1  -> SERVE                                                │
│  < 0.3  -> SERVE_WITH_DISCLAIMER                                │
│  < 0.6  -> HUMAN_REVIEW                                         │
│  >= 0.6 -> BLOCK                                                │
└─────────────────────────────────────────────────────────────────┘

Method comparison

Method	Speed	Cost	Best For	Weakness
Cross-referencing	500-2000ms	LLM call	Pinpointing which claim is wrong	Verifier LLM can itself hallucinate
Consistency checking	2000-5000ms	Multiple LLM calls	Catching model uncertainty	Consistently wrong hallucinations pass
NLI models	10-50ms	Free (local)	Fast first filter at scale	Struggles with complex multi-hop claims
Source attribution	0ms (built into prompt)	No extra cost	Forcing grounding + audit trail	Model can fabricate citations
Human evaluation	Minutes-hours	Expensive	Gold standard, catches edge cases	Doesn't scale, only for sampling

Human evaluation sampling rules

// Who gets reviewed:
Math.random() < 0.02          // 2% random sample
response.confidence < 0.7     // Low confidence -> always review
response.hallucinationScore > 0.3  // Flagged by automated detection
response.domain === 'medical' || response.domain === 'legal'  // High stakes
response.promptVersion !== stableVersion && Math.random() < 0.2 // New prompts: 20%
response.userFeedback === 'thumbs_down'  // User-reported issues

Confidence score implementation

Signal sources

Signal	Weight	Objectivity	Notes
Self-assessment	0.15	Subjective	Prone to overconfidence; cheapest to implement
Log probabilities	0.25	Objective	Per-token; no extra cost; not all APIs expose them
Source overlap	0.30	Objective	Most reliable for RAG; fraction of claims grounded in sources
Retrieval relevance	0.20	Objective	Quality of input chunks; bad retrieval = bad answers
Hallucination penalty	0.10	Objective	From detection pipeline; penalizes flagged content

Composite formula

composite = selfAssessed * 0.15
          + logprob      * 0.25
          + sourceOverlap * 0.30
          + retrievalRelevance * 0.20
          + (1 - hallucinationScore) * 0.10

Logprob reference

logprob = -0.01  -> probability ~99%   (very confident)
logprob = -0.10  -> probability ~90%   (confident)
logprob = -0.69  -> probability ~50%   (coin flip)
logprob = -2.30  -> probability ~10%   (uncertain)
logprob = -4.60  -> probability ~1%    (very uncertain)

Convert: probability = Math.exp(logprob)

Routing thresholds by domain

Domain	Auto-approve	Disclaimer	Human review	Refuse
Medical / Legal	>= 0.95	0.80-0.95	0.60-0.80	< 0.60
Customer support	>= 0.85	0.60-0.85	0.40-0.60	< 0.40
Internal tools	>= 0.75	0.50-0.75	0.30-0.50	< 0.30
Casual Q&A	>= 0.70	0.40-0.70	0.20-0.40	< 0.20

Calibration check

Step 1: Collect 200+ questions with ground-truth answers
Step 2: Get model confidence for each
Step 3: Bin by confidence (0-20%, 20-40%, 40-60%, 60-80%, 80-100%)
Step 4: Compute actual accuracy per bin
Step 5: ECE = weighted avg of |actual - expected| across bins
         ECE < 0.05 = well calibrated
         ECE > 0.15 = poorly calibrated -> apply Platt scaling

Retrieval quality metrics with formulas

Precision@k

Precision@k = (relevant docs in top-k) / k

Example: k=5, relevant at positions 1,3,5 -> Precision@5 = 3/5 = 0.60

Recall@k

Recall@k = (relevant docs in top-k) / (total relevant docs in corpus)

Example: k=5, found 3 of 4 relevant -> Recall@5 = 3/4 = 0.75

MRR

Reciprocal Rank = 1 / (position of first relevant result)
MRR = average of RR across all queries

Example:
  Query 1: first relevant at position 1 -> 1/1 = 1.00
  Query 2: first relevant at position 4 -> 1/4 = 0.25
  Query 3: first relevant at position 2 -> 1/2 = 0.50
  MRR = (1.00 + 0.25 + 0.50) / 3 = 0.583

NDCG@k

Relevance scale: 0 (irrelevant), 1 (tangential), 2 (very relevant), 3 (perfect)

DCG@k  = sum over i=1..k: relevance(i) / log2(i + 1)
IDCG@k = DCG of the ideal (best possible) ranking
NDCG@k = DCG@k / IDCG@k     (0 to 1, higher is better)

Example ranking: [3, 1, 3, 0, 2]
  DCG@5  = 3/1.0 + 1/1.58 + 3/2.0 + 0/2.32 + 2/2.58 = 5.90
  Ideal:   [3, 3, 2, 1, 0]
  IDCG@5 = 3/1.0 + 3/1.58 + 2/2.0 + 1/2.32 + 0/2.58 = 6.32
  NDCG@5 = 5.90 / 6.32 = 0.93 (93% of ideal ranking)

Metric targets

Metric	Target	What It Tells You
Precision@5	> 0.60	Low noise in retrieved context
Recall@5	> 0.70	Good coverage of relevant docs
MRR	> 0.70	First relevant result is near the top
NDCG@5	> 0.75	Overall ranking quality with graded relevance

When metrics disagree

Scenario	Meaning	Action
High precision, low recall	Few results but all relevant; missing documents	Retrieve more candidates (increase k), add hybrid search
Low precision, high recall	Finding everything but with lots of noise	Better re-ranking, stricter similarity threshold
High MRR, low NDCG	First result is great, rest are poorly ordered	Improve re-ranking for positions 2-5
Low MRR, high NDCG	Good results exist but aren't ranked first	Improve initial ranking or add re-ranking stage

Evaluation dataset sizes

Stage	Minimum	Ideal
Quick sanity check	20-50	--
Pre-deployment eval	100-200	500+
Ongoing monitoring	50/week	200/week
A/B test significance	200+ per variant	1000+

Observability dashboard design

The four pillars

LOGS              METRICS           TRACES            EVALUATION
Every LLM call    Latency           Full pipeline     Hallucination rate
  input/output      P95, avg          step-by-step      confidence dist.
  params            Token cost        timing per span   retrieval quality
  prompt version    Error rate        bottleneck ID     user satisfaction
  token counts      Throughput        error propagation calibration check

Six metric categories

Category	Key Metrics	Targets
Latency	Avg response, P95, time to first token	< 2s avg, < 5s P95, < 500ms TTFT
Cost	Cost per call, daily spend, tokens per call	Within budget, trending stable
Errors	Error rate, rate limit hits	< 1% errors, < 0.1% rate limits
Quality	Hallucination rate, confidence avg	< 5% hallucination, > 0.75 confidence
Retrieval	Precision@k, Recall@k, MRR	> 0.60 precision, > 0.70 recall
User	Thumbs up rate, edit rate	> 80% thumbs up, < 15% edits

What to log vs what NOT to log

Always Log	Never Log
Request ID, timestamp	Raw user PII (unless required + encrypted)
Model, temperature, params	API keys or secrets
Input/output token counts	Full conversation history of other users
Latency, finish reason	Internal IP addresses
Prompt version
Confidence scores, error details

RAG trace spans (typical)

embed_query:          ~120ms  (5%)    -- Embed user question
retrieve_documents:   ~230ms  (10%)   -- Vector DB search
build_prompt:          ~15ms  (1%)    -- Format context + instructions
generate_answer:     ~1450ms  (62%)   -- LLM generation (usually the bottleneck)
hallucination_check:  ~525ms  (22%)   -- Verification pipeline

Alert rules

Rule	Metric	Threshold	Severity	Cooldown
High error rate	Error %	> 5%	Critical	10 min
Latency spike	P95 latency	> 5000ms	Warning	15 min
Hallucination spike	Avg hallucination score	> 0.15	Critical	30 min
Cost spike	Hourly cost	> budget * 2	Warning	60 min
Confidence drop	Avg confidence	< 0.60	Warning	30 min
Low satisfaction	Thumbs up rate	< 70%	Warning	60 min

Observability tools

Tool	Best For	Key Feature
Helicone	API-level monitoring	Drop-in proxy, zero code change
LangSmith	LangChain apps	Deep chain tracing, eval suites
Langfuse	Open-source, self-hosted	Traces, evals, full control
Weights & Biases	ML experiment tracking	Prompt versioning, dashboards
Custom dashboards	Full control	Exactly what you need

Common gotchas

Gotcha	What Goes Wrong	Fix
Overconfident self-assessment	Model says 0.95 but is correct only 70% of the time	Calibrate with labeled dataset; combine multiple signals
Consistency != correctness	Model hallucinations are the same across rephrasings	Consistency check alone is not enough; layer with source checking
NLI on complex claims	NLI model fails on multi-hop reasoning	Use NLI for simple claims, escalate complex ones to LLM verifier
No calibration dataset	Confidence thresholds are guesses, not data-driven	Build 200+ question dataset with ground truth; measure ECE
Evaluating retrieval by answer quality	Good answers mask bad retrieval (LLM compensates)	Evaluate retrieval and generation separately (component eval)
Stale eval dataset	Eval set doesn't cover new document types or query patterns	Refresh with production query logs monthly
Too many false positives	Valid answers blocked by overly aggressive hallucination detection	Tune thresholds; accept slightly higher false negatives for lower false positives
No prompt version logging	Can't correlate quality changes with prompt deployments	Log prompt version on every call
Ignoring false negatives	Focus on false positives while hallucinations slip through	Track false negative rate aggressively; sample auto-approved responses
NDCG with binary relevance	Using NDCG when you only have relevant/not-relevant labels	Use Precision@k and MRR for binary; NDCG needs graded (0-3) labels
Provider model updates	LLM provider silently updates model; quality shifts	Monitor `system_fingerprint`; run eval suite on schedule, not just on your changes
Missing cost monitoring	Bill doubles; no breakdown by model/feature to diagnose	Track cost per call, per model, per feature; set budget alerts

Key formulas cheat sheet

Precision@k        = relevant_in_top_k / k
Recall@k           = relevant_in_top_k / total_relevant
MRR                = mean(1 / rank_of_first_relevant)
NDCG@k             = DCG@k / IDCG@k
DCG@k              = sum(relevance_i / log2(i + 1))
probability        = Math.exp(logprob)
ECE                = sum(bin_weight * |actual_accuracy - expected_accuracy|)
Composite conf.    = sum(signal_i * weight_i)
Hallucination rate = hallucinated_responses / total_responses
False positive rate = flagged_but_correct / total_flagged
False negative rate = missed_hallucinations / total_hallucinations

One-line summary

AI Evaluation = Detect hallucinations + Score confidence + Measure retrieval quality + Monitor everything in production -- continuously, not once

End of 4.14 quick revision.