Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14 — Evaluating AI Systems: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps -- reopen README.md -> 4.14.a...4.14.d.
  3. Practice -- 4.14-Exercise-Questions.md.
  4. Polish answers -- 4.14-Interview-Questions.md.

Core vocabulary

TermOne-liner
HallucinationAI generates information not grounded in its source material -- looks confident but is fabricated
Hallucination detectionEngineering discipline of automatically catching fabricated claims in AI outputs
Cross-referencingVerify each claim in the AI answer against the retrieved source documents
Consistency checkingAsk the same question multiple ways and check if answers agree
NLI (Natural Language Inference)Lightweight model that classifies whether a premise entails, contradicts, or is neutral toward a claim
Confidence score0.0-1.0 value quantifying how reliable an AI output is believed to be
Self-assessmentAsking the LLM to rate its own confidence -- simple but prone to overconfidence
Log probabilities (logprobs)Model's internal probability assignments per generated token -- objective confidence signal
CalibrationVerifying that 90% confidence = ~90% actual accuracy on a labeled dataset
Platt scalingLogistic regression to remap raw confidence to calibrated confidence
ECE (Expected Calibration Error)Weighted average of the gap between expected and actual accuracy across confidence bins
Precision@kFraction of retrieved docs in top-k that are relevant -- measures noise
Recall@kFraction of all relevant docs that appear in top-k -- measures coverage
MRR (Mean Reciprocal Rank)Average of 1/position_of_first_relevant_result across queries -- measures ranking quality
NDCG (Normalized Discounted Cumulative Gain)Measures ranking quality with graded relevance (0-3 scale, not just binary)
FaithfulnessWhether the answer uses only information from retrieved context (no hallucination)
RelevanceWhether the answer actually addresses the user's question
Source attributionRequiring the model to cite specific sources for every claim -- doubles as hallucination detection
ObservabilityLogging, tracing, metrics, and evaluation -- the four pillars for AI systems
TraceEnd-to-end record of a multi-step pipeline, with timing per span
SpanA single step within a trace (e.g., embed_query, retrieve, generate)

Hallucination detection strategies

┌─────────────────────────────────────────────────────────────────┐
│  LAYERED DETECTION PIPELINE                                     │
│                                                                 │
│  Layer 1: Source attribution (always, ~100ms)                   │
│    Extract claims -> keyword overlap with sources               │
│    Flags obvious fabrications (names, numbers not in sources)   │
│                                                                 │
│  Layer 2: NLI check (if Layer 1 flags > 20%, ~200ms)           │
│    Run flagged claims through NLI model                         │
│    Catches semantic contradictions ("15 days" vs "30 days")    │
│                                                                 │
│  Layer 3: Consistency check (if high risk, ~1500ms)             │
│    Ask same question multiple ways, compare answers             │
│    Catches confident-but-wrong hallucinations                   │
│                                                                 │
│  Combine: overall = 0.5*attribution + 0.3*nli + 0.2*consistency│
│                                                                 │
│  < 0.1  -> SERVE                                                │
│  < 0.3  -> SERVE_WITH_DISCLAIMER                                │
│  < 0.6  -> HUMAN_REVIEW                                         │
│  >= 0.6 -> BLOCK                                                │
└─────────────────────────────────────────────────────────────────┘

Method comparison

MethodSpeedCostBest ForWeakness
Cross-referencing500-2000msLLM callPinpointing which claim is wrongVerifier LLM can itself hallucinate
Consistency checking2000-5000msMultiple LLM callsCatching model uncertaintyConsistently wrong hallucinations pass
NLI models10-50msFree (local)Fast first filter at scaleStruggles with complex multi-hop claims
Source attribution0ms (built into prompt)No extra costForcing grounding + audit trailModel can fabricate citations
Human evaluationMinutes-hoursExpensiveGold standard, catches edge casesDoesn't scale, only for sampling

Human evaluation sampling rules

// Who gets reviewed:
Math.random() < 0.02          // 2% random sample
response.confidence < 0.7     // Low confidence -> always review
response.hallucinationScore > 0.3  // Flagged by automated detection
response.domain === 'medical' || response.domain === 'legal'  // High stakes
response.promptVersion !== stableVersion && Math.random() < 0.2 // New prompts: 20%
response.userFeedback === 'thumbs_down'  // User-reported issues

Confidence score implementation

Signal sources

SignalWeightObjectivityNotes
Self-assessment0.15SubjectiveProne to overconfidence; cheapest to implement
Log probabilities0.25ObjectivePer-token; no extra cost; not all APIs expose them
Source overlap0.30ObjectiveMost reliable for RAG; fraction of claims grounded in sources
Retrieval relevance0.20ObjectiveQuality of input chunks; bad retrieval = bad answers
Hallucination penalty0.10ObjectiveFrom detection pipeline; penalizes flagged content

Composite formula

composite = selfAssessed * 0.15
          + logprob      * 0.25
          + sourceOverlap * 0.30
          + retrievalRelevance * 0.20
          + (1 - hallucinationScore) * 0.10

Logprob reference

logprob = -0.01  -> probability ~99%   (very confident)
logprob = -0.10  -> probability ~90%   (confident)
logprob = -0.69  -> probability ~50%   (coin flip)
logprob = -2.30  -> probability ~10%   (uncertain)
logprob = -4.60  -> probability ~1%    (very uncertain)

Convert: probability = Math.exp(logprob)

Routing thresholds by domain

DomainAuto-approveDisclaimerHuman reviewRefuse
Medical / Legal>= 0.950.80-0.950.60-0.80< 0.60
Customer support>= 0.850.60-0.850.40-0.60< 0.40
Internal tools>= 0.750.50-0.750.30-0.50< 0.30
Casual Q&A>= 0.700.40-0.700.20-0.40< 0.20

Calibration check

Step 1: Collect 200+ questions with ground-truth answers
Step 2: Get model confidence for each
Step 3: Bin by confidence (0-20%, 20-40%, 40-60%, 60-80%, 80-100%)
Step 4: Compute actual accuracy per bin
Step 5: ECE = weighted avg of |actual - expected| across bins
         ECE < 0.05 = well calibrated
         ECE > 0.15 = poorly calibrated -> apply Platt scaling

Retrieval quality metrics with formulas

Precision@k

Precision@k = (relevant docs in top-k) / k

Example: k=5, relevant at positions 1,3,5 -> Precision@5 = 3/5 = 0.60

Recall@k

Recall@k = (relevant docs in top-k) / (total relevant docs in corpus)

Example: k=5, found 3 of 4 relevant -> Recall@5 = 3/4 = 0.75

MRR

Reciprocal Rank = 1 / (position of first relevant result)
MRR = average of RR across all queries

Example:
  Query 1: first relevant at position 1 -> 1/1 = 1.00
  Query 2: first relevant at position 4 -> 1/4 = 0.25
  Query 3: first relevant at position 2 -> 1/2 = 0.50
  MRR = (1.00 + 0.25 + 0.50) / 3 = 0.583

NDCG@k

Relevance scale: 0 (irrelevant), 1 (tangential), 2 (very relevant), 3 (perfect)

DCG@k  = sum over i=1..k: relevance(i) / log2(i + 1)
IDCG@k = DCG of the ideal (best possible) ranking
NDCG@k = DCG@k / IDCG@k     (0 to 1, higher is better)

Example ranking: [3, 1, 3, 0, 2]
  DCG@5  = 3/1.0 + 1/1.58 + 3/2.0 + 0/2.32 + 2/2.58 = 5.90
  Ideal:   [3, 3, 2, 1, 0]
  IDCG@5 = 3/1.0 + 3/1.58 + 2/2.0 + 1/2.32 + 0/2.58 = 6.32
  NDCG@5 = 5.90 / 6.32 = 0.93 (93% of ideal ranking)

Metric targets

MetricTargetWhat It Tells You
Precision@5> 0.60Low noise in retrieved context
Recall@5> 0.70Good coverage of relevant docs
MRR> 0.70First relevant result is near the top
NDCG@5> 0.75Overall ranking quality with graded relevance

When metrics disagree

ScenarioMeaningAction
High precision, low recallFew results but all relevant; missing documentsRetrieve more candidates (increase k), add hybrid search
Low precision, high recallFinding everything but with lots of noiseBetter re-ranking, stricter similarity threshold
High MRR, low NDCGFirst result is great, rest are poorly orderedImprove re-ranking for positions 2-5
Low MRR, high NDCGGood results exist but aren't ranked firstImprove initial ranking or add re-ranking stage

Evaluation dataset sizes

StageMinimumIdeal
Quick sanity check20-50--
Pre-deployment eval100-200500+
Ongoing monitoring50/week200/week
A/B test significance200+ per variant1000+

Observability dashboard design

The four pillars

LOGS              METRICS           TRACES            EVALUATION
Every LLM call    Latency           Full pipeline     Hallucination rate
  input/output      P95, avg          step-by-step      confidence dist.
  params            Token cost        timing per span   retrieval quality
  prompt version    Error rate        bottleneck ID     user satisfaction
  token counts      Throughput        error propagation calibration check

Six metric categories

CategoryKey MetricsTargets
LatencyAvg response, P95, time to first token< 2s avg, < 5s P95, < 500ms TTFT
CostCost per call, daily spend, tokens per callWithin budget, trending stable
ErrorsError rate, rate limit hits< 1% errors, < 0.1% rate limits
QualityHallucination rate, confidence avg< 5% hallucination, > 0.75 confidence
RetrievalPrecision@k, Recall@k, MRR> 0.60 precision, > 0.70 recall
UserThumbs up rate, edit rate> 80% thumbs up, < 15% edits

What to log vs what NOT to log

Always LogNever Log
Request ID, timestampRaw user PII (unless required + encrypted)
Model, temperature, paramsAPI keys or secrets
Input/output token countsFull conversation history of other users
Latency, finish reasonInternal IP addresses
Prompt version
Confidence scores, error details

RAG trace spans (typical)

embed_query:          ~120ms  (5%)    -- Embed user question
retrieve_documents:   ~230ms  (10%)   -- Vector DB search
build_prompt:          ~15ms  (1%)    -- Format context + instructions
generate_answer:     ~1450ms  (62%)   -- LLM generation (usually the bottleneck)
hallucination_check:  ~525ms  (22%)   -- Verification pipeline

Alert rules

RuleMetricThresholdSeverityCooldown
High error rateError %> 5%Critical10 min
Latency spikeP95 latency> 5000msWarning15 min
Hallucination spikeAvg hallucination score> 0.15Critical30 min
Cost spikeHourly cost> budget * 2Warning60 min
Confidence dropAvg confidence< 0.60Warning30 min
Low satisfactionThumbs up rate< 70%Warning60 min

Observability tools

ToolBest ForKey Feature
HeliconeAPI-level monitoringDrop-in proxy, zero code change
LangSmithLangChain appsDeep chain tracing, eval suites
LangfuseOpen-source, self-hostedTraces, evals, full control
Weights & BiasesML experiment trackingPrompt versioning, dashboards
Custom dashboardsFull controlExactly what you need

Common gotchas

GotchaWhat Goes WrongFix
Overconfident self-assessmentModel says 0.95 but is correct only 70% of the timeCalibrate with labeled dataset; combine multiple signals
Consistency != correctnessModel hallucinations are the same across rephrasingsConsistency check alone is not enough; layer with source checking
NLI on complex claimsNLI model fails on multi-hop reasoningUse NLI for simple claims, escalate complex ones to LLM verifier
No calibration datasetConfidence thresholds are guesses, not data-drivenBuild 200+ question dataset with ground truth; measure ECE
Evaluating retrieval by answer qualityGood answers mask bad retrieval (LLM compensates)Evaluate retrieval and generation separately (component eval)
Stale eval datasetEval set doesn't cover new document types or query patternsRefresh with production query logs monthly
Too many false positivesValid answers blocked by overly aggressive hallucination detectionTune thresholds; accept slightly higher false negatives for lower false positives
No prompt version loggingCan't correlate quality changes with prompt deploymentsLog prompt version on every call
Ignoring false negativesFocus on false positives while hallucinations slip throughTrack false negative rate aggressively; sample auto-approved responses
NDCG with binary relevanceUsing NDCG when you only have relevant/not-relevant labelsUse Precision@k and MRR for binary; NDCG needs graded (0-3) labels
Provider model updatesLLM provider silently updates model; quality shiftsMonitor system_fingerprint; run eval suite on schedule, not just on your changes
Missing cost monitoringBill doubles; no breakdown by model/feature to diagnoseTrack cost per call, per model, per feature; set budget alerts

Key formulas cheat sheet

Precision@k        = relevant_in_top_k / k
Recall@k           = relevant_in_top_k / total_relevant
MRR                = mean(1 / rank_of_first_relevant)
NDCG@k             = DCG@k / IDCG@k
DCG@k              = sum(relevance_i / log2(i + 1))
probability        = Math.exp(logprob)
ECE                = sum(bin_weight * |actual_accuracy - expected_accuracy|)
Composite conf.    = sum(signal_i * weight_i)
Hallucination rate = hallucinated_responses / total_responses
False positive rate = flagged_but_correct / total_flagged
False negative rate = missed_hallucinations / total_hallucinations

One-line summary

AI Evaluation = Detect hallucinations + Score confidence + Measure retrieval quality + Monitor everything in production -- continuously, not once


End of 4.14 quick revision.