Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems
4.14 — Exercise Questions: Evaluating AI Systems
Practice questions for all four subtopics in Section 4.14. Mix of conceptual, calculation, design, and hands-on tasks.
How to use this material (instructions)
- Read lessons in order —
README.md, then4.14.a→4.14.d. - Answer closed-book first — then compare to the matching lesson.
- Build the code examples — run them, modify them, break them.
- Interview prep —
4.14-Interview-Questions.md. - Quick review —
4.14-Quick-Revision.md.
4.14.a — Detecting Hallucinations (Q1–Q10)
Q1. Define hallucination detection in one sentence. How is it different from hallucination prevention?
Q2. Your RAG system answers "The refund period is 60 days" but the source document says "30 days." Which hallucination detection method would catch this most reliably? Explain why.
Q3. Explain the cross-referencing method for hallucination detection. What are its two main limitations?
Q4. In consistency checking, you ask the same question three different ways and get these answers: (a) "30 days", (b) "one month", (c) "60 days." What conclusion do you draw and what action do you take?
Q5. Explain what an NLI (Natural Language Inference) model does. What are its three output categories and what does each mean for hallucination detection?
Q6. Compare NLI-based detection vs LLM-based verification across speed, cost, accuracy, and scalability. When would you choose each?
Q7. Your hallucination detection pipeline has a false positive rate of 20%. Users are complaining that good answers are being blocked. But your false negative rate is 2% (very few hallucinations slip through). Should you adjust? How?
Q8. Design a human evaluation sampling strategy for a medical Q&A chatbot. Which responses should always be reviewed? Which can be sampled randomly?
Q9. Explain how source attribution works as a hallucination detection method. Why does requiring citations reduce hallucination?
Q10. Design exercise: Sketch a three-layer hallucination detection pipeline. For each layer, specify: what it checks, its speed, its false positive rate, and when it triggers the next layer.
4.14.b — Confidence Scores (Q11–Q20)
Q11. What is a confidence score in an AI output? Why is it more useful than a simple pass/fail check?
Q12. You ask a model to self-assess its confidence. It reports 0.95 for the answer "The company was founded in 2020." But the source document says 2015. What does this tell you about self-assessment reliability?
Q13. Explain log probabilities (logprobs) in one paragraph. How do you convert a logprob of -0.69 to a probability?
Q14. Your logprobs analysis shows high confidence (>95%) on the tokens "The", "capital", "of", "France", "is" but only 40% confidence on the token "Lyon" (the actual answer token). What does this pattern tell you?
Q15. Define calibration in the context of confidence scores. What does it mean if a system is "overconfident"?
Q16. Calculation: You have a calibration dataset of 200 questions. When the model reports confidence 0.8-1.0, it's correct 150 out of 180 times. What is the actual accuracy for this confidence bin? Is the model overconfident or underconfident?
Q17. Explain Platt scaling in simple terms. What problem does it solve?
Q18. Design a confidence-based routing system with three tiers: auto-approve, serve-with-disclaimer, and human-review. Specify thresholds for a customer support chatbot and explain your choices.
Q19. You're combining five confidence signals: self-assessment (0.15 weight), logprobs (0.25), source overlap (0.30), retrieval relevance (0.20), and hallucination penalty (0.10). Your signals are: 0.90, 0.85, 0.40, 0.80, and hallucination score 0.35. What is the composite confidence? Which signal is the weakest?
Q20. Hands-on: Write a JavaScript function that takes an OpenAI API response with logprobs: true and returns: (a) average token confidence, (b) the lowest-confidence token, and (c) a risk level (LOW/MEDIUM/HIGH).
4.14.c — Evaluating Retrieval Quality (Q21–Q30)
Q21. Why is retrieval quality called "the ceiling" for RAG answer quality? Give a concrete example.
Q22. You retrieve 5 documents for a query. Documents at positions 1, 3, and 5 are relevant. There are 4 total relevant documents in the corpus. Calculate: (a) Precision@5, (b) Recall@5, (c) Precision@3, (d) Recall@3.
Q23. Calculate the MRR for these three queries: Query 1 has first relevant result at position 1, Query 2 at position 4, Query 3 at position 2.
Q24. Explain NDCG in plain language. Why is it more informative than precision@k for ranking quality?
Q25. You're comparing two retrieval strategies: Strategy A has Precision@5=0.80 and Recall@5=0.40. Strategy B has Precision@5=0.40 and Recall@5=0.80. Which would you choose for a RAG system and why?
Q26. Name three strategies for building a retrieval evaluation dataset. What are the pros and cons of each?
Q27. Your retrieval evaluation shows Recall@5=0.60 but Recall@20=0.90. What does this tell you? What two approaches could improve Recall@5?
Q28. List five chunk quality issues that can degrade retrieval. For each, describe the symptom and the fix.
Q29. Explain the difference between component evaluation (retrieval only) and end-to-end evaluation (retrieval + generation). When do their results disagree, and what does each disagreement pattern tell you?
Q30. Design exercise: Design an A/B test comparing pure vector search vs hybrid search (vector + BM25). Specify: what metrics you track, how many queries you need, how you determine the winner, and how you handle statistical significance.
4.14.d — Observability and Monitoring (Q31–Q43)
Q31. What are the four pillars of AI observability? How does the fourth pillar (evaluation) differ from traditional monitoring?
Q32. List the six metric categories for AI system monitoring. Give one specific metric for each category.
Q33. Your AI system's P95 latency jumps from 2s to 8s, but error rate stays at 0.5%. Using traces, describe your debugging process step by step.
Q34. Why is it important to log the prompt version alongside every LLM call? Give a scenario where missing this causes a debugging nightmare.
Q35. Your hallucination detection alert fires: average hallucination score jumped from 0.08 to 0.22 overnight. List five possible root causes you would investigate.
Q36. Compare Helicone, LangSmith, Langfuse, and custom dashboards. When would you choose each?
Q37. Explain how to trace a RAG pipeline end-to-end. What are the five typical spans, and which span usually takes the most time?
Q38. Calculation: Your AI chatbot makes 50,000 calls/day using gpt-4o ($2.50/1M input, $10.00/1M output). Average call: 3,000 input tokens, 500 output tokens. What is the daily cost? What would switching 60% of calls to gpt-4o-mini ($0.15/1M input, $0.60/1M output) save per day?
Q39. Design an A/B test for a new system prompt. Specify: traffic split, minimum sample size, metrics to compare, and how you decide to ship or rollback.
Q40. Your monthly AI bill doubled from $5,000 to $10,000 but call volume only increased 20%. List four possible causes and how your cost monitoring dashboard would help you find each one.
Q41. What should you always log and what should you never log for AI calls? Why?
Q42. Design an alert rule set for a production AI system. Include at least five rules, each with: metric, threshold, severity, and cooldown period.
Q43. Hands-on: Write a JavaScript class that wraps any OpenAI API call with automatic logging of: requestId, timestamp, model, latency, input/output token counts, cost estimate, and any error details.
Answer Hints
| Q | Hint |
|---|---|
| Q4 | Two out of three agree (30 days / one month are equivalent). The third is inconsistent — flag for review. |
| Q7 | Adjust the threshold to reduce false positives. Trade a small increase in false negatives (2% → 4%) for a large decrease in false positives (20% → 8%). |
| Q14 | The model is certain about the question structure but uncertain about the factual answer. This is a classic hallucination risk pattern — high confidence on common tokens, low on the critical fact. |
| Q16 | Actual accuracy = 150/180 = 83.3%. The model reports 80-100% (midpoint 90%). Actual is 83.3%. The model is slightly overconfident. |
| Q19 | Composite = 0.90(0.15) + 0.85(0.25) + 0.40(0.30) + 0.80(0.20) + (1-0.35)(0.10) = 0.135 + 0.2125 + 0.12 + 0.16 + 0.065 = 0.6925. Weakest signal: source overlap (0.40). |
| Q22 | (a) P@5=3/5=0.60, (b) R@5=3/4=0.75, (c) P@3=2/3=0.67, (d) R@3=2/4=0.50 |
| Q23 | MRR = (1/1 + 1/4 + 1/2) / 3 = (1.0 + 0.25 + 0.5) / 3 = 0.583 |
| Q25 | Strategy B (higher recall) is usually better for RAG — missing relevant documents hurts answer quality more than including some irrelevant ones. |
| Q27 | Relevant documents exist but aren't ranked high enough. Improve with: (1) re-ranking, (2) hybrid search. |
| Q38 | gpt-4o daily: 50K × (3000 × $2.50/1M + 500 × $10/1M) = 50K × ($0.0075 + $0.005) = 50K × $0.0125 = $625/day. gpt-4o-mini on 60%: 30K × (3000 × $0.15/1M + 500 × $0.60/1M) = 30K × $0.00075 = $22.50. Remaining 40% gpt-4o: 20K × $0.0125 = $250. Total: $272.50 vs $625 = saving $352.50/day. |
| Q40 | (1) Prompt grew longer (check avg input tokens), (2) Model upgraded (check model breakdown), (3) Runaway loop (check call volume by feature), (4) Output tokens increased (check max_tokens). |
← Back to 4.14 — Evaluating AI Systems (README)