Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems
4.14.c — Evaluating Retrieval Quality
In one sentence: The quality of a RAG system's answers is bounded by the quality of its retrieval — if the right documents aren't retrieved, no amount of prompt engineering can produce correct answers — so you need metrics like precision@k, recall@k, MRR, and NDCG to measure and improve retrieval systematically.
Navigation: ← 4.14.b — Confidence Scores · 4.14.d — Observability and Monitoring →
1. Why Retrieval Quality Determines RAG Answer Quality
In a RAG pipeline (4.13), the LLM only sees what the retrieval system gives it. This means:
┌──────────────────────────────────────────────────────────┐
│ THE RETRIEVAL BOTTLENECK │
│ │
│ Good retrieval + Good LLM = Good answers ✓ │
│ Bad retrieval + Good LLM = Bad answers ✗ │
│ Good retrieval + Bad LLM = Mediocre answers ~ │
│ Bad retrieval + Bad LLM = Terrible answers ✗✗ │
│ │
│ KEY INSIGHT: Retrieval is the ceiling. │
│ The LLM can only work with what it's given. │
│ If relevant chunks are NOT retrieved, the LLM │
│ will hallucinate or say "I don't know." │
│ │
│ Improving retrieval from 60% to 90% recall often │
│ improves answer quality MORE than switching from │
│ GPT-4o-mini to GPT-4o. │
└──────────────────────────────────────────────────────────┘
The two types of retrieval failure
| Failure Type | What Happens | Impact |
|---|---|---|
| False negative (missed relevant doc) | Relevant document exists but isn't retrieved | LLM can't answer correctly, may hallucinate |
| False positive (irrelevant doc retrieved) | Irrelevant document is retrieved and injected | LLM gets confused, answers based on wrong context |
Both are measurable, and the metrics in this section tell you exactly where your retrieval is failing.
2. Core Metrics: Precision@k, Recall@k, MRR, NDCG
Precision@k — "Of the k documents retrieved, how many were relevant?"
Precision@k = (number of relevant documents in top-k results) / k
Example: You retrieve 5 documents (k=5)
Doc 1: Relevant ✓
Doc 2: Irrelevant ✗
Doc 3: Relevant ✓
Doc 4: Irrelevant ✗
Doc 5: Relevant ✓
Precision@5 = 3/5 = 0.60 (60% of retrieved docs are relevant)
Precision@3 = 2/3 = 0.67 (first 3 docs: 2 relevant)
Precision@1 = 1/1 = 1.00 (first doc is relevant)
Recall@k — "Of all relevant documents, how many did we retrieve?"
Recall@k = (number of relevant documents in top-k results) / (total relevant documents)
Example: There are 4 relevant documents in the entire corpus. You retrieve 5.
Retrieved: [Relevant, Irrelevant, Relevant, Irrelevant, Relevant]
Relevant docs found: 3 out of 4
Recall@5 = 3/4 = 0.75 (found 75% of all relevant docs)
Recall@3 = 2/4 = 0.50 (first 3 results found 2 of 4 relevant docs)
Recall@1 = 1/4 = 0.25 (first result found 1 of 4 relevant docs)
MRR (Mean Reciprocal Rank) — "How high does the first relevant result appear?"
Reciprocal Rank = 1 / (position of first relevant result)
MRR = average of reciprocal ranks across all queries
Example queries:
Query 1: First relevant at position 1 → RR = 1/1 = 1.00
Query 2: First relevant at position 3 → RR = 1/3 = 0.33
Query 3: First relevant at position 2 → RR = 1/2 = 0.50
MRR = (1.00 + 0.33 + 0.50) / 3 = 0.61
Interpretation: On average, the first relevant result appears around position 1-2.
Higher MRR = relevant results appear earlier (better for user experience).
NDCG (Normalized Discounted Cumulative Gain) — "Are the MOST relevant results ranked highest?"
NDCG goes beyond binary relevance (relevant/not relevant) to graded relevance (how relevant on a scale):
Relevance scale: 0 = not relevant, 1 = somewhat relevant, 2 = very relevant, 3 = perfect
Example retrieval for "company refund policy":
Position 1: Refund FAQ page → relevance 3 (perfect match)
Position 2: General FAQ page → relevance 1 (somewhat relevant)
Position 3: Refund policy document → relevance 3 (perfect match)
Position 4: Product pricing page → relevance 0 (not relevant)
Position 5: Return shipping info → relevance 2 (very relevant)
DCG@5 = 3/log2(2) + 1/log2(3) + 3/log2(4) + 0/log2(5) + 2/log2(6)
= 3.0 + 0.63 + 1.5 + 0.0 + 0.77
= 5.90
Ideal ranking: [3, 3, 2, 1, 0] (best possible order)
IDCG@5 = 3/log2(2) + 3/log2(3) + 2/log2(4) + 1/log2(5) + 0/log2(6)
= 3.0 + 1.89 + 1.0 + 0.43 + 0.0
= 6.32
NDCG@5 = DCG@5 / IDCG@5 = 5.90 / 6.32 = 0.93
Interpretation: 0.93 is excellent — the ranking is 93% as good as the ideal ordering.
3. Implementing Retrieval Metrics in Code
class RetrievalEvaluator {
/**
* @param {Array<{query: string, retrievedDocs: string[], relevantDocs: string[]}>} results
* relevantDocs = the ground truth set of relevant document IDs for each query
* retrievedDocs = the ordered list of document IDs returned by retrieval
*/
// Precision@k for a single query
precisionAtK(retrievedDocs, relevantDocs, k) {
const topK = retrievedDocs.slice(0, k);
const relevantInTopK = topK.filter(doc => relevantDocs.includes(doc));
return relevantInTopK.length / k;
}
// Recall@k for a single query
recallAtK(retrievedDocs, relevantDocs, k) {
if (relevantDocs.length === 0) return 0;
const topK = retrievedDocs.slice(0, k);
const relevantInTopK = topK.filter(doc => relevantDocs.includes(doc));
return relevantInTopK.length / relevantDocs.length;
}
// Reciprocal Rank for a single query
reciprocalRank(retrievedDocs, relevantDocs) {
for (let i = 0; i < retrievedDocs.length; i++) {
if (relevantDocs.includes(retrievedDocs[i])) {
return 1 / (i + 1);
}
}
return 0; // No relevant doc found
}
// NDCG@k for a single query (with graded relevance)
ndcgAtK(retrievedDocs, relevanceScores, k) {
// relevanceScores: Map<docId, relevanceScore (0-3)>
const topK = retrievedDocs.slice(0, k);
// DCG
let dcg = 0;
for (let i = 0; i < topK.length; i++) {
const relevance = relevanceScores.get(topK[i]) || 0;
dcg += relevance / Math.log2(i + 2); // i+2 because log2(1)=0
}
// Ideal DCG: sort all scores descending, compute DCG
const idealScores = Array.from(relevanceScores.values())
.sort((a, b) => b - a)
.slice(0, k);
let idcg = 0;
for (let i = 0; i < idealScores.length; i++) {
idcg += idealScores[i] / Math.log2(i + 2);
}
return idcg > 0 ? dcg / idcg : 0;
}
// Aggregate metrics across all queries
evaluateAll(evalDataset, k = 5) {
const metrics = evalDataset.map(item => ({
query: item.query,
precisionAtK: this.precisionAtK(item.retrievedDocs, item.relevantDocs, k),
recallAtK: this.recallAtK(item.retrievedDocs, item.relevantDocs, k),
reciprocalRank: this.reciprocalRank(item.retrievedDocs, item.relevantDocs),
ndcgAtK: item.relevanceScores
? this.ndcgAtK(item.retrievedDocs, item.relevanceScores, k)
: null
}));
// Average across all queries
const avg = (arr) => arr.reduce((a, b) => a + b, 0) / arr.length;
return {
perQuery: metrics,
aggregated: {
meanPrecisionAtK: avg(metrics.map(m => m.precisionAtK)),
meanRecallAtK: avg(metrics.map(m => m.recallAtK)),
mrr: avg(metrics.map(m => m.reciprocalRank)),
meanNdcgAtK: metrics[0].ndcgAtK !== null
? avg(metrics.map(m => m.ndcgAtK))
: null,
k
}
};
}
}
// Usage
const evaluator = new RetrievalEvaluator();
const evalDataset = [
{
query: 'What is the refund policy?',
retrievedDocs: ['doc-7', 'doc-3', 'doc-12', 'doc-1', 'doc-9'],
relevantDocs: ['doc-7', 'doc-12', 'doc-15'], // ground truth
relevanceScores: new Map([
['doc-7', 3], ['doc-3', 0], ['doc-12', 3],
['doc-1', 0], ['doc-9', 1], ['doc-15', 2]
])
},
{
query: 'How do I cancel my subscription?',
retrievedDocs: ['doc-5', 'doc-2', 'doc-8', 'doc-11', 'doc-4'],
relevantDocs: ['doc-2', 'doc-8'],
relevanceScores: new Map([
['doc-5', 0], ['doc-2', 3], ['doc-8', 2],
['doc-11', 0], ['doc-4', 1]
])
}
];
const results = evaluator.evaluateAll(evalDataset, 5);
console.log('Aggregated metrics:', results.aggregated);
/*
{
meanPrecisionAtK: 0.50,
meanRecallAtK: 0.58,
mrr: 0.75,
meanNdcgAtK: 0.82,
k: 5
}
*/
4. Building Evaluation Datasets for Retrieval
Metrics are only meaningful if you have a ground-truth evaluation dataset. Building one is the hardest — and most valuable — part of retrieval evaluation.
Strategy 1: Manual labeling
// Create question-answer pairs manually from your document corpus
const evalDataset = [
{
id: 'eval-001',
query: 'What is the maximum refund period for physical products?',
relevantDocIds: ['refund-policy-v3-chunk-2', 'refund-policy-v3-chunk-3'],
expectedAnswer: '30 days from purchase date',
difficulty: 'easy',
category: 'policy'
},
{
id: 'eval-002',
query: 'Can I return a digital product after downloading it?',
relevantDocIds: ['refund-policy-v3-chunk-5', 'digital-tos-chunk-1'],
expectedAnswer: 'No, digital products are non-refundable after download',
difficulty: 'easy',
category: 'policy'
},
{
id: 'eval-003',
query: 'What happens if I return a product after the refund window?',
relevantDocIds: ['refund-policy-v3-chunk-4'],
expectedAnswer: 'Late returns may be eligible for store credit at manager discretion',
difficulty: 'medium',
category: 'policy'
}
// ... 200+ more questions
];
Strategy 2: LLM-generated evaluation questions
import OpenAI from 'openai';
const openai = new OpenAI();
async function generateEvalQuestions(chunks) {
const evalPairs = [];
for (const chunk of chunks) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.3,
messages: [
{
role: 'system',
content: `Given a document chunk, generate 3 questions that this chunk
can answer. For each question, extract the ground-truth answer.
Generate questions at different difficulty levels:
- Easy: answer is explicitly stated
- Medium: requires combining information from the chunk
- Hard: requires inference or understanding context
Return JSON:
{
"questions": [
{
"question": "the question",
"answer": "the ground truth answer",
"difficulty": "easy" | "medium" | "hard"
}
]
}`
},
{
role: 'user',
content: `Document: ${chunk.document}\nChunk ID: ${chunk.id}\nContent: ${chunk.text}`
}
],
response_format: { type: 'json_object' }
});
const { questions } = JSON.parse(response.choices[0].message.content);
for (const q of questions) {
evalPairs.push({
query: q.question,
relevantDocIds: [chunk.id],
expectedAnswer: q.answer,
difficulty: q.difficulty,
sourceChunk: chunk.id
});
}
}
return evalPairs;
}
// Usage: generate eval set from your chunk corpus
const chunks = [
{ id: 'chunk-001', document: 'refund-policy.md', text: 'Returns accepted within 30 days...' },
{ id: 'chunk-002', document: 'shipping-policy.md', text: 'Free shipping on orders over $50...' }
];
const evalSet = await generateEvalQuestions(chunks);
console.log(`Generated ${evalSet.length} evaluation questions`);
Strategy 3: User query logs
// Mine real user queries for evaluation data
function buildEvalFromQueryLogs(queryLogs, humanLabels) {
return queryLogs
.filter(log => humanLabels.has(log.queryId)) // Only queries with human labels
.map(log => ({
query: log.query,
retrievedDocs: log.retrievedChunkIds,
relevantDocs: humanLabels.get(log.queryId).relevantChunkIds,
expectedAnswer: humanLabels.get(log.queryId).correctAnswer,
isRealQuery: true
}));
}
How many evaluation examples do you need?
| Evaluation Stage | Minimum | Ideal | Notes |
|---|---|---|---|
| Quick sanity check | 20-50 | — | During development |
| Pre-deployment eval | 100-200 | 500+ | Cover all categories |
| Ongoing monitoring | 50/week | 200/week | Sampled from production |
| A/B test significance | 200+ per variant | 1000+ | For statistical power |
5. A/B Testing Retrieval Strategies
When you want to compare two retrieval approaches (e.g., pure vector search vs hybrid search), run a structured A/B test:
class RetrievalABTest {
constructor(strategyA, strategyB, evalDataset) {
this.strategyA = strategyA;
this.strategyB = strategyB;
this.evalDataset = evalDataset;
this.evaluator = new RetrievalEvaluator();
}
async run(k = 5) {
const resultsA = [];
const resultsB = [];
for (const item of this.evalDataset) {
// Run both strategies on the same query
const [docsA, docsB] = await Promise.all([
this.strategyA.retrieve(item.query, k),
this.strategyB.retrieve(item.query, k)
]);
resultsA.push({
query: item.query,
retrievedDocs: docsA.map(d => d.id),
relevantDocs: item.relevantDocIds,
relevanceScores: item.relevanceScores
});
resultsB.push({
query: item.query,
retrievedDocs: docsB.map(d => d.id),
relevantDocs: item.relevantDocIds,
relevanceScores: item.relevanceScores
});
}
const metricsA = this.evaluator.evaluateAll(resultsA, k);
const metricsB = this.evaluator.evaluateAll(resultsB, k);
return this.compareResults(metricsA, metricsB);
}
compareResults(metricsA, metricsB) {
const a = metricsA.aggregated;
const b = metricsB.aggregated;
const comparison = {
strategyA: a,
strategyB: b,
winner: {},
summary: []
};
// Compare each metric
const metricNames = ['meanPrecisionAtK', 'meanRecallAtK', 'mrr', 'meanNdcgAtK'];
for (const metric of metricNames) {
if (a[metric] === null || b[metric] === null) continue;
const diff = b[metric] - a[metric];
const pctChange = a[metric] > 0 ? (diff / a[metric] * 100).toFixed(1) : 'N/A';
comparison.winner[metric] = diff > 0 ? 'B' : diff < 0 ? 'A' : 'TIE';
comparison.summary.push(
`${metric}: A=${a[metric].toFixed(3)} vs B=${b[metric].toFixed(3)} ` +
`(${diff > 0 ? '+' : ''}${pctChange}%) → Winner: ${comparison.winner[metric]}`
);
}
return comparison;
}
}
// Usage
const test = new RetrievalABTest(
vectorSearchStrategy, // Pure embedding-based search
hybridSearchStrategy, // Embedding + BM25 keyword search
evalDataset
);
const results = await test.run(5);
results.summary.forEach(line => console.log(line));
/*
meanPrecisionAtK: A=0.520 vs B=0.640 (+23.1%) → Winner: B
meanRecallAtK: A=0.480 vs B=0.620 (+29.2%) → Winner: B
mrr: A=0.710 vs B=0.780 (+9.9%) → Winner: B
meanNdcgAtK: A=0.650 vs B=0.740 (+13.8%) → Winner: B
*/
6. Measuring Chunk Relevance
Beyond document-level retrieval, evaluate the quality of individual chunks — the actual text fragments injected into the prompt.
import OpenAI from 'openai';
const openai = new OpenAI();
async function evaluateChunkRelevance(query, chunks) {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{
role: 'system',
content: `You are a retrieval quality evaluator. For each chunk, rate its
relevance to the query on a 0-3 scale:
0 = Not relevant at all
1 = Tangentially relevant (same topic but doesn't answer the question)
2 = Partially relevant (contains some useful information)
3 = Highly relevant (directly answers or is essential for answering the question)
Also note if the chunk contains the specific information needed to answer the query.
Return JSON:
{
"evaluations": [
{
"chunkId": "id",
"relevanceScore": 0-3,
"containsAnswer": true/false,
"reasoning": "brief explanation"
}
],
"summary": {
"totalChunks": number,
"relevantChunks": number,
"averageRelevance": number,
"answerCoverage": "FULL" | "PARTIAL" | "NONE"
}
}`
},
{
role: 'user',
content: `Query: ${query}\n\nChunks:\n${
chunks.map((c, i) => `[Chunk ${i + 1} | ID: ${c.id}]: ${c.text}`).join('\n\n')
}`
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
// Usage
const chunkEval = await evaluateChunkRelevance(
'What is the refund policy for electronics?',
[
{ id: 'chunk-1', text: 'Electronics purchases can be returned within 15 days with original packaging.' },
{ id: 'chunk-2', text: 'Our store is located at 123 Main Street, open 9am-5pm.' },
{ id: 'chunk-3', text: 'Refunds are processed within 5-7 business days to the original payment method.' },
{ id: 'chunk-4', text: 'We carry a wide range of Samsung and Apple products.' },
{ id: 'chunk-5', text: 'Defective electronics may be exchanged within 30 days regardless of return window.' }
]
);
console.log(chunkEval.summary);
/*
{
totalChunks: 5,
relevantChunks: 3,
averageRelevance: 1.8,
answerCoverage: "FULL"
}
*/
Chunk quality issues to watch for
| Issue | Symptom | Fix |
|---|---|---|
| Too small chunks | High recall but low relevance (lots of noise) | Increase chunk size |
| Too large chunks | Low recall (relevant info buried in irrelevant text) | Decrease chunk size, add overlap |
| Bad boundaries | Answer split across two chunks, neither sufficient alone | Improve chunking strategy (semantic chunking) |
| Stale content | Correct retrieval but outdated information | Re-index when documents update |
| Duplicate chunks | Same info retrieved multiple times, wasting context | Deduplicate before injection |
7. End-to-End vs Component Evaluation
You can evaluate the RAG system at two levels. Both are necessary.
Component evaluation (retrieval only)
Measures the retrieval step in isolation. Does not involve the LLM.
async function evaluateRetrievalComponent(evalDataset, retriever, k = 5) {
const evaluator = new RetrievalEvaluator();
const results = [];
for (const item of evalDataset) {
const retrieved = await retriever.search(item.query, k);
results.push({
query: item.query,
retrievedDocs: retrieved.map(r => r.id),
relevantDocs: item.relevantDocIds
});
}
return evaluator.evaluateAll(results, k);
}
End-to-end evaluation (retrieval + generation)
Measures the final answer quality. Involves both retrieval and LLM.
async function evaluateEndToEnd(evalDataset, ragPipeline) {
const results = [];
for (const item of evalDataset) {
const ragResponse = await ragPipeline.query(item.query);
// Check answer correctness
const correctnessCheck = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{
role: 'system',
content: `Compare the AI answer to the ground truth. Return JSON:
{
"isCorrect": true/false,
"isPartiallyCorrect": true/false,
"hasHallucination": true/false,
"explanation": "brief explanation",
"score": 0.0 to 1.0
}`
},
{
role: 'user',
content: `Question: ${item.query}
AI Answer: ${ragResponse.answer}
Ground Truth: ${item.expectedAnswer}`
}
],
response_format: { type: 'json_object' }
});
const evaluation = JSON.parse(correctnessCheck.choices[0].message.content);
results.push({
query: item.query,
aiAnswer: ragResponse.answer,
groundTruth: item.expectedAnswer,
retrievedDocs: ragResponse.sources?.map(s => s.document) || [],
...evaluation
});
}
// Aggregate
const total = results.length;
return {
totalQuestions: total,
correctRate: results.filter(r => r.isCorrect).length / total,
partiallyCorrectRate: results.filter(r => r.isPartiallyCorrect).length / total,
hallucinationRate: results.filter(r => r.hasHallucination).length / total,
averageScore: results.reduce((sum, r) => sum + r.score, 0) / total,
perQuestion: results
};
}
When component and end-to-end metrics disagree
| Component (Retrieval) | End-to-End (Answer) | Diagnosis |
|---|---|---|
| Good | Good | System is working well |
| Good | Bad | LLM is the problem (prompt, model, hallucination) |
| Bad | Good | Lucky — the LLM is compensating (fragile, will fail) |
| Bad | Bad | Fix retrieval first — it's the bottleneck |
8. Retrieval Evaluation Dashboard
Bring metrics together into a monitoring view:
class RetrievalDashboard {
constructor() {
this.history = [];
}
record(evalRun) {
this.history.push({
timestamp: new Date(),
...evalRun
});
}
getLatestReport() {
if (this.history.length === 0) return null;
const latest = this.history[this.history.length - 1];
const previous = this.history.length > 1
? this.history[this.history.length - 2]
: null;
const report = {
current: latest,
trends: {}
};
if (previous) {
const metrics = ['meanPrecisionAtK', 'meanRecallAtK', 'mrr', 'meanNdcgAtK'];
for (const metric of metrics) {
const curr = latest.aggregated?.[metric];
const prev = previous.aggregated?.[metric];
if (curr != null && prev != null) {
report.trends[metric] = {
current: curr,
previous: prev,
change: curr - prev,
pctChange: prev > 0 ? ((curr - prev) / prev * 100).toFixed(1) + '%' : 'N/A',
direction: curr > prev ? 'IMPROVING' : curr < prev ? 'DEGRADING' : 'STABLE'
};
}
}
}
return report;
}
printReport() {
const report = this.getLatestReport();
if (!report) {
console.log('No evaluation data recorded yet.');
return;
}
console.log('\n=== RETRIEVAL QUALITY REPORT ===');
console.log(`Timestamp: ${report.current.timestamp}`);
console.log(`Queries evaluated: ${report.current.perQuery?.length || 'N/A'}`);
console.log('\nMetrics:');
const agg = report.current.aggregated;
if (agg) {
console.log(` Precision@${agg.k}: ${agg.meanPrecisionAtK?.toFixed(3)}`);
console.log(` Recall@${agg.k}: ${agg.meanRecallAtK?.toFixed(3)}`);
console.log(` MRR: ${agg.mrr?.toFixed(3)}`);
if (agg.meanNdcgAtK != null) {
console.log(` NDCG@${agg.k}: ${agg.meanNdcgAtK?.toFixed(3)}`);
}
}
if (Object.keys(report.trends).length > 0) {
console.log('\nTrends vs previous:');
for (const [metric, trend] of Object.entries(report.trends)) {
console.log(` ${metric}: ${trend.pctChange} (${trend.direction})`);
}
}
console.log('================================\n');
}
}
9. Key Takeaways
- Retrieval quality is the ceiling for RAG answer quality — no amount of prompt engineering compensates for retrieving the wrong documents. Fix retrieval first.
- Precision@k measures noise (how many irrelevant docs clog the context), Recall@k measures coverage (how many relevant docs you find), MRR measures ranking (is the best result first?), NDCG measures graded ranking quality.
- Build evaluation datasets early — use manual labeling for accuracy, LLM-generated questions for scale, and real user query logs for realism. You need 100-500+ examples.
- A/B test retrieval strategies rigorously — compare vector search, hybrid search, re-ranking, and different chunking strategies on the same eval dataset.
- Measure chunk relevance individually — irrelevant chunks waste context tokens and confuse the LLM.
- Evaluate both components and end-to-end — component metrics isolate retrieval problems; end-to-end metrics catch issues in the full pipeline. When they disagree, the diagnosis tells you exactly where to invest.
- Track metrics over time — retrieval quality can degrade silently as your document corpus grows or changes.
Explain-It Challenge
- Your RAG system has precision@5 of 0.90 but recall@5 of only 0.30. What does this mean in practical terms, and what would you do to fix it?
- A colleague says "just retrieve the top 20 documents to be safe." Explain why this might actually make answers worse.
- Your end-to-end answer accuracy is 85% but retrieval recall@5 is only 60%. How is this possible, and why is it dangerous?
Navigation: ← 4.14.b — Confidence Scores · 4.14.d — Observability and Monitoring →