Episode 4 — Generative AI Engineering / 4.14 — Evaluating AI Systems

4.14.c — Evaluating Retrieval Quality

In one sentence: The quality of a RAG system's answers is bounded by the quality of its retrieval — if the right documents aren't retrieved, no amount of prompt engineering can produce correct answers — so you need metrics like precision@k, recall@k, MRR, and NDCG to measure and improve retrieval systematically.

Navigation: ← 4.14.b — Confidence Scores · 4.14.d — Observability and Monitoring →

1. Why Retrieval Quality Determines RAG Answer Quality

In a RAG pipeline (4.13), the LLM only sees what the retrieval system gives it. This means:

┌──────────────────────────────────────────────────────────┐
│  THE RETRIEVAL BOTTLENECK                                │
│                                                          │
│  Good retrieval + Good LLM  = Good answers       ✓      │
│  Bad retrieval  + Good LLM  = Bad answers         ✗      │
│  Good retrieval + Bad LLM   = Mediocre answers   ~      │
│  Bad retrieval  + Bad LLM   = Terrible answers    ✗✗     │
│                                                          │
│  KEY INSIGHT: Retrieval is the ceiling.                   │
│  The LLM can only work with what it's given.             │
│  If relevant chunks are NOT retrieved, the LLM           │
│  will hallucinate or say "I don't know."                 │
│                                                          │
│  Improving retrieval from 60% to 90% recall often        │
│  improves answer quality MORE than switching from         │
│  GPT-4o-mini to GPT-4o.                                  │
└──────────────────────────────────────────────────────────┘

The two types of retrieval failure

Failure Type	What Happens	Impact
False negative (missed relevant doc)	Relevant document exists but isn't retrieved	LLM can't answer correctly, may hallucinate
False positive (irrelevant doc retrieved)	Irrelevant document is retrieved and injected	LLM gets confused, answers based on wrong context

Both are measurable, and the metrics in this section tell you exactly where your retrieval is failing.

2. Core Metrics: Precision@k, Recall@k, MRR, NDCG

Precision@k — "Of the k documents retrieved, how many were relevant?"

Precision@k = (number of relevant documents in top-k results) / k

Example: You retrieve 5 documents (k=5)
  Doc 1: Relevant    ✓
  Doc 2: Irrelevant  ✗
  Doc 3: Relevant    ✓
  Doc 4: Irrelevant  ✗
  Doc 5: Relevant    ✓

Precision@5 = 3/5 = 0.60  (60% of retrieved docs are relevant)
Precision@3 = 2/3 = 0.67  (first 3 docs: 2 relevant)
Precision@1 = 1/1 = 1.00  (first doc is relevant)

Recall@k — "Of all relevant documents, how many did we retrieve?"

Recall@k = (number of relevant documents in top-k results) / (total relevant documents)

Example: There are 4 relevant documents in the entire corpus. You retrieve 5.
  Retrieved: [Relevant, Irrelevant, Relevant, Irrelevant, Relevant]
  Relevant docs found: 3 out of 4

Recall@5 = 3/4 = 0.75  (found 75% of all relevant docs)
Recall@3 = 2/4 = 0.50  (first 3 results found 2 of 4 relevant docs)
Recall@1 = 1/4 = 0.25  (first result found 1 of 4 relevant docs)

MRR (Mean Reciprocal Rank) — "How high does the first relevant result appear?"

Reciprocal Rank = 1 / (position of first relevant result)
MRR = average of reciprocal ranks across all queries

Example queries:
  Query 1: First relevant at position 1 → RR = 1/1 = 1.00
  Query 2: First relevant at position 3 → RR = 1/3 = 0.33
  Query 3: First relevant at position 2 → RR = 1/2 = 0.50

MRR = (1.00 + 0.33 + 0.50) / 3 = 0.61

Interpretation: On average, the first relevant result appears around position 1-2.
Higher MRR = relevant results appear earlier (better for user experience).

NDCG (Normalized Discounted Cumulative Gain) — "Are the MOST relevant results ranked highest?"

NDCG goes beyond binary relevance (relevant/not relevant) to graded relevance (how relevant on a scale):

Relevance scale: 0 = not relevant, 1 = somewhat relevant, 2 = very relevant, 3 = perfect

Example retrieval for "company refund policy":
  Position 1: Refund FAQ page        → relevance 3 (perfect match)
  Position 2: General FAQ page       → relevance 1 (somewhat relevant)
  Position 3: Refund policy document  → relevance 3 (perfect match)
  Position 4: Product pricing page    → relevance 0 (not relevant)
  Position 5: Return shipping info    → relevance 2 (very relevant)

DCG@5 = 3/log2(2) + 1/log2(3) + 3/log2(4) + 0/log2(5) + 2/log2(6)
       = 3.0 + 0.63 + 1.5 + 0.0 + 0.77
       = 5.90

Ideal ranking: [3, 3, 2, 1, 0] (best possible order)
IDCG@5 = 3/log2(2) + 3/log2(3) + 2/log2(4) + 1/log2(5) + 0/log2(6)
        = 3.0 + 1.89 + 1.0 + 0.43 + 0.0
        = 6.32

NDCG@5 = DCG@5 / IDCG@5 = 5.90 / 6.32 = 0.93

Interpretation: 0.93 is excellent — the ranking is 93% as good as the ideal ordering.

3. Implementing Retrieval Metrics in Code

class RetrievalEvaluator {
  /**
   * @param {Array<{query: string, retrievedDocs: string[], relevantDocs: string[]}>} results
   * relevantDocs = the ground truth set of relevant document IDs for each query
   * retrievedDocs = the ordered list of document IDs returned by retrieval
   */

  // Precision@k for a single query
  precisionAtK(retrievedDocs, relevantDocs, k) {
    const topK = retrievedDocs.slice(0, k);
    const relevantInTopK = topK.filter(doc => relevantDocs.includes(doc));
    return relevantInTopK.length / k;
  }

  // Recall@k for a single query
  recallAtK(retrievedDocs, relevantDocs, k) {
    if (relevantDocs.length === 0) return 0;
    const topK = retrievedDocs.slice(0, k);
    const relevantInTopK = topK.filter(doc => relevantDocs.includes(doc));
    return relevantInTopK.length / relevantDocs.length;
  }

  // Reciprocal Rank for a single query
  reciprocalRank(retrievedDocs, relevantDocs) {
    for (let i = 0; i < retrievedDocs.length; i++) {
      if (relevantDocs.includes(retrievedDocs[i])) {
        return 1 / (i + 1);
      }
    }
    return 0; // No relevant doc found
  }

  // NDCG@k for a single query (with graded relevance)
  ndcgAtK(retrievedDocs, relevanceScores, k) {
    // relevanceScores: Map<docId, relevanceScore (0-3)>
    const topK = retrievedDocs.slice(0, k);

    // DCG
    let dcg = 0;
    for (let i = 0; i < topK.length; i++) {
      const relevance = relevanceScores.get(topK[i]) || 0;
      dcg += relevance / Math.log2(i + 2); // i+2 because log2(1)=0
    }

    // Ideal DCG: sort all scores descending, compute DCG
    const idealScores = Array.from(relevanceScores.values())
      .sort((a, b) => b - a)
      .slice(0, k);

    let idcg = 0;
    for (let i = 0; i < idealScores.length; i++) {
      idcg += idealScores[i] / Math.log2(i + 2);
    }

    return idcg > 0 ? dcg / idcg : 0;
  }

  // Aggregate metrics across all queries
  evaluateAll(evalDataset, k = 5) {
    const metrics = evalDataset.map(item => ({
      query: item.query,
      precisionAtK: this.precisionAtK(item.retrievedDocs, item.relevantDocs, k),
      recallAtK: this.recallAtK(item.retrievedDocs, item.relevantDocs, k),
      reciprocalRank: this.reciprocalRank(item.retrievedDocs, item.relevantDocs),
      ndcgAtK: item.relevanceScores 
        ? this.ndcgAtK(item.retrievedDocs, item.relevanceScores, k)
        : null
    }));

    // Average across all queries
    const avg = (arr) => arr.reduce((a, b) => a + b, 0) / arr.length;

    return {
      perQuery: metrics,
      aggregated: {
        meanPrecisionAtK: avg(metrics.map(m => m.precisionAtK)),
        meanRecallAtK: avg(metrics.map(m => m.recallAtK)),
        mrr: avg(metrics.map(m => m.reciprocalRank)),
        meanNdcgAtK: metrics[0].ndcgAtK !== null 
          ? avg(metrics.map(m => m.ndcgAtK))
          : null,
        k
      }
    };
  }
}

// Usage
const evaluator = new RetrievalEvaluator();

const evalDataset = [
  {
    query: 'What is the refund policy?',
    retrievedDocs: ['doc-7', 'doc-3', 'doc-12', 'doc-1', 'doc-9'],
    relevantDocs: ['doc-7', 'doc-12', 'doc-15'], // ground truth
    relevanceScores: new Map([
      ['doc-7', 3], ['doc-3', 0], ['doc-12', 3],
      ['doc-1', 0], ['doc-9', 1], ['doc-15', 2]
    ])
  },
  {
    query: 'How do I cancel my subscription?',
    retrievedDocs: ['doc-5', 'doc-2', 'doc-8', 'doc-11', 'doc-4'],
    relevantDocs: ['doc-2', 'doc-8'],
    relevanceScores: new Map([
      ['doc-5', 0], ['doc-2', 3], ['doc-8', 2],
      ['doc-11', 0], ['doc-4', 1]
    ])
  }
];

const results = evaluator.evaluateAll(evalDataset, 5);
console.log('Aggregated metrics:', results.aggregated);
/*
{
  meanPrecisionAtK: 0.50,
  meanRecallAtK: 0.58,
  mrr: 0.75,
  meanNdcgAtK: 0.82,
  k: 5
}
*/

4. Building Evaluation Datasets for Retrieval

Metrics are only meaningful if you have a ground-truth evaluation dataset. Building one is the hardest — and most valuable — part of retrieval evaluation.

Strategy 1: Manual labeling

// Create question-answer pairs manually from your document corpus

const evalDataset = [
  {
    id: 'eval-001',
    query: 'What is the maximum refund period for physical products?',
    relevantDocIds: ['refund-policy-v3-chunk-2', 'refund-policy-v3-chunk-3'],
    expectedAnswer: '30 days from purchase date',
    difficulty: 'easy',
    category: 'policy'
  },
  {
    id: 'eval-002',
    query: 'Can I return a digital product after downloading it?',
    relevantDocIds: ['refund-policy-v3-chunk-5', 'digital-tos-chunk-1'],
    expectedAnswer: 'No, digital products are non-refundable after download',
    difficulty: 'easy',
    category: 'policy'
  },
  {
    id: 'eval-003',
    query: 'What happens if I return a product after the refund window?',
    relevantDocIds: ['refund-policy-v3-chunk-4'],
    expectedAnswer: 'Late returns may be eligible for store credit at manager discretion',
    difficulty: 'medium',
    category: 'policy'
  }
  // ... 200+ more questions
];

Strategy 2: LLM-generated evaluation questions

import OpenAI from 'openai';

const openai = new OpenAI();

async function generateEvalQuestions(chunks) {
  const evalPairs = [];

  for (const chunk of chunks) {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0.3,
      messages: [
        {
          role: 'system',
          content: `Given a document chunk, generate 3 questions that this chunk 
can answer. For each question, extract the ground-truth answer.

Generate questions at different difficulty levels:
- Easy: answer is explicitly stated
- Medium: requires combining information from the chunk
- Hard: requires inference or understanding context

Return JSON:
{
  "questions": [
    {
      "question": "the question",
      "answer": "the ground truth answer",
      "difficulty": "easy" | "medium" | "hard"
    }
  ]
}`
        },
        {
          role: 'user',
          content: `Document: ${chunk.document}\nChunk ID: ${chunk.id}\nContent: ${chunk.text}`
        }
      ],
      response_format: { type: 'json_object' }
    });

    const { questions } = JSON.parse(response.choices[0].message.content);

    for (const q of questions) {
      evalPairs.push({
        query: q.question,
        relevantDocIds: [chunk.id],
        expectedAnswer: q.answer,
        difficulty: q.difficulty,
        sourceChunk: chunk.id
      });
    }
  }

  return evalPairs;
}

// Usage: generate eval set from your chunk corpus
const chunks = [
  { id: 'chunk-001', document: 'refund-policy.md', text: 'Returns accepted within 30 days...' },
  { id: 'chunk-002', document: 'shipping-policy.md', text: 'Free shipping on orders over $50...' }
];

const evalSet = await generateEvalQuestions(chunks);
console.log(`Generated ${evalSet.length} evaluation questions`);

Strategy 3: User query logs

// Mine real user queries for evaluation data
function buildEvalFromQueryLogs(queryLogs, humanLabels) {
  return queryLogs
    .filter(log => humanLabels.has(log.queryId)) // Only queries with human labels
    .map(log => ({
      query: log.query,
      retrievedDocs: log.retrievedChunkIds,
      relevantDocs: humanLabels.get(log.queryId).relevantChunkIds,
      expectedAnswer: humanLabels.get(log.queryId).correctAnswer,
      isRealQuery: true
    }));
}

How many evaluation examples do you need?

Evaluation Stage	Minimum	Ideal	Notes
Quick sanity check	20-50	—	During development
Pre-deployment eval	100-200	500+	Cover all categories
Ongoing monitoring	50/week	200/week	Sampled from production
A/B test significance	200+ per variant	1000+	For statistical power

5. A/B Testing Retrieval Strategies

When you want to compare two retrieval approaches (e.g., pure vector search vs hybrid search), run a structured A/B test:

class RetrievalABTest {
  constructor(strategyA, strategyB, evalDataset) {
    this.strategyA = strategyA;
    this.strategyB = strategyB;
    this.evalDataset = evalDataset;
    this.evaluator = new RetrievalEvaluator();
  }

  async run(k = 5) {
    const resultsA = [];
    const resultsB = [];

    for (const item of this.evalDataset) {
      // Run both strategies on the same query
      const [docsA, docsB] = await Promise.all([
        this.strategyA.retrieve(item.query, k),
        this.strategyB.retrieve(item.query, k)
      ]);

      resultsA.push({
        query: item.query,
        retrievedDocs: docsA.map(d => d.id),
        relevantDocs: item.relevantDocIds,
        relevanceScores: item.relevanceScores
      });

      resultsB.push({
        query: item.query,
        retrievedDocs: docsB.map(d => d.id),
        relevantDocs: item.relevantDocIds,
        relevanceScores: item.relevanceScores
      });
    }

    const metricsA = this.evaluator.evaluateAll(resultsA, k);
    const metricsB = this.evaluator.evaluateAll(resultsB, k);

    return this.compareResults(metricsA, metricsB);
  }

  compareResults(metricsA, metricsB) {
    const a = metricsA.aggregated;
    const b = metricsB.aggregated;

    const comparison = {
      strategyA: a,
      strategyB: b,
      winner: {},
      summary: []
    };

    // Compare each metric
    const metricNames = ['meanPrecisionAtK', 'meanRecallAtK', 'mrr', 'meanNdcgAtK'];
    
    for (const metric of metricNames) {
      if (a[metric] === null || b[metric] === null) continue;
      
      const diff = b[metric] - a[metric];
      const pctChange = a[metric] > 0 ? (diff / a[metric] * 100).toFixed(1) : 'N/A';
      
      comparison.winner[metric] = diff > 0 ? 'B' : diff < 0 ? 'A' : 'TIE';
      comparison.summary.push(
        `${metric}: A=${a[metric].toFixed(3)} vs B=${b[metric].toFixed(3)} ` +
        `(${diff > 0 ? '+' : ''}${pctChange}%) → Winner: ${comparison.winner[metric]}`
      );
    }

    return comparison;
  }
}

// Usage
const test = new RetrievalABTest(
  vectorSearchStrategy,   // Pure embedding-based search
  hybridSearchStrategy,   // Embedding + BM25 keyword search
  evalDataset
);

const results = await test.run(5);
results.summary.forEach(line => console.log(line));
/*
meanPrecisionAtK: A=0.520 vs B=0.640 (+23.1%) → Winner: B
meanRecallAtK:    A=0.480 vs B=0.620 (+29.2%) → Winner: B
mrr:              A=0.710 vs B=0.780 (+9.9%)  → Winner: B
meanNdcgAtK:      A=0.650 vs B=0.740 (+13.8%) → Winner: B
*/

6. Measuring Chunk Relevance

Beyond document-level retrieval, evaluate the quality of individual chunks — the actual text fragments injected into the prompt.

import OpenAI from 'openai';

const openai = new OpenAI();

async function evaluateChunkRelevance(query, chunks) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `You are a retrieval quality evaluator. For each chunk, rate its 
relevance to the query on a 0-3 scale:
  0 = Not relevant at all
  1 = Tangentially relevant (same topic but doesn't answer the question)
  2 = Partially relevant (contains some useful information)
  3 = Highly relevant (directly answers or is essential for answering the question)

Also note if the chunk contains the specific information needed to answer the query.

Return JSON:
{
  "evaluations": [
    {
      "chunkId": "id",
      "relevanceScore": 0-3,
      "containsAnswer": true/false,
      "reasoning": "brief explanation"
    }
  ],
  "summary": {
    "totalChunks": number,
    "relevantChunks": number,
    "averageRelevance": number,
    "answerCoverage": "FULL" | "PARTIAL" | "NONE"
  }
}`
      },
      {
        role: 'user',
        content: `Query: ${query}\n\nChunks:\n${
          chunks.map((c, i) => `[Chunk ${i + 1} | ID: ${c.id}]: ${c.text}`).join('\n\n')
        }`
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

// Usage
const chunkEval = await evaluateChunkRelevance(
  'What is the refund policy for electronics?',
  [
    { id: 'chunk-1', text: 'Electronics purchases can be returned within 15 days with original packaging.' },
    { id: 'chunk-2', text: 'Our store is located at 123 Main Street, open 9am-5pm.' },
    { id: 'chunk-3', text: 'Refunds are processed within 5-7 business days to the original payment method.' },
    { id: 'chunk-4', text: 'We carry a wide range of Samsung and Apple products.' },
    { id: 'chunk-5', text: 'Defective electronics may be exchanged within 30 days regardless of return window.' }
  ]
);

console.log(chunkEval.summary);
/*
{
  totalChunks: 5,
  relevantChunks: 3,
  averageRelevance: 1.8,
  answerCoverage: "FULL"
}
*/

Chunk quality issues to watch for

Issue	Symptom	Fix
Too small chunks	High recall but low relevance (lots of noise)	Increase chunk size
Too large chunks	Low recall (relevant info buried in irrelevant text)	Decrease chunk size, add overlap
Bad boundaries	Answer split across two chunks, neither sufficient alone	Improve chunking strategy (semantic chunking)
Stale content	Correct retrieval but outdated information	Re-index when documents update
Duplicate chunks	Same info retrieved multiple times, wasting context	Deduplicate before injection

7. End-to-End vs Component Evaluation

You can evaluate the RAG system at two levels. Both are necessary.

Component evaluation (retrieval only)

Measures the retrieval step in isolation. Does not involve the LLM.

async function evaluateRetrievalComponent(evalDataset, retriever, k = 5) {
  const evaluator = new RetrievalEvaluator();
  
  const results = [];
  for (const item of evalDataset) {
    const retrieved = await retriever.search(item.query, k);
    results.push({
      query: item.query,
      retrievedDocs: retrieved.map(r => r.id),
      relevantDocs: item.relevantDocIds
    });
  }

  return evaluator.evaluateAll(results, k);
}

End-to-end evaluation (retrieval + generation)

Measures the final answer quality. Involves both retrieval and LLM.

async function evaluateEndToEnd(evalDataset, ragPipeline) {
  const results = [];

  for (const item of evalDataset) {
    const ragResponse = await ragPipeline.query(item.query);

    // Check answer correctness
    const correctnessCheck = await openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0,
      messages: [
        {
          role: 'system',
          content: `Compare the AI answer to the ground truth. Return JSON:
{
  "isCorrect": true/false,
  "isPartiallyCorrect": true/false,
  "hasHallucination": true/false,
  "explanation": "brief explanation",
  "score": 0.0 to 1.0
}`
        },
        {
          role: 'user',
          content: `Question: ${item.query}
AI Answer: ${ragResponse.answer}
Ground Truth: ${item.expectedAnswer}`
        }
      ],
      response_format: { type: 'json_object' }
    });

    const evaluation = JSON.parse(correctnessCheck.choices[0].message.content);

    results.push({
      query: item.query,
      aiAnswer: ragResponse.answer,
      groundTruth: item.expectedAnswer,
      retrievedDocs: ragResponse.sources?.map(s => s.document) || [],
      ...evaluation
    });
  }

  // Aggregate
  const total = results.length;
  return {
    totalQuestions: total,
    correctRate: results.filter(r => r.isCorrect).length / total,
    partiallyCorrectRate: results.filter(r => r.isPartiallyCorrect).length / total,
    hallucinationRate: results.filter(r => r.hasHallucination).length / total,
    averageScore: results.reduce((sum, r) => sum + r.score, 0) / total,
    perQuestion: results
  };
}

When component and end-to-end metrics disagree

Component (Retrieval)	End-to-End (Answer)	Diagnosis
Good	Good	System is working well
Good	Bad	LLM is the problem (prompt, model, hallucination)
Bad	Good	Lucky — the LLM is compensating (fragile, will fail)
Bad	Bad	Fix retrieval first — it's the bottleneck

8. Retrieval Evaluation Dashboard

Bring metrics together into a monitoring view:

class RetrievalDashboard {
  constructor() {
    this.history = [];
  }

  record(evalRun) {
    this.history.push({
      timestamp: new Date(),
      ...evalRun
    });
  }

  getLatestReport() {
    if (this.history.length === 0) return null;

    const latest = this.history[this.history.length - 1];
    const previous = this.history.length > 1 
      ? this.history[this.history.length - 2] 
      : null;

    const report = {
      current: latest,
      trends: {}
    };

    if (previous) {
      const metrics = ['meanPrecisionAtK', 'meanRecallAtK', 'mrr', 'meanNdcgAtK'];
      for (const metric of metrics) {
        const curr = latest.aggregated?.[metric];
        const prev = previous.aggregated?.[metric];
        if (curr != null && prev != null) {
          report.trends[metric] = {
            current: curr,
            previous: prev,
            change: curr - prev,
            pctChange: prev > 0 ? ((curr - prev) / prev * 100).toFixed(1) + '%' : 'N/A',
            direction: curr > prev ? 'IMPROVING' : curr < prev ? 'DEGRADING' : 'STABLE'
          };
        }
      }
    }

    return report;
  }

  printReport() {
    const report = this.getLatestReport();
    if (!report) {
      console.log('No evaluation data recorded yet.');
      return;
    }

    console.log('\n=== RETRIEVAL QUALITY REPORT ===');
    console.log(`Timestamp: ${report.current.timestamp}`);
    console.log(`Queries evaluated: ${report.current.perQuery?.length || 'N/A'}`);
    console.log('\nMetrics:');

    const agg = report.current.aggregated;
    if (agg) {
      console.log(`  Precision@${agg.k}: ${agg.meanPrecisionAtK?.toFixed(3)}`);
      console.log(`  Recall@${agg.k}:    ${agg.meanRecallAtK?.toFixed(3)}`);
      console.log(`  MRR:            ${agg.mrr?.toFixed(3)}`);
      if (agg.meanNdcgAtK != null) {
        console.log(`  NDCG@${agg.k}:      ${agg.meanNdcgAtK?.toFixed(3)}`);
      }
    }

    if (Object.keys(report.trends).length > 0) {
      console.log('\nTrends vs previous:');
      for (const [metric, trend] of Object.entries(report.trends)) {
        console.log(`  ${metric}: ${trend.pctChange} (${trend.direction})`);
      }
    }

    console.log('================================\n');
  }
}

9. Key Takeaways

Retrieval quality is the ceiling for RAG answer quality — no amount of prompt engineering compensates for retrieving the wrong documents. Fix retrieval first.
Precision@k measures noise (how many irrelevant docs clog the context), Recall@k measures coverage (how many relevant docs you find), MRR measures ranking (is the best result first?), NDCG measures graded ranking quality.
Build evaluation datasets early — use manual labeling for accuracy, LLM-generated questions for scale, and real user query logs for realism. You need 100-500+ examples.
A/B test retrieval strategies rigorously — compare vector search, hybrid search, re-ranking, and different chunking strategies on the same eval dataset.
Measure chunk relevance individually — irrelevant chunks waste context tokens and confuse the LLM.
Evaluate both components and end-to-end — component metrics isolate retrieval problems; end-to-end metrics catch issues in the full pipeline. When they disagree, the diagnosis tells you exactly where to invest.
Track metrics over time — retrieval quality can degrade silently as your document corpus grows or changes.

Explain-It Challenge

Your RAG system has precision@5 of 0.90 but recall@5 of only 0.30. What does this mean in practical terms, and what would you do to fix it?
A colleague says "just retrieve the top 20 documents to be safe." Explain why this might actually make answers worse.
Your end-to-end answer accuracy is 85% but retrieval recall@5 is only 60%. How is this possible, and why is it dangerous?

Navigation: ← 4.14.b — Confidence Scores · 4.14.d — Observability and Monitoring →