Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline

4.13.b — Retrieval Strategies

In one sentence: The quality of your RAG answers depends entirely on the quality of retrieval — embedding queries, vector similarity search, top-k selection, cross-encoder re-ranking, and hybrid search (vector + BM25) are the tools that determine whether the LLM sees the right documents or garbage.

Navigation: <- 4.13.a RAG Workflow | 4.13.c — Prompt Construction for RAG ->

1. Why Retrieval Quality Is Everything

In a RAG pipeline, the LLM can only answer based on the chunks you provide. If retrieval returns irrelevant chunks, the LLM will either hallucinate or give a wrong answer grounded in wrong context. If retrieval returns the perfect chunks, even a mediocre LLM will give a good answer.

┌─────────────────────────────────────────────────────────────────────┐
│  RETRIEVAL QUALITY vs ANSWER QUALITY                                 │
│                                                                     │
│  Excellent retrieval + Average LLM  = Good answers                  │
│  Average retrieval   + Excellent LLM = Mediocre answers             │
│  Poor retrieval      + Excellent LLM = Bad answers (hallucination)  │
│                                                                     │
│  CONCLUSION: Invest MORE in retrieval than in the LLM.              │
└─────────────────────────────────────────────────────────────────────┘

This section covers every major retrieval strategy from basic to advanced.

2. Embedding the User Query

The first step in retrieval is converting the user's natural-language query into a vector (embedding) so it can be compared against document chunk vectors.

Critical rule: Same embedding model

The query MUST be embedded with the exact same model used to embed the document chunks during ingestion. Different models produce incompatible vector spaces.

// WRONG — model mismatch
// Ingestion used: text-embedding-3-small
// Query uses:     text-embedding-3-large
// Result: similarity scores are meaningless

// CORRECT — same model for both
const EMBEDDING_MODEL = 'text-embedding-3-small';

// During ingestion
const docEmbedding = await openai.embeddings.create({
  model: EMBEDDING_MODEL,
  input: chunkText,
});

// During query
const queryEmbedding = await openai.embeddings.create({
  model: EMBEDDING_MODEL,
  input: userQuery,
});

Query preprocessing

Sometimes the raw user query is not the best input for embedding. Preprocessing can improve retrieval quality:

// Strategy 1: Query expansion — add context to ambiguous queries
function expandQuery(query) {
  // Short queries often lack context
  if (query.split(' ').length < 4) {
    return `Information about: ${query}`;
  }
  return query;
}

// Strategy 2: Query decomposition — split complex queries
function decomposeQuery(query) {
  // "What are the benefits and drawbacks of remote work?"
  // -> Two searches: "benefits of remote work" + "drawbacks of remote work"
  // -> Merge results
  return [query]; // In practice, use an LLM to decompose
}

// Strategy 3: Hypothetical Document Embedding (HyDE)
// Instead of embedding the question, embed a hypothetical answer
async function hydeEmbed(query) {
  // Ask the LLM to generate a hypothetical answer (without context)
  const hypothetical = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: 'Write a short paragraph that would answer the following question. Do not include caveats or hedging.',
      },
      { role: 'user', content: query },
    ],
  });

  // Embed the hypothetical answer instead of the question
  // Hypothetical answers are more similar to real document chunks than questions are
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: hypothetical.choices[0].message.content,
  });

  return embedding.data[0].embedding;
}

Why HyDE works

Questions and answers live in different regions of embedding space. "What is the refund policy?" is semantically different from "Customers may return items within 30 days..." even though they are about the same topic. HyDE bridges this gap by embedding a hypothetical answer that is more similar to the actual document text.

3. Vector Similarity Search

Once the query is embedded, you compare it against all stored chunk vectors. The most common similarity metrics:

Cosine similarity

Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical). Most commonly used for text embeddings.

// Cosine similarity — manual implementation
function cosineSimilarity(a, b) {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

// In practice, the vector DB does this internally at massive scale
const results = await pinecone.index('docs').query({
  vector: queryVector,
  topK: 10,
  includeMetadata: true,
  // Pinecone uses cosine similarity by default
});

Other distance metrics

Metric	Formula	Range	Best For
Cosine similarity	dot(a,b) / (norm(a) * norm(b))	-1 to 1	Text embeddings (direction matters)
Dot product	sum(a[i] * b[i])	-inf to +inf	When magnitude carries information
Euclidean distance	sqrt(sum((a[i]-b[i])^2))	0 to +inf	Dense numeric vectors

For OpenAI embeddings, cosine similarity is the standard choice. The embeddings are already normalized, so cosine similarity equals the dot product.

4. Top-k Retrieval

Top-k retrieval means: return the k most similar chunks. Choosing the right k is crucial.

How k affects results

k too low (k=1):
  - May miss relevant information spread across multiple chunks
  - Single point of failure — if the top chunk is wrong, answer fails
  - Fast, cheap

k too high (k=20):
  - Includes irrelevant chunks that dilute the context
  - "Lost in the middle" — LLM ignores middle chunks
  - Uses more tokens (costs more, may exceed context window)
  - Slower

k just right (k=3 to 7 for most use cases):
  - Captures main relevant information
  - Manageable context size
  - Good balance of recall and precision

Dynamic k — adjust based on query

// Dynamic k based on query complexity
function determineTopK(query, options = {}) {
  const { minK = 3, maxK = 10, defaultK = 5 } = options;

  // Simple factual question -> fewer chunks needed
  if (query.split(' ').length < 6) return minK;

  // Complex multi-part question -> more chunks needed
  const questionWords = ['and', 'also', 'additionally', 'compare', 'versus'];
  const isComplex = questionWords.some(w => query.toLowerCase().includes(w));
  if (isComplex) return maxK;

  return defaultK;
}

// Dynamic k based on score threshold
async function retrieveWithThreshold(queryVector, options = {}) {
  const { maxK = 10, minScore = 0.7 } = options;

  const results = await vectorDB.query({
    vector: queryVector,
    topK: maxK,
    includeMetadata: true,
  });

  // Only keep chunks above the similarity threshold
  const filtered = results.matches.filter(m => m.score >= minScore);

  return filtered;
}

Score distribution analysis

Understanding the score distribution helps you set thresholds:

async function analyzeScores(queryVector) {
  const results = await vectorDB.query({
    vector: queryVector,
    topK: 20,
    includeMetadata: true,
  });

  const scores = results.matches.map(m => m.score);

  console.log('Score distribution:');
  console.log('  Top 1:', scores[0]?.toFixed(3));
  console.log('  Top 5 avg:', (scores.slice(0, 5).reduce((a, b) => a + b, 0) / 5).toFixed(3));
  console.log('  Top 10 avg:', (scores.slice(0, 10).reduce((a, b) => a + b, 0) / 10).toFixed(3));
  console.log('  Score drop-off:', (scores[0] - scores[scores.length - 1]).toFixed(3));

  // If there's a sharp drop-off after position N, k=N is likely optimal
  for (let i = 1; i < scores.length; i++) {
    const drop = scores[i - 1] - scores[i];
    if (drop > 0.1) {
      console.log(`  Sharp drop at position ${i}: ${scores[i-1].toFixed(3)} -> ${scores[i].toFixed(3)}`);
    }
  }
}

5. Re-Ranking Retrieved Chunks

Top-k vector search gives you an approximate ranking. Re-ranking refines this ranking using a more powerful model that sees both the query and the chunk text together.

Why re-rank?

Vector similarity is a rough approximation. Two texts can have similar embeddings but be about different things (semantic similarity does not equal relevance). A re-ranker reads both the query and the chunk and directly estimates relevance.

┌─────────────────────────────────────────────────────────────────────┐
│  TWO-STAGE RETRIEVAL                                                 │
│                                                                     │
│  Stage 1 — Vector Search (FAST, APPROXIMATE)                        │
│    Query vector vs 100,000 chunk vectors                            │
│    Returns: top 20 candidates ranked by cosine similarity           │
│    Speed: ~10ms                                                     │
│                                                                     │
│  Stage 2 — Cross-Encoder Re-Ranking (SLOW, PRECISE)                 │
│    Feed each (query, chunk) pair through a cross-encoder model      │
│    Returns: top 20 re-ranked by true relevance                      │
│    Speed: ~200ms for 20 pairs                                       │
│                                                                     │
│  Final: Take top 5 from re-ranked list -> inject into prompt        │
└─────────────────────────────────────────────────────────────────────┘

Cross-encoder re-ranking

A cross-encoder is a model that takes two texts as input and outputs a relevance score. Unlike embeddings (which encode query and document separately), cross-encoders see both together and can capture fine-grained relevance.

// Cross-encoder re-ranking using Cohere's Rerank API
// (One of the most popular re-ranking services)
import { CohereClient } from 'cohere-ai';

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function rerankChunks(query, chunks) {
  const response = await cohere.rerank({
    model: 'rerank-english-v3.0',
    query: query,
    documents: chunks.map(c => c.text),
    topN: 5,  // Return top 5 after re-ranking
  });

  // Map back to original chunks with new scores
  return response.results.map(result => ({
    ...chunks[result.index],
    originalScore: chunks[result.index].score,
    rerankedScore: result.relevanceScore,
  }));
}

// Full retrieval pipeline with re-ranking
async function retrieveAndRerank(userQuery, topK = 20, finalK = 5) {
  // Stage 1: Broad vector search
  const queryEmbedding = await embedQuery(userQuery);
  const candidates = await vectorDB.query({
    vector: queryEmbedding,
    topK: topK,  // Retrieve more candidates than we need
    includeMetadata: true,
  });

  // Stage 2: Re-rank with cross-encoder
  const reranked = await rerankChunks(
    userQuery,
    candidates.matches.map(m => ({
      text: m.metadata.text,
      score: m.score,
      metadata: m.metadata,
    }))
  );

  // Return top finalK after re-ranking
  return reranked.slice(0, finalK);
}

LLM-based re-ranking (alternative)

If you don't have access to a cross-encoder service, you can use the LLM itself to re-rank:

async function llmRerank(query, chunks) {
  const chunkList = chunks.map((c, i) => `[${i}] ${c.text.slice(0, 200)}`).join('\n\n');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `You are a relevance judge. Given a query and a list of text chunks, return the indices of the most relevant chunks in order of relevance. Return ONLY a JSON array of indices, e.g., [3, 1, 7, 0, 5].`,
      },
      {
        role: 'user',
        content: `Query: "${query}"\n\nChunks:\n${chunkList}`,
      },
    ],
    response_format: { type: 'json_object' },
  });

  const rankedIndices = JSON.parse(response.choices[0].message.content).indices;
  return rankedIndices.map(i => chunks[i]).filter(Boolean);
}

6. Hybrid Search (Vector + Keyword/BM25)

Vector search finds semantically similar content, but it can miss exact keyword matches. BM25 (the algorithm behind traditional search engines like Elasticsearch) finds exact term matches but misses semantic relationships. Hybrid search combines both.

Why hybrid?

Query: "Error code E-4012 troubleshooting"

Vector search:
  Returns chunks about "error handling", "troubleshooting guide", "debugging steps"
  May MISS the chunk with the exact code "E-4012" if it's semantically far

BM25 keyword search:
  Returns chunks containing the literal string "E-4012"
  May MISS chunks about "resolving connectivity issues" (same topic, different words)

Hybrid (both):
  Returns BOTH the exact match AND semantically related chunks
  Best of both worlds

Implementing hybrid search

// Hybrid search with score fusion

async function hybridSearch(query, options = {}) {
  const { topK = 10, vectorWeight = 0.7, keywordWeight = 0.3 } = options;

  // 1. Vector search
  const queryEmbedding = await embedQuery(query);
  const vectorResults = await vectorDB.query({
    vector: queryEmbedding,
    topK: topK * 2,  // Get more candidates for fusion
    includeMetadata: true,
  });

  // 2. Keyword/BM25 search (using your search engine)
  const keywordResults = await searchEngine.search({
    query: query,
    limit: topK * 2,
  });

  // 3. Reciprocal Rank Fusion (RRF)
  const fusedScores = new Map();

  vectorResults.matches.forEach((result, rank) => {
    const id = result.id;
    const rrf = 1 / (60 + rank);  // RRF constant = 60 (standard)
    fusedScores.set(id, (fusedScores.get(id) || 0) + vectorWeight * rrf);
  });

  keywordResults.hits.forEach((result, rank) => {
    const id = result.id;
    const rrf = 1 / (60 + rank);
    fusedScores.set(id, (fusedScores.get(id) || 0) + keywordWeight * rrf);
  });

  // 4. Sort by fused score and return top-k
  const sorted = [...fusedScores.entries()]
    .sort((a, b) => b[1] - a[1])
    .slice(0, topK);

  // Fetch full metadata for top results
  const topIds = sorted.map(([id]) => id);
  return await vectorDB.fetch(topIds);
}

Reciprocal Rank Fusion (RRF) explained

RRF is the standard algorithm for combining ranked lists from different sources:

RRF score for document d = sum over all ranked lists: 1 / (k + rank(d))

Where k is a constant (typically 60) that prevents high-ranked items from 
dominating the score.

Example:
  Document "refund-policy-chunk-3" appears at:
    Vector search:  rank 2  -> RRF = 1/(60+2) = 0.0161
    Keyword search: rank 1  -> RRF = 1/(60+1) = 0.0164
    Combined RRF = 0.0161 + 0.0164 = 0.0325

  Document "returns-faq-chunk-7" appears at:
    Vector search:  rank 1  -> RRF = 1/(60+1) = 0.0164
    Keyword search: not found -> RRF = 0
    Combined RRF = 0.0164

  Result: refund-policy-chunk-3 (0.0325) ranks higher than returns-faq-chunk-7 (0.0164)
  because it was found by BOTH search methods.

Databases with built-in hybrid search

Some vector databases support hybrid search natively:

Database	Hybrid Support	Notes
Pinecone	Sparse-dense vectors	Supports both vector and keyword in one query
Weaviate	Built-in BM25 + vector	Hybrid search as a first-class feature
Qdrant	Sparse vectors + dense	Native hybrid support
pgvector + pg_trgm	Manual combination	Combine vector similarity with trigram text search
Elasticsearch	kNN + BM25	Add vector search to existing keyword infrastructure

7. Handling No Results

What happens when the vector database returns chunks with low similarity scores? This means the knowledge base doesn't contain relevant information for the query.

async function retrieveWithFallback(queryVector, query, options = {}) {
  const { topK = 5, minScore = 0.7 } = options;

  const results = await vectorDB.query({
    vector: queryVector,
    topK: topK,
    includeMetadata: true,
  });

  const relevant = results.matches.filter(m => m.score >= minScore);

  if (relevant.length === 0) {
    // Strategy 1: Return a "no information" response
    return {
      chunks: [],
      status: 'no_results',
      message: 'No relevant documents found for this query.',
      suggestion: 'Try rephrasing your question or ask about a different topic.',
    };
  }

  if (relevant.length < 2) {
    // Strategy 2: Low confidence — flag for review
    return {
      chunks: relevant,
      status: 'low_confidence',
      message: 'Limited relevant information found. Answer may be incomplete.',
    };
  }

  return {
    chunks: relevant,
    status: 'ok',
  };
}

// In the query pipeline
async function queryRAG(userQuery) {
  const queryVector = await embedQuery(userQuery);
  const retrieval = await retrieveWithFallback(queryVector, userQuery);

  if (retrieval.status === 'no_results') {
    return {
      answer: "I don't have information about that in my knowledge base.",
      confidence: 0,
      sources: [],
      status: 'no_results',
    };
  }

  // Continue with normal RAG pipeline...
  const context = formatChunks(retrieval.chunks);
  const response = await generateAnswer(userQuery, context);

  if (retrieval.status === 'low_confidence') {
    response.confidence = Math.min(response.confidence, 0.5);
    response.warning = 'Limited source material available.';
  }

  return response;
}

Score thresholds by use case

Use Case	Min Score	Rationale
Customer support	0.75	Better to say "I don't know" than give wrong info
Internal docs search	0.65	Users can verify; more permissive
Medical/Legal	0.85	High stakes; only high-confidence results
Creative/Brainstorming	0.50	Loosely related content can still inspire

8. Handling Too Many Results

The opposite problem: the query matches many chunks and you need to decide which ones to keep.

// Strategy 1: Score-based cutoff with diminishing returns
function selectChunksForContext(matches, options = {}) {
  const { maxChunks = 7, maxTokens = 3000, minScore = 0.6 } = options;

  const selected = [];
  let totalTokens = 0;

  for (const match of matches) {
    // Stop if score drops below threshold
    if (match.score < minScore) break;

    // Stop if we've hit the chunk limit
    if (selected.length >= maxChunks) break;

    // Stop if adding this chunk would exceed token budget
    const chunkTokens = estimateTokens(match.metadata.text);
    if (totalTokens + chunkTokens > maxTokens) break;

    selected.push(match);
    totalTokens += chunkTokens;
  }

  return selected;
}

// Strategy 2: Diversity-aware selection (Maximal Marginal Relevance)
function mmrSelect(query, candidates, options = {}) {
  const { k = 5, lambda = 0.7 } = options;
  // lambda balances relevance (1.0) vs diversity (0.0)

  const selected = [];
  const remaining = [...candidates];

  while (selected.length < k && remaining.length > 0) {
    let bestIdx = -1;
    let bestScore = -Infinity;

    for (let i = 0; i < remaining.length; i++) {
      const candidate = remaining[i];

      // Relevance to query
      const relevance = candidate.score;

      // Maximum similarity to any already-selected chunk
      const maxSimilarity = selected.length === 0
        ? 0
        : Math.max(...selected.map(s =>
            cosineSimilarity(candidate.embedding, s.embedding)
          ));

      // MMR score: balance relevance and diversity
      const mmrScore = lambda * relevance - (1 - lambda) * maxSimilarity;

      if (mmrScore > bestScore) {
        bestScore = mmrScore;
        bestIdx = i;
      }
    }

    selected.push(remaining[bestIdx]);
    remaining.splice(bestIdx, 1);
  }

  return selected;
}

Why Maximal Marginal Relevance (MMR)?

Without MMR, top-k retrieval often returns highly similar chunks that all say the same thing. This wastes context window space. MMR selects chunks that are both relevant to the query AND diverse from each other, maximizing the information content of the context.

Without MMR (top-5 by score):
  Chunk 1: "Remote work policy allows 3 days per week..."     score: 0.94
  Chunk 2: "Employees may work from home up to 3 days..."     score: 0.93
  Chunk 3: "Our remote work arrangement permits 3 remote..."  score: 0.91
  Chunk 4: "Remote workers are allowed three days..."          score: 0.90
  Chunk 5: "Equipment allowance for remote employees..."       score: 0.85
  -> Four chunks say the SAME thing! Wasted context.

With MMR (lambda=0.7):
  Chunk 1: "Remote work policy allows 3 days per week..."     relevance: 0.94
  Chunk 5: "Equipment allowance for remote employees..."       relevance: 0.85
  Chunk 8: "Remote work eligibility requirements..."           relevance: 0.80
  Chunk 12: "International remote work tax implications..."    relevance: 0.72
  Chunk 6: "Remote work schedule approval process..."          relevance: 0.83
  -> Five DIFFERENT aspects of remote work! Maximum information.

9. Metadata Filtering

Before or during retrieval, you can filter by metadata to narrow the search space:

// Filter by document type, date, department, etc.
const results = await vectorDB.query({
  vector: queryVector,
  topK: 10,
  filter: {
    // Only search in HR documents
    department: 'hr',
    // Only recent documents
    updatedAfter: '2024-01-01',
    // Exclude drafts
    status: 'published',
  },
  includeMetadata: true,
});

// Multi-tenant filtering
const results = await vectorDB.query({
  vector: queryVector,
  topK: 10,
  filter: {
    tenantId: currentUser.tenantId,       // Data isolation
    accessLevel: { $lte: currentUser.clearanceLevel }, // Permission check
  },
  includeMetadata: true,
});

Metadata filtering is essential for:

Multi-tenant applications — each customer only sees their own documents
Access control — users only retrieve documents they have permission to see
Temporal filtering — prefer recent documents over outdated ones
Document type filtering — search only policies, or only FAQs, etc.

10. Measuring Retrieval Quality

You cannot improve what you don't measure. Key metrics for retrieval quality:

// Evaluation metrics for retrieval

// Precision@k: Of the k retrieved chunks, what fraction is relevant?
function precisionAtK(retrieved, relevant) {
  const relevantRetrieved = retrieved.filter(r => relevant.includes(r.id));
  return relevantRetrieved.length / retrieved.length;
}

// Recall@k: Of all relevant chunks, what fraction was retrieved?
function recallAtK(retrieved, relevant) {
  const relevantRetrieved = retrieved.filter(r => relevant.includes(r.id));
  return relevantRetrieved.length / relevant.length;
}

// Mean Reciprocal Rank: How high is the first relevant result?
function mrr(retrieved, relevant) {
  for (let i = 0; i < retrieved.length; i++) {
    if (relevant.includes(retrieved[i].id)) {
      return 1 / (i + 1);
    }
  }
  return 0;
}

// Evaluation suite
async function evaluateRetrieval(testQueries) {
  const results = [];

  for (const { query, expectedChunkIds } of testQueries) {
    const queryVector = await embedQuery(query);
    const retrieved = await vectorDB.query({
      vector: queryVector,
      topK: 10,
      includeMetadata: true,
    });

    const retrievedIds = retrieved.matches.map(m => m.id);

    results.push({
      query,
      precision: precisionAtK(retrieved.matches, expectedChunkIds),
      recall: recallAtK(retrieved.matches, expectedChunkIds),
      mrr: mrr(retrieved.matches, expectedChunkIds),
      topScore: retrieved.matches[0]?.score,
    });
  }

  // Aggregate metrics
  const avgPrecision = results.reduce((a, r) => a + r.precision, 0) / results.length;
  const avgRecall = results.reduce((a, r) => a + r.recall, 0) / results.length;
  const avgMRR = results.reduce((a, r) => a + r.mrr, 0) / results.length;

  console.log(`Precision@10: ${avgPrecision.toFixed(3)}`);
  console.log(`Recall@10:    ${avgRecall.toFixed(3)}`);
  console.log(`MRR:          ${avgMRR.toFixed(3)}`);

  return results;
}

What good metrics look like

Metric	Poor	Acceptable	Good	Excellent
Precision@5	< 0.3	0.3 - 0.5	0.5 - 0.7	> 0.7
Recall@10	< 0.4	0.4 - 0.6	0.6 - 0.8	> 0.8
MRR	< 0.3	0.3 - 0.5	0.5 - 0.8	> 0.8

11. Complete Retrieval Pipeline

Putting it all together — a production-grade retrieval function:

import OpenAI from 'openai';

const openai = new OpenAI();
const EMBEDDING_MODEL = 'text-embedding-3-small';

async function retrieve(userQuery, options = {}) {
  const {
    topK = 20,            // Initial candidates (broad)
    finalK = 5,           // Final chunks to return
    minScore = 0.65,      // Minimum similarity threshold
    useReranking = true,  // Enable cross-encoder re-ranking
    filters = {},         // Metadata filters
    maxTokens = 3000,     // Token budget for context
  } = options;

  // 1. Embed the query
  const embeddingResponse = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: userQuery,
  });
  const queryVector = embeddingResponse.data[0].embedding;

  // 2. Broad vector search
  const candidates = await vectorDB.query({
    vector: queryVector,
    topK: topK,
    filter: filters,
    includeMetadata: true,
  });

  // 3. Filter by minimum score
  let chunks = candidates.matches.filter(m => m.score >= minScore);

  if (chunks.length === 0) {
    return { chunks: [], status: 'no_results' };
  }

  // 4. Re-rank (optional but recommended)
  if (useReranking && chunks.length > finalK) {
    chunks = await rerankChunks(userQuery, chunks);
  }

  // 5. Select final chunks within token budget
  const selected = [];
  let tokenCount = 0;

  for (const chunk of chunks.slice(0, finalK)) {
    const chunkTokens = estimateTokens(chunk.metadata?.text || chunk.text);
    if (tokenCount + chunkTokens > maxTokens) break;
    selected.push(chunk);
    tokenCount += chunkTokens;
  }

  return {
    chunks: selected,
    status: selected.length >= 3 ? 'ok' : 'low_confidence',
    totalCandidates: candidates.matches.length,
    tokenCount,
  };
}

// Helper: estimate tokens (~4 chars per token)
function estimateTokens(text) {
  return Math.ceil(text.length / 4);
}

12. Key Takeaways

Retrieval quality determines answer quality — invest more in retrieval than in the LLM. Excellent retrieval with an average LLM beats poor retrieval with an excellent LLM.
Always use the same embedding model for ingestion and query. Mismatched models produce meaningless similarity scores.
Top-k retrieval is a starting point; re-ranking with a cross-encoder dramatically improves precision.
Hybrid search (vector + BM25/keyword) catches what pure vector search misses — especially exact terms, codes, and identifiers.
Handle edge cases explicitly: no results (say "I don't know"), too many results (MMR for diversity, token budget enforcement).
Measure retrieval quality with precision, recall, and MRR. You cannot improve what you don't measure.

Explain-It Challenge

Explain why HyDE (Hypothetical Document Embedding) can improve retrieval over embedding the raw question.
A user searches for "PTO-2024-REV3" (a specific policy document ID) and gets no results from vector search. What went wrong and how do you fix it?
Your RAG system returns 5 chunks that all say essentially the same thing. How do you fix this waste of context space?

Navigation: <- 4.13.a RAG Workflow | 4.13.c — Prompt Construction for RAG ->