Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline
4.13.b — Retrieval Strategies
In one sentence: The quality of your RAG answers depends entirely on the quality of retrieval — embedding queries, vector similarity search, top-k selection, cross-encoder re-ranking, and hybrid search (vector + BM25) are the tools that determine whether the LLM sees the right documents or garbage.
Navigation: <- 4.13.a RAG Workflow | 4.13.c — Prompt Construction for RAG ->
1. Why Retrieval Quality Is Everything
In a RAG pipeline, the LLM can only answer based on the chunks you provide. If retrieval returns irrelevant chunks, the LLM will either hallucinate or give a wrong answer grounded in wrong context. If retrieval returns the perfect chunks, even a mediocre LLM will give a good answer.
┌─────────────────────────────────────────────────────────────────────┐
│ RETRIEVAL QUALITY vs ANSWER QUALITY │
│ │
│ Excellent retrieval + Average LLM = Good answers │
│ Average retrieval + Excellent LLM = Mediocre answers │
│ Poor retrieval + Excellent LLM = Bad answers (hallucination) │
│ │
│ CONCLUSION: Invest MORE in retrieval than in the LLM. │
└─────────────────────────────────────────────────────────────────────┘
This section covers every major retrieval strategy from basic to advanced.
2. Embedding the User Query
The first step in retrieval is converting the user's natural-language query into a vector (embedding) so it can be compared against document chunk vectors.
Critical rule: Same embedding model
The query MUST be embedded with the exact same model used to embed the document chunks during ingestion. Different models produce incompatible vector spaces.
// WRONG — model mismatch
// Ingestion used: text-embedding-3-small
// Query uses: text-embedding-3-large
// Result: similarity scores are meaningless
// CORRECT — same model for both
const EMBEDDING_MODEL = 'text-embedding-3-small';
// During ingestion
const docEmbedding = await openai.embeddings.create({
model: EMBEDDING_MODEL,
input: chunkText,
});
// During query
const queryEmbedding = await openai.embeddings.create({
model: EMBEDDING_MODEL,
input: userQuery,
});
Query preprocessing
Sometimes the raw user query is not the best input for embedding. Preprocessing can improve retrieval quality:
// Strategy 1: Query expansion — add context to ambiguous queries
function expandQuery(query) {
// Short queries often lack context
if (query.split(' ').length < 4) {
return `Information about: ${query}`;
}
return query;
}
// Strategy 2: Query decomposition — split complex queries
function decomposeQuery(query) {
// "What are the benefits and drawbacks of remote work?"
// -> Two searches: "benefits of remote work" + "drawbacks of remote work"
// -> Merge results
return [query]; // In practice, use an LLM to decompose
}
// Strategy 3: Hypothetical Document Embedding (HyDE)
// Instead of embedding the question, embed a hypothetical answer
async function hydeEmbed(query) {
// Ask the LLM to generate a hypothetical answer (without context)
const hypothetical = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
messages: [
{
role: 'system',
content: 'Write a short paragraph that would answer the following question. Do not include caveats or hedging.',
},
{ role: 'user', content: query },
],
});
// Embed the hypothetical answer instead of the question
// Hypothetical answers are more similar to real document chunks than questions are
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: hypothetical.choices[0].message.content,
});
return embedding.data[0].embedding;
}
Why HyDE works
Questions and answers live in different regions of embedding space. "What is the refund policy?" is semantically different from "Customers may return items within 30 days..." even though they are about the same topic. HyDE bridges this gap by embedding a hypothetical answer that is more similar to the actual document text.
3. Vector Similarity Search
Once the query is embedded, you compare it against all stored chunk vectors. The most common similarity metrics:
Cosine similarity
Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical). Most commonly used for text embeddings.
// Cosine similarity — manual implementation
function cosineSimilarity(a, b) {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// In practice, the vector DB does this internally at massive scale
const results = await pinecone.index('docs').query({
vector: queryVector,
topK: 10,
includeMetadata: true,
// Pinecone uses cosine similarity by default
});
Other distance metrics
| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine similarity | dot(a,b) / (norm(a) * norm(b)) | -1 to 1 | Text embeddings (direction matters) |
| Dot product | sum(a[i] * b[i]) | -inf to +inf | When magnitude carries information |
| Euclidean distance | sqrt(sum((a[i]-b[i])^2)) | 0 to +inf | Dense numeric vectors |
For OpenAI embeddings, cosine similarity is the standard choice. The embeddings are already normalized, so cosine similarity equals the dot product.
4. Top-k Retrieval
Top-k retrieval means: return the k most similar chunks. Choosing the right k is crucial.
How k affects results
k too low (k=1):
- May miss relevant information spread across multiple chunks
- Single point of failure — if the top chunk is wrong, answer fails
- Fast, cheap
k too high (k=20):
- Includes irrelevant chunks that dilute the context
- "Lost in the middle" — LLM ignores middle chunks
- Uses more tokens (costs more, may exceed context window)
- Slower
k just right (k=3 to 7 for most use cases):
- Captures main relevant information
- Manageable context size
- Good balance of recall and precision
Dynamic k — adjust based on query
// Dynamic k based on query complexity
function determineTopK(query, options = {}) {
const { minK = 3, maxK = 10, defaultK = 5 } = options;
// Simple factual question -> fewer chunks needed
if (query.split(' ').length < 6) return minK;
// Complex multi-part question -> more chunks needed
const questionWords = ['and', 'also', 'additionally', 'compare', 'versus'];
const isComplex = questionWords.some(w => query.toLowerCase().includes(w));
if (isComplex) return maxK;
return defaultK;
}
// Dynamic k based on score threshold
async function retrieveWithThreshold(queryVector, options = {}) {
const { maxK = 10, minScore = 0.7 } = options;
const results = await vectorDB.query({
vector: queryVector,
topK: maxK,
includeMetadata: true,
});
// Only keep chunks above the similarity threshold
const filtered = results.matches.filter(m => m.score >= minScore);
return filtered;
}
Score distribution analysis
Understanding the score distribution helps you set thresholds:
async function analyzeScores(queryVector) {
const results = await vectorDB.query({
vector: queryVector,
topK: 20,
includeMetadata: true,
});
const scores = results.matches.map(m => m.score);
console.log('Score distribution:');
console.log(' Top 1:', scores[0]?.toFixed(3));
console.log(' Top 5 avg:', (scores.slice(0, 5).reduce((a, b) => a + b, 0) / 5).toFixed(3));
console.log(' Top 10 avg:', (scores.slice(0, 10).reduce((a, b) => a + b, 0) / 10).toFixed(3));
console.log(' Score drop-off:', (scores[0] - scores[scores.length - 1]).toFixed(3));
// If there's a sharp drop-off after position N, k=N is likely optimal
for (let i = 1; i < scores.length; i++) {
const drop = scores[i - 1] - scores[i];
if (drop > 0.1) {
console.log(` Sharp drop at position ${i}: ${scores[i-1].toFixed(3)} -> ${scores[i].toFixed(3)}`);
}
}
}
5. Re-Ranking Retrieved Chunks
Top-k vector search gives you an approximate ranking. Re-ranking refines this ranking using a more powerful model that sees both the query and the chunk text together.
Why re-rank?
Vector similarity is a rough approximation. Two texts can have similar embeddings but be about different things (semantic similarity does not equal relevance). A re-ranker reads both the query and the chunk and directly estimates relevance.
┌─────────────────────────────────────────────────────────────────────┐
│ TWO-STAGE RETRIEVAL │
│ │
│ Stage 1 — Vector Search (FAST, APPROXIMATE) │
│ Query vector vs 100,000 chunk vectors │
│ Returns: top 20 candidates ranked by cosine similarity │
│ Speed: ~10ms │
│ │
│ Stage 2 — Cross-Encoder Re-Ranking (SLOW, PRECISE) │
│ Feed each (query, chunk) pair through a cross-encoder model │
│ Returns: top 20 re-ranked by true relevance │
│ Speed: ~200ms for 20 pairs │
│ │
│ Final: Take top 5 from re-ranked list -> inject into prompt │
└─────────────────────────────────────────────────────────────────────┘
Cross-encoder re-ranking
A cross-encoder is a model that takes two texts as input and outputs a relevance score. Unlike embeddings (which encode query and document separately), cross-encoders see both together and can capture fine-grained relevance.
// Cross-encoder re-ranking using Cohere's Rerank API
// (One of the most popular re-ranking services)
import { CohereClient } from 'cohere-ai';
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
async function rerankChunks(query, chunks) {
const response = await cohere.rerank({
model: 'rerank-english-v3.0',
query: query,
documents: chunks.map(c => c.text),
topN: 5, // Return top 5 after re-ranking
});
// Map back to original chunks with new scores
return response.results.map(result => ({
...chunks[result.index],
originalScore: chunks[result.index].score,
rerankedScore: result.relevanceScore,
}));
}
// Full retrieval pipeline with re-ranking
async function retrieveAndRerank(userQuery, topK = 20, finalK = 5) {
// Stage 1: Broad vector search
const queryEmbedding = await embedQuery(userQuery);
const candidates = await vectorDB.query({
vector: queryEmbedding,
topK: topK, // Retrieve more candidates than we need
includeMetadata: true,
});
// Stage 2: Re-rank with cross-encoder
const reranked = await rerankChunks(
userQuery,
candidates.matches.map(m => ({
text: m.metadata.text,
score: m.score,
metadata: m.metadata,
}))
);
// Return top finalK after re-ranking
return reranked.slice(0, finalK);
}
LLM-based re-ranking (alternative)
If you don't have access to a cross-encoder service, you can use the LLM itself to re-rank:
async function llmRerank(query, chunks) {
const chunkList = chunks.map((c, i) => `[${i}] ${c.text.slice(0, 200)}`).join('\n\n');
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
messages: [
{
role: 'system',
content: `You are a relevance judge. Given a query and a list of text chunks, return the indices of the most relevant chunks in order of relevance. Return ONLY a JSON array of indices, e.g., [3, 1, 7, 0, 5].`,
},
{
role: 'user',
content: `Query: "${query}"\n\nChunks:\n${chunkList}`,
},
],
response_format: { type: 'json_object' },
});
const rankedIndices = JSON.parse(response.choices[0].message.content).indices;
return rankedIndices.map(i => chunks[i]).filter(Boolean);
}
6. Hybrid Search (Vector + Keyword/BM25)
Vector search finds semantically similar content, but it can miss exact keyword matches. BM25 (the algorithm behind traditional search engines like Elasticsearch) finds exact term matches but misses semantic relationships. Hybrid search combines both.
Why hybrid?
Query: "Error code E-4012 troubleshooting"
Vector search:
Returns chunks about "error handling", "troubleshooting guide", "debugging steps"
May MISS the chunk with the exact code "E-4012" if it's semantically far
BM25 keyword search:
Returns chunks containing the literal string "E-4012"
May MISS chunks about "resolving connectivity issues" (same topic, different words)
Hybrid (both):
Returns BOTH the exact match AND semantically related chunks
Best of both worlds
Implementing hybrid search
// Hybrid search with score fusion
async function hybridSearch(query, options = {}) {
const { topK = 10, vectorWeight = 0.7, keywordWeight = 0.3 } = options;
// 1. Vector search
const queryEmbedding = await embedQuery(query);
const vectorResults = await vectorDB.query({
vector: queryEmbedding,
topK: topK * 2, // Get more candidates for fusion
includeMetadata: true,
});
// 2. Keyword/BM25 search (using your search engine)
const keywordResults = await searchEngine.search({
query: query,
limit: topK * 2,
});
// 3. Reciprocal Rank Fusion (RRF)
const fusedScores = new Map();
vectorResults.matches.forEach((result, rank) => {
const id = result.id;
const rrf = 1 / (60 + rank); // RRF constant = 60 (standard)
fusedScores.set(id, (fusedScores.get(id) || 0) + vectorWeight * rrf);
});
keywordResults.hits.forEach((result, rank) => {
const id = result.id;
const rrf = 1 / (60 + rank);
fusedScores.set(id, (fusedScores.get(id) || 0) + keywordWeight * rrf);
});
// 4. Sort by fused score and return top-k
const sorted = [...fusedScores.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, topK);
// Fetch full metadata for top results
const topIds = sorted.map(([id]) => id);
return await vectorDB.fetch(topIds);
}
Reciprocal Rank Fusion (RRF) explained
RRF is the standard algorithm for combining ranked lists from different sources:
RRF score for document d = sum over all ranked lists: 1 / (k + rank(d))
Where k is a constant (typically 60) that prevents high-ranked items from
dominating the score.
Example:
Document "refund-policy-chunk-3" appears at:
Vector search: rank 2 -> RRF = 1/(60+2) = 0.0161
Keyword search: rank 1 -> RRF = 1/(60+1) = 0.0164
Combined RRF = 0.0161 + 0.0164 = 0.0325
Document "returns-faq-chunk-7" appears at:
Vector search: rank 1 -> RRF = 1/(60+1) = 0.0164
Keyword search: not found -> RRF = 0
Combined RRF = 0.0164
Result: refund-policy-chunk-3 (0.0325) ranks higher than returns-faq-chunk-7 (0.0164)
because it was found by BOTH search methods.
Databases with built-in hybrid search
Some vector databases support hybrid search natively:
| Database | Hybrid Support | Notes |
|---|---|---|
| Pinecone | Sparse-dense vectors | Supports both vector and keyword in one query |
| Weaviate | Built-in BM25 + vector | Hybrid search as a first-class feature |
| Qdrant | Sparse vectors + dense | Native hybrid support |
| pgvector + pg_trgm | Manual combination | Combine vector similarity with trigram text search |
| Elasticsearch | kNN + BM25 | Add vector search to existing keyword infrastructure |
7. Handling No Results
What happens when the vector database returns chunks with low similarity scores? This means the knowledge base doesn't contain relevant information for the query.
async function retrieveWithFallback(queryVector, query, options = {}) {
const { topK = 5, minScore = 0.7 } = options;
const results = await vectorDB.query({
vector: queryVector,
topK: topK,
includeMetadata: true,
});
const relevant = results.matches.filter(m => m.score >= minScore);
if (relevant.length === 0) {
// Strategy 1: Return a "no information" response
return {
chunks: [],
status: 'no_results',
message: 'No relevant documents found for this query.',
suggestion: 'Try rephrasing your question or ask about a different topic.',
};
}
if (relevant.length < 2) {
// Strategy 2: Low confidence — flag for review
return {
chunks: relevant,
status: 'low_confidence',
message: 'Limited relevant information found. Answer may be incomplete.',
};
}
return {
chunks: relevant,
status: 'ok',
};
}
// In the query pipeline
async function queryRAG(userQuery) {
const queryVector = await embedQuery(userQuery);
const retrieval = await retrieveWithFallback(queryVector, userQuery);
if (retrieval.status === 'no_results') {
return {
answer: "I don't have information about that in my knowledge base.",
confidence: 0,
sources: [],
status: 'no_results',
};
}
// Continue with normal RAG pipeline...
const context = formatChunks(retrieval.chunks);
const response = await generateAnswer(userQuery, context);
if (retrieval.status === 'low_confidence') {
response.confidence = Math.min(response.confidence, 0.5);
response.warning = 'Limited source material available.';
}
return response;
}
Score thresholds by use case
| Use Case | Min Score | Rationale |
|---|---|---|
| Customer support | 0.75 | Better to say "I don't know" than give wrong info |
| Internal docs search | 0.65 | Users can verify; more permissive |
| Medical/Legal | 0.85 | High stakes; only high-confidence results |
| Creative/Brainstorming | 0.50 | Loosely related content can still inspire |
8. Handling Too Many Results
The opposite problem: the query matches many chunks and you need to decide which ones to keep.
// Strategy 1: Score-based cutoff with diminishing returns
function selectChunksForContext(matches, options = {}) {
const { maxChunks = 7, maxTokens = 3000, minScore = 0.6 } = options;
const selected = [];
let totalTokens = 0;
for (const match of matches) {
// Stop if score drops below threshold
if (match.score < minScore) break;
// Stop if we've hit the chunk limit
if (selected.length >= maxChunks) break;
// Stop if adding this chunk would exceed token budget
const chunkTokens = estimateTokens(match.metadata.text);
if (totalTokens + chunkTokens > maxTokens) break;
selected.push(match);
totalTokens += chunkTokens;
}
return selected;
}
// Strategy 2: Diversity-aware selection (Maximal Marginal Relevance)
function mmrSelect(query, candidates, options = {}) {
const { k = 5, lambda = 0.7 } = options;
// lambda balances relevance (1.0) vs diversity (0.0)
const selected = [];
const remaining = [...candidates];
while (selected.length < k && remaining.length > 0) {
let bestIdx = -1;
let bestScore = -Infinity;
for (let i = 0; i < remaining.length; i++) {
const candidate = remaining[i];
// Relevance to query
const relevance = candidate.score;
// Maximum similarity to any already-selected chunk
const maxSimilarity = selected.length === 0
? 0
: Math.max(...selected.map(s =>
cosineSimilarity(candidate.embedding, s.embedding)
));
// MMR score: balance relevance and diversity
const mmrScore = lambda * relevance - (1 - lambda) * maxSimilarity;
if (mmrScore > bestScore) {
bestScore = mmrScore;
bestIdx = i;
}
}
selected.push(remaining[bestIdx]);
remaining.splice(bestIdx, 1);
}
return selected;
}
Why Maximal Marginal Relevance (MMR)?
Without MMR, top-k retrieval often returns highly similar chunks that all say the same thing. This wastes context window space. MMR selects chunks that are both relevant to the query AND diverse from each other, maximizing the information content of the context.
Without MMR (top-5 by score):
Chunk 1: "Remote work policy allows 3 days per week..." score: 0.94
Chunk 2: "Employees may work from home up to 3 days..." score: 0.93
Chunk 3: "Our remote work arrangement permits 3 remote..." score: 0.91
Chunk 4: "Remote workers are allowed three days..." score: 0.90
Chunk 5: "Equipment allowance for remote employees..." score: 0.85
-> Four chunks say the SAME thing! Wasted context.
With MMR (lambda=0.7):
Chunk 1: "Remote work policy allows 3 days per week..." relevance: 0.94
Chunk 5: "Equipment allowance for remote employees..." relevance: 0.85
Chunk 8: "Remote work eligibility requirements..." relevance: 0.80
Chunk 12: "International remote work tax implications..." relevance: 0.72
Chunk 6: "Remote work schedule approval process..." relevance: 0.83
-> Five DIFFERENT aspects of remote work! Maximum information.
9. Metadata Filtering
Before or during retrieval, you can filter by metadata to narrow the search space:
// Filter by document type, date, department, etc.
const results = await vectorDB.query({
vector: queryVector,
topK: 10,
filter: {
// Only search in HR documents
department: 'hr',
// Only recent documents
updatedAfter: '2024-01-01',
// Exclude drafts
status: 'published',
},
includeMetadata: true,
});
// Multi-tenant filtering
const results = await vectorDB.query({
vector: queryVector,
topK: 10,
filter: {
tenantId: currentUser.tenantId, // Data isolation
accessLevel: { $lte: currentUser.clearanceLevel }, // Permission check
},
includeMetadata: true,
});
Metadata filtering is essential for:
- Multi-tenant applications — each customer only sees their own documents
- Access control — users only retrieve documents they have permission to see
- Temporal filtering — prefer recent documents over outdated ones
- Document type filtering — search only policies, or only FAQs, etc.
10. Measuring Retrieval Quality
You cannot improve what you don't measure. Key metrics for retrieval quality:
// Evaluation metrics for retrieval
// Precision@k: Of the k retrieved chunks, what fraction is relevant?
function precisionAtK(retrieved, relevant) {
const relevantRetrieved = retrieved.filter(r => relevant.includes(r.id));
return relevantRetrieved.length / retrieved.length;
}
// Recall@k: Of all relevant chunks, what fraction was retrieved?
function recallAtK(retrieved, relevant) {
const relevantRetrieved = retrieved.filter(r => relevant.includes(r.id));
return relevantRetrieved.length / relevant.length;
}
// Mean Reciprocal Rank: How high is the first relevant result?
function mrr(retrieved, relevant) {
for (let i = 0; i < retrieved.length; i++) {
if (relevant.includes(retrieved[i].id)) {
return 1 / (i + 1);
}
}
return 0;
}
// Evaluation suite
async function evaluateRetrieval(testQueries) {
const results = [];
for (const { query, expectedChunkIds } of testQueries) {
const queryVector = await embedQuery(query);
const retrieved = await vectorDB.query({
vector: queryVector,
topK: 10,
includeMetadata: true,
});
const retrievedIds = retrieved.matches.map(m => m.id);
results.push({
query,
precision: precisionAtK(retrieved.matches, expectedChunkIds),
recall: recallAtK(retrieved.matches, expectedChunkIds),
mrr: mrr(retrieved.matches, expectedChunkIds),
topScore: retrieved.matches[0]?.score,
});
}
// Aggregate metrics
const avgPrecision = results.reduce((a, r) => a + r.precision, 0) / results.length;
const avgRecall = results.reduce((a, r) => a + r.recall, 0) / results.length;
const avgMRR = results.reduce((a, r) => a + r.mrr, 0) / results.length;
console.log(`Precision@10: ${avgPrecision.toFixed(3)}`);
console.log(`Recall@10: ${avgRecall.toFixed(3)}`);
console.log(`MRR: ${avgMRR.toFixed(3)}`);
return results;
}
What good metrics look like
| Metric | Poor | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Precision@5 | < 0.3 | 0.3 - 0.5 | 0.5 - 0.7 | > 0.7 |
| Recall@10 | < 0.4 | 0.4 - 0.6 | 0.6 - 0.8 | > 0.8 |
| MRR | < 0.3 | 0.3 - 0.5 | 0.5 - 0.8 | > 0.8 |
11. Complete Retrieval Pipeline
Putting it all together — a production-grade retrieval function:
import OpenAI from 'openai';
const openai = new OpenAI();
const EMBEDDING_MODEL = 'text-embedding-3-small';
async function retrieve(userQuery, options = {}) {
const {
topK = 20, // Initial candidates (broad)
finalK = 5, // Final chunks to return
minScore = 0.65, // Minimum similarity threshold
useReranking = true, // Enable cross-encoder re-ranking
filters = {}, // Metadata filters
maxTokens = 3000, // Token budget for context
} = options;
// 1. Embed the query
const embeddingResponse = await openai.embeddings.create({
model: EMBEDDING_MODEL,
input: userQuery,
});
const queryVector = embeddingResponse.data[0].embedding;
// 2. Broad vector search
const candidates = await vectorDB.query({
vector: queryVector,
topK: topK,
filter: filters,
includeMetadata: true,
});
// 3. Filter by minimum score
let chunks = candidates.matches.filter(m => m.score >= minScore);
if (chunks.length === 0) {
return { chunks: [], status: 'no_results' };
}
// 4. Re-rank (optional but recommended)
if (useReranking && chunks.length > finalK) {
chunks = await rerankChunks(userQuery, chunks);
}
// 5. Select final chunks within token budget
const selected = [];
let tokenCount = 0;
for (const chunk of chunks.slice(0, finalK)) {
const chunkTokens = estimateTokens(chunk.metadata?.text || chunk.text);
if (tokenCount + chunkTokens > maxTokens) break;
selected.push(chunk);
tokenCount += chunkTokens;
}
return {
chunks: selected,
status: selected.length >= 3 ? 'ok' : 'low_confidence',
totalCandidates: candidates.matches.length,
tokenCount,
};
}
// Helper: estimate tokens (~4 chars per token)
function estimateTokens(text) {
return Math.ceil(text.length / 4);
}
12. Key Takeaways
- Retrieval quality determines answer quality — invest more in retrieval than in the LLM. Excellent retrieval with an average LLM beats poor retrieval with an excellent LLM.
- Always use the same embedding model for ingestion and query. Mismatched models produce meaningless similarity scores.
- Top-k retrieval is a starting point; re-ranking with a cross-encoder dramatically improves precision.
- Hybrid search (vector + BM25/keyword) catches what pure vector search misses — especially exact terms, codes, and identifiers.
- Handle edge cases explicitly: no results (say "I don't know"), too many results (MMR for diversity, token budget enforcement).
- Measure retrieval quality with precision, recall, and MRR. You cannot improve what you don't measure.
Explain-It Challenge
- Explain why HyDE (Hypothetical Document Embedding) can improve retrieval over embedding the raw question.
- A user searches for "PTO-2024-REV3" (a specific policy document ID) and gets no results from vector search. What went wrong and how do you fix it?
- Your RAG system returns 5 chunks that all say essentially the same thing. How do you fix this waste of context space?
Navigation: <- 4.13.a RAG Workflow | 4.13.c — Prompt Construction for RAG ->