4.13 — Building a RAG Pipeline: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps -- reopen
README.md -> 4.13.a...4.13.d.
- Practice --
4.13-Exercise-Questions.md.
- Polish answers --
4.13-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|
| RAG | Retrieval-Augmented Generation -- combine retrieval (vector DB) with generation (LLM) to answer from real documents |
| Retrieval | Finding relevant document chunks for a given query using vector similarity search |
| Chunking | Splitting documents into smaller pieces (~500 tokens) with overlap for embedding and retrieval |
| Chunk overlap | Repeating 10-20% of text between consecutive chunks so information at boundaries is not lost |
| Embedding | Converting text into a high-dimensional vector (e.g., 1536 floats) that captures semantic meaning |
| Top-k | Returning the k most similar chunks from the vector database |
| Re-ranking | Using a cross-encoder to re-score retrieved chunks for true relevance after initial vector search |
| Cross-encoder | A model that scores a (query, document) pair together -- slower but more accurate than embeddings |
| Context injection | Inserting retrieved chunks into the LLM prompt so the model answers from documents, not training data |
| Hybrid search | Combining vector similarity search (semantic) with BM25 keyword search (lexical) |
| BM25 | Term-frequency keyword search algorithm (like traditional search engines) |
| RRF | Reciprocal Rank Fusion -- algorithm to merge ranked lists: score(d) = sum(1/(k + rank(d))) |
| HyDE | Hypothetical Document Embedding -- embed a hypothetical answer instead of the question for better retrieval |
| MMR | Maximal Marginal Relevance -- selects chunks that are both relevant AND diverse from each other |
| Faithfulness | Whether the answer uses only information from retrieved context (no hallucination) |
| Source attribution | Tracing every claim in the answer back to a specific document and chunk |
| Confidence score | 0.0-1.0 value indicating how well the context supports the answer |
| Lost in the middle | LLMs pay less attention to content in the middle of long contexts |
| Sandwich ordering | Placing most-relevant chunks at the start and end of context (where attention is highest) |
| Token budget | Allocating context window space across system prompt, context, query, and output |
RAG pipeline flow
INGESTION (offline, runs once or on schedule)
==============================================
Documents -----> Clean & Parse -----> Chunk -----> Embed Each -----> Store in
(PDF, MD, (remove noise, (500 tokens, (text-embedding- Vector DB
HTML, TXT) extract text) 50 overlap) 3-small) (+ metadata)
QUERY (runtime, per user request)
==============================================
User Query
|
v
Step 1: Embed query (SAME model as ingestion -- critical!)
|
v
Step 2: Vector DB search (top-k candidates)
| |
| +---> [Optional] Re-rank with cross-encoder
| |
v v
Step 3: Build prompt (system msg + context + query)
|
v
Step 4: LLM generate (temperature 0, JSON mode)
|
v
Parse JSON -> Zod validate -> Return { answer, confidence, sources }
RAG vs fine-tuning
| Factor | Fine-Tuning | RAG |
|---|
| Knowledge freshness | Frozen at training time | Updated by re-indexing |
| Source attribution | Cannot cite sources | Cites exact document + chunk |
| Cost to update | $100s-$1000s per training run | Pennies per re-index |
| Hallucination | Model may still hallucinate | Grounded in retrieved docs |
| Setup complexity | Training data, GPU, pipeline | Vector DB, embedding pipeline |
| Latency | Single model call | Embed + DB query + model call |
| Best for | Style, format, jargon | Specific facts, knowledge Q&A |
Rule of thumb: Use RAG for knowledge tasks. Use fine-tuning for style/format. Use both when you need domain-specific style AND factual grounding.
Retrieval strategies comparison
| Strategy | How It Works | Strengths | Weaknesses |
|---|
| Vector search | Embed query, cosine similarity against stored chunks | Semantic understanding, fast | Misses exact terms (error codes, IDs) |
| BM25 / Keyword | Term-frequency matching | Exact matches, no embedding needed | No semantic understanding |
| Hybrid (vector + BM25) | Run both, merge with RRF | Best of both worlds | More complex, two search systems |
| Two-stage (vector + re-rank) | Broad vector search (top-20), cross-encoder re-rank (top-5) | High precision | Extra latency (~200ms for re-ranking) |
| HyDE | Embed a hypothetical answer instead of the question | Bridges question-answer gap | Extra LLM call, may introduce bias |
| MMR selection | Balance relevance and diversity among retrieved chunks | No duplicate/redundant context | Slightly lower max relevance |
Top-k guide
k = 1-3 Simple factual lookup
k = 3-7 Standard RAG Q&A (most common, start with k=5)
k = 10-20 Complex, multi-aspect questions
k = 20-50 Broad retrieval for re-ranking pipeline
Too low: miss relevant info, single point of failure.
Too high: irrelevant chunks dilute context, "lost in the middle", more tokens/cost.
Score thresholds by use case
| Use Case | Min Score | Rationale |
|---|
| Customer support | 0.75 | Better to say "I don't know" than give wrong info |
| Internal docs | 0.65 | Users can verify; more permissive |
| Medical / Legal | 0.85 | High stakes; only high-confidence results |
| Creative / Brainstorming | 0.50 | Loosely related content can still help |
Prompt construction patterns
1. System message structure
[Role] "You are a document assistant for Acme Corp."
[Rules] "Answer ONLY from the provided CONTEXT."
[Fallback] "If context is insufficient, say I don't know, confidence 0."
[Output format] "Return JSON: { answer, confidence, sources, gaps }"
[Context] "CONTEXT: [Source 1: file.md, Chunk 0] ..."
2. Anti-hallucination layers
Layer 1: "Answer ONLY from the provided CONTEXT. Do NOT use training data."
Layer 2: "If context is insufficient -> confidence: 0, say I don't know."
Layer 3: Require { answer, confidence, sources } -> forces source tracking.
Layer 4: Negative examples: "Do NOT say 'typically' or 'generally'."
Layer 5: Post-generation validation: verify cited sources exist in context.
3. Context injection methods
| Method | Best For |
|---|
| System message (recommended) | Most RAG apps -- highest priority, clean separation |
| User message | Flexible, works with any model |
| Multi-turn | Natural conversation, but uses more tokens |
4. Chunk formatting
[Source 1: employee-handbook.pdf, Chunk 12]
New employees receive 15 PTO days per year.
---
[Source 2: pto-policy.md, Chunk 3]
After 5 years, PTO increases to 20 days.
Always: label chunks with [Source N: filename] + use --- separators.
5. Chunk ordering strategies
| Strategy | When to Use |
|---|
| Relevance first | Default for most RAG apps |
| Sandwich | Long contexts -- place best chunks at start AND end |
| Chronological | Policy changes, versioned docs, audit trails |
Document QA system architecture
INGESTION QUERY
========= =====
Load docs (PDF, MD, TXT) User query
| |
v v
Clean & parse Embed query (same model!)
| |
v v
Chunk (500 tok, 50 overlap) Vector DB search (top-k)
| |
v v
Embed each chunk [Optional] Re-rank
| |
v v
Store in Vector DB Build prompt
(vector + metadata) |
v
LLM generate (temp 0, JSON)
|
v
Parse JSON
|
v
Zod validate
|
v
{ answer, confidence, sources }
Zod schema
z.object({
answer: z.string().min(1),
confidence: z.number().min(0).max(1),
sources: z.array(z.object({
document: z.string(),
chunk: z.number().int().min(0),
relevance: z.number().min(0).max(1),
})),
gaps: z.array(z.string()).optional(),
})
Error handling chain
Embedding fails -> retry with backoff, fallback error response
Empty retrieval -> return confidence: 0, "I don't have that info"
LLM rate limited -> retry with exponential backoff (1s, 2s, 4s)
Invalid JSON -> extract JSON from text, retry LLM call
Zod rejects -> retry (up to 2x), then safe fallback response
Token budget calculation
GPT-4o context window: 128,000 tokens
- System prompt (instructions, rules): 1,200
- Max output reservation: 4,000
- Safety margin: 500
- User query (typical): 80
--------
Available for RAG context: 122,220 tokens
With 500-token chunks: ~244 could fit
In practice: 5-10 high-quality chunks is optimal (2,500-5,000 tokens)
Key insight: Quality of context beats quantity. 5 perfect chunks > 50 marginal chunks.
Evaluation metrics
Layer 1 -- Retrieval quality
| Metric | What It Measures | Target |
|---|
| Precision@k | Fraction of retrieved chunks that are relevant | > 0.6 |
| Recall@k | Fraction of all relevant chunks that were retrieved | > 0.7 |
| MRR | How high is the first relevant result? | > 0.7 |
Layer 2 -- Generation quality
| Metric | What It Measures |
|---|
| Faithfulness | Does the answer use only info from context? (no hallucination) |
| Relevance | Does the answer address the user's question? |
| Completeness | Does the answer cover all aspects the context supports? |
Layer 3 -- End-to-end quality
| Metric | What It Measures |
|---|
| Answer correctness | Compare against ground-truth answers |
| Source accuracy | Do cited sources actually support the claims? |
| Confidence calibration | Does confidence predict actual correctness? |
| Failure handling | Does system refuse when context is insufficient? |
RRF formula
RRF score for document d = sum over all ranked lists: 1 / (k + rank(d))
k = 60 (standard constant)
Example:
Doc appears at vector rank 2, keyword rank 1:
RRF = 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325
Doc appears at vector rank 1 only:
RRF = 1/(60+1) + 0 = 0.0164
-> First doc ranks higher (found by BOTH methods)
Common gotchas
| Gotcha | What Goes Wrong | Fix |
|---|
| Model mismatch | Ingestion and query use different embedding models | Always use the SAME model constant |
| No "context only" rule | LLM mixes training data with context | Explicit prompt: "Answer ONLY from CONTEXT" |
| Chunks too large | Irrelevant text dilutes answers | 200-500 token chunks with overlap |
| Chunks too small | Fragmented context, incoherent answers | Break at paragraph/sentence boundaries |
| No overlap | Info lost at chunk boundaries | 10-20% overlap between consecutive chunks |
| No fallback | System hallucinates when docs lack the answer | "If insufficient context -> confidence: 0, say I don't know" |
| Stale index | Docs updated but embeddings are old | Automated re-indexing pipeline |
| Lost in the middle | LLM ignores middle chunks in long contexts | Limit to 5-10 chunks; use sandwich ordering |
| No source labels | LLM cannot cite specific sources | Label every chunk with [Source N: filename] |
| No separators | LLM blends info across chunk boundaries | Use --- between chunks |
| Context after query | LLM starts answering before reading context | Put CONTEXT before the user question |
| Confidence > 1 | Zod rejects, retry wastes time | Clear confidence guidelines in prompt (0.0-1.0) |
| Same-topic chunks | 5 chunks say the same thing, wasted context | Use MMR for diversity-aware selection |
| Exact term search | Vector search misses error codes, IDs, SKUs | Use hybrid search (vector + BM25) |
| Multi-turn ambiguity | "What about for contractors?" retrieves wrong docs | Contextualize follow-up queries before retrieval |
Security checklist
[ ] Prompt injection: Sanitize input, separate instruction/data channels
[ ] Multi-tenant leakage: Filter by tenant_id in EVERY vector DB query
[ ] PII in context: PII detection before injecting into LLM prompts
[ ] Context poisoning: Validate documents during ingestion
[ ] Knowledge probing: Generic "no info available" (don't confirm/deny)
[ ] Authoritative answers: Disclaimers + confidence thresholds + human-in-loop
One-line summary
RAG = Embed query -> Retrieve relevant chunks from vector DB -> Inject into prompt as context -> Generate grounded, structured answer with sources
End of 4.13 quick revision.