Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline

4.13 — Building a RAG Pipeline: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps -- reopen README.md -> 4.13.a...4.13.d.
Practice -- 4.13-Exercise-Questions.md.
Polish answers -- 4.13-Interview-Questions.md.

Core vocabulary

Term	One-liner
RAG	Retrieval-Augmented Generation -- combine retrieval (vector DB) with generation (LLM) to answer from real documents
Retrieval	Finding relevant document chunks for a given query using vector similarity search
Chunking	Splitting documents into smaller pieces (~500 tokens) with overlap for embedding and retrieval
Chunk overlap	Repeating 10-20% of text between consecutive chunks so information at boundaries is not lost
Embedding	Converting text into a high-dimensional vector (e.g., 1536 floats) that captures semantic meaning
Top-k	Returning the k most similar chunks from the vector database
Re-ranking	Using a cross-encoder to re-score retrieved chunks for true relevance after initial vector search
Cross-encoder	A model that scores a (query, document) pair together -- slower but more accurate than embeddings
Context injection	Inserting retrieved chunks into the LLM prompt so the model answers from documents, not training data
Hybrid search	Combining vector similarity search (semantic) with BM25 keyword search (lexical)
BM25	Term-frequency keyword search algorithm (like traditional search engines)
RRF	Reciprocal Rank Fusion -- algorithm to merge ranked lists: `score(d) = sum(1/(k + rank(d)))`
HyDE	Hypothetical Document Embedding -- embed a hypothetical answer instead of the question for better retrieval
MMR	Maximal Marginal Relevance -- selects chunks that are both relevant AND diverse from each other
Faithfulness	Whether the answer uses only information from retrieved context (no hallucination)
Source attribution	Tracing every claim in the answer back to a specific document and chunk
Confidence score	0.0-1.0 value indicating how well the context supports the answer
Lost in the middle	LLMs pay less attention to content in the middle of long contexts
Sandwich ordering	Placing most-relevant chunks at the start and end of context (where attention is highest)
Token budget	Allocating context window space across system prompt, context, query, and output

RAG pipeline flow

INGESTION (offline, runs once or on schedule)
==============================================

  Documents -----> Clean & Parse -----> Chunk -----> Embed Each -----> Store in
  (PDF, MD,        (remove noise,       (500 tokens,  (text-embedding-   Vector DB
   HTML, TXT)       extract text)        50 overlap)   3-small)          (+ metadata)


QUERY (runtime, per user request)
==============================================

  User Query
      |
      v
  Step 1: Embed query (SAME model as ingestion -- critical!)
      |
      v
  Step 2: Vector DB search (top-k candidates)
      |                     |
      |                     +---> [Optional] Re-rank with cross-encoder
      |                     |
      v                     v
  Step 3: Build prompt (system msg + context + query)
      |
      v
  Step 4: LLM generate (temperature 0, JSON mode)
      |
      v
  Parse JSON -> Zod validate -> Return { answer, confidence, sources }

RAG vs fine-tuning

Factor	Fine-Tuning	RAG
Knowledge freshness	Frozen at training time	Updated by re-indexing
Source attribution	Cannot cite sources	Cites exact document + chunk
Cost to update	$100s-$1000s per training run	Pennies per re-index
Hallucination	Model may still hallucinate	Grounded in retrieved docs
Setup complexity	Training data, GPU, pipeline	Vector DB, embedding pipeline
Latency	Single model call	Embed + DB query + model call
Best for	Style, format, jargon	Specific facts, knowledge Q&A

Rule of thumb: Use RAG for knowledge tasks. Use fine-tuning for style/format. Use both when you need domain-specific style AND factual grounding.

Retrieval strategies comparison

Strategy	How It Works	Strengths	Weaknesses
Vector search	Embed query, cosine similarity against stored chunks	Semantic understanding, fast	Misses exact terms (error codes, IDs)
BM25 / Keyword	Term-frequency matching	Exact matches, no embedding needed	No semantic understanding
Hybrid (vector + BM25)	Run both, merge with RRF	Best of both worlds	More complex, two search systems
Two-stage (vector + re-rank)	Broad vector search (top-20), cross-encoder re-rank (top-5)	High precision	Extra latency (~200ms for re-ranking)
HyDE	Embed a hypothetical answer instead of the question	Bridges question-answer gap	Extra LLM call, may introduce bias
MMR selection	Balance relevance and diversity among retrieved chunks	No duplicate/redundant context	Slightly lower max relevance

Top-k guide

k = 1-3    Simple factual lookup
k = 3-7    Standard RAG Q&A (most common, start with k=5)
k = 10-20  Complex, multi-aspect questions
k = 20-50  Broad retrieval for re-ranking pipeline

Too low: miss relevant info, single point of failure. Too high: irrelevant chunks dilute context, "lost in the middle", more tokens/cost.

Score thresholds by use case

Use Case	Min Score	Rationale
Customer support	0.75	Better to say "I don't know" than give wrong info
Internal docs	0.65	Users can verify; more permissive
Medical / Legal	0.85	High stakes; only high-confidence results
Creative / Brainstorming	0.50	Loosely related content can still help

Prompt construction patterns

1. System message structure

[Role]           "You are a document assistant for Acme Corp."
[Rules]          "Answer ONLY from the provided CONTEXT."
[Fallback]       "If context is insufficient, say I don't know, confidence 0."
[Output format]  "Return JSON: { answer, confidence, sources, gaps }"
[Context]        "CONTEXT: [Source 1: file.md, Chunk 0] ..."

2. Anti-hallucination layers

Layer 1: "Answer ONLY from the provided CONTEXT. Do NOT use training data."
Layer 2: "If context is insufficient -> confidence: 0, say I don't know."
Layer 3: Require { answer, confidence, sources } -> forces source tracking.
Layer 4: Negative examples: "Do NOT say 'typically' or 'generally'."
Layer 5: Post-generation validation: verify cited sources exist in context.

3. Context injection methods

Method	Best For
System message (recommended)	Most RAG apps -- highest priority, clean separation
User message	Flexible, works with any model
Multi-turn	Natural conversation, but uses more tokens

4. Chunk formatting

[Source 1: employee-handbook.pdf, Chunk 12]
New employees receive 15 PTO days per year.

---

[Source 2: pto-policy.md, Chunk 3]
After 5 years, PTO increases to 20 days.

Always: label chunks with [Source N: filename] + use --- separators.

5. Chunk ordering strategies

Strategy	When to Use
Relevance first	Default for most RAG apps
Sandwich	Long contexts -- place best chunks at start AND end
Chronological	Policy changes, versioned docs, audit trails

Document QA system architecture

INGESTION                              QUERY
=========                              =====
Load docs (PDF, MD, TXT)              User query
    |                                      |
    v                                      v
Clean & parse                          Embed query (same model!)
    |                                      |
    v                                      v
Chunk (500 tok, 50 overlap)            Vector DB search (top-k)
    |                                      |
    v                                      v
Embed each chunk                       [Optional] Re-rank
    |                                      |
    v                                      v
Store in Vector DB                     Build prompt
(vector + metadata)                        |
                                           v
                                       LLM generate (temp 0, JSON)
                                           |
                                           v
                                       Parse JSON
                                           |
                                           v
                                       Zod validate
                                           |
                                           v
                                       { answer, confidence, sources }

Zod schema

z.object({
  answer:     z.string().min(1),
  confidence: z.number().min(0).max(1),
  sources:    z.array(z.object({
    document:  z.string(),
    chunk:     z.number().int().min(0),
    relevance: z.number().min(0).max(1),
  })),
  gaps:       z.array(z.string()).optional(),
})

Error handling chain

Embedding fails    -> retry with backoff, fallback error response
Empty retrieval    -> return confidence: 0, "I don't have that info"
LLM rate limited   -> retry with exponential backoff (1s, 2s, 4s)
Invalid JSON       -> extract JSON from text, retry LLM call
Zod rejects        -> retry (up to 2x), then safe fallback response

Token budget calculation

GPT-4o context window:                    128,000 tokens
  - System prompt (instructions, rules):    1,200
  - Max output reservation:                 4,000
  - Safety margin:                            500
  - User query (typical):                      80
                                          --------
Available for RAG context:               122,220 tokens

With 500-token chunks: ~244 could fit
In practice: 5-10 high-quality chunks is optimal (2,500-5,000 tokens)

Key insight: Quality of context beats quantity. 5 perfect chunks > 50 marginal chunks.

Evaluation metrics

Layer 1 -- Retrieval quality

Metric	What It Measures	Target
Precision@k	Fraction of retrieved chunks that are relevant	> 0.6
Recall@k	Fraction of all relevant chunks that were retrieved	> 0.7
MRR	How high is the first relevant result?	> 0.7

Layer 2 -- Generation quality

Metric	What It Measures
Faithfulness	Does the answer use only info from context? (no hallucination)
Relevance	Does the answer address the user's question?
Completeness	Does the answer cover all aspects the context supports?

Layer 3 -- End-to-end quality

Metric	What It Measures
Answer correctness	Compare against ground-truth answers
Source accuracy	Do cited sources actually support the claims?
Confidence calibration	Does confidence predict actual correctness?
Failure handling	Does system refuse when context is insufficient?

RRF formula

RRF score for document d = sum over all ranked lists: 1 / (k + rank(d))

k = 60 (standard constant)

Example:
  Doc appears at vector rank 2, keyword rank 1:
    RRF = 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325

  Doc appears at vector rank 1 only:
    RRF = 1/(60+1) + 0 = 0.0164

  -> First doc ranks higher (found by BOTH methods)

Common gotchas

Gotcha	What Goes Wrong	Fix
Model mismatch	Ingestion and query use different embedding models	Always use the SAME model constant
No "context only" rule	LLM mixes training data with context	Explicit prompt: "Answer ONLY from CONTEXT"
Chunks too large	Irrelevant text dilutes answers	200-500 token chunks with overlap
Chunks too small	Fragmented context, incoherent answers	Break at paragraph/sentence boundaries
No overlap	Info lost at chunk boundaries	10-20% overlap between consecutive chunks
No fallback	System hallucinates when docs lack the answer	"If insufficient context -> confidence: 0, say I don't know"
Stale index	Docs updated but embeddings are old	Automated re-indexing pipeline
Lost in the middle	LLM ignores middle chunks in long contexts	Limit to 5-10 chunks; use sandwich ordering
No source labels	LLM cannot cite specific sources	Label every chunk with `[Source N: filename]`
No separators	LLM blends info across chunk boundaries	Use `---` between chunks
Context after query	LLM starts answering before reading context	Put CONTEXT before the user question
Confidence > 1	Zod rejects, retry wastes time	Clear confidence guidelines in prompt (0.0-1.0)
Same-topic chunks	5 chunks say the same thing, wasted context	Use MMR for diversity-aware selection
Exact term search	Vector search misses error codes, IDs, SKUs	Use hybrid search (vector + BM25)
Multi-turn ambiguity	"What about for contractors?" retrieves wrong docs	Contextualize follow-up queries before retrieval

Security checklist

[ ] Prompt injection:       Sanitize input, separate instruction/data channels
[ ] Multi-tenant leakage:   Filter by tenant_id in EVERY vector DB query
[ ] PII in context:         PII detection before injecting into LLM prompts
[ ] Context poisoning:      Validate documents during ingestion
[ ] Knowledge probing:      Generic "no info available" (don't confirm/deny)
[ ] Authoritative answers:  Disclaimers + confidence thresholds + human-in-loop

One-line summary

RAG = Embed query -> Retrieve relevant chunks from vector DB -> Inject into prompt as context -> Generate grounded, structured answer with sources

End of 4.13 quick revision.