Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline

4.13 — Building a RAG Pipeline: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps -- reopen README.md -> 4.13.a...4.13.d.
  3. Practice -- 4.13-Exercise-Questions.md.
  4. Polish answers -- 4.13-Interview-Questions.md.

Core vocabulary

TermOne-liner
RAGRetrieval-Augmented Generation -- combine retrieval (vector DB) with generation (LLM) to answer from real documents
RetrievalFinding relevant document chunks for a given query using vector similarity search
ChunkingSplitting documents into smaller pieces (~500 tokens) with overlap for embedding and retrieval
Chunk overlapRepeating 10-20% of text between consecutive chunks so information at boundaries is not lost
EmbeddingConverting text into a high-dimensional vector (e.g., 1536 floats) that captures semantic meaning
Top-kReturning the k most similar chunks from the vector database
Re-rankingUsing a cross-encoder to re-score retrieved chunks for true relevance after initial vector search
Cross-encoderA model that scores a (query, document) pair together -- slower but more accurate than embeddings
Context injectionInserting retrieved chunks into the LLM prompt so the model answers from documents, not training data
Hybrid searchCombining vector similarity search (semantic) with BM25 keyword search (lexical)
BM25Term-frequency keyword search algorithm (like traditional search engines)
RRFReciprocal Rank Fusion -- algorithm to merge ranked lists: score(d) = sum(1/(k + rank(d)))
HyDEHypothetical Document Embedding -- embed a hypothetical answer instead of the question for better retrieval
MMRMaximal Marginal Relevance -- selects chunks that are both relevant AND diverse from each other
FaithfulnessWhether the answer uses only information from retrieved context (no hallucination)
Source attributionTracing every claim in the answer back to a specific document and chunk
Confidence score0.0-1.0 value indicating how well the context supports the answer
Lost in the middleLLMs pay less attention to content in the middle of long contexts
Sandwich orderingPlacing most-relevant chunks at the start and end of context (where attention is highest)
Token budgetAllocating context window space across system prompt, context, query, and output

RAG pipeline flow

INGESTION (offline, runs once or on schedule)
==============================================

  Documents -----> Clean & Parse -----> Chunk -----> Embed Each -----> Store in
  (PDF, MD,        (remove noise,       (500 tokens,  (text-embedding-   Vector DB
   HTML, TXT)       extract text)        50 overlap)   3-small)          (+ metadata)


QUERY (runtime, per user request)
==============================================

  User Query
      |
      v
  Step 1: Embed query (SAME model as ingestion -- critical!)
      |
      v
  Step 2: Vector DB search (top-k candidates)
      |                     |
      |                     +---> [Optional] Re-rank with cross-encoder
      |                     |
      v                     v
  Step 3: Build prompt (system msg + context + query)
      |
      v
  Step 4: LLM generate (temperature 0, JSON mode)
      |
      v
  Parse JSON -> Zod validate -> Return { answer, confidence, sources }

RAG vs fine-tuning

FactorFine-TuningRAG
Knowledge freshnessFrozen at training timeUpdated by re-indexing
Source attributionCannot cite sourcesCites exact document + chunk
Cost to update$100s-$1000s per training runPennies per re-index
HallucinationModel may still hallucinateGrounded in retrieved docs
Setup complexityTraining data, GPU, pipelineVector DB, embedding pipeline
LatencySingle model callEmbed + DB query + model call
Best forStyle, format, jargonSpecific facts, knowledge Q&A

Rule of thumb: Use RAG for knowledge tasks. Use fine-tuning for style/format. Use both when you need domain-specific style AND factual grounding.


Retrieval strategies comparison

StrategyHow It WorksStrengthsWeaknesses
Vector searchEmbed query, cosine similarity against stored chunksSemantic understanding, fastMisses exact terms (error codes, IDs)
BM25 / KeywordTerm-frequency matchingExact matches, no embedding neededNo semantic understanding
Hybrid (vector + BM25)Run both, merge with RRFBest of both worldsMore complex, two search systems
Two-stage (vector + re-rank)Broad vector search (top-20), cross-encoder re-rank (top-5)High precisionExtra latency (~200ms for re-ranking)
HyDEEmbed a hypothetical answer instead of the questionBridges question-answer gapExtra LLM call, may introduce bias
MMR selectionBalance relevance and diversity among retrieved chunksNo duplicate/redundant contextSlightly lower max relevance

Top-k guide

k = 1-3    Simple factual lookup
k = 3-7    Standard RAG Q&A (most common, start with k=5)
k = 10-20  Complex, multi-aspect questions
k = 20-50  Broad retrieval for re-ranking pipeline

Too low: miss relevant info, single point of failure. Too high: irrelevant chunks dilute context, "lost in the middle", more tokens/cost.


Score thresholds by use case

Use CaseMin ScoreRationale
Customer support0.75Better to say "I don't know" than give wrong info
Internal docs0.65Users can verify; more permissive
Medical / Legal0.85High stakes; only high-confidence results
Creative / Brainstorming0.50Loosely related content can still help

Prompt construction patterns

1. System message structure

[Role]           "You are a document assistant for Acme Corp."
[Rules]          "Answer ONLY from the provided CONTEXT."
[Fallback]       "If context is insufficient, say I don't know, confidence 0."
[Output format]  "Return JSON: { answer, confidence, sources, gaps }"
[Context]        "CONTEXT: [Source 1: file.md, Chunk 0] ..."

2. Anti-hallucination layers

Layer 1: "Answer ONLY from the provided CONTEXT. Do NOT use training data."
Layer 2: "If context is insufficient -> confidence: 0, say I don't know."
Layer 3: Require { answer, confidence, sources } -> forces source tracking.
Layer 4: Negative examples: "Do NOT say 'typically' or 'generally'."
Layer 5: Post-generation validation: verify cited sources exist in context.

3. Context injection methods

MethodBest For
System message (recommended)Most RAG apps -- highest priority, clean separation
User messageFlexible, works with any model
Multi-turnNatural conversation, but uses more tokens

4. Chunk formatting

[Source 1: employee-handbook.pdf, Chunk 12]
New employees receive 15 PTO days per year.

---

[Source 2: pto-policy.md, Chunk 3]
After 5 years, PTO increases to 20 days.

Always: label chunks with [Source N: filename] + use --- separators.

5. Chunk ordering strategies

StrategyWhen to Use
Relevance firstDefault for most RAG apps
SandwichLong contexts -- place best chunks at start AND end
ChronologicalPolicy changes, versioned docs, audit trails

Document QA system architecture

INGESTION                              QUERY
=========                              =====
Load docs (PDF, MD, TXT)              User query
    |                                      |
    v                                      v
Clean & parse                          Embed query (same model!)
    |                                      |
    v                                      v
Chunk (500 tok, 50 overlap)            Vector DB search (top-k)
    |                                      |
    v                                      v
Embed each chunk                       [Optional] Re-rank
    |                                      |
    v                                      v
Store in Vector DB                     Build prompt
(vector + metadata)                        |
                                           v
                                       LLM generate (temp 0, JSON)
                                           |
                                           v
                                       Parse JSON
                                           |
                                           v
                                       Zod validate
                                           |
                                           v
                                       { answer, confidence, sources }

Zod schema

z.object({
  answer:     z.string().min(1),
  confidence: z.number().min(0).max(1),
  sources:    z.array(z.object({
    document:  z.string(),
    chunk:     z.number().int().min(0),
    relevance: z.number().min(0).max(1),
  })),
  gaps:       z.array(z.string()).optional(),
})

Error handling chain

Embedding fails    -> retry with backoff, fallback error response
Empty retrieval    -> return confidence: 0, "I don't have that info"
LLM rate limited   -> retry with exponential backoff (1s, 2s, 4s)
Invalid JSON       -> extract JSON from text, retry LLM call
Zod rejects        -> retry (up to 2x), then safe fallback response

Token budget calculation

GPT-4o context window:                    128,000 tokens
  - System prompt (instructions, rules):    1,200
  - Max output reservation:                 4,000
  - Safety margin:                            500
  - User query (typical):                      80
                                          --------
Available for RAG context:               122,220 tokens

With 500-token chunks: ~244 could fit
In practice: 5-10 high-quality chunks is optimal (2,500-5,000 tokens)

Key insight: Quality of context beats quantity. 5 perfect chunks > 50 marginal chunks.


Evaluation metrics

Layer 1 -- Retrieval quality

MetricWhat It MeasuresTarget
Precision@kFraction of retrieved chunks that are relevant> 0.6
Recall@kFraction of all relevant chunks that were retrieved> 0.7
MRRHow high is the first relevant result?> 0.7

Layer 2 -- Generation quality

MetricWhat It Measures
FaithfulnessDoes the answer use only info from context? (no hallucination)
RelevanceDoes the answer address the user's question?
CompletenessDoes the answer cover all aspects the context supports?

Layer 3 -- End-to-end quality

MetricWhat It Measures
Answer correctnessCompare against ground-truth answers
Source accuracyDo cited sources actually support the claims?
Confidence calibrationDoes confidence predict actual correctness?
Failure handlingDoes system refuse when context is insufficient?

RRF formula

RRF score for document d = sum over all ranked lists: 1 / (k + rank(d))

k = 60 (standard constant)

Example:
  Doc appears at vector rank 2, keyword rank 1:
    RRF = 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325

  Doc appears at vector rank 1 only:
    RRF = 1/(60+1) + 0 = 0.0164

  -> First doc ranks higher (found by BOTH methods)

Common gotchas

GotchaWhat Goes WrongFix
Model mismatchIngestion and query use different embedding modelsAlways use the SAME model constant
No "context only" ruleLLM mixes training data with contextExplicit prompt: "Answer ONLY from CONTEXT"
Chunks too largeIrrelevant text dilutes answers200-500 token chunks with overlap
Chunks too smallFragmented context, incoherent answersBreak at paragraph/sentence boundaries
No overlapInfo lost at chunk boundaries10-20% overlap between consecutive chunks
No fallbackSystem hallucinates when docs lack the answer"If insufficient context -> confidence: 0, say I don't know"
Stale indexDocs updated but embeddings are oldAutomated re-indexing pipeline
Lost in the middleLLM ignores middle chunks in long contextsLimit to 5-10 chunks; use sandwich ordering
No source labelsLLM cannot cite specific sourcesLabel every chunk with [Source N: filename]
No separatorsLLM blends info across chunk boundariesUse --- between chunks
Context after queryLLM starts answering before reading contextPut CONTEXT before the user question
Confidence > 1Zod rejects, retry wastes timeClear confidence guidelines in prompt (0.0-1.0)
Same-topic chunks5 chunks say the same thing, wasted contextUse MMR for diversity-aware selection
Exact term searchVector search misses error codes, IDs, SKUsUse hybrid search (vector + BM25)
Multi-turn ambiguity"What about for contractors?" retrieves wrong docsContextualize follow-up queries before retrieval

Security checklist

[ ] Prompt injection:       Sanitize input, separate instruction/data channels
[ ] Multi-tenant leakage:   Filter by tenant_id in EVERY vector DB query
[ ] PII in context:         PII detection before injecting into LLM prompts
[ ] Context poisoning:      Validate documents during ingestion
[ ] Knowledge probing:      Generic "no info available" (don't confirm/deny)
[ ] Authoritative answers:  Disclaimers + confidence thresholds + human-in-loop

One-line summary

RAG = Embed query -> Retrieve relevant chunks from vector DB -> Inject into prompt as context -> Generate grounded, structured answer with sources


End of 4.13 quick revision.