Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline
4.13.a — RAG Workflow
In one sentence: Retrieval-Augmented Generation (RAG) is a 4-step pipeline — receive a user query, retrieve relevant document chunks from a vector database, inject them into the prompt as context, and generate a structured answer — giving the LLM access to up-to-date, domain-specific knowledge without fine-tuning.
Navigation: <- 4.13 Overview | 4.13.b — Retrieval Strategies ->
1. What Is RAG?
Retrieval-Augmented Generation (RAG) is an architecture pattern that combines two capabilities:
- Retrieval — finding relevant information from a knowledge base (documents, databases, APIs)
- Generation — using an LLM to synthesize a natural-language answer from that information
The core insight: instead of hoping the LLM "knows" the answer from its training data, you give it the answer in the prompt and ask it to synthesize and format a response.
Traditional LLM:
User asks question -> LLM answers from training data -> May hallucinate
RAG:
User asks question -> System retrieves relevant documents ->
Documents injected into prompt -> LLM answers from documents ->
Answer is grounded in real sources
Why "Retrieval-Augmented"?
The term was introduced in a 2020 paper by Lewis et al. (Facebook AI Research). The key idea:
- Retrieval — the system retrieves external knowledge at inference time
- Augmented — this retrieved knowledge augments (enhances) the prompt
- Generation — the LLM generates the final answer using the augmented prompt
RAG is not a model architecture — it is a system architecture. You can build RAG with any LLM (GPT-4o, Claude, Llama, Gemini) and any retrieval system (Pinecone, Chroma, pgvector, Elasticsearch).
2. The 4-Step RAG Pipeline
Every RAG system follows the same fundamental workflow:
┌────────────────────────────────────────────────────────────────────────────┐
│ THE 4 STEPS OF RAG │
│ │
│ Step 1: USER QUERY │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ "What is our company's policy on remote work?" │ │
│ └───────────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 2: RETRIEVE RELEVANT CHUNKS │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Embed query -> Vector similarity search -> Top-k chunks │ │
│ │ │ │
│ │ [Chunk 1: "Remote work policy updated Jan 2025..."] score: 0.94 │ │
│ │ [Chunk 2: "Employees may work remotely up to..."] score: 0.91 │ │
│ │ [Chunk 3: "Remote work equipment allowance..."] score: 0.87 │ │
│ └───────────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 3: INJECT INTO PROMPT AS CONTEXT │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ System: "Answer ONLY based on the provided context..." │ │
│ │ Context: [Chunk 1] [Chunk 2] [Chunk 3] │ │
│ │ User: "What is our company's policy on remote work?" │ │
│ └───────────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Step 4: GENERATE STRUCTURED ANSWER │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ { │ │
│ │ "answer": "Employees may work remotely up to 3 days per week...",│ │
│ │ "confidence": 0.92, │ │
│ │ "sources": ["remote-policy.md#chunk-1", "hr-handbook.md#chunk-7"]│ │
│ │ } │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Step 1: User Query
The user submits a natural-language question. This could come from a chat interface, an API endpoint, a search bar, or an automated system.
const userQuery = "What is our company's policy on remote work?";
Step 2: Retrieve Relevant Chunks
The query is embedded into a vector and compared against all document chunks in the vector database. The most similar chunks (top-k) are returned.
import OpenAI from 'openai';
const openai = new OpenAI();
// Embed the user query
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: userQuery,
});
const queryVector = embeddingResponse.data[0].embedding;
// Search the vector database for similar chunks
const results = await vectorDB.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
});
// results.matches contains the most relevant document chunks
Step 3: Inject Into Prompt as Context
The retrieved chunks are formatted and inserted into the LLM prompt alongside instructions for how to use them.
const contextChunks = results.matches.map((match, i) =>
`[Source ${i + 1}: ${match.metadata.filename}, Chunk ${match.metadata.chunkIndex}]\n${match.metadata.text}`
).join('\n\n');
const messages = [
{
role: 'system',
content: `You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain enough information, say "I don't have enough information to answer that."
CONTEXT:
${contextChunks}`
},
{
role: 'user',
content: userQuery,
}
];
Step 4: Generate Structured Answer
The LLM processes the prompt (which now includes the relevant documents) and generates a grounded answer.
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages,
response_format: { type: 'json_object' },
});
const answer = JSON.parse(response.choices[0].message.content);
// { answer: "...", confidence: 0.92, sources: [...] }
3. Why RAG Over Fine-Tuning?
This is one of the most important architectural decisions in AI engineering. Here is a detailed comparison:
Fine-tuning
Fine-tuning modifies the model's weights by training on your custom data. The knowledge becomes part of the model itself.
Your Data -> Training Process -> Custom Model Weights
User Query -> Custom Model -> Answer (from learned weights)
RAG
RAG keeps the model unchanged and provides relevant data at inference time through the prompt.
Your Data -> Chunk + Embed + Store (once)
User Query -> Retrieve Chunks -> Inject into Prompt -> Standard Model -> Answer
Head-to-head comparison
| Factor | Fine-Tuning | RAG |
|---|---|---|
| Knowledge freshness | Frozen at training time | Updated by re-indexing documents |
| Source attribution | Cannot cite sources (knowledge is in weights) | Can cite exact document and chunk |
| Cost to update | Expensive re-training ($100s-$1000s) | Re-embed changed documents (pennies) |
| Hallucination control | Model may still hallucinate | Grounded in retrieved documents |
| Setup complexity | Need training data, GPU, training pipeline | Need vector DB, embedding pipeline |
| Latency | Single model call | Embedding + DB query + model call |
| Data privacy | Data baked into model weights | Data stays in your database |
| Best for | Teaching model a new style or format | Giving model access to specific facts |
When to use each
USE RAG WHEN:
- You need answers grounded in specific documents
- Information changes frequently (policies, docs, products)
- Source attribution is required
- You need to control exactly what knowledge the model accesses
- Data privacy requires keeping documents in your infrastructure
USE FINE-TUNING WHEN:
- You need the model to adopt a specific writing style
- You need the model to follow complex output formats consistently
- You have a very specialized vocabulary or domain jargon
- Latency is critical and you can't afford retrieval overhead
USE BOTH (RAG + Fine-Tuned Model) WHEN:
- You need domain-specific style AND factual grounding
- Example: a medical chatbot with clinical writing style AND patient records
The overwhelming trend in production
Most production AI systems use RAG, not fine-tuning, for knowledge tasks. The reasons:
- Models keep getting better — base models like GPT-4o and Claude 4 are already excellent at following instructions and formatting output. Fine-tuning for format is increasingly unnecessary.
- Knowledge changes — fine-tuned knowledge is frozen. RAG knowledge is updated by re-indexing.
- Accountability — RAG provides audit trails. You can trace every answer to its source document.
- Cost — fine-tuning costs hundreds to thousands of dollars per training run. RAG costs pennies per query.
4. RAG Architecture — Deep Dive
A production RAG system has two distinct pipelines:
Ingestion Pipeline (Offline)
The ingestion pipeline runs once (or on a schedule) to prepare your documents for retrieval.
┌───────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Load │──►│ Clean │──►│ Chunk │──►│ Embed Each │ │
│ │ Documents│ │ & Parse │ │ (split │ │ Chunk │ │
│ │ (PDF, │ │ (remove │ │ into │ │ (text -> │ │
│ │ MD, │ │ noise, │ │ 500- │ │ 1536-dim │ │
│ │ HTML, │ │ extract│ │ 1000 │ │ vector) │ │
│ │ TXT) │ │ text) │ │ token │ │ │ │
│ │ │ │ │ │ pieces) │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Store in │ │
│ │ Vector DB │ │
│ │ (vector + │ │
│ │ metadata: │ │
│ │ filename, │ │
│ │ chunk index, │ │
│ │ original text) │ │
│ └──────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
// Ingestion pipeline — conceptual code
async function ingestDocuments(documents) {
const chunks = [];
for (const doc of documents) {
// 1. Load and clean
const text = await loadDocument(doc.path); // PDF -> text, HTML -> text, etc.
const cleanText = cleanDocument(text); // Remove headers, footers, noise
// 2. Chunk
const docChunks = chunkText(cleanText, {
chunkSize: 500, // ~500 tokens per chunk
chunkOverlap: 50, // 50-token overlap for continuity
});
// 3. Embed each chunk
const embeddings = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: docChunks.map(c => c.text),
});
// 4. Prepare for storage
docChunks.forEach((chunk, i) => {
chunks.push({
id: `${doc.id}-chunk-${i}`,
values: embeddings.data[i].embedding,
metadata: {
text: chunk.text,
filename: doc.filename,
chunkIndex: i,
totalChunks: docChunks.length,
source: doc.path,
},
});
});
}
// 5. Store in vector DB
await vectorDB.upsert(chunks);
console.log(`Ingested ${chunks.length} chunks from ${documents.length} documents`);
}
Query Pipeline (Runtime)
The query pipeline runs for every user question.
┌───────────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ User │──►│ Embed │──►│ Retrieve │──►│ Build Prompt │ │
│ │ Query │ │ Query │ │ Top-k │ │ (system msg + │ │
│ │ │ │ (same │ │ Chunks │ │ context + │ │
│ │ │ │ model │ │ from │ │ user query) │ │
│ │ │ │ as │ │ Vector │ │ │ │
│ │ │ │ ingest)│ │ DB │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ LLM Generate │ │
│ │ + Validate │ │
│ │ (Zod schema) │ │
│ │ │ │
│ │ -> { answer, │ │
│ │ confidence, │ │
│ │ sources[] } │ │
│ └──────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
// Query pipeline — conceptual code
async function queryRAG(userQuery) {
// 1. Embed the query (SAME model as ingestion — critical!)
const queryEmbedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: userQuery,
});
// 2. Retrieve top-k chunks
const results = await vectorDB.query({
vector: queryEmbedding.data[0].embedding,
topK: 5,
includeMetadata: true,
});
// 3. Build the prompt
const context = results.matches
.map((m, i) => `[Source ${i + 1}: ${m.metadata.filename}]\n${m.metadata.text}`)
.join('\n\n---\n\n');
// 4. Generate answer
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [
{
role: 'system',
content: `Answer the user's question based ONLY on the provided context.
Return JSON: { "answer": "...", "confidence": 0.0-1.0, "sources": ["..."] }
If the context doesn't answer the question, set confidence to 0 and answer "I don't have enough information."
CONTEXT:
${context}`,
},
{ role: 'user', content: userQuery },
],
response_format: { type: 'json_object' },
});
return JSON.parse(response.choices[0].message.content);
}
5. Advantages of RAG
Advantage 1: Up-to-date information
LLMs have a knowledge cutoff — they only know what was in their training data. RAG solves this by retrieving current documents at query time.
Without RAG:
"What is our Q3 2025 revenue?"
-> LLM: "I don't have access to that information" (or worse, hallucinate)
With RAG:
"What is our Q3 2025 revenue?"
-> Retrieve: quarterly-report-Q3-2025.pdf, chunk 4
-> LLM: "Q3 2025 revenue was $4.2M, a 15% increase over Q2." (from document)
Advantage 2: Source attribution
RAG naturally provides source tracking — you know exactly which document and chunk the answer came from.
// Every answer includes traceable sources
{
answer: "The maximum PTO carryover is 5 days per year.",
confidence: 0.95,
sources: [
{ document: "employee-handbook-2025.pdf", chunk: 42, page: 18 },
{ document: "pto-policy-update.md", chunk: 3, page: 1 }
]
}
// Users can click to verify. Auditors can trace every claim.
Advantage 3: Reduced hallucination
When the LLM is instructed to answer ONLY from the provided context, it has much less opportunity to hallucinate.
Without RAG:
LLM draws from billions of parameters -> any plausible-sounding text can emerge
With RAG:
LLM draws from 3-5 specific document chunks -> constrained to real content
+ "If the context doesn't answer, say I don't know" -> graceful failure
Advantage 4: No model training required
RAG works with any off-the-shelf LLM. No training data preparation, no GPU time, no model management.
Advantage 5: Data governance and access control
Documents stay in your infrastructure. You can control which users access which documents by filtering vector DB queries.
// Role-based document access in RAG
const results = await vectorDB.query({
vector: queryEmbedding,
topK: 5,
filter: {
department: user.department, // Only their department's docs
classification: { $ne: 'top-secret' } // Exclude classified docs
},
});
6. RAG vs Other Patterns
┌───────────────────────────────────────────────────────────────────┐
│ KNOWLEDGE PATTERNS COMPARED │
│ │
│ Prompt Stuffing: Manually paste docs into prompt │
│ + Simple │
│ - Doesn't scale, manual work │
│ │
│ Fine-Tuning: Train model on your data │
│ + Fast inference │
│ - Expensive, frozen knowledge, no sources │
│ │
│ RAG: Retrieve docs dynamically per query │
│ + Fresh data, source attribution │
│ - Extra latency, retrieval quality matters │
│ │
│ RAG + Fine-Tuning: Fine-tune for style, RAG for facts │
│ + Best of both worlds │
│ - Most complex to build and maintain │
│ │
│ Tool Use / Agents: Model calls APIs to get real-time data │
│ + Real-time data, multi-step reasoning │
│ - Complex, unpredictable, harder to debug │
└───────────────────────────────────────────────────────────────────┘
7. Complete Conceptual Walkthrough
Let's trace a single query through the entire RAG pipeline to solidify the concepts.
Scenario: A company has an internal documentation chatbot. An employee asks: "How many vacation days do new employees get?"
Phase 1: Ingestion (already completed)
The company's HR documents were previously ingested:
employee-handbook.pdf (85 pages) -> 340 chunks -> 340 embeddings stored
pto-policy.md (3 pages) -> 12 chunks -> 12 embeddings stored
onboarding-guide.md (5 pages) -> 20 chunks -> 20 embeddings stored
Total: 372 chunks in the vector database, each with its embedding vector and metadata.
Phase 2: Query processing
Step 1 — Query arrives:
"How many vacation days do new employees get?"
Step 2 — Embed the query:
text-embedding-3-small("How many vacation days do new employees get?")
-> [0.023, -0.041, 0.089, ...] (1536-dimensional vector)
Step 3 — Vector similarity search:
Compare query vector against all 372 chunk vectors
Cosine similarity scores:
pto-policy.md#chunk-2: 0.94 "New employees receive 15 PTO days..."
employee-handbook.pdf#chunk-87: 0.91 "Vacation accrual begins on start date..."
employee-handbook.pdf#chunk-89: 0.88 "After 5 years, PTO increases to 20 days..."
onboarding-guide.md#chunk-5: 0.82 "During onboarding, discuss PTO policy..."
pto-policy.md#chunk-8: 0.79 "PTO requests must be submitted 2 weeks..."
Top 3 chunks retrieved (top-k = 3)
Step 4 — Build the prompt:
System: "You are an HR assistant. Answer based ONLY on the provided context..."
Context: [chunk-2 text] [chunk-87 text] [chunk-89 text]
User: "How many vacation days do new employees get?"
Step 5 — LLM generates:
{
"answer": "New employees receive 15 PTO (vacation) days per year.
Vacation accrual begins on the employee's start date.
After 5 years of service, PTO increases to 20 days per year.",
"confidence": 0.95,
"sources": [
"pto-policy.md (chunk 2)",
"employee-handbook.pdf (chunk 87)",
"employee-handbook.pdf (chunk 89)"
]
}
Step 6 — Validate and return:
Zod schema validates the JSON structure -> pass
Confidence > threshold (0.7) -> show to user
Sources included -> user can click to verify
Why this works
- The answer is directly grounded in real documents
- The employee can verify by clicking the source links
- If the documents change (PTO policy updated to 20 days), re-index and the answer automatically updates
- If the question is about something NOT in the documents, the system says "I don't have enough information" instead of hallucinating
8. Common RAG Pitfalls
| Pitfall | What Goes Wrong | How to Avoid |
|---|---|---|
| Wrong embedding model | Query embedded with different model than documents | ALWAYS use the same embedding model for ingestion and query |
| Chunks too large | Retrieved chunk has too much irrelevant text, dilutes the answer | Use 200-500 token chunks with overlap |
| Chunks too small | Context is fragmented, LLM can't form coherent answer | Ensure chunks are semantically complete paragraphs |
| No overlap | Important information split across chunk boundary | Use 10-20% overlap between consecutive chunks |
| Too few results (k too low) | Missing relevant information | Start with k=5, measure recall, increase if needed |
| Too many results (k too high) | Irrelevant chunks dilute context, "lost in the middle" | Re-rank results and use only top relevant chunks |
| No "I don't know" | System hallucinates when docs don't have the answer | Explicit instruction + confidence score + threshold |
| Stale index | Documents updated but embeddings not re-generated | Build re-indexing pipeline triggered on document changes |
9. Key Takeaways
- RAG is a 4-step pipeline: query -> retrieve -> inject context -> generate. Each step has its own engineering challenges.
- RAG beats fine-tuning for knowledge tasks because it provides fresh data, source attribution, and cheaper updates.
- Two pipelines: ingestion (offline, prepares documents) and query (runtime, serves answers).
- Source attribution is a first-class feature — every answer should trace back to its source documents.
- "Answer ONLY from context" is the critical instruction that prevents the LLM from using its training data.
- The quality of retrieval directly determines the quality of the answer — garbage in, garbage out.
Explain-It Challenge
- Your manager asks "why can't we just fine-tune the model on our docs instead of building this retrieval thing?" — explain the trade-offs.
- A colleague says "RAG is just prompt stuffing with extra steps." — explain the key difference that makes RAG scalable.
- Walk through what happens if the vector database returns irrelevant chunks — how does that affect the final answer?
Navigation: <- 4.13 Overview | 4.13.b — Retrieval Strategies ->