Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline

4.13.a — RAG Workflow

In one sentence: Retrieval-Augmented Generation (RAG) is a 4-step pipeline — receive a user query, retrieve relevant document chunks from a vector database, inject them into the prompt as context, and generate a structured answer — giving the LLM access to up-to-date, domain-specific knowledge without fine-tuning.

Navigation: <- 4.13 Overview | 4.13.b — Retrieval Strategies ->

1. What Is RAG?

Retrieval-Augmented Generation (RAG) is an architecture pattern that combines two capabilities:

Retrieval — finding relevant information from a knowledge base (documents, databases, APIs)
Generation — using an LLM to synthesize a natural-language answer from that information

The core insight: instead of hoping the LLM "knows" the answer from its training data, you give it the answer in the prompt and ask it to synthesize and format a response.

Traditional LLM:
  User asks question -> LLM answers from training data -> May hallucinate

RAG:
  User asks question -> System retrieves relevant documents -> 
  Documents injected into prompt -> LLM answers from documents -> 
  Answer is grounded in real sources

Why "Retrieval-Augmented"?

The term was introduced in a 2020 paper by Lewis et al. (Facebook AI Research). The key idea:

Retrieval — the system retrieves external knowledge at inference time
Augmented — this retrieved knowledge augments (enhances) the prompt
Generation — the LLM generates the final answer using the augmented prompt

RAG is not a model architecture — it is a system architecture. You can build RAG with any LLM (GPT-4o, Claude, Llama, Gemini) and any retrieval system (Pinecone, Chroma, pgvector, Elasticsearch).

2. The 4-Step RAG Pipeline

Every RAG system follows the same fundamental workflow:

┌────────────────────────────────────────────────────────────────────────────┐
│                         THE 4 STEPS OF RAG                                  │
│                                                                            │
│  Step 1: USER QUERY                                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  "What is our company's policy on remote work?"                     │   │
│  └───────────────────────────────┬─────────────────────────────────────┘   │
│                                  │                                         │
│                                  ▼                                         │
│  Step 2: RETRIEVE RELEVANT CHUNKS                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Embed query -> Vector similarity search -> Top-k chunks            │   │
│  │                                                                     │   │
│  │  [Chunk 1: "Remote work policy updated Jan 2025..."]  score: 0.94  │   │
│  │  [Chunk 2: "Employees may work remotely up to..."]    score: 0.91  │   │
│  │  [Chunk 3: "Remote work equipment allowance..."]      score: 0.87  │   │
│  └───────────────────────────────┬─────────────────────────────────────┘   │
│                                  │                                         │
│                                  ▼                                         │
│  Step 3: INJECT INTO PROMPT AS CONTEXT                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  System: "Answer ONLY based on the provided context..."             │   │
│  │  Context: [Chunk 1] [Chunk 2] [Chunk 3]                            │   │
│  │  User: "What is our company's policy on remote work?"               │   │
│  └───────────────────────────────┬─────────────────────────────────────┘   │
│                                  │                                         │
│                                  ▼                                         │
│  Step 4: GENERATE STRUCTURED ANSWER                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  {                                                                  │   │
│  │    "answer": "Employees may work remotely up to 3 days per week...",│   │
│  │    "confidence": 0.92,                                              │   │
│  │    "sources": ["remote-policy.md#chunk-1", "hr-handbook.md#chunk-7"]│   │
│  │  }                                                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────────────┘

Step 1: User Query

The user submits a natural-language question. This could come from a chat interface, an API endpoint, a search bar, or an automated system.

const userQuery = "What is our company's policy on remote work?";

Step 2: Retrieve Relevant Chunks

The query is embedded into a vector and compared against all document chunks in the vector database. The most similar chunks (top-k) are returned.

import OpenAI from 'openai';

const openai = new OpenAI();

// Embed the user query
const embeddingResponse = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: userQuery,
});
const queryVector = embeddingResponse.data[0].embedding;

// Search the vector database for similar chunks
const results = await vectorDB.query({
  vector: queryVector,
  topK: 5,
  includeMetadata: true,
});

// results.matches contains the most relevant document chunks

Step 3: Inject Into Prompt as Context

The retrieved chunks are formatted and inserted into the LLM prompt alongside instructions for how to use them.

const contextChunks = results.matches.map((match, i) => 
  `[Source ${i + 1}: ${match.metadata.filename}, Chunk ${match.metadata.chunkIndex}]\n${match.metadata.text}`
).join('\n\n');

const messages = [
  {
    role: 'system',
    content: `You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain enough information, say "I don't have enough information to answer that."

CONTEXT:
${contextChunks}`
  },
  {
    role: 'user',
    content: userQuery,
  }
];

Step 4: Generate Structured Answer

The LLM processes the prompt (which now includes the relevant documents) and generates a grounded answer.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,
  messages,
  response_format: { type: 'json_object' },
});

const answer = JSON.parse(response.choices[0].message.content);
// { answer: "...", confidence: 0.92, sources: [...] }

3. Why RAG Over Fine-Tuning?

This is one of the most important architectural decisions in AI engineering. Here is a detailed comparison:

Fine-tuning

Fine-tuning modifies the model's weights by training on your custom data. The knowledge becomes part of the model itself.

Your Data -> Training Process -> Custom Model Weights
User Query -> Custom Model -> Answer (from learned weights)

RAG

RAG keeps the model unchanged and provides relevant data at inference time through the prompt.

Your Data -> Chunk + Embed + Store (once)
User Query -> Retrieve Chunks -> Inject into Prompt -> Standard Model -> Answer

Head-to-head comparison

Factor	Fine-Tuning	RAG
Knowledge freshness	Frozen at training time	Updated by re-indexing documents
Source attribution	Cannot cite sources (knowledge is in weights)	Can cite exact document and chunk
Cost to update	Expensive re-training ($100s-$1000s)	Re-embed changed documents (pennies)
Hallucination control	Model may still hallucinate	Grounded in retrieved documents
Setup complexity	Need training data, GPU, training pipeline	Need vector DB, embedding pipeline
Latency	Single model call	Embedding + DB query + model call
Data privacy	Data baked into model weights	Data stays in your database
Best for	Teaching model a new style or format	Giving model access to specific facts

When to use each

USE RAG WHEN:
  - You need answers grounded in specific documents
  - Information changes frequently (policies, docs, products)
  - Source attribution is required
  - You need to control exactly what knowledge the model accesses
  - Data privacy requires keeping documents in your infrastructure

USE FINE-TUNING WHEN:
  - You need the model to adopt a specific writing style
  - You need the model to follow complex output formats consistently
  - You have a very specialized vocabulary or domain jargon
  - Latency is critical and you can't afford retrieval overhead

USE BOTH (RAG + Fine-Tuned Model) WHEN:
  - You need domain-specific style AND factual grounding
  - Example: a medical chatbot with clinical writing style AND patient records

The overwhelming trend in production

Most production AI systems use RAG, not fine-tuning, for knowledge tasks. The reasons:

Models keep getting better — base models like GPT-4o and Claude 4 are already excellent at following instructions and formatting output. Fine-tuning for format is increasingly unnecessary.
Knowledge changes — fine-tuned knowledge is frozen. RAG knowledge is updated by re-indexing.
Accountability — RAG provides audit trails. You can trace every answer to its source document.
Cost — fine-tuning costs hundreds to thousands of dollars per training run. RAG costs pennies per query.

4. RAG Architecture — Deep Dive

A production RAG system has two distinct pipelines:

Ingestion Pipeline (Offline)

The ingestion pipeline runs once (or on a schedule) to prepare your documents for retrieval.

┌───────────────────────────────────────────────────────────────────────┐
│                       INGESTION PIPELINE                               │
│                                                                       │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────────┐  │
│  │  Load     │──►│  Clean   │──►│  Chunk   │──►│  Embed Each      │  │
│  │  Documents│   │  & Parse │   │  (split  │   │  Chunk           │  │
│  │  (PDF,    │   │  (remove │   │   into   │   │  (text ->        │  │
│  │   MD,     │   │   noise, │   │   500-   │   │   1536-dim       │  │
│  │   HTML,   │   │   extract│   │   1000   │   │   vector)        │  │
│  │   TXT)    │   │   text)  │   │   token  │   │                  │  │
│  │           │   │          │   │   pieces) │   │                  │  │
│  └──────────┘   └──────────┘   └──────────┘   └────────┬─────────┘  │
│                                                         │            │
│                                                         ▼            │
│                                                ┌──────────────────┐  │
│                                                │  Store in        │  │
│                                                │  Vector DB       │  │
│                                                │  (vector +       │  │
│                                                │   metadata:      │  │
│                                                │   filename,      │  │
│                                                │   chunk index,   │  │
│                                                │   original text) │  │
│                                                └──────────────────┘  │
└───────────────────────────────────────────────────────────────────────┘

// Ingestion pipeline — conceptual code
async function ingestDocuments(documents) {
  const chunks = [];

  for (const doc of documents) {
    // 1. Load and clean
    const text = await loadDocument(doc.path);    // PDF -> text, HTML -> text, etc.
    const cleanText = cleanDocument(text);         // Remove headers, footers, noise

    // 2. Chunk
    const docChunks = chunkText(cleanText, {
      chunkSize: 500,        // ~500 tokens per chunk
      chunkOverlap: 50,      // 50-token overlap for continuity
    });

    // 3. Embed each chunk
    const embeddings = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: docChunks.map(c => c.text),
    });

    // 4. Prepare for storage
    docChunks.forEach((chunk, i) => {
      chunks.push({
        id: `${doc.id}-chunk-${i}`,
        values: embeddings.data[i].embedding,
        metadata: {
          text: chunk.text,
          filename: doc.filename,
          chunkIndex: i,
          totalChunks: docChunks.length,
          source: doc.path,
        },
      });
    });
  }

  // 5. Store in vector DB
  await vectorDB.upsert(chunks);
  console.log(`Ingested ${chunks.length} chunks from ${documents.length} documents`);
}

Query Pipeline (Runtime)

The query pipeline runs for every user question.

┌───────────────────────────────────────────────────────────────────────┐
│                        QUERY PIPELINE                                  │
│                                                                       │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────────┐  │
│  │  User     │──►│  Embed   │──►│  Retrieve │──►│  Build Prompt    │  │
│  │  Query    │   │  Query   │   │  Top-k   │   │  (system msg +   │  │
│  │           │   │  (same   │   │  Chunks  │   │   context +      │  │
│  │           │   │   model  │   │  from    │   │   user query)    │  │
│  │           │   │   as     │   │  Vector  │   │                  │  │
│  │           │   │   ingest)│   │  DB      │   │                  │  │
│  └──────────┘   └──────────┘   └──────────┘   └────────┬─────────┘  │
│                                                         │            │
│                                                         ▼            │
│                                                ┌──────────────────┐  │
│                                                │  LLM Generate    │  │
│                                                │  + Validate      │  │
│                                                │  (Zod schema)    │  │
│                                                │                  │  │
│                                                │  -> { answer,    │  │
│                                                │     confidence,  │  │
│                                                │     sources[] }  │  │
│                                                └──────────────────┘  │
└───────────────────────────────────────────────────────────────────────┘

// Query pipeline — conceptual code
async function queryRAG(userQuery) {
  // 1. Embed the query (SAME model as ingestion — critical!)
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuery,
  });

  // 2. Retrieve top-k chunks
  const results = await vectorDB.query({
    vector: queryEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true,
  });

  // 3. Build the prompt
  const context = results.matches
    .map((m, i) => `[Source ${i + 1}: ${m.metadata.filename}]\n${m.metadata.text}`)
    .join('\n\n---\n\n');

  // 4. Generate answer
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based ONLY on the provided context.
Return JSON: { "answer": "...", "confidence": 0.0-1.0, "sources": ["..."] }
If the context doesn't answer the question, set confidence to 0 and answer "I don't have enough information."

CONTEXT:
${context}`,
      },
      { role: 'user', content: userQuery },
    ],
    response_format: { type: 'json_object' },
  });

  return JSON.parse(response.choices[0].message.content);
}

5. Advantages of RAG

Advantage 1: Up-to-date information

LLMs have a knowledge cutoff — they only know what was in their training data. RAG solves this by retrieving current documents at query time.

Without RAG:
  "What is our Q3 2025 revenue?"
  -> LLM: "I don't have access to that information" (or worse, hallucinate)

With RAG:
  "What is our Q3 2025 revenue?"
  -> Retrieve: quarterly-report-Q3-2025.pdf, chunk 4
  -> LLM: "Q3 2025 revenue was $4.2M, a 15% increase over Q2." (from document)

Advantage 2: Source attribution

RAG naturally provides source tracking — you know exactly which document and chunk the answer came from.

// Every answer includes traceable sources
{
  answer: "The maximum PTO carryover is 5 days per year.",
  confidence: 0.95,
  sources: [
    { document: "employee-handbook-2025.pdf", chunk: 42, page: 18 },
    { document: "pto-policy-update.md", chunk: 3, page: 1 }
  ]
}

// Users can click to verify. Auditors can trace every claim.

Advantage 3: Reduced hallucination

When the LLM is instructed to answer ONLY from the provided context, it has much less opportunity to hallucinate.

Without RAG:
  LLM draws from billions of parameters -> any plausible-sounding text can emerge

With RAG:
  LLM draws from 3-5 specific document chunks -> constrained to real content
  + "If the context doesn't answer, say I don't know" -> graceful failure

Advantage 4: No model training required

RAG works with any off-the-shelf LLM. No training data preparation, no GPU time, no model management.

Advantage 5: Data governance and access control

Documents stay in your infrastructure. You can control which users access which documents by filtering vector DB queries.

// Role-based document access in RAG
const results = await vectorDB.query({
  vector: queryEmbedding,
  topK: 5,
  filter: {
    department: user.department,         // Only their department's docs
    classification: { $ne: 'top-secret' } // Exclude classified docs
  },
});

6. RAG vs Other Patterns

┌───────────────────────────────────────────────────────────────────┐
│                   KNOWLEDGE PATTERNS COMPARED                      │
│                                                                   │
│  Prompt Stuffing:    Manually paste docs into prompt              │
│                      + Simple                                     │
│                      - Doesn't scale, manual work                 │
│                                                                   │
│  Fine-Tuning:        Train model on your data                     │
│                      + Fast inference                             │
│                      - Expensive, frozen knowledge, no sources    │
│                                                                   │
│  RAG:                Retrieve docs dynamically per query          │
│                      + Fresh data, source attribution             │
│                      - Extra latency, retrieval quality matters   │
│                                                                   │
│  RAG + Fine-Tuning:  Fine-tune for style, RAG for facts          │
│                      + Best of both worlds                        │
│                      - Most complex to build and maintain         │
│                                                                   │
│  Tool Use / Agents:  Model calls APIs to get real-time data      │
│                      + Real-time data, multi-step reasoning       │
│                      - Complex, unpredictable, harder to debug    │
└───────────────────────────────────────────────────────────────────┘

7. Complete Conceptual Walkthrough

Let's trace a single query through the entire RAG pipeline to solidify the concepts.

Scenario: A company has an internal documentation chatbot. An employee asks: "How many vacation days do new employees get?"

Phase 1: Ingestion (already completed)

The company's HR documents were previously ingested:

employee-handbook.pdf (85 pages) -> 340 chunks -> 340 embeddings stored
pto-policy.md (3 pages) -> 12 chunks -> 12 embeddings stored
onboarding-guide.md (5 pages) -> 20 chunks -> 20 embeddings stored

Total: 372 chunks in the vector database, each with its embedding vector and metadata.

Phase 2: Query processing

Step 1 — Query arrives:
  "How many vacation days do new employees get?"

Step 2 — Embed the query:
  text-embedding-3-small("How many vacation days do new employees get?")
  -> [0.023, -0.041, 0.089, ...] (1536-dimensional vector)

Step 3 — Vector similarity search:
  Compare query vector against all 372 chunk vectors
  Cosine similarity scores:
    pto-policy.md#chunk-2:          0.94  "New employees receive 15 PTO days..."
    employee-handbook.pdf#chunk-87: 0.91  "Vacation accrual begins on start date..."
    employee-handbook.pdf#chunk-89: 0.88  "After 5 years, PTO increases to 20 days..."
    onboarding-guide.md#chunk-5:    0.82  "During onboarding, discuss PTO policy..."
    pto-policy.md#chunk-8:          0.79  "PTO requests must be submitted 2 weeks..."

  Top 3 chunks retrieved (top-k = 3)

Step 4 — Build the prompt:
  System: "You are an HR assistant. Answer based ONLY on the provided context..."
  Context: [chunk-2 text] [chunk-87 text] [chunk-89 text]
  User: "How many vacation days do new employees get?"

Step 5 — LLM generates:
  {
    "answer": "New employees receive 15 PTO (vacation) days per year. 
               Vacation accrual begins on the employee's start date. 
               After 5 years of service, PTO increases to 20 days per year.",
    "confidence": 0.95,
    "sources": [
      "pto-policy.md (chunk 2)",
      "employee-handbook.pdf (chunk 87)",
      "employee-handbook.pdf (chunk 89)"
    ]
  }

Step 6 — Validate and return:
  Zod schema validates the JSON structure -> pass
  Confidence > threshold (0.7) -> show to user
  Sources included -> user can click to verify

Why this works

The answer is directly grounded in real documents
The employee can verify by clicking the source links
If the documents change (PTO policy updated to 20 days), re-index and the answer automatically updates
If the question is about something NOT in the documents, the system says "I don't have enough information" instead of hallucinating

8. Common RAG Pitfalls

Pitfall	What Goes Wrong	How to Avoid
Wrong embedding model	Query embedded with different model than documents	ALWAYS use the same embedding model for ingestion and query
Chunks too large	Retrieved chunk has too much irrelevant text, dilutes the answer	Use 200-500 token chunks with overlap
Chunks too small	Context is fragmented, LLM can't form coherent answer	Ensure chunks are semantically complete paragraphs
No overlap	Important information split across chunk boundary	Use 10-20% overlap between consecutive chunks
Too few results (k too low)	Missing relevant information	Start with k=5, measure recall, increase if needed
Too many results (k too high)	Irrelevant chunks dilute context, "lost in the middle"	Re-rank results and use only top relevant chunks
No "I don't know"	System hallucinates when docs don't have the answer	Explicit instruction + confidence score + threshold
Stale index	Documents updated but embeddings not re-generated	Build re-indexing pipeline triggered on document changes

9. Key Takeaways

RAG is a 4-step pipeline: query -> retrieve -> inject context -> generate. Each step has its own engineering challenges.
RAG beats fine-tuning for knowledge tasks because it provides fresh data, source attribution, and cheaper updates.
Two pipelines: ingestion (offline, prepares documents) and query (runtime, serves answers).
Source attribution is a first-class feature — every answer should trace back to its source documents.
"Answer ONLY from context" is the critical instruction that prevents the LLM from using its training data.
The quality of retrieval directly determines the quality of the answer — garbage in, garbage out.

Explain-It Challenge

Your manager asks "why can't we just fine-tune the model on our docs instead of building this retrieval thing?" — explain the trade-offs.
A colleague says "RAG is just prompt stuffing with extra steps." — explain the key difference that makes RAG scalable.
Walk through what happens if the vector database returns irrelevant chunks — how does that affect the final answer?

Navigation: <- 4.13 Overview | 4.13.b — Retrieval Strategies ->