Episode 4 — Generative AI Engineering / 4.13 — Building a RAG Pipeline

4.13.c — Prompt Construction for RAG

In one sentence: The way you inject retrieved context into the LLM prompt determines whether the model answers from your documents or from its training data — proper prompt construction includes system message structure, "answer ONLY from context" instructions, chunk ordering, source attribution, and explicit anti-hallucination guardrails.

Navigation: <- 4.13.b Retrieval Strategies | 4.13.d — Building Document QA System ->

1. Why Prompt Construction Matters in RAG

Retrieval gives you the right documents. Prompt construction tells the LLM how to use them. A bad prompt can make the LLM ignore the context entirely and answer from training data, defeating the entire purpose of RAG.

┌─────────────────────────────────────────────────────────────────────┐
│  SAME CONTEXT, DIFFERENT PROMPTS, DIFFERENT RESULTS                  │
│                                                                     │
│  Context: "Our refund policy is 14 days with receipt."              │
│                                                                     │
│  Bad prompt: "What is the refund policy?"                           │
│  -> LLM may use training data: "Most companies offer 30 days..."   │
│                                                                     │
│  Good prompt: "Based ONLY on the following context, answer          │
│  the question. Context: [Our refund policy is 14 days with         │
│  receipt.] Question: What is the refund policy?"                    │
│  -> LLM uses context: "The refund policy is 14 days with receipt." │
│                                                                     │
│  LESSON: The prompt must FORCE the LLM to use the context.          │
└─────────────────────────────────────────────────────────────────────┘

2. System Message Structure for RAG

The system message is the foundation of your RAG prompt. It must accomplish four things:

Define the assistant's role
Establish the "answer from context only" rule
Define the output format
Provide instructions for edge cases

Basic RAG system message

const systemMessage = `You are a helpful document assistant for Acme Corp.

RULES:
1. Answer the user's question based ONLY on the provided CONTEXT below.
2. If the CONTEXT does not contain enough information to answer the question, respond with: "I don't have enough information in my knowledge base to answer that question."
3. Do NOT use any knowledge from your training data. Only use the CONTEXT.
4. Always cite which source(s) your answer comes from.
5. If the answer spans multiple sources, synthesize the information and cite all relevant sources.

OUTPUT FORMAT:
Return a JSON object with this exact structure:
{
  "answer": "Your answer here, written in clear natural language",
  "confidence": 0.0 to 1.0 (how confident you are based on the context quality),
  "sources": ["source1.md (chunk N)", "source2.pdf (chunk M)"]
}

CONFIDENCE GUIDELINES:
- 1.0: The context directly and completely answers the question
- 0.7-0.9: The context mostly answers the question with minor gaps
- 0.4-0.6: The context partially addresses the question
- 0.1-0.3: The context is tangentially related
- 0.0: The context does not address the question at all

CONTEXT:
{context_placeholder}`;

Advanced system message with role-specific behavior

function buildSystemMessage(config) {
  const {
    role = 'document assistant',
    company = 'our organization',
    contextChunks = [],
    allowPartialAnswers = true,
    requireSources = true,
    outputFormat = 'json',
  } = config;

  const formattedContext = contextChunks.map((chunk, i) =>
    `--- SOURCE ${i + 1}: ${chunk.metadata.filename} (Chunk ${chunk.metadata.chunkIndex}) ---
${chunk.text}
--- END SOURCE ${i + 1} ---`
  ).join('\n\n');

  return `You are a ${role} for ${company}.

CORE RULES:
1. Answer ONLY from the provided CONTEXT sections below. Do NOT use your training knowledge.
2. ${allowPartialAnswers
    ? 'If the context only partially answers the question, provide what you can and note what is missing.'
    : 'If the context does not fully answer the question, say "I don\'t have enough information."'}
3. ${requireSources
    ? 'ALWAYS cite sources using the format [Source N] where N matches the source number below.'
    : 'Provide a direct answer without source citations.'}
4. Never fabricate information, statistics, dates, or claims not present in the context.
5. If multiple sources provide conflicting information, note the conflict and cite both sources.

${outputFormat === 'json' ? `OUTPUT FORMAT:
Return valid JSON:
{
  "answer": "string - natural language answer with [Source N] citations",
  "confidence": "number - 0.0 to 1.0",
  "sources": ["array of source references used"]
}` : 'Respond in clear, natural language with inline source citations.'}

CONTEXT:
${formattedContext}

Remember: You are a RETRIEVAL system. Your job is to find and present information FROM the context, not to generate new knowledge.`;
}

3. Injecting Context Into the Prompt

There are several ways to structure the context within the prompt. Each has trade-offs.

Method 1: Context in system message (recommended)

Place the context in the system message, with the user message containing only the question.

const messages = [
  {
    role: 'system',
    content: `You are a document assistant. Answer ONLY from the provided context.

CONTEXT:
[Source 1: employee-handbook.pdf, Chunk 12]
New employees receive 15 paid time off (PTO) days per calendar year.
PTO accrues monthly at a rate of 1.25 days per month.

[Source 2: pto-policy.md, Chunk 3]
After 5 years of continuous employment, PTO increases to 20 days per year.
PTO requests must be submitted at least 2 weeks in advance.

Return JSON: { "answer": "...", "confidence": 0.0-1.0, "sources": [...] }`,
  },
  {
    role: 'user',
    content: 'How many vacation days do new employees get?',
  },
];

Why this works: The system message has the highest priority in most LLMs. Context placed here is less likely to be ignored. The user message stays clean and simple.

Method 2: Context in user message

Place everything in the user message. Useful when you don't want a persistent system message.

const messages = [
  {
    role: 'system',
    content: 'You are a document assistant. Answer ONLY from the provided context. Return JSON: { "answer": "...", "confidence": 0.0-1.0, "sources": [...] }',
  },
  {
    role: 'user',
    content: `CONTEXT:
[Source 1: employee-handbook.pdf, Chunk 12]
New employees receive 15 paid time off (PTO) days per calendar year.

[Source 2: pto-policy.md, Chunk 3]
After 5 years of continuous employment, PTO increases to 20 days per year.

QUESTION:
How many vacation days do new employees get?`,
  },
];

Method 3: Context as a separate message (multi-turn style)

Use a dedicated message role for context. Some teams use a "fake" assistant message or multiple user messages.

const messages = [
  {
    role: 'system',
    content: 'You are a document assistant. Answer ONLY from the context provided in the conversation. Return JSON: { "answer": "...", "confidence": 0.0-1.0, "sources": [...] }',
  },
  {
    role: 'user',
    content: `Here are the relevant documents for my question:

[Source 1: employee-handbook.pdf, Chunk 12]
New employees receive 15 paid time off (PTO) days per calendar year.

[Source 2: pto-policy.md, Chunk 3]
After 5 years of continuous employment, PTO increases to 20 days per year.`,
  },
  {
    role: 'assistant',
    content: 'I have reviewed the provided documents. Please ask your question.',
  },
  {
    role: 'user',
    content: 'How many vacation days do new employees get?',
  },
];

Comparison

Method	Pros	Cons
System message	Highest priority, clean user message	Long system messages can be expensive per turn
User message	Flexible, works with any model	Context may be treated as user input (safety filters)
Multi-turn	Natural conversation flow	More tokens (extra messages), can confuse some models

Recommendation: Use the system message method for most RAG applications. It provides the clearest separation of instructions, context, and query.

4. Handling Multiple Chunks

When you have multiple retrieved chunks, how you format and present them matters significantly.

Formatting chunks with clear boundaries

function formatChunksForPrompt(chunks) {
  return chunks.map((chunk, index) => {
    const sourceLabel = `[Source ${index + 1}: ${chunk.metadata.filename}` +
      (chunk.metadata.page ? `, Page ${chunk.metadata.page}` : '') +
      (chunk.metadata.chunkIndex !== undefined ? `, Chunk ${chunk.metadata.chunkIndex}` : '') +
      `]`;

    return `${sourceLabel}
${chunk.text.trim()}`;
  }).join('\n\n---\n\n');
}

// Result:
// [Source 1: employee-handbook.pdf, Page 18, Chunk 12]
// New employees receive 15 paid time off (PTO) days per calendar year.
// PTO accrues monthly at a rate of 1.25 days per month.
//
// ---
//
// [Source 2: pto-policy.md, Chunk 3]
// After 5 years of continuous employment, PTO increases to 20 days per year.
// PTO requests must be submitted at least 2 weeks in advance.

Why separators matter

Without clear separators, the LLM may blend information from adjacent chunks:

BAD (no separators):
  "New employees receive 15 PTO days per year. After 5 years, PTO 
  increases to 20 days. The cafeteria is open from 7am to 3pm."
  -> LLM might say: "New employees get 15-20 PTO days and can eat 
     at the cafeteria until 3pm" (blended, confused)

GOOD (with separators):
  [Source 1: pto-policy.md]
  New employees receive 15 PTO days per year.
  ---
  [Source 2: pto-policy.md]
  After 5 years, PTO increases to 20 days.
  ---
  [Source 3: facilities-guide.md]
  The cafeteria is open from 7am to 3pm.
  -> LLM correctly separates the information

5. Chunk Ordering — Most Relevant First vs Last

Research on LLM attention patterns shows the "lost in the middle" effect: models pay more attention to information at the beginning and end of the context, with reduced attention to the middle.

Strategy 1: Most relevant first (recommended for most cases)

function orderChunksRelevanceFirst(chunks) {
  // Already sorted by relevance score (highest first)
  return chunks;
}

// Context layout:
// [Most relevant chunk]    <- LLM pays strong attention (beginning)
// [2nd most relevant]
// [3rd most relevant]      <- LLM pays weaker attention (middle)
// [4th most relevant]
// [Least relevant chunk]   <- LLM pays moderate attention (end)

Strategy 2: Sandwich pattern (relevant at start AND end)

function orderChunksSandwich(chunks) {
  if (chunks.length <= 2) return chunks;

  const sorted = [...chunks].sort((a, b) => b.score - a.score);
  const result = [];

  // Alternate: first goes to start, second goes to end, etc.
  for (let i = 0; i < sorted.length; i++) {
    if (i % 2 === 0) {
      result.push(sorted[i]);       // Even indices: push to front section
    } else {
      result.unshift(sorted[i]);    // Odd indices: push to back section
    }
  }

  // Better approach: explicitly place top items at boundaries
  const top = sorted[0];
  const middle = sorted.slice(2);
  const secondBest = sorted[1];

  return [top, ...middle, secondBest];
}

// Context layout:
// [MOST relevant]          <- Strong attention (beginning)
// [3rd most relevant]
// [4th most relevant]      <- Weaker attention (middle)
// [5th most relevant]
// [2nd MOST relevant]      <- Strong attention (end)

Strategy 3: Chronological (for time-sensitive documents)

function orderChunksChronological(chunks) {
  return [...chunks].sort((a, b) => {
    const dateA = new Date(a.metadata.date || 0);
    const dateB = new Date(b.metadata.date || 0);
    return dateA - dateB; // Oldest first, newest last (most recent gets end-of-context attention)
  });
}

When to use each strategy

Strategy	Best For
Relevance first	Most RAG applications, factual Q&A
Sandwich	Long contexts where middle content might be lost
Chronological	Policy changes, versioned documents, audit trails

6. Source Attribution in the Prompt

Source attribution is a core feature of RAG. The prompt must instruct the LLM how to cite sources.

Explicit citation instructions

const systemMessage = `You are a document assistant.

CITATION RULES:
- Cite every factual claim using [Source N] format matching the source numbers below.
- If a single sentence uses information from multiple sources, cite all of them: [Source 1][Source 3].
- Place citations at the end of the relevant sentence, before the period.
- If you cannot find a source for a claim, do NOT include that claim.

EXAMPLE:
  Context includes [Source 1] about PTO and [Source 2] about benefits.
  Good: "New employees receive 15 PTO days [Source 1] and full health benefits [Source 2]."
  Bad:  "New employees receive 15 PTO days and full health benefits." (no citations)
  Bad:  "New employees typically receive 10-15 PTO days." (information not in context)

CONTEXT:
${formattedContext}

Return JSON:
{
  "answer": "Answer with [Source N] inline citations",
  "confidence": 0.0-1.0,
  "sources": [
    { "id": 1, "document": "filename", "chunk": N, "used": true/false }
  ]
}`;

Validating source attribution

After generation, verify that the LLM actually cited real sources:

function validateSources(answer, availableSources) {
  // Extract cited source numbers from the answer
  const citedPattern = /\[Source (\d+)\]/g;
  const citedNumbers = [];
  let match;

  while ((match = citedPattern.exec(answer.answer)) !== null) {
    citedNumbers.push(parseInt(match[1]));
  }

  const uniqueCited = [...new Set(citedNumbers)];

  // Check: are all cited sources real?
  const invalidCitations = uniqueCited.filter(n => n > availableSources.length || n < 1);
  if (invalidCitations.length > 0) {
    console.warn(`Invalid source citations: ${invalidCitations.join(', ')}`);
  }

  // Check: did the answer cite at least one source?
  if (uniqueCited.length === 0 && answer.confidence > 0) {
    console.warn('Answer has confidence > 0 but no source citations');
  }

  return {
    citedSources: uniqueCited,
    invalidCitations,
    hasCitations: uniqueCited.length > 0,
    allCitationsValid: invalidCitations.length === 0,
  };
}

7. Preventing the Model from Using Training Data

This is the most critical aspect of RAG prompt engineering. Without explicit instructions, the LLM will freely mix its training knowledge with the provided context — destroying the reliability and traceability that RAG provides.

The problem

Context: "Our company was founded in 2019."
Question: "When was the company founded and who founded it?"

Without guardrails:
  "The company was founded in 2019 by John Smith."
  -> "John Smith" came from training data (or was hallucinated), not the context!

With guardrails:
  "The company was founded in 2019 [Source 1]. The provided documents 
   do not mention who founded the company."
  -> Correctly limits to context and flags the gap.

Anti-hallucination prompt patterns

// Pattern 1: Strict "context only" instruction
const strict = `CRITICAL RULE: You must answer ONLY using information explicitly 
stated in the CONTEXT below. If the context does not contain the information 
needed to answer the question, you MUST say "This information is not available 
in the provided documents." Do NOT use any knowledge from your training data.`;

// Pattern 2: Explicit "what to do when you don't know"
const withFallback = `RULES:
- Answer from the CONTEXT below.
- If the context does not contain the answer: set confidence to 0 and answer 
  "I don't have this information in my knowledge base."
- If the context partially answers the question: answer what you can, list 
  what's missing, and set confidence between 0.3-0.6.
- NEVER guess, infer beyond what's stated, or fill gaps with general knowledge.`;

// Pattern 3: Verification step
const withVerification = `After writing your answer, verify each claim:
- For each sentence in your answer, identify which SOURCE it came from.
- If a sentence cannot be traced to a specific source, REMOVE it.
- Include only claims that are directly supported by the context.`;

// Pattern 4: Negative examples (show what NOT to do)
const withNegativeExamples = `ANTI-PATTERNS (do NOT do these):
- Do NOT add context or background information not in the provided documents.
- Do NOT say "typically" or "generally" — only state what the documents say.
- Do NOT complete partial information with assumptions.
- Do NOT answer questions about topics not covered in the context.

Example of what NOT to do:
  Context: "Meeting room A seats 10 people."
  Question: "How many people does meeting room B seat?"
  WRONG: "Meeting room B typically seats 8-12 people."
  CORRECT: "The provided documents only mention meeting room A (seats 10). 
           No information about meeting room B is available."`;

Combining patterns for maximum safety

function buildSafeRAGPrompt(context, options = {}) {
  const { strictness = 'high' } = options;

  const strictnessRules = {
    low: 'Prefer information from the context, but you may supplement with general knowledge if clearly labeled.',
    medium: 'Answer primarily from the context. If you use any external knowledge, explicitly mark it as "[External knowledge, not from documents]".',
    high: 'Answer ONLY from the provided context. Do NOT use any training knowledge. If the context is insufficient, say so.',
  };

  return `You are a document question-answering system.

${strictnessRules[strictness]}

CONTEXT:
${context}

OUTPUT FORMAT (JSON):
{
  "answer": "Answer with inline [Source N] citations",
  "confidence": 0.0-1.0,
  "sources": ["list of sources used"],
  "gaps": ["list of information the user asked about that was NOT found in context"]
}

The "gaps" field is critical — it tells the user what they need to find elsewhere.`;
}

8. Token Budget Management in RAG Prompts

The system prompt, context, and user query all share the same context window. You need to budget tokens carefully.

function buildPromptWithBudget(userQuery, chunks, options = {}) {
  const {
    modelContextWindow = 128000,  // GPT-4o
    maxOutputTokens = 4000,
    systemPromptTokens = 800,     // Measured, not estimated
    safetyMargin = 500,
  } = options;

  // Calculate available tokens for context
  const queryTokens = estimateTokens(userQuery);
  const availableForContext = modelContextWindow
    - maxOutputTokens
    - systemPromptTokens
    - queryTokens
    - safetyMargin;

  console.log(`Token budget: ${availableForContext} tokens for context`);

  // Greedily fill context with chunks (most relevant first)
  const selectedChunks = [];
  let usedTokens = 0;

  for (const chunk of chunks) {
    const chunkTokens = estimateTokens(chunk.text);

    if (usedTokens + chunkTokens > availableForContext) {
      // If the chunk doesn't fit, try truncating it
      const remainingTokens = availableForContext - usedTokens;
      if (remainingTokens > 100) {  // Only if meaningful space remains
        const truncatedText = chunk.text.slice(0, remainingTokens * 4); // ~4 chars per token
        selectedChunks.push({ ...chunk, text: truncatedText, truncated: true });
      }
      break;
    }

    selectedChunks.push(chunk);
    usedTokens += chunkTokens;
  }

  console.log(`Selected ${selectedChunks.length}/${chunks.length} chunks (${usedTokens} tokens)`);

  return {
    selectedChunks,
    usedTokens,
    availableTokens: availableForContext,
    utilization: (usedTokens / availableForContext * 100).toFixed(1) + '%',
  };
}

function estimateTokens(text) {
  return Math.ceil(text.length / 4);
}

Token budget breakdown example

Model:           GPT-4o (128K context window)
Max output:      4,000 tokens
System prompt:   800 tokens (instructions, rules, format)
User query:      50 tokens
Safety margin:   500 tokens

Available for context: 128,000 - 4,000 - 800 - 50 - 500 = 122,650 tokens

With 500-token chunks: up to 245 chunks could fit
With 1000-token chunks: up to 122 chunks could fit

In practice: 5-10 high-quality chunks (2,500-10,000 tokens) is usually better
than 100 marginally relevant chunks.

9. Multi-Turn RAG Conversations

When the user asks follow-up questions, you need to maintain conversation context while still performing fresh retrieval.

async function multiTurnRAG(userQuery, conversationHistory, options = {}) {
  const { maxHistoryTurns = 3 } = options;

  // 1. Contextualize the query using conversation history
  //    "What about the salary?" -> "What is the salary for the remote work position?"
  const contextualizedQuery = await contextualizeQuery(userQuery, conversationHistory);

  // 2. Retrieve based on the contextualized query
  const chunks = await retrieve(contextualizedQuery);

  // 3. Build prompt with history + new context
  const trimmedHistory = conversationHistory.slice(-maxHistoryTurns * 2); // Keep last N turns

  const messages = [
    {
      role: 'system',
      content: `You are a document assistant. Answer from the provided CONTEXT only.

CONTEXT:
${formatChunksForPrompt(chunks)}

Return JSON: { "answer": "...", "confidence": 0.0-1.0, "sources": [...] }`,
    },
    ...trimmedHistory,
    {
      role: 'user',
      content: userQuery,  // Use original query (not contextualized) for natural conversation
    },
  ];

  return await generateAnswer(messages);
}

// Use a small, fast LLM to rewrite ambiguous queries
async function contextualizeQuery(query, history) {
  if (history.length === 0) return query;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    temperature: 0,
    messages: [
      {
        role: 'system',
        content: `Given the conversation history, rewrite the user's latest query to be self-contained. 
If the query is already clear, return it unchanged. Return ONLY the rewritten query, nothing else.`,
      },
      {
        role: 'user',
        content: `Conversation history:
${history.map(m => `${m.role}: ${m.content}`).join('\n')}

Latest query: "${query}"

Rewritten query:`,
      },
    ],
  });

  return response.choices[0].message.content.trim();
}

Why query contextualization matters

Turn 1: "What is the remote work policy?"
  -> Retrieves remote work policy chunks
  -> "Employees may work remotely up to 3 days per week..."

Turn 2: "What about for contractors?"
  -> Without contextualization: embeds "What about for contractors?"
     -> Retrieves generic contractor info (WRONG)
  -> With contextualization: rewrites to "What is the remote work policy for contractors?"
     -> Retrieves contractor-specific remote work policy (CORRECT)

10. Complete Prompt Construction Pipeline

Putting all the pieces together:

import OpenAI from 'openai';

const openai = new OpenAI();

async function constructRAGPrompt(userQuery, retrievedChunks, options = {}) {
  const {
    role = 'document assistant',
    company = 'Acme Corp',
    strictness = 'high',
    outputFormat = 'json',
    maxContextTokens = 4000,
    chunkOrdering = 'relevance_first',
  } = options;

  // 1. Order chunks
  let orderedChunks;
  switch (chunkOrdering) {
    case 'sandwich':
      orderedChunks = orderChunksSandwich(retrievedChunks);
      break;
    case 'chronological':
      orderedChunks = orderChunksChronological(retrievedChunks);
      break;
    case 'relevance_first':
    default:
      orderedChunks = retrievedChunks; // Already sorted by relevance
  }

  // 2. Select chunks within token budget
  const { selectedChunks } = buildPromptWithBudget(
    userQuery,
    orderedChunks,
    { maxContextTokens }
  );

  // 3. Format chunks with source labels
  const formattedContext = formatChunksForPrompt(selectedChunks);

  // 4. Build system message
  const systemContent = `You are a ${role} for ${company}.

RULES:
1. Answer the user's question based ONLY on the CONTEXT provided below.
2. Do NOT use your training knowledge. Only use the CONTEXT.
3. Cite sources using [Source N] notation matching the source numbers in the context.
4. If the context does not contain the answer, set confidence to 0 and say so.
5. Never fabricate information not present in the context.

${outputFormat === 'json' ? `OUTPUT FORMAT (valid JSON):
{
  "answer": "Your answer with [Source N] citations",
  "confidence": 0.0 to 1.0,
  "sources": ["source references used"],
  "gaps": ["information asked about but not found in context"]
}` : 'Respond in natural language with [Source N] inline citations.'}

CONTEXT:
${formattedContext}`;

  // 5. Build messages array
  const messages = [
    { role: 'system', content: systemContent },
    { role: 'user', content: userQuery },
  ];

  return {
    messages,
    metadata: {
      chunksUsed: selectedChunks.length,
      totalChunksAvailable: retrievedChunks.length,
      estimatedPromptTokens: estimateTokens(systemContent + userQuery),
    },
  };
}

// Helper functions
function formatChunksForPrompt(chunks) {
  return chunks.map((chunk, i) => {
    const source = chunk.metadata?.filename || `Document ${i + 1}`;
    const chunkNum = chunk.metadata?.chunkIndex ?? '';
    const page = chunk.metadata?.page ? `, Page ${chunk.metadata.page}` : '';
    return `[Source ${i + 1}: ${source}${chunkNum !== '' ? `, Chunk ${chunkNum}` : ''}${page}]
${chunk.text || chunk.metadata?.text}`;
  }).join('\n\n---\n\n');
}

function orderChunksSandwich(chunks) {
  if (chunks.length <= 2) return chunks;
  const sorted = [...chunks].sort((a, b) => (b.score || 0) - (a.score || 0));
  const top = sorted[0];
  const second = sorted[1];
  const rest = sorted.slice(2);
  return [top, ...rest, second];
}

function orderChunksChronological(chunks) {
  return [...chunks].sort((a, b) => {
    const dateA = new Date(a.metadata?.date || 0);
    const dateB = new Date(b.metadata?.date || 0);
    return dateA - dateB;
  });
}

function buildPromptWithBudget(query, chunks, options = {}) {
  const { maxContextTokens = 4000 } = options;
  const selected = [];
  let tokens = 0;

  for (const chunk of chunks) {
    const t = estimateTokens(chunk.text || chunk.metadata?.text || '');
    if (tokens + t > maxContextTokens) break;
    selected.push(chunk);
    tokens += t;
  }

  return { selectedChunks: selected, usedTokens: tokens };
}

function estimateTokens(text) {
  return Math.ceil((text || '').length / 4);
}

11. Common Prompt Construction Mistakes

Mistake	Problem	Fix
No "context only" instruction	LLM mixes training data with context	Add explicit "answer ONLY from context" rule
Context after the question	LLM may start answering before reading context	Put context BEFORE the question
No source labels on chunks	LLM cannot cite specific sources	Label every chunk with [Source N: filename]
No instructions for "I don't know"	LLM hallucinates when context is insufficient	Explicit fallback instruction + confidence 0
Chunks without separators	LLM blends information across boundaries	Use `---` or similar separators between chunks
Too much context	"Lost in the middle" effect, wasted tokens	Limit to 5-10 high-quality chunks
No output format specification	LLM returns inconsistent formats	Define exact JSON schema in the prompt
Forgetting multi-turn context	Follow-up queries retrieve wrong chunks	Contextualize queries using conversation history

12. Key Takeaways

The prompt must FORCE the LLM to use the context — without explicit "answer ONLY from context" instructions, the LLM will freely use training data.
System message is the best place for context — it has the highest priority and cleanest separation from the user query.
Label and separate chunks clearly — use [Source N: filename] labels and --- separators so the LLM can cite sources accurately.
Chunk ordering matters — most relevant first is the default; sandwich pattern helps with long contexts.
Source attribution must be enforced — include citation instructions, format examples, and post-generation validation.
Budget tokens carefully — system prompt + context + query + output must all fit in the context window.
Multi-turn conversations need query contextualization — rewrite ambiguous follow-up questions to be self-contained before retrieval.

Explain-It Challenge

A developer puts the context AFTER the user question in the prompt. The LLM sometimes answers without reading the context. Explain what is happening and how to fix it.
Your RAG system is answering questions about topics not covered in the documents — where is the information coming from and how do you stop it?
Design a prompt template for a medical RAG system where incorrect information could be dangerous. What extra safeguards would you add?

Navigation: <- 4.13.b Retrieval Strategies | 4.13.d — Building Document QA System ->