Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work
4.1.b — Context Window
In one sentence: The context window is the fixed-size "memory" of an LLM — the maximum number of tokens (prompt + response combined) the model can process in a single request — and understanding it is critical because everything beyond this limit is silently dropped.
Navigation: ← 4.1.a — Tokens · 4.1.c — Sampling & Temperature →
1. What Is the Context Window?
When you send a message to an LLM, the model doesn't have infinite memory. It can only "see" a fixed number of tokens at a time — this is the context window. Think of it as a sliding window of text that the model reads all at once.
┌──────────────────────────────────────────────────────┐
│ CONTEXT WINDOW │
│ (e.g., 128K tokens) │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ System Prompt │ ~500-2000 tokens │ │
│ ├───────────────────────┤ │ │
│ │ Conversation History │ Grows with each turn │ │
│ │ (all previous msgs) │ │ │
│ ├───────────────────────┤ │ │
│ │ Current User Message │ Your latest input │ │
│ ├───────────────────────┤ │ │
│ │ Model's Response │ Generated tokens │ │
│ │ (being generated) │ (counts against window) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ EVERYTHING must fit in the window. If it doesn't, │
│ older messages are truncated or the request fails. │
└──────────────────────────────────────────────────────┘
Critical insight: Both the input (prompt) and the output (model's response) share the same context window. If your prompt uses 120K tokens of a 128K window, the model can only generate 8K tokens of response.
2. Context Window Sizes Across Models
| Model | Context Window | Approximate Pages of Text |
|---|---|---|
| GPT-3.5 Turbo | 16,385 tokens | ~25 pages |
| GPT-4 | 8,192 tokens | ~12 pages |
| GPT-4 Turbo | 128,000 tokens | ~200 pages |
| GPT-4o | 128,000 tokens | ~200 pages |
| GPT-4o mini | 128,000 tokens | ~200 pages |
| Claude 3.5 Sonnet | 200,000 tokens | ~300 pages |
| Claude 3 Opus | 200,000 tokens | ~300 pages |
| Claude 4 Sonnet | 200,000 tokens | ~300 pages |
| Gemini 1.5 Pro | 2,000,000 tokens | ~3000 pages |
| Llama 3.1 405B | 128,000 tokens | ~200 pages |
Max output tokens are usually a separate, smaller limit:
| Model | Max Output Tokens |
|---|---|
| GPT-4o | 16,384 tokens |
| Claude 3.5/4 Sonnet | 8,192 tokens (default), up to 64K |
| Gemini 1.5 Pro | 8,192 tokens (default) |
3. Why Prompts Get Cut
When a conversation exceeds the context window, something must give. Different systems handle this differently:
Strategy 1: Truncation (most common)
Messages: [system, user1, assistant1, user2, assistant2, ..., user50, assistant50]
Window: Can only fit messages from user30 onward
Result: Messages before user30 are SILENTLY DROPPED.
The model has NO memory of the early conversation.
It may contradict things it said 20 messages ago.
Strategy 2: API rejection
HTTP 400: "This model's maximum context length is 128000 tokens.
Your messages resulted in 135420 tokens."
Strategy 3: Summarization (application-level)
Your app detects the conversation is getting long:
1. Take the oldest messages
2. Ask the LLM to summarize them
3. Replace the original messages with the summary
4. Continue the conversation with the summary as context
4. Token Budgeting
Token budgeting is the practice of planning exactly how your context window will be used. This is essential for production AI applications.
// Token budget for a RAG-powered chatbot
const CONTEXT_WINDOW = 128000; // GPT-4o
const budget = {
systemPrompt: 1500, // Fixed: instructions, persona, rules
retrievedDocs: 8000, // Variable: RAG context (4 chunks × 2000)
conversationHistory: 4000, // Variable: recent messages
currentUserMessage: 500, // Variable: user's question
reservedForOutput: 4000, // Reserved: model's response
safetyMargin: 1000, // Buffer: prevent edge-case overflow
};
const totalBudget = Object.values(budget).reduce((a, b) => a + b, 0);
console.log(`Using ${totalBudget} of ${CONTEXT_WINDOW} tokens`);
// Using 19000 of 128000 tokens — plenty of headroom
// In a production system, you'd dynamically adjust:
function calculateRemainingBudget(systemTokens, historyTokens, userTokens) {
const used = systemTokens + historyTokens + userTokens;
const reserved = 4000; // For output
const available = CONTEXT_WINDOW - used - reserved;
// This is how many tokens of RAG context you can inject
return Math.max(0, available);
}
Budget allocation strategy
Priority Order (what to keep when space is tight):
1. System prompt — ALWAYS keep (defines behavior)
2. Current user message — ALWAYS keep (what they just asked)
3. Output reservation — ALWAYS keep (model needs room to respond)
4. Recent history — Keep last 2-4 exchanges (immediate context)
5. RAG documents — Fill remaining space (most relevant first)
6. Older history — First to be trimmed or summarized
5. What Happens at the Edges of the Context Window
The "Lost in the Middle" Problem
Research shows that LLMs pay more attention to information at the beginning and end of the context window, and less attention to information in the middle. This is called the "lost in the middle" phenomenon.
Attention Distribution:
HIGH ██████████████████████████████████████████████████ Beginning
██████████████████
██████████████
██████████ Middle
████████████
██████████████████
██████████████████████████████████
HIGH ██████████████████████████████████████████████████ End
Implication: Put the MOST IMPORTANT information at the
beginning (system prompt) and end (current question) of
the context. Less critical info can go in the middle.
Long context ≠ good context
Just because a model supports 200K tokens doesn't mean you should use all of them:
| Context Length | Recall Accuracy | Latency | Cost |
|---|---|---|---|
| 1K tokens | ~99% | Fast | Low |
| 10K tokens | ~95% | Fast | Low |
| 50K tokens | ~85% | Medium | Medium |
| 100K tokens | ~75% | Slow | High |
| 200K tokens | ~65% | Very slow | Very high |
Best practice: Use the minimum context necessary for the task. More context often means worse performance, higher latency, and higher cost.
6. Managing Context in Multi-Turn Conversations
Every API call in a chat application sends the entire conversation history. This means costs and token usage grow with every message:
Turn 1: [system] + [user1] = ~2000 tokens
Turn 2: [system] + [user1] + [asst1] + [user2] = ~3500 tokens
Turn 3: [system] + [user1] + [asst1] + [user2] + [asst2] + [user3] = ~5000 tokens
...
Turn 50: [system] + [all 50 exchanges] = ~50,000 tokens
Conversation management strategies
// Strategy 1: Sliding window — keep only the last N messages
function trimConversation(messages, maxTokens) {
const systemMsg = messages[0]; // Always keep system prompt
let remaining = messages.slice(1);
let totalTokens = countTokens(systemMsg);
const kept = [systemMsg];
// Walk backwards, keeping the most recent messages
for (let i = remaining.length - 1; i >= 0; i--) {
const msgTokens = countTokens(remaining[i]);
if (totalTokens + msgTokens > maxTokens) break;
totalTokens += msgTokens;
kept.unshift(remaining[i]); // Actually we should insert after system
}
return kept;
}
// Strategy 2: Summarize old messages
async function summarizeOldMessages(oldMessages, llm) {
const summary = await llm.chat([
{ role: 'system', content: 'Summarize this conversation concisely.' },
{ role: 'user', content: JSON.stringify(oldMessages) }
]);
return {
role: 'system',
content: `Previous conversation summary: ${summary}`
};
}
7. Context Window vs Training Data
A common confusion: the context window is not the model's "memory" of its training data.
| Context Window | Training Data | |
|---|---|---|
| What | Text in the current API call | Text the model learned from during training |
| Size | 128K-2M tokens | Trillions of tokens |
| Persists? | No — reset every API call | Yes — baked into model weights |
| You control it? | Yes — you choose what to include | No — decided by the model creator |
| Updatable? | Yes — change the prompt | No — requires retraining |
The context window is like a whiteboard — you write on it, the model reads it, then it's erased. Training data is like the model's education — it's permanent knowledge (but may be outdated or wrong).
8. Key Takeaways
- The context window is the total tokens (input + output) a model can handle in one request.
- Both prompt and response share the context window — reserve space for the output.
- When exceeded, messages are silently dropped or the API rejects the request.
- Token budgeting is essential in production — plan how every token is used.
- More context ≠ better results — models lose accuracy in the middle of long contexts.
- Conversation history grows linearly — implement trimming or summarization for multi-turn apps.
Explain-It Challenge
- A product manager asks "why can't the chatbot remember what I said 30 minutes ago?" — explain using context windows.
- You have a 128K token window. Your system prompt is 2K, you want 8K for output, and each RAG chunk is 500 tokens. How many chunks can you fit?
- Why does putting important instructions at the beginning AND end of a long prompt work better than burying them in the middle?
Navigation: ← 4.1.a — Tokens · 4.1.c — Sampling & Temperature →