Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1.b — Context Window

In one sentence: The context window is the fixed-size "memory" of an LLM — the maximum number of tokens (prompt + response combined) the model can process in a single request — and understanding it is critical because everything beyond this limit is silently dropped.

Navigation: ← 4.1.a — Tokens · 4.1.c — Sampling & Temperature →

1. What Is the Context Window?

When you send a message to an LLM, the model doesn't have infinite memory. It can only "see" a fixed number of tokens at a time — this is the context window. Think of it as a sliding window of text that the model reads all at once.

┌──────────────────────────────────────────────────────┐
│                    CONTEXT WINDOW                     │
│                  (e.g., 128K tokens)                  │
│                                                       │
│  ┌─────────────────────────────────────────────────┐ │
│  │ System Prompt         │  ~500-2000 tokens        │ │
│  ├───────────────────────┤                          │ │
│  │ Conversation History  │  Grows with each turn    │ │
│  │ (all previous msgs)   │                          │ │
│  ├───────────────────────┤                          │ │
│  │ Current User Message  │  Your latest input       │ │
│  ├───────────────────────┤                          │ │
│  │ Model's Response      │  Generated tokens        │ │
│  │ (being generated)     │  (counts against window) │ │
│  └─────────────────────────────────────────────────┘ │
│                                                       │
│  EVERYTHING must fit in the window. If it doesn't,   │
│  older messages are truncated or the request fails.   │
└──────────────────────────────────────────────────────┘

Critical insight: Both the input (prompt) and the output (model's response) share the same context window. If your prompt uses 120K tokens of a 128K window, the model can only generate 8K tokens of response.

2. Context Window Sizes Across Models

Model	Context Window	Approximate Pages of Text
GPT-3.5 Turbo	16,385 tokens	~25 pages
GPT-4	8,192 tokens	~12 pages
GPT-4 Turbo	128,000 tokens	~200 pages
GPT-4o	128,000 tokens	~200 pages
GPT-4o mini	128,000 tokens	~200 pages
Claude 3.5 Sonnet	200,000 tokens	~300 pages
Claude 3 Opus	200,000 tokens	~300 pages
Claude 4 Sonnet	200,000 tokens	~300 pages
Gemini 1.5 Pro	2,000,000 tokens	~3000 pages
Llama 3.1 405B	128,000 tokens	~200 pages

Max output tokens are usually a separate, smaller limit:

Model	Max Output Tokens
GPT-4o	16,384 tokens
Claude 3.5/4 Sonnet	8,192 tokens (default), up to 64K
Gemini 1.5 Pro	8,192 tokens (default)

3. Why Prompts Get Cut

When a conversation exceeds the context window, something must give. Different systems handle this differently:

Strategy 1: Truncation (most common)

Messages: [system, user1, assistant1, user2, assistant2, ..., user50, assistant50]
Window:   Can only fit messages from user30 onward

Result: Messages before user30 are SILENTLY DROPPED.
        The model has NO memory of the early conversation.
        It may contradict things it said 20 messages ago.

Strategy 2: API rejection

HTTP 400: "This model's maximum context length is 128000 tokens.
           Your messages resulted in 135420 tokens."

Strategy 3: Summarization (application-level)

Your app detects the conversation is getting long:
  1. Take the oldest messages
  2. Ask the LLM to summarize them
  3. Replace the original messages with the summary
  4. Continue the conversation with the summary as context

4. Token Budgeting

Token budgeting is the practice of planning exactly how your context window will be used. This is essential for production AI applications.

// Token budget for a RAG-powered chatbot
const CONTEXT_WINDOW = 128000;  // GPT-4o

const budget = {
  systemPrompt:       1500,   // Fixed: instructions, persona, rules
  retrievedDocs:      8000,   // Variable: RAG context (4 chunks × 2000)
  conversationHistory: 4000,  // Variable: recent messages
  currentUserMessage:  500,   // Variable: user's question
  reservedForOutput:   4000,  // Reserved: model's response
  safetyMargin:        1000,  // Buffer: prevent edge-case overflow
};

const totalBudget = Object.values(budget).reduce((a, b) => a + b, 0);
console.log(`Using ${totalBudget} of ${CONTEXT_WINDOW} tokens`);
// Using 19000 of 128000 tokens — plenty of headroom

// In a production system, you'd dynamically adjust:
function calculateRemainingBudget(systemTokens, historyTokens, userTokens) {
  const used = systemTokens + historyTokens + userTokens;
  const reserved = 4000; // For output
  const available = CONTEXT_WINDOW - used - reserved;
  
  // This is how many tokens of RAG context you can inject
  return Math.max(0, available);
}

Budget allocation strategy

Priority Order (what to keep when space is tight):

1. System prompt       — ALWAYS keep (defines behavior)
2. Current user message — ALWAYS keep (what they just asked)
3. Output reservation  — ALWAYS keep (model needs room to respond)
4. Recent history      — Keep last 2-4 exchanges (immediate context)
5. RAG documents       — Fill remaining space (most relevant first)
6. Older history       — First to be trimmed or summarized

5. What Happens at the Edges of the Context Window

The "Lost in the Middle" Problem

Research shows that LLMs pay more attention to information at the beginning and end of the context window, and less attention to information in the middle. This is called the "lost in the middle" phenomenon.

Attention Distribution:

  HIGH ██████████████████████████████████████████████████  Beginning
       ██████████████████
       ██████████████
       ██████████                                          Middle
       ████████████
       ██████████████████
       ██████████████████████████████████
  HIGH ██████████████████████████████████████████████████  End

Implication: Put the MOST IMPORTANT information at the
beginning (system prompt) and end (current question) of
the context. Less critical info can go in the middle.

Long context ≠ good context

Just because a model supports 200K tokens doesn't mean you should use all of them:

Context Length	Recall Accuracy	Latency	Cost
1K tokens	~99%	Fast	Low
10K tokens	~95%	Fast	Low
50K tokens	~85%	Medium	Medium
100K tokens	~75%	Slow	High
200K tokens	~65%	Very slow	Very high

Best practice: Use the minimum context necessary for the task. More context often means worse performance, higher latency, and higher cost.

6. Managing Context in Multi-Turn Conversations

Every API call in a chat application sends the entire conversation history. This means costs and token usage grow with every message:

Turn 1:  [system] + [user1]                          = ~2000 tokens
Turn 2:  [system] + [user1] + [asst1] + [user2]      = ~3500 tokens
Turn 3:  [system] + [user1] + [asst1] + [user2] + [asst2] + [user3] = ~5000 tokens
...
Turn 50: [system] + [all 50 exchanges]                = ~50,000 tokens

Conversation management strategies

// Strategy 1: Sliding window — keep only the last N messages
function trimConversation(messages, maxTokens) {
  const systemMsg = messages[0]; // Always keep system prompt
  let remaining = messages.slice(1);
  
  let totalTokens = countTokens(systemMsg);
  const kept = [systemMsg];
  
  // Walk backwards, keeping the most recent messages
  for (let i = remaining.length - 1; i >= 0; i--) {
    const msgTokens = countTokens(remaining[i]);
    if (totalTokens + msgTokens > maxTokens) break;
    totalTokens += msgTokens;
    kept.unshift(remaining[i]); // Actually we should insert after system
  }
  
  return kept;
}

// Strategy 2: Summarize old messages
async function summarizeOldMessages(oldMessages, llm) {
  const summary = await llm.chat([
    { role: 'system', content: 'Summarize this conversation concisely.' },
    { role: 'user', content: JSON.stringify(oldMessages) }
  ]);
  
  return {
    role: 'system',
    content: `Previous conversation summary: ${summary}`
  };
}

7. Context Window vs Training Data

A common confusion: the context window is not the model's "memory" of its training data.

	Context Window	Training Data
What	Text in the current API call	Text the model learned from during training
Size	128K-2M tokens	Trillions of tokens
Persists?	No — reset every API call	Yes — baked into model weights
You control it?	Yes — you choose what to include	No — decided by the model creator
Updatable?	Yes — change the prompt	No — requires retraining

The context window is like a whiteboard — you write on it, the model reads it, then it's erased. Training data is like the model's education — it's permanent knowledge (but may be outdated or wrong).

8. Key Takeaways

The context window is the total tokens (input + output) a model can handle in one request.
Both prompt and response share the context window — reserve space for the output.
When exceeded, messages are silently dropped or the API rejects the request.
Token budgeting is essential in production — plan how every token is used.
More context ≠ better results — models lose accuracy in the middle of long contexts.
Conversation history grows linearly — implement trimming or summarization for multi-turn apps.

Explain-It Challenge

A product manager asks "why can't the chatbot remember what I said 30 minutes ago?" — explain using context windows.
You have a 128K token window. Your system prompt is 2K, you want 8K for output, and each RAG chunk is 500 tokens. How many chunks can you fit?
Why does putting important instructions at the beginning AND end of a long prompt work better than burying them in the middle?

Navigation: ← 4.1.a — Tokens · 4.1.c — Sampling & Temperature →