Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly
4.2.b — Token Budgeting
In one sentence: The context window is a fixed-size container shared by input and output — token budgeting is the practice of carefully allocating tokens across system prompts, conversation history, retrieved documents, and reserved output space to prevent overflow and control costs.
Navigation: ← 4.2.a — Message Roles · 4.2.c — Cost Awareness →
1. The Fundamental Constraint: Context Window = Input + Output
Every LLM has a context window — the maximum number of tokens it can handle in a single API call. This window is shared between your input (everything you send) and the output (everything the model generates).
┌────────────────────── Context Window (e.g. 128K tokens) ──────────────────────┐
│ │
│ ┌───────────────────────────────────────────┐ ┌──────────────────────────┐ │
│ │ INPUT TOKENS │ │ OUTPUT TOKENS │ │
│ │ │ │ │ │
│ │ System prompt │ │ Model's response │ │
│ │ + Few-shot examples │ │ (controlled by │ │
│ │ + Conversation history │ │ max_tokens param) │ │
│ │ + RAG documents │ │ │ │
│ │ + Current user message │ │ │ │
│ │ + Special tokens (overhead) │ │ │ │
│ │ │ │ │ │
│ └───────────────────────────────────────────┘ └──────────────────────────┘ │
│ │
│ input_tokens + output_tokens <= context_window │
└────────────────────────────────────────────────────────────────────────────────┘
If your input consumes 127,000 tokens in a 128K window, the model has only 1,000 tokens left for its response — roughly 750 words. If input exceeds the window entirely, the API returns an error.
2. The max_tokens Parameter
The max_tokens parameter is one of the most misunderstood settings in LLM APIs. It controls the maximum number of output tokens the model can generate. It does not limit input size.
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing." }
],
max_tokens: 200 // Model will generate AT MOST 200 output tokens
});
Key facts about max_tokens
| Fact | Detail |
|---|---|
| Controls output only | Does not affect how many input tokens you can send |
| Upper bound, not target | Model may stop earlier if it finishes its thought |
| Truncation on hit | If the model needs 500 tokens but max_tokens is 200, the response is cut off mid-sentence |
| Default varies by provider | OpenAI defaults vary by model; Anthropic requires you to set it |
| Affects cost | You pay for actual output tokens, not max_tokens (but a higher limit allows more generation) |
What happens when output is truncated
When the model's response hits the max_tokens limit, the response is cut off and the API tells you why:
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a 2000-word essay on AI." }],
max_tokens: 100 // Way too short for a 2000-word essay
});
console.log(response.choices[0].finish_reason);
// "length" — hit max_tokens limit (truncated!)
// "stop" — model finished naturally (complete response)
console.log(response.choices[0].message.content);
// "Artificial intelligence (AI) is a branch of computer science that
// aims to create systems capable of performing tasks that would
// normally require human intelligence. The field has evolved
// significantly since its..." — CUT OFF
The finish_reason values
| Value | Meaning | Action |
|---|---|---|
"stop" | Model finished naturally | Response is complete |
"length" | Hit max_tokens limit | Response is truncated — increase max_tokens or continue in next call |
"content_filter" | Content was filtered | Response may be incomplete due to safety filter |
"tool_calls" | Model wants to call a tool | Handle the tool call |
Always check finish_reason in production — a "length" response means the user got an incomplete answer.
3. Token Budget Formula
The core formula for token budgeting:
available_output = context_window - input_tokens - safety_margin
Where:
input_tokens = system_prompt + few_shot_examples + conversation_history
+ rag_documents + current_user_message + special_token_overhead
Practical example: GPT-4o (128K window)
Context window: 128,000 tokens
System prompt: 1,500 tokens (fixed)
Few-shot examples: 1,000 tokens (fixed, 3 examples)
Conversation history: 8,000 tokens (variable, grows per turn)
RAG documents: 15,000 tokens (variable, depends on retrieval)
Current user message: 500 tokens (variable)
Special token overhead: 100 tokens (message formatting)
Safety margin: 1,000 tokens (buffer for estimation errors)
─────────
Total input: 27,100 tokens
Available for output: 128,000 - 27,100 = 100,900 tokens
Set max_tokens to: 4,096 (typical response cap)
Actual headroom: 96,804 tokens unused (comfortable)
When it gets tight: Claude with heavy RAG
Context window: 200,000 tokens
System prompt: 2,000 tokens
RAG documents: 180,000 tokens (100 large documents)
User message: 500 tokens
Safety margin: 1,000 tokens
─────────
Total input: 183,500 tokens
Available for output: 200,000 - 183,500 = 16,500 tokens
Set max_tokens to: 8,192
WARNING: Only 8,308 tokens of headroom — any increase in
RAG documents or history could push you over the limit!
4. Dynamic Budget Allocation
In real applications, token counts change with every request. A static budget only works for simple one-shot calls. Multi-turn chatbots need dynamic allocation.
Strategy: Priority-based allocation
import { encoding_for_model } from 'tiktoken';
function allocateTokenBudget({
contextWindow = 128000,
systemPrompt,
fewShotExamples = [],
conversationHistory = [],
ragDocuments = [],
userMessage,
maxOutputTokens = 4096,
safetyMargin = 1000
}) {
const enc = encoding_for_model('gpt-4o');
// Count fixed allocations
const systemTokens = enc.encode(systemPrompt).length;
const userTokens = enc.encode(userMessage).length;
const fewShotTokens = fewShotExamples.reduce(
(sum, ex) => sum + enc.encode(ex.input).length + enc.encode(ex.output).length, 0
);
// Overhead: ~4 tokens per message for role markers and formatting
const overheadPerMessage = 4;
const fixedOverhead = (2 + fewShotExamples.length * 2) * overheadPerMessage;
const fixedTokens = systemTokens + userTokens + fewShotTokens
+ fixedOverhead + maxOutputTokens + safetyMargin;
let remainingBudget = contextWindow - fixedTokens;
// Priority 1: Recent conversation history (last N turns)
const historyTokens = [];
const reversedHistory = [...conversationHistory].reverse();
let historyTotal = 0;
for (const msg of reversedHistory) {
const msgTokens = enc.encode(msg.content).length + overheadPerMessage;
if (historyTotal + msgTokens > remainingBudget * 0.4) break; // Max 40% of remaining
historyTokens.unshift(msg);
historyTotal += msgTokens;
}
remainingBudget -= historyTotal;
// Priority 2: RAG documents (most relevant first)
const includedDocs = [];
let ragTotal = 0;
for (const doc of ragDocuments) { // Assumed pre-sorted by relevance
const docTokens = enc.encode(doc.content).length;
if (ragTotal + docTokens > remainingBudget) break;
includedDocs.push(doc);
ragTotal += docTokens;
}
enc.free();
return {
systemPrompt,
fewShotExamples,
conversationHistory: historyTokens,
ragDocuments: includedDocs,
userMessage,
maxOutputTokens,
totalInputTokens: contextWindow - remainingBudget + ragTotal - maxOutputTokens - safetyMargin,
headroom: remainingBudget - ragTotal
};
}
5. Overflow Prevention Strategies
When conversations grow long or RAG retrieval returns large documents, you risk exceeding the context window. Here are the primary strategies:
Strategy 1: Sliding window (drop oldest messages)
function slidingWindow(messages, maxHistoryTokens, enc) {
// Keep system message + trim history from the beginning
const system = messages[0]; // Always keep system
const history = messages.slice(1);
const trimmed = [];
let tokenCount = 0;
// Walk backwards — keep most recent messages
for (let i = history.length - 1; i >= 0; i--) {
const msgTokens = enc.encode(history[i].content).length;
if (tokenCount + msgTokens > maxHistoryTokens) break;
trimmed.unshift(history[i]);
tokenCount += msgTokens;
}
return [system, ...trimmed];
}
Pros: Simple, preserves recent context. Cons: Loses early context — model forgets the beginning of the conversation.
Strategy 2: Summarize older messages
async function summarizeHistory(oldMessages, openai) {
const historyText = oldMessages
.map(m => `${m.role}: ${m.content}`)
.join('\n');
const summary = await openai.chat.completions.create({
model: "gpt-4o-mini", // Use cheaper model for summarization
messages: [
{ role: "system", content: "Summarize this conversation in under 200 words. Preserve key facts, decisions, and user preferences." },
{ role: "user", content: historyText }
],
max_tokens: 300
});
return {
role: "system",
content: `Previous conversation summary: ${summary.choices[0].message.content}`
};
}
// Usage: replace old messages with a summary
async function manageHistory(messages, maxTokens, enc, openai) {
let totalTokens = messages.reduce(
(sum, m) => sum + enc.encode(m.content).length, 0
);
if (totalTokens <= maxTokens) return messages;
const system = messages[0];
const recentCount = 6; // Keep last 3 turns (6 messages)
const recent = messages.slice(-recentCount);
const old = messages.slice(1, -recentCount);
const summaryMsg = await summarizeHistory(old, openai);
return [system, summaryMsg, ...recent];
}
Pros: Preserves key information from the entire conversation. Cons: Summarization costs tokens/money and may lose details.
Strategy 3: Truncate RAG documents
function truncateDocuments(documents, maxTotalTokens, enc) {
const truncated = [];
let totalTokens = 0;
for (const doc of documents) {
const docTokens = enc.encode(doc.content).length;
if (totalTokens + docTokens <= maxTotalTokens) {
truncated.push(doc);
totalTokens += docTokens;
} else {
// Partial inclusion — truncate this document to fit
const remainingBudget = maxTotalTokens - totalTokens;
if (remainingBudget > 100) { // Only include if meaningful content fits
const tokens = enc.encode(doc.content).slice(0, remainingBudget);
const partialContent = enc.decode(tokens);
truncated.push({ ...doc, content: partialContent + '...[truncated]' });
}
break;
}
}
return truncated;
}
Strategy 4: Pre-request validation
function validateRequest(messages, contextWindow, maxOutputTokens, enc) {
const inputTokens = messages.reduce(
(sum, m) => sum + enc.encode(m.content).length + 4, 0 // +4 for overhead per message
);
const totalRequired = inputTokens + maxOutputTokens;
if (totalRequired > contextWindow) {
return {
valid: false,
inputTokens,
overflow: totalRequired - contextWindow,
message: `Request would use ${totalRequired} tokens but window is ${contextWindow}. ` +
`Reduce input by ${totalRequired - contextWindow} tokens.`
};
}
return {
valid: true,
inputTokens,
headroom: contextWindow - totalRequired,
utilizationPercent: ((totalRequired / contextWindow) * 100).toFixed(1)
};
}
6. Token Counting in Practice
Using tiktoken (OpenAI models)
import { encoding_for_model } from 'tiktoken';
function countMessageTokens(messages, model = 'gpt-4o') {
const enc = encoding_for_model(model);
let totalTokens = 0;
// Every message has overhead for role formatting
// For gpt-4o: each message costs 3 extra tokens + 1 for role name
const tokensPerMessage = 4;
for (const message of messages) {
totalTokens += tokensPerMessage;
totalTokens += enc.encode(message.content).length;
if (message.name) {
totalTokens += enc.encode(message.name).length;
}
}
totalTokens += 3; // Every reply is primed with <|start|>assistant<|message|>
enc.free();
return totalTokens;
}
// Usage
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Hello, how are you?" }
];
console.log(countMessageTokens(messages));
// Approximately 22 tokens
Quick estimation without libraries
function estimateTokens(text) {
// Rule of thumb: 1 token ≈ 4 characters for English
return Math.ceil(text.length / 4);
}
function estimateMessagesTokens(messages) {
const contentTokens = messages.reduce(
(sum, m) => sum + estimateTokens(m.content), 0
);
const overhead = messages.length * 4 + 3; // Per-message overhead + reply primer
return contentTokens + overhead;
}
Using the API response (post-hoc counting)
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [...],
max_tokens: 1000
});
// The API tells you exactly how many tokens were used
console.log(response.usage);
// {
// prompt_tokens: 156, ← Input tokens (what you sent)
// completion_tokens: 89, ← Output tokens (what the model generated)
// total_tokens: 245 ← Sum of both
// }
7. Common max_tokens Settings by Use Case
| Use Case | Recommended max_tokens | Reasoning |
|---|---|---|
| Classification (yes/no) | 5-10 | Only need a single word |
| JSON extraction | 200-500 | Structured, predictable length |
| Short answers (Q&A) | 300-500 | Concise responses |
| Chatbot response | 1,000-2,000 | Conversational paragraphs |
| Code generation | 2,000-4,000 | Functions can be lengthy |
| Long-form writing | 4,000-8,000 | Essays, articles |
| Document summarization | 500-2,000 | Depends on desired summary length |
Cost implication: Setting max_tokens: 16000 when you only need 200 doesn't cost more (you pay for actual tokens generated), but it allows the model to ramble. Tight limits encourage concise output and prevent runaway responses.
8. Advanced: Token Budget Monitoring Dashboard
In production, track these metrics:
┌──────────────────────────────────────────────────────┐
│ Token Budget Dashboard │
│ │
│ Avg input tokens/request: 2,340 ▓▓▓▓▓░░░░░ │
│ Avg output tokens/request: 680 ▓▓░░░░░░░░ │
│ Avg total tokens/request: 3,020 ▓▓▓░░░░░░░ │
│ Context utilization: 2.4% (healthy) │
│ │
│ p95 input tokens: 12,800 ▓▓▓▓▓▓▓░░░ │
│ p99 input tokens: 45,200 ▓▓▓▓▓▓▓▓▓░ │
│ Max observed: 98,400 ▓▓▓▓▓▓▓▓▓▓ │
│ │
│ Truncation events (length): 1.2% (target < 2%) │
│ Overflow errors: 0.01% (target < 0.1%) │
│ History summarizations: 8.4% (normal) │
│ │
│ Budget alerts: │
│ [!] 3 requests hit 90%+ utilization in last hour │
│ [OK] No overflow errors in last 24h │
└──────────────────────────────────────────────────────┘
Track and alert on:
| Metric | Alert Threshold | Action |
|---|---|---|
Truncation rate (finish_reason: "length") | > 2% | Increase max_tokens or reduce input |
| Context overflow errors | > 0.1% | Implement more aggressive history trimming |
| Average utilization > 80% | > 80% of window | You're close to limits — optimize inputs |
| p99 input tokens near window | > 90% of window | Long conversations need summarization |
9. Key Takeaways
- Context window = input + output — they share a fixed budget. Every token of input leaves one less token for output.
max_tokenscontrols output only — it caps how much the model can generate, not how much you can send.- Always check
finish_reason—"length"means the response was cut off mid-sentence. - Budget formula:
available_output = context_window - input_tokens - safety_margin. - Dynamic allocation is essential for multi-turn conversations — use sliding windows, summarization, or document truncation to stay within limits.
- Count tokens before sending — pre-validate requests to prevent overflow errors and wasted API calls.
Explain-It Challenge
- A product manager asks: "Why can't the chatbot just remember the whole conversation forever?" Explain context window limits and token budgeting in non-technical terms.
- Your chatbot works perfectly for 5-turn conversations but breaks at 20+ turns with truncated responses. Diagnose the problem and propose a fix.
- Calculate the token budget for a RAG system using GPT-4o (128K window) with: 1,200-token system prompt, 3 few-shot examples at 400 tokens each, up to 20 turns of history averaging 300 tokens each, and a desired 4,096-token max output. How many tokens are available for RAG documents?
Navigation: ← 4.2.a — Message Roles · 4.2.c — Cost Awareness →