Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2.b — Token Budgeting

In one sentence: The context window is a fixed-size container shared by input and output — token budgeting is the practice of carefully allocating tokens across system prompts, conversation history, retrieved documents, and reserved output space to prevent overflow and control costs.

Navigation: ← 4.2.a — Message Roles · 4.2.c — Cost Awareness →

1. The Fundamental Constraint: Context Window = Input + Output

Every LLM has a context window — the maximum number of tokens it can handle in a single API call. This window is shared between your input (everything you send) and the output (everything the model generates).

┌────────────────────── Context Window (e.g. 128K tokens) ──────────────────────┐
│                                                                                │
│   ┌───────────────────────────────────────────┐  ┌──────────────────────────┐  │
│   │              INPUT TOKENS                  │  │     OUTPUT TOKENS        │  │
│   │                                            │  │                          │  │
│   │  System prompt                             │  │  Model's response        │  │
│   │  + Few-shot examples                       │  │  (controlled by          │  │
│   │  + Conversation history                    │  │   max_tokens param)      │  │
│   │  + RAG documents                           │  │                          │  │
│   │  + Current user message                    │  │                          │  │
│   │  + Special tokens (overhead)               │  │                          │  │
│   │                                            │  │                          │  │
│   └───────────────────────────────────────────┘  └──────────────────────────┘  │
│                                                                                │
│   input_tokens + output_tokens  <=  context_window                             │
└────────────────────────────────────────────────────────────────────────────────┘

If your input consumes 127,000 tokens in a 128K window, the model has only 1,000 tokens left for its response — roughly 750 words. If input exceeds the window entirely, the API returns an error.

2. The `max_tokens` Parameter

The max_tokens parameter is one of the most misunderstood settings in LLM APIs. It controls the maximum number of output tokens the model can generate. It does not limit input size.

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user",   content: "Explain quantum computing." }
  ],
  max_tokens: 200   // Model will generate AT MOST 200 output tokens
});

Key facts about `max_tokens`

Fact	Detail
Controls output only	Does not affect how many input tokens you can send
Upper bound, not target	Model may stop earlier if it finishes its thought
Truncation on hit	If the model needs 500 tokens but max_tokens is 200, the response is cut off mid-sentence
Default varies by provider	OpenAI defaults vary by model; Anthropic requires you to set it
Affects cost	You pay for actual output tokens, not max_tokens (but a higher limit allows more generation)

What happens when output is truncated

When the model's response hits the max_tokens limit, the response is cut off and the API tells you why:

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Write a 2000-word essay on AI." }],
  max_tokens: 100   // Way too short for a 2000-word essay
});

console.log(response.choices[0].finish_reason);
// "length"  — hit max_tokens limit (truncated!)
// "stop"    — model finished naturally (complete response)

console.log(response.choices[0].message.content);
// "Artificial intelligence (AI) is a branch of computer science that
//  aims to create systems capable of performing tasks that would
//  normally require human intelligence. The field has evolved
//  significantly since its..." — CUT OFF

The `finish_reason` values

Value	Meaning	Action
`"stop"`	Model finished naturally	Response is complete
`"length"`	Hit `max_tokens` limit	Response is truncated — increase max_tokens or continue in next call
`"content_filter"`	Content was filtered	Response may be incomplete due to safety filter
`"tool_calls"`	Model wants to call a tool	Handle the tool call

Always check finish_reason in production — a "length" response means the user got an incomplete answer.

3. Token Budget Formula

The core formula for token budgeting:

available_output = context_window - input_tokens - safety_margin

Where:
  input_tokens = system_prompt + few_shot_examples + conversation_history
               + rag_documents + current_user_message + special_token_overhead

Practical example: GPT-4o (128K window)

Context window:           128,000 tokens
System prompt:              1,500 tokens  (fixed)
Few-shot examples:          1,000 tokens  (fixed, 3 examples)
Conversation history:       8,000 tokens  (variable, grows per turn)
RAG documents:             15,000 tokens  (variable, depends on retrieval)
Current user message:         500 tokens  (variable)
Special token overhead:       100 tokens  (message formatting)
Safety margin:              1,000 tokens  (buffer for estimation errors)
                          ─────────
Total input:               27,100 tokens

Available for output:     128,000 - 27,100 = 100,900 tokens
Set max_tokens to:          4,096 (typical response cap)
Actual headroom:           96,804 tokens unused (comfortable)

When it gets tight: Claude with heavy RAG

Context window:           200,000 tokens
System prompt:              2,000 tokens
RAG documents:            180,000 tokens  (100 large documents)
User message:                 500 tokens
Safety margin:              1,000 tokens
                          ─────────
Total input:              183,500 tokens

Available for output:     200,000 - 183,500 = 16,500 tokens
Set max_tokens to:          8,192

WARNING: Only 8,308 tokens of headroom — any increase in
RAG documents or history could push you over the limit!

4. Dynamic Budget Allocation

In real applications, token counts change with every request. A static budget only works for simple one-shot calls. Multi-turn chatbots need dynamic allocation.

Strategy: Priority-based allocation

import { encoding_for_model } from 'tiktoken';

function allocateTokenBudget({
  contextWindow = 128000,
  systemPrompt,
  fewShotExamples = [],
  conversationHistory = [],
  ragDocuments = [],
  userMessage,
  maxOutputTokens = 4096,
  safetyMargin = 1000
}) {
  const enc = encoding_for_model('gpt-4o');

  // Count fixed allocations
  const systemTokens = enc.encode(systemPrompt).length;
  const userTokens = enc.encode(userMessage).length;
  const fewShotTokens = fewShotExamples.reduce(
    (sum, ex) => sum + enc.encode(ex.input).length + enc.encode(ex.output).length, 0
  );

  // Overhead: ~4 tokens per message for role markers and formatting
  const overheadPerMessage = 4;
  const fixedOverhead = (2 + fewShotExamples.length * 2) * overheadPerMessage;

  const fixedTokens = systemTokens + userTokens + fewShotTokens
                    + fixedOverhead + maxOutputTokens + safetyMargin;

  let remainingBudget = contextWindow - fixedTokens;

  // Priority 1: Recent conversation history (last N turns)
  const historyTokens = [];
  const reversedHistory = [...conversationHistory].reverse();
  let historyTotal = 0;

  for (const msg of reversedHistory) {
    const msgTokens = enc.encode(msg.content).length + overheadPerMessage;
    if (historyTotal + msgTokens > remainingBudget * 0.4) break;  // Max 40% of remaining
    historyTokens.unshift(msg);
    historyTotal += msgTokens;
  }

  remainingBudget -= historyTotal;

  // Priority 2: RAG documents (most relevant first)
  const includedDocs = [];
  let ragTotal = 0;

  for (const doc of ragDocuments) {  // Assumed pre-sorted by relevance
    const docTokens = enc.encode(doc.content).length;
    if (ragTotal + docTokens > remainingBudget) break;
    includedDocs.push(doc);
    ragTotal += docTokens;
  }

  enc.free();

  return {
    systemPrompt,
    fewShotExamples,
    conversationHistory: historyTokens,
    ragDocuments: includedDocs,
    userMessage,
    maxOutputTokens,
    totalInputTokens: contextWindow - remainingBudget + ragTotal - maxOutputTokens - safetyMargin,
    headroom: remainingBudget - ragTotal
  };
}

5. Overflow Prevention Strategies

When conversations grow long or RAG retrieval returns large documents, you risk exceeding the context window. Here are the primary strategies:

Strategy 1: Sliding window (drop oldest messages)

function slidingWindow(messages, maxHistoryTokens, enc) {
  // Keep system message + trim history from the beginning
  const system = messages[0];  // Always keep system
  const history = messages.slice(1);
  const trimmed = [];
  let tokenCount = 0;

  // Walk backwards — keep most recent messages
  for (let i = history.length - 1; i >= 0; i--) {
    const msgTokens = enc.encode(history[i].content).length;
    if (tokenCount + msgTokens > maxHistoryTokens) break;
    trimmed.unshift(history[i]);
    tokenCount += msgTokens;
  }

  return [system, ...trimmed];
}

Pros: Simple, preserves recent context. Cons: Loses early context — model forgets the beginning of the conversation.

Strategy 2: Summarize older messages

async function summarizeHistory(oldMessages, openai) {
  const historyText = oldMessages
    .map(m => `${m.role}: ${m.content}`)
    .join('\n');

  const summary = await openai.chat.completions.create({
    model: "gpt-4o-mini",  // Use cheaper model for summarization
    messages: [
      { role: "system", content: "Summarize this conversation in under 200 words. Preserve key facts, decisions, and user preferences." },
      { role: "user", content: historyText }
    ],
    max_tokens: 300
  });

  return {
    role: "system",
    content: `Previous conversation summary: ${summary.choices[0].message.content}`
  };
}

// Usage: replace old messages with a summary
async function manageHistory(messages, maxTokens, enc, openai) {
  let totalTokens = messages.reduce(
    (sum, m) => sum + enc.encode(m.content).length, 0
  );

  if (totalTokens <= maxTokens) return messages;

  const system = messages[0];
  const recentCount = 6;  // Keep last 3 turns (6 messages)
  const recent = messages.slice(-recentCount);
  const old = messages.slice(1, -recentCount);

  const summaryMsg = await summarizeHistory(old, openai);

  return [system, summaryMsg, ...recent];
}

Pros: Preserves key information from the entire conversation. Cons: Summarization costs tokens/money and may lose details.

Strategy 3: Truncate RAG documents

function truncateDocuments(documents, maxTotalTokens, enc) {
  const truncated = [];
  let totalTokens = 0;

  for (const doc of documents) {
    const docTokens = enc.encode(doc.content).length;

    if (totalTokens + docTokens <= maxTotalTokens) {
      truncated.push(doc);
      totalTokens += docTokens;
    } else {
      // Partial inclusion — truncate this document to fit
      const remainingBudget = maxTotalTokens - totalTokens;
      if (remainingBudget > 100) {  // Only include if meaningful content fits
        const tokens = enc.encode(doc.content).slice(0, remainingBudget);
        const partialContent = enc.decode(tokens);
        truncated.push({ ...doc, content: partialContent + '...[truncated]' });
      }
      break;
    }
  }

  return truncated;
}

Strategy 4: Pre-request validation

function validateRequest(messages, contextWindow, maxOutputTokens, enc) {
  const inputTokens = messages.reduce(
    (sum, m) => sum + enc.encode(m.content).length + 4, 0  // +4 for overhead per message
  );

  const totalRequired = inputTokens + maxOutputTokens;

  if (totalRequired > contextWindow) {
    return {
      valid: false,
      inputTokens,
      overflow: totalRequired - contextWindow,
      message: `Request would use ${totalRequired} tokens but window is ${contextWindow}. ` +
               `Reduce input by ${totalRequired - contextWindow} tokens.`
    };
  }

  return {
    valid: true,
    inputTokens,
    headroom: contextWindow - totalRequired,
    utilizationPercent: ((totalRequired / contextWindow) * 100).toFixed(1)
  };
}

6. Token Counting in Practice

Using tiktoken (OpenAI models)

import { encoding_for_model } from 'tiktoken';

function countMessageTokens(messages, model = 'gpt-4o') {
  const enc = encoding_for_model(model);
  let totalTokens = 0;

  // Every message has overhead for role formatting
  // For gpt-4o: each message costs 3 extra tokens + 1 for role name
  const tokensPerMessage = 4;

  for (const message of messages) {
    totalTokens += tokensPerMessage;
    totalTokens += enc.encode(message.content).length;
    if (message.name) {
      totalTokens += enc.encode(message.name).length;
    }
  }

  totalTokens += 3;  // Every reply is primed with <|start|>assistant<|message|>

  enc.free();
  return totalTokens;
}

// Usage
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user",   content: "Hello, how are you?" }
];

console.log(countMessageTokens(messages));
// Approximately 22 tokens

Quick estimation without libraries

function estimateTokens(text) {
  // Rule of thumb: 1 token ≈ 4 characters for English
  return Math.ceil(text.length / 4);
}

function estimateMessagesTokens(messages) {
  const contentTokens = messages.reduce(
    (sum, m) => sum + estimateTokens(m.content), 0
  );
  const overhead = messages.length * 4 + 3;  // Per-message overhead + reply primer
  return contentTokens + overhead;
}

Using the API response (post-hoc counting)

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [...],
  max_tokens: 1000
});

// The API tells you exactly how many tokens were used
console.log(response.usage);
// {
//   prompt_tokens: 156,      ← Input tokens (what you sent)
//   completion_tokens: 89,   ← Output tokens (what the model generated)
//   total_tokens: 245        ← Sum of both
// }

7. Common `max_tokens` Settings by Use Case

Use Case	Recommended `max_tokens`	Reasoning
Classification (yes/no)	5-10	Only need a single word
JSON extraction	200-500	Structured, predictable length
Short answers (Q&A)	300-500	Concise responses
Chatbot response	1,000-2,000	Conversational paragraphs
Code generation	2,000-4,000	Functions can be lengthy
Long-form writing	4,000-8,000	Essays, articles
Document summarization	500-2,000	Depends on desired summary length

Cost implication: Setting max_tokens: 16000 when you only need 200 doesn't cost more (you pay for actual tokens generated), but it allows the model to ramble. Tight limits encourage concise output and prevent runaway responses.

8. Advanced: Token Budget Monitoring Dashboard

In production, track these metrics:

┌──────────────────────────────────────────────────────┐
│             Token Budget Dashboard                    │
│                                                       │
│  Avg input tokens/request:    2,340   ▓▓▓▓▓░░░░░     │
│  Avg output tokens/request:     680   ▓▓░░░░░░░░     │
│  Avg total tokens/request:    3,020   ▓▓▓░░░░░░░     │
│  Context utilization:          2.4%   (healthy)       │
│                                                       │
│  p95 input tokens:           12,800   ▓▓▓▓▓▓▓░░░     │
│  p99 input tokens:           45,200   ▓▓▓▓▓▓▓▓▓░     │
│  Max observed:               98,400   ▓▓▓▓▓▓▓▓▓▓     │
│                                                       │
│  Truncation events (length):   1.2%   (target < 2%)  │
│  Overflow errors:              0.01%  (target < 0.1%) │
│  History summarizations:       8.4%   (normal)        │
│                                                       │
│  Budget alerts:                                       │
│  [!] 3 requests hit 90%+ utilization in last hour    │
│  [OK] No overflow errors in last 24h                  │
└──────────────────────────────────────────────────────┘

Track and alert on:

Metric	Alert Threshold	Action
Truncation rate (`finish_reason: "length"`)	> 2%	Increase `max_tokens` or reduce input
Context overflow errors	> 0.1%	Implement more aggressive history trimming
Average utilization > 80%	> 80% of window	You're close to limits — optimize inputs
p99 input tokens near window	> 90% of window	Long conversations need summarization

9. Key Takeaways

Context window = input + output — they share a fixed budget. Every token of input leaves one less token for output.
max_tokens controls output only — it caps how much the model can generate, not how much you can send.
Always check finish_reason — "length" means the response was cut off mid-sentence.
Budget formula: available_output = context_window - input_tokens - safety_margin.
Dynamic allocation is essential for multi-turn conversations — use sliding windows, summarization, or document truncation to stay within limits.
Count tokens before sending — pre-validate requests to prevent overflow errors and wasted API calls.

Explain-It Challenge

A product manager asks: "Why can't the chatbot just remember the whole conversation forever?" Explain context window limits and token budgeting in non-technical terms.
Your chatbot works perfectly for 5-turn conversations but breaks at 20+ turns with truncated responses. Diagnose the problem and propose a fix.
Calculate the token budget for a RAG system using GPT-4o (128K window) with: 1,200-token system prompt, 3 few-shot examples at 400 tokens each, up to 20 turns of history averaging 300 tokens each, and a desired 4,096-token max output. How many tokens are available for RAG documents?