Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1.a — Tokens and Tokenization

In one sentence: Every piece of text you send to an LLM is broken into tokens — small chunks of characters — which are the fundamental unit of input, output, cost, and context limits in every AI system.

Navigation: ← 4.1 Overview · 4.1.b — Context Window →


1. What Is a Token?

A token is the smallest unit of text that a language model processes. When you type "Hello, world!" into ChatGPT or call the OpenAI API, the model doesn't see letters or words — it sees tokens. A token might be a whole word, part of a word, a single character, or even a punctuation mark.

Input:  "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]
Count:  4 tokens

Input:  "Tokenization is fundamental"
Tokens: ["Token", "ization", " is", " fundamental"]
Count:  4 tokens

Input:  "I love JavaScript"
Tokens: ["I", " love", " JavaScript"]
Count:  3 tokens

Notice that "Tokenization" is split into two tokens: "Token" and "ization". Common words like "love" or "JavaScript" are single tokens, while rare or compound words get split into subword pieces. This is by design.


2. Why Everything Becomes Tokens

Language models are mathematical systems. They can't process raw text — they need numbers. Tokenization is the bridge:

Text → Tokens → Token IDs (numbers) → Model processes numbers → Output Token IDs → Tokens → Text

"Hello world" → ["Hello", " world"] → [15496, 1917] → [Model] → [2061] → ["How"] → "How"

Every operation in an LLM happens in token space, not character space or word space. This has enormous practical implications:

  1. Billing — You pay per token, not per word or character. OpenAI charges separately for input tokens and output tokens.
  2. Context limits — The context window is measured in tokens, not characters. GPT-4o has 128K tokens, Claude has 200K tokens.
  3. Speed — Generation speed is measured in tokens per second. Streaming delivers one token at a time.
  4. Prompt design — A poorly written prompt wastes tokens. A well-designed prompt maximizes information per token.

3. How Tokenizers Work: BPE (Byte Pair Encoding)

The most common tokenization algorithm is Byte Pair Encoding (BPE), used by GPT models, Claude, and most modern LLMs. Here's how it works conceptually:

Step 1: Start with individual characters

Vocabulary: [a, b, c, d, ..., z, A, B, ..., Z, 0, ..., 9, !, ?, ., ...]

Step 2: Find the most frequent pair and merge

Training text: "the cat sat on the mat the cat"

Most frequent pair: "t" + "h" → "th"
New vocabulary: [..., "th"]
Text becomes: "th e c a t s a t o n th e m a t th e c a t"

Next most frequent: "th" + "e" → "the"
New vocabulary: [..., "th", "the"]
Text becomes: "the c a t s a t o n the m a t the c a t"

Step 3: Repeat thousands of times

After tens of thousands of merges, common words become single tokens, common subwords become tokens, and rare strings stay as individual characters or small pieces.

Final result:
"the"         → 1 token  (very common → merged early)
"cat"         → 1 token  (common word)
"JavaScript"  → 1 token  (common in training data)
"defenestrate"→ 3 tokens (rare → "def", "en", "estrate")
"asdfghjkl"   → 9 tokens (random → each character is a token)

Why BPE is brilliant

  • Common words = 1 token → efficient
  • Rare words = multiple tokens → still representable
  • Any text can be tokenized → no "unknown word" errors
  • Multilingual → works across languages (though English is usually more efficient)

4. Token Counts: Rules of Thumb

For English text with GPT-style tokenizers:

ApproximationExample
1 token ≈ 4 characters"Hello" = ~1.25 tokens
1 token ≈ ¾ of a word100 words ≈ 75 tokens
1 word ≈ 1.3 tokens1000 words ≈ 1300 tokens
1 page of text ≈ 500-800 tokensA typical A4 page
1 line of code ≈ 10-20 tokensconst x = arr.filter(n => n > 0); ≈ 14 tokens

Non-English text is often less efficient:

English:  "Hello"         → 1 token
Spanish:  "Hola"          → 1 token
Japanese: "こんにちは"     → 3-5 tokens (each character may be a token)
Arabic:   "مرحبا"         → 3-5 tokens
Emoji:    "😀"            → 1-2 tokens

5. Tokenizer Differences Across Models

Different models use different tokenizers with different vocabulary sizes:

Model FamilyTokenizerVocabulary SizeNotes
GPT-4o / GPT-4o200k_base200,000 tokensMost efficient for code and multilingual
GPT-3.5cl100k_base100,256 tokens
GPT-2 / GPT-3r50k_base50,257 tokensOlder, less efficient
Claude 3/3.5/4Custom BPE~100K+ tokensAnthropic's proprietary tokenizer
Llama 3Custom BPE128,256 tokensMeta's tokenizer

Key insight: The same text produces different token counts with different models. Always use the correct tokenizer when counting tokens for a specific API.


6. Counting Tokens in Practice

Using the tiktoken library (OpenAI models)

import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');
const tokens = enc.encode('Hello, how are you?');
console.log(tokens);        // [9906, 11, 1268, 527, 499, 30]
console.log(tokens.length); // 6 tokens

// Decode back to text
const text = enc.decode(tokens);
console.log(text); // "Hello, how are you?"

enc.free(); // Free the encoder when done (WASM resource)

Quick estimation without a library

// Rule of thumb: ~4 characters per token for English
function estimateTokens(text) {
  return Math.ceil(text.length / 4);
}

console.log(estimateTokens('Hello, how are you?')); // ~5 (actual: 6)

Using the OpenAI Tokenizer web tool

Visit the OpenAI Tokenizer playground to visually see how text is split into tokens. Each token is highlighted in a different color.


7. Why Token Awareness Matters for Engineers

Cost control

GPT-4o pricing (example):
  Input:  $2.50 per 1M tokens
  Output: $10.00 per 1M tokens

A chatbot with 2000-token system prompt:
  2000 tokens × $2.50/1M = $0.005 per request (just for system prompt)
  × 100,000 users per day = $500/day just for system prompts

Optimize the system prompt from 2000 to 500 tokens:
  500 tokens × $2.50/1M = $0.00125 per request
  × 100,000 users per day = $125/day
  
  Savings: $375/day = $11,250/month

Context window management

Claude context window: 200,000 tokens

If your RAG pipeline retrieves 50 documents × 2000 tokens each = 100,000 tokens
  + System prompt: 1,000 tokens
  + User message: 500 tokens
  + Reserved for output: 4,000 tokens
  = 105,500 tokens used out of 200,000

That leaves 94,500 tokens of headroom. But if you retrieve 120 documents:
  120 × 2000 = 240,000 tokens → EXCEEDS context window → request fails

Prompt engineering

// BAD: Wastes tokens with verbose instruction
"I would really appreciate it if you could please take a moment to
summarize the following text in a concise manner, making sure to
include only the most important points..."
→ ~30 tokens for the instruction alone

// GOOD: Direct and efficient
"Summarize this text in 3 bullet points:"
→ ~8 tokens for the instruction

8. Special Tokens

Every tokenizer includes special tokens that are not regular text but control the model's behavior:

<|endoftext|>     → Signals end of a document/conversation
<|im_start|>      → Start of a message (chat format)
<|im_end|>        → End of a message (chat format)
<|system|>        → System message marker
<|user|>          → User message marker
<|assistant|>     → Assistant message marker

These special tokens are added automatically by the API — you don't type them. But they consume tokens from your context window. A multi-turn conversation with 20 messages has ~40 special tokens overhead just for the message formatting.


9. Common Gotchas

GotchaExplanation
Spaces are tokensA leading space Hello tokenizes differently from Hello
Code is expensiveCode uses many short tokens (brackets, operators, etc.) — more tokens per line than prose
JSON is expensive{"key": "value"} has many structural tokens (braces, colons, quotes)
Numbers are tricky1000 might be 1 token, but 1000000 might be 2-3 tokens
Newlines are tokensEvery \n is typically its own token
Repeated textRepeating the same word 100 times = 100 tokens (no compression)

10. Key Takeaways

  1. Tokens are the fundamental unit — billing, context limits, speed, and prompt design all operate in token space.
  2. BPE tokenization splits text into subword pieces — common words are single tokens, rare words are multiple tokens.
  3. 1 token ≈ 4 characters in English — use this for quick estimation.
  4. Different models use different tokenizers — always count tokens with the right tokenizer for your target model.
  5. Token efficiency directly impacts cost — a poorly designed prompt can waste thousands of dollars at scale.

Explain-It Challenge

  1. A user asks "why did my 10-page PDF break the API?" — explain using tokens and context windows.
  2. Why does the same sentence cost more to process in Japanese than in English?
  3. If GPT-4o costs $2.50 per million input tokens, how much does a 500-word system prompt cost per API call?

Navigation: ← 4.1 Overview · 4.1.b — Context Window →