Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1.a — Tokens and Tokenization

In one sentence: Every piece of text you send to an LLM is broken into tokens — small chunks of characters — which are the fundamental unit of input, output, cost, and context limits in every AI system.

Navigation: ← 4.1 Overview · 4.1.b — Context Window →

1. What Is a Token?

A token is the smallest unit of text that a language model processes. When you type "Hello, world!" into ChatGPT or call the OpenAI API, the model doesn't see letters or words — it sees tokens. A token might be a whole word, part of a word, a single character, or even a punctuation mark.

Input:  "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]
Count:  4 tokens

Input:  "Tokenization is fundamental"
Tokens: ["Token", "ization", " is", " fundamental"]
Count:  4 tokens

Input:  "I love JavaScript"
Tokens: ["I", " love", " JavaScript"]
Count:  3 tokens

Notice that "Tokenization" is split into two tokens: "Token" and "ization". Common words like "love" or "JavaScript" are single tokens, while rare or compound words get split into subword pieces. This is by design.

2. Why Everything Becomes Tokens

Language models are mathematical systems. They can't process raw text — they need numbers. Tokenization is the bridge:

Text → Tokens → Token IDs (numbers) → Model processes numbers → Output Token IDs → Tokens → Text

"Hello world" → ["Hello", " world"] → [15496, 1917] → [Model] → [2061] → ["How"] → "How"

Every operation in an LLM happens in token space, not character space or word space. This has enormous practical implications:

Billing — You pay per token, not per word or character. OpenAI charges separately for input tokens and output tokens.
Context limits — The context window is measured in tokens, not characters. GPT-4o has 128K tokens, Claude has 200K tokens.
Speed — Generation speed is measured in tokens per second. Streaming delivers one token at a time.
Prompt design — A poorly written prompt wastes tokens. A well-designed prompt maximizes information per token.

3. How Tokenizers Work: BPE (Byte Pair Encoding)

The most common tokenization algorithm is Byte Pair Encoding (BPE), used by GPT models, Claude, and most modern LLMs. Here's how it works conceptually:

Step 1: Start with individual characters

Vocabulary: [a, b, c, d, ..., z, A, B, ..., Z, 0, ..., 9, !, ?, ., ...]

Step 2: Find the most frequent pair and merge

Training text: "the cat sat on the mat the cat"

Most frequent pair: "t" + "h" → "th"
New vocabulary: [..., "th"]
Text becomes: "th e c a t s a t o n th e m a t th e c a t"

Next most frequent: "th" + "e" → "the"
New vocabulary: [..., "th", "the"]
Text becomes: "the c a t s a t o n the m a t the c a t"

Step 3: Repeat thousands of times

After tens of thousands of merges, common words become single tokens, common subwords become tokens, and rare strings stay as individual characters or small pieces.

Final result:
"the"         → 1 token  (very common → merged early)
"cat"         → 1 token  (common word)
"JavaScript"  → 1 token  (common in training data)
"defenestrate"→ 3 tokens (rare → "def", "en", "estrate")
"asdfghjkl"   → 9 tokens (random → each character is a token)

Why BPE is brilliant

Common words = 1 token → efficient
Rare words = multiple tokens → still representable
Any text can be tokenized → no "unknown word" errors
Multilingual → works across languages (though English is usually more efficient)

4. Token Counts: Rules of Thumb

For English text with GPT-style tokenizers:

Approximation	Example
1 token ≈ 4 characters	"Hello" = ~1.25 tokens
1 token ≈ ¾ of a word	100 words ≈ 75 tokens
1 word ≈ 1.3 tokens	1000 words ≈ 1300 tokens
1 page of text ≈ 500-800 tokens	A typical A4 page
1 line of code ≈ 10-20 tokens	`const x = arr.filter(n => n > 0);` ≈ 14 tokens

Non-English text is often less efficient:

English:  "Hello"         → 1 token
Spanish:  "Hola"          → 1 token
Japanese: "こんにちは"     → 3-5 tokens (each character may be a token)
Arabic:   "مرحبا"         → 3-5 tokens
Emoji:    "😀"            → 1-2 tokens

5. Tokenizer Differences Across Models

Different models use different tokenizers with different vocabulary sizes:

Model Family	Tokenizer	Vocabulary Size	Notes
GPT-4o / GPT-4	`o200k_base`	200,000 tokens	Most efficient for code and multilingual
GPT-3.5	`cl100k_base`	100,256 tokens
GPT-2 / GPT-3	`r50k_base`	50,257 tokens	Older, less efficient
Claude 3/3.5/4	Custom BPE	~100K+ tokens	Anthropic's proprietary tokenizer
Llama 3	Custom BPE	128,256 tokens	Meta's tokenizer

Key insight: The same text produces different token counts with different models. Always use the correct tokenizer when counting tokens for a specific API.

6. Counting Tokens in Practice

Using the `tiktoken` library (OpenAI models)

import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');
const tokens = enc.encode('Hello, how are you?');
console.log(tokens);        // [9906, 11, 1268, 527, 499, 30]
console.log(tokens.length); // 6 tokens

// Decode back to text
const text = enc.decode(tokens);
console.log(text); // "Hello, how are you?"

enc.free(); // Free the encoder when done (WASM resource)

Quick estimation without a library

// Rule of thumb: ~4 characters per token for English
function estimateTokens(text) {
  return Math.ceil(text.length / 4);
}

console.log(estimateTokens('Hello, how are you?')); // ~5 (actual: 6)

Using the OpenAI Tokenizer web tool

Visit the OpenAI Tokenizer playground to visually see how text is split into tokens. Each token is highlighted in a different color.

7. Why Token Awareness Matters for Engineers

Cost control

GPT-4o pricing (example):
  Input:  $2.50 per 1M tokens
  Output: $10.00 per 1M tokens

A chatbot with 2000-token system prompt:
  2000 tokens × $2.50/1M = $0.005 per request (just for system prompt)
  × 100,000 users per day = $500/day just for system prompts

Optimize the system prompt from 2000 to 500 tokens:
  500 tokens × $2.50/1M = $0.00125 per request
  × 100,000 users per day = $125/day
  
  Savings: $375/day = $11,250/month

Context window management

Claude context window: 200,000 tokens

If your RAG pipeline retrieves 50 documents × 2000 tokens each = 100,000 tokens
  + System prompt: 1,000 tokens
  + User message: 500 tokens
  + Reserved for output: 4,000 tokens
  = 105,500 tokens used out of 200,000

That leaves 94,500 tokens of headroom. But if you retrieve 120 documents:
  120 × 2000 = 240,000 tokens → EXCEEDS context window → request fails

Prompt engineering

// BAD: Wastes tokens with verbose instruction
"I would really appreciate it if you could please take a moment to
summarize the following text in a concise manner, making sure to
include only the most important points..."
→ ~30 tokens for the instruction alone

// GOOD: Direct and efficient
"Summarize this text in 3 bullet points:"
→ ~8 tokens for the instruction

8. Special Tokens

Every tokenizer includes special tokens that are not regular text but control the model's behavior:

<|endoftext|>     → Signals end of a document/conversation
<|im_start|>      → Start of a message (chat format)
<|im_end|>        → End of a message (chat format)
<|system|>        → System message marker
<|user|>          → User message marker
<|assistant|>     → Assistant message marker

These special tokens are added automatically by the API — you don't type them. But they consume tokens from your context window. A multi-turn conversation with 20 messages has ~40 special tokens overhead just for the message formatting.

9. Common Gotchas

Gotcha	Explanation
Spaces are tokens	A leading space `Hello` tokenizes differently from `Hello`
Code is expensive	Code uses many short tokens (brackets, operators, etc.) — more tokens per line than prose
JSON is expensive	`{"key": "value"}` has many structural tokens (braces, colons, quotes)
Numbers are tricky	`1000` might be 1 token, but `1000000` might be 2-3 tokens
Newlines are tokens	Every `\n` is typically its own token
Repeated text	Repeating the same word 100 times = 100 tokens (no compression)

10. Key Takeaways

Tokens are the fundamental unit — billing, context limits, speed, and prompt design all operate in token space.
BPE tokenization splits text into subword pieces — common words are single tokens, rare words are multiple tokens.
1 token ≈ 4 characters in English — use this for quick estimation.
Different models use different tokenizers — always count tokens with the right tokenizer for your target model.
Token efficiency directly impacts cost — a poorly designed prompt can waste thousands of dollars at scale.

Explain-It Challenge

A user asks "why did my 10-page PDF break the API?" — explain using tokens and context windows.
Why does the same sentence cost more to process in Japanese than in English?
If GPT-4o costs $2.50 per million input tokens, how much does a 500-word system prompt cost per API call?