Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work
4.1 — How LLMs Actually Work: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md→4.1.a…4.1.e. - Practice —
4.1-Exercise-Questions.md. - Polish answers —
4.1-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| Token | Smallest text unit an LLM processes (~4 chars in English) |
| BPE | Byte Pair Encoding — tokenizer algorithm that merges frequent pairs |
| Context window | Max tokens (input + output) per API call |
| Temperature | Controls probability distribution sharpness (0=deterministic, 1=default, >1=creative) |
| Top-p | Nucleus sampling — only sample from tokens covering top p% of probability |
| Top-k | Only sample from top k most probable tokens |
| Hallucination | Model generates plausible but factually wrong text |
| Greedy decoding | Always pick highest-probability token (temperature 0) |
| Seed | Parameter for reproducible outputs (best-effort) |
Token math
1 token ≈ 4 characters (English)
1 token ≈ 0.75 words
100 words ≈ 130 tokens
1 page ≈ 500-800 tokens
1 line of code ≈ 10-20 tokens
Context window sizes
| Model | Window | Max Output |
|---|---|---|
| GPT-4o | 128K | 16K |
| Claude 3.5/4 Sonnet | 200K | 8K (up to 64K) |
| Gemini 1.5 Pro | 2M | 8K |
| Llama 3.1 | 128K | varies |
Temperature cheat sheet
temp 0 → Deterministic (JSON, extraction, classification)
temp 0.1-0.3 → Very focused (factual Q&A, data processing)
temp 0.5-0.7 → Balanced (chatbots, summaries)
temp 0.8-1.0 → Natural (conversation, content writing)
temp 1.2-1.5 → Creative (poetry, brainstorming)
temp >1.5 → Chaotic (rarely useful)
Sampling parameters
// Structured extraction (deterministic)
{ temperature: 0, top_p: 1 }
// General chatbot
{ temperature: 0.7, top_p: 0.9 }
// Creative writing
{ temperature: 1.0, top_p: 0.95, frequency_penalty: 0.5 }
// Code generation
{ temperature: 0.2, top_p: 0.95 }
Rule: Adjust temperature OR top_p, not both.
Hallucination
WHY: Model predicts plausible text, not verified facts
Same mechanism as creativity — just unwanted in factual tasks
TYPES:
Factual → Wrong facts stated confidently
Fabricated → Made-up citations, papers, URLs
Intrinsic → Contradicts the input you gave it
Extrinsic → Adds unsupported claims to summaries
REDUCE:
1. RAG → Ground in retrieved documents
2. Temperature 0 → Less creative divergence
3. "Say I don't know" instruction
4. Output validation (schema + programmatic checks)
5. Constrained format (JSON with specific fields)
Context window management
Token Budget (128K window):
System prompt: 1,500 (fixed)
Output reservation: 4,000 (fixed)
Safety margin: 1,000 (fixed)
Current message: 500 (variable)
Conversation history: 8,000 (trim if needed)
RAG documents: 113,000 (fill with ranked chunks)
Priority (what to keep when tight):
1. System prompt — ALWAYS
2. Current message — ALWAYS
3. Output reserve — ALWAYS
4. Recent history — Last 2-4 turns
5. RAG documents — Most relevant first
6. Older history — First to trim/summarize
Deterministic vs probabilistic
DETERMINISTIC: same input → same output (temp 0, seed)
PROBABILISTIC: same input → different outputs (temp > 0)
Use deterministic for: JSON, classification, extraction, testing, pipelines
Use probabilistic for: creative writing, conversation, brainstorming
LOGGING IS ESSENTIAL — non-deterministic systems need full request/response logs
PIN MODEL VERSIONS — "gpt-4o-2024-08-06" not "gpt-4o"
LLM pipeline flow
Text → Tokenizer → Token IDs → Transformer → Probability Distribution
→ Sampling (temp, top-p) → Selected Token → Append → Repeat
→ Stop condition → Detokenize → Text output
Common gotchas
| Gotcha | Why |
|---|---|
| JSON costs more tokens than plain text | Braces, colons, quotes are each tokens |
| Non-English text costs more | Less efficient tokenization |
| Spaces affect tokenization | " Hello" ≠ "Hello" in token space |
| Context includes output | Reserve tokens for the model's response |
| More context ≠ better | Lost-in-the-middle reduces accuracy |
| temp 0 doesn't fix hallucination | Model still predicts plausible wrong facts |
End of 4.1 quick revision.