Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1 — How LLMs Actually Work: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md → 4.1.a…4.1.e.
Practice — 4.1-Exercise-Questions.md.
Polish answers — 4.1-Interview-Questions.md.

Core vocabulary

Term	One-liner
Token	Smallest text unit an LLM processes (~4 chars in English)
BPE	Byte Pair Encoding — tokenizer algorithm that merges frequent pairs
Context window	Max tokens (input + output) per API call
Temperature	Controls probability distribution sharpness (0=deterministic, 1=default, >1=creative)
Top-p	Nucleus sampling — only sample from tokens covering top p% of probability
Top-k	Only sample from top k most probable tokens
Hallucination	Model generates plausible but factually wrong text
Greedy decoding	Always pick highest-probability token (temperature 0)
Seed	Parameter for reproducible outputs (best-effort)

Token math

1 token  ≈ 4 characters (English)
1 token  ≈ 0.75 words
100 words ≈ 130 tokens
1 page   ≈ 500-800 tokens
1 line of code ≈ 10-20 tokens

Context window sizes

Model	Window	Max Output
GPT-4o	128K	16K
Claude 3.5/4 Sonnet	200K	8K (up to 64K)
Gemini 1.5 Pro	2M	8K
Llama 3.1	128K	varies

Temperature cheat sheet

temp 0      → Deterministic (JSON, extraction, classification)
temp 0.1-0.3 → Very focused (factual Q&A, data processing)
temp 0.5-0.7 → Balanced (chatbots, summaries)
temp 0.8-1.0 → Natural (conversation, content writing)
temp 1.2-1.5 → Creative (poetry, brainstorming)
temp >1.5    → Chaotic (rarely useful)

Sampling parameters

// Structured extraction (deterministic)
{ temperature: 0, top_p: 1 }

// General chatbot
{ temperature: 0.7, top_p: 0.9 }

// Creative writing
{ temperature: 1.0, top_p: 0.95, frequency_penalty: 0.5 }

// Code generation
{ temperature: 0.2, top_p: 0.95 }

Rule: Adjust temperature OR top_p, not both.

Hallucination

WHY:  Model predicts plausible text, not verified facts
      Same mechanism as creativity — just unwanted in factual tasks

TYPES:
  Factual       → Wrong facts stated confidently
  Fabricated    → Made-up citations, papers, URLs
  Intrinsic     → Contradicts the input you gave it
  Extrinsic     → Adds unsupported claims to summaries

REDUCE:
  1. RAG            → Ground in retrieved documents
  2. Temperature 0  → Less creative divergence
  3. "Say I don't know" instruction
  4. Output validation (schema + programmatic checks)
  5. Constrained format (JSON with specific fields)

Context window management

Token Budget (128K window):
  System prompt:       1,500  (fixed)
  Output reservation:  4,000  (fixed)
  Safety margin:       1,000  (fixed)
  Current message:       500  (variable)
  Conversation history: 8,000 (trim if needed)
  RAG documents:     113,000  (fill with ranked chunks)

Priority (what to keep when tight):
  1. System prompt     — ALWAYS
  2. Current message   — ALWAYS
  3. Output reserve    — ALWAYS
  4. Recent history    — Last 2-4 turns
  5. RAG documents     — Most relevant first
  6. Older history     — First to trim/summarize

Deterministic vs probabilistic

DETERMINISTIC:  same input → same output (temp 0, seed)
PROBABILISTIC:  same input → different outputs (temp > 0)

Use deterministic for: JSON, classification, extraction, testing, pipelines
Use probabilistic for: creative writing, conversation, brainstorming

LOGGING IS ESSENTIAL — non-deterministic systems need full request/response logs
PIN MODEL VERSIONS — "gpt-4o-2024-08-06" not "gpt-4o"

LLM pipeline flow

Text → Tokenizer → Token IDs → Transformer → Probability Distribution
       → Sampling (temp, top-p) → Selected Token → Append → Repeat
       → Stop condition → Detokenize → Text output

Common gotchas

Gotcha	Why
JSON costs more tokens than plain text	Braces, colons, quotes are each tokens
Non-English text costs more	Less efficient tokenization
Spaces affect tokenization	" Hello" ≠ "Hello" in token space
Context includes output	Reserve tokens for the model's response
More context ≠ better	Lost-in-the-middle reduces accuracy
temp 0 doesn't fix hallucination	Model still predicts plausible wrong facts

End of 4.1 quick revision.