Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

Interview Questions: How LLMs Actually Work

Model answers for tokens, context windows, sampling/temperature, hallucination, and deterministic vs probabilistic outputs.

How to use this material (instructions)

Read lessons in order — README.md, then 4.1.a → 4.1.e.
Practice out loud — definition → example → pitfall.
Pair with exercises — 4.1-Exercise-Questions.md.
Quick review — 4.1-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is a token and why does it matter?

Why interviewers ask: Tests if you understand the fundamental unit of LLM operation — critical for cost estimation, prompt design, and API usage.

Model answer:

A token is the smallest unit of text an LLM processes — typically a word, subword piece, or character. Tokenization converts text into numerical IDs the model can process. Common words like "the" are single tokens; rare words like "defenestrate" are split into multiple subword tokens via algorithms like BPE (Byte Pair Encoding).

Tokens matter because everything is measured in tokens: billing (per input/output token), context window limits, generation speed, and prompt efficiency. A rough approximation for English: 1 token ≈ 4 characters or ~0.75 words. Different models use different tokenizers, so the same text produces different token counts across providers.

Q2. What is the context window?

Why interviewers ask: Directly impacts how you design conversations, RAG systems, and manage cost.

Model answer:

The context window is the maximum number of tokens an LLM can process in a single API call — including both the input (prompt, history, documents) and the output (model's response). GPT-4o has 128K tokens, Claude has 200K, Gemini 1.5 Pro has 2M.

When exceeded, older messages are silently truncated or the API rejects the request. In production, you manage this with token budgeting — allocating tokens for system prompt, conversation history, RAG documents, and reserved output. A critical subtlety: models show reduced recall for information in the middle of long contexts ("lost in the middle" problem), so important information should be placed at the beginning or end.

Q3. What is temperature in LLM APIs?

Why interviewers ask: Essential for configuring AI behavior correctly per use case.

Model answer:

Temperature controls the randomness of the model's output by reshaping the probability distribution over next tokens. At temperature 0 (greedy decoding), the model always picks the highest-probability token — output is nearly deterministic. At temperature 1.0 (default), the model samples from the natural distribution. Above 1.0, the distribution is flattened, increasing randomness and creativity.

Use temperature 0 for structured output (JSON extraction, classification), 0.5-0.7 for balanced chatbots, and 0.8-1.2 for creative writing. In production, you also have top-p (nucleus sampling) which cuts off the tail of unlikely tokens, and frequency/presence penalties to reduce repetition. Best practice: adjust temperature or top-p, not both.

Q4. Why do LLMs hallucinate?

Why interviewers ask: Every production AI system must handle hallucination — this tests your understanding of the root cause and mitigation strategies.

Model answer:

LLMs hallucinate because they are next-token prediction machines, not fact-retrieval systems. They generate text that is statistically plausible given the prompt, not factually verified. The model doesn't "know" facts — it predicts what text looks like it should come next based on patterns learned during training. When patterns are strong (common facts), output is usually correct. When patterns are weak (rare facts, recent events), the model generates confident-sounding fiction.

Primary mitigation strategies: (1) RAG — ground answers in retrieved documents, (2) temperature 0 — reduce creative divergence, (3) explicit instructions to say "I don't know," (4) output validation — verify structured claims programmatically, (5) constrained output — JSON schemas limit free-form generation.

Intermediate (Q5–Q8)

Q5. How does tokenization affect cost and system design?

Why interviewers ask: Evaluates your ability to think about AI engineering at scale — cost optimization is a real production concern.

Model answer:

Tokenization directly impacts four pillars of AI system design:

Cost: APIs charge per token. A verbose 2,000-token system prompt costs $0.005/call at $2.50/1M tokens. At 100K calls/day, that's $500/day. Optimizing to 500 tokens saves $375/day ($11K/month). JSON is expensive — structural tokens (braces, colons, quotes) add overhead. Non-English text tokenizes less efficiently.

Latency: Output generation speed is measured in tokens/second. More output tokens = longer response time. Streaming mitigates perceived latency.

Context management: Token-expensive prompts leave less room for conversation history and RAG documents within the fixed context window.

Design decisions: Token awareness informs choices like: short vs detailed system prompts, how many RAG chunks to retrieve, when to summarize conversation history, and which language to use for multilingual systems.

Q6. Explain the "lost in the middle" problem and how you design around it.

Why interviewers ask: Tests practical knowledge of a well-documented LLM limitation that affects RAG and long-context applications.

Model answer:

Research shows LLMs pay disproportionate attention to information at the beginning and end of the context, with reduced recall for content in the middle. In a 100K token prompt, a fact buried at position 50K may be effectively ignored despite being "in context."

Design implications: (1) System prompt (beginning) — always the instructions and persona. (2) Current user question (end) — recency bias helps here. (3) Most relevant RAG documents — place near the beginning or end, not in the middle. (4) Use minimum context necessary — 10K focused tokens often outperform 100K unfocused tokens. (5) Repeat critical instructions — state key rules in the system prompt AND near the end. (6) Chunk and rank RAG results — inject only the top 3-5 most relevant chunks rather than 50 marginally relevant ones.

Q7. What is the difference between temperature, top-p, and top-k?

Why interviewers ask: Shows depth of understanding beyond "temperature = creativity slider."

Model answer:

All three control which token gets selected from the probability distribution, but they work differently:

Temperature reshapes the entire distribution. Low temperature sharpens it (high-probability tokens become even more dominant). High temperature flattens it (low-probability tokens get a bigger chance). At 0, it's greedy — always the top token.

Top-p (nucleus sampling) doesn't reshape the distribution — it truncates it. Sort tokens by probability, include tokens until their cumulative probability reaches p. If top-p=0.9, only the tokens that make up the top 90% of probability mass are eligible. The tail of unlikely tokens is cut off.

Top-k is the simplest: only consider the top k tokens regardless of probability mass. Less adaptive than top-p — if one token has 99% probability, top-k=50 still considers 49 near-impossible tokens.

In practice, temperature + top-p are the standard pair. Most providers don't expose top-k. Recommendation: adjust one at a time — combining temperature and top-p changes can produce unpredictable results.

Q8. How do you build a testable system on top of a non-deterministic model?

Why interviewers ask: A practical engineering question that separates people who've built real AI products from those who've only experimented.

Model answer:

Key strategies:

1. Deterministic configuration: Use temperature: 0 and seed parameter for all critical paths (extraction, classification, structured output). This gives near-deterministic outputs.

2. Pin model versions: Use exact versions like gpt-4o-2024-08-06 instead of aliases that auto-update.

3. Comprehensive logging: Log every LLM call with full input, output, model version, parameters, token counts, and latency. This allows post-hoc debugging even when outputs aren't reproducible.

4. Output validation, not output equality: Instead of asserting the exact string, validate structure and semantics — expect(output).toMatchJsonSchema(schema) rather than expect(output).toBe(exactString).

5. Evaluation suites: Build eval datasets with expected outcomes. Score model responses on correctness, format compliance, and safety. Run evals when changing models, prompts, or parameters.

6. Fallback mechanisms: When the model returns unexpected output, have retry logic, format correction, or graceful degradation to a default response.

Advanced (Q9–Q11)

Q9. Design a token budget strategy for a production RAG chatbot.

Why interviewers ask: Tests system design thinking — balancing competing demands within a fixed resource constraint.

Model answer:

Given a 128K token window (GPT-4o):

Fixed allocations:
  System prompt:       1,500 tokens  (persona, rules, output format)
  Output reservation:  4,000 tokens  (max response length)
  Safety margin:       1,000 tokens  (prevent edge-case overflow)
  Subtotal fixed:      6,500 tokens

Dynamic allocations (fill in priority order):
  1. Current user message:  ~500 tokens (variable, measured per request)
  2. Recent history:        ~8,000 tokens (last 3-4 exchanges)
  3. RAG documents:         REMAINING tokens (most relevant first)

Implementation: ragBudget = 128,000 - 6,500 - userTokens - historyTokens. If history grows too large, summarize older messages. For RAG, retrieve more chunks than needed, rank by relevance, then greedily fill the budget with top chunks until the token budget is exhausted.

Monitor: track average utilization. If you're consistently under 50% utilization, you have room to add more context. If you're hitting 90%+, implement more aggressive history pruning.

Q10. How would you evaluate whether a model's hallucination rate is acceptable for a given use case?

Why interviewers ask: Tests your ability to make quantitative engineering decisions about AI reliability.

Model answer:

Step 1: Define ground truth. Create an evaluation dataset — 200+ examples with known correct answers. For QA: question + verified answer. For extraction: document + expected fields.

Step 2: Measure hallucination rate. Run the model on all examples. Classify each output as: correct, partially correct, hallucinated, or refused. Calculate: hallucination_rate = hallucinated / total.

Step 3: Categorize by severity. Not all hallucinations are equal. A wrong date is less severe than a fabricated medical dosage. Weight the hallucination rate by severity.

Step 4: Compare against threshold. Set an acceptable rate based on the domain: creative writing (tolerate high), customer support (< 5%), medical/legal (< 0.1%, with human-in-the-loop for everything).

Step 5: A/B test mitigations. Test each strategy (RAG, lower temperature, better prompts, output validation) and measure the hallucination rate reduction. Ship the configuration that meets the threshold.

Step 6: Monitor in production. Sample production outputs for human review. Track user feedback signals (thumbs down, corrections). Alert when hallucination rate drifts above threshold.

Q11. Explain how the same mechanism that causes hallucination also enables creativity, and how this affects AI system design.

Why interviewers ask: Tests deep conceptual understanding and ability to apply it to real architecture decisions.

Model answer:

Hallucination and creativity are the same mechanism — the model's ability to generate novel text that wasn't in the input. When writing a poem, novel text is "creative." When answering a factual question, novel text is "hallucination." The underlying computation is identical: the model predicts statistically likely continuations.

This has profound design implications:

You can't eliminate hallucination without eliminating creativity. Temperature 0 reduces both. Constraining output format (JSON schemas) reduces both. Grounding in retrieved documents reduces both.

Architecture must match the task: For factual tasks, constrain the model: temperature 0, RAG grounding, structured output, validation. For creative tasks, liberate the model: higher temperature, open-ended prompts, multiple completions.

Hybrid systems often need both: a RAG pipeline (grounded, factual) that feeds into a response generator (creative, natural-sounding). The pipeline ensures factual accuracy; the generator ensures good UX. The key is architecting the boundary — what's allowed to be creative (phrasing) and what must be grounded (facts, numbers, citations).

Quick-fire

#	Question	One-line answer
1	1 token ≈ how many characters?	~4 characters in English
2	What does temperature 0 do?	Greedy decoding — always picks highest-probability token
3	Context window includes output?	Yes — input + output share the same window
4	Can you eliminate hallucination?	No — only reduce it (RAG, temp 0, validation, human review)
5	What is top-p = 0.9?	Nucleus sampling — only sample from tokens covering 90% of probability mass

← Back to 4.1 — How LLMs Actually Work (README)