Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1 — Exercise Questions: How LLMs Actually Work

Practice questions for all five subtopics in Section 4.1. Mix of conceptual, calculation, and hands-on tasks.

How to use this material (instructions)

Read lessons in order — README.md, then 4.1.a → 4.1.e.
Answer closed-book first — then compare to the matching lesson.
Use the OpenAI Tokenizer — experiment with tokenization hands-on.
Interview prep — 4.1-Interview-Questions.md.
Quick review — 4.1-Quick-Revision.md.

4.1.a — Tokens and Tokenization (Q1–Q10)

Q1. Define what a token is in one sentence. Why don't LLMs process raw text directly?

Q2. The word "Tokenization" is typically split into two tokens: "Token" and "ization". Why does the tokenizer split it instead of keeping it as one token?

Q3. Estimate the token count for this paragraph (assume 1 token ≈ 4 characters): "JavaScript is a high-level, interpreted programming language that is one of the core technologies of the World Wide Web." How close is your estimate to the actual count?

Q4. Explain BPE (Byte Pair Encoding) in 3 steps. Why is it better than word-level or character-level tokenization?

Q5. Why does the same sentence in Japanese use more tokens than in English? What are the cost implications?

Q6. Calculation: GPT-4o costs $2.50 per 1M input tokens. Your chatbot has a 1,500-token system prompt and averages 500-token user messages. You get 200,000 API calls per day. What is the daily cost just for input tokens?

Q7. Name three things that are measured in tokens (not characters or words) when working with LLM APIs.

Q8. What are special tokens like <|endoftext|> used for? Do they consume tokens from your context window?

Q9. Why is JSON more expensive in tokens than plain text for conveying the same information? Give an example.

Q10. Hands-on: Use the OpenAI tokenizer tool (or tiktoken library) to tokenize these three strings and compare token counts: (a) "Hello world", (b) "こんにちは世界", (c) "console.log('Hello world');"

4.1.b — Context Window (Q11–Q20)

Q11. Define context window and explain what is included in it (hint: it's not just the prompt).

Q12. GPT-4o has a 128K token context window. Your system prompt is 2,000 tokens, conversation history is 10,000 tokens, and you want 4,000 tokens for the response. How many tokens are available for RAG documents?

Q13. What happens when a conversation exceeds the context window? Name two strategies applications use to handle this.

Q14. Explain the "lost in the middle" problem. Where should you place the most important information in a long prompt?

Q15. Why does using a 200K token context not always produce better results than using 10K tokens?

Q16. In a multi-turn chatbot, token usage grows with every message because the entire history is sent each time. If the average turn adds 500 tokens, how many tokens is the 40th API call?

Q17. Describe the sliding window strategy for managing conversation history. What is its biggest downside?

Q18. What is the difference between context window and training data? Which one do you control as an engineer?

Q19. Token budgeting: Design a token budget for a RAG chatbot using Claude (200K window). Allocate tokens for: system prompt, retrieved documents, conversation history, user message, response, and safety margin.

Q20. A user says "the AI forgot what I said 10 minutes ago." Is this a bug? Explain what happened.

4.1.c — Sampling & Temperature (Q21–Q30)

Q21. After the model computes probabilities for all possible next tokens, what is sampling?

Q22. Explain what happens when you set temperature to 0. Is the output truly deterministic?

Q23. Compare temperature 0.2 vs temperature 1.2 — what does each produce, and when would you use each?

Q24. Explain top-p (nucleus sampling) with an example. If top-p = 0.9 and the top 3 tokens have probabilities 70%, 15%, and 8%, which tokens are included?

Q25. What is top-k sampling? Why is it less commonly used than top-p in production APIs?

Q26. Why should you adjust either temperature or top-p, not both at the same time?

Q27. You're building a JSON extraction API that must return consistent, parseable output. What temperature do you set and why?

Q28. What do frequency_penalty and presence_penalty do? When would you use each?

Q29. Write the OpenAI API configuration (model, temperature, top_p, etc.) for each use case: (a) extracting dates from legal documents, (b) writing marketing taglines, (c) a general Q&A chatbot.

Q30. Prediction: Given temperature 0 and the prompt "The capital of France is", what token will the model pick? What about temperature 2.0?

4.1.d — Hallucination (Q31–Q40)

Q31. Define hallucination in the context of LLMs. Why is "lying" the wrong word?

Q32. Explain the fundamental reason LLMs hallucinate. What is the model actually doing when it generates a wrong fact?

Q33. Name and give an example of four types of hallucination: factual, fabricated citation, intrinsic, and extrinsic.

Q34. Why is hallucination more dangerous in medical/legal domains than in creative writing?

Q35. How does RAG (Retrieval-Augmented Generation) reduce hallucination? Write a pseudocode example.

Q36. Your chatbot told a customer that the refund policy is "30 days, no questions asked." The actual policy is "14 days with receipt." How would you prevent this?

Q37. Does temperature 0 eliminate hallucination? Why or why not?

Q38. List five strategies for reducing hallucination in production systems.

Q39. A lawyer used ChatGPT for legal research and submitted fake case citations to court. What went wrong, and how should AI-assisted research be done instead?

Q40. Why is hallucination actually the same mechanism as creativity? How does this affect your approach to different use cases?

4.1.e — Deterministic vs Probabilistic (Q41–Q50)

Q41. Define deterministic and probabilistic systems. Give one non-AI example of each.

Q42. Why are LLMs inherently probabilistic? What makes them different from a traditional if/else program?

Q43. How do temperature 0 and the seed parameter help achieve near-deterministic outputs?

Q44. Name five use cases where you need deterministic LLM behavior and explain why.

Q45. Name three use cases where probabilistic behavior is actually desirable.

Q46. Your data pipeline uses GPT-4o to extract prices from invoices. Sometimes it returns "$12.50" and sometimes "12.50". How do you fix this inconsistency?

Q47. A QA engineer says "I can't write tests for AI features because the output keeps changing." What do you recommend?

Q48. What is the n parameter in the OpenAI API? How does it relate to cost?

Q49. Why should you pin model versions in production instead of using aliases like "gpt-4o"?

Q50. Design exercise: Write a production logging schema for LLM API calls that would allow you to reproduce and debug any AI output.

Answer Hints

Q	Hint
Q3	~120 characters ÷ 4 ≈ 30 tokens (actual varies by tokenizer)
Q6	(1500+500) × 200,000 = 400M input tokens × $2.50/1M = $1,000/day
Q12	128,000 - 2,000 - 10,000 - 4,000 = 112,000 tokens available
Q16	System(2000) + 40 turns × 500 × 2(user+asst) = 42,000 tokens
Q22	Greedy decoding; nearly deterministic but GPU floating-point variance
Q24	70+15+8 = 93% > 90%, so all three are included
Q27	Temperature 0 — must be consistent and parseable
Q37	No — temperature 0 reduces randomness but model still predicts plausible-sounding wrong facts
Q46	Temperature 0 + explicit format instruction + output validation

← Back to 4.1 — How LLMs Actually Work (README)