Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2 — Exercise Questions: Calling LLM APIs Properly

Practice questions for all four subtopics in Section 4.2. Mix of conceptual, calculation, coding, and design tasks.

How to use this material (instructions)

Read lessons in order — README.md, then 4.2.a → 4.2.d.
Answer closed-book first — then compare to the matching lesson.
Try the code examples — run them against real APIs to build muscle memory.
Interview prep — 4.2-Interview-Questions.md.
Quick review — 4.2-Quick-Revision.md.

4.2.a — Message Roles (Q1–Q10)

Q1. Name the three message roles in the OpenAI Chat Completions API and describe the purpose of each in one sentence.

Q2. What is the difference between how OpenAI and Anthropic handle the system message? Write the message structure for both providers.

Q3. Explain why LLM APIs are stateless. What does this mean for multi-turn conversations?

Q4. You have a chatbot that "forgets" the user's name after 2 messages. The developer sends only the system prompt and the latest user message each time. What is wrong and how do you fix it?

Q5. Write a messages array that uses few-shot examples to teach the model to extract email addresses from text and return them as a JSON array.

Q6. What happens if you put two user messages in a row without an assistant message between them? Does the API reject it or produce unexpected behavior?

Q7. A system prompt says: "Be extremely detailed and thorough" and also "Keep all responses under 50 words." What problem does this create and how would you fix it?

Q8. Why are few-shot examples placed as assistant messages rather than included in the system prompt? What advantage does this give?

Q9. Write a buildMessages() function (JavaScript) that takes a system prompt, an array of few-shot examples, conversation history, and a user message, then returns a properly ordered messages array.

Q10. Your system prompt is 2,000 tokens long and includes persona, rules, output format, scope limits, and safety instructions. A colleague says "just put it all in the user message." Give three reasons why the system role is better than the user role for instructions.

4.2.b — Token Budgeting (Q11–Q20)

Q11. Write the formula for calculating available output tokens given a context window size. What are all the components that consume input tokens?

Q12. What does the max_tokens parameter control? Does it limit input, output, or both?

Q13. What is finish_reason: "length" and why should you always check it in production?

Q14. Calculation: GPT-4o has a 128K token context window. Your request has: system prompt (1,500 tokens), 3 few-shot examples (400 tokens each), 15 turns of history (average 300 tokens per message), current user message (200 tokens), and you want max_tokens set to 4,096. Calculate the total input tokens and remaining headroom.

Q15. Explain the sliding window strategy for managing conversation history. What is its biggest downside?

Q16. Explain the summarization strategy for managing conversation history. When is it better than a sliding window?

Q17. Write a function that validates whether a set of messages will fit within a given context window, returning the utilization percentage and any warnings.

Q18. Your RAG application retrieves 10 documents averaging 3,000 tokens each. The context window is 128K tokens and your other inputs consume 5,000 tokens. Should you include all 10 documents? What if max_tokens is 4,096?

Q19. What are "special tokens" and how much overhead do they add per message? Why does this matter for token budget calculations?

Q20. A user reports that your chatbot gives incomplete answers after long conversations. The responses end mid-sentence. Diagnose the issue using token budgeting concepts.

4.2.c — Cost Awareness (Q21–Q32)

Q21. Write the cost calculation formula for a single LLM API call. Why are output tokens more expensive than input tokens?

Q22. Calculation: You use GPT-4o ($2.50/$10.00 per 1M tokens). A single request has 2,000 input tokens and 500 output tokens. What is the cost? What about 100,000 such requests per day?

Q23. Calculation: Your system prompt is 1,200 tokens. You make 200,000 API calls per day using GPT-4o. How much does the system prompt alone cost per month? If you optimize it to 400 tokens, how much do you save?

Q24. Explain prompt caching and how it reduces costs. Name one provider that supports it and the approximate discount.

Q25. What is model routing and how can it reduce costs by 50-70%? Give an example of which tasks go to cheap models vs expensive models.

Q26. A developer writes this system prompt: "I would really appreciate it if you could please carefully and thoroughly analyze the following text and provide a comprehensive summary." Rewrite it to be token-efficient. Estimate the token savings.

Q27. Why is batching multiple items into a single API call cheaper than making one call per item? Calculate the system prompt savings for processing 100 items in batches of 10 vs individually.

Q28. Your AI feature costs $30,000/month. The breakdown is: 40% from a large system prompt, 30% from conversation history, 20% from model responses, 10% from RAG documents. Propose three optimizations and estimate the savings from each.

Q29. Compare the monthly cost of running a classification task (500 input tokens, 10 output tokens) at 50,000 requests/day on: (a) GPT-4o, (b) GPT-4o-mini, (c) Claude 3 Haiku. Which would you choose and why?

Q30. When should you use GPT-4o vs GPT-4o-mini? Give three task categories for each with reasoning.

Q31. Write a CostTracker class that records API calls, tracks spending by model and feature, and alerts when daily spending exceeds a threshold.

Q32. Scenario: A startup goes from 100 to 100,000 daily users in 3 months. Their AI feature costs $0.02 per user interaction with an average of 5 interactions per user per day. What is their monthly AI cost at each stage? At what point should they start optimizing?

4.2.d — Rate Limits and Retries (Q33–Q44)

Q33. What are RPM and TPM? Give a scenario where you'd hit each one independently.

Q34. What does HTTP status code 429 mean? What header tells you when to retry?

Q35. Explain exponential backoff with jitter in plain English. Why is the jitter important?

Q36. Which HTTP status codes should you retry and which should you NOT retry? Explain why for each category.

Q37. Write a callWithRetry() function that implements exponential backoff with jitter, respects the retry-after header, and only retries appropriate error codes.

Q38. What is the thundering herd problem? How does jitter solve it?

Q39. You need to process 5,000 items through the GPT-4o API. Your rate limit is 500 RPM. How long will it take at minimum? Write code that processes all items while respecting the rate limit.

Q40. Explain the circuit breaker pattern. Draw the three states and transitions between them.

Q41. A team says "we have retries so we don't need a circuit breaker." Describe a scenario where retries make things worse and a circuit breaker would help.

Q42. What timeout should you set for LLM API calls? Why is "no timeout" dangerous?

Q43. Write a complete LLMClient class that combines: (a) retries with backoff, (b) concurrency limiting, (c) circuit breaker, (d) timeout, (e) fallback model.

Q44. Production checklist: List 10 things you should verify before shipping an LLM API integration to production. Cover error handling, monitoring, cost, and reliability.

Answer Hints

Q	Hint
Q4	Must send full conversation history with each API call (stateless)
Q7	Conflicting instructions — model can't satisfy both. Remove one or add conditional logic
Q14	System(1500) + FewShot(3x400=1200) + History(15x2x300=9000) + User(200) + Overhead(~80) = 11,980 input tokens. Headroom = 128,000 - 11,980 - 4,096 = 111,924 tokens
Q18	10 x 3,000 = 30,000 tokens. 30,000 + 5,000 + 4,096 = 39,096. Fits easily in 128K. Include all 10
Q20	Conversation history fills the context window, leaving too few tokens for output. finish_reason is likely "length"
Q22	Input: 2,000 x $2.50/1M = $0.005. Output: 500 x $10.00/1M = $0.005. Total: $0.01. Daily: $1,000
Q23	1,200 x 200K = 240M tokens/day x $2.50/1M = $600/day = $18,000/month. After: $6,000/month. Savings: $12,000/month
Q29	GPT-4o: $412.50/month. GPT-4o-mini: $24.75/month. Haiku: $22.50/month. Choose Haiku or mini for classification
Q32	Stage 1: 500 interactions x $0.02 = $300/month. Stage 2: 500,000 x $0.02 = $300,000/month. Optimize before scaling
Q35	Each retry waits longer (1s, 2s, 4s, 8s...). Jitter adds randomness so clients don't all retry simultaneously
Q39	5,000 items at 500 RPM = minimum 10 minutes. Use p-limit or token bucket to throttle

← Back to 4.2 — Calling LLM APIs Properly (README)