Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses

4.8 — Exercise Questions: Streaming Responses

Practice questions for all three subtopics in Section 4.8. Mix of conceptual, code-based, and hands-on tasks.

How to use this material (instructions)

Read lessons in order — README.md, then 4.8.a -> 4.8.c.
Answer closed-book first — then compare to the matching lesson.
Build the code exercises — streaming is best learned by doing.
Interview prep — 4.8-Interview-Questions.md.
Quick review — 4.8-Quick-Revision.md.

4.8.a — Streaming Tokens (Q1–Q12)

Q1. Explain the difference between a streaming and a non-streaming (buffered) LLM API call. If the total response time is the same, why does streaming feel faster?

Q2. What is Server-Sent Events (SSE)? Why is it preferred over WebSockets for LLM streaming? Name two advantages.

Q3. Write the single parameter you add to an OpenAI chat.completions.create call to enable streaming. What changes about the return type?

Q4. In an OpenAI streaming response, the token content lives in chunk.choices[0].delta.content, not chunk.choices[0].message.content. Why does the API use delta instead of message?

Q5. Compare the streaming event structures of OpenAI and Anthropic. Name at least three differences in how they deliver tokens.

Q6. Hands-on: Write a complete Node.js script that streams a response from the OpenAI API and prints each token to the console as it arrives. Use for await...of.

Q7. What is an async iterator in JavaScript? How does for await...of relate to Symbol.asyncIterator?

Q8. Hands-on: Create a mock stream using an async generator function (async function*) that yields fake tokens with a 50ms delay between each. Use it to test a streaming UI without hitting a real API.

Q9. Explain the raw SSE format. What does a line like data: {"choices":[{"delta":{"content":"Hello"}}]} represent? What does data: [DONE] signal?

Q10. When parsing raw SSE from a fetch response using ReadableStream, why do you need a buffer to accumulate partial data? What edge case does this handle?

Q11. How do you get token usage (prompt_tokens, completion_tokens) from a streaming OpenAI response? What parameter must you set?

Q12. Write a function that wraps a streaming API call and measures both time-to-first-token and total response time. Return both metrics alongside the full content.

4.8.b — Progressive Rendering (Q13–Q24)

Q13. What is progressive rendering in the context of AI applications? How does it differ from traditional page rendering?

Q14. In React, why must you use setResponse(prev => prev + token) (functional updater) instead of setResponse(response + token) when appending streaming tokens?

Q15. Hands-on: Write a custom React hook called useStreamingChat that manages: (a) a messages array, (b) a streaming boolean, (c) an error state, (d) a sendMessage function, and (e) a cancelStream function.

Q16. Explain the auto-scroll problem in streaming chat UIs. Why should you only auto-scroll when the user is "near the bottom"? Write the condition check.

Q17. What is the typewriter effect and when would you choose it over raw token streaming? What is the main trade-off?

Q18. Hands-on: Write a useTypewriter hook that takes a streamedText string and returns a displayedText string that reveals characters one at a time with a configurable delay. Include a "catch-up" mechanism for when the streamed text gets too far ahead.

Q19. Explain the incomplete markdown problem during streaming. Give a specific example where "Here are **three" causes rendering issues, and describe three solutions.

Q20. Code challenge: Write a component that detects if the streaming content is inside an unclosed markdown code block ( ```) and temporarily closes it for rendering.

Q21. Describe how the Vercel AI SDK simplifies streaming in Next.js. How many lines of code does a basic streaming chat take with useChat and streamText?

Q22. Why can rapid streaming (50+ tokens/second) cause React performance issues? Describe the requestAnimationFrame batching technique to solve this.

Q23. How should you handle accessibility in streaming UIs? Name the aria-* attributes you should use and explain what each does.

Q24. Hands-on (Next.js): Write a complete Next.js App Router route handler (app/api/chat/route.js) that accepts a POST request with messages and returns a streaming SSE response from OpenAI.

4.8.c — Improving UX in AI Applications (Q25–Q37)

Q25. Define perceived latency vs actual latency. Give an example where two AI features have identical actual latency but very different perceived latency.

Q26. What is Time-to-First-Token (TTFT)? Name the five components that contribute to TTFT and identify which one typically dominates.

Q27. Calculation: A request has 5,000 input tokens. The model prefill rate is 2,500 tokens/second. Network round-trip is 100ms. What is the approximate TTFT?

Q28. Name three strategies for reducing TTFT in a production application. For each, explain the mechanism and estimated impact.

Q29. Describe the three-phase loading strategy for AI responses: (a) immediate feedback phase, (b) streaming phase, (c) completion phase. What UI element belongs in each phase?

Q30. Hands-on: Build a skeleton screen component that shows animated placeholder lines while waiting for the AI response. The skeleton should transition smoothly to streaming content when the first token arrives.

Q31. Name six scenarios where you should NOT stream the LLM response. For each, explain why buffered is better.

Q32. Write a shouldStream(config) decision function that takes parameters like expectedTokens, isUserFacing, needsValidation, and outputFormat, and returns true or false.

Q33. Explain the hidden cost of cancellation in LLM streaming. When a user clicks "Stop generating" at token 50 of a 500-token response, what actually happens on the LLM side? How are you billed?

Q34. Hands-on: Implement cancellation with AbortController in a streaming fetch call. The code should: (a) cancel on button click, (b) keep partial content visible, (c) handle the AbortError gracefully.

Q35. What is the difference between a pre-stream error and a mid-stream error? How should the UI handle each differently?

Q36. Design a UX metrics dashboard for an AI chat application. List at least six metrics, their target values, and alert thresholds.

Q37. Name five common UX anti-patterns in AI applications and describe the fix for each.

Answer Hints

Q	Hint
Q1	Same total time, but TTFT drops from 10s to ~200ms with streaming
Q2	SSE is one-way (server->client), simpler than WebSockets, built-in reconnection, works through proxies
Q4	`delta` means "the change since last chunk" — only the new token, not the accumulated text
Q5	OpenAI uses `choices[0].delta.content`; Anthropic uses typed events (`content_block_delta` with `delta.text`); Anthropic has lifecycle events (message_start, content_block_start/stop, message_stop); OpenAI uses `[DONE]` sentinel
Q10	Network chunks may split in the middle of an SSE event; the buffer accumulates partial data until a complete `\n\n`-delimited event is available
Q11	Set `stream_options: { include_usage: true }` — usage appears in the final chunk
Q14	Closure captures stale `response` value; functional updater always uses the latest state
Q16	`container.scrollHeight - container.scrollTop - container.clientHeight < 100`
Q19	`"**three"` is an unclosed bold tag; solutions: (1) markdown lib that handles partial, (2) debounced rendering, (3) raw text while streaming / markdown after
Q22	Each `setState` triggers a re-render; at 50 tok/s that is 50 re-renders/second; `requestAnimationFrame` batches to ~16ms intervals (60fps)
Q27	Prefill: 5000/2500 = 2s. Network: 100ms. Total TTFT ~ 2.1s (plus ~100ms for API overhead)
Q31	Short responses (<50 tokens), JSON output, binary decisions, batch processing, function/tool calls, validation-required output
Q33	The model continues generating all 500 tokens; you are billed for all output tokens; client-side cancellation does not stop server-side generation
Q35	Pre-stream: show error + retry, no partial content. Mid-stream: keep partial content visible, show error banner below it, offer retry

<- Back to 4.8 — Streaming Responses (README)