Episode 4 — Generative AI Engineering / 4.9 — Combining Streaming with Structured Data
4.9 — Exercise Questions: Combining Streaming with Structured Data
Practice questions for all three subtopics in Section 4.9. Mix of conceptual, design, and hands-on tasks.
How to use this material (instructions)
- Read lessons in order —
README.md, then4.9.a->4.9.c. - Answer closed-book first — then compare to the matching lesson.
- Build the code — many questions require writing actual implementations.
- Interview prep —
4.9-Interview-Questions.md. - Quick review —
4.9-Quick-Revision.md.
4.9.a — Streaming Conversational Text (Q1–Q10)
Q1. Define streaming in the context of LLM APIs. What is the fundamental difference between streaming and non-streaming responses?
Q2. What is Time to First Token (TTFT)? Why is it the most important latency metric for streaming UX? Give typical TTFT values for GPT-4o and Claude.
Q3. Name three scenarios where streaming conversational text is the right choice, and three scenarios where you should NOT stream and instead wait for the complete response.
Q4. Write a JavaScript function using the OpenAI SDK that streams a response and returns the accumulated full text. The function should accept a userMessage parameter and an onToken callback.
Q5. Explain the accumulate-while-streaming pattern. Why do you always need to build the full response alongside the real-time stream?
Q6. A user on a slow 3G mobile connection reports that your streaming chat shows text appearing in "large jumps" rather than smooth word-by-word flow. What is happening technically, and how can you improve the experience?
Q7. Write code to implement a streaming conversation manager that maintains message history across multiple turns. Each assistant reply should be streamed in real time and then appended to the history.
Q8. Why does string concatenation in a tight loop (fullText += content) have performance implications for long responses? What is the better alternative? Write code for both approaches.
Q9. Implement abort/cancellation support for a streaming response. When the user clicks "stop generating," the stream should be terminated, and partial text should be preserved.
Q10. Hands-on: Build an Express.js endpoint that streams an LLM response to the browser using Server-Sent Events (SSE). Include proper headers, error handling, and a completion event.
4.9.b — Returning Structured JSON After Generation (Q11–Q22)
Q11. Explain the two-phase response pattern in one paragraph. What problem does it solve?
Q12. Describe the single-call delimiter pattern. What delimiter would you choose, and why might the model forget to include it?
Q13. Write a complete implementation of the delimiter-based pattern: prompt design, streaming with delimiter detection, and JSON extraction after the delimiter.
Q14. What are the edge cases in delimiter detection during streaming? What happens if the delimiter arrives split across two chunks (e.g., "---JS" in one chunk, "ON---" in the next)?
Q15. Compare sequential two-call vs parallel two-call patterns. When would you choose each? Give a concrete scenario for each choice.
Q16. Cost calculation: Your application processes 200,000 requests per day. Each request has a 50-token user message and a 200-token system prompt. The conversational response averages 400 tokens and the structured JSON averages 150 tokens. Calculate the daily cost difference between the single-call pattern and the two-call sequential pattern using GPT-4o pricing ($2.50/1M input, $10/1M output).
Q17. Explain the "stream then parse" pattern. How does it differ from the delimiter pattern? Write code that streams an entire response, then extracts JSON from markdown code blocks after the stream completes.
Q18. How does OpenAI function calling (tool use) combined with streaming solve the dual-purpose problem? Write code that streams conversational text AND accumulates function call arguments simultaneously.
Q19. Write a production-ready service class that tries the delimiter pattern first, and if JSON extraction fails, falls back to a second structured API call. Include retry logic.
Q20. A junior developer suggests: "Just use response_format: { type: 'json_object' } and stream it." Explain why this does NOT solve the problem of providing both conversational text and structured data.
Q21. Design the prompt engineering for a single-call approach where the model must: (a) explain its analysis conversationally, (b) output a delimiter, (c) output valid JSON matching a specific schema. What are the risks, and how do you mitigate them?
Q22. Hands-on: Implement the Anthropic Claude tool use pattern for combined streaming and structured output. The model should stream a text explanation and simultaneously generate structured data through a tool call.
4.9.c — Separating UI from System Outputs (Q23–Q35)
Q23. What is the dual-channel architecture problem? Why is it an architecture problem, not just a parsing problem?
Q24. Draw (in ASCII or describe) the data flow for an event-based AI response pipeline that serves: (a) the browser (streamed text), (b) a database (structured data), (c) an analytics service (usage metrics), and (d) a Slack bot (high-priority alerts).
Q25. Write a complete EventEmitter-based pipeline in Node.js that emits typed events (stream:token, stream:end, structure:complete, response:error) and has four independent consumers subscribing to different events.
Q26. Compare SSE (Server-Sent Events) vs WebSockets for dual-channel AI responses. When would you choose each? What are the trade-offs for reconnection, browser support, and HTTP/2 compatibility?
Q27. Write a multi-channel SSE server that sends different SSE event types for text tokens, structured data, and status updates. Include the client-side parsing code.
Q28. Implement a ConnectionManager class that tracks active streaming connections, handles client disconnects, and exposes metrics (active streams, peak concurrent, total requests).
Q29. How do you test dual-channel architectures? Write a test harness that captures both the text stream and structured data events from a mock SSE response, then validates each channel independently.
Q30. Your application serves 50,000 concurrent users. Each user interaction creates two LLM calls. You are rate-limited to 10,000 requests/minute. Design a priority queue that ensures streaming calls (user-facing) are always processed before structured extraction calls (system-facing).
Q31. Explain the Response Router pattern. How does it differ from the event emitter pattern? When would you use conditional routing (only route to certain consumers based on response content)?
Q32. Write Express.js middleware that composes a dual-channel pipeline: (1) stream text to the client, (2) extract structured data, (3) log metrics and finalize. Each step should be a separate middleware function.
Q33. A user reports: "The chat text appeared instantly, but the insights panel took 3 more seconds to load." Explain the timing issue. Propose two architectural changes to reduce the perceived delay.
Q34. Design exercise: You are building a customer support AI. The streamed text goes to the customer. The structured data includes: sentiment (for routing), category (for logging), and suggested_actions (for the support agent's dashboard). Design the complete architecture including the prompt, the separation strategy, and the event routing.
Q35. Scaling exercise: Your event-based pipeline currently runs in a single Node.js process. Traffic is growing to 100K concurrent users. How do you scale horizontally? Consider: stream state management, event distribution across processes, and structured extraction load balancing.
Answer Hints
| Q | Hint |
|---|---|
| Q2 | TTFT for GPT-4o: ~200-500ms; Claude: ~300-700ms. Users perceive responsiveness from TTFT, not total time |
| Q3 | Stream: chatbot, long explanations, creative writing. Do NOT stream: short responses, JSON for display, error messages |
| Q5 | You need the full text for: conversation history, logging, structured extraction in Phase 2, database storage |
| Q8 | Array of parts + .join('') at end is O(n); repeated += creates intermediate strings at each step |
| Q14 | Use a buffer that checks for partial delimiter matches at the end of each chunk |
| Q16 | Single: ~250 input + ~560 output per req. Two-call: ~250+650 input + ~400+150 output. At 200K/day: single ~$1,300/day, two-call ~$1,575/day |
| Q20 | response_format: json_object forces the ENTIRE response to be JSON — no conversational text at all |
| Q26 | SSE: simpler, auto-reconnect, HTTP/2 native. WebSocket: bidirectional, cancel mid-stream, real-time collaboration |
| Q30 | Use two queues with weighted scheduling (e.g., 80% capacity for streaming, 20% for extraction) |
| Q33 | The structured extraction is a second API call that starts after the stream completes. Solutions: parallel calls, or optimistic UI with client-side extraction |
| Q35 | Redis pub/sub for event distribution; sticky sessions or stream state in Redis; separate worker pool for extraction |
<- Back to 4.9 — Combining Streaming with Structured Data (README)