Episode 4 — Generative AI Engineering / 4.9 — Combining Streaming with Structured Data

Interview Questions: Combining Streaming with Structured Data

Model answers for streaming conversational text, two-phase response patterns, delimiter-based separation, dual API calls, and event-based architectures for separating UI from system outputs.

How to use this material (instructions)

  1. Read lessons in orderREADME.md, then 4.9.a -> 4.9.c.
  2. Practice out loud — definition -> example -> pitfall.
  3. Pair with exercises4.9-Exercise-Questions.md.
  4. Quick review4.9-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. Why would you stream an LLM response instead of waiting for the complete response?

Why interviewers ask: Tests understanding of a fundamental UX pattern in AI applications — nearly every production chatbot uses streaming.

Model answer:

Streaming delivers LLM tokens to the user as they are generated, rather than waiting for the entire response. The key metric is Time to First Token (TTFT) — typically 200-500ms — versus total generation time, which might be 5-15 seconds for a long response. Without streaming, the user stares at a spinner for the entire generation time. With streaming, they start reading within milliseconds.

Streaming is essential for conversational AI because it makes the interaction feel responsive and natural. Users perceive the application as faster even though total generation time is identical. However, streaming is not always appropriate — very short responses (< 50 tokens), structured data for programmatic consumption, and error messages should typically be delivered atomically.

The implementation uses Server-Sent Events (SSE) or WebSockets on the transport layer, and the LLM SDK's streaming mode (stream: true) on the API layer. On the server, you iterate over an async stream and forward each chunk to the client.


Q2. What is the "two-phase response pattern" and when do you use it?

Why interviewers ask: Tests whether you understand the core architectural pattern for serving both humans and systems from a single LLM interaction.

Model answer:

The two-phase response pattern solves a common production problem: the user needs streamed conversational text (for good UX), but the system also needs structured JSON (for databases, analytics, downstream APIs). These two needs conflict — you cannot JSON.parse() a half-streamed sentence, and you cannot show raw JSON to a user.

The pattern works in two phases: Phase 1 streams human-readable text to the UI in real time. Phase 2 extracts or generates structured JSON for the system. Phase 2 can happen several ways:

  1. Single call with delimiter — the model outputs conversational text, then a delimiter, then JSON. Cheapest but least reliable.
  2. Sequential two calls — first call streams text, second call extracts structured data from the text using response_format: { type: "json_object" }. Most reliable.
  3. Parallel two calls — both calls fire simultaneously with the same user input but different system prompts. Fastest but structured data may not match the conversational text exactly.

You use this pattern whenever an AI feature needs to simultaneously serve a human reader AND feed structured data into the system — which is the majority of production AI features.


Q3. What is Time to First Token (TTFT) and why does it matter for streaming?

Why interviewers ask: Tests knowledge of the key performance metric that streaming optimizes for, and understanding of perceived vs actual latency.

Model answer:

Time to First Token (TTFT) is the duration from when the API request is sent to when the first token of the response is received by the client. Typical values: GPT-4o ~200-500ms, Claude ~300-700ms.

TTFT matters because it determines perceived latency. Human perception research shows that users judge responsiveness based on when they first see activity, not when the full response arrives. A 10-second generation with 200ms TTFT feels fast because the user starts reading immediately. The same 10-second generation without streaming feels slow because the user waits the entire time.

In practice, TTFT is affected by: model size (larger models have higher TTFT), prompt length (more input tokens = longer processing before first output), server load (queuing delays), and network latency. You should monitor TTFT as a production metric separate from total generation time, and set alerting thresholds appropriate for your UX requirements (typically < 1 second for consumer applications).


Q4. Why can't you just use response_format: { type: "json_object" } with streaming to get both conversational text and structured data?

Why interviewers ask: This is a common misconception. Tests whether the candidate has actually tried to build a system that serves both humans and machines.

Model answer:

response_format: { type: "json_object" } forces the entire model output to be valid JSON. You get structured data, but you lose the conversational text entirely. The response will be something like {"recommendation": "MacBook Pro", "reason": "..."} — which is not a natural reading experience for a user.

You could stream this JSON, but streaming partial JSON is not meaningful to a human. The user would see {"recommen then dation": "Mac appearing token by token — nonsensical from a UX perspective.

The fundamental issue is that conversational text and structured JSON are different formats for different consumers. Conversational text is optimized for human readability — paragraphs, natural language, emphasis, examples. Structured JSON is optimized for machine parseability — predictable keys, typed values, no ambiguity. You need both, and response_format: { type: "json_object" } can only give you one.

This is why the two-phase pattern exists: use streaming without response_format for the human-facing text, then use a separate call or post-processing for the machine-facing JSON.


Intermediate (Q5–Q8)

Q5. Compare the single-call delimiter pattern vs the two-call pattern for combining streaming with structured data.

Why interviewers ask: Tests practical knowledge of the cost, reliability, and latency tradeoffs — a real production decision.

Model answer:

Single-call delimiter pattern: Instruct the model to output conversational text, then a delimiter (e.g., ---JSON---), then valid JSON. Stream the text portion to the user, accumulate the JSON portion silently, parse it after the stream completes.

  • Pros: One API call (cheapest), lowest latency, all data from a single generation (maximum consistency).
  • Cons: The model might forget the delimiter, output invalid JSON, or mix JSON into the conversational text. You cannot use response_format since the response is mixed format. Delimiter detection during streaming requires careful buffering to handle split delimiters across chunks.

Two-call pattern: First call streams conversational text. Second call sends the user message (and optionally the first response) to extract structured JSON with response_format: { type: "json_object" }.

  • Pros: Each call is specialized — streaming call uses natural temperature (0.7), extraction call uses temperature 0 with enforced JSON format. Much more reliable JSON. Clear separation of concerns.
  • Cons: Two API calls (higher cost, ~7-23% more depending on variant). Additional latency for the sequential variant. For the parallel variant, structured data may not exactly match conversational text.

Decision criteria: Use single-call for high-volume, cost-sensitive applications where occasional extraction failures are acceptable (with fallback). Use two-call for reliability-critical applications, especially in domains like medical, legal, or financial where the structured data drives important decisions.


Q6. How do you implement an event-based architecture for dual-purpose AI responses?

Why interviewers ask: Tests architectural thinking and ability to design systems that separate concerns cleanly.

Model answer:

An event-based architecture uses an event emitter (or message bus) to decouple the AI response pipeline from its consumers. The pipeline emits typed events at each stage; consumers subscribe only to the events they care about.

Event types in a typical pipeline:

  • request:start — request received with metadata
  • stream:token — individual token generated (UI subscribes)
  • stream:end — text streaming complete
  • structure:complete — structured JSON extracted (database subscribes)
  • response:complete — all phases done (analytics subscribes)
  • response:error — failure at any stage (monitoring subscribes)

Implementation: In Node.js, extend EventEmitter. The pipeline class has a process() method that runs the streaming phase (emitting stream:token for each token), then the structure phase (emitting structure:complete), then emits response:complete. Each consumer registers with pipeline.on('event', handler).

Benefits: (1) Adding a new consumer (e.g., Slack notifications) requires zero changes to the pipeline — just add a new event listener. (2) Fault isolation — if the database write fails, the UI stream is unaffected. (3) Testability — each consumer can be tested independently with mock events. (4) The event log forms a natural audit trail.

This pattern scales well. For multi-process deployments, replace the in-process EventEmitter with Redis pub/sub or a message queue like RabbitMQ.


Q7. How do you handle failures in a two-phase response system?

Why interviewers ask: Production systems must handle partial failures gracefully. This tests resilience engineering.

Model answer:

In a two-phase system, failures can occur at multiple points, and each needs specific handling:

Phase 1 failure (streaming): The stream might break mid-response due to network issues, rate limits, or server errors. Strategy: (1) Preserve partial text — whatever was streamed to the user is already visible. (2) Show an error indicator to the user. (3) Retry from scratch if appropriate, or offer a "regenerate" button. (4) Log the partial response for debugging.

Phase 2 failure (structured extraction): The conversational text was delivered successfully, but the structured JSON extraction fails. Strategy: (1) The user experience is NOT affected — they got their answer. (2) Return a degraded result with partial: true and data: null. (3) Retry the extraction call with exponential backoff. (4) If all retries fail, queue for async extraction later. (5) If the data is needed for immediate display (like a sidebar), show a "loading failed" state with a retry button.

JSON parse failure (single-call pattern): The model output something that looks like JSON but is not valid. Strategy: (1) Try multiple extraction methods — delimiter-based, code block regex, brace-matching. (2) If all fail, fall back to a dedicated structured call. (3) Log the raw output for prompt engineering improvements.

Key principle: The user-facing stream is the highest priority. If only one phase can succeed, it should be Phase 1. The system-facing structured data can be retried, queued, or degraded without the user noticing.


Q8. How do you choose between SSE and WebSockets for streaming AI responses?

Why interviewers ask: Tests practical knowledge of transport protocols for real-time AI applications.

Model answer:

Server-Sent Events (SSE):

  • Unidirectional (server -> client only)
  • Built on HTTP — works with load balancers, proxies, CDNs, and HTTP/2 natively
  • Automatic reconnection built into the browser EventSource API
  • Named event types (event: text, event: structured) for channel separation
  • Simple to implement — just HTTP with special headers
  • Best for: Standard chat applications where the user sends a message and receives a streamed response. Most AI applications fit this pattern.

WebSockets:

  • Bidirectional (client <-> server)
  • Persistent connection — lower per-message overhead
  • No built-in reconnection (must implement yourself)
  • Requires WebSocket-aware infrastructure (some load balancers need special configuration)
  • Best for: Applications that need real-time bidirectional communication: user can cancel mid-stream, send additional context while receiving, collaborative editing, or multiplayer features.

Decision rule: Start with SSE unless you have a concrete bidirectional requirement. SSE covers 90%+ of AI streaming use cases, is simpler to implement and debug, and has better infrastructure compatibility. Move to WebSockets only when you need the client to send messages during an active stream (e.g., "stop generating," "refine this part," real-time collaboration).


Advanced (Q9–Q11)

Q9. Design a production architecture for an AI system that serves 50,000 concurrent users, where each interaction produces both streamed text and structured data.

Why interviewers ask: Tests system design at scale — combining AI, real-time streaming, and structured data processing.

Model answer:

Architecture overview:

Client (Browser)
    |
    | SSE / WebSocket
    v
Load Balancer (sticky sessions for active streams)
    |
    v
API Servers (Node.js cluster, 8-16 instances)
    |
    ├── Stream Manager (tracks active connections per process)
    |
    ├── Priority Queue ──→ LLM API calls
    |       |
    |       ├── High priority: Streaming calls (user-facing)
    |       └── Low priority: Structured extraction calls
    |
    ├── Redis Pub/Sub ──→ Event distribution across processes
    |
    └── Worker Pool ──→ Structured extraction (separate from stream handlers)
            |
            ├── Database writes (findings, conversation history)
            ├── Analytics pipeline (usage, performance)
            └── Alert service (urgent findings → Slack/PagerDuty)

Key design decisions:

1. Rate limit management: At 50K concurrent users, you generate ~100K LLM API calls (one stream + one extraction each). Separate rate limit pools: 70% capacity for streaming (user-facing, latency-sensitive), 30% for extraction (system-facing, can be delayed). When the extraction queue is full, enqueue for async processing rather than blocking.

2. Stream state: Active streams must be tracked to handle disconnects. Use a per-process ConnectionManager with a Redis-backed global view. When a client disconnects, abort the LLM stream immediately to stop token generation and reduce cost.

3. Extraction backpressure: If extraction calls pile up, implement circuit-breaking: after N consecutive failures or queue depth > threshold, temporarily switch to single-call delimiter pattern (cheaper, faster, but less reliable).

4. Monitoring: Track TTFT (p50, p95, p99), stream completion rate, extraction success rate, queue depth, and cost per request. Alert on TTFT p95 > 2s or extraction success rate < 95%.


Q10. How would you migrate a system from a simple "stream then parse" approach to a full event-based pipeline without breaking existing functionality?

Why interviewers ask: Tests incremental architecture evolution — a critical skill for production systems.

Model answer:

Phase 1: Wrap existing code in events (0 risk)

Keep the existing stream-then-parse logic unchanged. Add an EventEmitter wrapper that emits events at the existing code's natural boundaries:

// Existing: streamAndParse(message)
// New: wraps it and emits events
pipeline.on('stream:token', existingTokenHandler); // Same as before
pipeline.on('response:complete', existingCompleteHandler); // Same as before

The existing UI consumer subscribes to the same events it was already handling. Zero behavior change, but now you have an event bus.

Phase 2: Add new consumers (low risk)

New features (analytics, database writes, Slack alerts) subscribe to the existing events. They are additive — if they fail, the existing stream is unaffected. Each new consumer is behind a feature flag.

Phase 3: Separate extraction into its own phase (medium risk)

Move the "parse JSON from the streamed text" logic into an event-driven phase that emits structure:complete. Add a fallback: if parsing fails, make a dedicated API call. The UI consumer is untouched. The database consumer now listens to structure:complete instead of parsing the full text itself.

Phase 4: Independent scaling (higher complexity)

Move the extraction phase to a worker pool. Stream events flow through Redis pub/sub. The extraction worker subscribes to stream:end, performs extraction, and publishes structure:complete. This decouples the streaming server from the extraction workload.

Key principle: At each phase, the user-facing streaming behavior is unchanged. New capabilities are added alongside, never replacing. Feature flags control rollout. Rollback is always possible by disabling the flag and reverting to the previous phase.


Q11. Explain how you would design a system where the structured data from an LLM response needs to be consistent with the streamed text, even under failure conditions.

Why interviewers ask: Consistency between human-facing and machine-facing outputs is a subtle but critical production requirement — tests deep understanding.

Model answer:

The consistency problem: If the conversational text says "I recommend the MacBook Pro at $1,999" but the structured JSON says {"product": "Dell XPS", "price": "$1,599"}, the user and the system have conflicting information. This can happen with parallel two-call patterns (independent generations) or when extraction fails and falls back to a different method.

Solution 1: Sequential extraction (highest consistency)

Always extract structured data FROM the conversational text, never independently. The second call receives the streamed text as context and is instructed to extract data from it, not generate new data. Consistency is guaranteed because the structured data is derived from the text.

Downside: additional latency (must wait for stream to complete before extracting).

Solution 2: Single-call with validation

Use the delimiter pattern (single call). Both the text and JSON come from the same generation, so they are inherently consistent. Add validation: programmatically check that key claims in the text match the JSON (e.g., product names, prices mentioned in text exist in the JSON array).

If validation fails, flag the response and trigger a re-extraction from the text.

Solution 3: Consistency hash

Generate a content hash of key fields mentioned in the text (extracted via regex or a lightweight pass). Compare with the structured data. If they diverge beyond a threshold, regenerate the structured data from the text (sequential extraction).

Failure handling: If the stream succeeds but extraction fails, log the inconsistency. Show the user their conversational text (which is already delivered). Queue the extraction for retry. Never show structured data that hasn't been validated against the text. Use a "pending" state in the UI for the structured panel until extraction completes and validates.

Key principle: The conversational text is the source of truth (the user has already seen it). All structured data must be derivable from and consistent with that text. Design the extraction to be a function of the text, not an independent generation.


Quick-fire

#QuestionOne-line answer
1What is TTFT?Time to First Token — the key metric for streaming UX, typically 200-700ms
2Stream or wait for a 20-token response?Wait — streaming adds visual noise for a near-instant response
3Can you JSON.parse a half-streamed response?No — partial JSON is invalid. Stream text to humans; parse JSON after completion
4Single-call or two-call for reliability?Two-call — second call uses response_format: { type: "json_object" } for guaranteed valid JSON
5SSE or WebSocket for a standard chatbot?SSE — simpler, auto-reconnect, HTTP/2 compatible, sufficient for unidirectional streaming

<- Back to 4.9 — Combining Streaming with Structured Data (README)