Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses

Interview Questions: Streaming Responses

Model answers for streaming tokens, progressive rendering, and AI UX improvement.

How to use this material (instructions)

  1. Read lessons in orderREADME.md, then 4.8.a -> 4.8.c.
  2. Practice out loud — definition -> example -> pitfall.
  3. Pair with exercises4.8-Exercise-Questions.md.
  4. Quick review4.8-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is streaming in the context of LLM APIs, and why does it matter?

Why interviewers ask: Tests if you understand the fundamental UX problem streaming solves and the protocol it uses.

Model answer:

Streaming means the LLM API sends tokens incrementally as the model generates them, rather than waiting for the complete response. This is implemented using Server-Sent Events (SSE) — a standard HTTP mechanism for server-to-client push. You enable it by setting stream: true in the API call.

Streaming matters because LLM responses take 3-15 seconds for the full output. Without streaming, users stare at a spinner the entire time — well into the "abandonment" zone (>3s). With streaming, the first token arrives in 200-500ms, and users can start reading immediately. The total response time is identical, but the perceived latency drops dramatically. This is why every major AI chat product (ChatGPT, Claude, Gemini) uses streaming — it transforms a frustrating wait into a responsive "typing" experience.


Q2. How do you consume a streaming response in Node.js?

Why interviewers ask: Tests practical ability to work with the SDK and async iteration patterns.

Model answer:

Using the official SDK, the streaming response is an async iterable. You consume it with for await...of:

const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,
});

let fullResponse = '';
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  fullResponse += content;
  process.stdout.write(content);
}

Key details: (1) Content lives in delta.content, not message.contentdelta means "just the new token." (2) The first chunk includes delta.role: "assistant". (3) The final chunk has finish_reason: "stop" and empty delta. (4) For token usage, set stream_options: { include_usage: true } — usage data arrives in the last chunk.

Without the SDK (raw fetch), you read from response.body.getReader(), decode with TextDecoder, and manually parse SSE format by splitting on \n\n and extracting data: lines.


Q3. What is the difference between SSE and WebSockets? Why do LLM APIs use SSE?

Why interviewers ask: Tests understanding of the underlying protocol choice and its engineering trade-offs.

Model answer:

SSE (Server-Sent Events) is a one-way protocol: the server pushes events to the client over a standard HTTP connection. WebSockets is a bidirectional protocol that upgrades the HTTP connection to a persistent two-way channel.

LLM APIs use SSE because:

  1. One-way data flow is sufficient — the client sends a single request (the prompt), and the server sends many events (the tokens). There is no need for bidirectional communication during a single generation.
  2. Simpler infrastructure — SSE works over standard HTTP, compatible with proxies, CDNs, load balancers, and firewalls without special configuration. WebSockets require upgrade handling and can be blocked by corporate proxies.
  3. Built-in reconnection — SSE has automatic reconnection if the connection drops. WebSocket reconnection must be implemented manually.
  4. Lower overhead — no handshake upgrade, no frame encoding, no ping/pong keep-alive.

WebSockets would only be necessary if the client needed to send data to the server during the stream (e.g., real-time collaborative editing), which is not the case for LLM generation.


Q4. How do you display streaming tokens in a React component?

Why interviewers ask: Tests ability to connect backend streaming to frontend rendering — a core AI engineering skill.

Model answer:

The fundamental pattern is: (1) fetch the streaming endpoint, (2) read from response.body.getReader(), (3) parse SSE events, (4) append each token to React state using a functional updater.

const [response, setResponse] = useState('');

// Inside the streaming function:
const content = parsedChunk.choices[0]?.delta?.content;
if (content) {
  setResponse(prev => prev + content); // Functional updater is critical
}

The functional updater (prev => prev + content) is essential because the closure captured at function creation time would hold a stale reference to response. Without it, each token would overwrite instead of append.

For production, I'd extract this into a custom useStreamingChat hook that manages messages, streaming state, error state, and cancellation via AbortController. For even faster development, the Vercel AI SDK provides a useChat hook that reduces the entire implementation to ~30 lines.


Intermediate (Q5–Q8)

Q5. How do you handle markdown rendering during streaming?

Why interviewers ask: Tests awareness of a real production challenge — partial markdown is syntactically invalid and causes UI flickering.

Model answer:

LLMs often respond with markdown (bold, code blocks, lists). During streaming, you receive incomplete markdown: "Here are **three" has an unclosed bold tag. Rendering this as markdown causes visual flickering as tags open and close across chunks.

Four solutions, in order of preference:

  1. Use a tolerant markdown renderer — Libraries like react-markdown handle incomplete markdown gracefully, treating unclosed tags as plain text until they close. This is the simplest fix.

  2. Debounced markdown rendering — Only re-render markdown every 100ms instead of on every token. This reduces flicker and improves performance.

  3. Code block detection — The worst case is unclosed code blocks (```). Count backtick-fence occurrences; if odd, temporarily append a closing fence for rendering.

  4. Phase-based rendering — Show raw text (with whitespace preserved) during streaming, then render full markdown only after the stream completes. This eliminates all flickering but sacrifices in-stream formatting.

In production, I combine #1 and #3: a tolerant renderer with special handling for code blocks.


Q6. What is time-to-first-token (TTFT) and how do you optimize it?

Why interviewers ask: Tests understanding of the most important AI UX metric and practical optimization strategies.

Model answer:

TTFT is the time from the user submitting a request to the first token appearing on screen. It is composed of: network latency + server processing + API setup + model prefill time + client rendering.

Model prefill time dominates for long prompts — the model must process all input tokens before generating the first output token. A 100-token prompt might have 300ms TTFT; a 20,000-token RAG prompt might have 2-3 seconds TTFT.

Optimization strategies:

  1. Shorter system prompts (high impact) — Every unnecessary token in the system prompt adds prefill time on every request. Optimize aggressively.
  2. Fewer/shorter RAG chunks (high impact) — Retrieve only the top 3-5 most relevant chunks rather than flooding the context.
  3. Faster/smaller models (high impact) — gpt-4o-mini has roughly 50% faster TTFT than gpt-4o. Choose based on task complexity.
  4. Prompt caching (high impact) — Both OpenAI and Anthropic support caching repeated prompt prefixes, dramatically reducing prefill time for the cached portion.
  5. Edge deployment / connection reuse (medium impact) — Reduce network latency by deploying closer to users and reusing HTTP connections.

I target TTFT < 500ms for interactive features and alert if it exceeds 2 seconds.


Q7. When should you NOT stream an LLM response?

Why interviewers ask: Tests nuance — knowing when NOT to apply a technique is as important as knowing the technique.

Model answer:

Streaming adds complexity and is not always beneficial:

  1. JSON / structured output — Partial JSON is invalid. You cannot parse {"name": "Joh until the complete object arrives. Buffer and display only when complete.

  2. Short responses (< 50 tokens) — The response arrives in under 1 second anyway. Streaming overhead (SSE setup, chunked parsing) adds latency without UX benefit.

  3. Binary decisions (yes/no, approve/reject) — A single word does not benefit from incremental display.

  4. Pipeline-internal calls — Intermediate LLM calls in a chain (e.g., "summarize this, then classify the summary") are consumed programmatically, not displayed to users.

  5. Validation-required output — If you must validate or moderate the response before showing it (e.g., content safety check), streaming bypasses that gate.

  6. Function/tool calling — Tool call arguments must be complete before execution. Streaming partial JSON arguments is unusable.

My rule of thumb: stream when the output is user-facing free-form text longer than 50 tokens. Everything else gets buffered.


Q8. How do you handle cancellation in a streaming AI feature?

Why interviewers ask: Tests understanding of the full cancellation lifecycle, including a commonly overlooked billing implication.

Model answer:

Cancellation involves three layers:

Frontend: Use AbortController with the fetch request. When the user clicks "Stop generating," call controller.abort(). Catch the AbortError in the streaming loop and keep partial content visible — don't discard what has already been shown.

Backend: Listen for the close event on the request (req.on('close', ...)). When detected, stop writing to the response and break out of the streaming loop. This saves server resources but does NOT stop the LLM.

LLM API layer (the catch): Current LLM APIs do not support true cancellation. When you abort the client connection, the model continues generating the full response on the provider's side. You are billed for all output tokens the model generates, not just the ones you received. For a 500-token response cancelled at token 50, you still pay for 500 output tokens.

Mitigation: Set a reasonable max_tokens to cap worst-case cost. For user-initiated "stop," the billing impact is usually small. For programmatic cancellation (e.g., content filter), max_tokens is your primary defense.


Advanced (Q9–Q11)

Q9. Design the streaming architecture for a production AI chat application.

Why interviewers ask: Tests system design ability — combining streaming, error handling, UX, and infrastructure.

Model answer:

Frontend architecture:

User Input → useStreamingChat hook → fetch() with AbortController
  → ReadableStream reader → SSE parser → state updates
  → React components: MessageList, StreamingIndicator, ErrorBanner
  → Performance: requestAnimationFrame batching for 60fps

Backend architecture:

Express/Next.js route → validate input → rate limit check
  → OpenAI/Anthropic SDK stream call → SSE writer
  → Client disconnect detection → graceful cleanup
  → Metrics emission (TTFT, total time, tokens, errors)

Key design decisions:

  1. SSE over the wire — Server pushes typed events (token, usage, error, done) to the client.
  2. Three-phase loading — Thinking dots (0-TTFT) -> streaming text with cursor (TTFT-completion) -> formatted markdown (after completion).
  3. Cancellation — AbortController on client, req.on('close') on server, max_tokens to cap cost.
  4. Error handling — Pre-stream errors show error + retry. Mid-stream errors show partial content + error banner + retry. Rate limits trigger backoff + retry with user feedback.
  5. Monitoring — Track TTFT, total time, error rate, cancellation rate, token usage per request. Alert on TTFT > 2s or error rate > 5%.
  6. Resilience — Automatic retry with exponential backoff (max 3 attempts). Chunk timeout (30s between chunks = stale stream). Client-side reconnection on network failure.
  7. Accessibilityaria-live="polite" on message container, aria-busy during streaming, screen-reader hints for dynamic content.

Q10. How do you ensure streaming performance at scale — 10,000 concurrent users?

Why interviewers ask: Tests understanding of the infrastructure challenges unique to streaming at scale.

Model answer:

Streaming presents unique scaling challenges because each active stream holds an open HTTP connection for the duration of the response (5-30 seconds). At 10,000 concurrent users, that is 10,000 open connections simultaneously.

Server-side:

  1. Event-loop-based servers (Node.js, Go) are ideal — they handle thousands of concurrent connections without thread-per-connection overhead. Avoid thread-based servers (traditional Java/PHP) for streaming endpoints.
  2. Connection limits — Default Node.js maxConnections may need increasing. Monitor file descriptor usage.
  3. Memory per connection — Each open stream consumes memory for the response buffer, SSE writer, and request context. At 10K connections, budget ~50-100KB per connection = 500MB-1GB just for stream state.
  4. Horizontal scaling — Multiple server instances behind a load balancer. Use sticky sessions or stateless design (each stream is a single request, no cross-request state needed).

Proxy/CDN layer:

  1. Disable response buffering — Nginx, Cloudflare, and other proxies buffer responses by default, which defeats streaming. Set X-Accel-Buffering: no in the response header and configure proxy pass with proxy_buffering off.
  2. Increase proxy timeouts — Default timeouts (60s) may kill long-running streams. Set proxy_read_timeout to 120-300s.

LLM API layer:

  1. Rate limits — LLM providers impose per-minute request and token limits. At 10K concurrent users, you likely need multiple API keys, a request queue, or an enterprise agreement.
  2. Connection pooling — Reuse HTTP connections to the LLM API to avoid TCP handshake overhead.

Client-side:

  1. requestAnimationFrame batching — At 50 tokens/second per stream, React state updates must be batched to avoid 50 re-renders/second.
  2. Virtualized message list — For long conversations, only render visible messages to avoid DOM bloat.

Q11. Compare the streaming developer experience across OpenAI, Anthropic, and the Vercel AI SDK. When would you choose each?

Why interviewers ask: Tests breadth of experience with different providers and ability to make pragmatic technology choices.

Model answer:

OpenAI SDK:

  • stream: true returns an async iterable of ChatCompletionChunk objects.
  • Content is in chunk.choices[0].delta.content.
  • Token usage requires stream_options: { include_usage: true }.
  • Cancellation via AbortController in the request options.
  • Best when: you are OpenAI-exclusive and want full control over stream processing.

Anthropic SDK:

  • .stream() method returns a typed event stream with lifecycle events (message_start, content_block_delta, message_stop).
  • Content is in events where type === 'content_block_delta', accessed via event.delta.text.
  • Provides .on('text', callback) for event-based consumption.
  • await stream.finalMessage() gives the complete message after stream ends.
  • Best when: you are Anthropic-exclusive and want structured event handling with typed events.

Vercel AI SDK:

  • Provider-agnostic: works with OpenAI, Anthropic, Google, and others through unified adapter layer.
  • Backend: streamText() returns a StreamTextResult with .toDataStreamResponse() for instant SSE endpoint.
  • Frontend: useChat() hook handles messages, streaming state, cancellation, and input management in ~10 lines.
  • Best when: you want maximum development speed, multi-provider flexibility, or are building a Next.js application.

My recommendation: Start with the Vercel AI SDK for most projects — it cuts development time by 80% and makes provider switching trivial. Drop down to the raw SDK only when you need provider-specific features (Anthropic tool use patterns, OpenAI fine-tuning integration) or need custom stream transformation logic that the abstraction does not support.


Quick-fire

#QuestionOne-line answer
1What parameter enables streaming in OpenAI?stream: true
2What protocol does LLM streaming use?Server-Sent Events (SSE) over HTTP
3Where is the token content in an OpenAI chunk?chunk.choices[0].delta.content
4Where is the token content in an Anthropic event?event.delta.text (when event.type === 'content_block_delta')
5What signals the end of an OpenAI stream?data: [DONE] or finish_reason: "stop"
6What is TTFT?Time-to-first-token — time from request to first visible token
7Target TTFT for good UX?< 500ms
8How to cancel a stream in the browser?AbortController + signal on the fetch request
9Does cancelling a stream stop the model?No — the model continues generating; you are billed for all tokens
10Should you stream JSON responses?No — partial JSON is invalid; buffer and parse when complete
11React updater pattern for streaming?setState(prev => prev + token) — functional updater avoids stale closures

<- Back to 4.8 — Streaming Responses (README)