Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses

4.8 — Streaming Responses: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps — reopen README.md -> 4.8.a...4.8.c.
  3. Practice4.8-Exercise-Questions.md.
  4. Polish answers4.8-Interview-Questions.md.

Core vocabulary

TermOne-liner
StreamingDelivering LLM tokens incrementally as generated, not waiting for full response
SSEServer-Sent Events — one-way HTTP protocol for server-to-client push
ChunkA single streaming event containing one or more tokens
DeltaThe new content in a chunk (only the change, not the accumulated text)
TTFTTime-to-first-token — most important UX metric for AI features
Perceived latencyHow fast the app feels to the user (vs actual latency)
Progressive renderingDisplaying content as it arrives instead of waiting for completion
AbortControllerBrowser/Node.js API for cancelling fetch requests and streams
Async iteratorJavaScript interface (Symbol.asyncIterator) consumed with for await...of
Typewriter effectSmooth character-by-character reveal of streamed text

Streaming at a glance

NON-STREAMING:  Request ──── [10s waiting] ──── Full response
STREAMING:      Request ── [0.2s] ── token ── token ── token ── done

Same total time. Different perceived latency.
TTFT drops from 10s → 200ms.

Enable streaming

// OpenAI
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,                                      // <-- this
  stream_options: { include_usage: true },            // <-- for token counts
});

// Anthropic
const stream = await anthropic.messages.stream({     // <-- .stream() method
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Hello' }],
});

Consume streams

// OpenAI — for await...of
for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(token);
}

// Anthropic — for await...of with event types
for await (const event of stream) {
  if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
    process.stdout.write(event.delta.text);
  }
}

// Anthropic — event-based
stream.on('text', (text) => process.stdout.write(text));
const final = await stream.finalMessage();

Chunk structure

OpenAI chunk:
  chunk.choices[0].delta.content     → the token text
  chunk.choices[0].finish_reason     → null | "stop" | "content_filter"
  chunk.usage                        → { prompt_tokens, completion_tokens } (final chunk only)

Anthropic events (in order):
  message_start       → metadata, input_tokens
  content_block_start → block begins
  content_block_delta → the token text (event.delta.text)
  content_block_stop  → block ends
  message_delta       → stop_reason, output_tokens
  message_stop        → stream complete

SSE format

data: {"choices":[{"delta":{"content":"Hello"}}]}\n\n
data: {"choices":[{"delta":{"content":" world"}}]}\n\n
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}\n\n
data: [DONE]\n\n

React streaming pattern

// Functional updater — MUST use this for streaming
setResponse(prev => prev + token);   // Correct
setResponse(response + token);       // BUG: stale closure

// Custom hook shape
const { messages, isStreaming, error, sendMessage, cancelStream } =
  useStreamingChat('/api/chat');

// Vercel AI SDK — simplest approach
const { messages, input, handleInputChange, handleSubmit, isLoading, stop } =
  useChat();

Cancellation

// Frontend
const controller = new AbortController();
fetch('/api/chat', { signal: controller.signal });
controller.abort(); // Cancel

// Backend
req.on('close', () => { /* client disconnected */ });

// IMPORTANT: LLM API continues generating after cancel
// You are billed for ALL output tokens, not just received ones
// Mitigation: set max_tokens to cap worst-case cost

TTFT optimization

TTFT = Network + Server + API setup + Model prefill + Client render
                                       ^^^^^^^^^^^
                                    This dominates for long prompts

REDUCE TTFT:
  1. Shorter system prompts        (fewer tokens to prefill)
  2. Fewer RAG chunks              (less input = faster prefill)
  3. Smaller/faster models         (gpt-4o-mini ~50% faster TTFT)
  4. Prompt caching                (skip prefill for cached prefix)
  5. Edge deployment               (reduce network latency)
  6. Connection keep-alive         (skip TCP handshake)

When to stream vs buffer

STREAM when:
  - User-facing text response
  - Expected length > 50 tokens
  - No validation required before display

BUFFER when:
  - JSON / structured output
  - Short responses (< 50 tokens)
  - Binary decisions (yes/no)
  - Pipeline-internal / batch processing
  - Function/tool calling
  - Validation-required output

Three-phase loading

Phase 1: Immediate feedback (0 → TTFT)
  → Thinking dots / skeleton screen
  → Shows the system received the request

Phase 2: Streaming (TTFT → completion)
  → Tokens appearing with blinking cursor
  → User reads along as content generates

Phase 3: Complete (after last token)
  → Full markdown rendering
  → Action buttons (copy, regenerate, thumbs up/down)

Markdown during streaming

PROBLEM: Incomplete markdown flickers
  Token stream: "Here are **three"  → unclosed bold tag

SOLUTIONS (pick one):
  1. Tolerant markdown lib (react-markdown handles partial gracefully)
  2. Debounced rendering (re-render every 100ms, not every token)
  3. Code block detection (count ```, if odd, temporarily close)
  4. Raw text while streaming, markdown after complete

Performance

PROBLEM: 50 tokens/sec = 50 setState() calls/sec = 50 re-renders/sec

FIX: requestAnimationFrame batching
  - Buffer tokens in a ref
  - Flush to state once per animation frame (~60fps)
  - Reduces re-renders from 50/s to 16/s

ALSO:
  - Virtualize long message lists
  - Memoize message components
  - Debounce markdown rendering

Error handling

PRE-STREAM ERROR:   → Show error message + retry button
MID-STREAM ERROR:   → Keep partial content + error banner + retry
RATE LIMIT (429):   → Backoff + retry + show wait indicator
TIMEOUT (no chunks): → "Response interrupted" + retry
CONTENT FILTER:     → finish_reason: "content_filter" + graceful message

ALWAYS: Exponential backoff, max 3 retries, keep partial content

UX metrics targets

MetricTargetAlert
TTFT< 500ms> 2s
Total time< 10s> 20s
Display rate> 30 tok/s< 10 tok/s
Error rate< 1%> 5%
Cancel rate< 15%> 30%
Empty response< 0.5%> 2%

Common anti-patterns

Anti-patternFix
Spinner for 10+ secondsStream with progressive rendering
No cancel buttonAlways show "Stop generating"
Streaming JSONBuffer JSON, show when complete
No error recoveryError banner + retry + keep partial
Auto-scroll over user readingOnly scroll if user is near bottom
No submit feedbackDisable input immediately on submit

Infrastructure checklist

[ ] Disable proxy buffering (X-Accel-Buffering: no)
[ ] Increase proxy timeouts (120-300s)
[ ] Set SSE headers (Content-Type: text/event-stream, Cache-Control: no-cache)
[ ] Handle client disconnect (req.on('close'))
[ ] Set max_tokens to cap cost
[ ] Monitor TTFT, error rate, cancellation rate
[ ] requestAnimationFrame batching on frontend
[ ] aria-live="polite" + aria-busy for accessibility

End of 4.8 quick revision.