Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses

4.8 — Streaming Responses: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md -> 4.8.a...4.8.c.
Practice — 4.8-Exercise-Questions.md.
Polish answers — 4.8-Interview-Questions.md.

Core vocabulary

Term	One-liner
Streaming	Delivering LLM tokens incrementally as generated, not waiting for full response
SSE	Server-Sent Events — one-way HTTP protocol for server-to-client push
Chunk	A single streaming event containing one or more tokens
Delta	The new content in a chunk (only the change, not the accumulated text)
TTFT	Time-to-first-token — most important UX metric for AI features
Perceived latency	How fast the app feels to the user (vs actual latency)
Progressive rendering	Displaying content as it arrives instead of waiting for completion
AbortController	Browser/Node.js API for cancelling fetch requests and streams
Async iterator	JavaScript interface (`Symbol.asyncIterator`) consumed with `for await...of`
Typewriter effect	Smooth character-by-character reveal of streamed text

Streaming at a glance

NON-STREAMING:  Request ──── [10s waiting] ──── Full response
STREAMING:      Request ── [0.2s] ── token ── token ── token ── done

Same total time. Different perceived latency.
TTFT drops from 10s → 200ms.

Enable streaming

// OpenAI
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,                                      // <-- this
  stream_options: { include_usage: true },            // <-- for token counts
});

// Anthropic
const stream = await anthropic.messages.stream({     // <-- .stream() method
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Hello' }],
});

Consume streams

// OpenAI — for await...of
for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(token);
}

// Anthropic — for await...of with event types
for await (const event of stream) {
  if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
    process.stdout.write(event.delta.text);
  }
}

// Anthropic — event-based
stream.on('text', (text) => process.stdout.write(text));
const final = await stream.finalMessage();

Chunk structure

OpenAI chunk:
  chunk.choices[0].delta.content     → the token text
  chunk.choices[0].finish_reason     → null | "stop" | "content_filter"
  chunk.usage                        → { prompt_tokens, completion_tokens } (final chunk only)

Anthropic events (in order):
  message_start       → metadata, input_tokens
  content_block_start → block begins
  content_block_delta → the token text (event.delta.text)
  content_block_stop  → block ends
  message_delta       → stop_reason, output_tokens
  message_stop        → stream complete

SSE format

data: {"choices":[{"delta":{"content":"Hello"}}]}\n\n
data: {"choices":[{"delta":{"content":" world"}}]}\n\n
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}\n\n
data: [DONE]\n\n

React streaming pattern

// Functional updater — MUST use this for streaming
setResponse(prev => prev + token);   // Correct
setResponse(response + token);       // BUG: stale closure

// Custom hook shape
const { messages, isStreaming, error, sendMessage, cancelStream } =
  useStreamingChat('/api/chat');

// Vercel AI SDK — simplest approach
const { messages, input, handleInputChange, handleSubmit, isLoading, stop } =
  useChat();

Cancellation

// Frontend
const controller = new AbortController();
fetch('/api/chat', { signal: controller.signal });
controller.abort(); // Cancel

// Backend
req.on('close', () => { /* client disconnected */ });

// IMPORTANT: LLM API continues generating after cancel
// You are billed for ALL output tokens, not just received ones
// Mitigation: set max_tokens to cap worst-case cost

TTFT optimization

TTFT = Network + Server + API setup + Model prefill + Client render
                                       ^^^^^^^^^^^
                                    This dominates for long prompts

REDUCE TTFT:
  1. Shorter system prompts        (fewer tokens to prefill)
  2. Fewer RAG chunks              (less input = faster prefill)
  3. Smaller/faster models         (gpt-4o-mini ~50% faster TTFT)
  4. Prompt caching                (skip prefill for cached prefix)
  5. Edge deployment               (reduce network latency)
  6. Connection keep-alive         (skip TCP handshake)

When to stream vs buffer

STREAM when:
  - User-facing text response
  - Expected length > 50 tokens
  - No validation required before display

BUFFER when:
  - JSON / structured output
  - Short responses (< 50 tokens)
  - Binary decisions (yes/no)
  - Pipeline-internal / batch processing
  - Function/tool calling
  - Validation-required output

Three-phase loading

Phase 1: Immediate feedback (0 → TTFT)
  → Thinking dots / skeleton screen
  → Shows the system received the request

Phase 2: Streaming (TTFT → completion)
  → Tokens appearing with blinking cursor
  → User reads along as content generates

Phase 3: Complete (after last token)
  → Full markdown rendering
  → Action buttons (copy, regenerate, thumbs up/down)

Markdown during streaming

PROBLEM: Incomplete markdown flickers
  Token stream: "Here are **three"  → unclosed bold tag

SOLUTIONS (pick one):
  1. Tolerant markdown lib (react-markdown handles partial gracefully)
  2. Debounced rendering (re-render every 100ms, not every token)
  3. Code block detection (count ```, if odd, temporarily close)
  4. Raw text while streaming, markdown after complete

Performance

PROBLEM: 50 tokens/sec = 50 setState() calls/sec = 50 re-renders/sec

FIX: requestAnimationFrame batching
  - Buffer tokens in a ref
  - Flush to state once per animation frame (~60fps)
  - Reduces re-renders from 50/s to 16/s

ALSO:
  - Virtualize long message lists
  - Memoize message components
  - Debounce markdown rendering

Error handling

PRE-STREAM ERROR:   → Show error message + retry button
MID-STREAM ERROR:   → Keep partial content + error banner + retry
RATE LIMIT (429):   → Backoff + retry + show wait indicator
TIMEOUT (no chunks): → "Response interrupted" + retry
CONTENT FILTER:     → finish_reason: "content_filter" + graceful message

ALWAYS: Exponential backoff, max 3 retries, keep partial content

UX metrics targets

Metric	Target	Alert
TTFT	< 500ms	> 2s
Total time	< 10s	> 20s
Display rate	> 30 tok/s	< 10 tok/s
Error rate	< 1%	> 5%
Cancel rate	< 15%	> 30%
Empty response	< 0.5%	> 2%

Common anti-patterns

Anti-pattern	Fix
Spinner for 10+ seconds	Stream with progressive rendering
No cancel button	Always show "Stop generating"
Streaming JSON	Buffer JSON, show when complete
No error recovery	Error banner + retry + keep partial
Auto-scroll over user reading	Only scroll if user is near bottom
No submit feedback	Disable input immediately on submit

Infrastructure checklist

[ ] Disable proxy buffering (X-Accel-Buffering: no)
[ ] Increase proxy timeouts (120-300s)
[ ] Set SSE headers (Content-Type: text/event-stream, Cache-Control: no-cache)
[ ] Handle client disconnect (req.on('close'))
[ ] Set max_tokens to cap cost
[ ] Monitor TTFT, error rate, cancellation rate
[ ] requestAnimationFrame batching on frontend
[ ] aria-live="polite" + aria-busy for accessibility

End of 4.8 quick revision.