Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses
4.8 — Streaming Responses: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md->4.8.a...4.8.c. - Practice —
4.8-Exercise-Questions.md. - Polish answers —
4.8-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| Streaming | Delivering LLM tokens incrementally as generated, not waiting for full response |
| SSE | Server-Sent Events — one-way HTTP protocol for server-to-client push |
| Chunk | A single streaming event containing one or more tokens |
| Delta | The new content in a chunk (only the change, not the accumulated text) |
| TTFT | Time-to-first-token — most important UX metric for AI features |
| Perceived latency | How fast the app feels to the user (vs actual latency) |
| Progressive rendering | Displaying content as it arrives instead of waiting for completion |
| AbortController | Browser/Node.js API for cancelling fetch requests and streams |
| Async iterator | JavaScript interface (Symbol.asyncIterator) consumed with for await...of |
| Typewriter effect | Smooth character-by-character reveal of streamed text |
Streaming at a glance
NON-STREAMING: Request ──── [10s waiting] ──── Full response
STREAMING: Request ── [0.2s] ── token ── token ── token ── done
Same total time. Different perceived latency.
TTFT drops from 10s → 200ms.
Enable streaming
// OpenAI
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello' }],
stream: true, // <-- this
stream_options: { include_usage: true }, // <-- for token counts
});
// Anthropic
const stream = await anthropic.messages.stream({ // <-- .stream() method
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
messages: [{ role: 'user', content: 'Hello' }],
});
Consume streams
// OpenAI — for await...of
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
process.stdout.write(token);
}
// Anthropic — for await...of with event types
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
process.stdout.write(event.delta.text);
}
}
// Anthropic — event-based
stream.on('text', (text) => process.stdout.write(text));
const final = await stream.finalMessage();
Chunk structure
OpenAI chunk:
chunk.choices[0].delta.content → the token text
chunk.choices[0].finish_reason → null | "stop" | "content_filter"
chunk.usage → { prompt_tokens, completion_tokens } (final chunk only)
Anthropic events (in order):
message_start → metadata, input_tokens
content_block_start → block begins
content_block_delta → the token text (event.delta.text)
content_block_stop → block ends
message_delta → stop_reason, output_tokens
message_stop → stream complete
SSE format
data: {"choices":[{"delta":{"content":"Hello"}}]}\n\n
data: {"choices":[{"delta":{"content":" world"}}]}\n\n
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}\n\n
data: [DONE]\n\n
React streaming pattern
// Functional updater — MUST use this for streaming
setResponse(prev => prev + token); // Correct
setResponse(response + token); // BUG: stale closure
// Custom hook shape
const { messages, isStreaming, error, sendMessage, cancelStream } =
useStreamingChat('/api/chat');
// Vercel AI SDK — simplest approach
const { messages, input, handleInputChange, handleSubmit, isLoading, stop } =
useChat();
Cancellation
// Frontend
const controller = new AbortController();
fetch('/api/chat', { signal: controller.signal });
controller.abort(); // Cancel
// Backend
req.on('close', () => { /* client disconnected */ });
// IMPORTANT: LLM API continues generating after cancel
// You are billed for ALL output tokens, not just received ones
// Mitigation: set max_tokens to cap worst-case cost
TTFT optimization
TTFT = Network + Server + API setup + Model prefill + Client render
^^^^^^^^^^^
This dominates for long prompts
REDUCE TTFT:
1. Shorter system prompts (fewer tokens to prefill)
2. Fewer RAG chunks (less input = faster prefill)
3. Smaller/faster models (gpt-4o-mini ~50% faster TTFT)
4. Prompt caching (skip prefill for cached prefix)
5. Edge deployment (reduce network latency)
6. Connection keep-alive (skip TCP handshake)
When to stream vs buffer
STREAM when:
- User-facing text response
- Expected length > 50 tokens
- No validation required before display
BUFFER when:
- JSON / structured output
- Short responses (< 50 tokens)
- Binary decisions (yes/no)
- Pipeline-internal / batch processing
- Function/tool calling
- Validation-required output
Three-phase loading
Phase 1: Immediate feedback (0 → TTFT)
→ Thinking dots / skeleton screen
→ Shows the system received the request
Phase 2: Streaming (TTFT → completion)
→ Tokens appearing with blinking cursor
→ User reads along as content generates
Phase 3: Complete (after last token)
→ Full markdown rendering
→ Action buttons (copy, regenerate, thumbs up/down)
Markdown during streaming
PROBLEM: Incomplete markdown flickers
Token stream: "Here are **three" → unclosed bold tag
SOLUTIONS (pick one):
1. Tolerant markdown lib (react-markdown handles partial gracefully)
2. Debounced rendering (re-render every 100ms, not every token)
3. Code block detection (count ```, if odd, temporarily close)
4. Raw text while streaming, markdown after complete
Performance
PROBLEM: 50 tokens/sec = 50 setState() calls/sec = 50 re-renders/sec
FIX: requestAnimationFrame batching
- Buffer tokens in a ref
- Flush to state once per animation frame (~60fps)
- Reduces re-renders from 50/s to 16/s
ALSO:
- Virtualize long message lists
- Memoize message components
- Debounce markdown rendering
Error handling
PRE-STREAM ERROR: → Show error message + retry button
MID-STREAM ERROR: → Keep partial content + error banner + retry
RATE LIMIT (429): → Backoff + retry + show wait indicator
TIMEOUT (no chunks): → "Response interrupted" + retry
CONTENT FILTER: → finish_reason: "content_filter" + graceful message
ALWAYS: Exponential backoff, max 3 retries, keep partial content
UX metrics targets
| Metric | Target | Alert |
|---|---|---|
| TTFT | < 500ms | > 2s |
| Total time | < 10s | > 20s |
| Display rate | > 30 tok/s | < 10 tok/s |
| Error rate | < 1% | > 5% |
| Cancel rate | < 15% | > 30% |
| Empty response | < 0.5% | > 2% |
Common anti-patterns
| Anti-pattern | Fix |
|---|---|
| Spinner for 10+ seconds | Stream with progressive rendering |
| No cancel button | Always show "Stop generating" |
| Streaming JSON | Buffer JSON, show when complete |
| No error recovery | Error banner + retry + keep partial |
| Auto-scroll over user reading | Only scroll if user is near bottom |
| No submit feedback | Disable input immediately on submit |
Infrastructure checklist
[ ] Disable proxy buffering (X-Accel-Buffering: no)
[ ] Increase proxy timeouts (120-300s)
[ ] Set SSE headers (Content-Type: text/event-stream, Cache-Control: no-cache)
[ ] Handle client disconnect (req.on('close'))
[ ] Set max_tokens to cap cost
[ ] Monitor TTFT, error rate, cancellation rate
[ ] requestAnimationFrame batching on frontend
[ ] aria-live="polite" + aria-busy for accessibility
End of 4.8 quick revision.