Episode 4 — Generative AI Engineering / 4.9 — Combining Streaming with Structured Data
4.9 — Combining Streaming with Structured Data: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md->4.9.a...4.9.c. - Practice —
4.9-Exercise-Questions.md. - Polish answers —
4.9-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| Streaming | Sending LLM tokens to the client as they are generated, rather than waiting for the full response |
| TTFT | Time to First Token — perceived latency metric, typically 200-700ms |
| Two-phase pattern | Stream text to humans (Phase 1), then extract/return structured JSON for systems (Phase 2) |
| Delimiter pattern | Model outputs text, then a delimiter (---JSON---), then JSON in a single call |
| Stream then parse | Stream the full response, then extract JSON from code blocks or structured sections post-completion |
| Dual-channel | Architecture separating UI-facing (streamed text) from system-facing (structured JSON) outputs |
| SSE | Server-Sent Events — HTTP-based server-to-client streaming protocol |
| Accumulate-while-streaming | Building the full response string alongside the real-time stream for later use |
| Response router | Pattern that conditionally routes AI response data to different system consumers |
The core question
"Who consumes this response?"
Human only → Stream conversational text
Machine only → Return structured JSON (no streaming needed)
Both → Two-phase pattern (stream text + extract/return JSON)
Pattern comparison
SINGLE CALL + DELIMITER
Cost: Lowest (1 API call)
Latency: Lowest
Reliability: Medium (model may forget delimiter or produce invalid JSON)
Best for: High-volume, cost-sensitive, acceptable failure rate
TWO CALLS - SEQUENTIAL
Cost: Highest (~23% more)
Latency: Highest (stream time + extraction time)
Reliability: Highest (response_format enforces valid JSON)
Best for: Reliability-critical (medical, legal, financial)
TWO CALLS - PARALLEL
Cost: Medium (~7% more)
Latency: Medium (max of both calls)
Reliability: High
Best for: Speed + reliability, when structured data can be independent
Streaming basics
// Core pattern: accumulate while streaming
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
stream: true
});
const parts = [];
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content); // Stream to user
parts.push(content); // Accumulate
}
}
const fullText = parts.join(''); // Complete response
Delimiter pattern
// System prompt instructs: text, then ---JSON---, then JSON
let fullText = '';
let jsonStarted = false;
let jsonBuffer = '';
const delimiter = '---JSON---';
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (!content) continue;
fullText += content;
if (!jsonStarted && fullText.includes(delimiter)) {
jsonStarted = true;
jsonBuffer = fullText.split(delimiter)[1] || '';
} else if (jsonStarted) {
jsonBuffer += content;
} else {
onToken(content); // Stream text to user
}
}
const data = JSON.parse(jsonBuffer.trim());
Two-call pattern
// Call 1: Stream text (temperature 0.7, no response_format)
const text = await streamToUser(message);
// Call 2: Extract structured data (temperature 0, response_format)
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Extract structured data...' },
{ role: 'user', content: message },
{ role: 'assistant', content: text } // Feed streamed text as context
],
response_format: { type: 'json_object' },
temperature: 0
});
const data = JSON.parse(response.choices[0].message.content);
SSE (Server-Sent Events) essentials
// Server: Express
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
// Send typed events
res.write(`event: text\ndata: ${JSON.stringify({ content })}\n\n`);
res.write(`event: structured\ndata: ${JSON.stringify({ data })}\n\n`);
// Client: Parse SSE
const reader = response.body.getReader();
// Parse lines starting with "event:" and "data:"
Event-based architecture
// Pipeline emits typed events
pipeline.emit('stream:token', { requestId, content });
pipeline.emit('stream:end', { requestId, fullText });
pipeline.emit('structure:complete', { requestId, data });
pipeline.emit('response:error', { requestId, error });
// Independent consumers subscribe
pipeline.on('stream:token', sendToUI);
pipeline.on('structure:complete', saveToDatabase);
pipeline.on('structure:complete', checkForAlerts);
pipeline.on('response:error', notifyMonitoring);
When to stream vs when NOT to
| Stream | Do NOT stream |
|---|---|
| Long conversational responses | Short responses (< 50 tokens) |
| Explanations, analysis, essays | Structured JSON for display |
| Creative writing | Error messages |
| Multi-paragraph answers | Action confirmations ("Deleted!") |
| User-facing chat | Background batch processing |
Transport choice
SSE (Server-Sent Events)
✓ Simple (HTTP-based)
✓ Auto-reconnect (browser EventSource API)
✓ HTTP/2 compatible
✓ Named event types for channel separation
✗ Unidirectional only (server → client)
USE FOR: 90% of AI streaming apps
WebSocket
✓ Bidirectional (client ↔ server)
✓ Lower per-message overhead
✗ No auto-reconnect (DIY)
✗ Needs WebSocket-aware infrastructure
USE FOR: Cancel mid-stream, real-time collaboration, gaming
Failure handling cheat sheet
Stream fails mid-response:
→ Preserve partial text
→ Show error indicator
→ Offer "regenerate" button
→ Log partial response
JSON extraction fails (delimiter pattern):
→ Try code block regex
→ Try brace-matching
→ Fall back to dedicated structured call
→ Queue for async extraction
Structured call fails (two-call pattern):
→ User already got their text (UX unaffected)
→ Retry with exponential backoff
→ Show "loading" state for structured panel
→ Queue for later processing
Cost quick math
Single call (delimiter):
Input: user_tokens + system_tokens
Output: text_tokens + delimiter + json_tokens
= 1 API call
Two calls (sequential):
Call 1: user + system → text
Call 2: user + system + text → json
= ~23% more expensive (text sent as input to call 2)
Two calls (parallel):
Call 1: user + system → text
Call 2: user + system → json
= ~7% more expensive (no shared context)
At 100K requests/day with GPT-4o:
Single: ~$773/day
Sequential: ~$950/day
Parallel: ~$825/day
Architecture decision tree
Concurrent users?
< 100 → Simple SSE + sequential extraction
100-10K → SSE + event pipeline + connection management
> 10K → WebSocket + message queue + worker pool
Bidirectional needed?
No → SSE
Yes → WebSocket
Structured data reliability?
Nice to have → Single call + delimiter
Important → Two calls + fallback
Mission critical → Two calls + validation + retry + human review
Common gotchas
| Gotcha | Why |
|---|---|
response_format: json kills conversational text | Forces entire output to be JSON — no natural language |
| Delimiter split across chunks | Buffer and check for partial delimiter matches |
| String concatenation in tight loop | Use array + join() for long responses |
| Streaming short responses | Adds visual noise for < 1 second generation |
| Not handling stream abort | Wastes tokens/money when user navigates away |
| Assuming structured data matches text | Parallel calls generate independently — validate |
End of 4.9 quick revision.