Episode 4 — Generative AI Engineering / 4.9 — Combining Streaming with Structured Data

4.9 — Combining Streaming with Structured Data: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md -> 4.9.a...4.9.c.
Practice — 4.9-Exercise-Questions.md.
Polish answers — 4.9-Interview-Questions.md.

Core vocabulary

Term	One-liner
Streaming	Sending LLM tokens to the client as they are generated, rather than waiting for the full response
TTFT	Time to First Token — perceived latency metric, typically 200-700ms
Two-phase pattern	Stream text to humans (Phase 1), then extract/return structured JSON for systems (Phase 2)
Delimiter pattern	Model outputs text, then a delimiter (`---JSON---`), then JSON in a single call
Stream then parse	Stream the full response, then extract JSON from code blocks or structured sections post-completion
Dual-channel	Architecture separating UI-facing (streamed text) from system-facing (structured JSON) outputs
SSE	Server-Sent Events — HTTP-based server-to-client streaming protocol
Accumulate-while-streaming	Building the full response string alongside the real-time stream for later use
Response router	Pattern that conditionally routes AI response data to different system consumers

The core question

"Who consumes this response?"

  Human only    → Stream conversational text
  Machine only  → Return structured JSON (no streaming needed)
  Both          → Two-phase pattern (stream text + extract/return JSON)

Pattern comparison

SINGLE CALL + DELIMITER
  Cost:        Lowest (1 API call)
  Latency:     Lowest
  Reliability: Medium (model may forget delimiter or produce invalid JSON)
  Best for:    High-volume, cost-sensitive, acceptable failure rate

TWO CALLS - SEQUENTIAL
  Cost:        Highest (~23% more)
  Latency:     Highest (stream time + extraction time)
  Reliability: Highest (response_format enforces valid JSON)
  Best for:    Reliability-critical (medical, legal, financial)

TWO CALLS - PARALLEL
  Cost:        Medium (~7% more)
  Latency:     Medium (max of both calls)
  Reliability: High
  Best for:    Speed + reliability, when structured data can be independent

Streaming basics

// Core pattern: accumulate while streaming
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  stream: true
});

const parts = [];
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);  // Stream to user
    parts.push(content);            // Accumulate
  }
}
const fullText = parts.join('');    // Complete response

Delimiter pattern

// System prompt instructs: text, then ---JSON---, then JSON
let fullText = '';
let jsonStarted = false;
let jsonBuffer = '';
const delimiter = '---JSON---';

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (!content) continue;
  fullText += content;

  if (!jsonStarted && fullText.includes(delimiter)) {
    jsonStarted = true;
    jsonBuffer = fullText.split(delimiter)[1] || '';
  } else if (jsonStarted) {
    jsonBuffer += content;
  } else {
    onToken(content); // Stream text to user
  }
}
const data = JSON.parse(jsonBuffer.trim());

Two-call pattern

// Call 1: Stream text (temperature 0.7, no response_format)
const text = await streamToUser(message);

// Call 2: Extract structured data (temperature 0, response_format)
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'Extract structured data...' },
    { role: 'user', content: message },
    { role: 'assistant', content: text }  // Feed streamed text as context
  ],
  response_format: { type: 'json_object' },
  temperature: 0
});
const data = JSON.parse(response.choices[0].message.content);

SSE (Server-Sent Events) essentials

// Server: Express
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');

// Send typed events
res.write(`event: text\ndata: ${JSON.stringify({ content })}\n\n`);
res.write(`event: structured\ndata: ${JSON.stringify({ data })}\n\n`);

// Client: Parse SSE
const reader = response.body.getReader();
// Parse lines starting with "event:" and "data:"

Event-based architecture

// Pipeline emits typed events
pipeline.emit('stream:token', { requestId, content });
pipeline.emit('stream:end', { requestId, fullText });
pipeline.emit('structure:complete', { requestId, data });
pipeline.emit('response:error', { requestId, error });

// Independent consumers subscribe
pipeline.on('stream:token', sendToUI);
pipeline.on('structure:complete', saveToDatabase);
pipeline.on('structure:complete', checkForAlerts);
pipeline.on('response:error', notifyMonitoring);

When to stream vs when NOT to

Stream	Do NOT stream
Long conversational responses	Short responses (< 50 tokens)
Explanations, analysis, essays	Structured JSON for display
Creative writing	Error messages
Multi-paragraph answers	Action confirmations ("Deleted!")
User-facing chat	Background batch processing

Transport choice

SSE (Server-Sent Events)
  ✓ Simple (HTTP-based)
  ✓ Auto-reconnect (browser EventSource API)
  ✓ HTTP/2 compatible
  ✓ Named event types for channel separation
  ✗ Unidirectional only (server → client)
  USE FOR: 90% of AI streaming apps

WebSocket
  ✓ Bidirectional (client ↔ server)
  ✓ Lower per-message overhead
  ✗ No auto-reconnect (DIY)
  ✗ Needs WebSocket-aware infrastructure
  USE FOR: Cancel mid-stream, real-time collaboration, gaming

Failure handling cheat sheet

Stream fails mid-response:
  → Preserve partial text
  → Show error indicator
  → Offer "regenerate" button
  → Log partial response

JSON extraction fails (delimiter pattern):
  → Try code block regex
  → Try brace-matching
  → Fall back to dedicated structured call
  → Queue for async extraction

Structured call fails (two-call pattern):
  → User already got their text (UX unaffected)
  → Retry with exponential backoff
  → Show "loading" state for structured panel
  → Queue for later processing

Cost quick math

Single call (delimiter):
  Input:  user_tokens + system_tokens
  Output: text_tokens + delimiter + json_tokens
  = 1 API call

Two calls (sequential):
  Call 1: user + system → text
  Call 2: user + system + text → json
  = ~23% more expensive (text sent as input to call 2)

Two calls (parallel):
  Call 1: user + system → text
  Call 2: user + system → json
  = ~7% more expensive (no shared context)

At 100K requests/day with GPT-4o:
  Single:     ~$773/day
  Sequential: ~$950/day
  Parallel:   ~$825/day

Architecture decision tree

Concurrent users?
  < 100    → Simple SSE + sequential extraction
  100-10K  → SSE + event pipeline + connection management
  > 10K    → WebSocket + message queue + worker pool

Bidirectional needed?
  No  → SSE
  Yes → WebSocket

Structured data reliability?
  Nice to have     → Single call + delimiter
  Important        → Two calls + fallback
  Mission critical → Two calls + validation + retry + human review

Common gotchas

Gotcha	Why
`response_format: json` kills conversational text	Forces entire output to be JSON — no natural language
Delimiter split across chunks	Buffer and check for partial delimiter matches
String concatenation in tight loop	Use array + `join()` for long responses
Streaming short responses	Adds visual noise for < 1 second generation
Not handling stream abort	Wastes tokens/money when user navigates away
Assuming structured data matches text	Parallel calls generate independently — validate

End of 4.9 quick revision.