Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses
4.8.c — Improving UX in AI Applications
In one sentence: Great AI UX goes beyond just streaming tokens — it requires optimizing perceived latency through time-to-first-token, intelligent loading states, skeleton screens, streaming indicators, knowing when not to stream, graceful cancellation, and robust error handling to make AI features feel instant and reliable.
Navigation: <- 4.8.b — Progressive Rendering | 4.8 Overview ->
1. Perceived Latency vs Actual Latency
The most important insight in AI UX: users don't measure time — they measure how time feels. Two API calls that both take 8 seconds can feel completely different depending on what the user sees during those 8 seconds.
ACTUAL LATENCY: Both scenarios take 8 seconds total
SCENARIO A — High perceived latency:
[Click] ──── [Spinner for 8s] ──── [Full text appears]
User feeling: "This is slow. Is it broken?"
SCENARIO B — Low perceived latency:
[Click] ── [Skeleton 0.1s] ── [First token 0.3s] ── [Text streaming 7.7s]
User feeling: "Wow, that was fast!"
The difference: ZERO seconds. Same total time. Different UX.
Research on perceived latency
| Time | User Perception |
|---|---|
| < 100ms | Feels instant |
| 100ms - 300ms | Noticeable but not annoying |
| 300ms - 1s | User notices the delay |
| 1s - 3s | User loses focus, needs feedback |
| 3s - 10s | User patience erodes quickly |
| > 10s | User likely abandons the task |
LLM responses typically take 3-15 seconds for the full response. Without streaming, every AI feature falls squarely in the "abandon" zone. With streaming, the first token arrives in 200-500ms — well within the "noticeable but not annoying" zone.
2. Time-to-First-Token (TTFT)
Time-to-first-token is the single most important metric for AI application UX. It measures the time from when the user submits their request to when the first token appears on screen.
What affects TTFT
TTFT = Network latency (client → server)
+ Server processing time
+ API call setup time
+ Model prefill time (processing the input prompt)
+ Network latency (server → client)
+ Client rendering time
Typical breakdown:
Network (round trip): 50-200ms
Server processing: 10-50ms
API setup: 20-100ms
Model prefill: 100-2000ms ← This dominates for long prompts
Client rendering: 5-20ms
─────────────────────────────────────
Total TTFT: 200ms - 2.5s
How prompt length affects TTFT
The model must process (prefill) all input tokens before generating the first output token. Longer prompts = slower TTFT:
| Input Size | Approximate TTFT |
|---|---|
| 100 tokens (short question) | 200-400ms |
| 1,000 tokens (with system prompt) | 300-600ms |
| 5,000 tokens (with RAG context) | 500-1,200ms |
| 20,000 tokens (large document) | 1,000-2,500ms |
| 100,000 tokens (max context) | 3,000-8,000ms |
Measuring TTFT
async function measureTTFT(messages) {
const startTime = performance.now();
let firstTokenTime = null;
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true,
});
for await (const chunk of stream) {
if (!firstTokenTime && chunk.choices[0]?.delta?.content) {
firstTokenTime = performance.now();
const ttft = firstTokenTime - startTime;
console.log(`TTFT: ${ttft.toFixed(0)}ms`);
}
}
const totalTime = performance.now() - startTime;
console.log(`Total time: ${totalTime.toFixed(0)}ms`);
return {
ttft: firstTokenTime ? firstTokenTime - startTime : null,
totalTime,
};
}
Optimizing TTFT
| Strategy | Impact | How |
|---|---|---|
| Shorter system prompts | High | Every token in the prompt adds prefill time |
| Fewer RAG chunks | High | Retrieve only the top 3-5 most relevant chunks |
| Smaller/faster models | High | gpt-4o-mini has ~50% faster TTFT than gpt-4o |
| Edge deployment | Medium | Reduce network round-trip time |
| Connection keep-alive | Medium | Reuse HTTP connections to avoid TCP handshake |
| Prompt caching | High | Anthropic/OpenAI cache repeated prompt prefixes |
| Speculative decoding | Future | New model architectures for faster first tokens |
3. Loading States: The First 200ms
The gap between the user clicking "Send" and the first token arriving is critical. What happens in this gap determines whether the app feels "fast" or "broken."
Three-phase loading strategy
function StreamingMessage({ isWaiting, isStreaming, content }) {
// Phase 1: Immediate feedback (0-50ms after click)
// Show that the request was received
if (isWaiting && !content) {
return (
<div className="message assistant">
<div className="thinking-indicator">
<ThinkingDots />
<span className="thinking-label">Thinking...</span>
</div>
</div>
);
}
// Phase 2: First token arrived, streaming in progress
if (isStreaming && content) {
return (
<div className="message assistant">
<div className="streaming-content">
{content}
<span className="streaming-cursor">|</span>
</div>
</div>
);
}
// Phase 3: Stream complete
return (
<div className="message assistant">
<div className="complete-content">
<ReactMarkdown>{content}</ReactMarkdown>
</div>
</div>
);
}
Animated thinking indicator
function ThinkingDots() {
return (
<div className="thinking-dots" aria-label="AI is thinking">
<div className="dot" />
<div className="dot" />
<div className="dot" />
</div>
);
}
.thinking-dots {
display: flex;
gap: 4px;
padding: 8px 0;
}
.thinking-dots .dot {
width: 8px;
height: 8px;
background: #9ca3af;
border-radius: 50%;
animation: thinking-bounce 1.4s ease-in-out infinite;
}
.thinking-dots .dot:nth-child(1) { animation-delay: 0s; }
.thinking-dots .dot:nth-child(2) { animation-delay: 0.2s; }
.thinking-dots .dot:nth-child(3) { animation-delay: 0.4s; }
@keyframes thinking-bounce {
0%, 80%, 100% { transform: translateY(0); }
40% { transform: translateY(-6px); }
}
4. Skeleton Screens for AI Content
Skeleton screens show a placeholder of the expected content shape before it loads. For AI content, this means showing a rough outline of where the response will appear:
function AIResponseSkeleton({ expectedLength = 'medium' }) {
const lineWidths = {
short: ['75%', '45%'],
medium: ['90%', '100%', '80%', '60%'],
long: ['100%', '95%', '88%', '100%', '72%', '90%', '45%'],
};
const widths = lineWidths[expectedLength] || lineWidths.medium;
return (
<div className="skeleton-container" aria-busy="true" aria-label="Loading response">
{widths.map((width, i) => (
<div
key={i}
className="skeleton-line"
style={{ width }}
/>
))}
</div>
);
}
.skeleton-container {
padding: 16px;
display: flex;
flex-direction: column;
gap: 10px;
}
.skeleton-line {
height: 14px;
background: linear-gradient(
90deg,
#e5e7eb 25%,
#f3f4f6 50%,
#e5e7eb 75%
);
background-size: 200% 100%;
animation: shimmer 1.5s ease-in-out infinite;
border-radius: 4px;
}
@keyframes shimmer {
0% { background-position: 200% 0; }
100% { background-position: -200% 0; }
}
Transitioning from skeleton to streamed content
function SmartLoadingMessage({ isWaiting, isStreaming, content }) {
const [showSkeleton, setShowSkeleton] = useState(true);
useEffect(() => {
// Fade out skeleton when first content arrives
if (content && content.length > 0) {
// Small delay for smooth transition
const timer = setTimeout(() => setShowSkeleton(false), 100);
return () => clearTimeout(timer);
}
if (!isWaiting && !isStreaming) {
setShowSkeleton(true); // Reset for next message
}
}, [content, isWaiting, isStreaming]);
if (isWaiting && showSkeleton) {
return <AIResponseSkeleton />;
}
return (
<div className={`message-content ${isStreaming ? 'streaming' : 'complete'}`}>
{content}
{isStreaming && <span className="streaming-cursor">|</span>}
</div>
);
}
5. Streaming Indicators and Progress Feedback
Users need continuous feedback that the system is working. Beyond the blinking cursor, consider these patterns:
Token count / word count indicator
function StreamingStatus({ content, isStreaming, startTime }) {
if (!isStreaming) return null;
const wordCount = content.split(/\s+/).filter(Boolean).length;
const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
return (
<div className="streaming-status">
<span className="pulse-indicator" />
<span>{wordCount} words</span>
<span className="separator">|</span>
<span>{elapsed}s</span>
</div>
);
}
"Stop generating" button (like ChatGPT)
function StopButton({ isStreaming, onStop }) {
if (!isStreaming) return null;
return (
<button
onClick={onStop}
className="stop-generating-btn"
aria-label="Stop generating response"
>
<svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
<rect x="3" y="3" width="10" height="10" rx="1" />
</svg>
Stop generating
</button>
);
}
Progress estimation
For known-length tasks (summarize to N sentences, answer in M points), you can estimate progress:
function StreamingProgress({ content, expectedItems, isStreaming }) {
if (!isStreaming) return null;
// Count completed items (e.g., numbered list items)
const completedItems = (content.match(/^\d+\./gm) || []).length;
const progress = Math.min(completedItems / expectedItems, 1);
return (
<div className="progress-container">
<div className="progress-bar">
<div
className="progress-fill"
style={{ width: `${progress * 100}%` }}
/>
</div>
<span className="progress-label">
{completedItems} of {expectedItems} points
</span>
</div>
);
}
6. When NOT to Stream
Streaming is not always the right choice. Sometimes it adds complexity without UX benefit:
Cases where buffered responses are better
| Scenario | Why Not Stream |
|---|---|
| Short responses (< 50 tokens) | Response completes in < 1 second anyway; streaming overhead adds latency |
| JSON/structured output | Partial JSON is invalid and unparseable; you need the complete object |
| Binary decisions (yes/no, approve/reject) | A single word doesn't benefit from streaming |
| Batch processing | Backend pipelines process results programmatically, not for display |
| Function/tool calling | Tool calls must be complete before execution; streaming partial function args is unusable |
| Validation-required output | If you must validate before showing (e.g., content moderation), streaming bypasses the check |
| Multi-step chains | Intermediate LLM calls in a chain are not user-facing |
Decision framework
function shouldStream(config) {
const {
expectedTokens, // Estimated response length
isUserFacing, // Will the output be displayed to a user?
needsValidation, // Must the output be validated before showing?
outputFormat, // 'text', 'json', 'binary_decision'
isPartOfChain, // Is this an intermediate step in a pipeline?
} = config;
// Don't stream if:
if (!isUserFacing) return false; // Background processing
if (needsValidation) return false; // Must validate first
if (outputFormat === 'json') return false; // Need complete JSON
if (outputFormat === 'binary_decision') return false; // Too short
if (isPartOfChain) return false; // Intermediate step
if (expectedTokens < 50) return false; // Too short to benefit
// Do stream if:
return true;
}
Hybrid approach: stream text but collect structured data
async function hybridStream(messages) {
// For responses that contain both text and structured data:
// Stream the text portion for UX, then extract structured data at the end
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true,
});
let fullContent = '';
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
fullContent += content;
// Display streamed text in UI
updateUI(content);
}
}
// After stream completes, extract any structured data from the full response
const structuredData = parseStructuredData(fullContent);
return { text: fullContent, data: structuredData };
}
7. Cancellation Handling
Users should always be able to stop a streaming response. This requires coordination between the frontend and backend:
Frontend cancellation with AbortController
function useCancellableStream() {
const [isStreaming, setIsStreaming] = useState(false);
const controllerRef = useRef(null);
const startStream = useCallback(async (messages, onToken) => {
// Cancel any existing stream
controllerRef.current?.abort();
// Create new controller
controllerRef.current = new AbortController();
setIsStreaming(true);
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
signal: controllerRef.current.signal,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n\n');
buffer = lines.pop();
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const raw = line.slice(6);
if (raw === '[DONE]') return;
const data = JSON.parse(raw);
if (data.content) onToken(data.content);
}
}
} catch (error) {
if (error.name === 'AbortError') {
console.log('Stream cancelled by user');
// Partial response is kept — user can see what was generated
} else {
throw error;
}
} finally {
setIsStreaming(false);
}
}, []);
const cancelStream = useCallback(() => {
controllerRef.current?.abort();
setIsStreaming(false);
}, []);
return { startStream, cancelStream, isStreaming };
}
Backend: detecting client disconnect
// Express.js middleware for handling disconnects
app.post('/api/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
let clientDisconnected = false;
req.on('close', () => {
clientDisconnected = true;
// Note: OpenAI SDK doesn't support cancelling an in-flight stream
// The stream continues on OpenAI's side, but we stop sending to the client
// This still costs tokens — the model generates the full response
});
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: req.body.messages,
stream: true,
});
for await (const chunk of stream) {
if (clientDisconnected) {
// Stop processing chunks, but the API call still completes
// Consider: is the cost worth it? For expensive models, maybe abort.
break;
}
const content = chunk.choices[0]?.delta?.content;
if (content) {
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
}
if (!clientDisconnected) {
res.write('data: [DONE]\n\n');
res.end();
}
});
The hidden cost of cancellation
IMPORTANT: Cancelling on the client side does NOT stop the model from generating.
User clicks "Stop" at token 50 of a 500-token response:
- Client: stops reading the stream ✓
- Server: may or may not stop forwarding ✓
- LLM API: continues generating all 500 tokens ✗
- Billing: you pay for all 500 output tokens ✗
This is a known limitation of current LLM APIs. The model runs to completion
(or to max_tokens) regardless of client cancellation.
Mitigation: Use max_tokens to cap the maximum possible cost.
8. Error Handling During Streams
Streaming introduces a class of errors that don't exist with buffered responses: errors that occur after partial content has been delivered.
Error taxonomy for streaming
| Error Type | When | User Impact | Handling Strategy |
|---|---|---|---|
| Pre-stream error | Before first token | No content shown | Show error message, allow retry |
| Mid-stream network error | During streaming | Partial content visible | Show warning, keep partial content, offer retry |
| Mid-stream API error | During streaming | Partial content visible | Append error indicator, offer retry |
| Rate limit (429) | Before or during | May have partial content | Backoff + retry, show wait indicator |
| Content filter | During streaming | Partial content may be shown | Replace or append warning |
| Timeout | No chunks for N seconds | Partial or no content | Show timeout message, offer retry |
Comprehensive error handling component
function useRobustStreaming() {
const [state, setState] = useState({
messages: [],
isStreaming: false,
error: null,
retryCount: 0,
});
const sendMessage = useCallback(async (userMessage, maxRetries = 2) => {
setState(prev => ({
...prev,
isStreaming: true,
error: null,
}));
let attempt = 0;
while (attempt <= maxRetries) {
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message: userMessage }),
});
if (response.status === 429) {
const retryAfter = parseInt(response.headers.get('retry-after') || '5');
setState(prev => ({
...prev,
error: `Rate limited. Retrying in ${retryAfter}s...`,
}));
await new Promise(r => setTimeout(r, retryAfter * 1000));
attempt++;
continue;
}
if (!response.ok) {
throw new Error(`Server error: ${response.status}`);
}
// Process the stream
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
let lastChunkTime = Date.now();
let content = '';
// Set up a chunk timeout (no data for 30s = stale stream)
const chunkTimeoutMs = 30000;
while (true) {
// Read with timeout
const readPromise = reader.read();
const timeoutPromise = new Promise((_, reject) =>
setTimeout(
() => reject(new Error('Chunk timeout')),
chunkTimeoutMs
)
);
const { done, value } = await Promise.race([readPromise, timeoutPromise]);
if (done) break;
lastChunkTime = Date.now();
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n\n');
buffer = lines.pop();
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const raw = line.slice(6);
if (raw === '[DONE]') break;
const data = JSON.parse(raw);
if (data.type === 'error') {
throw new Error(data.message);
}
if (data.content) {
content += data.content;
setState(prev => ({
...prev,
error: null,
// Update the assistant message in place
}));
}
}
}
// Success — exit retry loop
setState(prev => ({
...prev,
isStreaming: false,
error: null,
retryCount: 0,
}));
return content;
} catch (error) {
attempt++;
if (attempt > maxRetries) {
setState(prev => ({
...prev,
isStreaming: false,
error: `Failed after ${maxRetries + 1} attempts: ${error.message}`,
retryCount: attempt,
}));
return null;
}
// Exponential backoff
const backoff = Math.min(1000 * Math.pow(2, attempt), 10000);
setState(prev => ({
...prev,
error: `Error: ${error.message}. Retrying in ${backoff / 1000}s... (attempt ${attempt + 1}/${maxRetries + 1})`,
}));
await new Promise(r => setTimeout(r, backoff));
}
}
}, []);
return { ...state, sendMessage };
}
Error UI patterns
function StreamingError({ error, onRetry, partialContent }) {
if (!error) return null;
return (
<div className="streaming-error" role="alert">
{/* If we have partial content, show it with an error banner */}
{partialContent && (
<div className="partial-content">
{partialContent}
<div className="truncation-indicator">
[Response interrupted]
</div>
</div>
)}
<div className="error-banner">
<svg className="error-icon" viewBox="0 0 20 20" fill="currentColor">
<path fillRule="evenodd" d="M10 18a8 8 0 100-16 8 8 0 000 16zM8.707 7.293a1 1 0 00-1.414 1.414L8.586 10l-1.293 1.293a1 1 0 101.414 1.414L10 11.414l1.293 1.293a1 1 0 001.414-1.414L11.414 10l1.293-1.293a1 1 0 00-1.414-1.414L10 8.586 8.707 7.293z" />
</svg>
<span className="error-message">{error}</span>
{onRetry && (
<button onClick={onRetry} className="retry-btn">
Retry
</button>
)}
</div>
</div>
);
}
9. Production UX Patterns: Putting It All Together
Pattern 1: ChatGPT-style interface
function ProductionChatUI() {
const {
messages,
isStreaming,
error,
sendMessage,
cancelStream,
} = useStreamingChat();
const [input, setInput] = useState('');
const messagesEndRef = useRef(null);
// Auto-scroll
useEffect(() => {
messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
}, [messages]);
return (
<div className="chat-layout">
{/* Message history */}
<div className="messages-area">
{messages.length === 0 && (
<div className="empty-state">
<h2>How can I help you today?</h2>
<div className="suggestion-chips">
<button onClick={() => sendMessage('Explain streaming APIs')}>
Explain streaming APIs
</button>
<button onClick={() => sendMessage('Write a React component')}>
Write a React component
</button>
</div>
</div>
)}
{messages.map((msg, i) => (
<MessageBubble
key={i}
message={msg}
isStreaming={
isStreaming &&
msg.role === 'assistant' &&
i === messages.length - 1
}
/>
))}
{error && (
<StreamingError
error={error}
onRetry={() => {
const lastUserMsg = [...messages].reverse().find(m => m.role === 'user');
if (lastUserMsg) sendMessage(lastUserMsg.content);
}}
/>
)}
<div ref={messagesEndRef} />
</div>
{/* Input area */}
<div className="input-area">
{isStreaming && (
<button onClick={cancelStream} className="stop-btn">
Stop generating
</button>
)}
<form onSubmit={(e) => {
e.preventDefault();
if (input.trim() && !isStreaming) {
sendMessage(input.trim());
setInput('');
}
}}>
<textarea
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyDown={(e) => {
if (e.key === 'Enter' && !e.shiftKey) {
e.preventDefault();
e.target.form.requestSubmit();
}
}}
placeholder="Message..."
disabled={isStreaming}
rows={1}
/>
<button type="submit" disabled={!input.trim() || isStreaming}>
Send
</button>
</form>
</div>
</div>
);
}
Pattern 2: Inline AI assist (like GitHub Copilot or Notion AI)
function InlineAIAssist({ context, onInsert }) {
const [suggestion, setSuggestion] = useState('');
const [isGenerating, setIsGenerating] = useState(false);
const generate = async () => {
setSuggestion('');
setIsGenerating(true);
try {
const response = await fetch('/api/assist', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ context }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
let content = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n\n');
buffer = lines.pop();
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const raw = line.slice(6);
if (raw === '[DONE]') break;
const data = JSON.parse(raw);
if (data.content) {
content += data.content;
setSuggestion(content);
}
}
}
} finally {
setIsGenerating(false);
}
};
return (
<div className="inline-assist">
{!suggestion && !isGenerating && (
<button onClick={generate} className="generate-btn">
AI Assist
</button>
)}
{isGenerating && !suggestion && <ThinkingDots />}
{suggestion && (
<div className="suggestion-preview">
<div className="suggestion-text">
{suggestion}
{isGenerating && <span className="streaming-cursor">|</span>}
</div>
{!isGenerating && (
<div className="suggestion-actions">
<button onClick={() => onInsert(suggestion)}>Insert</button>
<button onClick={() => setSuggestion('')}>Discard</button>
<button onClick={generate}>Regenerate</button>
</div>
)}
</div>
)}
</div>
);
}
10. Measuring and Monitoring UX Metrics
In production, track these metrics to ensure your AI features meet UX standards:
// Metrics collection for AI UX
class AIUXMetrics {
constructor() {
this.metrics = [];
}
trackInteraction(data) {
const metric = {
timestamp: Date.now(),
...data,
};
this.metrics.push(metric);
this.report(metric);
}
report(metric) {
// Send to your analytics service
fetch('/api/analytics', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(metric),
}).catch(() => {}); // Non-blocking
}
}
const aiMetrics = new AIUXMetrics();
// Track in your streaming hook
function useTrackedStreaming() {
const sendMessage = useCallback(async (message) => {
const startTime = performance.now();
let firstTokenTime = null;
let tokenCount = 0;
let wasCancelled = false;
// ... streaming logic ...
// On first token:
firstTokenTime = performance.now();
// On each token:
tokenCount++;
// On complete or cancel:
aiMetrics.trackInteraction({
event: 'ai_response',
ttft: firstTokenTime ? firstTokenTime - startTime : null,
totalTime: performance.now() - startTime,
tokenCount,
wasCancelled,
model: 'gpt-4o',
promptLength: message.length,
});
}, []);
}
Key UX metrics dashboard
| Metric | Target | Alert Threshold |
|---|---|---|
| Time to first token (TTFT) | < 500ms | > 2s |
| Total response time | < 10s for typical queries | > 20s |
| Tokens per second | > 30 tok/s display rate | < 10 tok/s |
| Error rate | < 1% | > 5% |
| Cancellation rate | < 15% | > 30% (indicates frustration) |
| Retry rate | < 5% | > 10% |
| Empty response rate | < 0.5% | > 2% |
| User satisfaction (thumbs up/down) | > 80% positive | < 60% positive |
11. Common UX Anti-patterns in AI Applications
| Anti-pattern | Problem | Fix |
|---|---|---|
| Spinner for 10+ seconds | Users abandon; no indication of progress | Use streaming |
| No cancel button | Users feel trapped waiting | Always provide "Stop generating" |
| Streaming JSON | Partial JSON flashes on screen, looks broken | Buffer JSON, only show when complete |
| No error recovery | One error = dead end | Show error + retry button + keep partial content |
| Identical loading for all tasks | Short tasks feel artificially slow | Adjust loading UI based on expected duration |
| Flash of skeleton then instant content | Skeleton shown for < 200ms, feels glitchy | Only show skeleton if TTFT > 300ms |
| Auto-scroll while user is reading | Annoying when user scrolls up to re-read | Only auto-scroll if user is near the bottom |
| No feedback on submit | User double-clicks, sends duplicate | Disable input immediately, show visual confirmation |
| Streaming text reflows layout | Page jumps as content grows | Reserve space, use fixed-width containers |
12. Key Takeaways
- Perceived latency > actual latency — streaming doesn't make the model faster, but it makes the experience feel 10x faster.
- Time-to-first-token (TTFT) is the most important metric — optimize prompt length, model choice, and network latency to minimize it.
- Use three-phase loading: immediate feedback (thinking dots) -> streaming text (with cursor) -> complete response (with markdown rendering).
- Don't stream everything — JSON, short responses, binary decisions, and pipeline-internal calls should use buffered responses.
- Always provide cancellation — both UX ("Stop generating" button) and technical (AbortController).
- Handle mid-stream errors gracefully — keep partial content visible, show an error banner, and offer retry.
- Monitor UX metrics in production — TTFT, cancellation rate, and error rate reveal real user experience quality.
Explain-It Challenge
- A product manager asks "why do we need streaming? Can't we just make the model faster?" Explain the difference between actual and perceived latency with a concrete example.
- Your AI feature works great for short questions but users complain about long RAG-powered responses feeling slow. What is the root cause and how would you improve the UX?
- A developer wants to stream a JSON response to show a progress bar as fields fill in. Explain why this is problematic and suggest a better approach.
Navigation: <- 4.8.b — Progressive Rendering | 4.8 Overview ->