Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses

4.8.c — Improving UX in AI Applications

In one sentence: Great AI UX goes beyond just streaming tokens — it requires optimizing perceived latency through time-to-first-token, intelligent loading states, skeleton screens, streaming indicators, knowing when not to stream, graceful cancellation, and robust error handling to make AI features feel instant and reliable.

Navigation: <- 4.8.b — Progressive Rendering | 4.8 Overview ->

1. Perceived Latency vs Actual Latency

The most important insight in AI UX: users don't measure time — they measure how time feels. Two API calls that both take 8 seconds can feel completely different depending on what the user sees during those 8 seconds.

ACTUAL LATENCY:  Both scenarios take 8 seconds total

SCENARIO A — High perceived latency:
  [Click] ──── [Spinner for 8s] ──── [Full text appears]
  User feeling: "This is slow. Is it broken?"

SCENARIO B — Low perceived latency:
  [Click] ── [Skeleton 0.1s] ── [First token 0.3s] ── [Text streaming 7.7s]
  User feeling: "Wow, that was fast!"

The difference: ZERO seconds. Same total time. Different UX.

Research on perceived latency

Time	User Perception
< 100ms	Feels instant
100ms - 300ms	Noticeable but not annoying
300ms - 1s	User notices the delay
1s - 3s	User loses focus, needs feedback
3s - 10s	User patience erodes quickly
> 10s	User likely abandons the task

LLM responses typically take 3-15 seconds for the full response. Without streaming, every AI feature falls squarely in the "abandon" zone. With streaming, the first token arrives in 200-500ms — well within the "noticeable but not annoying" zone.

2. Time-to-First-Token (TTFT)

Time-to-first-token is the single most important metric for AI application UX. It measures the time from when the user submits their request to when the first token appears on screen.

What affects TTFT

TTFT = Network latency (client → server)
     + Server processing time
     + API call setup time
     + Model prefill time (processing the input prompt)
     + Network latency (server → client)
     + Client rendering time

Typical breakdown:
  Network (round trip):     50-200ms
  Server processing:        10-50ms
  API setup:                20-100ms
  Model prefill:            100-2000ms  ← This dominates for long prompts
  Client rendering:         5-20ms
  ─────────────────────────────────────
  Total TTFT:               200ms - 2.5s

How prompt length affects TTFT

The model must process (prefill) all input tokens before generating the first output token. Longer prompts = slower TTFT:

Input Size	Approximate TTFT
100 tokens (short question)	200-400ms
1,000 tokens (with system prompt)	300-600ms
5,000 tokens (with RAG context)	500-1,200ms
20,000 tokens (large document)	1,000-2,500ms
100,000 tokens (max context)	3,000-8,000ms

Measuring TTFT

async function measureTTFT(messages) {
  const startTime = performance.now();
  let firstTokenTime = null;

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    if (!firstTokenTime && chunk.choices[0]?.delta?.content) {
      firstTokenTime = performance.now();
      const ttft = firstTokenTime - startTime;
      console.log(`TTFT: ${ttft.toFixed(0)}ms`);
    }
  }

  const totalTime = performance.now() - startTime;
  console.log(`Total time: ${totalTime.toFixed(0)}ms`);

  return {
    ttft: firstTokenTime ? firstTokenTime - startTime : null,
    totalTime,
  };
}

Optimizing TTFT

Strategy	Impact	How
Shorter system prompts	High	Every token in the prompt adds prefill time
Fewer RAG chunks	High	Retrieve only the top 3-5 most relevant chunks
Smaller/faster models	High	`gpt-4o-mini` has ~50% faster TTFT than `gpt-4o`
Edge deployment	Medium	Reduce network round-trip time
Connection keep-alive	Medium	Reuse HTTP connections to avoid TCP handshake
Prompt caching	High	Anthropic/OpenAI cache repeated prompt prefixes
Speculative decoding	Future	New model architectures for faster first tokens

3. Loading States: The First 200ms

The gap between the user clicking "Send" and the first token arriving is critical. What happens in this gap determines whether the app feels "fast" or "broken."

Three-phase loading strategy

function StreamingMessage({ isWaiting, isStreaming, content }) {
  // Phase 1: Immediate feedback (0-50ms after click)
  // Show that the request was received
  if (isWaiting && !content) {
    return (
      <div className="message assistant">
        <div className="thinking-indicator">
          <ThinkingDots />
          <span className="thinking-label">Thinking...</span>
        </div>
      </div>
    );
  }

  // Phase 2: First token arrived, streaming in progress
  if (isStreaming && content) {
    return (
      <div className="message assistant">
        <div className="streaming-content">
          {content}
          <span className="streaming-cursor">|</span>
        </div>
      </div>
    );
  }

  // Phase 3: Stream complete
  return (
    <div className="message assistant">
      <div className="complete-content">
        <ReactMarkdown>{content}</ReactMarkdown>
      </div>
    </div>
  );
}

Animated thinking indicator

function ThinkingDots() {
  return (
    <div className="thinking-dots" aria-label="AI is thinking">
      <div className="dot" />
      <div className="dot" />
      <div className="dot" />
    </div>
  );
}

.thinking-dots {
  display: flex;
  gap: 4px;
  padding: 8px 0;
}

.thinking-dots .dot {
  width: 8px;
  height: 8px;
  background: #9ca3af;
  border-radius: 50%;
  animation: thinking-bounce 1.4s ease-in-out infinite;
}

.thinking-dots .dot:nth-child(1) { animation-delay: 0s; }
.thinking-dots .dot:nth-child(2) { animation-delay: 0.2s; }
.thinking-dots .dot:nth-child(3) { animation-delay: 0.4s; }

@keyframes thinking-bounce {
  0%, 80%, 100% { transform: translateY(0); }
  40% { transform: translateY(-6px); }
}

4. Skeleton Screens for AI Content

Skeleton screens show a placeholder of the expected content shape before it loads. For AI content, this means showing a rough outline of where the response will appear:

function AIResponseSkeleton({ expectedLength = 'medium' }) {
  const lineWidths = {
    short: ['75%', '45%'],
    medium: ['90%', '100%', '80%', '60%'],
    long: ['100%', '95%', '88%', '100%', '72%', '90%', '45%'],
  };

  const widths = lineWidths[expectedLength] || lineWidths.medium;

  return (
    <div className="skeleton-container" aria-busy="true" aria-label="Loading response">
      {widths.map((width, i) => (
        <div
          key={i}
          className="skeleton-line"
          style={{ width }}
        />
      ))}
    </div>
  );
}

.skeleton-container {
  padding: 16px;
  display: flex;
  flex-direction: column;
  gap: 10px;
}

.skeleton-line {
  height: 14px;
  background: linear-gradient(
    90deg,
    #e5e7eb 25%,
    #f3f4f6 50%,
    #e5e7eb 75%
  );
  background-size: 200% 100%;
  animation: shimmer 1.5s ease-in-out infinite;
  border-radius: 4px;
}

@keyframes shimmer {
  0% { background-position: 200% 0; }
  100% { background-position: -200% 0; }
}

Transitioning from skeleton to streamed content

function SmartLoadingMessage({ isWaiting, isStreaming, content }) {
  const [showSkeleton, setShowSkeleton] = useState(true);

  useEffect(() => {
    // Fade out skeleton when first content arrives
    if (content && content.length > 0) {
      // Small delay for smooth transition
      const timer = setTimeout(() => setShowSkeleton(false), 100);
      return () => clearTimeout(timer);
    }
    if (!isWaiting && !isStreaming) {
      setShowSkeleton(true); // Reset for next message
    }
  }, [content, isWaiting, isStreaming]);

  if (isWaiting && showSkeleton) {
    return <AIResponseSkeleton />;
  }

  return (
    <div className={`message-content ${isStreaming ? 'streaming' : 'complete'}`}>
      {content}
      {isStreaming && <span className="streaming-cursor">|</span>}
    </div>
  );
}

5. Streaming Indicators and Progress Feedback

Users need continuous feedback that the system is working. Beyond the blinking cursor, consider these patterns:

Token count / word count indicator

function StreamingStatus({ content, isStreaming, startTime }) {
  if (!isStreaming) return null;

  const wordCount = content.split(/\s+/).filter(Boolean).length;
  const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);

  return (
    <div className="streaming-status">
      <span className="pulse-indicator" />
      <span>{wordCount} words</span>
      <span className="separator">|</span>
      <span>{elapsed}s</span>
    </div>
  );
}

"Stop generating" button (like ChatGPT)

function StopButton({ isStreaming, onStop }) {
  if (!isStreaming) return null;

  return (
    <button
      onClick={onStop}
      className="stop-generating-btn"
      aria-label="Stop generating response"
    >
      <svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
        <rect x="3" y="3" width="10" height="10" rx="1" />
      </svg>
      Stop generating
    </button>
  );
}

Progress estimation

For known-length tasks (summarize to N sentences, answer in M points), you can estimate progress:

function StreamingProgress({ content, expectedItems, isStreaming }) {
  if (!isStreaming) return null;

  // Count completed items (e.g., numbered list items)
  const completedItems = (content.match(/^\d+\./gm) || []).length;
  const progress = Math.min(completedItems / expectedItems, 1);

  return (
    <div className="progress-container">
      <div className="progress-bar">
        <div
          className="progress-fill"
          style={{ width: `${progress * 100}%` }}
        />
      </div>
      <span className="progress-label">
        {completedItems} of {expectedItems} points
      </span>
    </div>
  );
}

6. When NOT to Stream

Streaming is not always the right choice. Sometimes it adds complexity without UX benefit:

Cases where buffered responses are better

Scenario	Why Not Stream
Short responses (< 50 tokens)	Response completes in < 1 second anyway; streaming overhead adds latency
JSON/structured output	Partial JSON is invalid and unparseable; you need the complete object
Binary decisions (yes/no, approve/reject)	A single word doesn't benefit from streaming
Batch processing	Backend pipelines process results programmatically, not for display
Function/tool calling	Tool calls must be complete before execution; streaming partial function args is unusable
Validation-required output	If you must validate before showing (e.g., content moderation), streaming bypasses the check
Multi-step chains	Intermediate LLM calls in a chain are not user-facing

Decision framework

function shouldStream(config) {
  const {
    expectedTokens,     // Estimated response length
    isUserFacing,       // Will the output be displayed to a user?
    needsValidation,    // Must the output be validated before showing?
    outputFormat,       // 'text', 'json', 'binary_decision'
    isPartOfChain,      // Is this an intermediate step in a pipeline?
  } = config;

  // Don't stream if:
  if (!isUserFacing) return false;          // Background processing
  if (needsValidation) return false;        // Must validate first
  if (outputFormat === 'json') return false; // Need complete JSON
  if (outputFormat === 'binary_decision') return false; // Too short
  if (isPartOfChain) return false;          // Intermediate step
  if (expectedTokens < 50) return false;    // Too short to benefit

  // Do stream if:
  return true;
}

Hybrid approach: stream text but collect structured data

async function hybridStream(messages) {
  // For responses that contain both text and structured data:
  // Stream the text portion for UX, then extract structured data at the end

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });

  let fullContent = '';

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      fullContent += content;
      // Display streamed text in UI
      updateUI(content);
    }
  }

  // After stream completes, extract any structured data from the full response
  const structuredData = parseStructuredData(fullContent);
  return { text: fullContent, data: structuredData };
}

7. Cancellation Handling

Users should always be able to stop a streaming response. This requires coordination between the frontend and backend:

Frontend cancellation with AbortController

function useCancellableStream() {
  const [isStreaming, setIsStreaming] = useState(false);
  const controllerRef = useRef(null);

  const startStream = useCallback(async (messages, onToken) => {
    // Cancel any existing stream
    controllerRef.current?.abort();

    // Create new controller
    controllerRef.current = new AbortController();
    setIsStreaming(true);

    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages }),
        signal: controllerRef.current.signal,
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n\n');
        buffer = lines.pop();

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          const raw = line.slice(6);
          if (raw === '[DONE]') return;

          const data = JSON.parse(raw);
          if (data.content) onToken(data.content);
        }
      }
    } catch (error) {
      if (error.name === 'AbortError') {
        console.log('Stream cancelled by user');
        // Partial response is kept — user can see what was generated
      } else {
        throw error;
      }
    } finally {
      setIsStreaming(false);
    }
  }, []);

  const cancelStream = useCallback(() => {
    controllerRef.current?.abort();
    setIsStreaming(false);
  }, []);

  return { startStream, cancelStream, isStreaming };
}

Backend: detecting client disconnect

// Express.js middleware for handling disconnects
app.post('/api/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');

  let clientDisconnected = false;
  req.on('close', () => {
    clientDisconnected = true;
    // Note: OpenAI SDK doesn't support cancelling an in-flight stream
    // The stream continues on OpenAI's side, but we stop sending to the client
    // This still costs tokens — the model generates the full response
  });

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    if (clientDisconnected) {
      // Stop processing chunks, but the API call still completes
      // Consider: is the cost worth it? For expensive models, maybe abort.
      break;
    }

    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }
  }

  if (!clientDisconnected) {
    res.write('data: [DONE]\n\n');
    res.end();
  }
});

The hidden cost of cancellation

IMPORTANT: Cancelling on the client side does NOT stop the model from generating.

User clicks "Stop" at token 50 of a 500-token response:
  - Client: stops reading the stream ✓
  - Server: may or may not stop forwarding ✓
  - LLM API: continues generating all 500 tokens ✗
  - Billing: you pay for all 500 output tokens ✗

This is a known limitation of current LLM APIs. The model runs to completion
(or to max_tokens) regardless of client cancellation.

Mitigation: Use max_tokens to cap the maximum possible cost.

8. Error Handling During Streams

Streaming introduces a class of errors that don't exist with buffered responses: errors that occur after partial content has been delivered.

Error taxonomy for streaming

Error Type	When	User Impact	Handling Strategy
Pre-stream error	Before first token	No content shown	Show error message, allow retry
Mid-stream network error	During streaming	Partial content visible	Show warning, keep partial content, offer retry
Mid-stream API error	During streaming	Partial content visible	Append error indicator, offer retry
Rate limit (429)	Before or during	May have partial content	Backoff + retry, show wait indicator
Content filter	During streaming	Partial content may be shown	Replace or append warning
Timeout	No chunks for N seconds	Partial or no content	Show timeout message, offer retry

Comprehensive error handling component

function useRobustStreaming() {
  const [state, setState] = useState({
    messages: [],
    isStreaming: false,
    error: null,
    retryCount: 0,
  });

  const sendMessage = useCallback(async (userMessage, maxRetries = 2) => {
    setState(prev => ({
      ...prev,
      isStreaming: true,
      error: null,
    }));

    let attempt = 0;

    while (attempt <= maxRetries) {
      try {
        const response = await fetch('/api/chat', {
          method: 'POST',
          headers: { 'Content-Type': 'application/json' },
          body: JSON.stringify({ message: userMessage }),
        });

        if (response.status === 429) {
          const retryAfter = parseInt(response.headers.get('retry-after') || '5');
          setState(prev => ({
            ...prev,
            error: `Rate limited. Retrying in ${retryAfter}s...`,
          }));
          await new Promise(r => setTimeout(r, retryAfter * 1000));
          attempt++;
          continue;
        }

        if (!response.ok) {
          throw new Error(`Server error: ${response.status}`);
        }

        // Process the stream
        const reader = response.body.getReader();
        const decoder = new TextDecoder();
        let buffer = '';
        let lastChunkTime = Date.now();
        let content = '';

        // Set up a chunk timeout (no data for 30s = stale stream)
        const chunkTimeoutMs = 30000;

        while (true) {
          // Read with timeout
          const readPromise = reader.read();
          const timeoutPromise = new Promise((_, reject) =>
            setTimeout(
              () => reject(new Error('Chunk timeout')),
              chunkTimeoutMs
            )
          );

          const { done, value } = await Promise.race([readPromise, timeoutPromise]);
          if (done) break;

          lastChunkTime = Date.now();
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split('\n\n');
          buffer = lines.pop();

          for (const line of lines) {
            if (!line.startsWith('data: ')) continue;
            const raw = line.slice(6);
            if (raw === '[DONE]') break;

            const data = JSON.parse(raw);

            if (data.type === 'error') {
              throw new Error(data.message);
            }

            if (data.content) {
              content += data.content;
              setState(prev => ({
                ...prev,
                error: null,
                // Update the assistant message in place
              }));
            }
          }
        }

        // Success — exit retry loop
        setState(prev => ({
          ...prev,
          isStreaming: false,
          error: null,
          retryCount: 0,
        }));
        return content;

      } catch (error) {
        attempt++;

        if (attempt > maxRetries) {
          setState(prev => ({
            ...prev,
            isStreaming: false,
            error: `Failed after ${maxRetries + 1} attempts: ${error.message}`,
            retryCount: attempt,
          }));
          return null;
        }

        // Exponential backoff
        const backoff = Math.min(1000 * Math.pow(2, attempt), 10000);
        setState(prev => ({
          ...prev,
          error: `Error: ${error.message}. Retrying in ${backoff / 1000}s... (attempt ${attempt + 1}/${maxRetries + 1})`,
        }));
        await new Promise(r => setTimeout(r, backoff));
      }
    }
  }, []);

  return { ...state, sendMessage };
}

Error UI patterns

function StreamingError({ error, onRetry, partialContent }) {
  if (!error) return null;

  return (
    <div className="streaming-error" role="alert">
      {/* If we have partial content, show it with an error banner */}
      {partialContent && (
        <div className="partial-content">
          {partialContent}
          <div className="truncation-indicator">
            [Response interrupted]
          </div>
        </div>
      )}

      <div className="error-banner">
        <svg className="error-icon" viewBox="0 0 20 20" fill="currentColor">
          <path fillRule="evenodd" d="M10 18a8 8 0 100-16 8 8 0 000 16zM8.707 7.293a1 1 0 00-1.414 1.414L8.586 10l-1.293 1.293a1 1 0 101.414 1.414L10 11.414l1.293 1.293a1 1 0 001.414-1.414L11.414 10l1.293-1.293a1 1 0 00-1.414-1.414L10 8.586 8.707 7.293z" />
        </svg>
        <span className="error-message">{error}</span>
        {onRetry && (
          <button onClick={onRetry} className="retry-btn">
            Retry
          </button>
        )}
      </div>
    </div>
  );
}

9. Production UX Patterns: Putting It All Together

Pattern 1: ChatGPT-style interface

function ProductionChatUI() {
  const {
    messages,
    isStreaming,
    error,
    sendMessage,
    cancelStream,
  } = useStreamingChat();

  const [input, setInput] = useState('');
  const messagesEndRef = useRef(null);

  // Auto-scroll
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
  }, [messages]);

  return (
    <div className="chat-layout">
      {/* Message history */}
      <div className="messages-area">
        {messages.length === 0 && (
          <div className="empty-state">
            <h2>How can I help you today?</h2>
            <div className="suggestion-chips">
              <button onClick={() => sendMessage('Explain streaming APIs')}>
                Explain streaming APIs
              </button>
              <button onClick={() => sendMessage('Write a React component')}>
                Write a React component
              </button>
            </div>
          </div>
        )}

        {messages.map((msg, i) => (
          <MessageBubble
            key={i}
            message={msg}
            isStreaming={
              isStreaming &&
              msg.role === 'assistant' &&
              i === messages.length - 1
            }
          />
        ))}

        {error && (
          <StreamingError
            error={error}
            onRetry={() => {
              const lastUserMsg = [...messages].reverse().find(m => m.role === 'user');
              if (lastUserMsg) sendMessage(lastUserMsg.content);
            }}
          />
        )}

        <div ref={messagesEndRef} />
      </div>

      {/* Input area */}
      <div className="input-area">
        {isStreaming && (
          <button onClick={cancelStream} className="stop-btn">
            Stop generating
          </button>
        )}

        <form onSubmit={(e) => {
          e.preventDefault();
          if (input.trim() && !isStreaming) {
            sendMessage(input.trim());
            setInput('');
          }
        }}>
          <textarea
            value={input}
            onChange={(e) => setInput(e.target.value)}
            onKeyDown={(e) => {
              if (e.key === 'Enter' && !e.shiftKey) {
                e.preventDefault();
                e.target.form.requestSubmit();
              }
            }}
            placeholder="Message..."
            disabled={isStreaming}
            rows={1}
          />
          <button type="submit" disabled={!input.trim() || isStreaming}>
            Send
          </button>
        </form>
      </div>
    </div>
  );
}

Pattern 2: Inline AI assist (like GitHub Copilot or Notion AI)

function InlineAIAssist({ context, onInsert }) {
  const [suggestion, setSuggestion] = useState('');
  const [isGenerating, setIsGenerating] = useState(false);

  const generate = async () => {
    setSuggestion('');
    setIsGenerating(true);

    try {
      const response = await fetch('/api/assist', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ context }),
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let buffer = '';
      let content = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n\n');
        buffer = lines.pop();

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          const raw = line.slice(6);
          if (raw === '[DONE]') break;
          const data = JSON.parse(raw);
          if (data.content) {
            content += data.content;
            setSuggestion(content);
          }
        }
      }
    } finally {
      setIsGenerating(false);
    }
  };

  return (
    <div className="inline-assist">
      {!suggestion && !isGenerating && (
        <button onClick={generate} className="generate-btn">
          AI Assist
        </button>
      )}

      {isGenerating && !suggestion && <ThinkingDots />}

      {suggestion && (
        <div className="suggestion-preview">
          <div className="suggestion-text">
            {suggestion}
            {isGenerating && <span className="streaming-cursor">|</span>}
          </div>
          {!isGenerating && (
            <div className="suggestion-actions">
              <button onClick={() => onInsert(suggestion)}>Insert</button>
              <button onClick={() => setSuggestion('')}>Discard</button>
              <button onClick={generate}>Regenerate</button>
            </div>
          )}
        </div>
      )}
    </div>
  );
}

10. Measuring and Monitoring UX Metrics

In production, track these metrics to ensure your AI features meet UX standards:

// Metrics collection for AI UX
class AIUXMetrics {
  constructor() {
    this.metrics = [];
  }

  trackInteraction(data) {
    const metric = {
      timestamp: Date.now(),
      ...data,
    };
    this.metrics.push(metric);
    this.report(metric);
  }

  report(metric) {
    // Send to your analytics service
    fetch('/api/analytics', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(metric),
    }).catch(() => {}); // Non-blocking
  }
}

const aiMetrics = new AIUXMetrics();

// Track in your streaming hook
function useTrackedStreaming() {
  const sendMessage = useCallback(async (message) => {
    const startTime = performance.now();
    let firstTokenTime = null;
    let tokenCount = 0;
    let wasCancelled = false;

    // ... streaming logic ...
    // On first token:
    firstTokenTime = performance.now();

    // On each token:
    tokenCount++;

    // On complete or cancel:
    aiMetrics.trackInteraction({
      event: 'ai_response',
      ttft: firstTokenTime ? firstTokenTime - startTime : null,
      totalTime: performance.now() - startTime,
      tokenCount,
      wasCancelled,
      model: 'gpt-4o',
      promptLength: message.length,
    });
  }, []);
}

Key UX metrics dashboard

Metric	Target	Alert Threshold
Time to first token (TTFT)	< 500ms	> 2s
Total response time	< 10s for typical queries	> 20s
Tokens per second	> 30 tok/s display rate	< 10 tok/s
Error rate	< 1%	> 5%
Cancellation rate	< 15%	> 30% (indicates frustration)
Retry rate	< 5%	> 10%
Empty response rate	< 0.5%	> 2%
User satisfaction (thumbs up/down)	> 80% positive	< 60% positive

11. Common UX Anti-patterns in AI Applications

Anti-pattern	Problem	Fix
Spinner for 10+ seconds	Users abandon; no indication of progress	Use streaming
No cancel button	Users feel trapped waiting	Always provide "Stop generating"
Streaming JSON	Partial JSON flashes on screen, looks broken	Buffer JSON, only show when complete
No error recovery	One error = dead end	Show error + retry button + keep partial content
Identical loading for all tasks	Short tasks feel artificially slow	Adjust loading UI based on expected duration
Flash of skeleton then instant content	Skeleton shown for < 200ms, feels glitchy	Only show skeleton if TTFT > 300ms
Auto-scroll while user is reading	Annoying when user scrolls up to re-read	Only auto-scroll if user is near the bottom
No feedback on submit	User double-clicks, sends duplicate	Disable input immediately, show visual confirmation
Streaming text reflows layout	Page jumps as content grows	Reserve space, use fixed-width containers

12. Key Takeaways

Perceived latency > actual latency — streaming doesn't make the model faster, but it makes the experience feel 10x faster.
Time-to-first-token (TTFT) is the most important metric — optimize prompt length, model choice, and network latency to minimize it.
Use three-phase loading: immediate feedback (thinking dots) -> streaming text (with cursor) -> complete response (with markdown rendering).
Don't stream everything — JSON, short responses, binary decisions, and pipeline-internal calls should use buffered responses.
Always provide cancellation — both UX ("Stop generating" button) and technical (AbortController).
Handle mid-stream errors gracefully — keep partial content visible, show an error banner, and offer retry.
Monitor UX metrics in production — TTFT, cancellation rate, and error rate reveal real user experience quality.

Explain-It Challenge

A product manager asks "why do we need streaming? Can't we just make the model faster?" Explain the difference between actual and perceived latency with a concrete example.
Your AI feature works great for short questions but users complain about long RAG-powered responses feeling slow. What is the root cause and how would you improve the UX?
A developer wants to stream a JSON response to show a progress bar as fields fill in. Explain why this is problematic and suggest a better approach.

Navigation: <- 4.8.b — Progressive Rendering | 4.8 Overview ->