Episode 4 — Generative AI Engineering / 4.8 — Streaming Responses

4.8.c — Improving UX in AI Applications

In one sentence: Great AI UX goes beyond just streaming tokens — it requires optimizing perceived latency through time-to-first-token, intelligent loading states, skeleton screens, streaming indicators, knowing when not to stream, graceful cancellation, and robust error handling to make AI features feel instant and reliable.

Navigation: <- 4.8.b — Progressive Rendering | 4.8 Overview ->


1. Perceived Latency vs Actual Latency

The most important insight in AI UX: users don't measure time — they measure how time feels. Two API calls that both take 8 seconds can feel completely different depending on what the user sees during those 8 seconds.

ACTUAL LATENCY:  Both scenarios take 8 seconds total

SCENARIO A — High perceived latency:
  [Click] ──── [Spinner for 8s] ──── [Full text appears]
  User feeling: "This is slow. Is it broken?"

SCENARIO B — Low perceived latency:
  [Click] ── [Skeleton 0.1s] ── [First token 0.3s] ── [Text streaming 7.7s]
  User feeling: "Wow, that was fast!"

The difference: ZERO seconds. Same total time. Different UX.

Research on perceived latency

TimeUser Perception
< 100msFeels instant
100ms - 300msNoticeable but not annoying
300ms - 1sUser notices the delay
1s - 3sUser loses focus, needs feedback
3s - 10sUser patience erodes quickly
> 10sUser likely abandons the task

LLM responses typically take 3-15 seconds for the full response. Without streaming, every AI feature falls squarely in the "abandon" zone. With streaming, the first token arrives in 200-500ms — well within the "noticeable but not annoying" zone.


2. Time-to-First-Token (TTFT)

Time-to-first-token is the single most important metric for AI application UX. It measures the time from when the user submits their request to when the first token appears on screen.

What affects TTFT

TTFT = Network latency (client → server)
     + Server processing time
     + API call setup time
     + Model prefill time (processing the input prompt)
     + Network latency (server → client)
     + Client rendering time

Typical breakdown:
  Network (round trip):     50-200ms
  Server processing:        10-50ms
  API setup:                20-100ms
  Model prefill:            100-2000ms  ← This dominates for long prompts
  Client rendering:         5-20ms
  ─────────────────────────────────────
  Total TTFT:               200ms - 2.5s

How prompt length affects TTFT

The model must process (prefill) all input tokens before generating the first output token. Longer prompts = slower TTFT:

Input SizeApproximate TTFT
100 tokens (short question)200-400ms
1,000 tokens (with system prompt)300-600ms
5,000 tokens (with RAG context)500-1,200ms
20,000 tokens (large document)1,000-2,500ms
100,000 tokens (max context)3,000-8,000ms

Measuring TTFT

async function measureTTFT(messages) {
  const startTime = performance.now();
  let firstTokenTime = null;

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    if (!firstTokenTime && chunk.choices[0]?.delta?.content) {
      firstTokenTime = performance.now();
      const ttft = firstTokenTime - startTime;
      console.log(`TTFT: ${ttft.toFixed(0)}ms`);
    }
  }

  const totalTime = performance.now() - startTime;
  console.log(`Total time: ${totalTime.toFixed(0)}ms`);

  return {
    ttft: firstTokenTime ? firstTokenTime - startTime : null,
    totalTime,
  };
}

Optimizing TTFT

StrategyImpactHow
Shorter system promptsHighEvery token in the prompt adds prefill time
Fewer RAG chunksHighRetrieve only the top 3-5 most relevant chunks
Smaller/faster modelsHighgpt-4o-mini has ~50% faster TTFT than gpt-4o
Edge deploymentMediumReduce network round-trip time
Connection keep-aliveMediumReuse HTTP connections to avoid TCP handshake
Prompt cachingHighAnthropic/OpenAI cache repeated prompt prefixes
Speculative decodingFutureNew model architectures for faster first tokens

3. Loading States: The First 200ms

The gap between the user clicking "Send" and the first token arriving is critical. What happens in this gap determines whether the app feels "fast" or "broken."

Three-phase loading strategy

function StreamingMessage({ isWaiting, isStreaming, content }) {
  // Phase 1: Immediate feedback (0-50ms after click)
  // Show that the request was received
  if (isWaiting && !content) {
    return (
      <div className="message assistant">
        <div className="thinking-indicator">
          <ThinkingDots />
          <span className="thinking-label">Thinking...</span>
        </div>
      </div>
    );
  }

  // Phase 2: First token arrived, streaming in progress
  if (isStreaming && content) {
    return (
      <div className="message assistant">
        <div className="streaming-content">
          {content}
          <span className="streaming-cursor">|</span>
        </div>
      </div>
    );
  }

  // Phase 3: Stream complete
  return (
    <div className="message assistant">
      <div className="complete-content">
        <ReactMarkdown>{content}</ReactMarkdown>
      </div>
    </div>
  );
}

Animated thinking indicator

function ThinkingDots() {
  return (
    <div className="thinking-dots" aria-label="AI is thinking">
      <div className="dot" />
      <div className="dot" />
      <div className="dot" />
    </div>
  );
}
.thinking-dots {
  display: flex;
  gap: 4px;
  padding: 8px 0;
}

.thinking-dots .dot {
  width: 8px;
  height: 8px;
  background: #9ca3af;
  border-radius: 50%;
  animation: thinking-bounce 1.4s ease-in-out infinite;
}

.thinking-dots .dot:nth-child(1) { animation-delay: 0s; }
.thinking-dots .dot:nth-child(2) { animation-delay: 0.2s; }
.thinking-dots .dot:nth-child(3) { animation-delay: 0.4s; }

@keyframes thinking-bounce {
  0%, 80%, 100% { transform: translateY(0); }
  40% { transform: translateY(-6px); }
}

4. Skeleton Screens for AI Content

Skeleton screens show a placeholder of the expected content shape before it loads. For AI content, this means showing a rough outline of where the response will appear:

function AIResponseSkeleton({ expectedLength = 'medium' }) {
  const lineWidths = {
    short: ['75%', '45%'],
    medium: ['90%', '100%', '80%', '60%'],
    long: ['100%', '95%', '88%', '100%', '72%', '90%', '45%'],
  };

  const widths = lineWidths[expectedLength] || lineWidths.medium;

  return (
    <div className="skeleton-container" aria-busy="true" aria-label="Loading response">
      {widths.map((width, i) => (
        <div
          key={i}
          className="skeleton-line"
          style={{ width }}
        />
      ))}
    </div>
  );
}
.skeleton-container {
  padding: 16px;
  display: flex;
  flex-direction: column;
  gap: 10px;
}

.skeleton-line {
  height: 14px;
  background: linear-gradient(
    90deg,
    #e5e7eb 25%,
    #f3f4f6 50%,
    #e5e7eb 75%
  );
  background-size: 200% 100%;
  animation: shimmer 1.5s ease-in-out infinite;
  border-radius: 4px;
}

@keyframes shimmer {
  0% { background-position: 200% 0; }
  100% { background-position: -200% 0; }
}

Transitioning from skeleton to streamed content

function SmartLoadingMessage({ isWaiting, isStreaming, content }) {
  const [showSkeleton, setShowSkeleton] = useState(true);

  useEffect(() => {
    // Fade out skeleton when first content arrives
    if (content && content.length > 0) {
      // Small delay for smooth transition
      const timer = setTimeout(() => setShowSkeleton(false), 100);
      return () => clearTimeout(timer);
    }
    if (!isWaiting && !isStreaming) {
      setShowSkeleton(true); // Reset for next message
    }
  }, [content, isWaiting, isStreaming]);

  if (isWaiting && showSkeleton) {
    return <AIResponseSkeleton />;
  }

  return (
    <div className={`message-content ${isStreaming ? 'streaming' : 'complete'}`}>
      {content}
      {isStreaming && <span className="streaming-cursor">|</span>}
    </div>
  );
}

5. Streaming Indicators and Progress Feedback

Users need continuous feedback that the system is working. Beyond the blinking cursor, consider these patterns:

Token count / word count indicator

function StreamingStatus({ content, isStreaming, startTime }) {
  if (!isStreaming) return null;

  const wordCount = content.split(/\s+/).filter(Boolean).length;
  const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);

  return (
    <div className="streaming-status">
      <span className="pulse-indicator" />
      <span>{wordCount} words</span>
      <span className="separator">|</span>
      <span>{elapsed}s</span>
    </div>
  );
}

"Stop generating" button (like ChatGPT)

function StopButton({ isStreaming, onStop }) {
  if (!isStreaming) return null;

  return (
    <button
      onClick={onStop}
      className="stop-generating-btn"
      aria-label="Stop generating response"
    >
      <svg width="16" height="16" viewBox="0 0 16 16" fill="currentColor">
        <rect x="3" y="3" width="10" height="10" rx="1" />
      </svg>
      Stop generating
    </button>
  );
}

Progress estimation

For known-length tasks (summarize to N sentences, answer in M points), you can estimate progress:

function StreamingProgress({ content, expectedItems, isStreaming }) {
  if (!isStreaming) return null;

  // Count completed items (e.g., numbered list items)
  const completedItems = (content.match(/^\d+\./gm) || []).length;
  const progress = Math.min(completedItems / expectedItems, 1);

  return (
    <div className="progress-container">
      <div className="progress-bar">
        <div
          className="progress-fill"
          style={{ width: `${progress * 100}%` }}
        />
      </div>
      <span className="progress-label">
        {completedItems} of {expectedItems} points
      </span>
    </div>
  );
}

6. When NOT to Stream

Streaming is not always the right choice. Sometimes it adds complexity without UX benefit:

Cases where buffered responses are better

ScenarioWhy Not Stream
Short responses (< 50 tokens)Response completes in < 1 second anyway; streaming overhead adds latency
JSON/structured outputPartial JSON is invalid and unparseable; you need the complete object
Binary decisions (yes/no, approve/reject)A single word doesn't benefit from streaming
Batch processingBackend pipelines process results programmatically, not for display
Function/tool callingTool calls must be complete before execution; streaming partial function args is unusable
Validation-required outputIf you must validate before showing (e.g., content moderation), streaming bypasses the check
Multi-step chainsIntermediate LLM calls in a chain are not user-facing

Decision framework

function shouldStream(config) {
  const {
    expectedTokens,     // Estimated response length
    isUserFacing,       // Will the output be displayed to a user?
    needsValidation,    // Must the output be validated before showing?
    outputFormat,       // 'text', 'json', 'binary_decision'
    isPartOfChain,      // Is this an intermediate step in a pipeline?
  } = config;

  // Don't stream if:
  if (!isUserFacing) return false;          // Background processing
  if (needsValidation) return false;        // Must validate first
  if (outputFormat === 'json') return false; // Need complete JSON
  if (outputFormat === 'binary_decision') return false; // Too short
  if (isPartOfChain) return false;          // Intermediate step
  if (expectedTokens < 50) return false;    // Too short to benefit

  // Do stream if:
  return true;
}

Hybrid approach: stream text but collect structured data

async function hybridStream(messages) {
  // For responses that contain both text and structured data:
  // Stream the text portion for UX, then extract structured data at the end

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });

  let fullContent = '';

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      fullContent += content;
      // Display streamed text in UI
      updateUI(content);
    }
  }

  // After stream completes, extract any structured data from the full response
  const structuredData = parseStructuredData(fullContent);
  return { text: fullContent, data: structuredData };
}

7. Cancellation Handling

Users should always be able to stop a streaming response. This requires coordination between the frontend and backend:

Frontend cancellation with AbortController

function useCancellableStream() {
  const [isStreaming, setIsStreaming] = useState(false);
  const controllerRef = useRef(null);

  const startStream = useCallback(async (messages, onToken) => {
    // Cancel any existing stream
    controllerRef.current?.abort();

    // Create new controller
    controllerRef.current = new AbortController();
    setIsStreaming(true);

    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages }),
        signal: controllerRef.current.signal,
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n\n');
        buffer = lines.pop();

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          const raw = line.slice(6);
          if (raw === '[DONE]') return;

          const data = JSON.parse(raw);
          if (data.content) onToken(data.content);
        }
      }
    } catch (error) {
      if (error.name === 'AbortError') {
        console.log('Stream cancelled by user');
        // Partial response is kept — user can see what was generated
      } else {
        throw error;
      }
    } finally {
      setIsStreaming(false);
    }
  }, []);

  const cancelStream = useCallback(() => {
    controllerRef.current?.abort();
    setIsStreaming(false);
  }, []);

  return { startStream, cancelStream, isStreaming };
}

Backend: detecting client disconnect

// Express.js middleware for handling disconnects
app.post('/api/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');

  let clientDisconnected = false;
  req.on('close', () => {
    clientDisconnected = true;
    // Note: OpenAI SDK doesn't support cancelling an in-flight stream
    // The stream continues on OpenAI's side, but we stop sending to the client
    // This still costs tokens — the model generates the full response
  });

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    if (clientDisconnected) {
      // Stop processing chunks, but the API call still completes
      // Consider: is the cost worth it? For expensive models, maybe abort.
      break;
    }

    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }
  }

  if (!clientDisconnected) {
    res.write('data: [DONE]\n\n');
    res.end();
  }
});

The hidden cost of cancellation

IMPORTANT: Cancelling on the client side does NOT stop the model from generating.

User clicks "Stop" at token 50 of a 500-token response:
  - Client: stops reading the stream ✓
  - Server: may or may not stop forwarding ✓
  - LLM API: continues generating all 500 tokens ✗
  - Billing: you pay for all 500 output tokens ✗

This is a known limitation of current LLM APIs. The model runs to completion
(or to max_tokens) regardless of client cancellation.

Mitigation: Use max_tokens to cap the maximum possible cost.

8. Error Handling During Streams

Streaming introduces a class of errors that don't exist with buffered responses: errors that occur after partial content has been delivered.

Error taxonomy for streaming

Error TypeWhenUser ImpactHandling Strategy
Pre-stream errorBefore first tokenNo content shownShow error message, allow retry
Mid-stream network errorDuring streamingPartial content visibleShow warning, keep partial content, offer retry
Mid-stream API errorDuring streamingPartial content visibleAppend error indicator, offer retry
Rate limit (429)Before or duringMay have partial contentBackoff + retry, show wait indicator
Content filterDuring streamingPartial content may be shownReplace or append warning
TimeoutNo chunks for N secondsPartial or no contentShow timeout message, offer retry

Comprehensive error handling component

function useRobustStreaming() {
  const [state, setState] = useState({
    messages: [],
    isStreaming: false,
    error: null,
    retryCount: 0,
  });

  const sendMessage = useCallback(async (userMessage, maxRetries = 2) => {
    setState(prev => ({
      ...prev,
      isStreaming: true,
      error: null,
    }));

    let attempt = 0;

    while (attempt <= maxRetries) {
      try {
        const response = await fetch('/api/chat', {
          method: 'POST',
          headers: { 'Content-Type': 'application/json' },
          body: JSON.stringify({ message: userMessage }),
        });

        if (response.status === 429) {
          const retryAfter = parseInt(response.headers.get('retry-after') || '5');
          setState(prev => ({
            ...prev,
            error: `Rate limited. Retrying in ${retryAfter}s...`,
          }));
          await new Promise(r => setTimeout(r, retryAfter * 1000));
          attempt++;
          continue;
        }

        if (!response.ok) {
          throw new Error(`Server error: ${response.status}`);
        }

        // Process the stream
        const reader = response.body.getReader();
        const decoder = new TextDecoder();
        let buffer = '';
        let lastChunkTime = Date.now();
        let content = '';

        // Set up a chunk timeout (no data for 30s = stale stream)
        const chunkTimeoutMs = 30000;

        while (true) {
          // Read with timeout
          const readPromise = reader.read();
          const timeoutPromise = new Promise((_, reject) =>
            setTimeout(
              () => reject(new Error('Chunk timeout')),
              chunkTimeoutMs
            )
          );

          const { done, value } = await Promise.race([readPromise, timeoutPromise]);
          if (done) break;

          lastChunkTime = Date.now();
          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split('\n\n');
          buffer = lines.pop();

          for (const line of lines) {
            if (!line.startsWith('data: ')) continue;
            const raw = line.slice(6);
            if (raw === '[DONE]') break;

            const data = JSON.parse(raw);

            if (data.type === 'error') {
              throw new Error(data.message);
            }

            if (data.content) {
              content += data.content;
              setState(prev => ({
                ...prev,
                error: null,
                // Update the assistant message in place
              }));
            }
          }
        }

        // Success — exit retry loop
        setState(prev => ({
          ...prev,
          isStreaming: false,
          error: null,
          retryCount: 0,
        }));
        return content;

      } catch (error) {
        attempt++;

        if (attempt > maxRetries) {
          setState(prev => ({
            ...prev,
            isStreaming: false,
            error: `Failed after ${maxRetries + 1} attempts: ${error.message}`,
            retryCount: attempt,
          }));
          return null;
        }

        // Exponential backoff
        const backoff = Math.min(1000 * Math.pow(2, attempt), 10000);
        setState(prev => ({
          ...prev,
          error: `Error: ${error.message}. Retrying in ${backoff / 1000}s... (attempt ${attempt + 1}/${maxRetries + 1})`,
        }));
        await new Promise(r => setTimeout(r, backoff));
      }
    }
  }, []);

  return { ...state, sendMessage };
}

Error UI patterns

function StreamingError({ error, onRetry, partialContent }) {
  if (!error) return null;

  return (
    <div className="streaming-error" role="alert">
      {/* If we have partial content, show it with an error banner */}
      {partialContent && (
        <div className="partial-content">
          {partialContent}
          <div className="truncation-indicator">
            [Response interrupted]
          </div>
        </div>
      )}

      <div className="error-banner">
        <svg className="error-icon" viewBox="0 0 20 20" fill="currentColor">
          <path fillRule="evenodd" d="M10 18a8 8 0 100-16 8 8 0 000 16zM8.707 7.293a1 1 0 00-1.414 1.414L8.586 10l-1.293 1.293a1 1 0 101.414 1.414L10 11.414l1.293 1.293a1 1 0 001.414-1.414L11.414 10l1.293-1.293a1 1 0 00-1.414-1.414L10 8.586 8.707 7.293z" />
        </svg>
        <span className="error-message">{error}</span>
        {onRetry && (
          <button onClick={onRetry} className="retry-btn">
            Retry
          </button>
        )}
      </div>
    </div>
  );
}

9. Production UX Patterns: Putting It All Together

Pattern 1: ChatGPT-style interface

function ProductionChatUI() {
  const {
    messages,
    isStreaming,
    error,
    sendMessage,
    cancelStream,
  } = useStreamingChat();

  const [input, setInput] = useState('');
  const messagesEndRef = useRef(null);

  // Auto-scroll
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
  }, [messages]);

  return (
    <div className="chat-layout">
      {/* Message history */}
      <div className="messages-area">
        {messages.length === 0 && (
          <div className="empty-state">
            <h2>How can I help you today?</h2>
            <div className="suggestion-chips">
              <button onClick={() => sendMessage('Explain streaming APIs')}>
                Explain streaming APIs
              </button>
              <button onClick={() => sendMessage('Write a React component')}>
                Write a React component
              </button>
            </div>
          </div>
        )}

        {messages.map((msg, i) => (
          <MessageBubble
            key={i}
            message={msg}
            isStreaming={
              isStreaming &&
              msg.role === 'assistant' &&
              i === messages.length - 1
            }
          />
        ))}

        {error && (
          <StreamingError
            error={error}
            onRetry={() => {
              const lastUserMsg = [...messages].reverse().find(m => m.role === 'user');
              if (lastUserMsg) sendMessage(lastUserMsg.content);
            }}
          />
        )}

        <div ref={messagesEndRef} />
      </div>

      {/* Input area */}
      <div className="input-area">
        {isStreaming && (
          <button onClick={cancelStream} className="stop-btn">
            Stop generating
          </button>
        )}

        <form onSubmit={(e) => {
          e.preventDefault();
          if (input.trim() && !isStreaming) {
            sendMessage(input.trim());
            setInput('');
          }
        }}>
          <textarea
            value={input}
            onChange={(e) => setInput(e.target.value)}
            onKeyDown={(e) => {
              if (e.key === 'Enter' && !e.shiftKey) {
                e.preventDefault();
                e.target.form.requestSubmit();
              }
            }}
            placeholder="Message..."
            disabled={isStreaming}
            rows={1}
          />
          <button type="submit" disabled={!input.trim() || isStreaming}>
            Send
          </button>
        </form>
      </div>
    </div>
  );
}

Pattern 2: Inline AI assist (like GitHub Copilot or Notion AI)

function InlineAIAssist({ context, onInsert }) {
  const [suggestion, setSuggestion] = useState('');
  const [isGenerating, setIsGenerating] = useState(false);

  const generate = async () => {
    setSuggestion('');
    setIsGenerating(true);

    try {
      const response = await fetch('/api/assist', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ context }),
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let buffer = '';
      let content = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n\n');
        buffer = lines.pop();

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          const raw = line.slice(6);
          if (raw === '[DONE]') break;
          const data = JSON.parse(raw);
          if (data.content) {
            content += data.content;
            setSuggestion(content);
          }
        }
      }
    } finally {
      setIsGenerating(false);
    }
  };

  return (
    <div className="inline-assist">
      {!suggestion && !isGenerating && (
        <button onClick={generate} className="generate-btn">
          AI Assist
        </button>
      )}

      {isGenerating && !suggestion && <ThinkingDots />}

      {suggestion && (
        <div className="suggestion-preview">
          <div className="suggestion-text">
            {suggestion}
            {isGenerating && <span className="streaming-cursor">|</span>}
          </div>
          {!isGenerating && (
            <div className="suggestion-actions">
              <button onClick={() => onInsert(suggestion)}>Insert</button>
              <button onClick={() => setSuggestion('')}>Discard</button>
              <button onClick={generate}>Regenerate</button>
            </div>
          )}
        </div>
      )}
    </div>
  );
}

10. Measuring and Monitoring UX Metrics

In production, track these metrics to ensure your AI features meet UX standards:

// Metrics collection for AI UX
class AIUXMetrics {
  constructor() {
    this.metrics = [];
  }

  trackInteraction(data) {
    const metric = {
      timestamp: Date.now(),
      ...data,
    };
    this.metrics.push(metric);
    this.report(metric);
  }

  report(metric) {
    // Send to your analytics service
    fetch('/api/analytics', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(metric),
    }).catch(() => {}); // Non-blocking
  }
}

const aiMetrics = new AIUXMetrics();

// Track in your streaming hook
function useTrackedStreaming() {
  const sendMessage = useCallback(async (message) => {
    const startTime = performance.now();
    let firstTokenTime = null;
    let tokenCount = 0;
    let wasCancelled = false;

    // ... streaming logic ...
    // On first token:
    firstTokenTime = performance.now();

    // On each token:
    tokenCount++;

    // On complete or cancel:
    aiMetrics.trackInteraction({
      event: 'ai_response',
      ttft: firstTokenTime ? firstTokenTime - startTime : null,
      totalTime: performance.now() - startTime,
      tokenCount,
      wasCancelled,
      model: 'gpt-4o',
      promptLength: message.length,
    });
  }, []);
}

Key UX metrics dashboard

MetricTargetAlert Threshold
Time to first token (TTFT)< 500ms> 2s
Total response time< 10s for typical queries> 20s
Tokens per second> 30 tok/s display rate< 10 tok/s
Error rate< 1%> 5%
Cancellation rate< 15%> 30% (indicates frustration)
Retry rate< 5%> 10%
Empty response rate< 0.5%> 2%
User satisfaction (thumbs up/down)> 80% positive< 60% positive

11. Common UX Anti-patterns in AI Applications

Anti-patternProblemFix
Spinner for 10+ secondsUsers abandon; no indication of progressUse streaming
No cancel buttonUsers feel trapped waitingAlways provide "Stop generating"
Streaming JSONPartial JSON flashes on screen, looks brokenBuffer JSON, only show when complete
No error recoveryOne error = dead endShow error + retry button + keep partial content
Identical loading for all tasksShort tasks feel artificially slowAdjust loading UI based on expected duration
Flash of skeleton then instant contentSkeleton shown for < 200ms, feels glitchyOnly show skeleton if TTFT > 300ms
Auto-scroll while user is readingAnnoying when user scrolls up to re-readOnly auto-scroll if user is near the bottom
No feedback on submitUser double-clicks, sends duplicateDisable input immediately, show visual confirmation
Streaming text reflows layoutPage jumps as content growsReserve space, use fixed-width containers

12. Key Takeaways

  1. Perceived latency > actual latency — streaming doesn't make the model faster, but it makes the experience feel 10x faster.
  2. Time-to-first-token (TTFT) is the most important metric — optimize prompt length, model choice, and network latency to minimize it.
  3. Use three-phase loading: immediate feedback (thinking dots) -> streaming text (with cursor) -> complete response (with markdown rendering).
  4. Don't stream everything — JSON, short responses, binary decisions, and pipeline-internal calls should use buffered responses.
  5. Always provide cancellation — both UX ("Stop generating" button) and technical (AbortController).
  6. Handle mid-stream errors gracefully — keep partial content visible, show an error banner, and offer retry.
  7. Monitor UX metrics in production — TTFT, cancellation rate, and error rate reveal real user experience quality.

Explain-It Challenge

  1. A product manager asks "why do we need streaming? Can't we just make the model faster?" Explain the difference between actual and perceived latency with a concrete example.
  2. Your AI feature works great for short questions but users complain about long RAG-powered responses feeling slow. What is the root cause and how would you improve the UX?
  3. A developer wants to stream a JSON response to show a progress bar as fields fill in. Explain why this is problematic and suggest a better approach.

Navigation: <- 4.8.b — Progressive Rendering | 4.8 Overview ->