Episode 4 — Generative AI Engineering / 4.19 — Multi Agent Architecture Concerns

4.19.a — Increased Latency

In one sentence: Every agent in a multi-agent pipeline adds its own LLM call time, turning a sub-second single-call experience into a multi-second (or multi-minute) pipeline -- and understanding where that time goes is the first step to making it acceptable.

Navigation: <- 4.19 Overview | 4.19.b -- Higher Operational Cost ->


1. Why Multi-Agent Systems Are Slow

A single LLM call typically takes 200ms to 2 seconds depending on the model, input size, and output length. In a multi-agent system, you are making multiple LLM calls -- and if they are sequential, the latency adds up.

SINGLE CALL:
  User -> [LLM Call: 800ms] -> Response
  Total: 800ms

3-AGENT SEQUENTIAL PIPELINE:
  User -> [Agent A: 800ms] -> [Agent B: 1200ms] -> [Agent C: 600ms] -> Response
  Total: 800 + 1200 + 600 = 2,600ms (2.6 seconds)

5-AGENT SEQUENTIAL PIPELINE:
  User -> [A: 800ms] -> [B: 1200ms] -> [C: 600ms] -> [D: 900ms] -> [E: 700ms] -> Response
  Total: 800 + 1200 + 600 + 900 + 700 = 4,200ms (4.2 seconds)

Each agent in the pipeline typically needs to:

  1. Prepare its prompt (assemble context, format input from previous agent)
  2. Send the request to the LLM API (network round-trip)
  3. Wait for the model to generate the full response (compute time)
  4. Parse the response (extract structured output, validate)
  5. Pass results to the next agent

Steps 2 and 3 dominate the total time. The model's generation time scales linearly with output token count -- a 500-token response takes roughly 5x longer to generate than a 100-token response.


2. Measuring Latency Across the Pipeline

You cannot optimize what you do not measure. Every production multi-agent system needs per-agent timing.

// Utility: wrap any async function with timing
async function withTiming(label, fn) {
  const start = performance.now();
  const result = await fn();
  const elapsed = performance.now() - start;
  console.log(`[${label}] ${elapsed.toFixed(0)}ms`);
  return { result, elapsed };
}

// Example: 3-agent pipeline with timing
async function runPipeline(userInput) {
  const timings = {};
  const pipelineStart = performance.now();

  // Agent 1: Classify intent
  const { result: classification, elapsed: t1 } = await withTiming(
    'Agent 1: Classify',
    () => classifyIntent(userInput)
  );
  timings.classify = t1;

  // Agent 2: Research
  const { result: research, elapsed: t2 } = await withTiming(
    'Agent 2: Research',
    () => conductResearch(classification)
  );
  timings.research = t2;

  // Agent 3: Generate response
  const { result: response, elapsed: t3 } = await withTiming(
    'Agent 3: Respond',
    () => generateResponse(research)
  );
  timings.respond = t3;

  const totalElapsed = performance.now() - pipelineStart;
  timings.total = totalElapsed;
  timings.overhead = totalElapsed - t1 - t2 - t3; // non-LLM time

  console.log('\n--- Pipeline Timing Summary ---');
  console.log(`  Classify:  ${timings.classify.toFixed(0)}ms`);
  console.log(`  Research:  ${timings.research.toFixed(0)}ms`);
  console.log(`  Respond:   ${timings.respond.toFixed(0)}ms`);
  console.log(`  Overhead:  ${timings.overhead.toFixed(0)}ms`);
  console.log(`  TOTAL:     ${timings.total.toFixed(0)}ms`);

  return { response, timings };
}

Output example:

[Agent 1: Classify] 342ms
[Agent 2: Research] 1847ms
[Agent 3: Respond]  923ms

--- Pipeline Timing Summary ---
  Classify:  342ms
  Research:  1847ms
  Respond:   923ms
  Overhead:  12ms
  TOTAL:     3124ms

The timing breakdown immediately shows that Agent 2 (Research) is the bottleneck. This is where optimization effort should focus.


3. Latency Math: Sequential vs Parallel

Sequential latency

When agents run one after another (each depends on the previous output):

Latency_sequential = T_agent1 + T_agent2 + T_agent3 + ... + T_agentN

This is the worst case and the most common pattern, because downstream agents typically need upstream results.

Parallel latency

When agents can run simultaneously (they don't depend on each other's output):

Latency_parallel = max(T_agent1, T_agent2, T_agent3, ..., T_agentN)

This is the best case. You only pay the time of the slowest agent.

Mixed (real-world) latency

Most pipelines are a mix -- some stages sequential, some parallel:

MIXED PIPELINE EXAMPLE:
                        +-> [Agent B1: 800ms] --+
  [Agent A: 500ms] ---->                         +--> [Agent C: 600ms] -> Done
                        +-> [Agent B2: 1200ms] --+

  Latency = T_A + max(T_B1, T_B2) + T_C
          = 500 + max(800, 1200) + 600
          = 500 + 1200 + 600
          = 2,300ms

  vs fully sequential:
          = 500 + 800 + 1200 + 600
          = 3,100ms

  Savings from parallelization: 800ms (26% faster)

Code: parallel agent execution

// Sequential (slow)
async function sequentialPipeline(input) {
  const a = await agentA(input);      // Wait for A
  const b1 = await agentB1(a);        // Wait for B1
  const b2 = await agentB2(a);        // Wait for B2
  const c = await agentC(b1, b2);     // Wait for C
  return c;
}
// Total time: T_A + T_B1 + T_B2 + T_C

// Parallel where possible (fast)
async function parallelPipeline(input) {
  const a = await agentA(input);               // Must wait for A

  // B1 and B2 can run in parallel -- they both only need A's output
  const [b1, b2] = await Promise.all([
    agentB1(a),
    agentB2(a),
  ]);

  const c = await agentC(b1, b2);             // Must wait for both B's
  return c;
}
// Total time: T_A + max(T_B1, T_B2) + T_C

4. Strategies to Reduce Latency

Strategy 1: Parallelize independent agents

As shown above, identify agents whose inputs don't depend on each other and run them with Promise.all().

BEFORE (sequential):
  [Sentiment: 400ms] -> [Summary: 900ms] -> [Keywords: 300ms]
  Total: 1,600ms

AFTER (parallel where possible):
  If Sentiment, Summary, and Keywords all only need the original input:
  Promise.all([Sentiment, Summary, Keywords])
  Total: max(400, 900, 300) = 900ms
  Savings: 44%

Strategy 2: Use smaller/faster models for simple agents

Not every agent needs GPT-4o or Claude Sonnet. A classification agent that picks from 5 categories can use a smaller, faster model.

// Agent configuration with model selection based on task complexity
const agentConfigs = {
  classifier: {
    model: 'gpt-4o-mini',        // Fast, cheap -- classification is simple
    maxTokens: 50,                // Short output
    expectedLatency: '200-400ms',
  },
  researcher: {
    model: 'gpt-4o',             // Complex reasoning needs a powerful model
    maxTokens: 2000,
    expectedLatency: '1-3s',
  },
  formatter: {
    model: 'gpt-4o-mini',        // Reformatting is mechanical
    maxTokens: 500,
    expectedLatency: '300-600ms',
  },
};

Strategy 3: Cache repeated calls

If the same agent gets the same (or very similar) input multiple times, cache the result.

const cache = new Map();

async function cachedAgent(agentFn, input, ttlMs = 300000) {
  const cacheKey = JSON.stringify(input);

  if (cache.has(cacheKey)) {
    const { result, timestamp } = cache.get(cacheKey);
    if (Date.now() - timestamp < ttlMs) {
      console.log('[Cache HIT] Skipping LLM call');
      return result;
    }
    cache.delete(cacheKey); // Expired
  }

  console.log('[Cache MISS] Calling LLM...');
  const result = await agentFn(input);
  cache.set(cacheKey, { result, timestamp: Date.now() });
  return result;
}

// Usage
const classification = await cachedAgent(classifyIntent, userMessage);
// Second call with same message returns instantly from cache

Strategy 4: Reduce output token count

Output generation is the slowest part of an LLM call. Constrain agents to produce only what's needed.

// BAD: Agent generates a verbose explanation
const prompt = `Classify this customer message into a category.
Message: "${userMessage}"`;
// Output: "Based on my analysis, this message appears to be about billing
// because the customer mentions charges and payments. The category is: billing"
// ~40 tokens, ~400ms

// GOOD: Agent generates minimal structured output
const prompt = `Classify this customer message. Reply with ONLY the category name.
Categories: billing, technical, account, general
Message: "${userMessage}"`;
// Output: "billing"
// ~1 token, ~150ms

Strategy 5: Stream the final agent's response

Even if intermediate agents must complete fully, you can stream the last agent's output to the user so they see results appearing immediately.

async function streamingPipeline(userInput) {
  // Intermediate agents run to completion (no streaming)
  const classification = await classifyIntent(userInput);
  const research = await conductResearch(classification);

  // Final agent streams to the user
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'Generate a helpful response using this research.' },
      { role: 'user', content: JSON.stringify(research) },
    ],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || '';
    process.stdout.write(content); // User sees tokens as they arrive
  }
}

Strategy 6: Set timeouts and fallbacks

Protect the user experience with timeouts. If an agent is too slow, fall back to a simpler approach.

async function withTimeout(fn, timeoutMs, fallbackFn) {
  return Promise.race([
    fn(),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
    ),
  ]).catch(async (error) => {
    console.warn(`Agent timed out: ${error.message}. Using fallback.`);
    return fallbackFn();
  });
}

// Usage: give the research agent 3 seconds, then fall back to a simpler lookup
const research = await withTimeout(
  () => conductResearch(classification),
  3000,
  () => simpleLookup(classification) // Faster but less thorough
);

5. Latency Budgets for User-Facing Applications

Different applications have different latency tolerances. Define a latency budget before designing your pipeline.

+---------------------------------------------------------------+
|  APPLICATION TYPE          |  ACCEPTABLE LATENCY  |  AGENTS   |
|----------------------------+----------------------+-----------|
|  Chat / conversational     |  < 2 seconds         |  1-2 max  |
|  Search / Q&A              |  < 3 seconds         |  2-3 max  |
|  Document processing       |  < 10 seconds        |  3-5 okay |
|  Background / async job    |  < 60 seconds        |  5+ okay  |
|  Batch processing          |  Minutes acceptable  |  No limit |
+---------------------------------------------------------------+

Setting a latency budget

const LATENCY_BUDGET = {
  total: 3000,        // 3 seconds max for user-facing Q&A
  classify: 400,      // Budget for classification agent
  retrieve: 500,      // Budget for retrieval (not an LLM call, but still costs time)
  research: 1500,     // Budget for research agent (the heavy one)
  respond: 500,       // Budget for response generation
  overhead: 100,      // Network, parsing, etc.
};

// Validate that the budget adds up
const budgetTotal = Object.values(LATENCY_BUDGET)
  .filter((_, i, arr) => i > 0) // skip 'total'
  .reduce((sum, v) => sum + v, 0);

// This is a design-time check, not a runtime check
console.assert(
  budgetTotal <= LATENCY_BUDGET.total,
  `Budget overflows: ${budgetTotal}ms > ${LATENCY_BUDGET.total}ms`
);

Monitoring budget compliance in production

async function monitoredPipeline(input) {
  const timings = {};

  const classify = await withTiming('classify', () => classifyIntent(input));
  timings.classify = classify.elapsed;
  if (classify.elapsed > LATENCY_BUDGET.classify) {
    console.warn(`[BUDGET EXCEEDED] classify: ${classify.elapsed}ms > ${LATENCY_BUDGET.classify}ms`);
  }

  const research = await withTiming('research', () => conductResearch(classify.result));
  timings.research = research.elapsed;
  if (research.elapsed > LATENCY_BUDGET.research) {
    console.warn(`[BUDGET EXCEEDED] research: ${research.elapsed}ms > ${LATENCY_BUDGET.research}ms`);
  }

  const respond = await withTiming('respond', () => generateResponse(research.result));
  timings.respond = respond.elapsed;

  const total = Object.values(timings).reduce((a, b) => a + b, 0);
  if (total > LATENCY_BUDGET.total) {
    console.error(`[TOTAL BUDGET EXCEEDED] ${total.toFixed(0)}ms > ${LATENCY_BUDGET.total}ms`);
    // Log to monitoring system, trigger alert
  }

  return respond.result;
}

6. When Latency Makes Multi-Agent Impractical

Sometimes multi-agent architecture simply cannot meet the latency requirements. Recognize these situations early.

ScenarioWhy Multi-Agent FailsWhat to Do Instead
Real-time chat (< 1s)Even 2 sequential LLM calls exceed budgetSingle well-crafted prompt with structured output
Autocomplete / typeaheadNeeds < 200ms responsePre-computed suggestions or a single small model
Interactive UI with live updatesUsers expect instant feedbackSingle call + streaming; defer complex analysis to background
High-throughput API (1000+ req/s)Multiplied latency = multiplied concurrent connectionsSingle call or batch processing
Mobile apps with poor connectivityEach round-trip adds network latency on top of computeMinimize API calls; do more client-side

Decision flowchart

  Is your latency budget > 3 seconds?
       |                    |
      YES                  NO
       |                    |
  Multi-agent is     Can you parallelize agents?
  viable if needed.       |            |
                        YES           NO
                         |             |
                  Budget = max(agents)  Budget = sum(agents)
                  Might still work.    Probably too slow.
                         |             Use single call or
                  Test and measure.    reduce agent count.

7. Latency Benchmarking Example

A complete benchmarking script to compare single-call vs multi-agent latency:

import OpenAI from 'openai';

const openai = new OpenAI();

// Helper: time an LLM call
async function timedCall(label, messages, model = 'gpt-4o') {
  const start = performance.now();
  const response = await openai.chat.completions.create({
    model,
    messages,
    temperature: 0,
  });
  const elapsed = performance.now() - start;
  return {
    label,
    elapsed,
    content: response.choices[0].message.content,
    tokens: response.usage,
  };
}

// Approach 1: Single call does everything
async function singleCallApproach(article) {
  const start = performance.now();
  const result = await timedCall('Single Call', [
    {
      role: 'system',
      content: `Analyze this article. Return JSON with:
        { "sentiment": "positive|negative|neutral",
          "summary": "2 sentence summary",
          "keywords": ["keyword1", "keyword2", ...] }`,
    },
    { role: 'user', content: article },
  ]);
  return { ...result, totalElapsed: performance.now() - start };
}

// Approach 2: Three agents in sequence
async function multiAgentSequential(article) {
  const start = performance.now();

  const sentiment = await timedCall('Agent: Sentiment', [
    { role: 'system', content: 'Return ONLY: positive, negative, or neutral.' },
    { role: 'user', content: article },
  ], 'gpt-4o-mini');

  const summary = await timedCall('Agent: Summary', [
    { role: 'system', content: 'Summarize in exactly 2 sentences.' },
    { role: 'user', content: article },
  ]);

  const keywords = await timedCall('Agent: Keywords', [
    { role: 'system', content: 'Return 5 keywords as a JSON array of strings.' },
    { role: 'user', content: article },
  ], 'gpt-4o-mini');

  return {
    agents: [sentiment, summary, keywords],
    totalElapsed: performance.now() - start,
  };
}

// Approach 3: Three agents in parallel
async function multiAgentParallel(article) {
  const start = performance.now();

  const [sentiment, summary, keywords] = await Promise.all([
    timedCall('Agent: Sentiment', [
      { role: 'system', content: 'Return ONLY: positive, negative, or neutral.' },
      { role: 'user', content: article },
    ], 'gpt-4o-mini'),
    timedCall('Agent: Summary', [
      { role: 'system', content: 'Summarize in exactly 2 sentences.' },
      { role: 'user', content: article },
    ]),
    timedCall('Agent: Keywords', [
      { role: 'system', content: 'Return 5 keywords as a JSON array of strings.' },
      { role: 'user', content: article },
    ], 'gpt-4o-mini'),
  ]);

  return {
    agents: [sentiment, summary, keywords],
    totalElapsed: performance.now() - start,
  };
}

// Run benchmark
async function benchmark() {
  const article = `Artificial intelligence continues to transform industries
    worldwide. Recent advances in large language models have enabled new
    applications in healthcare, finance, and education. However, concerns
    about bias, hallucination, and job displacement remain significant
    challenges that the industry must address.`;

  console.log('=== LATENCY BENCHMARK ===\n');

  const single = await singleCallApproach(article);
  console.log(`Single call:        ${single.totalElapsed.toFixed(0)}ms`);

  const sequential = await multiAgentSequential(article);
  console.log(`Multi-agent (seq):  ${sequential.totalElapsed.toFixed(0)}ms`);
  sequential.agents.forEach((a) =>
    console.log(`  ${a.label}: ${a.elapsed.toFixed(0)}ms`)
  );

  const parallel = await multiAgentParallel(article);
  console.log(`Multi-agent (par):  ${parallel.totalElapsed.toFixed(0)}ms`);
  parallel.agents.forEach((a) =>
    console.log(`  ${a.label}: ${a.elapsed.toFixed(0)}ms`)
  );

  console.log('\n--- Summary ---');
  console.log(`Single call:       ${single.totalElapsed.toFixed(0)}ms`);
  console.log(`Sequential agents: ${sequential.totalElapsed.toFixed(0)}ms (+${((sequential.totalElapsed / single.totalElapsed - 1) * 100).toFixed(0)}%)`);
  console.log(`Parallel agents:   ${parallel.totalElapsed.toFixed(0)}ms (+${((parallel.totalElapsed / single.totalElapsed - 1) * 100).toFixed(0)}%)`);
}

benchmark();

Typical output:

=== LATENCY BENCHMARK ===

Single call:        1,247ms
Multi-agent (seq):  2,891ms
  Agent: Sentiment: 312ms
  Agent: Summary:   1,623ms
  Agent: Keywords:  428ms
Multi-agent (par):  1,680ms
  Agent: Sentiment: 298ms
  Agent: Summary:   1,598ms
  Agent: Keywords:  411ms

--- Summary ---
Single call:       1,247ms
Sequential agents: 2,891ms (+132%)
Parallel agents:   1,680ms (+35%)

8. Key Takeaways

  1. Sequential multi-agent latency = sum of all agent call times. A 3-agent pipeline is roughly 3x slower than a single call.
  2. Parallel execution reduces latency to the slowest agent -- use Promise.all() for independent agents.
  3. Measure every agent individually -- you cannot optimize what you do not measure. The bottleneck is often one specific agent.
  4. Use smaller/faster models for simple agents -- classification and formatting do not need the most powerful model.
  5. Cache, constrain output tokens, and stream the final response -- these compound into significant latency reductions.
  6. Set latency budgets up front -- know your user-facing constraints before designing the pipeline.
  7. If the latency budget is < 1-2 seconds, multi-agent is likely impractical -- use a single well-crafted prompt instead.

Explain-It Challenge

  1. Your team built a 4-agent pipeline that takes 6 seconds. The PM says users are abandoning the feature. Walk through your latency diagnosis and optimization plan.
  2. Why does streaming the final agent help the user experience even though the total pipeline time is unchanged?
  3. A colleague proposes replacing one GPT-4o agent with GPT-4o-mini to reduce latency. What questions would you ask before approving this change?

Navigation: <- 4.19 Overview | 4.19.b -- Higher Operational Cost ->