Episode 4 — Generative AI Engineering / 4.19 — Multi Agent Architecture Concerns
4.19.a — Increased Latency
In one sentence: Every agent in a multi-agent pipeline adds its own LLM call time, turning a sub-second single-call experience into a multi-second (or multi-minute) pipeline -- and understanding where that time goes is the first step to making it acceptable.
Navigation: <- 4.19 Overview | 4.19.b -- Higher Operational Cost ->
1. Why Multi-Agent Systems Are Slow
A single LLM call typically takes 200ms to 2 seconds depending on the model, input size, and output length. In a multi-agent system, you are making multiple LLM calls -- and if they are sequential, the latency adds up.
SINGLE CALL:
User -> [LLM Call: 800ms] -> Response
Total: 800ms
3-AGENT SEQUENTIAL PIPELINE:
User -> [Agent A: 800ms] -> [Agent B: 1200ms] -> [Agent C: 600ms] -> Response
Total: 800 + 1200 + 600 = 2,600ms (2.6 seconds)
5-AGENT SEQUENTIAL PIPELINE:
User -> [A: 800ms] -> [B: 1200ms] -> [C: 600ms] -> [D: 900ms] -> [E: 700ms] -> Response
Total: 800 + 1200 + 600 + 900 + 700 = 4,200ms (4.2 seconds)
Each agent in the pipeline typically needs to:
- Prepare its prompt (assemble context, format input from previous agent)
- Send the request to the LLM API (network round-trip)
- Wait for the model to generate the full response (compute time)
- Parse the response (extract structured output, validate)
- Pass results to the next agent
Steps 2 and 3 dominate the total time. The model's generation time scales linearly with output token count -- a 500-token response takes roughly 5x longer to generate than a 100-token response.
2. Measuring Latency Across the Pipeline
You cannot optimize what you do not measure. Every production multi-agent system needs per-agent timing.
// Utility: wrap any async function with timing
async function withTiming(label, fn) {
const start = performance.now();
const result = await fn();
const elapsed = performance.now() - start;
console.log(`[${label}] ${elapsed.toFixed(0)}ms`);
return { result, elapsed };
}
// Example: 3-agent pipeline with timing
async function runPipeline(userInput) {
const timings = {};
const pipelineStart = performance.now();
// Agent 1: Classify intent
const { result: classification, elapsed: t1 } = await withTiming(
'Agent 1: Classify',
() => classifyIntent(userInput)
);
timings.classify = t1;
// Agent 2: Research
const { result: research, elapsed: t2 } = await withTiming(
'Agent 2: Research',
() => conductResearch(classification)
);
timings.research = t2;
// Agent 3: Generate response
const { result: response, elapsed: t3 } = await withTiming(
'Agent 3: Respond',
() => generateResponse(research)
);
timings.respond = t3;
const totalElapsed = performance.now() - pipelineStart;
timings.total = totalElapsed;
timings.overhead = totalElapsed - t1 - t2 - t3; // non-LLM time
console.log('\n--- Pipeline Timing Summary ---');
console.log(` Classify: ${timings.classify.toFixed(0)}ms`);
console.log(` Research: ${timings.research.toFixed(0)}ms`);
console.log(` Respond: ${timings.respond.toFixed(0)}ms`);
console.log(` Overhead: ${timings.overhead.toFixed(0)}ms`);
console.log(` TOTAL: ${timings.total.toFixed(0)}ms`);
return { response, timings };
}
Output example:
[Agent 1: Classify] 342ms
[Agent 2: Research] 1847ms
[Agent 3: Respond] 923ms
--- Pipeline Timing Summary ---
Classify: 342ms
Research: 1847ms
Respond: 923ms
Overhead: 12ms
TOTAL: 3124ms
The timing breakdown immediately shows that Agent 2 (Research) is the bottleneck. This is where optimization effort should focus.
3. Latency Math: Sequential vs Parallel
Sequential latency
When agents run one after another (each depends on the previous output):
Latency_sequential = T_agent1 + T_agent2 + T_agent3 + ... + T_agentN
This is the worst case and the most common pattern, because downstream agents typically need upstream results.
Parallel latency
When agents can run simultaneously (they don't depend on each other's output):
Latency_parallel = max(T_agent1, T_agent2, T_agent3, ..., T_agentN)
This is the best case. You only pay the time of the slowest agent.
Mixed (real-world) latency
Most pipelines are a mix -- some stages sequential, some parallel:
MIXED PIPELINE EXAMPLE:
+-> [Agent B1: 800ms] --+
[Agent A: 500ms] ----> +--> [Agent C: 600ms] -> Done
+-> [Agent B2: 1200ms] --+
Latency = T_A + max(T_B1, T_B2) + T_C
= 500 + max(800, 1200) + 600
= 500 + 1200 + 600
= 2,300ms
vs fully sequential:
= 500 + 800 + 1200 + 600
= 3,100ms
Savings from parallelization: 800ms (26% faster)
Code: parallel agent execution
// Sequential (slow)
async function sequentialPipeline(input) {
const a = await agentA(input); // Wait for A
const b1 = await agentB1(a); // Wait for B1
const b2 = await agentB2(a); // Wait for B2
const c = await agentC(b1, b2); // Wait for C
return c;
}
// Total time: T_A + T_B1 + T_B2 + T_C
// Parallel where possible (fast)
async function parallelPipeline(input) {
const a = await agentA(input); // Must wait for A
// B1 and B2 can run in parallel -- they both only need A's output
const [b1, b2] = await Promise.all([
agentB1(a),
agentB2(a),
]);
const c = await agentC(b1, b2); // Must wait for both B's
return c;
}
// Total time: T_A + max(T_B1, T_B2) + T_C
4. Strategies to Reduce Latency
Strategy 1: Parallelize independent agents
As shown above, identify agents whose inputs don't depend on each other and run them with Promise.all().
BEFORE (sequential):
[Sentiment: 400ms] -> [Summary: 900ms] -> [Keywords: 300ms]
Total: 1,600ms
AFTER (parallel where possible):
If Sentiment, Summary, and Keywords all only need the original input:
Promise.all([Sentiment, Summary, Keywords])
Total: max(400, 900, 300) = 900ms
Savings: 44%
Strategy 2: Use smaller/faster models for simple agents
Not every agent needs GPT-4o or Claude Sonnet. A classification agent that picks from 5 categories can use a smaller, faster model.
// Agent configuration with model selection based on task complexity
const agentConfigs = {
classifier: {
model: 'gpt-4o-mini', // Fast, cheap -- classification is simple
maxTokens: 50, // Short output
expectedLatency: '200-400ms',
},
researcher: {
model: 'gpt-4o', // Complex reasoning needs a powerful model
maxTokens: 2000,
expectedLatency: '1-3s',
},
formatter: {
model: 'gpt-4o-mini', // Reformatting is mechanical
maxTokens: 500,
expectedLatency: '300-600ms',
},
};
Strategy 3: Cache repeated calls
If the same agent gets the same (or very similar) input multiple times, cache the result.
const cache = new Map();
async function cachedAgent(agentFn, input, ttlMs = 300000) {
const cacheKey = JSON.stringify(input);
if (cache.has(cacheKey)) {
const { result, timestamp } = cache.get(cacheKey);
if (Date.now() - timestamp < ttlMs) {
console.log('[Cache HIT] Skipping LLM call');
return result;
}
cache.delete(cacheKey); // Expired
}
console.log('[Cache MISS] Calling LLM...');
const result = await agentFn(input);
cache.set(cacheKey, { result, timestamp: Date.now() });
return result;
}
// Usage
const classification = await cachedAgent(classifyIntent, userMessage);
// Second call with same message returns instantly from cache
Strategy 4: Reduce output token count
Output generation is the slowest part of an LLM call. Constrain agents to produce only what's needed.
// BAD: Agent generates a verbose explanation
const prompt = `Classify this customer message into a category.
Message: "${userMessage}"`;
// Output: "Based on my analysis, this message appears to be about billing
// because the customer mentions charges and payments. The category is: billing"
// ~40 tokens, ~400ms
// GOOD: Agent generates minimal structured output
const prompt = `Classify this customer message. Reply with ONLY the category name.
Categories: billing, technical, account, general
Message: "${userMessage}"`;
// Output: "billing"
// ~1 token, ~150ms
Strategy 5: Stream the final agent's response
Even if intermediate agents must complete fully, you can stream the last agent's output to the user so they see results appearing immediately.
async function streamingPipeline(userInput) {
// Intermediate agents run to completion (no streaming)
const classification = await classifyIntent(userInput);
const research = await conductResearch(classification);
// Final agent streams to the user
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Generate a helpful response using this research.' },
{ role: 'user', content: JSON.stringify(research) },
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
process.stdout.write(content); // User sees tokens as they arrive
}
}
Strategy 6: Set timeouts and fallbacks
Protect the user experience with timeouts. If an agent is too slow, fall back to a simpler approach.
async function withTimeout(fn, timeoutMs, fallbackFn) {
return Promise.race([
fn(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
),
]).catch(async (error) => {
console.warn(`Agent timed out: ${error.message}. Using fallback.`);
return fallbackFn();
});
}
// Usage: give the research agent 3 seconds, then fall back to a simpler lookup
const research = await withTimeout(
() => conductResearch(classification),
3000,
() => simpleLookup(classification) // Faster but less thorough
);
5. Latency Budgets for User-Facing Applications
Different applications have different latency tolerances. Define a latency budget before designing your pipeline.
+---------------------------------------------------------------+
| APPLICATION TYPE | ACCEPTABLE LATENCY | AGENTS |
|----------------------------+----------------------+-----------|
| Chat / conversational | < 2 seconds | 1-2 max |
| Search / Q&A | < 3 seconds | 2-3 max |
| Document processing | < 10 seconds | 3-5 okay |
| Background / async job | < 60 seconds | 5+ okay |
| Batch processing | Minutes acceptable | No limit |
+---------------------------------------------------------------+
Setting a latency budget
const LATENCY_BUDGET = {
total: 3000, // 3 seconds max for user-facing Q&A
classify: 400, // Budget for classification agent
retrieve: 500, // Budget for retrieval (not an LLM call, but still costs time)
research: 1500, // Budget for research agent (the heavy one)
respond: 500, // Budget for response generation
overhead: 100, // Network, parsing, etc.
};
// Validate that the budget adds up
const budgetTotal = Object.values(LATENCY_BUDGET)
.filter((_, i, arr) => i > 0) // skip 'total'
.reduce((sum, v) => sum + v, 0);
// This is a design-time check, not a runtime check
console.assert(
budgetTotal <= LATENCY_BUDGET.total,
`Budget overflows: ${budgetTotal}ms > ${LATENCY_BUDGET.total}ms`
);
Monitoring budget compliance in production
async function monitoredPipeline(input) {
const timings = {};
const classify = await withTiming('classify', () => classifyIntent(input));
timings.classify = classify.elapsed;
if (classify.elapsed > LATENCY_BUDGET.classify) {
console.warn(`[BUDGET EXCEEDED] classify: ${classify.elapsed}ms > ${LATENCY_BUDGET.classify}ms`);
}
const research = await withTiming('research', () => conductResearch(classify.result));
timings.research = research.elapsed;
if (research.elapsed > LATENCY_BUDGET.research) {
console.warn(`[BUDGET EXCEEDED] research: ${research.elapsed}ms > ${LATENCY_BUDGET.research}ms`);
}
const respond = await withTiming('respond', () => generateResponse(research.result));
timings.respond = respond.elapsed;
const total = Object.values(timings).reduce((a, b) => a + b, 0);
if (total > LATENCY_BUDGET.total) {
console.error(`[TOTAL BUDGET EXCEEDED] ${total.toFixed(0)}ms > ${LATENCY_BUDGET.total}ms`);
// Log to monitoring system, trigger alert
}
return respond.result;
}
6. When Latency Makes Multi-Agent Impractical
Sometimes multi-agent architecture simply cannot meet the latency requirements. Recognize these situations early.
| Scenario | Why Multi-Agent Fails | What to Do Instead |
|---|---|---|
| Real-time chat (< 1s) | Even 2 sequential LLM calls exceed budget | Single well-crafted prompt with structured output |
| Autocomplete / typeahead | Needs < 200ms response | Pre-computed suggestions or a single small model |
| Interactive UI with live updates | Users expect instant feedback | Single call + streaming; defer complex analysis to background |
| High-throughput API (1000+ req/s) | Multiplied latency = multiplied concurrent connections | Single call or batch processing |
| Mobile apps with poor connectivity | Each round-trip adds network latency on top of compute | Minimize API calls; do more client-side |
Decision flowchart
Is your latency budget > 3 seconds?
| |
YES NO
| |
Multi-agent is Can you parallelize agents?
viable if needed. | |
YES NO
| |
Budget = max(agents) Budget = sum(agents)
Might still work. Probably too slow.
| Use single call or
Test and measure. reduce agent count.
7. Latency Benchmarking Example
A complete benchmarking script to compare single-call vs multi-agent latency:
import OpenAI from 'openai';
const openai = new OpenAI();
// Helper: time an LLM call
async function timedCall(label, messages, model = 'gpt-4o') {
const start = performance.now();
const response = await openai.chat.completions.create({
model,
messages,
temperature: 0,
});
const elapsed = performance.now() - start;
return {
label,
elapsed,
content: response.choices[0].message.content,
tokens: response.usage,
};
}
// Approach 1: Single call does everything
async function singleCallApproach(article) {
const start = performance.now();
const result = await timedCall('Single Call', [
{
role: 'system',
content: `Analyze this article. Return JSON with:
{ "sentiment": "positive|negative|neutral",
"summary": "2 sentence summary",
"keywords": ["keyword1", "keyword2", ...] }`,
},
{ role: 'user', content: article },
]);
return { ...result, totalElapsed: performance.now() - start };
}
// Approach 2: Three agents in sequence
async function multiAgentSequential(article) {
const start = performance.now();
const sentiment = await timedCall('Agent: Sentiment', [
{ role: 'system', content: 'Return ONLY: positive, negative, or neutral.' },
{ role: 'user', content: article },
], 'gpt-4o-mini');
const summary = await timedCall('Agent: Summary', [
{ role: 'system', content: 'Summarize in exactly 2 sentences.' },
{ role: 'user', content: article },
]);
const keywords = await timedCall('Agent: Keywords', [
{ role: 'system', content: 'Return 5 keywords as a JSON array of strings.' },
{ role: 'user', content: article },
], 'gpt-4o-mini');
return {
agents: [sentiment, summary, keywords],
totalElapsed: performance.now() - start,
};
}
// Approach 3: Three agents in parallel
async function multiAgentParallel(article) {
const start = performance.now();
const [sentiment, summary, keywords] = await Promise.all([
timedCall('Agent: Sentiment', [
{ role: 'system', content: 'Return ONLY: positive, negative, or neutral.' },
{ role: 'user', content: article },
], 'gpt-4o-mini'),
timedCall('Agent: Summary', [
{ role: 'system', content: 'Summarize in exactly 2 sentences.' },
{ role: 'user', content: article },
]),
timedCall('Agent: Keywords', [
{ role: 'system', content: 'Return 5 keywords as a JSON array of strings.' },
{ role: 'user', content: article },
], 'gpt-4o-mini'),
]);
return {
agents: [sentiment, summary, keywords],
totalElapsed: performance.now() - start,
};
}
// Run benchmark
async function benchmark() {
const article = `Artificial intelligence continues to transform industries
worldwide. Recent advances in large language models have enabled new
applications in healthcare, finance, and education. However, concerns
about bias, hallucination, and job displacement remain significant
challenges that the industry must address.`;
console.log('=== LATENCY BENCHMARK ===\n');
const single = await singleCallApproach(article);
console.log(`Single call: ${single.totalElapsed.toFixed(0)}ms`);
const sequential = await multiAgentSequential(article);
console.log(`Multi-agent (seq): ${sequential.totalElapsed.toFixed(0)}ms`);
sequential.agents.forEach((a) =>
console.log(` ${a.label}: ${a.elapsed.toFixed(0)}ms`)
);
const parallel = await multiAgentParallel(article);
console.log(`Multi-agent (par): ${parallel.totalElapsed.toFixed(0)}ms`);
parallel.agents.forEach((a) =>
console.log(` ${a.label}: ${a.elapsed.toFixed(0)}ms`)
);
console.log('\n--- Summary ---');
console.log(`Single call: ${single.totalElapsed.toFixed(0)}ms`);
console.log(`Sequential agents: ${sequential.totalElapsed.toFixed(0)}ms (+${((sequential.totalElapsed / single.totalElapsed - 1) * 100).toFixed(0)}%)`);
console.log(`Parallel agents: ${parallel.totalElapsed.toFixed(0)}ms (+${((parallel.totalElapsed / single.totalElapsed - 1) * 100).toFixed(0)}%)`);
}
benchmark();
Typical output:
=== LATENCY BENCHMARK ===
Single call: 1,247ms
Multi-agent (seq): 2,891ms
Agent: Sentiment: 312ms
Agent: Summary: 1,623ms
Agent: Keywords: 428ms
Multi-agent (par): 1,680ms
Agent: Sentiment: 298ms
Agent: Summary: 1,598ms
Agent: Keywords: 411ms
--- Summary ---
Single call: 1,247ms
Sequential agents: 2,891ms (+132%)
Parallel agents: 1,680ms (+35%)
8. Key Takeaways
- Sequential multi-agent latency = sum of all agent call times. A 3-agent pipeline is roughly 3x slower than a single call.
- Parallel execution reduces latency to the slowest agent -- use
Promise.all()for independent agents. - Measure every agent individually -- you cannot optimize what you do not measure. The bottleneck is often one specific agent.
- Use smaller/faster models for simple agents -- classification and formatting do not need the most powerful model.
- Cache, constrain output tokens, and stream the final response -- these compound into significant latency reductions.
- Set latency budgets up front -- know your user-facing constraints before designing the pipeline.
- If the latency budget is < 1-2 seconds, multi-agent is likely impractical -- use a single well-crafted prompt instead.
Explain-It Challenge
- Your team built a 4-agent pipeline that takes 6 seconds. The PM says users are abandoning the feature. Walk through your latency diagnosis and optimization plan.
- Why does streaming the final agent help the user experience even though the total pipeline time is unchanged?
- A colleague proposes replacing one GPT-4o agent with GPT-4o-mini to reduce latency. What questions would you ask before approving this change?
Navigation: <- 4.19 Overview | 4.19.b -- Higher Operational Cost ->