Episode 4 — Generative AI Engineering / 4.19 — Multi Agent Architecture Concerns

4.19.e — When Not to Use Multi-Agent

In one sentence: Multi-agent architecture is a powerful tool, not a default choice -- and the mark of an experienced AI engineer is knowing when a single well-crafted prompt outperforms an elaborate pipeline of specialized agents.

Navigation: <- 4.19.d Managing Shared State | 4.19 Exercise Questions ->


1. The Importance of Logging Each Pipeline Step

Before deciding whether to use multi-agent or not, you need visibility into what is actually happening. Logging every pipeline step is what enables informed architectural decisions.

// Minimal pipeline logger -- use this BEFORE optimizing or refactoring
class PipelineLogger {
  constructor(pipelineName) {
    this.pipelineName = pipelineName;
    this.steps = [];
  }

  logStep(stepName, input, output, durationMs, tokenCount) {
    const entry = {
      step: stepName,
      inputPreview: JSON.stringify(input).slice(0, 200),
      outputPreview: JSON.stringify(output).slice(0, 200),
      durationMs,
      tokenCount,
      timestamp: new Date().toISOString(),
    };
    this.steps.push(entry);
    console.log(`[${this.pipelineName}] ${stepName}: ${durationMs}ms, ${tokenCount} tokens`);
  }

  summary() {
    const totalDuration = this.steps.reduce((sum, s) => sum + s.durationMs, 0);
    const totalTokens = this.steps.reduce((sum, s) => sum + s.tokenCount, 0);
    console.log(`\n=== Pipeline Summary: ${this.pipelineName} ===`);
    console.log(`Steps: ${this.steps.length}`);
    console.log(`Total duration: ${totalDuration}ms`);
    console.log(`Total tokens: ${totalTokens}`);
    console.log(`Avg duration/step: ${(totalDuration / this.steps.length).toFixed(0)}ms`);
    this.steps.forEach((s) => {
      const pct = ((s.durationMs / totalDuration) * 100).toFixed(1);
      console.log(`  ${s.step}: ${s.durationMs}ms (${pct}%)`);
    });
    return { totalDuration, totalTokens, stepCount: this.steps.length };
  }
}

The reason logging comes first: you cannot make good architectural decisions without data. If you don't know how long each step takes, how many tokens each agent uses, and what quality each agent produces, you are guessing.


2. Simple Tasks Don't Need Agents

Many tasks that get built as multi-agent pipelines could be done in a single LLM call -- or even without an LLM at all.

Tasks that do NOT need multi-agent

TaskWhy Single Call (or No LLM) WorksAnti-Pattern
Classify intent into 5 categoriesOne prompt with examples. Done.3-agent pipeline: "classifier -> validator -> router"
Extract name + email from textSingle prompt with JSON output"Extraction agent -> Validation agent -> Format agent"
Summarize a short articleOne prompt: "Summarize in 3 bullet points""Chunker agent -> Summarizer agent -> Merger agent"
Format a date stringnew Date(input).toISOString() -- no LLM needed"Parsing agent -> Format agent"
Look up a FAQ answerKeyword search in a database"Classifier agent -> Retriever agent -> Generator agent" for a static FAQ
Validate JSON structureJSON.parse() + schema validation"Validation agent" that calls an LLM to check JSON

The "do you actually need an LLM?" test

BEFORE adding any agent, ask:

  1. Can this be done with regular code?
     (string operations, regex, database query, API call)
     YES -> Don't use an LLM at all. Code is faster, cheaper, deterministic.

  2. Can this be done with a single LLM call?
     (one well-crafted prompt with clear instructions)
     YES -> Use a single call. Simpler, faster, cheaper.

  3. Does this require genuinely different reasoning steps
     that a single prompt cannot handle?
     YES -> Consider multi-agent.
     NO  -> Single call is sufficient.

3. Single Well-Crafted Prompt vs Multi-Agent Overkill

A single prompt can do a surprising amount of work if written well. The key is structured output and clear instructions.

Example: Article analysis

// OVER-ENGINEERED: 4-agent pipeline
async function overEngineeredAnalysis(article) {
  const sentiment = await sentimentAgent(article);     // Agent 1: 400ms, $0.001
  const summary = await summaryAgent(article);          // Agent 2: 900ms, $0.005
  const keywords = await keywordAgent(article);         // Agent 3: 300ms, $0.001
  const report = await reportAgent(sentiment, summary, keywords); // Agent 4: 600ms, $0.004
  return report;
  // Total: 2,200ms, $0.011, 4 LLM calls, complex debugging
}

// RIGHT-SIZED: Single call
async function singleCallAnalysis(article) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Analyze the following article. Return JSON with exactly this structure:
{
  "sentiment": "positive" | "negative" | "neutral",
  "sentimentScore": number between -1 and 1,
  "summary": "2-3 sentence summary",
  "keywords": ["keyword1", "keyword2", ...up to 5],
  "readingLevel": "basic" | "intermediate" | "advanced"
}`,
      },
      { role: 'user', content: article },
    ],
    temperature: 0,
    response_format: { type: 'json_object' },
  });
  return JSON.parse(response.choices[0].message.content);
  // Total: 800ms, $0.004, 1 LLM call, trivial debugging
}

When does single-call break down?

Single call works when:
  - The total task fits in one prompt's context window
  - The model can handle all reasoning in one pass
  - Output is a single structured response
  - No external tool calls needed between reasoning steps

Single call breaks down when:
  - Task requires SEQUENTIAL reasoning (step 2 depends on step 1's exact output)
  - Different steps need DIFFERENT models (code gen vs classification)
  - Steps involve EXTERNAL actions (API calls, database queries, file operations)
  - Context is too large for one call (must chunk and process separately)
  - Task requires ITERATION (try something, evaluate, retry)

4. Decision Framework: When to Add vs Remove Agents

The "Agent Justification Test"

Before adding any agent to a pipeline, it must pass all three criteria:

+-------------------------------------------------------------------+
|  AGENT JUSTIFICATION TEST                                         |
|                                                                   |
|  For each proposed agent, answer ALL THREE:                       |
|                                                                   |
|  1. NECESSITY: Does this agent do something that CANNOT be        |
|     done by the previous agent in the same call?                  |
|     (Different model? External tool? Different context?)          |
|                                                                   |
|  2. VALUE: Does adding this agent MEASURABLY improve output       |
|     quality compared to the simpler approach?                     |
|     (Run A/B test. "Probably better" is not good enough.)         |
|                                                                   |
|  3. COST-BENEFIT: Does the improvement justify the added          |
|     latency, cost, and debugging complexity?                      |
|     (A 2% quality improvement that adds 3 seconds and $0.02      |
|      per request is rarely worth it.)                             |
|                                                                   |
|  If ANY answer is NO -> Don't add the agent.                      |
+-------------------------------------------------------------------+

Decision tree

Do you need an LLM at all?
    |             |
   YES           NO -> Use regular code
    |
Can a single LLM call handle it?
    |             |
   YES           NO
    |             |
  Use single    Why not?
  call.            |
              +----+----+
              |         |
         Needs tool   Needs different
         calls        models/contexts
         between      for different
         steps        subtasks
              |         |
              v         v
         Multi-agent  Multi-agent
         justified    justified
              |
         How many agents?
              |
         Start with 2.
         Add more ONLY
         if measured quality
         improves.

The "subtraction test"

For an existing multi-agent pipeline, try removing each agent one at a time:

// Test: what happens if we skip the "tone checker" agent?
async function withoutToneChecker(userInput) {
  const classification = await classifyIntent(userInput);
  const research = await conductResearch(classification);
  const response = await generateResponse(research);
  // SKIP: const checked = await checkTone(response);
  return response;
}

// Run both versions on 100 test cases, compare quality
async function subtractionTest() {
  const testCases = loadTestCases(); // 100 representative inputs

  const withAgent = [];
  const withoutAgent = [];

  for (const tc of testCases) {
    withAgent.push(await fullPipeline(tc.input));
    withoutAgent.push(await withoutToneChecker(tc.input));
  }

  // Compare quality (have a human or judge LLM rate both)
  const comparison = await compareOutputQuality(withAgent, withoutAgent, testCases);

  console.log('=== Subtraction Test: Tone Checker ===');
  console.log(`With agent:    avg quality ${comparison.withAvg}/10`);
  console.log(`Without agent: avg quality ${comparison.withoutAvg}/10`);
  console.log(`Quality drop:  ${(comparison.withAvg - comparison.withoutAvg).toFixed(2)}`);
  console.log(`Latency saved: ~${comparison.latencySaved}ms per request`);
  console.log(`Cost saved:    ~$${comparison.costSaved.toFixed(4)} per request`);

  if (comparison.withAvg - comparison.withoutAvg < 0.5) {
    console.log('\n>>> RECOMMENDATION: Remove this agent. Quality difference is negligible.');
  }
}

5. The YAGNI Principle Applied to AI Architecture

YAGNI: You Aren't Gonna Need It. Don't build complexity for hypothetical future requirements.

YAGNI violations in AI architecture

Over-EngineeringYAGNI AlternativeWhen to Upgrade
5-agent pipeline "for flexibility"Single call that works nowWhen single call demonstrably fails
Agent orchestrator frameworkSimple async/await chainWhen you have > 3 pipelines to manage
Dynamic agent routingHardcoded if/else for 3 intentsWhen you have > 10 intents with different flows
Shared vector database for all agentsPass context directlyWhen context exceeds single-call limits
Kubernetes-deployed agent microservicesFunctions in one fileWhen you have separate teams per agent
Custom agent communication protocolFunction return valuesWhen agents run on different machines

The complexity ladder

START HERE (simplest that works):

  Level 0: No LLM
    Regular code, database queries, templates.
    Latency: <10ms | Cost: $0 | Debugging: Trivial

  Level 1: Single LLM call
    One well-crafted prompt with structured output.
    Latency: 200ms-2s | Cost: $0.001-$0.01 | Debugging: Easy

  Level 2: Single LLM call + tool use
    One LLM call that can invoke tools (search, calculator, API).
    Latency: 1-5s | Cost: $0.005-$0.03 | Debugging: Moderate

  Level 3: Simple sequential pipeline (2-3 agents)
    Each agent has a clear, distinct role.
    Latency: 2-8s | Cost: $0.01-$0.05 | Debugging: Moderate

  Level 4: Parallel + sequential pipeline (3-5 agents)
    Some agents run in parallel, complex state management.
    Latency: 3-15s | Cost: $0.02-$0.10 | Debugging: Hard

  Level 5: Dynamic multi-agent with routing
    Agent count and flow determined at runtime.
    Latency: 5-60s | Cost: $0.05-$1.00 | Debugging: Very hard

RULE: Start at Level 0. Move up ONLY when the current level
      demonstrably fails to meet quality requirements.

6. Case Studies: Over-Engineered vs Right-Sized Solutions

Case Study 1: Customer support chatbot

Over-engineered version (6 agents):

User message
  -> [Intent Classifier Agent]
    -> [Entity Extractor Agent]
      -> [Knowledge Retriever Agent]
        -> [Response Drafter Agent]
          -> [Tone Checker Agent]
            -> [Safety Filter Agent]
              -> Response to user

Latency: 8-12 seconds
Cost: $0.045 per request
Result: Users abandoned the chat before getting a response

Right-sized version (1 call + 1 validation):

User message
  -> [Single LLM call with system prompt containing:
      - Intent classification instructions
      - Entity extraction instructions
      - Knowledge base context (via RAG retrieval, not an agent)
      - Response guidelines + tone requirements
      - Safety rules]
  -> [Programmatic safety check (regex + blocklist, no LLM)]
  -> Response to user

Latency: 1.5-2.5 seconds
Cost: $0.008 per request
Result: Same quality, 5x faster, 5x cheaper

Lesson: The 6 agents were doing work that one well-prompted model could do in a single pass. The "Tone Checker" and "Safety Filter" added latency without measurable quality improvement over system-prompt instructions.

Case Study 2: Legal document review

Under-engineered version (1 call):

Full 80-page contract
  -> [Single LLM call: "Review this contract and identify all risks"]
  -> Response

Problems:
  - Contract exceeds context window (needs chunking)
  - Single prompt cannot handle the complexity
  - Output is unstructured and misses important clauses
  - No way to verify which sections were actually analyzed

Right-sized version (4 agents):

Full 80-page contract
  -> [Chunking Agent: splits into sections, identifies structure]
  -> [Risk Analyzer Agent (per chunk, parallelized): identifies risks in each section]
  -> [Cross-Reference Agent: checks for contradictions between sections]
  -> [Report Generator: compiles structured risk report with citations]

Latency: 45-90 seconds (acceptable for document review)
Cost: $0.35 per document (justified at $500/document price point)
Result: Structured, verifiable, thorough analysis

Lesson: The single call literally could not handle this task. The document exceeds context limits, and the analysis requires cross-referencing between sections. Multi-agent is genuinely necessary here.

Case Study 3: Email categorization

Over-engineered version (3 agents):

Email
  -> [Classifier Agent: categorize email]
  -> [Priority Agent: assign priority]
  -> [Router Agent: determine which team handles it]

Latency: 3 seconds per email
Cost: $0.015 per email
At 50,000 emails/day: $750/day

Right-sized version (1 call):

Email
  -> [Single LLM call: "Classify this email. Return JSON:
      { category, priority, team }"]

Latency: 0.8 seconds per email
Cost: $0.004 per email
At 50,000 emails/day: $200/day
Savings: $550/day = $16,500/month

Even more right-sized version (no LLM for most emails):

Email
  -> [Rule-based filter: known senders, keywords, patterns]
    -> 70% of emails categorized by rules (0ms, $0)
    -> 30% of emails sent to single LLM call ($0.004 each)

Effective cost: $0.0012 per email average
At 50,000 emails/day: $60/day
Savings vs 3-agent: $690/day = $20,700/month

7. Summary: Multi-Agent Is a Tool, Not a Goal

+-------------------------------------------------------------------+
|  THE MULTI-AGENT DECISION SUMMARY                                 |
+-------------------------------------------------------------------+
|                                                                   |
|  Multi-agent is JUSTIFIED when:                                   |
|  [x] Task requires sequential reasoning across distinct steps     |
|  [x] Different steps need different models or tools               |
|  [x] Context exceeds single-call limits                           |
|  [x] Quality measurably improves over single-call approach        |
|  [x] Latency and cost are acceptable for the use case             |
|                                                                   |
|  Multi-agent is OVERKILL when:                                    |
|  [x] A single well-crafted prompt produces equivalent quality     |
|  [x] The task can be solved without an LLM at all                 |
|  [x] The added latency exceeds user tolerance                     |
|  [x] The cost increase is not justified by quality improvement    |
|  [x] You're adding agents "because we might need them later"      |
|                                                                   |
|  GOLDEN RULES:                                                    |
|  1. Start simple. Single call first.                              |
|  2. Add agents only when single call demonstrably fails.          |
|  3. Measure everything: quality, latency, cost.                   |
|  4. Remove agents that don't measurably improve output.           |
|  5. The best architecture is the simplest one that works.         |
|                                                                   |
+-------------------------------------------------------------------+

8. Key Takeaways

  1. Simple tasks do not need agents. Classification, extraction, summarization of short text -- these are single-call tasks. Using agents for them is wasteful.
  2. A single well-crafted prompt is often enough. Before building a pipeline, try writing one excellent prompt with structured output. You might be surprised by the quality.
  3. Apply the Agent Justification Test to every proposed agent: is it necessary, does it measurably help, and does the cost-benefit work out? If not, cut it.
  4. YAGNI applies to AI architecture. Don't build for hypothetical complexity. Start at the simplest level that works and move up only when forced to.
  5. The subtraction test is powerful. Remove an agent, measure quality. If quality barely drops, that agent was not earning its keep.
  6. Log everything to make informed decisions. You cannot decide whether to simplify or complexify without data on latency, cost, and quality at each step.
  7. The goal is solving the user's problem, not building an impressive architecture. Users care about fast, accurate, affordable answers -- not how many agents you deployed.

Explain-It Challenge

  1. A junior engineer proposes a 5-agent pipeline for a task. You suspect a single call would suffice. How do you diplomatically and empirically demonstrate this?
  2. Your multi-agent pipeline has been running for 6 months. How do you evaluate whether each agent is still earning its keep?
  3. A product manager says "we need AI for this feature." Walk through your decision process from "no LLM at all" up to "multi-agent pipeline."

Navigation: <- 4.19.d Managing Shared State | 4.19 Exercise Questions ->