Episode 4 — Generative AI Engineering / 4.19 — Multi Agent Architecture Concerns

4.19 — Multi-Agent Architecture Concerns: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps -- reopen README.md then 4.19.a...4.19.e.
  3. Practice -- 4.19-Exercise-Questions.md.
  4. Polish answers -- 4.19-Interview-Questions.md.

Core vocabulary

TermOne-liner
Sequential pipelineAgents run one after another; latency = sum of all
Parallel pipelineIndependent agents run simultaneously; latency = max (slowest)
Latency budgetMaximum allowed response time for a use case
Cost multiplicationEach agent = separate LLM call with its own token charges
Error propagationWrong output from agent A corrupts all downstream agents
Trace IDUnique identifier linking all agent calls in a single request
Shared stateData that flows between agents in a pipeline
Immutable stateState that cannot be modified after creation; prevents corruption
Pipeline state objectSingle object accumulating all agent outputs
Early terminationSkipping expensive agents when cheap classifier deems them unnecessary
Subtraction testRemoving an agent to see if quality drops; if not, remove it
YAGNI"You Aren't Gonna Need It" -- don't build for hypothetical needs

Latency math

Sequential:  Latency = T_A + T_B + T_C
Parallel:    Latency = max(T_A, T_B, T_C)
Mixed:       Latency = T_A + max(T_B1, T_B2) + T_C

Example (mixed):
  A=400ms, B1=800ms, B2=1100ms, C=500ms
  Total = 400 + max(800,1100) + 500 = 2000ms

Latency reduction strategies

1. Parallelize independent agents      (Promise.all)
2. Use smaller models for simple tasks  (gpt-4o-mini for classification)
3. Cache repeated calls                 (same input = cached output)
4. Constrain output tokens              (fewer tokens = faster generation)
5. Stream final agent response          (perceived latency drops)
6. Add timeouts with fallbacks          (protect user experience)

Latency budgets

Chat / conversational:     < 2 seconds     (1-2 agents max)
Search / Q&A:              < 3 seconds     (2-3 agents max)
Document processing:       < 10 seconds    (3-5 agents okay)
Background / async:        < 60 seconds    (5+ agents okay)
Batch processing:          Minutes         (no agent limit)

Cost math

Single call cost:
  = (input_tokens * input_price) + (output_tokens * output_price)

Pipeline cost:
  = SUM of all agents' individual costs

Context accumulation:
  Agent B input = B's prompt + A's output
  Agent C input = C's prompt + A's output + B's output
  Input tokens grow at each stage!

Cost reference (approximate)

GPT-4o:       $2.50/1M input,  $10.00/1M output
GPT-4o-mini:  $0.15/1M input,  $0.60/1M output
Ratio:        mini is ~17x cheaper than 4o

Cost optimization strategies

1. Cheaper models for simple agents     (classification, formatting -> mini)
2. Cache repeated agent calls           (hash input -> stored output)
3. Minimize tokens per agent            (short prompts, constrained output)
4. Batch independent items              (100 classifications in 1 call)
5. Early termination                    (skip expensive agents for simple queries)
6. Monitor per-agent costs              (identify which agent eats the budget)

Debugging

WHY IT'S HARD:
  - LLM errors are SILENT (no exceptions, just wrong text)
  - Errors PROPAGATE (Agent A mistake -> Agent B wrong -> Agent C wrong)
  - Non-deterministic (same input can fail intermittently)

BLAME ISOLATION (walk backwards):
  1. Is Agent C's output wrong GIVEN its input?  YES -> fix C
  2. Is Agent B's output wrong GIVEN its input?  YES -> fix B
  3. Is Agent A's output wrong GIVEN its input?  YES -> fix A
  4. All individually correct? -> Bug is in the HANDOFF

MANDATORY LOG FIELDS PER AGENT:
  traceId, agentName, timestamp, input (full), output (full),
  model, params, tokenUsage, durationMs, status, error (if any)

Tracing tools

Console logs:     Quick debugging, no setup, no persistence
Custom tracer:    Full control, fits your infra, medium effort
LangSmith:        Hosted dashboard, LangChain ecosystem, low effort
OpenTelemetry:    Enterprise, existing observability stack, high effort

Shared state

PATTERNS:
  Pipeline state object:  Single object accumulates all agent outputs
  Context passing:        Each agent receives only specific parameters

RULES:
  1. Use IMMUTABLE state  (Object.freeze + spread operator)
  2. Each agent RETURNS its contribution (never mutates state)
  3. Runner MERGES contributions into new state
  4. Parallel agents: collect independently, merge AFTER all complete
  5. VALIDATE state schema after each agent
  6. PERSIST state for long-running or expensive pipelines

State anti-patterns

Anti-PatternFix
Global mutable stateImmutable state + merge
Implicit dependenciesSchema validation at each step
Oversized stateEach agent reads only what it needs
No versioningSave snapshots at each stage
No error in stateInclude error fields, validate before each step

When NOT to use multi-agent

DON'T USE MULTI-AGENT WHEN:
  - Single prompt produces equivalent quality
  - Task needs no LLM at all (code, regex, DB lookup)
  - Latency budget < 1-2 seconds
  - Cost increase not justified by quality gain
  - Building for hypothetical future needs (YAGNI)

COMPLEXITY LADDER (start at 0, go up only when forced):
  Level 0:  No LLM (regular code)
  Level 1:  Single LLM call
  Level 2:  LLM + tool use
  Level 3:  2-3 sequential agents
  Level 4:  3-5 agents (parallel + sequential)
  Level 5:  Dynamic multi-agent routing

Agent Justification Test

For each proposed agent, answer ALL THREE:
  1. NECESSITY:    Can't be done by previous agent in same call?
  2. VALUE:        Measurably improves output quality?
  3. COST-BENEFIT: Improvement justifies latency + cost + complexity?

  If ANY answer is NO -> Don't add the agent.

Subtraction test

For each agent in an existing pipeline:
  1. Remove the agent
  2. Run 100+ test cases
  3. Measure quality difference
  If quality barely drops -> Remove the agent permanently

Golden rules

1. Start simple. Single call first.
2. Add agents only when single call demonstrably fails.
3. Measure EVERYTHING: quality, latency, cost.
4. Remove agents that don't measurably improve output.
5. The best architecture is the simplest one that works.
6. Multi-agent is a TOOL, not a GOAL.
7. Log every pipeline step from day one.

Quick mental model

Multi-agent concerns:

  Latency    = N agents means N x slower (sequential)
  Cost       = N agents means N x token charges
  Debugging  = which of N agents caused the error?
  State      = data flowing between N agents needs management
  Necessity  = do you actually need N agents? (usually not)

  The NUMBER of agents is directly proportional to
  latency, cost, and debugging difficulty.
  Keep N as small as possible.

End of 4.19 quick revision.