Episode 4 — Generative AI Engineering / 4.19 — Multi Agent Architecture Concerns
4.19 — Multi-Agent Architecture Concerns: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps -- reopen
README.mdthen4.19.a...4.19.e. - Practice --
4.19-Exercise-Questions.md. - Polish answers --
4.19-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| Sequential pipeline | Agents run one after another; latency = sum of all |
| Parallel pipeline | Independent agents run simultaneously; latency = max (slowest) |
| Latency budget | Maximum allowed response time for a use case |
| Cost multiplication | Each agent = separate LLM call with its own token charges |
| Error propagation | Wrong output from agent A corrupts all downstream agents |
| Trace ID | Unique identifier linking all agent calls in a single request |
| Shared state | Data that flows between agents in a pipeline |
| Immutable state | State that cannot be modified after creation; prevents corruption |
| Pipeline state object | Single object accumulating all agent outputs |
| Early termination | Skipping expensive agents when cheap classifier deems them unnecessary |
| Subtraction test | Removing an agent to see if quality drops; if not, remove it |
| YAGNI | "You Aren't Gonna Need It" -- don't build for hypothetical needs |
Latency math
Sequential: Latency = T_A + T_B + T_C
Parallel: Latency = max(T_A, T_B, T_C)
Mixed: Latency = T_A + max(T_B1, T_B2) + T_C
Example (mixed):
A=400ms, B1=800ms, B2=1100ms, C=500ms
Total = 400 + max(800,1100) + 500 = 2000ms
Latency reduction strategies
1. Parallelize independent agents (Promise.all)
2. Use smaller models for simple tasks (gpt-4o-mini for classification)
3. Cache repeated calls (same input = cached output)
4. Constrain output tokens (fewer tokens = faster generation)
5. Stream final agent response (perceived latency drops)
6. Add timeouts with fallbacks (protect user experience)
Latency budgets
Chat / conversational: < 2 seconds (1-2 agents max)
Search / Q&A: < 3 seconds (2-3 agents max)
Document processing: < 10 seconds (3-5 agents okay)
Background / async: < 60 seconds (5+ agents okay)
Batch processing: Minutes (no agent limit)
Cost math
Single call cost:
= (input_tokens * input_price) + (output_tokens * output_price)
Pipeline cost:
= SUM of all agents' individual costs
Context accumulation:
Agent B input = B's prompt + A's output
Agent C input = C's prompt + A's output + B's output
Input tokens grow at each stage!
Cost reference (approximate)
GPT-4o: $2.50/1M input, $10.00/1M output
GPT-4o-mini: $0.15/1M input, $0.60/1M output
Ratio: mini is ~17x cheaper than 4o
Cost optimization strategies
1. Cheaper models for simple agents (classification, formatting -> mini)
2. Cache repeated agent calls (hash input -> stored output)
3. Minimize tokens per agent (short prompts, constrained output)
4. Batch independent items (100 classifications in 1 call)
5. Early termination (skip expensive agents for simple queries)
6. Monitor per-agent costs (identify which agent eats the budget)
Debugging
WHY IT'S HARD:
- LLM errors are SILENT (no exceptions, just wrong text)
- Errors PROPAGATE (Agent A mistake -> Agent B wrong -> Agent C wrong)
- Non-deterministic (same input can fail intermittently)
BLAME ISOLATION (walk backwards):
1. Is Agent C's output wrong GIVEN its input? YES -> fix C
2. Is Agent B's output wrong GIVEN its input? YES -> fix B
3. Is Agent A's output wrong GIVEN its input? YES -> fix A
4. All individually correct? -> Bug is in the HANDOFF
MANDATORY LOG FIELDS PER AGENT:
traceId, agentName, timestamp, input (full), output (full),
model, params, tokenUsage, durationMs, status, error (if any)
Tracing tools
Console logs: Quick debugging, no setup, no persistence
Custom tracer: Full control, fits your infra, medium effort
LangSmith: Hosted dashboard, LangChain ecosystem, low effort
OpenTelemetry: Enterprise, existing observability stack, high effort
Shared state
PATTERNS:
Pipeline state object: Single object accumulates all agent outputs
Context passing: Each agent receives only specific parameters
RULES:
1. Use IMMUTABLE state (Object.freeze + spread operator)
2. Each agent RETURNS its contribution (never mutates state)
3. Runner MERGES contributions into new state
4. Parallel agents: collect independently, merge AFTER all complete
5. VALIDATE state schema after each agent
6. PERSIST state for long-running or expensive pipelines
State anti-patterns
| Anti-Pattern | Fix |
|---|---|
| Global mutable state | Immutable state + merge |
| Implicit dependencies | Schema validation at each step |
| Oversized state | Each agent reads only what it needs |
| No versioning | Save snapshots at each stage |
| No error in state | Include error fields, validate before each step |
When NOT to use multi-agent
DON'T USE MULTI-AGENT WHEN:
- Single prompt produces equivalent quality
- Task needs no LLM at all (code, regex, DB lookup)
- Latency budget < 1-2 seconds
- Cost increase not justified by quality gain
- Building for hypothetical future needs (YAGNI)
COMPLEXITY LADDER (start at 0, go up only when forced):
Level 0: No LLM (regular code)
Level 1: Single LLM call
Level 2: LLM + tool use
Level 3: 2-3 sequential agents
Level 4: 3-5 agents (parallel + sequential)
Level 5: Dynamic multi-agent routing
Agent Justification Test
For each proposed agent, answer ALL THREE:
1. NECESSITY: Can't be done by previous agent in same call?
2. VALUE: Measurably improves output quality?
3. COST-BENEFIT: Improvement justifies latency + cost + complexity?
If ANY answer is NO -> Don't add the agent.
Subtraction test
For each agent in an existing pipeline:
1. Remove the agent
2. Run 100+ test cases
3. Measure quality difference
If quality barely drops -> Remove the agent permanently
Golden rules
1. Start simple. Single call first.
2. Add agents only when single call demonstrably fails.
3. Measure EVERYTHING: quality, latency, cost.
4. Remove agents that don't measurably improve output.
5. The best architecture is the simplest one that works.
6. Multi-agent is a TOOL, not a GOAL.
7. Log every pipeline step from day one.
Quick mental model
Multi-agent concerns:
Latency = N agents means N x slower (sequential)
Cost = N agents means N x token charges
Debugging = which of N agents caused the error?
State = data flowing between N agents needs management
Necessity = do you actually need N agents? (usually not)
The NUMBER of agents is directly proportional to
latency, cost, and debugging difficulty.
Keep N as small as possible.
End of 4.19 quick revision.