Episode 4 — Generative AI Engineering / 4.19 — Multi Agent Architecture Concerns

4.19.c — Debugging Across Agents

In one sentence: When a multi-agent pipeline produces a wrong answer, finding which agent caused the problem is like debugging a game of telephone -- errors propagate, compound, and disguise themselves, making structured logging, trace IDs, and pipeline visualization not optional but essential.

Navigation: <- 4.19.b Higher Operational Cost | 4.19.d -- Managing Shared State ->

1. Why Debugging Multi-Agent Systems Is Hard

In a single LLM call, debugging is straightforward: you see the prompt, you see the output, and if the output is wrong, you adjust the prompt. In a multi-agent pipeline, the complexity explodes.

SINGLE CALL DEBUGGING:
  Prompt -> [LLM] -> Output
  If Output is wrong, check Prompt. Done.

MULTI-AGENT DEBUGGING:
  Input -> [Agent A] -> intermediate_1 -> [Agent B] -> intermediate_2 -> [Agent C] -> Output
                |                            |                            |
          Was A's output              Did B misinterpret           Did C fail on
          correct?                    A's output?                 its own?

  If Output is wrong, which agent caused it?
    - Agent A produced wrong intermediate result?
    - Agent B received correct input but produced wrong output?
    - Agent C received wrong input (from B) but would have been fine otherwise?
    - The combination of individually "okay" outputs produced a bad final result?

The three debugging nightmares

Nightmare 1: Error propagation. Agent A makes a subtle mistake (classifies "billing complaint" as "technical issue"). Agent B researches technical solutions. Agent C writes a confident response about resetting the router. The user gets a useless answer, and the bug is in Agent A, not Agent C.

Nightmare 2: The "no single point of failure." Each agent's output is individually reasonable, but the combination is wrong. Agent A correctly extracts a date, Agent B correctly converts the timezone, Agent C correctly formats the response -- but A and B used different date formats, so the final date is wrong by 12 hours.

Nightmare 3: Non-determinism. The pipeline worked in testing but fails in production. Because LLMs are probabilistic, the same input can produce different intermediate outputs, some of which trigger downstream failures. The bug is intermittent and hard to reproduce.

2. Error Propagation: The Corruption Chain

When Agent A's output feeds Agent B's input, a mistake in A corrupts the entire downstream pipeline. Unlike traditional software where errors often throw exceptions, LLM errors are silent -- the agent produces plausible-looking but wrong text.

ERROR PROPAGATION EXAMPLE:

  User: "What's the refund policy for orders over $100?"

  Agent A (Intent Classifier):
    Expected: { intent: "policy_inquiry", topic: "refunds" }
    Actual:   { intent: "order_status", topic: "orders" }     <-- WRONG
    (Looks valid! No error thrown. Just the wrong classification.)

  Agent B (Knowledge Retriever):
    Receives: { intent: "order_status", topic: "orders" }
    Retrieves: Order tracking documentation                    <-- CORRECT for its input
    (Agent B did its job perfectly -- but it received bad input.)

  Agent C (Response Generator):
    Receives: Order tracking docs + user's refund question
    Output: "To check your order status, please visit..."     <-- WRONG ANSWER
    (Agent C combined the docs with the question. Looks confident and polished.)

  User sees a helpful-looking but completely wrong response.
  Bug is in Agent A. But it LOOKS like Agent C's fault.

The amplification problem

Small errors early in the pipeline get amplified by later agents, because each agent treats its input as authoritative.

Error severity through the pipeline:

  Agent A: slight misclassification (3/10 severity)
       |
       v
  Agent B: retrieves wrong knowledge base (6/10 severity)
       |
       v
  Agent C: generates confidently wrong answer (9/10 severity)
       |
       v
  User: receives wrong information, loses trust (10/10 impact)

  One small error -> cascading failure -> user harm

3. The "Blame Game": Which Agent Caused the Wrong Output?

When a user reports a bad response, you need a systematic approach to identify the responsible agent. Random guessing wastes time.

Step-by-step blame isolation

BLAME ISOLATION PROTOCOL:

  1. Capture the exact final output that was wrong.

  2. Retrieve the FULL pipeline trace (all agent inputs and outputs).

  3. Walk BACKWARDS through the pipeline:
     a. Is Agent C's output wrong GIVEN its input?
        YES -> Bug is in Agent C (or its prompt).
        NO  -> Agent C is fine. Check Agent B.
     b. Is Agent B's output wrong GIVEN its input?
        YES -> Bug is in Agent B.
        NO  -> Agent B is fine. Check Agent A.
     c. Is Agent A's output wrong GIVEN its input (the original user message)?
        YES -> Bug is in Agent A.
        NO  -> All agents are individually correct.
              The bug is in the HANDOFF (data format, missing context, etc.)

  4. Fix the identified agent or handoff.

  5. Re-run the SAME input through the pipeline to verify the fix.

Code: automated blame checker

async function diagnoseFailure(pipelineTrace, expectedFinalOutput) {
  console.log('=== PIPELINE DIAGNOSIS ===\n');

  // Walk backwards through agents
  const agents = [...pipelineTrace.agents].reverse();

  for (const agent of agents) {
    console.log(`--- Checking ${agent.name} ---`);
    console.log(`  Input:    ${JSON.stringify(agent.input).slice(0, 200)}...`);
    console.log(`  Output:   ${JSON.stringify(agent.output).slice(0, 200)}...`);

    // Ask a judge LLM: "Given this input, is this output reasonable?"
    const judgment = await evaluateAgentOutput({
      agentName: agent.name,
      agentPrompt: agent.systemPrompt,
      input: agent.input,
      output: agent.output,
    });

    console.log(`  Judgment: ${judgment.verdict}`);
    console.log(`  Reason:   ${judgment.reason}`);

    if (judgment.verdict === 'INCORRECT') {
      console.log(`\n>>> ROOT CAUSE: ${agent.name} produced incorrect output.`);
      console.log(`>>> Fix: Review ${agent.name}'s system prompt or model.`);
      return agent.name;
    }
  }

  console.log('\n>>> All agents individually correct.');
  console.log('>>> Root cause likely in data format or handoff between agents.');
  return 'HANDOFF_ISSUE';
}

4. Logging Strategies: The Foundation of Debugging

Trace IDs: Linking all agents in a request

Every request to the pipeline must get a unique trace ID that follows it through every agent. Without this, you cannot reconstruct what happened.

import { randomUUID } from 'crypto';

function createTraceContext(userInput) {
  return {
    traceId: randomUUID(),
    startTime: Date.now(),
    userInput,
    agentLogs: [],
  };
}

async function loggedAgentCall(trace, agentName, agentFn, input) {
  const agentStart = Date.now();

  const log = {
    traceId: trace.traceId,
    agentName,
    input: JSON.parse(JSON.stringify(input)), // Deep copy to freeze state
    timestamp: new Date().toISOString(),
  };

  try {
    const output = await agentFn(input);
    log.output = JSON.parse(JSON.stringify(output));
    log.status = 'success';
    log.durationMs = Date.now() - agentStart;
    log.tokenUsage = output._tokenUsage || null; // If available from API
    return output;
  } catch (error) {
    log.error = { message: error.message, stack: error.stack };
    log.status = 'error';
    log.durationMs = Date.now() - agentStart;
    throw error;
  } finally {
    trace.agentLogs.push(log);
    // Also persist to your logging system
    console.log(JSON.stringify(log, null, 2));
  }
}

Using the logging system in a pipeline

async function tracedPipeline(userInput) {
  const trace = createTraceContext(userInput);

  try {
    const classification = await loggedAgentCall(
      trace, 'classifier', classifyIntent, userInput
    );

    const research = await loggedAgentCall(
      trace, 'researcher', conductResearch, classification
    );

    const response = await loggedAgentCall(
      trace, 'responder', generateResponse, research
    );

    trace.status = 'success';
    trace.totalDurationMs = Date.now() - trace.startTime;
    return response;

  } catch (error) {
    trace.status = 'error';
    trace.error = error.message;
    trace.totalDurationMs = Date.now() - trace.startTime;
    throw error;

  } finally {
    // Persist the full trace
    await persistTrace(trace);
  }
}

What to log for each agent

+-----------------------------------------------------------------------+
|  MANDATORY LOG FIELDS PER AGENT CALL                                  |
+-----------------------------------------------------------------------+
|  Field              | Why                                             |
|---------------------+-------------------------------------------------|
|  traceId            | Link all agents in a single user request        |
|  agentName          | Which agent ran                                 |
|  timestamp          | When it ran                                     |
|  input (full)       | Exactly what the agent received                 |
|  systemPrompt       | The prompt used (for prompt version tracking)   |
|  model              | Which LLM model was called                      |
|  modelParams        | temperature, max_tokens, etc.                   |
|  output (full)      | Exactly what the agent returned                 |
|  tokenUsage         | Input tokens, output tokens                     |
|  durationMs         | How long the call took                          |
|  status             | success / error / timeout                       |
|  error (if any)     | Error message and stack trace                   |
+-----------------------------------------------------------------------+

5. Visualizing Agent Pipelines

When debugging, a visual representation of the pipeline makes it much easier to spot problems.

Text-based pipeline visualization

function visualizePipelineTrace(trace) {
  console.log(`\nPipeline Trace: ${trace.traceId}`);
  console.log(`User Input: "${trace.userInput.slice(0, 80)}..."`);
  console.log(`Status: ${trace.status} | Total: ${trace.totalDurationMs}ms\n`);

  const agents = trace.agentLogs;
  const maxNameLen = Math.max(...agents.map((a) => a.agentName.length));

  for (let i = 0; i < agents.length; i++) {
    const a = agents[i];
    const name = a.agentName.padEnd(maxNameLen);
    const status = a.status === 'success' ? '[OK]' : '[FAIL]';
    const duration = `${a.durationMs}ms`.padStart(7);
    const arrow = i < agents.length - 1 ? '  |' : '  *';

    console.log(`  ${status} ${name}  ${duration}  tokens: ${a.tokenUsage?.total || '?'}`);

    // Show abbreviated output
    const outputPreview = JSON.stringify(a.output).slice(0, 60);
    console.log(`  ${' '.repeat(maxNameLen + 7)} -> ${outputPreview}...`);

    if (i < agents.length - 1) {
      console.log(`  ${' '.repeat(maxNameLen + 7)} |`);
      console.log(`  ${' '.repeat(maxNameLen + 7)} v`);
    }
  }

  console.log(`\nFinal Output: "${JSON.stringify(agents[agents.length - 1]?.output).slice(0, 100)}..."`);
}

Example output:

Pipeline Trace: a1b2c3d4-e5f6-7890-abcd-ef1234567890
User Input: "What's the refund policy for orders over $100?..."
Status: success | Total: 3124ms

  [OK]   classifier   342ms  tokens: 420
                            -> {"intent":"policy_inquiry","topic":"refunds"}...
                            |
                            v
  [OK]   researcher  1847ms  tokens: 2800
                            -> {"documents":[{"title":"Refund Policy","content...
                            |
                            v
  [OK]   responder    923ms  tokens: 2300
                            -> "Our refund policy for orders over $100 allows...

Final Output: "Our refund policy for orders over $100 allows you to return items within 30 d..."

ASCII pipeline diagram generator

function drawPipelineDiagram(agents, parallelGroups = []) {
  let diagram = '';
  diagram += '+--- Pipeline Flow ---+\n';
  diagram += '|                     |\n';
  diagram += '|  [User Input]       |\n';
  diagram += '|       |             |\n';

  for (let i = 0; i < agents.length; i++) {
    const a = agents[i];
    const parallelGroup = parallelGroups.find((g) => g.includes(i));

    if (parallelGroup && parallelGroup[0] === i) {
      // Start of parallel group
      diagram += '|       v             |\n';
      diagram += '|    +--+--+          |\n';
      for (const idx of parallelGroup) {
        diagram += `|    |${agents[idx].padEnd(7)}|         |\n`;
      }
      diagram += '|    +--+--+          |\n';
    } else if (!parallelGroup) {
      // Sequential agent
      diagram += '|       v             |\n';
      diagram += `|  [${a}]  |\n`;
    }
  }

  diagram += '|       |             |\n';
  diagram += '|  [Final Output]     |\n';
  diagram += '|                     |\n';
  diagram += '+---------------------+\n';

  return diagram;
}

6. Tracing Tools

Custom structured tracing

For many teams, a custom tracing solution built on your existing logging infrastructure is sufficient and gives you full control.

// Structured trace storage with query capability
class TraceStore {
  constructor() {
    this.traces = new Map();
  }

  save(trace) {
    this.traces.set(trace.traceId, {
      ...trace,
      savedAt: new Date().toISOString(),
    });
  }

  // Find traces where a specific agent failed
  findFailedTraces(agentName) {
    const results = [];
    for (const trace of this.traces.values()) {
      const agentLog = trace.agentLogs.find(
        (a) => a.agentName === agentName && a.status === 'error'
      );
      if (agentLog) results.push(trace);
    }
    return results;
  }

  // Find slow traces
  findSlowTraces(thresholdMs) {
    const results = [];
    for (const trace of this.traces.values()) {
      if (trace.totalDurationMs > thresholdMs) results.push(trace);
    }
    return results.sort((a, b) => b.totalDurationMs - a.totalDurationMs);
  }

  // Get aggregate stats per agent
  agentStats() {
    const stats = {};
    for (const trace of this.traces.values()) {
      for (const log of trace.agentLogs) {
        if (!stats[log.agentName]) {
          stats[log.agentName] = {
            calls: 0, errors: 0, totalDuration: 0, durations: [],
          };
        }
        stats[log.agentName].calls++;
        if (log.status === 'error') stats[log.agentName].errors++;
        stats[log.agentName].totalDuration += log.durationMs;
        stats[log.agentName].durations.push(log.durationMs);
      }
    }

    // Calculate percentiles
    for (const [name, s] of Object.entries(stats)) {
      s.durations.sort((a, b) => a - b);
      s.p50 = s.durations[Math.floor(s.durations.length * 0.5)];
      s.p95 = s.durations[Math.floor(s.durations.length * 0.95)];
      s.p99 = s.durations[Math.floor(s.durations.length * 0.99)];
      s.errorRate = ((s.errors / s.calls) * 100).toFixed(2) + '%';
      delete s.durations; // Clean up raw data
    }

    return stats;
  }
}

// Usage
const store = new TraceStore();
// After each pipeline run:
store.save(trace);

// Debugging:
const failedClassifications = store.findFailedTraces('classifier');
const slowPipelines = store.findSlowTraces(5000); // > 5 seconds
const stats = store.agentStats();
console.table(stats);

LangSmith integration pattern

LangSmith (by LangChain) provides hosted tracing for LLM applications. Here is the integration pattern:

// LangSmith integration concept (simplified)
// In production, use the actual LangSmith SDK

import { Client } from 'langsmith';

const langsmith = new Client();

async function tracedAgentCall(runName, agentFn, input, parentRunId) {
  const run = await langsmith.createRun({
    name: runName,
    run_type: 'chain',
    inputs: { input },
    parent_run_id: parentRunId,
  });

  try {
    const output = await agentFn(input);
    await langsmith.updateRun(run.id, {
      outputs: { output },
      end_time: new Date(),
    });
    return output;
  } catch (error) {
    await langsmith.updateRun(run.id, {
      error: error.message,
      end_time: new Date(),
    });
    throw error;
  }
}

Tool comparison

+------------------------------------------------------------------+
|  TRACING TOOL COMPARISON                                         |
+------------------------------------------------------------------+
|  Tool           | Best For              | Effort     | Cost      |
|-----------------+-----------------------+------------+-----------|
|  Console logs   | Quick debugging       | Minimal    | Free      |
|  Custom tracer  | Full control,         | Medium     | Free      |
|  (as shown)     | fits your infra       |            |           |
|  LangSmith      | LangChain ecosys,     | Low        | Freemium  |
|                 | hosted dashboard      |            |           |
|  OpenTelemetry  | Enterprise, existing  | High       | Varies    |
|                 | observability stack   |            |           |
|  Arize Phoenix  | ML-specific tracing,  | Medium     | Free/OSS  |
|                 | evaluations           |            |           |
+------------------------------------------------------------------+

7. Reproducing Bugs in Multi-Agent Systems

The hardest part of debugging LLM systems is that they are non-deterministic. The same input might succeed 9 out of 10 times and fail once. Here are strategies to make bugs reproducible.

Strategy 1: Log everything, replay from logs

// Save full pipeline state to recreate the exact scenario
async function replayFromTrace(traceId, traceStore) {
  const trace = traceStore.get(traceId);

  console.log(`Replaying trace ${traceId}`);
  console.log(`Original user input: "${trace.userInput}"`);
  console.log(`Original time: ${trace.agentLogs[0].timestamp}`);
  console.log();

  for (const agentLog of trace.agentLogs) {
    console.log(`--- Replaying ${agentLog.agentName} ---`);
    console.log(`Original input:  ${JSON.stringify(agentLog.input).slice(0, 100)}`);
    console.log(`Original output: ${JSON.stringify(agentLog.output).slice(0, 100)}`);

    // Re-run the agent with the SAME input
    const newOutput = await callAgent(agentLog.agentName, agentLog.input);
    console.log(`Replayed output: ${JSON.stringify(newOutput).slice(0, 100)}`);

    const matches = JSON.stringify(newOutput) === JSON.stringify(agentLog.output);
    console.log(`Output matches: ${matches}`);

    if (!matches) {
      console.log(`>>> DIVERGENCE at ${agentLog.agentName}`);
      console.log(`>>> This agent produced different output on replay.`);
      console.log(`>>> This is expected with LLMs -- check if the new output is also wrong.`);
    }
    console.log();
  }
}

Strategy 2: Use temperature 0 and seed for reproducibility

// For debugging, pin the model parameters to make output reproducible
const DEBUG_CONFIG = {
  temperature: 0,
  seed: 42, // Same seed = same output (best effort, not guaranteed)
};

async function debugAgentCall(agentPrompt, input) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: agentPrompt },
      { role: 'user', content: JSON.stringify(input) },
    ],
    ...DEBUG_CONFIG,
  });

  // Log the system_fingerprint for reproducibility tracking
  console.log(`System fingerprint: ${response.system_fingerprint}`);
  return response.choices[0].message.content;
}

Strategy 3: Agent-level unit testing

Test each agent in isolation with known inputs and expected outputs.

// Test each agent independently
async function testClassifier() {
  const testCases = [
    { input: 'I want a refund', expected: { intent: 'refund' } },
    { input: 'How do I reset my password?', expected: { intent: 'technical' } },
    { input: 'Hi there!', expected: { intent: 'greeting' } },
  ];

  let passed = 0;
  for (const tc of testCases) {
    const result = await classifyIntent(tc.input);
    const match = result.intent === tc.expected.intent;
    console.log(`  ${match ? 'PASS' : 'FAIL'}: "${tc.input}" -> ${result.intent} (expected: ${tc.expected.intent})`);
    if (match) passed++;
  }

  console.log(`\nClassifier: ${passed}/${testCases.length} passed`);
}

// Test the full pipeline with integration tests
async function testFullPipeline() {
  const testCases = [
    {
      input: 'What is your refund policy?',
      expectedContains: ['refund', 'return', 'days'],
      expectedNotContains: ['reset password', 'technical support'],
    },
  ];

  for (const tc of testCases) {
    const result = await tracedPipeline(tc.input);
    const containsExpected = tc.expectedContains.every((word) =>
      result.toLowerCase().includes(word)
    );
    const avoidsBadContent = tc.expectedNotContains.every((word) =>
      !result.toLowerCase().includes(word)
    );

    console.log(`Input: "${tc.input}"`);
    console.log(`Contains expected terms: ${containsExpected}`);
    console.log(`Avoids bad content: ${avoidsBadContent}`);
    console.log(`Result: ${containsExpected && avoidsBadContent ? 'PASS' : 'FAIL'}`);
  }
}

8. Key Takeaways

Multi-agent debugging is fundamentally harder than single-call debugging. Errors propagate silently -- a wrong classification in Agent A produces a confidently wrong final answer.
Trace IDs are non-negotiable. Every request needs a unique ID that links all agent calls together, or you cannot reconstruct what happened.
Log full inputs and outputs for every agent. You need to see exactly what each agent received and produced to identify the faulty one.
Walk backwards from the bad output to isolate blame. Check each agent: "Given this input, is this output correct?" The first agent that fails this test is your root cause.
Pipeline visualization makes patterns visible. A text-based or graphical trace view reveals bottlenecks, errors, and unexpected data flow at a glance.
Non-determinism makes reproduction hard. Use temperature 0 + seed for debugging, log everything for replay, and test agents in isolation with fixed test cases.
Invest in tracing infrastructure early. Retrofitting logging into a complex multi-agent system is painful. Build it in from day one.

Explain-It Challenge

A customer reports a wrong answer. You pull the trace ID and see 4 agent logs. Walk through your debugging process step by step.
Agent B produces a correct output 95% of the time but fails 5% of the time with identical input. How do you debug an intermittent failure in a non-deterministic system?
Your team is debating whether to use LangSmith or build custom tracing. What factors would influence your recommendation?

Navigation: <- 4.19.b Higher Operational Cost | 4.19.d -- Managing Shared State ->