Episode 4 — Generative AI Engineering / 4.16 — Agent Design Patterns

Interview Questions: Agent Design Patterns

Model answers for Planner-Executor, Researcher-Writer, Critic-Refiner, Router agents, choosing the right pattern, and combining patterns.

How to use this material (instructions)

Read lessons in order -- README.md, then 4.16.a -> 4.16.d.
Practice out loud -- definition -> example -> pitfall.
Pair with exercises -- 4.16-Exercise-Questions.md.
Quick review -- 4.16-Quick-Revision.md.

Beginner (Q1-Q4)

Q1. What are agent design patterns and why do they matter?

Why interviewers ask: Tests whether you understand that multi-agent systems are not ad-hoc -- they follow proven architectural templates, just like software design patterns.

Model answer:

Agent design patterns are reusable architectural templates for dividing complex AI work across multiple specialized agents. They are the multi-agent equivalent of software design patterns (like MVC or Observer). Instead of building one monolithic agent that tries to do everything, you compose agents with clearly separated responsibilities.

The four foundational patterns are:

Pattern	Division of Labor	Best For
Planner-Executor	One agent plans, another executes	Complex multi-step tasks with dependencies
Researcher-Writer	One agent gathers facts, another synthesizes	Reports, summaries, content grounded in data
Critic-Refiner	One agent evaluates, another improves	Iterative quality improvement (code, writing, data)
Router	One agent classifies intent and dispatches	Multi-capability systems handling diverse requests

Why they matter:

Separation of concerns -- each agent has one job, one optimized system prompt, and one temperature setting. A research agent at temperature 0 finds facts precisely; a writing agent at temperature 0.4 produces readable prose. Combining both in one prompt compromises both.
Auditability -- you can inspect each agent's output independently. If the final report has an error, you trace it to the Researcher (wrong fact) or the Writer (hallucinated a fact).
Composability -- patterns stack. A Router can dispatch to a Planner-Executor, which uses a Researcher-Writer, whose output passes through a Critic-Refiner. Real production systems layer these.

// Without a pattern: one giant prompt doing everything
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "Research, plan, execute, write, and review..." },
    { role: "user", content: task },
  ],
});
// Problems: no auditability, no specialization, poor quality

// With the Researcher-Writer pattern: separation of concerns
const facts = await runResearcher(topic);     // temperature 0, thorough
const report = await runWriter(facts, config); // temperature 0.4, polished
// Benefits: each agent is optimized for its job, facts are inspectable

Q2. Explain the Planner-Executor pattern. When would you use it over a single-prompt approach?

Why interviewers ask: The Planner-Executor is the most common multi-agent pattern. Understanding it demonstrates knowledge of task decomposition, dependency management, and structured plan execution.

Model answer:

The Planner-Executor pattern splits complex AI work into two agents:

Planner Agent -- receives a high-level task and produces a structured plan (JSON with steps, tool names, parameters, dependencies). It thinks but does not act.
Executor Agent -- takes the plan and carries out each step in dependency order, calling tools and capturing results. It acts but does not re-plan.

This mirrors how humans solve complex problems: you think before you act. A single prompt trying to both plan and execute a 10-step task often loses track of where it is, skips steps, or conflates planning with execution.

// Planner produces structured JSON
async function runPlanner(task) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
      { role: "system", content: `You are a Planning Agent. Break the task into steps.
Output JSON: { "steps": [{ "step_number": 1, "action": "tool_name",
  "parameters": {}, "depends_on": [], "description": "..." }] }` },
      { role: "user", content: task },
    ],
  });
  return JSON.parse(response.choices[0].message.content);
}

// Executor runs each step in dependency order
async function runExecutor(plan) {
  const results = {};
  for (const step of plan.steps) {
    const depsOk = step.depends_on.every((d) => results[d]?.status === "success");
    if (!depsOk) { results[step.step_number] = { status: "skipped" }; continue; }
    try {
      const output = await tools[step.action](step.parameters);
      results[step.step_number] = { status: "success", output };
    } catch (err) {
      results[step.step_number] = { status: "failed", error: err.message };
    }
  }
  return results;
}

When to use Planner-Executor over a single prompt:

Condition	Single Prompt	Planner-Executor
Task is 1-2 steps	Sufficient	Overkill
Task requires 5+ tools in sequence	Fails (loses track)	Excels
Steps have dependencies (step 3 needs step 2)	Unreliable	Dependency graph ensures correctness
You need auditability (log each step)	No visibility	Full plan and per-step results logged
Steps can run in parallel	Not possible	Independent steps run concurrently

The key trade-off is that the Planner call adds latency and cost (one extra LLM call). For simple tasks, that overhead is not justified. For complex multi-step tasks (data pipelines, code migrations, deployment workflows), it is essential.

Q3. What is the Researcher-Writer pattern and how does it prevent hallucination?

Why interviewers ask: Hallucination is one of the biggest production risks with LLMs. The Researcher-Writer pattern is a direct architectural solution, and understanding it shows awareness of grounding techniques.

Model answer:

The Researcher-Writer pattern separates information gathering from information synthesis:

Researcher Agent -- uses tools (web search, RAG, APIs, databases) to gather raw facts. Returns structured JSON with facts, sources, statistics, and gaps. Runs at temperature 0 for factual precision.
Writer Agent -- receives ONLY the Researcher's output and synthesizes it into polished content. Runs at temperature 0.3-0.5 for readable prose.

The hallucination prevention mechanism is the grounding constraint in the Writer's system prompt:

const WRITER_SYSTEM_PROMPT = `You are a professional writer.

CRITICAL RULE: Use ONLY the facts provided in the research data below.
Do NOT add any information from your training data.
If the research has gaps, acknowledge them — do NOT fill gaps with made-up facts.
Cite sources using [Source Name] format.`;

This works because:

Separation forces grounding -- the Writer has no tools. Its only source of truth is the Researcher's output. Without this separation, a single "research and write" prompt blends tool results with training data, making hallucination invisible.
Auditability -- you can compare the Writer's output against the Researcher's facts. If the Writer mentions a statistic that is not in the research, you detect the hallucination programmatically.
Temperature isolation -- the Researcher at temperature 0 retrieves facts precisely. The Writer at temperature 0.4 produces good prose. A single prompt forces one temperature for both, compromising either accuracy or readability.

// Validate Writer output against Researcher facts
function detectHallucination(writerOutput, researchData) {
  const researchFacts = researchData.facts.map((f) => f.fact.toLowerCase());
  const stats = researchData.key_statistics.map((s) => s.value);

  // Extract numbers from Writer output and check they exist in research
  const numbersInOutput = writerOutput.match(/\d+[\d,.]*%?/g) || [];
  const ungrounded = numbersInOutput.filter(
    (num) => !stats.some((s) => s.includes(num))
  );

  return {
    potentialHallucinations: ungrounded,
    isClean: ungrounded.length === 0,
  };
}

Q4. What is the Router pattern and how does it differ from function calling?

Why interviewers ask: Routers and function calling look similar on the surface -- both involve dispatching to different handlers. Distinguishing them demonstrates understanding of system architecture at two different levels.

Model answer:

The Router pattern uses a lightweight routing agent to classify user intent and dispatch the entire request to a specialized agent. Each specialized agent has its own system prompt, temperature, tools, and optionally its own model.

Function calling lets a single agent decide which tools to invoke within its own conversation loop.

They operate at different levels of the architecture:

Router Pattern                          Function Calling
      |                                       |
      v                                       v
+-----------+                          +--------------+
| Router    | classifies intent        | Code Agent   | uses tool calling
|           |-- "code_help" --------->|              | to call run_code(),
|           |                          |              | lint_code(), etc.
+-----------+                          +--------------+

The Router decides WHICH agent handles the request.
Function calling decides WHICH tools the chosen agent uses.

Aspect	Router Pattern	Function Calling
What decides	Router classifies intent, dispatches to an agent	LLM picks which tool to call
Granularity	Entire conversation flows	Individual tool calls within one conversation
System prompt	Different system prompt per handler	One system prompt for all tools
Temperature	Different temperature per handler	One temperature setting
Model	Can use different models per handler	Same model for all tool calls

// Router: classify and dispatch to specialized agent
async function route(message) {
  const { intent } = await classifyIntent(message); // "code_help"

  const handlers = {
    code_help:        { model: "gpt-4o", temperature: 0.2, prompt: "You are an expert coder..." },
    creative_writing: { model: "gpt-4o", temperature: 0.9, prompt: "You are a creative writer..." },
    data_analysis:    { model: "gpt-4o", temperature: 0,   prompt: "You analyze datasets..." },
  };

  const handler = handlers[intent] || handlers.general;
  return await callAgent(message, handler);
}

// Function calling: one agent picks tools dynamically
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages,
  tools: [searchTool, calculatorTool, codeTool], // Agent picks which to call
});

They are complementary, not competing. In production, the Router dispatches to a specialized agent, and that agent uses function calling internally to invoke its tools. The Router selects the agent; function calling selects the tools within that agent.

Intermediate (Q5-Q8)

Q5. How do you choose the right agent design pattern for a given task?

Why interviewers ask: Pattern selection is a design judgment skill. Picking the wrong pattern leads to over-engineering or under-engineering, both of which cost time and money.

Model answer:

The choice depends on three factors: task structure, quality requirements, and request diversity. I use a decision tree:

Question 1: Does the system handle MULTIPLE TYPES of requests?
  YES -> Start with a ROUTER to classify and dispatch
  NO  -> Continue

Question 2: Does the task require GATHERING information from external sources?
  YES -> Use RESEARCHER-WRITER (gather then synthesize)
  NO  -> Continue

Question 3: Does the task require MULTIPLE SEQUENTIAL STEPS with dependencies?
  YES -> Use PLANNER-EXECUTOR (decompose then execute)
  NO  -> Continue

Question 4: Does the output need ITERATIVE QUALITY IMPROVEMENT?
  YES -> Use CRITIC-REFINER (generate, critique, refine loop)
  NO  -> A single LLM call or simple agent is sufficient

Applied to real scenarios:

Scenario	Primary Pattern	Why
Customer support chatbot handling billing, tech, returns	Router	Multiple distinct request types needing different handling
"Analyze this CSV and produce a trend report"	Planner-Executor	Multi-step pipeline: load, clean, analyze, chart, report
"Write a market analysis of the EV industry"	Researcher-Writer	Must gather current data before writing
"Generate a legal contract from these terms"	Critic-Refiner	High-stakes output needs iterative quality improvement
"Research competitors, then write and polish a report"	Researcher-Writer + Critic-Refiner	Gather facts, synthesize, then iteratively improve

// Decision function
function selectPattern(task) {
  if (task.requestTypes > 1)           return "router";
  if (task.needsExternalData)          return "researcher-writer";
  if (task.steps > 3 && task.hasDeps)  return "planner-executor";
  if (task.qualityThreshold >= 8)      return "critic-refiner";
  return "single-call";
}

Important: always start with the simplest pattern that solves the problem. A single LLM call is cheaper, faster, and easier to debug than any multi-agent pattern. Upgrade to a pattern only when the single call produces insufficient results.

Q6. Walk through the Critic-Refiner loop. What are the two exit conditions and why do you need both?

Why interviewers ask: Tests understanding of iterative improvement systems, termination conditions, and cost management -- critical for production deployments where each iteration costs money.

Model answer:

The Critic-Refiner pattern creates an iterative improvement loop with three agents:

Generator -- produces the initial draft (temperature 0.7 for creative first attempt)
Critic -- evaluates the draft against a rubric and returns structured feedback with scores, issues, and suggestions (temperature 0 for consistent evaluation)
Refiner -- takes the draft plus the Critic's feedback and produces an improved version (temperature 0.3 for controlled improvement)

The loop: Generator -> Critic -> Refiner -> Critic -> Refiner -> ... until exit.

async function criticRefinerLoop(task, config = {}) {
  const { qualityThreshold = 8, maxIterations = 3 } = config;

  let content = await generate(task);

  for (let i = 1; i <= maxIterations; i++) {
    const feedback = await critique(content); // Returns { overall_score, issues[], strengths[] }

    // EXIT CONDITION 1: Quality threshold met
    if (feedback.overall_score >= qualityThreshold) {
      return { content, score: feedback.overall_score, iterations: i };
    }

    content = await refine(content, feedback);
  }

  // EXIT CONDITION 2: Max iterations reached (safety net)
  return { content, iterations: maxIterations, maxReached: true };
}

Why you need BOTH exit conditions:

Exit Condition	Purpose	What Happens Without It
Quality threshold (score >= 8)	Happy path -- stop when output is good enough	The loop always runs to max, wasting money on already-good content
Max iterations (e.g., 3)	Safety net -- prevent infinite loops	A harsh Critic that never gives 8+ creates an infinite loop burning tokens forever

Cost analysis:

1 iteration  = 2 LLM calls (critique + refine)     ~$0.025
2 iterations = 4 LLM calls                          ~$0.050
3 iterations = 6 LLM calls                          ~$0.075

Generator    = 1 LLM call                           ~$0.020

Total (3 max iterations) = 7 LLM calls              ~$0.095

Diminishing returns is the third implicit exit condition. If the score improves from 7.0 to 7.2 between iterations 2 and 3, that 0.2-point improvement does not justify another $0.025. Detect and stop early:

function shouldContinue(history) {
  const scores = history.filter((h) => h.type === "critique").map((h) => h.score);
  if (scores.length < 2) return true;
  const delta = scores[scores.length - 1] - scores[scores.length - 2];
  return delta >= 1.0; // Stop if improvement < 1 point
}

Q7. How would you combine multiple agent design patterns in a single system?

Why interviewers ask: Real production systems rarely use one pattern in isolation. The ability to compose patterns demonstrates architectural thinking and practical experience.

Model answer:

The four patterns are composable building blocks. Each pattern solves one architectural concern, and you layer them to solve complex problems. The key insight is that each pattern's output becomes the next pattern's input.

Example: "Research the top 5 JavaScript frameworks and write a polished comparison blog post"

This task needs: routing (classify the request type), research (gather current facts), writing (synthesize), and quality improvement (polish). The full pipeline:

User Message
    |
    v
[ROUTER] --> classifies as "research_and_write"
    |
    v
[PLANNER-EXECUTOR] --> creates a 3-step plan:
    |
    |  Step 1: Research (Researcher agent)
    |  Step 2: Write   (Writer agent)
    |  Step 3: Polish  (Critic-Refiner loop)
    |
    v
[RESEARCHER] --> gathers facts about React, Vue, Angular, Svelte, Solid
    |             (web search, RAG, API calls)
    |             Returns: structured JSON with facts, stats, comparisons
    v
[WRITER] --> synthesizes facts into a blog post draft
    |         (uses ONLY Researcher's facts, cites sources)
    v
[CRITIC-REFINER LOOP] --> iteratively improves the draft
    |  Iteration 1: Critic scores 6/10 (missing code examples)
    |  Refiner adds code examples
    |  Iteration 2: Critic scores 8/10 (ready to publish)
    v
Final polished blog post

async function researchAndWritePipeline(query) {
  // Phase 1: Research
  const research = await runResearcher(query);
  const validation = validateResearch(research);
  if (!validation.valid) throw new Error(`Insufficient research: ${validation.issues}`);

  // Phase 2: Write
  const draft = await runWriter(research, { format: "blog post", audience: "developers" });

  // Phase 3: Polish via Critic-Refiner
  const polished = await criticRefinerLoop(draft, {
    qualityThreshold: 8,
    maxIterations: 3,
    criteria: ["accuracy", "completeness", "examples", "readability"],
  });

  return {
    finalContent: polished.content,
    research,         // Audit trail: what facts were the Writer given?
    iterations: polished.iterations,
    finalScore: polished.score,
  };
}

// Wrap in a Router for a multi-capability system
async function routeRequest(message) {
  const { intent } = await classifyIntent(message);

  const pipelines = {
    research_and_write: researchAndWritePipeline,
    data_analysis:      dataAnalysisPipeline,      // Planner-Executor
    code_review:        codeReviewPipeline,         // Critic-Refiner
    quick_question:     singleCallHandler,          // No pattern needed
  };

  return await (pipelines[intent] || pipelines.quick_question)(message);
}

Composition rules:

Rule	Rationale
Router goes at the front	It classifies and dispatches before any other pattern runs
Planner-Executor is the orchestrator	It breaks the task into steps that may invoke other patterns
Researcher-Writer is a pipeline stage	Research step feeds into a write step
Critic-Refiner goes at the end	It polishes whatever the upstream patterns produced

Q8. What temperature settings would you use for each role in the four patterns, and why?

Why interviewers ask: Temperature is a subtle but high-impact configuration choice. Using the wrong temperature for a role degrades output quality in ways that are hard to debug.

Model answer:

Temperature controls the randomness of the LLM's output. Low temperature (0-0.2) produces precise, deterministic responses. High temperature (0.7-1.0) produces creative, varied responses. Each role in the four patterns has an optimal temperature based on its task:

Pattern	Role	Temperature	Reasoning
Planner-Executor	Planner	0	Plans must be deterministic and reproducible. Creative planning produces inconsistent step counts and unreliable dependency graphs.
Planner-Executor	Executor	0	Execution must be precise. The Executor follows the plan exactly -- no creativity needed.
Researcher-Writer	Researcher	0	Research must be factual and precise. High temperature causes the Researcher to "creatively" rephrase facts, introducing subtle inaccuracies.
Researcher-Writer	Writer	0.3-0.5	Writing needs some creativity for readable prose, varied sentence structure, and engaging narrative. Temperature 0 produces flat, robotic writing.
Critic-Refiner	Generator	0.7	The first draft benefits from creative exploration. A wider range of ideas gives the Critic more to work with.
Critic-Refiner	Critic	0	Evaluation must be consistent and reproducible. A Critic at temperature 0.7 gives the same content a 5 one run and a 9 the next.
Critic-Refiner	Refiner	0.3	Refinement needs controlled creativity -- enough to rephrase and improve, not so much that it rewrites sections the Critic praised.
Router	Router	0	Classification must be deterministic. The same input should always route to the same handler.
Router	Code handler	0-0.2	Code generation needs precision and correctness.
Router	Creative handler	0.8-1.0	Creative writing benefits from variety and expressiveness.
Router	Data handler	0	Analysis must be precise and reproducible.

const agentConfigs = {
  planner:    { temperature: 0,   model: "gpt-4o" },
  executor:   { temperature: 0,   model: "gpt-4o" },
  researcher: { temperature: 0,   model: "gpt-4o" },
  writer:     { temperature: 0.4, model: "gpt-4o" },
  generator:  { temperature: 0.7, model: "gpt-4o" },
  critic:     { temperature: 0,   model: "gpt-4o" },
  refiner:    { temperature: 0.3, model: "gpt-4o" },
  router:     { temperature: 0,   model: "gpt-4o-mini" }, // Cheap model, deterministic
};

The general principle: agents that judge, plan, or route use temperature 0. Agents that create use 0.3-0.7. Agents that need maximum creativity use 0.7-1.0. When in doubt, use temperature 0 -- consistency is almost always more valuable than creativity in production systems.

Advanced (Q9-Q11)

Q9. A step in your Planner-Executor pipeline fails. Compare the three failure-handling strategies and explain when you would use each.

Why interviewers ask: Failure handling is where production systems diverge from demos. This tests your ability to design resilient systems that degrade gracefully instead of crashing.

Model answer:

When step N of a plan fails, there are three strategies: skip, retry, and re-plan. Each is appropriate for different failure types.

Strategy 1: Skip dependent steps. Mark the failed step and all steps that depend on it as "skipped." Continue executing independent steps.

// Skip dependents
for (const step of plan.steps) {
  const depsOk = step.depends_on.every((d) => results[d]?.status === "success");
  if (!depsOk) {
    results[step.step_number] = { status: "skipped" };
    continue;
  }
  // ... execute step ...
}

When to use: The failed step is non-critical. Example: step 5 (generate chart) fails, but step 6 (write report) can still produce a text-only report. The system delivers partial value rather than nothing.

Strategy 2: Retry with backoff. Retry the failed step up to N times with exponential backoff.

async function executeWithRetry(step, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await tools[step.action](step.parameters);
    } catch (error) {
      if (attempt === maxRetries) throw error;
      await new Promise((r) => setTimeout(r, 1000 * Math.pow(2, attempt)));
    }
  }
}

When to use: The failure is transient -- network timeout, rate limit, temporary API outage. The same call will likely succeed on the next attempt.

Strategy 3: Re-plan around the failure. Send the failed step, the error, and all completed results back to the Planner. Ask it to create a new plan that achieves the original goal using alternative tools or approaches.

async function replanOnFailure(originalPlan, failedStep, error, completedResults) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
      { role: "system", content: PLANNER_SYSTEM_PROMPT },
      { role: "user", content: `Step ${failedStep.step_number} failed: ${error}.
Completed results: ${JSON.stringify(completedResults)}.
Create a NEW plan to achieve: "${originalPlan.task_summary}".
Do NOT repeat completed steps. Work around the failure.` },
    ],
  });
  return JSON.parse(response.choices[0].message.content);
}

When to use: The failure is structural -- the tool does not exist, the approach is wrong, or the data format is incompatible. Retrying will not help. The Planner needs to find an alternative path.

Decision framework:

Failure Type	Example	Strategy
Transient error	Network timeout, rate limit	Retry (up to 3 times)
Non-critical step	Chart generation fails	Skip dependents, continue
Structural failure	Tool does not exist, wrong API	Re-plan with alternatives
Critical path failure	Data loading fails (all steps depend on it)	Re-plan or abort
Repeated failure after retries	API is down permanently	Re-plan or abort with partial results

In production, combine all three in a cascading strategy: retry first (cheap), skip if retries fail and the step is non-critical (fast), re-plan if the step is critical and alternatives exist (expensive but effective).

Q10. Design a production system that uses all four agent design patterns together. Walk through the architecture.

Why interviewers ask: Tests end-to-end architectural thinking at the highest level. A complete answer covers pattern composition, data flow, error handling, cost management, and observability.

Model answer:

Scenario: An enterprise AI assistant that handles diverse employee requests -- code help, report generation, data analysis, and general questions.

Architecture:

                         ┌──────────────┐
                         │  User Input  │
                         └──────┬───────┘
                                │
                                v
                    ┌───────────────────────┐
          Tier 1:   │   ROUTER (gpt-4o-mini)│  Classifies intent
                    │   temperature: 0      │  Cost: ~$0.00015/call
                    └───────────┬───────────┘
                                │
              ┌─────────────────┼──────────────────┐
              │                 │                   │
              v                 v                   v
     ┌────────────┐   ┌────────────────┐   ┌────────────────┐
     │ Code Help  │   │ Report Gen     │   │ Quick Question │
     │ (single    │   │ (multi-pattern │   │ (single call)  │
     │  agent)    │   │  pipeline)     │   │                │
     └────────────┘   └───────┬────────┘   └────────────────┘
                              │
              Tier 2:         v
                    ┌───────────────────────┐
                    │ PLANNER-EXECUTOR      │  Orchestrates the pipeline
                    │ Step 1: Research      │
                    │ Step 2: Write         │
                    │ Step 3: Polish        │
                    └───────────┬───────────┘
                                │
              Tier 3:    ┌──────┴──────┐
                         v             v
               ┌──────────────┐ ┌──────────────┐
               │ RESEARCHER   │ │ WRITER       │
               │ temp: 0      │ │ temp: 0.4    │
               │ tools: search│ │ no tools     │
               └──────┬───────┘ └──────┬───────┘
                      │                │
              Tier 4: │                v
                      │      ┌──────────────────┐
                      │      │ CRITIC-REFINER   │  Iterative polish
                      │      │ threshold: 8/10  │
                      │      │ max iterations: 3│
                      │      └──────────────────┘
                      │                │
                      v                v
               ┌──────────────────────────┐
               │   Final Response +       │
               │   Metadata + Audit Trail │
               └──────────────────────────┘

Implementation skeleton:

// Tier 1: Router
async function handleRequest(userMessage) {
  const { intent, confidence } = await classifyIntent(userMessage);

  if (confidence < 0.5) return await askForClarification(userMessage);

  const pipelines = {
    report_generation: reportPipeline,
    code_help:         codeHelpPipeline,
    data_analysis:     dataAnalysisPipeline,
    quick_question:    singleCallHandler,
  };

  const startTime = Date.now();
  const result = await (pipelines[intent] || singleCallHandler)(userMessage);

  // Log metrics
  logMetrics({ intent, confidence, latencyMs: Date.now() - startTime });
  return result;
}

// Tier 2-4: Report generation pipeline
async function reportPipeline(query) {
  // Planner creates the execution plan
  const plan = await runPlanner(query);

  // Execute: Research phase (Researcher-Writer)
  const research = await runResearcher(query);
  const validation = validateResearch(research);
  if (!validation.valid) {
    research = await runResearcher(query, { deeper: true }); // Retry with more depth
  }

  // Execute: Write phase
  const draft = await runWriter(research, { format: "report", audience: "management" });

  // Execute: Polish phase (Critic-Refiner)
  const final = await criticRefinerLoop(draft, {
    qualityThreshold: 8,
    maxIterations: 3,
  });

  return {
    content: final.content,
    auditTrail: { research, draftLength: draft.length, finalScore: final.score },
  };
}

Cost budget per request type:

Route	LLM Calls	Estimated Cost	Latency
Quick question	1 (Router) + 1 (answer)	~$0.005	2-3s
Code help	1 (Router) + 1 (agent with tools)	~$0.01	3-5s
Report generation	1 (Router) + 1 (Planner) + 1 (Researcher) + 1 (Writer) + 4-6 (Critic-Refiner)	~$0.15	20-40s

Safety and observability:

maxIterations on every loop (agent loop, Critic-Refiner loop, re-plan loop)
Per-request token budget with forced early termination
All handoffs are structured JSON -- inspectable at every stage
Router logs intent distribution, confidence histograms, and fallback rates
Critic-Refiner logs score progression to detect quality drift over time

Q11. Your Critic-Refiner loop produces inconsistent scores -- the same content gets 5/10 on one run and 9/10 on the next. How do you diagnose and fix this?

Why interviewers ask: Critic consistency is one of the most subtle and dangerous failure modes in production Critic-Refiner systems. Diagnosing it requires understanding of temperature, prompt specificity, and evaluation methodology.

Model answer:

Inconsistent Critic scores mean the loop either exits too early (lucky high score) or runs too many iterations (unlucky low score). This wastes money and produces unpredictable quality. Diagnosis follows a systematic checklist:

Diagnosis step 1: Check the temperature.

The most common cause. If the Critic runs at temperature > 0, the model's sampling randomness produces different scores for the same content.

// BAD: temperature 0.5 causes score variance
const critic = { model: "gpt-4o", temperature: 0.5 };

// GOOD: temperature 0 for deterministic evaluation
const critic = { model: "gpt-4o", temperature: 0 };

Fix: Set the Critic to temperature 0. This alone often resolves the issue.

Diagnosis step 2: Check the rubric specificity.

Vague criteria produce subjective scores. "Is the writing good?" is evaluated differently each time. "Does every claim cite a source? Score 1 if <50% cited, 5 if 50-80%, 8 if 80-95%, 10 if 100%" is evaluated consistently.

// BAD: vague rubric
const criticPrompt = "Rate the content quality from 1-10.";

// GOOD: explicit anchored rubric
const criticPrompt = `Rate each criterion 1-10 using these anchors:

ACCURACY:
  10: Every claim is verifiable and correctly sourced
  8:  All claims correct, 1-2 missing citations
  5:  Some claims unverified, no major errors
  3:  Multiple unverified claims
  1:  Contains factual errors

COMPLETENESS:
  10: Covers all aspects of the topic exhaustively
  8:  Covers main points, minor omissions
  5:  Covers basics, missing important aspects
  3:  Superficial treatment
  1:  Barely addresses the topic`;

Diagnosis step 3: Run a consistency test.

async function testCriticConsistency(content, runs = 10) {
  const scores = [];
  for (let i = 0; i < runs; i++) {
    const feedback = await critique(content);
    scores.push(feedback.overall_score);
  }

  const avg = scores.reduce((a, b) => a + b) / scores.length;
  const variance = scores.reduce((a, b) => a + (b - avg) ** 2, 0) / scores.length;
  const min = Math.min(...scores);
  const max = Math.max(...scores);

  console.log(`Scores: ${scores.join(", ")}`);
  console.log(`Range: ${min}-${max}, Avg: ${avg.toFixed(1)}, Variance: ${variance.toFixed(2)}`);

  // Target: variance < 0.5, range <= 1 point
  return { scores, avg, variance, range: max - min, consistent: variance < 0.5 };
}

Diagnosis step 4: Check for positional bias.

Some models score content differently based on length or position in the prompt. Test with the same content presented in different formats.

Fix summary:

Root Cause	Fix	Impact
Temperature > 0	Set temperature to 0	Eliminates sampling randomness
Vague rubric	Add numbered criteria with score anchors	Reduces interpretation variance
Missing structured output	Force JSON output format	Prevents score drift in free-text
Model inconsistency	Use a stronger model for the Critic	Better instruction following
Prompt injection from content	Separate content in a delimited block	Prevents content from influencing scoring

After fixing, re-run the consistency test. Target: variance < 0.5, range <= 1 point across 10 runs. If the Critic cannot achieve this, consider a majority-vote approach -- run the Critic 3 times and take the median score.

Quick-fire

#	Question	One-line answer
1	Name the four agent design patterns	Planner-Executor, Researcher-Writer, Critic-Refiner, Router
2	What does the Planner output?	Structured JSON plan with steps, tool names, parameters, and dependencies
3	What prevents the Writer from hallucinating?	"Use ONLY the facts provided" instruction + no tool access
4	What are the two Critic-Refiner exit conditions?	Quality threshold met OR max iterations reached
5	Why use a cheap model for the Router?	Classification is simple; GPT-4o-mini is 20x cheaper and fast enough
6	Router vs function calling -- what level do they operate at?	Router selects the agent; function calling selects the tools within that agent
7	What temperature for the Critic?	0 -- evaluation must be deterministic and consistent
8	What temperature for the Writer?	0.3-0.5 -- needs some creativity for readable prose
9	When do you re-plan vs retry a failed step?	Retry for transient errors; re-plan for structural failures
10	How do you detect diminishing returns in the Critic-Refiner loop?	Stop if score improvement between iterations is < 1 point
11	What is the key composability insight?	Router dispatches, Planner-Executor orchestrates, Researcher-Writer gathers and synthesizes, Critic-Refiner polishes -- they stack

<- Back to 4.16 -- Agent Design Patterns (README)