Episode 4 — Generative AI Engineering / 4.18 — Building a Simple Multi Agent Workflow
Interview Questions: Building a Simple Multi-Agent Workflow
Model answers for pipeline design, sequential agent coordination, validation between steps, error handling in pipelines, and real-world multi-agent patterns.
How to use this material (instructions)
- Read lessons in order --
README.md, then4.18.athrough4.18.d. - Practice out loud -- definition, example, pitfall.
- Pair with exercises --
4.18-Exercise-Questions.md. - Quick review --
4.18-Quick-Revision.md.
Beginner (Q1-Q4)
Q1. What is a multi-agent pipeline and why would you use one instead of a single LLM call?
Why interviewers ask: Tests whether you understand the core architectural motivation behind multi-agent workflows. Separates candidates who have memorized the concept from those who can reason about when decomposition is genuinely valuable.
Model answer:
A multi-agent pipeline is an architecture where a complex AI task is decomposed into multiple specialized agents connected in sequence (or in parallel), with each agent responsible for a single sub-task and passing its validated output to the next agent as input. Think of it as an assembly line: one worker inspects raw material, the next shapes it, the next paints it. Each worker is specialized and only needs to understand their own job and the handoff protocol.
You would use a multi-agent pipeline instead of a single LLM call when a task involves genuinely different reasoning steps that benefit from separate, focused prompts. A single prompt trying to analyze a dating profile, rewrite the bio, AND generate conversation starters at the same time produces inconsistent, unfocused results because the context is overloaded. A multi-agent pipeline lets each agent have a short, focused prompt, a specific output schema, and an independently tunable temperature.
The key advantages are: focused prompts (each agent does one thing well), independent testing (test Agent 2 without running Agent 1), clear failure boundaries (you know exactly which agent failed), flexible model selection (use GPT-4o for analysis, GPT-4o-mini for formatting), and iterability (improve one agent without risking others).
However, the tradeoff is latency (each agent adds a round-trip to the API), cost (each agent has its own token charges), and complexity (more moving parts to maintain). Use multi-agent only when a single prompt demonstrably cannot produce the required quality.
Q2. What are schema contracts between agents and why are Zod schemas used for this?
Why interviewers ask: Tests understanding of a critical production concern -- how data integrity is maintained between autonomous processing steps.
Model answer:
A schema contract is a formal definition of what data one agent produces and what the next agent expects to receive. It is the "agreement" at the boundary between two agents. Without it, Agent 1 could return overallScore: "high" (a string) when Agent 2 expects overallScore: 7 (a number), and the pipeline would silently produce garbage.
Zod schemas are ideal for these contracts for four reasons. First, runtime validation -- Zod validates the actual data at execution time, not just at compile time. When Agent 1's JSON output is parsed, schema.parse(parsed) immediately throws if any field is missing, the wrong type, or out of range. Second, TypeScript inference -- z.infer<typeof MySchema> gives you full type safety in your code without duplicating type definitions. Third, self-documenting -- the schema IS the documentation; reading a Zod schema tells you exactly what shape the data takes. Fourth, expressive constraints -- Zod supports .min(), .max(), .enum(), .refine(), and nested objects, letting you enforce business rules like "overallScore must be 1-10" or "strengths array must have at least 1 element."
import { z } from 'zod';
// Agent 1 must produce this exact shape
const Agent1OutputSchema = z.object({
strengths: z.array(z.object({
category: z.enum(["bio", "interests", "photos"]),
description: z.string().min(10),
impactScore: z.number().min(1).max(10),
})).min(1),
overallScore: z.number().min(1).max(10),
summary: z.string().min(20),
});
// Validate immediately after Agent 1 runs
const validated = Agent1OutputSchema.parse(agent1RawOutput);
// If this line passes, Agent 2 can trust the data
The critical principle is fail fast: validate immediately after each agent, not three agents later when the final output is wrong and you cannot trace the root cause.
Q3. Explain the Single Responsibility Principle as it applies to agent design. How do you decide agent boundaries?
Why interviewers ask: Tests software engineering fundamentals applied to AI systems. The ability to decompose tasks well is what separates working pipelines from tangled messes.
Model answer:
The Single Responsibility Principle (SRP) says each agent should have one job and one reason to change. In a dating profile pipeline, this means one agent analyzes the profile, a separate agent rewrites the bio, and a third agent generates conversation starters. Each agent has a focused system prompt, a specific output schema, and can be tested, improved, or swapped independently.
To decide where to draw agent boundaries, I ask five questions:
- Can this step be tested independently? If I can write unit tests for the analysis step without running the writing step, they should be separate agents.
- Does this step need different context? The analyzer needs the raw profile; the writer needs the analysis plus the original bio. Different context needs suggest different agents.
- Might I want a different model for this step? Analysis might work fine with GPT-4o-mini, but creative writing might need GPT-4o. Separate agents enable this.
- Does this step have a clearly different output schema? Analysis produces scores and categories; writing produces prose. Different schemas, different agents.
- Would combining this with the next step make the prompt too long? If the system prompt exceeds ~500 words of instructions, it is probably doing too much.
The anti-patterns are equally important. Too much per agent: one agent trying to analyze, rewrite, AND generate -- this is just a monolithic prompt with extra steps. Too little per agent: separate agents for "read name," "read age," "read bio" -- unnecessary overhead. Overlapping responsibilities: Agent 1 analyzes and partially rewrites, Agent 2 re-analyzes and generates -- unclear ownership causes bugs.
Q4. What are the main pipeline architecture patterns and when do you choose each?
Why interviewers ask: Tests breadth of knowledge. Real-world systems use different patterns depending on data dependencies, and interviewers want to see that you know more than just "chain agents in a line."
Model answer:
There are four main patterns.
Sequential pipeline is the most common. Each agent depends on the previous agent's output. The Hinge profile pipeline is sequential: Profile Analyzer feeds Bio Improver feeds Conversation Starter Generator. Latency is the sum of all agent times. Use this when each step genuinely needs the prior step's output.
const step1 = await analyzerAgent(input);
const step2 = await writerAgent(step1);
const step3 = await generatorAgent(step2);
Parallel pipeline runs independent agents simultaneously on the same input, then combines results. Use this when agents do not depend on each other -- for example, running sentiment analysis, keyword extraction, and language detection on the same text at the same time.
const [sentiment, keywords, language] = await Promise.all([
sentimentAgent(input),
keywordAgent(input),
languageAgent(input),
]);
Fan-out/fan-in has a planner agent split work into sub-tasks, worker agents process each sub-task in parallel, and a merger agent combines the results. Use this for dynamic decomposition -- for example, a document with five sections where each section is summarized independently.
Conditional (router) pipeline has a router agent that decides which pipeline branch to follow based on the input. Use this when different input types need different processing paths -- for example, routing customer messages to billing, shipping, or technical support pipelines.
| Pattern | Latency | Use When |
|---|---|---|
| Sequential | Sum of all agents | Each step needs previous output |
| Parallel | Max of all agents | Agents are independent |
| Fan-out/fan-in | Planner + max parallel + merger | Dynamic sub-task decomposition |
| Conditional | Router + selected branch | Different inputs need different paths |
Intermediate (Q5-Q8)
Q5. Walk through how data flows between agents in a multi-agent pipeline. What is selective context and why does it matter?
Why interviewers ask: Tests practical understanding of the most important design decision in pipeline architecture -- what data each agent receives. Poor data flow design wastes tokens and creates hidden coupling.
Model answer:
There are three data flow patterns between agents.
Direct pass-through sends only the previous agent's output to the next agent. Agent 2 sees only what Agent 1 produced, and Agent 3 sees only what Agent 2 produced. This is simple but means later agents lose access to the original input.
Accumulated context sends the original input plus all previous outputs to each agent. Agent 3 receives the raw profile, Agent 1's analysis, AND Agent 2's improved bio. This gives maximum context but grows token usage at each step and can confuse agents with too much information.
Selective context is the recommended pattern. Each agent receives only the specific fields it needs from previous steps. In the Hinge pipeline, Agent 2 (Bio Improver) does not receive Agent 1's entire analysis object. It receives only the weaknesses, improvement tips, tone analysis, and personality description -- the fields that directly inform rewriting.
// Selective context for Agent 2
const agent2Input = {
originalBio: profile.bio,
interests: profile.interests,
name: profile.name,
analysis: {
weaknesses: analysis.weaknesses, // specific field
improvementTips: analysis.improvementTips, // specific field
toneAnalysis: analysis.toneAnalysis, // specific field
profilePersonality: analysis.profilePersonality,
},
};
Selective context matters for three reasons. First, token efficiency -- each unnecessary field costs input tokens at every API call. Second, agent focus -- an agent that receives only what it needs produces better output than one drowning in irrelevant context. Third, reduced coupling -- if Agent 1's schema changes (adding a new field), Agent 2 is unaffected as long as the fields it consumes remain unchanged.
The tradeoff is that you must carefully design what each agent needs upfront. This is a design cost worth paying because it prevents the "pass everything and hope for the best" approach that creates brittle, expensive pipelines.
Q6. How do you handle errors in a multi-agent pipeline? Compare the three main failure strategies.
Why interviewers ask: Error handling is where prototypes become production systems. Tests whether you have thought beyond the happy path.
Model answer:
Multi-agent pipelines face five error types, roughly ordered from easiest to hardest to detect: LLM API errors (network failures, rate limits, timeouts), empty responses (LLM returns null), JSON parse errors (LLM returns text instead of JSON), Zod validation errors (valid JSON but wrong structure), and semantic errors (valid structure but nonsensical content -- the hardest because no automated check catches them).
The three failure strategies are:
Fail fast stops the entire pipeline on the first error and reports exactly which agent failed and why. This is the recommended default for data pipelines because it never returns bad data.
try {
const step1 = await runAgent1(input);
const step2 = await runAgent2(step1);
const step3 = await runAgent3(step2);
return { success: true, result: step3 };
} catch (error) {
return { success: false, failedAt: error.agentName, error: error.message };
}
Fail with partial results stops the pipeline but returns whatever completed successfully. Useful during development and debugging, and for pipelines where partial output has value -- for example, having the analysis but not the rewritten bio is still useful.
Fail with fallback substitutes a safe, pre-defined response when an agent fails and continues the pipeline. This is for user-facing products where returning something is better than returning nothing. Fallbacks must pass the same Zod schema as the real agent so downstream agents can still work.
let step1;
try {
step1 = await runAgent1(input);
} catch (error) {
step1 = getAgent1Fallback(input); // must match Agent1OutputSchema
}
| Strategy | When to Use | Trade-off |
|---|---|---|
| Fail fast | Data pipelines, accuracy-critical | No result on failure |
| Partial results | Development, debugging, partial value | Incomplete output |
| Fallback | User-facing products, availability-critical | Quality may degrade |
In production, I typically combine these: fail fast for the first agent (if analysis fails, nothing downstream makes sense), with fallbacks for later agents (if conversation starters fail, the improved bio alone is still useful).
Q7. Explain the validation feedback retry strategy. When does it work and when does it not?
Why interviewers ask: Tests nuanced understanding of retry strategies. The validation feedback approach is a distinctive technique in multi-agent systems that shows the candidate understands the unique nature of LLM errors.
Model answer:
Validation feedback retry is a strategy where, when an agent's output fails Zod validation, the specific validation errors are fed back to the LLM as a follow-up message, asking it to correct its output. The key insight is that an LLM can read an error like "Field 'overallScore' must be a number but received string" and fix it on the next attempt -- the model understands its own mistake when told what it was.
async function runAgentWithValidationFeedback(agent, input, maxRetries = 3) {
let messages = [
{ role: "system", content: agent.systemPrompt },
{ role: "user", content: JSON.stringify(input) },
];
for (let attempt = 1; attempt <= maxRetries; attempt++) {
const raw = await callLLM(agent.model, messages);
const parsed = JSON.parse(raw);
try {
return agent.outputSchema.parse(parsed);
} catch (zodError) {
if (attempt < maxRetries) {
// Feed the exact error back to the LLM
const feedback = zodError.issues
.map(i => `"${i.path.join(".")}": ${i.message}`)
.join("\n");
messages.push(
{ role: "assistant", content: raw },
{ role: "user", content: `Validation errors:\n${feedback}\nFix and respond with valid JSON.` }
);
}
}
}
throw new Error(`Failed after ${maxRetries} attempts`);
}
This strategy works well for Zod validation errors (missing fields, wrong types, invalid enums, out-of-range values) because the LLM can read the structured error and directly fix the output format. It also works for JSON parse errors where the LLM wrapped JSON in markdown code blocks.
This strategy does NOT work for API rate limit errors (HTTP 429) or server errors (HTTP 500) because those have nothing to do with the LLM's output quality -- the model never even ran. For those, use exponential backoff (wait and retry without adding messages). It also does not help with semantic errors (the LLM returned valid JSON with correct types but factually wrong content) because there is no Zod error to feed back.
The tradeoff is token cost: each validation feedback retry adds messages to the conversation history, consuming more input tokens. A 3-retry validation feedback loop can use 2-3x the tokens of the original call. But this is usually worth it because it has a much higher success rate than blind retries for schema issues.
Q8. How would you implement a reusable pipeline runner that works for any set of agents?
Why interviewers ask: Tests the ability to abstract common patterns into reusable infrastructure. Shows software engineering maturity beyond one-off scripts.
Model answer:
A reusable pipeline runner needs three components: an agent definition format, a single-agent executor with JSON parsing and Zod validation, and a pipeline orchestrator that chains agents together with logging and error handling.
The agent definition is a simple object specifying the agent's name, system prompt, output schema, model, and temperature:
function createAgent({ name, systemPrompt, outputSchema, model = "gpt-4o", temperature = 0.7 }) {
return { name, systemPrompt, outputSchema, model, temperature };
}
The single-agent executor handles the LLM call, JSON parsing (including extracting JSON from markdown code blocks), and Zod validation:
async function runAgent(agent, input) {
const response = await client.chat.completions.create({
model: agent.model,
temperature: agent.temperature,
messages: [
{ role: "system", content: agent.systemPrompt },
{ role: "user", content: JSON.stringify(input) },
],
});
const raw = response.choices[0].message.content;
if (!raw) throw new Error(`${agent.name} returned empty response`);
let parsed;
try {
parsed = JSON.parse(raw);
} catch {
const match = raw.match(/```(?:json)?\s*([\s\S]*?)```/);
if (match) parsed = JSON.parse(match[1].trim());
else throw new Error(`${agent.name} returned invalid JSON`);
}
return agent.outputSchema.parse(parsed);
}
The pipeline orchestrator chains agents sequentially, logging each step. The critical design decision is the contextBuilder callback, which lets callers customize what data each agent receives -- enabling selective context without hardcoding data flow into the runner:
async function runPipeline(agents, initialInput, contextBuilder) {
let currentData = initialInput;
const results = {};
for (let i = 0; i < agents.length; i++) {
const agent = agents[i];
const agentInput = contextBuilder
? contextBuilder(i, agent.name, currentData, results, initialInput)
: currentData;
const output = await runAgent(agent, agentInput);
results[agent.name] = output;
currentData = output;
}
return { results, finalOutput: currentData };
}
This design follows the Open/Closed Principle: the runner is open for extension (any agents, any schemas, any data flow via contextBuilder) but closed for modification (the core loop, JSON parsing, and validation logic never change). Adding a new pipeline means defining new agents and a new context builder -- the runner itself stays the same.
Advanced (Q9-Q11)
Q9. Design a production-ready multi-agent pipeline with graceful degradation. Explain each degradation level.
Why interviewers ask: Tests the ability to build systems that are resilient in production, not just in demos. Graceful degradation is a hallmark of senior engineering.
Model answer:
A production pipeline should never return nothing when something is possible. The graceful degradation ladder has five levels, each trading quality for reliability:
Level 1 -- Full pipeline with primary model (GPT-4o). This is the happy path. All agents run with the best model, full validation, full quality. If any agent fails after retries (including validation feedback retry), drop to Level 2.
Level 2 -- Full pipeline with cheaper model (GPT-4o-mini). Same agents, same schemas, same prompts, but using a faster, cheaper model. Quality is slightly lower but the pipeline still runs end-to-end. This catches cases where the primary model is rate-limited or experiencing outages.
Level 3 -- Simplified pipeline (fewer agents, simpler prompts). Merge agents where possible. Instead of three separate agents, run one agent with a combined prompt. Quality drops further but latency and cost drop significantly.
Level 4 -- Rule-based heuristics (no LLM at all). Use code logic to produce a valid output. For the profile pipeline: check bio length, count interests, score based on simple rules. The output is valid Zod-schema-compliant data, just not AI-generated.
function ruleBasedAnalysis(profile) {
const strengths = [];
if (profile.bio.length > 100) {
strengths.push({
category: "bio",
description: "Bio has good length with sufficient detail",
impactScore: 6,
});
}
if (profile.interests.length >= 4) {
strengths.push({
category: "interests",
description: "Good variety of interests listed",
impactScore: 5,
});
}
// ... build a complete, schema-valid response
return ProfileAnalysisSchema.parse(result);
}
Level 5 -- Static fallback (hardcoded safe default). A pre-built, schema-valid response stored as a constant. Generic but guaranteed to work. Use as the absolute last resort.
The implementation cascades through levels using try/catch:
async function resilientPipeline(input) {
try { return await fullPipeline(input, "gpt-4o"); } catch {}
try { return await fullPipeline(input, "gpt-4o-mini"); } catch {}
try { return await simplifiedPipeline(input); } catch {}
try { return ruleBasedAnalysis(input); } catch {}
return staticFallback;
}
The critical requirement: every level must produce output that passes the same Zod schema validation. Downstream consumers should not need to know which level ran. The pipeline metadata should include a degradationLevel field for monitoring so you can alert when Level 4+ is firing frequently.
Q10. You need to process 10,000 images through a 3-agent SEO pipeline. Design the batch processing system, addressing concurrency, error isolation, and rate limiting.
Why interviewers ask: Tests the ability to take a single-item pipeline and scale it to production batch workloads. This is the gap between demo code and real systems.
Model answer:
The batch system needs three components: a concurrency-limited executor, per-item error isolation, and API rate limiting.
Concurrency control prevents overwhelming the API. Process images in batches of configurable size (e.g., 5 concurrent pipelines). Each batch runs with Promise.allSettled -- not Promise.all -- because allSettled reports individual failures without aborting the entire batch:
async function batchProcess(images, concurrency = 5) {
const results = [];
const errors = [];
for (let i = 0; i < images.length; i += concurrency) {
const batch = images.slice(i, i + concurrency);
const batchResults = await Promise.allSettled(
batch.map(image => runImagePipeline(image))
);
batchResults.forEach((result, index) => {
if (result.status === "fulfilled") {
results.push(result.value);
} else {
errors.push({
imageIndex: i + index,
filename: batch[index].filename,
error: result.reason.message,
});
}
});
console.log(`Progress: ${Math.min(i + concurrency, images.length)}/${images.length}`);
}
return {
successful: results,
failed: errors,
summary: {
total: images.length,
succeeded: results.length,
failed: errors.length,
successRate: (results.length / images.length * 100).toFixed(1) + "%",
},
};
}
Error isolation means one image's failure never affects another image's processing. Promise.allSettled handles this at the batch level. Within each pipeline, the fail-fast strategy is appropriate -- if Agent 1 fails for one image, skip that image rather than using a fallback that produces low-quality SEO metadata.
Rate limiting prevents HTTP 429 errors. Implement a token-bucket rate limiter that each callAgent function checks before making an API call:
class RateLimiter {
constructor(maxPerSecond) {
this.maxPerSecond = maxPerSecond;
this.tokens = maxPerSecond;
this.lastRefill = Date.now();
}
async acquire() {
while (this.tokens <= 0) {
const elapsed = Date.now() - this.lastRefill;
if (elapsed >= 1000) {
this.tokens = this.maxPerSecond;
this.lastRefill = Date.now();
}
await new Promise(r => setTimeout(r, 50));
}
this.tokens--;
}
}
const limiter = new RateLimiter(5); // 5 API calls/second
For 10,000 images with 3 agents each, that is 30,000 API calls. At 5 calls/second, that is about 100 minutes. I would also add: a progress reporter, a checkpoint system that saves completed results to disk (so you can resume after a crash), and a final summary report with success rate, total cost, and average processing time per image.
Q11. Compare the Hinge profile pipeline and the ImageKit SEO pipeline. What is the domain-agnostic pattern they share, and how would you apply it to a completely different domain?
Why interviewers ask: Tests the ability to abstract patterns from specific examples and apply them to new problems. This is the difference between following a tutorial and understanding the underlying architecture.
Model answer:
Despite being completely different domains (dating apps vs. image asset management), both pipelines follow the same three-stage pattern: Analyze, Transform, Generate.
| Stage | Hinge Pipeline | ImageKit Pipeline | Abstract Pattern |
|---|---|---|---|
| Agent 1 | Profile Analyzer (scores, strengths, weaknesses) | Metadata Extractor (subjects, colors, mood) | Analyze: Extract structured understanding from raw input |
| Agent 2 | Bio Improver (rewrite based on analysis) | SEO Optimizer (titles, descriptions, keywords) | Transform: Produce improved/optimized content from analysis |
| Agent 3 | Conversation Starter Generator (creative openers) | Tag Categorizer (organized, scored tags) | Generate: Create final deliverables from transformed data |
Both pipelines also share: Zod validation at every agent boundary, selective context (each agent receives only specific fields), a callAgent utility that standardizes JSON parsing and validation, temperature tuning per agent (though with different strategies), and a final output schema that aggregates all intermediate results plus pipeline metadata.
To apply this pattern to a new domain, say a code review pipeline:
- Agent 1 (Analyze): Code Quality Analyzer -- receives raw code, produces structured analysis: complexity scores, pattern violations, test coverage gaps. Temperature: 0.5 (factual analysis).
- Agent 2 (Transform): Improvement Suggester -- receives the analysis, produces specific refactoring suggestions with before/after code snippets. Temperature: 0.7 (creative but grounded).
- Agent 3 (Generate): Review Summary Writer -- receives suggestions, produces a human-readable review document with prioritized action items. Temperature: 0.8 (readable prose).
The pattern is: define input schema, define output schema for each agent, write focused system prompts, choose temperature per agent based on how analytical vs. creative the task is, validate at every boundary, and assemble the final output.
The temperature strategy differs by domain: the Hinge pipeline increases monotonically (0.7, 0.8, 0.9) because each step is progressively more creative. The ImageKit pipeline uses a non-monotonic pattern (0.5, 0.7, 0.6) because extraction is purely analytical, SEO needs some creativity, and tagging needs coverage but consistency. The code review pipeline would likely use an increasing pattern similar to Hinge. The lesson is that temperature strategy is domain-dependent, but the three-stage architecture is domain-agnostic.
Quick-fire
| # | Question | One-line answer |
|---|---|---|
| 1 | What is the core principle of multi-agent pipelines? | Decompose complex tasks into specialized, single-responsibility agents connected by validated data contracts |
| 2 | What validates data between agents? | Zod schemas -- runtime validation, TypeScript inference, self-documenting |
| 3 | What is selective context? | Each agent receives only the specific fields it needs, not the entire accumulated state |
| 4 | Sequential pipeline latency? | Sum of all agent latencies |
| 5 | Why different temperatures per agent? | Analytical tasks need lower temperature (consistency); creative tasks need higher (diversity) |
| 6 | What does Promise.allSettled give you over Promise.all? | Reports individual failures without aborting the entire batch |
| 7 | What is validation feedback retry? | Feeding Zod errors back to the LLM so it can self-correct its output format |
| 8 | Name the five levels of the graceful degradation ladder | Full model, cheaper model, simplified pipeline, rule-based heuristics, static fallback |
| 9 | What are the four pipeline patterns? | Sequential, parallel, fan-out/fan-in, conditional (router) |
| 10 | How do you decide if a task needs multi-agent? | Start with single prompt; add agents only when single prompt demonstrably fails on test cases |
Navigation: <- 4.18 Exercise Questions . 4.18 Quick Revision ->