Episode 4 — Generative AI Engineering / 4.16 — Agent Design Patterns

4.16.c — Critic-Refiner Pattern

In one sentence: The Critic-Refiner pattern creates an iterative improvement loop — a Generator produces initial output, a Critic evaluates its quality and identifies specific issues, and a Refiner improves the output based on the criticism — repeating until a quality threshold is met or a maximum iteration count is reached.

Navigation: ← 4.16.b Researcher-Writer · 4.16.d — Router Agents →

1. What Is the Critic-Refiner Pattern?

The Critic-Refiner pattern is a self-improving loop where AI output is evaluated and refined iteratively:

Generator — produces initial output (a draft, code, analysis, or any content).
Critic — evaluates the output against quality criteria and produces specific, actionable feedback (not just "this is bad" but "paragraph 2 lacks a supporting example, the conclusion contradicts the introduction").
Refiner — takes the original output plus the Critic's feedback and produces an improved version.
Loop — the improved version goes back to the Critic. This continues until the Critic says the output is good enough or a maximum number of iterations is reached.

This pattern models the real-world process of drafting, reviewing, and revising — the same loop that writers, programmers, and designers use daily.

┌────────────────────────────────────────────────────────────────┐
│                  CRITIC-REFINER LOOP                           │
│                                                                │
│  User Task: "Write a technical blog post about caching"        │
│       │                                                        │
│       ▼                                                        │
│  ┌──────────────┐                                              │
│  │  GENERATOR   │  Produces initial draft                      │
│  │              │  (temperature 0.7 for creative first draft)  │
│  └──────┬───────┘                                              │
│         │                                                      │
│         ▼  draft v1                                            │
│  ┌──────────────┐                                              │
│  │  CRITIC      │  Evaluates quality                           │
│  │              │  Returns:                                    │
│  │              │  - score: 6/10                                │
│  │              │  - issues: ["no code examples",              │
│  │              │    "intro too long", "missing               │
│  │              │    performance comparison"]                  │
│  └──────┬───────┘                                              │
│         │                                                      │
│    score < threshold?                                          │
│    YES ──► continue loop                                       │
│         │                                                      │
│         ▼  feedback                                            │
│  ┌──────────────┐                                              │
│  │  REFINER     │  Improves based on criticism                 │
│  │              │  Produces draft v2                            │
│  └──────┬───────┘                                              │
│         │                                                      │
│         ▼  draft v2                                            │
│  ┌──────────────┐                                              │
│  │  CRITIC      │  Re-evaluates                                │
│  │              │  score: 8/10                                  │
│  │              │  issues: ["comparison table could            │
│  │              │    be more detailed"]                         │
│  └──────┬───────┘                                              │
│         │                                                      │
│    score >= threshold (8)?                                      │
│    YES ──► exit loop                                           │
│         │                                                      │
│         ▼                                                      │
│  Final Output: draft v2 (quality: 8/10)                        │
└────────────────────────────────────────────────────────────────┘

2. When to Use This Pattern

Use This Pattern When	Do NOT Use When
Output quality matters more than speed	Real-time responses needed (latency-critical)
First-draft quality is typically insufficient	Simple factual queries (no iteration needed)
Clear quality criteria exist (rubric, checklist)	Quality is purely subjective (no measurable criteria)
Cost of bad output is high (legal, medical, public-facing)	Budget is tight (each iteration = more API calls)
You want consistent quality across many outputs	One-shot tasks where "good enough" is fine

Real-world use cases

Code review and improvement — generate code, critique for bugs/style/performance, refine
Writing quality — generate article, critique for clarity/accuracy/engagement, refine
Data quality — generate data transformation, critique for correctness/completeness, refine
Prompt optimization — generate prompt, critique for ambiguity/token efficiency, refine
Translation quality — translate text, critique for accuracy/fluency/cultural fit, refine
Email drafting — generate email, critique for tone/professionalism/clarity, refine

3. The Generator: Producing Initial Output

The Generator is the simplest agent — it produces the first draft based on the user's request.

const GENERATOR_SYSTEM_PROMPT = `You are a skilled content generator. Produce high-quality first drafts based on the user's request.

Focus on:
- Completeness — cover all aspects of the topic
- Structure — use clear headings and logical flow
- Accuracy — stick to facts you are confident about
- Clarity — write for the target audience

Do not self-critique or hedge. Produce the best draft you can in one pass.`;

4. The Critic: Evaluating Quality

The Critic is the most important agent in this pattern. A good Critic provides specific, actionable feedback — not vague opinions.

Critic system prompt

const CRITIC_SYSTEM_PROMPT = `You are a strict Quality Critic. You evaluate content against specific criteria and provide actionable feedback.

EVALUATION CRITERIA:
1. Accuracy — Are all facts correct? Any hallucinations or unsupported claims?
2. Completeness — Does the content cover the topic thoroughly? Any missing aspects?
3. Structure — Is the content logically organized? Good headings, flow, transitions?
4. Clarity — Is the language clear and appropriate for the audience?
5. Examples — Are there sufficient concrete examples and code samples?
6. Actionability — Can the reader apply what they learned?

SCORING:
- Rate each criterion from 1-10.
- Provide an overall score (1-10).
- A score of 8+ means the content is ready to publish.
- A score below 8 means it needs improvement.

OUTPUT FORMAT (strict JSON):
{
  "overall_score": <1-10>,
  "criteria_scores": {
    "accuracy": <1-10>,
    "completeness": <1-10>,
    "structure": <1-10>,
    "clarity": <1-10>,
    "examples": <1-10>,
    "actionability": <1-10>
  },
  "issues": [
    {
      "severity": "critical | major | minor",
      "location": "Where in the content (section/paragraph)",
      "issue": "What is wrong",
      "suggestion": "Specific improvement to make"
    }
  ],
  "strengths": ["What the content does well"],
  "ready_to_publish": <true|false>
}

RULES:
- Be specific. "The writing is unclear" is bad feedback. "Paragraph 3 uses jargon ('memoization') without defining it for the beginner audience" is good feedback.
- Be constructive. Every issue must have a concrete suggestion for improvement.
- Be honest. Do not inflate scores. If the content is mediocre, say so.
- Focus on the most impactful issues first (critical > major > minor).`;

Why the Critic must be strict

A lenient Critic defeats the purpose of the loop. If the Critic gives 9/10 to mediocre content, the loop exits immediately and no improvement happens. Tips for maintaining Critic strictness:

Define clear criteria — numbered rubric with specific expectations
Require structured output — JSON forces the Critic to be specific
Set a realistic threshold — 8/10 is a good default (demanding but achievable)
Include severity levels — helps the Refiner prioritize

5. The Refiner: Improving Based on Feedback

The Refiner receives the original content AND the Critic's feedback, and produces an improved version.

const REFINER_SYSTEM_PROMPT = `You are a Content Refiner. You receive content along with specific criticism and produce an improved version.

RULES:
1. Address EVERY issue raised by the Critic, starting with critical issues.
2. Preserve strengths identified by the Critic — do not break what already works.
3. Make minimal changes — do not rewrite sections that have no issues.
4. If the Critic suggests adding examples, add real, practical examples.
5. If the Critic identifies factual errors, correct them (or remove the claim if unsure).
6. Return the COMPLETE improved content, not just the changes.

Produce the improved content directly — no meta-commentary about what you changed.`;

6. Full Implementation: Iterative Content Improvement

import OpenAI from 'openai';

const openai = new OpenAI();

// ─────────────────────────────────────────────────────────
// AGENT DEFINITIONS
// ─────────────────────────────────────────────────────────

async function generate(task) {
  console.log('\n=== GENERATOR ===');
  console.log(`Task: "${task}"\n`);

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.7,
    messages: [
      {
        role: 'system',
        content: `You are a skilled technical writer. Write a comprehensive, well-structured piece on the given topic. Include code examples in JavaScript where relevant. Target audience: intermediate developers.`,
      },
      { role: 'user', content: task },
    ],
  });

  const content = response.choices[0].message.content;
  console.log(`  Generated: ${content.length} characters`);
  return content;
}

async function critique(content, criteria) {
  console.log('\n=== CRITIC ===');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: `You are a strict Quality Critic. Evaluate the content against these criteria and return structured feedback.

CRITERIA: ${criteria.join(', ')}

Return JSON:
{
  "overall_score": <1-10>,
  "criteria_scores": { <criterion>: <1-10>, ... },
  "issues": [
    {
      "severity": "critical|major|minor",
      "location": "...",
      "issue": "...",
      "suggestion": "..."
    }
  ],
  "strengths": ["..."],
  "ready_to_publish": <boolean>
}

Be strict. 8+ means genuinely excellent. Most first drafts score 5-7.`,
      },
      {
        role: 'user',
        content: `Evaluate this content:\n\n${content}`,
      },
    ],
  });

  const feedback = JSON.parse(response.choices[0].message.content);
  console.log(`  Score: ${feedback.overall_score}/10`);
  console.log(`  Issues: ${feedback.issues.length}`);
  console.log(`  Ready: ${feedback.ready_to_publish}`);

  if (feedback.issues.length > 0) {
    console.log('  Top issues:');
    feedback.issues.slice(0, 3).forEach((issue) => {
      console.log(`    [${issue.severity}] ${issue.issue}`);
    });
  }

  return feedback;
}

async function refine(content, feedback) {
  console.log('\n=== REFINER ===');

  const issueList = feedback.issues
    .map((i) => `- [${i.severity}] ${i.location}: ${i.issue} → Suggestion: ${i.suggestion}`)
    .join('\n');

  const strengthList = feedback.strengths.map((s) => `- ${s}`).join('\n');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.3,
    messages: [
      {
        role: 'system',
        content: `You are a Content Refiner. You receive content and specific feedback. Produce an improved version that addresses all issues while preserving strengths. Return ONLY the improved content.`,
      },
      {
        role: 'user',
        content: `ORIGINAL CONTENT:
${content}

CRITIC SCORE: ${feedback.overall_score}/10

ISSUES TO FIX:
${issueList}

STRENGTHS TO PRESERVE:
${strengthList}

Produce the improved version:`,
      },
    ],
  });

  const improved = response.choices[0].message.content;
  console.log(`  Refined: ${improved.length} characters`);
  return improved;
}

// ─────────────────────────────────────────────────────────
// CRITIC-REFINER LOOP
// ─────────────────────────────────────────────────────────

async function criticRefinerLoop(task, config = {}) {
  const {
    qualityThreshold = 8,
    maxIterations = 3,
    criteria = ['accuracy', 'completeness', 'structure', 'clarity', 'examples', 'actionability'],
  } = config;

  console.log('========================================');
  console.log('  CRITIC-REFINER LOOP');
  console.log('========================================');
  console.log(`Task: ${task}`);
  console.log(`Threshold: ${qualityThreshold}/10`);
  console.log(`Max iterations: ${maxIterations}`);
  console.log(`Criteria: ${criteria.join(', ')}`);

  // Track iteration history for analysis
  const history = [];

  // Step 1: Generate initial draft
  let content = await generate(task);
  history.push({ iteration: 0, type: 'generated', length: content.length });

  // Step 2: Iterative critique-refine loop
  for (let iteration = 1; iteration <= maxIterations; iteration++) {
    console.log(`\n--- Iteration ${iteration}/${maxIterations} ---`);

    // Critique
    const feedback = await critique(content, criteria);
    history.push({
      iteration,
      type: 'critique',
      score: feedback.overall_score,
      issueCount: feedback.issues.length,
      ready: feedback.ready_to_publish,
    });

    // Check if quality threshold is met
    if (feedback.overall_score >= qualityThreshold || feedback.ready_to_publish) {
      console.log(`\nQuality threshold met! Score: ${feedback.overall_score}/10`);
      return {
        content,
        finalScore: feedback.overall_score,
        iterations: iteration,
        history,
        finalFeedback: feedback,
      };
    }

    // Refine
    content = await refine(content, feedback);
    history.push({ iteration, type: 'refined', length: content.length });
  }

  // Max iterations reached — return best effort
  const finalFeedback = await critique(content, criteria);
  console.log(`\nMax iterations reached. Final score: ${finalFeedback.overall_score}/10`);

  return {
    content,
    finalScore: finalFeedback.overall_score,
    iterations: maxIterations,
    history,
    finalFeedback,
    maxIterationsReached: true,
  };
}

// ─────────────────────────────────────────────────────────
// RUN
// ─────────────────────────────────────────────────────────

const result = await criticRefinerLoop(
  'Write a technical blog post explaining caching strategies for web applications (browser cache, CDN, server-side, database query cache) with JavaScript examples',
  {
    qualityThreshold: 8,
    maxIterations: 3,
    criteria: ['accuracy', 'completeness', 'structure', 'clarity', 'examples', 'actionability'],
  }
);

console.log('\n=== FINAL RESULT ===');
console.log(`Score: ${result.finalScore}/10`);
console.log(`Iterations: ${result.iterations}`);
console.log(`Content length: ${result.content.length} characters`);
console.log('\nIteration history:');
result.history.forEach((h) => {
  if (h.type === 'critique') {
    console.log(`  Iteration ${h.iteration}: Score ${h.score}/10, ${h.issueCount} issues`);
  }
});

Expected output

========================================
  CRITIC-REFINER LOOP
========================================
Task: Write a technical blog post explaining caching strategies...
Threshold: 8/10
Max iterations: 3
Criteria: accuracy, completeness, structure, clarity, examples, actionability

=== GENERATOR ===
Task: "Write a technical blog post..."
  Generated: 4200 characters

--- Iteration 1/3 ---

=== CRITIC ===
  Score: 6/10
  Issues: 5
  Ready: false
  Top issues:
    [critical] Missing code examples for CDN and database caching
    [major] No comparison table between caching strategies
    [major] Cache invalidation not addressed

=== REFINER ===
  Refined: 5800 characters

--- Iteration 2/3 ---

=== CRITIC ===
  Score: 8/10
  Issues: 1
  Ready: true

Quality threshold met! Score: 8/10

=== FINAL RESULT ===
Score: 8/10
Iterations: 2
Content length: 5800 characters

7. Self-Reflection Pattern

A special case of the Critic-Refiner pattern is self-reflection, where a single agent critiques its own output. Instead of separate Critic and Refiner agents, one agent generates, reflects, and improves.

async function selfReflect(task, maxIterations = 3) {
  console.log('\n=== SELF-REFLECTION AGENT ===');

  const messages = [
    {
      role: 'system',
      content: `You are an AI agent that generates content and then critically reflects on it to improve.

PROCESS:
1. First, produce your best attempt at the task.
2. Then, critically analyze your output: What is good? What is weak? What is missing?
3. Produce an improved version addressing your self-critique.
4. Repeat until you are satisfied.

FORMAT each response as:
## Draft
[your content]

## Self-Critique
- Strengths: [list]
- Weaknesses: [list]
- Missing: [list]
- Score: X/10

## Improved Version
[improved content — only if score < 8]`,
    },
    { role: 'user', content: task },
  ];

  let finalContent = '';

  for (let i = 1; i <= maxIterations; i++) {
    console.log(`\n--- Reflection Round ${i} ---`);

    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0.4,
      messages,
    });

    const output = response.choices[0].message.content;
    messages.push({ role: 'assistant', content: output });

    // Check if the agent is satisfied (score >= 8 and no "Improved Version")
    const scoreMatch = output.match(/Score:\s*(\d+)\/10/);
    const score = scoreMatch ? parseInt(scoreMatch[1]) : 0;

    console.log(`  Self-assessed score: ${score}/10`);

    if (score >= 8) {
      // Extract the final draft content
      const draftMatch = output.match(/## Draft\n([\s\S]*?)(?=\n## Self-Critique)/);
      const improvedMatch = output.match(/## Improved Version\n([\s\S]*?)$/);
      finalContent = (improvedMatch ? improvedMatch[1] : draftMatch?.[1]) || output;
      console.log('  Self-reflection complete — satisfied with quality.');
      break;
    }

    // Ask for another round
    messages.push({
      role: 'user',
      content: 'Continue improving based on your self-critique. Follow the same format.',
    });
  }

  return finalContent;
}

Self-reflection vs separate agents

Aspect	Self-Reflection	Separate Critic + Refiner
LLM calls	1-3 calls (one agent)	2-6 calls (multiple agents)
Cost	Lower (fewer calls)	Higher (more calls)
Critique quality	Model may be lenient on its own work	Separate Critic can be configured to be strict
Best for	Quick iteration, low-stakes content	High-stakes content requiring rigorous review
Risk	"Blind spot" — model can't see its own flaws	Critic may find issues Generator agent would miss

8. Critic-Refiner for Code Review

One of the most powerful applications is automated code review and improvement.

async function codeReviewLoop(code, requirements) {
  const CODE_CRITIC_PROMPT = `You are a senior code reviewer. Evaluate this code against best practices.

CHECK FOR:
1. Correctness — Does it handle edge cases? Any bugs?
2. Security — SQL injection, XSS, input validation?
3. Performance — O(n^2) when O(n) is possible? Memory leaks?
4. Readability — Clear names, comments where needed, consistent style?
5. Error handling — Are errors caught and handled gracefully?
6. Testing — Is the code testable? Are there obvious test cases?

Return JSON:
{
  "overall_score": <1-10>,
  "issues": [
    { "severity": "critical|major|minor", "line": "...", "issue": "...", "fix": "..." }
  ],
  "strengths": ["..."],
  "ready_for_production": <boolean>
}`;

  const CODE_REFINER_PROMPT = `You are a senior developer. Refactor the code to address all review feedback. 
Return ONLY the improved code — no explanations.`;

  let currentCode = code;

  for (let i = 1; i <= 3; i++) {
    console.log(`\n--- Code Review Round ${i} ---`);

    // Critique
    const critiqueResponse = await openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0,
      response_format: { type: 'json_object' },
      messages: [
        { role: 'system', content: CODE_CRITIC_PROMPT },
        { role: 'user', content: `Requirements: ${requirements}\n\nCode:\n\`\`\`javascript\n${currentCode}\n\`\`\`` },
      ],
    });

    const review = JSON.parse(critiqueResponse.choices[0].message.content);
    console.log(`  Score: ${review.overall_score}/10`);
    console.log(`  Issues: ${review.issues.length}`);

    if (review.ready_for_production || review.overall_score >= 9) {
      console.log('  Code is production-ready!');
      return { code: currentCode, review, iterations: i };
    }

    // Refine
    const issueList = review.issues
      .map((issue) => `[${issue.severity}] ${issue.line}: ${issue.issue} -> Fix: ${issue.fix}`)
      .join('\n');

    const refineResponse = await openai.chat.completions.create({
      model: 'gpt-4o',
      temperature: 0,
      messages: [
        { role: 'system', content: CODE_REFINER_PROMPT },
        {
          role: 'user',
          content: `Original code:\n\`\`\`javascript\n${currentCode}\n\`\`\`\n\nReview feedback:\n${issueList}\n\nProduce improved code:`,
        },
      ],
    });

    currentCode = refineResponse.choices[0].message.content
      .replace(/^```javascript\n?/, '')
      .replace(/\n?```$/, '');

    console.log(`  Refined: ${currentCode.length} characters`);
  }

  return { code: currentCode, iterations: 3, maxReached: true };
}

// Usage
const result = await codeReviewLoop(
  `
function fetchUsers(db, name) {
  const query = "SELECT * FROM users WHERE name = '" + name + "'";
  const result = db.query(query);
  return result;
}`,
  'Fetch users by name from database. Must be secure, handle errors, and return typed results.'
);

After iteration 1, the Critic would flag the SQL injection vulnerability. The Refiner would produce parameterized queries. After iteration 2, the Critic might flag missing error handling. The Refiner would add try/catch. By iteration 3, the code is production-ready.

9. Configuring the Loop

Choosing the quality threshold

Content Type	Recommended Threshold	Why
Internal notes	6/10	Speed matters more than polish
Blog posts	7-8/10	Good quality, doesn't need perfection
Documentation	8/10	Accuracy and clarity are critical
Legal/medical	9/10	Errors have serious consequences
Production code	8-9/10	Bugs are expensive to fix later

Choosing max iterations

1 iteration  = 3 LLM calls (generate + critique + refine)
2 iterations = 5 LLM calls (generate + critique + refine + critique + refine)
3 iterations = 7 LLM calls (max recommended for most use cases)

Cost per iteration (GPT-4o, ~2000 tokens each):
  Input:  ~2000 tokens × $2.50/1M = $0.005
  Output: ~2000 tokens × $10.00/1M = $0.020
  Per iteration: ~$0.025
  3 iterations: ~$0.075

For high-volume applications, consider:
  - 1 iteration for 80% of requests (good enough)
  - 2 iterations for edge cases (Critic flagged critical issues)
  - 3 iterations maximum (diminishing returns beyond this)

Diminishing returns detection

function shouldContinueIterating(history) {
  if (history.length < 2) return true;

  const lastTwo = history.filter((h) => h.type === 'critique').slice(-2);
  if (lastTwo.length < 2) return true;

  const scoreDelta = lastTwo[1].score - lastTwo[0].score;

  // If improvement between last two iterations is < 1 point, stop
  if (scoreDelta < 1) {
    console.log(`  Diminishing returns: score improved only ${scoreDelta} points. Stopping.`);
    return false;
  }

  return true;
}

10. Design Considerations

Preventing infinite loops

Always have TWO exit conditions:

Quality threshold met (the happy path)
Maximum iterations reached (the safety net)

const MAX_ITERATIONS = 3;        // Hard ceiling
const QUALITY_THRESHOLD = 8;     // Exit when score >= this
const MIN_IMPROVEMENT = 0.5;     // Exit if improvement per iteration < this

Critic consistency

The Critic should produce consistent scores. Run the same content through the Critic multiple times — if scores vary by more than 1 point, the Critic's prompt needs refinement.

// Test Critic consistency
async function testCriticConsistency(content, criteria, runs = 5) {
  const scores = [];
  for (let i = 0; i < runs; i++) {
    const feedback = await critique(content, criteria);
    scores.push(feedback.overall_score);
  }
  const avg = scores.reduce((a, b) => a + b) / scores.length;
  const variance = scores.reduce((a, b) => a + (b - avg) ** 2, 0) / scores.length;
  console.log(`Scores: ${scores.join(', ')}`);
  console.log(`Average: ${avg.toFixed(1)}, Variance: ${variance.toFixed(2)}`);
  // Variance > 1.0 suggests the Critic is too inconsistent
}

Combining with other patterns

The Critic-Refiner loop is often the final stage in a pipeline:

Planner-Executor → produces raw output
Researcher-Writer → produces a draft report
  → Critic-Refiner → polishes the draft to publication quality

11. Key Takeaways

The Critic-Refiner pattern implements iterative improvement — generate, critique, refine, repeat. It models the human draft-review-revise cycle.
The Critic must be strict and specific — vague feedback like "needs improvement" is useless. Good feedback names the exact location, issue, and suggested fix.
Two exit conditions prevent infinite loops — quality threshold (happy path) and max iterations (safety net). Monitor for diminishing returns.
Self-reflection is a lighter alternative — one agent critiques its own work. Cheaper but less rigorous than separate Critic and Refiner agents.
Code review is a killer use case — the Critic finds bugs, security issues, and style violations; the Refiner produces improved code. Multiple rounds catch more issues than a single pass.
Cost scales linearly with iterations — each iteration is 2 LLM calls (critique + refine). Most tasks converge in 2-3 iterations. Diminishing returns detection saves money.

Explain-It Challenge

A team lead asks "why not just ask GPT to write a better article in one shot instead of doing this loop?" Explain the advantage of iterative improvement with a specific Critic.
Your Critic gives the same content a score of 6 on one run and 9 on another run. What is wrong, and how do you fix it?
You are building a legal document generator. Choose the quality threshold, max iterations, and explain your reasoning. What happens if you set the threshold too low? Too high?

Navigation: ← 4.16.b Researcher-Writer · 4.16.d — Router Agents →