Episode 4 — Generative AI Engineering / 4.3 — Prompt Engineering Fundamentals

4.3.c — Chain-of-Thought Prompting

In one sentence: Chain-of-thought (CoT) prompting asks the model to show its reasoning step by step before giving a final answer, which dramatically improves accuracy on math, logic, and multi-step problems — but adds cost and doesn't help simple tasks.

Navigation: ← 4.3.b — Few-Shot Examples · 4.3.d — Output Formatting Instructions →

1. What Is Chain-of-Thought Prompting?

Normally, when you ask an LLM a question, it jumps straight to the answer. Chain-of-thought (CoT) prompting tells the model to think through the problem step by step before producing the final answer.

WITHOUT CoT (direct answer):
  Q: "If a store has 23 apples and sells 7 in the morning and 
      9 in the afternoon, how many are left?"
  A: "7"  ← WRONG (model jumped to answer, made an error)

WITH CoT (step-by-step reasoning):
  Q: "If a store has 23 apples and sells 7 in the morning and 
      9 in the afternoon, how many are left? Think step by step."
  A: "The store starts with 23 apples.
      In the morning, 7 are sold: 23 - 7 = 16 apples remain.
      In the afternoon, 9 are sold: 16 - 9 = 7 apples remain.
      Answer: 7"  ← CORRECT (model worked through each step)

The key insight: by forcing the model to generate intermediate reasoning tokens, each step provides context for the next step. The model is less likely to skip ahead and make errors because each calculation is explicitly written out.

2. The "Let's Think Step by Step" Technique

The simplest form of CoT is appending a single phrase to your prompt. This is called zero-shot CoT because it requires no examples — just the magic phrase.

The classic trigger phrase

Prompt: "How many words are in the sentence 'The quick brown fox jumps 
over the lazy dog'? Let's think step by step."

Model response:
"Let me count each word:
1. The
2. quick
3. brown
4. fox
5. jumps
6. over
7. the
8. lazy
9. dog

There are 9 words in the sentence."

Variations that work

Trigger Phrase	When to Use
`"Let's think step by step."`	General-purpose; the original and most studied
`"Think through this carefully before answering."`	When you want reasoning but less verbose output
`"Break this down into steps."`	When the task has clear sequential steps
`"Show your work."`	Math and calculation problems
`"Reason through each part before giving your final answer."`	Multi-part analysis
`"First, identify the key information. Then, solve the problem."`	When you want to separate analysis from solution

Structured CoT prompt

For more control over the reasoning process, give the model a reasoning template:

const systemPrompt = `You are a math tutoring assistant.

When solving any problem, follow this EXACT structure:

GIVEN: List the known facts from the problem
FIND: State what we need to calculate
STEPS:
  Step 1: [first calculation with explanation]
  Step 2: [next calculation with explanation]
  ... (as many steps as needed)
ANSWER: [final answer with units]

Never skip the GIVEN/FIND/STEPS/ANSWER structure.`;

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,
  messages: [
    { role: 'system', content: systemPrompt },
    { 
      role: 'user', 
      content: 'A train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours. What is the total distance?' 
    },
  ],
});

Expected output:

GIVEN: Speed 1 = 60 mph, Time 1 = 2.5 hours, Speed 2 = 80 mph, Time 2 = 1.5 hours
FIND: Total distance traveled

STEPS:
  Step 1: Calculate distance for the first leg.
          Distance = Speed × Time = 60 × 2.5 = 150 miles
  Step 2: Calculate distance for the second leg.
          Distance = Speed × Time = 80 × 1.5 = 120 miles
  Step 3: Add both distances.
          Total = 150 + 120 = 270 miles

ANSWER: 270 miles

3. Few-Shot Chain-of-Thought

You can combine few-shot examples with chain-of-thought by including reasoning traces in your examples. This is often more reliable than zero-shot CoT because the model sees exactly how you want the reasoning structured.

const prompt = `Determine if the customer qualifies for free shipping.
Rules: Free shipping if order total > $50 OR customer is a premium member.

Order: Total $45, Regular member
Reasoning: Order total is $45. $45 is not greater than $50. Customer is Regular (not Premium). Neither condition is met.
Qualifies: No

Order: Total $35, Premium member
Reasoning: Order total is $35. $35 is not greater than $50. However, customer is a Premium member. The second condition (Premium member) IS met.
Qualifies: Yes

Order: Total $72, Regular member
Reasoning: Order total is $72. $72 is greater than $50. The first condition (order > $50) IS met.
Qualifies: Yes

Order: Total $${orderTotal}, ${membershipType} member
Reasoning:`;

The model now produces reasoning that follows the exact pattern shown in the examples — checking each condition explicitly before reaching a conclusion.

4. When Chain-of-Thought Helps

CoT significantly improves performance on tasks that require multi-step reasoning — where the model needs to work through intermediate steps to reach the correct answer.

Math and arithmetic

WITHOUT CoT:
  Q: "What is 15% of 340?"
  A: "50.1"  ← Wrong (might be close but computation error)

WITH CoT:
  Q: "What is 15% of 340? Show your work."
  A: "15% means 15/100 = 0.15
      0.15 × 340 = 51
      Answer: 51"  ← Correct

Logic and conditional reasoning

WITHOUT CoT:
  Q: "If all engineers know Python, and Alice is an engineer, 
      and Bob knows Python, is Bob necessarily an engineer?"
  A: "Yes"  ← WRONG

WITH CoT:
  Q: "If all engineers know Python, and Alice is an engineer, 
      and Bob knows Python, is Bob necessarily an engineer? 
      Think step by step."
  A: "Premise 1: All engineers know Python (Engineer → Python)
      Premise 2: Alice is an engineer → Alice knows Python
      Premise 3: Bob knows Python
      
      The question asks: Does Bob knowing Python mean he's an engineer?
      Premise 1 says engineers know Python, but it does NOT say that 
      ONLY engineers know Python. Bob could be a data scientist who 
      also knows Python.
      
      Answer: No, Bob is not necessarily an engineer."  ← CORRECT

Multi-step planning

WITH CoT for debugging:
  Q: "This function should return the sum of even numbers in an array, 
      but it returns 0 for [1,2,3,4,5,6]. Find the bug.
      
      function sumEvens(arr) {
        let sum = 0;
        for (let i = 0; i < arr.length; i++) {
          if (arr[i] % 2 === 0) sum += i;
        }
        return sum;
      }
      
      Think step by step."
      
  A: "Let me trace through the execution:
      
      i=0: arr[0]=1, 1%2=1 (odd), skip
      i=1: arr[1]=2, 2%2=0 (even), sum += i → sum += 1 (not the value!)
      i=2: arr[2]=3, 3%2=1 (odd), skip
      i=3: arr[3]=4, 4%2=0 (even), sum += i → sum += 3 (not the value!)
      i=4: arr[4]=5, 5%2=1 (odd), skip
      i=5: arr[5]=6, 6%2=0 (even), sum += i → sum += 5 (not the value!)
      
      Total: 1+3+5 = 9 (not 0 as stated, but still wrong)
      
      Bug: Line 'sum += i' adds the INDEX instead of the VALUE.
      Fix: Change to 'sum += arr[i]'
      
      Correct sum: 2+4+6 = 12"

Comparison and decision-making

const systemPrompt = `You are a technology advisor.

When asked to compare options, structure your reasoning as:

CRITERIA: List the comparison criteria
ANALYSIS: Evaluate each option against each criterion
TRADE-OFFS: What you gain and lose with each option
RECOMMENDATION: Your pick and why`;

const userMessage = `Should we use PostgreSQL or MongoDB for a 
social media application with user profiles, posts, comments, 
and a complex notification system?`;

5. When Chain-of-Thought Does NOT Help

CoT is not a universal improvement. For some tasks, it adds cost and latency without improving quality — and can even make things worse.

Simple factual lookup

WITHOUT CoT (fast, correct):
  Q: "What is the capital of Japan?"
  A: "Tokyo"

WITH CoT (wasteful):
  Q: "What is the capital of Japan? Think step by step."
  A: "Japan is a country in East Asia. The capital of a country is 
      the city where the central government is located. For Japan, 
      this city is Tokyo, which has been the capital since 1868.
      Answer: Tokyo"
  
  → Same answer, 10x more tokens, no accuracy improvement

Direct extraction

WITHOUT CoT (correct, efficient):
  Q: "Extract all email addresses from: 'Contact us at 
      info@example.com or support@example.com'"
  A: "info@example.com, support@example.com"

WITH CoT (wasteful):
  Q: "Extract all email addresses from this text. Think step by step."
  A: "Let me scan the text for email patterns...
      An email address has the format name@domain.
      I see 'info@example.com' — this matches the email pattern.
      I also see 'support@example.com' — this also matches.
      There are no other @ symbols in the text.
      Answer: info@example.com, support@example.com"
  
  → Same answer, 5x more tokens

Classification of clear-cut inputs

Clear positive review: "Best product ever! 10/10 would buy again!"
→ CoT adds no value. The classification is obvious.

Ambiguous review: "Works fine for the price, but I expected more."  
→ CoT might help decide between neutral and negative.

Tasks where the model is already near-perfect

If the model already gets 98% accuracy without CoT, adding CoT might improve it to 99% but at 3-5x the token cost. The ROI may not justify it.

Summary: When to use and when to skip

┌─────────────────────────────────────────────────────────────────┐
│              CoT DECISION GUIDE                                  │
│                                                                 │
│  USE CoT when:                   SKIP CoT when:                 │
│  ✓ Math / calculations           ✗ Simple factual questions     │
│  ✓ Logic / reasoning             ✗ Direct extraction            │
│  ✓ Multi-step problems           ✗ Clear-cut classification     │
│  ✓ Debugging code                ✗ Translation                  │
│  ✓ Comparison / analysis         ✗ Formatting / rewriting       │
│  ✓ Decision-making               ✗ Token budget is very tight   │
│  ✓ Ambiguous classification      ✗ Latency is critical          │
└─────────────────────────────────────────────────────────────────┘

6. Visible Reasoning vs Hidden Reasoning

This is an important conceptual distinction that every AI engineer needs to understand.

Visible reasoning (what you see)

When you use CoT prompting, the model produces reasoning tokens that you can read. You can see the logic, verify the steps, and catch errors. This is visible reasoning.

Visible reasoning example:
  "Step 1: The order was placed on Jan 5.
   Step 2: Today is Jan 20.
   Step 3: 20 - 5 = 15 days have passed.
   Step 4: The return policy is 14 days.
   Step 5: 15 > 14, so the return window has expired.
   Decision: Return denied."

You can READ each step and verify it.

Hidden reasoning (what some models do internally)

Some models (like OpenAI's o1, o3, and similar "reasoning" models) perform chain-of-thought internally — they reason before producing the response, but you don't see the reasoning steps. You only see the final answer.

Hidden reasoning example:
  [Internal: model performs multi-step reasoning — you cannot see this]
  
  Visible output:
  "The return window has expired. Return denied."
  
  You get the ANSWER but not the STEPS.

Why this matters for engineers

┌─────────────────────────────────────────────────────────────────┐
│          VISIBLE vs HIDDEN REASONING — IMPLICATIONS              │
│                                                                 │
│  VISIBLE (standard CoT):                                        │
│  + You can verify/debug the reasoning                           │
│  + Users can see WHY the model reached its conclusion           │
│  + You can catch logical errors in the steps                    │
│  - Uses more output tokens (costs more, slower)                 │
│  - Reasoning might be post-hoc rationalization, not real logic  │
│                                                                 │
│  HIDDEN (reasoning models like o1):                             │
│  + Often higher accuracy on complex problems                    │
│  + Cleaner output (just the answer)                             │
│  - You CANNOT verify the reasoning steps                        │
│  - Harder to debug when the answer is wrong                     │
│  - You're trusting the model's internal process                 │
│  - Different pricing model (reasoning tokens cost money too)    │
└─────────────────────────────────────────────────────────────────┘

HIGH-LEVEL AWARENESS: Reasoning models

Models like OpenAI's o1/o3 and similar reasoning models are designed to think before they respond. As an engineer, here is what you need to know at a high level:

They exist — some models are specifically trained to reason internally before answering.
You typically don't need to add "think step by step" to these models — they already do it.
They often outperform standard models on math, code, and logic — but at higher cost and latency.
The reasoning is hidden — you cannot inspect it, which changes your debugging approach.
Don't depend on hidden reasoning for critical decisions — if you need to verify the logic, use visible CoT with a standard model instead.

This section is intentionally high-level. The goal is awareness, not deep expertise. You will encounter these models in practice, and you should know they work differently from standard prompting.

7. CoT in Production: Cost Implications

Chain-of-thought reasoning generates more output tokens, which directly increases cost and latency.

Token cost analysis

WITHOUT CoT:
  Input:  50 tokens (question)
  Output: 5 tokens (answer: "The return is denied.")
  Total:  55 tokens
  
WITH CoT:
  Input:  60 tokens (question + "think step by step")
  Output: 150 tokens (reasoning + answer)
  Total:  210 tokens
  
Cost multiplier: ~4x more tokens per request

AT SCALE (100,000 requests/day, GPT-4o):
  Without CoT:
    Input: 5M tokens × $2.50/1M = $12.50
    Output: 0.5M tokens × $10/1M = $5.00
    Daily: $17.50

  With CoT:
    Input: 6M tokens × $2.50/1M = $15.00
    Output: 15M tokens × $10/1M = $150.00
    Daily: $165.00
    
  Difference: ~$147.50/day = ~$4,425/month MORE

Latency impact

Output tokens are generated sequentially (one at a time).

WITHOUT CoT:
  5 output tokens × ~20ms per token = ~100ms generation time

WITH CoT:
  150 output tokens × ~20ms per token = ~3,000ms generation time

User perceives: 0.1 seconds vs 3 seconds

Production strategy: Selective CoT

Don't apply CoT to every request. Use it only when it adds value:

async function processRequest(question, complexity) {
  const systemPrompt = complexity === 'complex'
    ? `Answer the question. Think step by step before giving your final answer.
       Structure your response as:
       REASONING: [your step-by-step analysis]
       ANSWER: [your final answer]`
    : `Answer the question concisely.`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: question },
    ],
  });

  return response.choices[0].message.content;
}

// Simple question — no CoT needed
await processRequest("What is the capital of France?", "simple");
// → "Paris"

// Complex question — CoT helps
await processRequest(
  "If I invest $10,000 at 7% annual return compounded monthly, how much will I have after 5 years?", 
  "complex"
);
// → REASONING: [detailed calculation steps]
//   ANSWER: $14,176.25

Extracting just the answer from CoT output

When you use CoT but only need the final answer for downstream processing:

async function solveWithCoT(question) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [{
      role: 'user',
      content: `${question}

Think step by step, then provide your final answer after "FINAL ANSWER:".`
    }],
  });

  const fullResponse = response.choices[0].message.content;
  
  // Extract just the final answer
  const answerMatch = fullResponse.match(/FINAL ANSWER:\s*(.+)/);
  const answer = answerMatch ? answerMatch[1].trim() : fullResponse;
  
  return {
    reasoning: fullResponse,  // Store for debugging
    answer: answer,           // Use for downstream processing
  };
}

8. CoT Pitfalls and Limitations

Pitfall 1: Post-hoc rationalization

The model may generate reasoning that sounds logical but isn't what actually determined the answer. The model might "decide" the answer first, then construct reasoning to support it.

WARNING: The reasoning steps are generated tokens, not a window into 
the model's actual computation. The model might produce:

"Step 1: The number 7 is prime.
 Step 2: 7 × 3 = 21
 Step 3: 21 + 4 = 25
 Answer: 25"

Even if the answer is correct, the steps might be a plausible narrative
rather than the actual path the model's neural network took.

Pitfall 2: Reasoning errors compound

If the model makes an error in an early step, all subsequent steps build on that error:

Q: "What is 37 × 28?"

A: "Step 1: 37 × 20 = 730  ← Wrong (should be 740)
    Step 2: 37 × 8 = 296   ← Correct
    Step 3: 730 + 296 = 1026 ← Wrong because Step 1 was wrong
    Answer: 1026"           ← Final answer is wrong

Correct answer: 740 + 296 = 1036

Pitfall 3: Overthinking simple problems

CoT can make the model overthink and second-guess straightforward tasks:

Q: "Is 4 greater than 3? Think step by step."

A: "Let me consider this carefully.
    4 is a positive integer.
    3 is also a positive integer.
    On the number line, 4 is to the right of 3.
    Numbers to the right are greater.
    However, I should also consider if there are special cases...
    In standard mathematics, 4 > 3.
    But could there be a trick? The question seems straightforward.
    Answer: Yes, 4 is greater than 3."
    
→ 80+ tokens for a question that needs 1 token ("Yes")

Pitfall 4: Inconsistent reasoning format

Without a template, the model may structure its reasoning differently each time, making it hard to parse programmatically.

Fix: Always provide a reasoning template (GIVEN/FIND/STEPS/ANSWER or similar).

9. Combining CoT with Other Techniques

CoT + Few-Shot (most powerful combination)

const prompt = `Determine if the code has a security vulnerability.

Code: user_input = request.params.id; db.query("SELECT * FROM users WHERE id = " + user_input);
Reasoning: The code concatenates user input directly into an SQL query string. The variable user_input comes from request.params.id, which is user-controlled. An attacker could input "1 OR 1=1" to retrieve all records, or "1; DROP TABLE users" to destroy data. This is SQL injection.
Verdict: VULNERABLE — SQL Injection
Fix: Use parameterized queries: db.query("SELECT * FROM users WHERE id = $1", [user_input])

Code: const token = jwt.sign(payload, process.env.SECRET_KEY);
Reasoning: The code signs a JWT using a secret key from environment variables. The secret key is not hardcoded in the source code. The jwt.sign function is a standard operation. No user input is directly concatenated. The security depends on the SECRET_KEY being strong and kept secret, but the code pattern itself is correct.
Verdict: SAFE — Standard JWT signing pattern

Code: ${userCode}
Reasoning:`;

CoT + Output formatting

const prompt = `Analyze the API response for errors.

Think through each field, then provide your analysis in this JSON format:
{
  "has_errors": boolean,
  "errors": [{"field": string, "issue": string, "severity": "high"|"medium"|"low"}],
  "reasoning": string
}

API Response to analyze:
${apiResponse}`;

10. Key Takeaways

Chain-of-thought prompting asks the model to reason step by step before answering, improving accuracy on complex tasks.
The simplest trigger is "Let's think step by step" — but structured templates give more consistent results.
CoT helps with math, logic, debugging, comparison, and multi-step problems.
CoT does NOT help with simple factual lookups, direct extraction, or clear-cut classification — and wastes tokens on these tasks.
Visible reasoning (standard CoT) lets you verify the steps; hidden reasoning (reasoning models) does not.
CoT costs more — typically 3-5x more output tokens per request. Use it selectively.
Few-shot + CoT is the most powerful combination — show the model both the reasoning pattern and the output format.
CoT reasoning may be post-hoc rationalization, not the model's actual computation — don't trust the steps blindly.

Explain-It Challenge

A colleague asks: "Why would I pay 4x more tokens just to make the model 'think'? The answer is the same either way." Explain when CoT changes the answer and when it doesn't.
Your team uses CoT for every LLM call in production. The monthly API bill is $15,000. You estimate that only 30% of requests actually benefit from CoT. What do you propose, and how much could you save?
A reasoning model gives you the wrong answer to a logic problem, but you can't see its reasoning steps. How do you debug this compared to debugging a standard model with visible CoT?

Navigation: ← 4.3.b — Few-Shot Examples · 4.3.d — Output Formatting Instructions →