Episode 4 — Generative AI Engineering / 4.3 — Prompt Engineering Fundamentals

4.3.b — Few-Shot Examples

In one sentence: Few-shot prompting means including example input-output pairs in your prompt so the model learns the pattern you want by seeing it demonstrated, rather than having it described in words.

Navigation: ← 4.3.a — Writing Clear Instructions · 4.3.c — Chain-of-Thought →


1. What Is Few-Shot Learning?

In prompt engineering, "shots" refer to how many examples you include in the prompt before asking the model to perform the actual task. This terminology comes from machine learning research, but in practice it simply means "show before you ask."

┌─────────────────────────────────────────────────────────────────┐
│                     THE SHOT SPECTRUM                             │
│                                                                 │
│  ZERO-SHOT         ONE-SHOT           FEW-SHOT                  │
│  (no examples)     (1 example)        (2-10+ examples)          │
│                                                                 │
│  "Classify this    "Example:          "Example 1:               │
│   as positive      'Great!' →         'Great!' → positive       │
│   or negative"     positive.          'Terrible' → negative     │
│                    Now classify:      'It was okay' → neutral   │
│                    'Awful service'"   Now classify:              │
│                                      'Awful service'"           │
│                                                                 │
│  Model guesses     Model sees the     Model sees the pattern    │
│  the format and    pattern once       multiple times → more     │
│  task definition                      reliable outputs          │
└─────────────────────────────────────────────────────────────────┘

Zero-shot: No examples, just instructions

const zeroShotPrompt = `Classify the following customer review as 
"positive", "negative", or "neutral".

Review: "The product arrived on time but the packaging was damaged."
Classification:`;

The model uses its general training to interpret the task. Works well for common, well-understood tasks. Can be inconsistent for ambiguous or nuanced tasks.

One-shot: One example

const oneShotPrompt = `Classify the following customer review as 
"positive", "negative", or "neutral".

Example:
Review: "Absolutely love this product! Best purchase I've made."
Classification: positive

Now classify:
Review: "The product arrived on time but the packaging was damaged."
Classification:`;

One example shows the model the exact format you want. The model now knows: (a) what a "classification" looks like, (b) that the answer should be a single word, (c) that it should be lowercase.

Few-shot: Multiple examples

const fewShotPrompt = `Classify the following customer review as 
"positive", "negative", or "neutral".

Example 1:
Review: "Absolutely love this product! Best purchase I've made."
Classification: positive

Example 2:
Review: "Terrible quality. Broke after one day. Want a refund."
Classification: negative

Example 3:
Review: "It's fine. Does what it says. Nothing special."
Classification: neutral

Example 4:
Review: "Good product but shipping took forever."
Classification: neutral

Now classify:
Review: "The product arrived on time but the packaging was damaged."
Classification:`;

Multiple examples show the model edge cases and boundary conditions. Example 4 is particularly important — it shows a mixed review classified as "neutral," which teaches the model how to handle ambiguity.


2. How Examples Force Pattern Learning

When you include examples, the model isn't just "reading" them — it is detecting patterns across the examples and applying those patterns to the new input. This is called in-context learning.

WHAT THE MODEL DETECTS FROM YOUR EXAMPLES:
┌─────────────────────────────────────────────────┐
│ Pattern 1: Input format                          │
│   Every input starts with "Review:" and is a     │
│   quoted sentence.                               │
│                                                  │
│ Pattern 2: Output format                         │
│   Every output is a single lowercase word         │
│   on the line after "Classification:"            │
│                                                  │
│ Pattern 3: Label set                             │
│   Only three possible values: positive,          │
│   negative, neutral. Nothing else.               │
│                                                  │
│ Pattern 4: Decision boundary                     │
│   Mixed sentiment → neutral (Example 4)          │
│   Strong language → positive or negative          │
│   Mild language → neutral                         │
│                                                  │
│ Pattern 5: Tone mapping                          │
│   "love" → positive, "terrible" → negative       │
│   "fine" → neutral, "good but..." → neutral      │
└─────────────────────────────────────────────────┘

This is why few-shot is so powerful: you are programming the model's behavior through data, not through instructions. The model generalizes from your examples to handle new inputs.


3. Designing Effective Examples

Not all examples are equally useful. Poorly chosen examples can actually make the model worse. Here are the principles for designing effective few-shot examples:

Principle 1: Cover the label space

If you have 3 possible outputs (positive, negative, neutral), include at least one example of each. If the model never sees a "neutral" example, it will be biased toward the labels it did see.

BAD (missing label):
  Example 1: "Love it!" → positive
  Example 2: "Great quality" → positive
  Example 3: "Hate it" → negative
  // No "neutral" example → model rarely outputs "neutral"

GOOD (all labels covered):
  Example 1: "Love it!" → positive
  Example 2: "Hate it" → negative
  Example 3: "It's okay" → neutral

Principle 2: Include edge cases

The most valuable examples are the ambiguous ones — the cases where the model is most likely to get it wrong without guidance.

const examples = [
  // Clear cases (easy — model probably gets these right anyway)
  { input: 'Best product ever!', output: 'positive' },
  { input: 'Total garbage, want refund.', output: 'negative' },
  
  // Edge cases (hard — this is where examples add real value)
  { input: 'Good product but overpriced.', output: 'neutral' },
  { input: 'Not bad, actually.', output: 'positive' },  // Double negative = positive
  { input: 'I expected more for this price.', output: 'negative' },  // Subtle negativity
  { input: 'Works as described.', output: 'neutral' },  // Factual, not emotional
];

Principle 3: Match the real distribution

If 70% of your real inputs are positive, 20% negative, and 10% neutral, your examples should roughly reflect that — or at minimum not be wildly skewed. If you give 5 negative examples and 1 positive, the model may develop a bias toward negative classifications.

Principle 4: Keep examples representative

Examples should look like the actual inputs the model will encounter. If real customer reviews are 1-3 sentences long, don't use paragraph-length examples.

BAD (unrepresentative):
  Example: "After extensive deliberation and careful consideration of 
  all the factors involved, including price, quality, customer service 
  responsiveness, and overall brand reputation, I have concluded that 
  this product meets my requirements."
  → positive

GOOD (representative of real data):
  Example: "Good quality, fast shipping. Happy with my purchase."
  → positive

Principle 5: Use diverse phrasing

Avoid examples that are too similar — the model might learn a superficial pattern (like "any review with 'love' is positive") instead of the real task.

BAD (too similar):
  "I love this product" → positive
  "I love this item" → positive
  "I love this purchase" → positive
  // Model learns: "love" → positive (too narrow)

GOOD (diverse phrasing):
  "I love this product" → positive
  "Exceeded my expectations completely" → positive
  "Solid build quality, very impressed" → positive
  // Model learns: general positive sentiment → positive

4. Example Formatting: Consistency Is Everything

The format of your examples matters as much as their content. The model learns the pattern of your formatting, so inconsistent formatting teaches the model to be inconsistent.

Use consistent delimiters

Pick a delimiter style and stick with it across all examples:

STYLE 1: Label-colon format
  Input: "Great product!"
  Sentiment: positive

STYLE 2: Arrow format
  "Great product!" → positive

STYLE 3: Bracketed format
  Review: "Great product!"
  [SENTIMENT]: positive

STYLE 4: Structured blocks
  ---
  Review: "Great product!"
  Classification: positive
  ---

Do not mix styles. If example 1 uses arrows and example 2 uses colons, the model's output format becomes unpredictable.

Input-output pairs should be clearly separated

// Clear separation between examples
const prompt = `Extract the person's name and age from the text.

===
Text: "John Smith, a 34-year-old engineer from Boston."
Name: John Smith
Age: 34
===
Text: "Meet Dr. Sarah Chen (age 45), head of research."
Name: Sarah Chen
Age: 45
===
Text: "The award was given to 28-year-old Maria Garcia."
Name: Maria Garcia
Age: 28
===
Text: "${userInput}"
Name:`;

The === delimiter makes it crystal clear where each example starts and ends. The model knows to start generating after the last Name:.

Formatting for JSON output

const prompt = `Convert the product description to a structured JSON object.

Example:
Input: "Nike Air Max 90, men's running shoe, size 10, $129.99"
Output: {"brand": "Nike", "model": "Air Max 90", "category": "running shoe", "gender": "men", "size": "10", "price": 129.99}

Example:
Input: "Adidas Ultraboost 22, women's training shoe, size 8, $189.99"  
Output: {"brand": "Adidas", "model": "Ultraboost 22", "category": "training shoe", "gender": "women", "size": "8", "price": 189.99}

Example:
Input: "New Balance 574, unisex casual shoe, size 11, $79.99"
Output: {"brand": "New Balance", "model": "574", "category": "casual shoe", "gender": "unisex", "size": "11", "price": 79.99}

Now convert:
Input: "${userInput}"
Output:`;

5. When Few-Shot Works Better Than Detailed Instructions

Few-shot is not always the best approach. Here is a decision framework:

When to use few-shot

ScenarioWhy Few-Shot Helps
Format is complexShowing the output format is clearer than describing it
Task is nuancedEdge cases are easier to demonstrate than explain
Classification with custom labelsExamples define the label set and boundaries
Data transformationInput-to-output mapping is best shown by example
Consistent styleExamples set a tone/voice that instructions struggle to capture
Non-obvious patternsThe pattern is intuitive when seen but hard to articulate

When instructions alone are better

ScenarioWhy Instructions Are Enough
Simple, well-known tasks"Translate to French" — the model already knows how
Tasks requiring reasoningChain-of-thought instructions > examples for math/logic
Very long outputsExamples of long outputs waste too many tokens
Highly variable inputsIf inputs vary wildly, examples may not be representative
Token-constrainedEach example costs tokens — instructions are more compact

The hybrid approach (usually best)

The strongest prompts combine instructions AND examples:

const prompt = `You are a data extraction system.
Extract the company name and funding amount from the news headline.

Rules:
- Company name should be the official name (not abbreviation)
- Funding amount should be in millions, as a number
- If funding amount is not mentioned, use null
- Currency should always be converted to USD

Example 1:
Headline: "Stripe raises $600M at $95B valuation"
Result: {"company": "Stripe", "funding_millions": 600}

Example 2:
Headline: "OpenAI in talks for new funding round"
Result: {"company": "OpenAI", "funding_millions": null}

Example 3:
Headline: "European fintech Klarna secures €800M"
Result: {"company": "Klarna", "funding_millions": 880}

Now extract:
Headline: "${headline}"
Result:`;

The instructions handle the rules (currency conversion, null handling). The examples handle the format and edge cases (no amount mentioned, non-USD currency). Together, they cover more ground than either alone.


6. Token Cost of Few-Shot vs Instruction-Only

Every example you add consumes tokens from your context window and increases per-request cost. Understanding the tradeoff is essential for production systems.

Token cost comparison

INSTRUCTION-ONLY APPROACH:
  System prompt: ~200 tokens (rules and format description)
  User message:  ~50 tokens (the input to classify)
  Total input:   ~250 tokens per request

FEW-SHOT APPROACH (5 examples):
  System prompt: ~100 tokens (brief rules)
  Examples:      ~400 tokens (5 examples × ~80 tokens each)
  User message:  ~50 tokens (the input to classify)
  Total input:   ~550 tokens per request

COST DIFFERENCE:
  At $2.50 per 1M input tokens (GPT-4o):
  Instruction-only: $0.000625 per request
  Few-shot:         $0.001375 per request
  
  At 1M requests/month:
  Instruction-only: $625/month
  Few-shot:         $1,375/month
  Difference:       $750/month

When the extra cost is worth it

Instruction-only accuracy: 85%
Few-shot accuracy:         95%

If each misclassification costs you $1 in manual review:
  Instruction-only: 150,000 errors × $1 = $150,000/month in error cost
  Few-shot:          50,000 errors × $1 = $50,000/month in error cost
  
  Extra token cost:  $750/month
  Error cost savings: $100,000/month
  
  ROI: Pay $750 more in tokens, save $100,000 in errors

Optimizing few-shot token usage

// WASTEFUL: Verbose examples
const wasteful = `
Example:
The following customer review was submitted on January 15, 2024, 
by a verified purchaser who bought the product through our online 
store. The review text reads: "Absolutely wonderful product! I 
can't recommend it enough to anyone looking for a high-quality item."
After careful analysis, this review has been classified as: positive
`;

// EFFICIENT: Concise examples
const efficient = `
"Wonderful product! Highly recommend." → positive
`;

// Same signal, ~80% fewer tokens

The diminishing returns curve

Number of examples vs accuracy improvement (typical pattern):

  0 examples (zero-shot):  ~80% accuracy
  1 example  (one-shot):   ~87% accuracy  (+7%)
  3 examples (few-shot):   ~93% accuracy  (+6%)
  5 examples (few-shot):   ~95% accuracy  (+2%)
  10 examples:             ~96% accuracy  (+1%)
  20 examples:             ~96.5% accuracy (+0.5%)

The first 3 examples provide the most value.
Beyond 5-7 examples, returns diminish sharply.
Beyond 10, you're mostly wasting tokens.

7. Practical Examples

Example 1: Text classification

async function classifySupport(message) {
  const prompt = `Classify the customer support message into exactly one category.

Categories: billing, technical, shipping, account, other

"I can't log into my account" → account
"My credit card was charged twice" → billing
"The app crashes when I open settings" → technical
"Where is my package?" → shipping
"Can you sponsor our event?" → other
"I forgot my password and can't reset it" → account
"The tracking number doesn't work" → shipping

Message: "${message}"
Category:`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 10,
  });

  return response.choices[0].message.content.trim();
}

// Usage
const category = await classifySupport("I was charged $50 but my plan is $30");
console.log(category); // "billing"

Example 2: Data extraction

async function extractContactInfo(text) {
  const prompt = `Extract contact information from the text.

Text: "Reach out to John Smith at john@example.com or call 555-0123"
Result: {"name": "John Smith", "email": "john@example.com", "phone": "555-0123"}

Text: "For inquiries, email support@company.io"
Result: {"name": null, "email": "support@company.io", "phone": null}

Text: "Contact Dr. Sarah Lee (555-9876) for appointments"
Result: {"name": "Sarah Lee", "email": null, "phone": "555-9876"}

Text: "${text}"
Result:`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [{ role: 'user', content: prompt }],
  });

  return JSON.parse(response.choices[0].message.content.trim());
}

Example 3: Text transformation / formatting

async function normalizeAddress(rawAddress) {
  const prompt = `Normalize the address to a standard US format.

Input: "123 main st apt 4b new york ny 10001"
Output: "123 Main St, Apt 4B, New York, NY 10001"

Input: "456 Oak Avenue, Suite 200, San Francisco California 94102"
Output: "456 Oak Ave, Suite 200, San Francisco, CA 94102"

Input: "789 ELM BOULEVARD LOS ANGELES CA 90001"
Output: "789 Elm Blvd, Los Angeles, CA 90001"

Input: "1010 pine street #302 chicago illinois"
Output: "1010 Pine St, #302, Chicago, IL"

Input: "${rawAddress}"
Output:`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 100,
  });

  return response.choices[0].message.content.trim().replace(/^"/, '').replace(/"$/, '');
}

Example 4: Structured summarization

async function summarizeForSlack(article) {
  const systemPrompt = `You write concise Slack summaries for an engineering team.

Example article: "React 19 introduces a new compiler that automatically 
memoizes components, eliminating the need for useMemo and useCallback 
in most cases. The compiler analyzes component render behavior and 
optimizes re-renders at build time."
Summary:
*React 19 Compiler* — Auto-memoizes components at build time. No more manual useMemo/useCallback for most cases. :rocket:

Example article: "A critical vulnerability (CVE-2024-1234) was found 
in Express.js versions below 4.18.3. The vulnerability allows request 
smuggling through malformed headers. Patch immediately."
Summary:
:rotating_light: *Express.js CVE-2024-1234* — Request smuggling via malformed headers. Affects <4.18.3. Patch NOW.`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.3,
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: `Summarize this:\n${article}` },
    ],
  });

  return response.choices[0].message.content.trim();
}

8. Advanced Few-Shot Techniques

Dynamic example selection

Instead of hard-coding examples, select the most relevant examples for each input. This is especially useful when you have a large bank of examples but can only fit a few in the prompt.

// Example bank (could be stored in a database or vector store)
const exampleBank = [
  { input: 'My order is late', output: 'shipping', category: 'delivery' },
  { input: 'Refund not received', output: 'billing', category: 'payment' },
  { input: 'App won\'t open on Android', output: 'technical', category: 'mobile' },
  { input: 'Can\'t change my email', output: 'account', category: 'profile' },
  { input: 'Charged wrong amount', output: 'billing', category: 'payment' },
  { input: 'Package arrived damaged', output: 'shipping', category: 'delivery' },
  { input: 'Login page shows error', output: 'technical', category: 'auth' },
  // ... hundreds more
];

async function classifyWithDynamicExamples(userMessage) {
  // Step 1: Select the 3 most relevant examples 
  // (in production, use embedding similarity for this)
  const selectedExamples = selectMostRelevant(exampleBank, userMessage, 3);
  
  // Step 2: Build prompt with selected examples
  const exampleText = selectedExamples
    .map(ex => `"${ex.input}" → ${ex.output}`)
    .join('\n');
  
  const prompt = `Classify the support message into: billing, technical, shipping, account, other

${exampleText}

"${userMessage}" →`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0,
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 10,
  });

  return response.choices[0].message.content.trim();
}

Negative examples (showing what NOT to do)

Sometimes showing wrong outputs helps the model avoid common mistakes:

Extract ONLY the person's name. Do not include titles or suffixes.

CORRECT:
"Dr. James Wilson, PhD" → James Wilson
"Ms. Emily Chen" → Emily Chen
"Captain Robert Brown Jr." → Robert Brown

INCORRECT (do not do this):
"Dr. James Wilson, PhD" → Dr. James Wilson, PhD  ← WRONG: includes title and suffix
"Ms. Emily Chen" → Ms. Emily Chen              ← WRONG: includes title

Now extract:
"Professor Anna Martinez, MD" →

Chain-of-thought in few-shot (covered more in 4.3.c)

You can combine few-shot with reasoning traces:

Determine if the customer is eligible for a refund.

Order: Placed Jan 5, requesting refund Jan 20. Policy: 14-day window.
Reasoning: Order placed Jan 5. Refund requested Jan 20. Days elapsed: 15. Policy allows 14 days. 15 > 14.
Decision: NOT ELIGIBLE

Order: Placed Mar 10, requesting refund Mar 18. Policy: 14-day window.
Reasoning: Order placed Mar 10. Refund requested Mar 18. Days elapsed: 8. Policy allows 14 days. 8 < 14.
Decision: ELIGIBLE

Order: Placed ${orderDate}, requesting refund ${refundDate}. Policy: 14-day window.
Reasoning:

9. Common Pitfalls

PitfallProblemSolution
Too few examplesModel doesn't see the pattern clearlyUse at least 3 examples, covering all output labels
All examples look the sameModel learns a superficial patternDiversify phrasing, length, and complexity
Missing edge casesModel fails on ambiguous inputsInclude at least one ambiguous/boundary example
Inconsistent formattingModel output format is unpredictableUse identical delimiters and structure in every example
Examples too longWastes tokens, less room for the actual inputKeep examples concise — just enough to show the pattern
Biased label distributionModel over-predicts the most common labelBalance examples across all labels (or match real distribution)
Stale examplesExamples don't match current data patternsReview and update examples quarterly
Examples contradict instructionsModel doesn't know which to followEnsure examples perfectly follow the rules you've stated

10. Key Takeaways

  1. Zero-shot = no examples (relies on instructions), one-shot = 1 example, few-shot = multiple examples.
  2. Examples teach by demonstration — the model detects patterns across your examples and applies them to new inputs.
  3. 3-5 well-chosen examples typically provide 90%+ of the accuracy benefit; beyond 10, returns diminish sharply.
  4. Cover all output labels, include edge cases, and use diverse phrasing.
  5. Consistent formatting is critical — use identical delimiters and structure across all examples.
  6. Few-shot excels at classification, extraction, formatting, and any task where the pattern is easier to show than describe.
  7. Instructions + examples together are stronger than either alone.
  8. Token cost of examples matters at scale — measure accuracy vs cost to find the optimal number.

Explain-It Challenge

  1. Your zero-shot sentiment classifier gets 80% accuracy. You add 3 examples and accuracy jumps to 93%. Where do the 3 examples add the most value — on the easy cases or the ambiguous ones? Why?
  2. A colleague adds 20 examples to their prompt and says "more examples = better." Explain the diminishing returns curve and the token cost implications.
  3. You're building a data extraction prompt and notice the model always returns the right fields but sometimes uses the wrong format (e.g., "March 15, 2024" instead of "2024-03-15"). How would you use few-shot examples to fix this?

Navigation: ← 4.3.a — Writing Clear Instructions · 4.3.c — Chain-of-Thought →