Episode 4 — Generative AI Engineering / 4.3 — Prompt Engineering Fundamentals
4.3.b — Few-Shot Examples
In one sentence: Few-shot prompting means including example input-output pairs in your prompt so the model learns the pattern you want by seeing it demonstrated, rather than having it described in words.
Navigation: ← 4.3.a — Writing Clear Instructions · 4.3.c — Chain-of-Thought →
1. What Is Few-Shot Learning?
In prompt engineering, "shots" refer to how many examples you include in the prompt before asking the model to perform the actual task. This terminology comes from machine learning research, but in practice it simply means "show before you ask."
┌─────────────────────────────────────────────────────────────────┐
│ THE SHOT SPECTRUM │
│ │
│ ZERO-SHOT ONE-SHOT FEW-SHOT │
│ (no examples) (1 example) (2-10+ examples) │
│ │
│ "Classify this "Example: "Example 1: │
│ as positive 'Great!' → 'Great!' → positive │
│ or negative" positive. 'Terrible' → negative │
│ Now classify: 'It was okay' → neutral │
│ 'Awful service'" Now classify: │
│ 'Awful service'" │
│ │
│ Model guesses Model sees the Model sees the pattern │
│ the format and pattern once multiple times → more │
│ task definition reliable outputs │
└─────────────────────────────────────────────────────────────────┘
Zero-shot: No examples, just instructions
const zeroShotPrompt = `Classify the following customer review as
"positive", "negative", or "neutral".
Review: "The product arrived on time but the packaging was damaged."
Classification:`;
The model uses its general training to interpret the task. Works well for common, well-understood tasks. Can be inconsistent for ambiguous or nuanced tasks.
One-shot: One example
const oneShotPrompt = `Classify the following customer review as
"positive", "negative", or "neutral".
Example:
Review: "Absolutely love this product! Best purchase I've made."
Classification: positive
Now classify:
Review: "The product arrived on time but the packaging was damaged."
Classification:`;
One example shows the model the exact format you want. The model now knows: (a) what a "classification" looks like, (b) that the answer should be a single word, (c) that it should be lowercase.
Few-shot: Multiple examples
const fewShotPrompt = `Classify the following customer review as
"positive", "negative", or "neutral".
Example 1:
Review: "Absolutely love this product! Best purchase I've made."
Classification: positive
Example 2:
Review: "Terrible quality. Broke after one day. Want a refund."
Classification: negative
Example 3:
Review: "It's fine. Does what it says. Nothing special."
Classification: neutral
Example 4:
Review: "Good product but shipping took forever."
Classification: neutral
Now classify:
Review: "The product arrived on time but the packaging was damaged."
Classification:`;
Multiple examples show the model edge cases and boundary conditions. Example 4 is particularly important — it shows a mixed review classified as "neutral," which teaches the model how to handle ambiguity.
2. How Examples Force Pattern Learning
When you include examples, the model isn't just "reading" them — it is detecting patterns across the examples and applying those patterns to the new input. This is called in-context learning.
WHAT THE MODEL DETECTS FROM YOUR EXAMPLES:
┌─────────────────────────────────────────────────┐
│ Pattern 1: Input format │
│ Every input starts with "Review:" and is a │
│ quoted sentence. │
│ │
│ Pattern 2: Output format │
│ Every output is a single lowercase word │
│ on the line after "Classification:" │
│ │
│ Pattern 3: Label set │
│ Only three possible values: positive, │
│ negative, neutral. Nothing else. │
│ │
│ Pattern 4: Decision boundary │
│ Mixed sentiment → neutral (Example 4) │
│ Strong language → positive or negative │
│ Mild language → neutral │
│ │
│ Pattern 5: Tone mapping │
│ "love" → positive, "terrible" → negative │
│ "fine" → neutral, "good but..." → neutral │
└─────────────────────────────────────────────────┘
This is why few-shot is so powerful: you are programming the model's behavior through data, not through instructions. The model generalizes from your examples to handle new inputs.
3. Designing Effective Examples
Not all examples are equally useful. Poorly chosen examples can actually make the model worse. Here are the principles for designing effective few-shot examples:
Principle 1: Cover the label space
If you have 3 possible outputs (positive, negative, neutral), include at least one example of each. If the model never sees a "neutral" example, it will be biased toward the labels it did see.
BAD (missing label):
Example 1: "Love it!" → positive
Example 2: "Great quality" → positive
Example 3: "Hate it" → negative
// No "neutral" example → model rarely outputs "neutral"
GOOD (all labels covered):
Example 1: "Love it!" → positive
Example 2: "Hate it" → negative
Example 3: "It's okay" → neutral
Principle 2: Include edge cases
The most valuable examples are the ambiguous ones — the cases where the model is most likely to get it wrong without guidance.
const examples = [
// Clear cases (easy — model probably gets these right anyway)
{ input: 'Best product ever!', output: 'positive' },
{ input: 'Total garbage, want refund.', output: 'negative' },
// Edge cases (hard — this is where examples add real value)
{ input: 'Good product but overpriced.', output: 'neutral' },
{ input: 'Not bad, actually.', output: 'positive' }, // Double negative = positive
{ input: 'I expected more for this price.', output: 'negative' }, // Subtle negativity
{ input: 'Works as described.', output: 'neutral' }, // Factual, not emotional
];
Principle 3: Match the real distribution
If 70% of your real inputs are positive, 20% negative, and 10% neutral, your examples should roughly reflect that — or at minimum not be wildly skewed. If you give 5 negative examples and 1 positive, the model may develop a bias toward negative classifications.
Principle 4: Keep examples representative
Examples should look like the actual inputs the model will encounter. If real customer reviews are 1-3 sentences long, don't use paragraph-length examples.
BAD (unrepresentative):
Example: "After extensive deliberation and careful consideration of
all the factors involved, including price, quality, customer service
responsiveness, and overall brand reputation, I have concluded that
this product meets my requirements."
→ positive
GOOD (representative of real data):
Example: "Good quality, fast shipping. Happy with my purchase."
→ positive
Principle 5: Use diverse phrasing
Avoid examples that are too similar — the model might learn a superficial pattern (like "any review with 'love' is positive") instead of the real task.
BAD (too similar):
"I love this product" → positive
"I love this item" → positive
"I love this purchase" → positive
// Model learns: "love" → positive (too narrow)
GOOD (diverse phrasing):
"I love this product" → positive
"Exceeded my expectations completely" → positive
"Solid build quality, very impressed" → positive
// Model learns: general positive sentiment → positive
4. Example Formatting: Consistency Is Everything
The format of your examples matters as much as their content. The model learns the pattern of your formatting, so inconsistent formatting teaches the model to be inconsistent.
Use consistent delimiters
Pick a delimiter style and stick with it across all examples:
STYLE 1: Label-colon format
Input: "Great product!"
Sentiment: positive
STYLE 2: Arrow format
"Great product!" → positive
STYLE 3: Bracketed format
Review: "Great product!"
[SENTIMENT]: positive
STYLE 4: Structured blocks
---
Review: "Great product!"
Classification: positive
---
Do not mix styles. If example 1 uses arrows and example 2 uses colons, the model's output format becomes unpredictable.
Input-output pairs should be clearly separated
// Clear separation between examples
const prompt = `Extract the person's name and age from the text.
===
Text: "John Smith, a 34-year-old engineer from Boston."
Name: John Smith
Age: 34
===
Text: "Meet Dr. Sarah Chen (age 45), head of research."
Name: Sarah Chen
Age: 45
===
Text: "The award was given to 28-year-old Maria Garcia."
Name: Maria Garcia
Age: 28
===
Text: "${userInput}"
Name:`;
The === delimiter makes it crystal clear where each example starts and ends. The model knows to start generating after the last Name:.
Formatting for JSON output
const prompt = `Convert the product description to a structured JSON object.
Example:
Input: "Nike Air Max 90, men's running shoe, size 10, $129.99"
Output: {"brand": "Nike", "model": "Air Max 90", "category": "running shoe", "gender": "men", "size": "10", "price": 129.99}
Example:
Input: "Adidas Ultraboost 22, women's training shoe, size 8, $189.99"
Output: {"brand": "Adidas", "model": "Ultraboost 22", "category": "training shoe", "gender": "women", "size": "8", "price": 189.99}
Example:
Input: "New Balance 574, unisex casual shoe, size 11, $79.99"
Output: {"brand": "New Balance", "model": "574", "category": "casual shoe", "gender": "unisex", "size": "11", "price": 79.99}
Now convert:
Input: "${userInput}"
Output:`;
5. When Few-Shot Works Better Than Detailed Instructions
Few-shot is not always the best approach. Here is a decision framework:
When to use few-shot
| Scenario | Why Few-Shot Helps |
|---|---|
| Format is complex | Showing the output format is clearer than describing it |
| Task is nuanced | Edge cases are easier to demonstrate than explain |
| Classification with custom labels | Examples define the label set and boundaries |
| Data transformation | Input-to-output mapping is best shown by example |
| Consistent style | Examples set a tone/voice that instructions struggle to capture |
| Non-obvious patterns | The pattern is intuitive when seen but hard to articulate |
When instructions alone are better
| Scenario | Why Instructions Are Enough |
|---|---|
| Simple, well-known tasks | "Translate to French" — the model already knows how |
| Tasks requiring reasoning | Chain-of-thought instructions > examples for math/logic |
| Very long outputs | Examples of long outputs waste too many tokens |
| Highly variable inputs | If inputs vary wildly, examples may not be representative |
| Token-constrained | Each example costs tokens — instructions are more compact |
The hybrid approach (usually best)
The strongest prompts combine instructions AND examples:
const prompt = `You are a data extraction system.
Extract the company name and funding amount from the news headline.
Rules:
- Company name should be the official name (not abbreviation)
- Funding amount should be in millions, as a number
- If funding amount is not mentioned, use null
- Currency should always be converted to USD
Example 1:
Headline: "Stripe raises $600M at $95B valuation"
Result: {"company": "Stripe", "funding_millions": 600}
Example 2:
Headline: "OpenAI in talks for new funding round"
Result: {"company": "OpenAI", "funding_millions": null}
Example 3:
Headline: "European fintech Klarna secures €800M"
Result: {"company": "Klarna", "funding_millions": 880}
Now extract:
Headline: "${headline}"
Result:`;
The instructions handle the rules (currency conversion, null handling). The examples handle the format and edge cases (no amount mentioned, non-USD currency). Together, they cover more ground than either alone.
6. Token Cost of Few-Shot vs Instruction-Only
Every example you add consumes tokens from your context window and increases per-request cost. Understanding the tradeoff is essential for production systems.
Token cost comparison
INSTRUCTION-ONLY APPROACH:
System prompt: ~200 tokens (rules and format description)
User message: ~50 tokens (the input to classify)
Total input: ~250 tokens per request
FEW-SHOT APPROACH (5 examples):
System prompt: ~100 tokens (brief rules)
Examples: ~400 tokens (5 examples × ~80 tokens each)
User message: ~50 tokens (the input to classify)
Total input: ~550 tokens per request
COST DIFFERENCE:
At $2.50 per 1M input tokens (GPT-4o):
Instruction-only: $0.000625 per request
Few-shot: $0.001375 per request
At 1M requests/month:
Instruction-only: $625/month
Few-shot: $1,375/month
Difference: $750/month
When the extra cost is worth it
Instruction-only accuracy: 85%
Few-shot accuracy: 95%
If each misclassification costs you $1 in manual review:
Instruction-only: 150,000 errors × $1 = $150,000/month in error cost
Few-shot: 50,000 errors × $1 = $50,000/month in error cost
Extra token cost: $750/month
Error cost savings: $100,000/month
ROI: Pay $750 more in tokens, save $100,000 in errors
Optimizing few-shot token usage
// WASTEFUL: Verbose examples
const wasteful = `
Example:
The following customer review was submitted on January 15, 2024,
by a verified purchaser who bought the product through our online
store. The review text reads: "Absolutely wonderful product! I
can't recommend it enough to anyone looking for a high-quality item."
After careful analysis, this review has been classified as: positive
`;
// EFFICIENT: Concise examples
const efficient = `
"Wonderful product! Highly recommend." → positive
`;
// Same signal, ~80% fewer tokens
The diminishing returns curve
Number of examples vs accuracy improvement (typical pattern):
0 examples (zero-shot): ~80% accuracy
1 example (one-shot): ~87% accuracy (+7%)
3 examples (few-shot): ~93% accuracy (+6%)
5 examples (few-shot): ~95% accuracy (+2%)
10 examples: ~96% accuracy (+1%)
20 examples: ~96.5% accuracy (+0.5%)
The first 3 examples provide the most value.
Beyond 5-7 examples, returns diminish sharply.
Beyond 10, you're mostly wasting tokens.
7. Practical Examples
Example 1: Text classification
async function classifySupport(message) {
const prompt = `Classify the customer support message into exactly one category.
Categories: billing, technical, shipping, account, other
"I can't log into my account" → account
"My credit card was charged twice" → billing
"The app crashes when I open settings" → technical
"Where is my package?" → shipping
"Can you sponsor our event?" → other
"I forgot my password and can't reset it" → account
"The tracking number doesn't work" → shipping
Message: "${message}"
Category:`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [{ role: 'user', content: prompt }],
max_tokens: 10,
});
return response.choices[0].message.content.trim();
}
// Usage
const category = await classifySupport("I was charged $50 but my plan is $30");
console.log(category); // "billing"
Example 2: Data extraction
async function extractContactInfo(text) {
const prompt = `Extract contact information from the text.
Text: "Reach out to John Smith at john@example.com or call 555-0123"
Result: {"name": "John Smith", "email": "john@example.com", "phone": "555-0123"}
Text: "For inquiries, email support@company.io"
Result: {"name": null, "email": "support@company.io", "phone": null}
Text: "Contact Dr. Sarah Lee (555-9876) for appointments"
Result: {"name": "Sarah Lee", "email": null, "phone": "555-9876"}
Text: "${text}"
Result:`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [{ role: 'user', content: prompt }],
});
return JSON.parse(response.choices[0].message.content.trim());
}
Example 3: Text transformation / formatting
async function normalizeAddress(rawAddress) {
const prompt = `Normalize the address to a standard US format.
Input: "123 main st apt 4b new york ny 10001"
Output: "123 Main St, Apt 4B, New York, NY 10001"
Input: "456 Oak Avenue, Suite 200, San Francisco California 94102"
Output: "456 Oak Ave, Suite 200, San Francisco, CA 94102"
Input: "789 ELM BOULEVARD LOS ANGELES CA 90001"
Output: "789 Elm Blvd, Los Angeles, CA 90001"
Input: "1010 pine street #302 chicago illinois"
Output: "1010 Pine St, #302, Chicago, IL"
Input: "${rawAddress}"
Output:`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [{ role: 'user', content: prompt }],
max_tokens: 100,
});
return response.choices[0].message.content.trim().replace(/^"/, '').replace(/"$/, '');
}
Example 4: Structured summarization
async function summarizeForSlack(article) {
const systemPrompt = `You write concise Slack summaries for an engineering team.
Example article: "React 19 introduces a new compiler that automatically
memoizes components, eliminating the need for useMemo and useCallback
in most cases. The compiler analyzes component render behavior and
optimizes re-renders at build time."
Summary:
*React 19 Compiler* — Auto-memoizes components at build time. No more manual useMemo/useCallback for most cases. :rocket:
Example article: "A critical vulnerability (CVE-2024-1234) was found
in Express.js versions below 4.18.3. The vulnerability allows request
smuggling through malformed headers. Patch immediately."
Summary:
:rotating_light: *Express.js CVE-2024-1234* — Request smuggling via malformed headers. Affects <4.18.3. Patch NOW.`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.3,
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: `Summarize this:\n${article}` },
],
});
return response.choices[0].message.content.trim();
}
8. Advanced Few-Shot Techniques
Dynamic example selection
Instead of hard-coding examples, select the most relevant examples for each input. This is especially useful when you have a large bank of examples but can only fit a few in the prompt.
// Example bank (could be stored in a database or vector store)
const exampleBank = [
{ input: 'My order is late', output: 'shipping', category: 'delivery' },
{ input: 'Refund not received', output: 'billing', category: 'payment' },
{ input: 'App won\'t open on Android', output: 'technical', category: 'mobile' },
{ input: 'Can\'t change my email', output: 'account', category: 'profile' },
{ input: 'Charged wrong amount', output: 'billing', category: 'payment' },
{ input: 'Package arrived damaged', output: 'shipping', category: 'delivery' },
{ input: 'Login page shows error', output: 'technical', category: 'auth' },
// ... hundreds more
];
async function classifyWithDynamicExamples(userMessage) {
// Step 1: Select the 3 most relevant examples
// (in production, use embedding similarity for this)
const selectedExamples = selectMostRelevant(exampleBank, userMessage, 3);
// Step 2: Build prompt with selected examples
const exampleText = selectedExamples
.map(ex => `"${ex.input}" → ${ex.output}`)
.join('\n');
const prompt = `Classify the support message into: billing, technical, shipping, account, other
${exampleText}
"${userMessage}" →`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
messages: [{ role: 'user', content: prompt }],
max_tokens: 10,
});
return response.choices[0].message.content.trim();
}
Negative examples (showing what NOT to do)
Sometimes showing wrong outputs helps the model avoid common mistakes:
Extract ONLY the person's name. Do not include titles or suffixes.
CORRECT:
"Dr. James Wilson, PhD" → James Wilson
"Ms. Emily Chen" → Emily Chen
"Captain Robert Brown Jr." → Robert Brown
INCORRECT (do not do this):
"Dr. James Wilson, PhD" → Dr. James Wilson, PhD ← WRONG: includes title and suffix
"Ms. Emily Chen" → Ms. Emily Chen ← WRONG: includes title
Now extract:
"Professor Anna Martinez, MD" →
Chain-of-thought in few-shot (covered more in 4.3.c)
You can combine few-shot with reasoning traces:
Determine if the customer is eligible for a refund.
Order: Placed Jan 5, requesting refund Jan 20. Policy: 14-day window.
Reasoning: Order placed Jan 5. Refund requested Jan 20. Days elapsed: 15. Policy allows 14 days. 15 > 14.
Decision: NOT ELIGIBLE
Order: Placed Mar 10, requesting refund Mar 18. Policy: 14-day window.
Reasoning: Order placed Mar 10. Refund requested Mar 18. Days elapsed: 8. Policy allows 14 days. 8 < 14.
Decision: ELIGIBLE
Order: Placed ${orderDate}, requesting refund ${refundDate}. Policy: 14-day window.
Reasoning:
9. Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Too few examples | Model doesn't see the pattern clearly | Use at least 3 examples, covering all output labels |
| All examples look the same | Model learns a superficial pattern | Diversify phrasing, length, and complexity |
| Missing edge cases | Model fails on ambiguous inputs | Include at least one ambiguous/boundary example |
| Inconsistent formatting | Model output format is unpredictable | Use identical delimiters and structure in every example |
| Examples too long | Wastes tokens, less room for the actual input | Keep examples concise — just enough to show the pattern |
| Biased label distribution | Model over-predicts the most common label | Balance examples across all labels (or match real distribution) |
| Stale examples | Examples don't match current data patterns | Review and update examples quarterly |
| Examples contradict instructions | Model doesn't know which to follow | Ensure examples perfectly follow the rules you've stated |
10. Key Takeaways
- Zero-shot = no examples (relies on instructions), one-shot = 1 example, few-shot = multiple examples.
- Examples teach by demonstration — the model detects patterns across your examples and applies them to new inputs.
- 3-5 well-chosen examples typically provide 90%+ of the accuracy benefit; beyond 10, returns diminish sharply.
- Cover all output labels, include edge cases, and use diverse phrasing.
- Consistent formatting is critical — use identical delimiters and structure across all examples.
- Few-shot excels at classification, extraction, formatting, and any task where the pattern is easier to show than describe.
- Instructions + examples together are stronger than either alone.
- Token cost of examples matters at scale — measure accuracy vs cost to find the optimal number.
Explain-It Challenge
- Your zero-shot sentiment classifier gets 80% accuracy. You add 3 examples and accuracy jumps to 93%. Where do the 3 examples add the most value — on the easy cases or the ambiguous ones? Why?
- A colleague adds 20 examples to their prompt and says "more examples = better." Explain the diminishing returns curve and the token cost implications.
- You're building a data extraction prompt and notice the model always returns the right fields but sometimes uses the wrong format (e.g., "March 15, 2024" instead of "2024-03-15"). How would you use few-shot examples to fix this?
Navigation: ← 4.3.a — Writing Clear Instructions · 4.3.c — Chain-of-Thought →