Episode 4 — Generative AI Engineering / 4.3 — Prompt Engineering Fundamentals

Interview Questions: Prompt Engineering Fundamentals

Model answers for clear instructions, few-shot examples, chain-of-thought, and output formatting in production AI systems.

How to use this material (instructions)

Read lessons in order — README.md, then 4.3.a → 4.3.d.
Practice out loud — definition → example → pitfall.
Pair with exercises — 4.3-Exercise-Questions.md.
Quick review — 4.3-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is prompt engineering and why does it matter?

Why interviewers ask: Tests whether you understand that prompt design is a core engineering skill, not just "talking to AI."

Model answer:

Prompt engineering is the practice of designing and refining the text instructions you send to an LLM to produce reliable, useful output. It matters because the same model can produce wildly different results depending on the prompt — a vague prompt gets a generic response, while a specific prompt with clear instructions, examples, and format specifications gets exactly what you need.

In production systems, prompt engineering directly impacts: (1) Output quality — clear instructions reduce errors and hallucination, (2) Reliability — format specifications make outputs parseable by code, (3) Cost — efficient prompts use fewer tokens, and (4) Maintainability — well-structured prompts are easier to debug and iterate on. The four core techniques are: writing clear instructions, providing few-shot examples, using chain-of-thought reasoning, and specifying output format.

Q2. What makes a good prompt? What are the key elements?

Why interviewers ask: Tests your ability to systematically construct prompts rather than using trial-and-error.

Model answer:

A good prompt addresses six dimensions: (1) Role — who the model should be ("You are a senior JavaScript developer"), (2) Task — what specific action to perform ("Review this code for security vulnerabilities"), (3) Context — relevant background and constraints ("The code handles user authentication"), (4) Format — what shape the output should take ("Return a JSON object with..."), (5) Tone — how it should sound ("Professional but accessible"), (6) Boundaries — what it should NOT do ("Do not suggest rewriting the entire codebase").

A good prompt is also testable — you can run it 10 times and get consistent, correct results. I follow the "Do over Don't" principle: instead of saying "Don't be vague," say "Be specific — include numbers, names, and dates." And I always version prompts in code, treating them as production artifacts with the same rigor as any other source code.

Q3. What is the difference between zero-shot, one-shot, and few-shot prompting?

Why interviewers ask: Fundamental terminology that every AI engineer should know. Also tests understanding of when to use each.

Model answer:

These terms describe how many examples you include in your prompt before asking the model to perform a task.

Zero-shot: No examples — only instructions. The model relies entirely on its training to interpret the task. Example: "Classify this review as positive or negative: 'Great product!'". Best for simple, well-understood tasks where the model already knows what to do.

One-shot: One example shown before the actual task. Example: "'Love it!' → positive. Now classify: 'Terrible quality.'". The single example establishes the exact output format and decision pattern.

Few-shot: Multiple examples (typically 3-7). Best for tasks with nuance, custom label sets, or complex output formats. The model detects patterns across examples and generalizes to new inputs.

The key tradeoff is accuracy vs token cost. Few-shot is typically the most accurate but costs more tokens. The first 3 examples provide about 90% of the accuracy benefit — beyond 7-10, returns diminish sharply. In production, I combine instructions (for rules) with 3-5 targeted examples (for format and edge cases).

Q4. How do you ensure an LLM returns valid JSON?

Why interviewers ask: JSON output is the most common production requirement. This tests practical prompt engineering and defensive coding skills.

Model answer:

I use a defense-in-depth approach with three layers:

Layer 1 — Prompt instructions: Include explicit instructions like "Respond ONLY with valid JSON. No explanation. No markdown code fences." I also specify the exact schema with key names, types, and null-handling rules. Few-shot examples demonstrating the exact JSON format reinforce this.

Layer 2 — API-level enforcement: When using OpenAI, I set response_format: { type: 'json_object' } or, even better, response_format: { type: 'json_schema', json_schema: { strict: true, schema: ... } }. This guarantees structurally valid JSON at the API level.

Layer 3 — Code validation: Even with the first two layers, I always validate the parsed result. I check for required fields, correct data types (e.g., that age is a number, not the string "30"), and business rules. I also strip markdown code fences defensively before parsing, because models occasionally add them.

I always use temperature 0 for JSON output — randomness and structured parsing do not mix. And I implement retry logic: if parsing fails, I retry up to 2 times before logging the failure and falling back gracefully.

Intermediate (Q5–Q8)

Q5. When should you use few-shot examples vs detailed instructions? How do you decide?

Why interviewers ask: Tests your ability to make practical tradeoffs in prompt design — a sign of real-world experience.

Model answer:

The decision depends on the task complexity and the type of knowledge you're transferring:

Use few-shot examples when: (1) the output format is complex and easier to show than describe (e.g., specific JSON structures), (2) the task involves classification with custom labels — examples define the decision boundaries better than rules, (3) the task has subtle patterns that are intuitive when demonstrated but hard to articulate (e.g., "this is the right tone for our brand"), (4) you need consistent style across outputs.

Use instructions when: (1) the task is well-known and the model already understands it (e.g., "Translate to French"), (2) the task requires reasoning rather than pattern matching — chain-of-thought instructions outperform examples for math and logic, (3) the output is long — example outputs waste too many tokens, (4) you're token-constrained — instructions are more compact than examples.

In practice, I almost always use both: instructions define the rules (null handling, currency conversion, date formats), and 3-5 examples demonstrate the format and edge cases. The instructions handle the "what," the examples handle the "how it looks." I measure accuracy on a held-out test set and add examples only when they improve accuracy enough to justify the extra token cost.

Q6. Explain chain-of-thought prompting. When does it help and when is it wasteful?

Why interviewers ask: Tests depth of understanding beyond surface-level prompting knowledge, and ability to make cost-conscious engineering decisions.

Model answer:

Chain-of-thought (CoT) prompting instructs the model to generate intermediate reasoning steps before producing a final answer. The simplest form is appending "Let's think step by step" to the prompt. For more control, I use structured templates like GIVEN/FIND/STEPS/ANSWER.

CoT helps when the task requires multi-step reasoning: math calculations, logic problems, code debugging, comparative analysis, and policy/eligibility decisions. It works because each generated reasoning token provides context for the next step — the model is less likely to skip steps and make errors. Research shows CoT can improve accuracy by 20-40% on reasoning tasks.

CoT is wasteful for: simple factual lookups ("What is the capital of France?"), direct extraction (pulling emails from text), clear-cut classification, and translation. For these tasks, the answer doesn't require intermediate reasoning, and CoT just generates 5-10x more tokens for the same result.

The cost implication is significant: CoT typically generates 100-200 output tokens vs 5-20 without it. At $10/1M output tokens, that is a 10-40x increase in output token cost per request. In production, I implement selective CoT — a routing layer that applies CoT only to requests classified as complex. This can reduce monthly bills by 50-70% compared to blanket CoT application.

Q7. How do you handle output formatting failures in production?

Why interviewers ask: Tests your ability to build resilient systems on top of non-deterministic models — a key production AI engineering skill.

Model answer:

I treat formatting failures as expected events, not exceptions. My approach has four parts:

Prevention: Temperature 0 for all structured output. Explicit format instructions in the system prompt. Few-shot examples showing the exact format. API-level format enforcement (response_format) when available. The "Respond ONLY with..." pattern to suppress wrapper text.

Detection: Every LLM response goes through a validation pipeline before reaching application logic. For JSON: try-catch on JSON.parse(), then schema validation (required fields, correct types, valid enum values). For structured text: regex validation of expected delimiters and sections.

Recovery: Automatic retry with up to 2 additional attempts. Between retries, I can add a more explicit instruction like "Your previous response was not valid JSON. Return ONLY a JSON object." If retries fail, I log the raw response for debugging and return a graceful fallback — a default value, a "sorry, try again" message, or routing to a human.

Monitoring: I track parsing success rate as a key metric. If it drops below 99.5%, I investigate — it usually means the prompt needs adjustment or the model was updated. I log every raw LLM response alongside the parsed result for post-hoc debugging.

Q8. What is the "Do vs Don't" principle in prompt engineering?

Why interviewers ask: Tests understanding of a nuanced but practical pattern that significantly improves prompt quality.

Model answer:

The principle is: "Do" instructions are stronger than "Don't" instructions. When you tell the model "Don't use technical jargon," it knows what to avoid but must guess the alternative. When you say "Write at an 8th-grade reading level," it knows exactly what level to target.

"Don't" instructions define a negative space (everything except X), which is ambiguous. "Do" instructions define a positive target, which is specific. Examples: "Don't be verbose" → "Keep responses under 50 words." "Don't make up information" → "If uncertain, say 'I'm not sure about this.'" "Don't include extra text" → "Respond ONLY with the JSON object."

In practice, the best approach combines both: the "Do" instruction states the desired behavior, and the "Don't" instruction explicitly rules out common failure modes. For example: "DO: Base every claim on the provided documents and cite your source. DON'T: Fabricate statistics, invent citations, or state facts not found in the documents." The "Do" gives direction; the "Don't" closes specific loopholes the model might exploit.

This principle extends to system prompts generally — I structure them with explicit DO and DON'T sections, with the DO section always coming first because it's more important.

Advanced (Q9–Q11)

Q9. Design a prompt engineering strategy for a production data extraction pipeline processing 100K documents per day.

Why interviewers ask: Tests system design thinking — balancing accuracy, cost, latency, and reliability at scale.

Model answer:

Architecture:

Documents → Pre-processing → LLM Extraction → Validation → Post-processing → Database
                                  ↑                 |
                            Prompt config        Retry queue (failures)

Prompt design: System prompt with strict JSON schema, "Respond ONLY" enforcement, and 2-3 carefully chosen few-shot examples. I'd use dynamic example selection — for each document type (invoice, receipt, contract), select examples of that type. Temperature 0, response_format: json_schema with strict mode.

Cost optimization: At 100K docs/day with ~500 tokens input per doc: 50M input tokens × $2.50/1M = $125/day. Output ~100 tokens: 10M × $10/1M = $100/day. Total: ~$225/day = ~$6,750/month. I'd optimize by: (1) minimizing prompt size (concise examples, trim unnecessary instructions), (2) using a smaller model (GPT-4o-mini) for simpler documents and routing complex ones to GPT-4o, (3) batching where the API supports it.

Reliability: Three-layer validation (prompt + API format + code validation). Retry queue for parsing failures with an escalation path to a more explicit prompt. Circuit breaker if failure rate exceeds 2%. Daily accuracy evaluation against a 200-document gold standard dataset.

Monitoring: Track per-document: extraction success rate, field-level accuracy, latency, token usage, cost. Alert on: success rate < 99%, average latency > 3s, daily cost > 110% of baseline.

Iteration: Version all prompts in code. A/B test prompt changes on the gold standard before deploying. Review the retry queue weekly to identify systematic failures and adjust prompts accordingly.

Q10. How would you evaluate and iterate on prompt quality systematically?

Why interviewers ask: Tests whether you approach prompt engineering as a rigorous engineering practice rather than ad hoc experimentation.

Model answer:

I use a four-stage evaluation framework:

Stage 1 — Build an evaluation dataset. Create 100-200+ examples with known correct outputs. Split into development set (for iteration) and held-out test set (for final evaluation). Include edge cases, adversarial inputs, and representative samples from production traffic.

Stage 2 — Define metrics. For classification: accuracy, precision, recall, F1 per class. For extraction: field-level accuracy (exact match), type correctness. For formatting: parse success rate, schema compliance rate. For generation: human evaluation on a 1-5 Likert scale for quality dimensions (relevance, accuracy, completeness, tone).

Stage 3 — Iterate on the development set. Change one thing at a time (prompt wording, number of examples, CoT vs direct). Run the full dev set after each change. Track which changes improved which metrics. Common iteration cycle: run → identify failure patterns → add instructions or examples targeting the failure → re-run.

Stage 4 — Validate on the held-out test set. Only run the test set for final validation, never during iteration (to avoid overfitting to the test set). Compare the new prompt version against the baseline. Deploy only if there's a statistically significant improvement.

Ongoing: Sample 1-5% of production traffic for human review. Track metric drift over time. Rerun the evaluation suite whenever the model is updated or the prompt is changed. Treat evaluation datasets as living documents — add new failure cases as they're discovered in production.

Q11. A team is spending $15,000/month on LLM API calls. The prompts use chain-of-thought on every request, 10 few-shot examples per prompt, and GPT-4o for all tasks. How would you reduce costs without sacrificing quality?

Why interviewers ask: Tests your ability to audit and optimize a real production AI system — a skill that distinguishes senior engineers.

Model answer:

I'd audit the system across four dimensions:

1. Selective CoT (estimated savings: 30-50%): Analyze request types. Simple lookups, clear-cut classification, and direct extraction don't benefit from CoT. Implement a complexity router: classify each request as simple or complex, and only apply CoT to complex requests. If 60% of requests are simple, removing CoT from those saves ~40% of output tokens.

2. Optimize few-shot examples (estimated savings: 15-25%): Measure accuracy with 10, 7, 5, and 3 examples on the evaluation set. Typically, 3-5 well-chosen examples perform within 1-2% of 10 examples. Reducing from 10 to 4 examples saves ~360 input tokens per request. At 500K requests/month: 180M fewer input tokens = $450/month savings. Also switch to dynamic example selection to get better accuracy with fewer examples.

3. Model tiering (estimated savings: 40-60%): Not every request needs GPT-4o. Route simple tasks (sentiment classification, entity extraction, formatting) to GPT-4o-mini (which is ~20x cheaper). Reserve GPT-4o for complex tasks (multi-step reasoning, nuanced analysis, long-form generation). If 50% of requests can use the mini model, that is a massive cost reduction.

4. Prompt compression (estimated savings: 10-15%): Audit system prompts for redundancy and verbosity. Replace paragraph instructions with concise bullet points. Remove examples that don't measurably improve accuracy. Trim "just in case" instructions that address edge cases that never occur.

Combined estimated savings: $7,000-$10,000/month (47-67% reduction). I'd implement changes incrementally, measuring quality metrics after each change to ensure no degradation. The evaluation dataset is the guardrail — if a change drops accuracy below the threshold, it's rolled back.

Quick-fire

#	Question	One-line answer
1	What is the most important phrase for JSON output?	"Respond ONLY with valid JSON"
2	How many few-shot examples give the best ROI?	3-5 examples (90%+ of accuracy benefit)
3	When does CoT help?	Multi-step reasoning — math, logic, debugging, comparison
4	What temperature for structured output?	Temperature 0 — always
5	"Don't be vague" — how to improve this?	"Be specific — include numbers, names, and dates"
6	What goes in system vs user prompt?	System: stable rules/persona. User: per-request task/data
7	Best defense for JSON formatting?	Three layers: prompt + API response_format + code validation
8	Zero-shot vs few-shot — when is zero-shot fine?	Simple, well-known tasks (translate, summarize, common QA)
9	Why is CoT expensive?	3-5x more output tokens generated for reasoning steps
10	What makes examples effective?	Diverse, cover all labels, include edge cases, consistent format

← Back to 4.3 — Prompt Engineering Fundamentals (README)