Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

Interview Questions: Calling LLM APIs Properly

Model answers for message roles, token budgeting, cost awareness, and rate limits/retries.

How to use this material (instructions)

  1. Read lessons in orderREADME.md, then 4.2.a4.2.d.
  2. Practice out loud — definition → example → pitfall.
  3. Pair with exercises4.2-Exercise-Questions.md.
  4. Quick review4.2-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What are message roles in LLM APIs and why do they matter?

Why interviewers ask: Tests foundational understanding of how LLM APIs work — every AI feature is built on this structure.

Model answer:

LLM APIs accept a messages array where each message has a role and content. The three roles are:

System — persistent instructions that shape the model's behavior: persona, rules, output format, and safety constraints. It sits at the top of the messages array and has elevated priority during generation. OpenAI includes it in the messages array; Anthropic uses a separate top-level system parameter.

User — human input or application-constructed prompts. In production, user messages are often built programmatically, combining the actual user question with retrieved context (RAG documents).

Assistant — the model's previous responses. These are included to maintain conversation context across API calls (since the API is stateless). Assistant messages are also used for few-shot examples — you inject example responses to teach the model the exact format you want.

The key insight is that the API is stateless — you must send the entire conversation (system + all user/assistant turns) with every call. The model has no memory between requests.


Q2. What does max_tokens control and what happens when a response is truncated?

Why interviewers ask: Misunderstanding max_tokens is one of the most common API mistakes. This tests practical API knowledge.

Model answer:

max_tokens controls the maximum number of output tokens the model can generate. It does not limit input size — that's bounded by the context window. It's an upper bound, not a target; the model may stop earlier if it naturally completes its response.

When the output reaches the max_tokens limit, the response is cut off mid-sentence and the API sets finish_reason: "length". A complete response returns finish_reason: "stop". In production, you should always check finish_reason — a "length" response means the user got an incomplete answer. You either need to increase max_tokens, instruct the model to be more concise, or implement a continuation mechanism.

The budget formula is: available_output = context_window - input_tokens - safety_margin. For GPT-4o (128K window), if your input is 10,000 tokens, you theoretically have 118,000 tokens for output, but in practice you set max_tokens to a sensible limit (500-4,096) to control cost and response length.


Q3. Why are output tokens more expensive than input tokens?

Why interviewers ask: Tests deeper understanding of LLM architecture beyond surface-level API usage.

Model answer:

Output tokens cost 3-5x more because of the fundamental difference in how input and output are processed.

Input processing is parallelized — the model processes all input tokens simultaneously through the transformer's self-attention layers. A 10,000-token prompt doesn't take 10x longer than a 1,000-token prompt because of parallel computation.

Output generation is sequential — the model generates one token at a time, each requiring a full forward pass through the network. Each new token depends on all previous tokens (autoregressive generation). Generating 500 output tokens requires 500 sequential forward passes. This makes output generation much more compute-intensive per token.

This pricing differential has practical design implications: optimizing output length (concise instructions, tight max_tokens, structured output formats) has higher cost impact per token saved than optimizing input length.


Q4. What is a 429 error and how should you handle it?

Why interviewers ask: Every production LLM integration hits rate limits eventually. This tests production readiness.

Model answer:

HTTP 429 Too Many Requests means you've exceeded the API's rate limit — either RPM (requests per minute) or TPM (tokens per minute). The response includes a retry-after header indicating how long to wait.

The standard handling strategy is exponential backoff with jitter:

  1. First retry after ~1 second
  2. Second retry after ~2 seconds
  3. Third retry after ~4 seconds
  4. Each delay has a random jitter component to prevent the "thundering herd" problem (where all clients retry at the exact same time)

Crucially, you should only retry retryable errors (429, 500, 502, 503). Never retry client errors like 400 (bad request) or 401 (unauthorized) — those indicate bugs in your code, not temporary conditions.

In production, you should also implement: concurrency limiting (cap parallel requests), circuit breakers (stop sending requests during prolonged outages), timeouts (prevent indefinite hangs), and fallback models (switch to GPT-4o-mini if GPT-4o is overloaded).


Intermediate (Q5–Q8)

Q5. Design a token budget strategy for a production RAG chatbot.

Why interviewers ask: Tests system design thinking — balancing competing demands within a fixed resource constraint.

Model answer:

Given GPT-4o's 128K token window, I'd use priority-based dynamic allocation:

Fixed allocations:
  System prompt:       1,500 tokens  (persona, rules, output format)
  Output reservation:  4,096 tokens  (max response length)
  Safety margin:       1,000 tokens  (prevent edge-case overflow)
  Subtotal fixed:      6,596 tokens

Dynamic allocations (fill in priority order):
  1. Current user message:  ~200-500 tokens (measured per request)
  2. Recent history:        last 3-5 turns (~3,000-6,000 tokens)
  3. RAG documents:         REMAINING budget (most relevant first)

The formula is: ragBudget = 128,000 - 6,596 - userTokens - historyTokens. I'd count tokens before sending using tiktoken to prevent overflow. If history grows too large, I'd summarize older turns using a cheaper model (GPT-4o-mini). For RAG, I'd retrieve more chunks than needed, rank by relevance, then greedily fill the budget with top chunks until exhausted.

I'd monitor: average context utilization, truncation rate (finish_reason: "length"), and p99 input token count. If utilization consistently exceeds 80%, I'd implement more aggressive history pruning.


Q6. How would you reduce LLM API costs by 50% or more without significantly impacting quality?

Why interviewers ask: Real production concern — evaluates whether you can think about AI engineering economically.

Model answer:

I'd apply a layered optimization strategy:

1. Model routing (biggest impact — 40-60% savings): Route simple tasks (classification, short Q&A, extraction) to GPT-4o-mini ($0.15/1M) instead of GPT-4o ($2.50/1M). A classifier determines complexity — typically 60-70% of requests can go to the cheaper model. For a typical workload this alone saves 50%+.

2. Prompt caching (20-30% savings on input): Enable provider-level prompt caching for the system prompt. Anthropic gives 90% off cached tokens; OpenAI gives 50% off. Since the system prompt is sent with every request, this compounds at scale.

3. Prompt compression (10-20% savings): Audit system prompts for verbosity. A 1,500-token prompt can often be reduced to 500 tokens without losing effectiveness. At 100K calls/day on GPT-4o, this saves ~$7,500/month.

4. Batching (5-10% savings): For batch processing tasks, group 5-10 items per API call instead of one call per item. This amortizes the system prompt cost.

5. Response length control (10-15% savings on output): Explicitly constrain output length in both the prompt ("answer in 1-2 sentences") and max_tokens. Since output is 3-5x more expensive, shorter responses have high impact.

Combined, these strategies routinely achieve 50-70% cost reduction with minimal quality impact.


Q7. Explain the circuit breaker pattern and when you'd use it for LLM APIs.

Why interviewers ask: Tests production engineering maturity — understanding failure modes beyond simple retries.

Model answer:

A circuit breaker is a state machine with three states that prevents cascading failure when an API is consistently failing:

CLOSED (normal) — requests pass through normally. Failures are counted. If failures exceed a threshold within a time window (e.g., 5 failures in 60 seconds), the circuit "trips" to OPEN.

OPEN (blocking) — all requests are immediately rejected without calling the API. This prevents wasting resources on a failing service. After a cooldown period (e.g., 30 seconds), the circuit transitions to HALF_OPEN.

HALF_OPEN (testing) — one test request is allowed through. If it succeeds, the circuit returns to CLOSED. If it fails, the circuit goes back to OPEN.

For LLM APIs, I'd use a circuit breaker when: (1) the provider has a prolonged outage — retries keep failing and accumulate cost/latency, (2) your rate limits are exhausted — continued requests just generate more 429s, (3) cascading failures — your app sends retries that overload the provider further.

Without a circuit breaker, 1,000 requests all retrying 3 times against a failing API become 4,000 requests — making the overload worse. With a circuit breaker, after 5 failures the other 995 requests fail fast and you can route them to a fallback model.


Q8. How do you handle few-shot examples efficiently when token budget is tight?

Why interviewers ask: Tests practical trade-off thinking between prompt effectiveness and resource constraints.

Model answer:

Few-shot examples are effective but expensive — each example adds ~100-300 tokens (user + assistant pair). With 3 examples, that's 300-900 tokens sent with every request. Strategies to handle this:

1. Dynamic few-shot selection: Instead of sending the same 3 examples every time, select examples relevant to the current input. Embed your example library, find the closest match to the user's input, and include only the most relevant 1-2 examples. This improves quality AND reduces tokens.

2. Move patterns to system prompt: If all examples follow the same pattern, describe the pattern in the system prompt instead: "Extract fields and return JSON with keys: product, price, currency" can replace 3 examples in some cases.

3. Cache examples in the prompt prefix: With provider caching (Anthropic, OpenAI), place few-shot examples right after the system prompt so they're included in the cached prefix. You pay full price once, then 50-90% off for subsequent calls.

4. Graduated few-shot: Use 3 examples for new users or novel inputs, but reduce to 0-1 examples for common patterns where the model already performs well.

5. Separate example model: For complex tasks, use a first LLM call to generate the "format template" once, then use that as a single example for all subsequent calls.

The key trade-off: fewer examples save tokens but may reduce format compliance. Measure compliance rate at each level (0, 1, 2, 3 examples) and pick the minimum that meets your target.


Advanced (Q9–Q11)

Q9. Design a complete error handling and reliability layer for a production LLM API integration.

Why interviewers ask: Tests the full breadth of production engineering knowledge for AI systems.

Model answer:

I'd build a layered reliability stack with five components:

Layer 1 — Request validation (pre-flight). Before sending any API call, count tokens using tiktoken and verify: (a) input fits within context window minus max_tokens, (b) max_tokens is set appropriately for the task, (c) messages array is properly formatted with alternating roles. Reject invalid requests early.

Layer 2 — Timeout and cancellation. Set per-request timeouts (10-60s based on expected response length). Use AbortController for cancellation. For streaming responses, set a timeout for the first chunk, then a per-chunk timeout.

Layer 3 — Retry with exponential backoff. Retry 429, 500, 502, 503 errors up to 3 times. Exponential delay (1s, 2s, 4s) with random jitter. Respect retry-after header. Never retry 400-level client errors.

Layer 4 — Concurrency control. Use a semaphore/p-limit to cap concurrent requests (typically 10-50). Track in-flight request count against RPM/TPM limits. Proactively throttle based on rate limit response headers (x-ratelimit-remaining-*).

Layer 5 — Circuit breaker with fallback. If error rate exceeds threshold (5 failures in 60s), circuit opens. All new requests immediately fail or route to a fallback model (GPT-4o-mini). After cooldown, test one request. On success, resume primary model.

Cross-cutting: Observability. Log every call with: model, input/output tokens, latency, status code, finish_reason, cost, feature name, user ID. Alert on: error rate > 1%, truncation rate > 2%, p95 latency > 30s, daily cost > budget.

Post-response validation. Check finish_reason for truncation. Validate response structure (JSON schema validation if expecting structured output). Handle empty or malformed responses with a retry or fallback.


Q10. How would you build a cost-aware model routing system?

Why interviewers ask: Tests advanced system design — optimizing cost while maintaining quality across a fleet of models.

Model answer:

The routing system classifies incoming requests by complexity and routes them to the cheapest model that can handle them adequately.

Architecture:

Request → Classifier → Router → Model → Post-validator → Response
                                  ↓ (if validation fails)
                               Upgrade to stronger model and retry

Classifier options (trade-off: cost of classification vs savings from routing):

  1. Rule-based (zero cost): Input length, keyword detection, feature flag. Simple but coarse. Example: < 200 characters and no code → mini model.

  2. Embedding similarity (low cost): Embed the request, compare to clusters of "simple" vs "complex" labeled examples. Cost: ~$0.0001 per classification.

  3. LLM classifier (moderate cost): Use GPT-4o-mini to classify request complexity (simple/medium/complex) for $0.0001-0.001 per classification. Best accuracy but adds latency.

Routing tiers:

TierModelCostUse When
Tier 1GPT-4o-mini$0.15/$0.60Classification, simple Q&A, extraction
Tier 2GPT-4o$2.50/$10.00Summarization, code gen, nuanced questions
Tier 3Claude Opus$15.00/$75.00Complex reasoning, multi-step analysis

Post-validation loop: After getting a response, validate quality (format compliance, confidence score, length adequacy). If the cheap model fails validation, automatically retry with the next tier. Track upgrade rates per task type to improve the classifier over time.

Monitoring: Track per-tier accuracy, latency, cost, and upgrade rate. If a task type has > 10% upgrade rate from Tier 1, reclassify it as Tier 2. The goal is to maximize Tier 1 usage while keeping quality acceptable.

Expected savings: 50-70% compared to routing everything to the best model, with < 2% quality degradation on well-classified tasks.


Q11. A system processes 1 million API calls per day and costs are growing 20% month-over-month. Design a comprehensive cost control strategy.

Why interviewers ask: Tests strategic thinking at scale — combining technical optimization with process and governance.

Model answer:

At 1M calls/day, every optimization multiplies. I'd attack this on four fronts:

1. Visibility (Week 1). You can't optimize what you can't see. Instrument every API call with: model, tokens (input/output), cost, feature, user_id, latency. Build dashboards showing: cost by feature, cost by model, cost per user, token efficiency (useful output per dollar). Identify the top 3 cost drivers — they're usually 80% of the bill.

2. Immediate wins (Weeks 2-3).

  • Prompt audit: Review the top 5 features' system prompts. Typical savings: 30-50% token reduction without quality loss. At 1M calls, even 500 tokens saved = 500M tokens/day.
  • Enable prompt caching: System prompts are sent 1M times/day. Caching saves 50-90% on those tokens.
  • Set appropriate max_tokens: Audit actual output lengths. If p95 output is 300 tokens but max_tokens is 4096, reduce it to 500. This doesn't save money directly (you pay for actual tokens) but prevents runaway responses.
  • Batch where possible: Background processing tasks (classification, extraction) should batch 5-10 items per call.

3. Model routing (Weeks 3-4). Implement the tiered routing system. Analyze the 1M calls: typically 60-70% are simple enough for mini/flash models. Even routing just 50% to GPT-4o-mini saves ~90% on those requests.

4. Governance (Ongoing).

  • Per-feature budgets: Each team gets a monthly token budget. Alerts at 80%, hard caps at 100%.
  • Cost-per-user limits: Prevent power users from consuming disproportionate resources.
  • Approval process for new features: Every new LLM-powered feature must include a cost estimate based on projected volume.
  • Quarterly optimization reviews: Re-evaluate model routing, prompt efficiency, and caching effectiveness.

Expected impact: 50-70% cost reduction within 4-6 weeks. The 20% month-over-month growth should flatten or reverse.


Quick-fire

#QuestionOne-line answer
1System message goes in which position?First in the messages array (or top-level param for Anthropic)
2Are LLM APIs stateful or stateless?Stateless — send full history every call
3What does max_tokens limit?Output tokens only, not input
4finish_reason: "length" means?Response was truncated — hit max_tokens
5Output costs ___ more than input?3-5x more per token
6HTTP 429 means?Rate limited — too many requests
7Should you retry a 401 error?No — fix your API key
8What is jitter in backoff?Random delay to prevent thundering herd
9Few-shot examples use which role?assistant (and user) messages
10Circuit breaker OPEN state does what?Blocks all requests (fast fail)

← Back to 4.2 — Calling LLM APIs Properly (README)