Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly
4.2 — Calling LLM APIs Properly: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md→4.2.a…4.2.d. - Practice —
4.2-Exercise-Questions.md. - Polish answers —
4.2-Interview-Questions.md.
Core Vocabulary
| Term | One-liner |
|---|---|
| system role | Message that sets persona, rules, format, and safety constraints |
| user role | Human input or application-constructed prompt |
| assistant role | Model responses; also used for few-shot examples |
| messages array | Ordered list of {role, content} objects sent to the API |
| max_tokens | Controls maximum output tokens (not input) |
| finish_reason | API response field: "stop" (complete) or "length" (truncated) |
| context window | Max tokens (input + output) per API call |
| token budget | Planned allocation of tokens across prompt components |
| RPM / TPM | Requests Per Minute / Tokens Per Minute (rate limits) |
| 429 | HTTP status: Too Many Requests (rate limited) |
| exponential backoff | Retry with increasing delays: 1s, 2s, 4s, 8s... |
| jitter | Random delay added to backoff to prevent thundering herd |
| circuit breaker | Pattern that stops requests to a failing service |
| prompt caching | Provider feature that discounts repeated prompt prefixes |
| model routing | Directing requests to cheap/expensive models by complexity |
Message Roles Summary
messages: [
{ role: "system", content: "..." }, ← Instructions, persona, rules (first)
{ role: "user", content: "..." }, ← Few-shot input example
{ role: "assistant", content: "..." }, ← Few-shot output example
{ role: "user", content: "..." }, ← Conversation turn 1
{ role: "assistant", content: "..." }, ← Model's response turn 1
{ role: "user", content: "..." } ← Current question (model responds to this)
]
OpenAI vs Anthropic
// OpenAI — system inside messages array
{ messages: [{ role: "system", content: "..." }, { role: "user", content: "..." }] }
// Anthropic — system is a top-level parameter
{ system: "...", messages: [{ role: "user", content: "..." }] }
System prompt best practices
DO: Use clear sections (PERSONA, RULES, FORMAT, SCOPE)
DO: Use imperative language ("Respond in JSON." not "It would be nice if...")
DO: Keep under 800 tokens for most apps
DON'T: Contradict yourself ("Be detailed" + "Under 50 words")
DON'T: Put instructions in user messages
DON'T: Send two user messages in a row
API Call Template (JavaScript)
// OpenAI
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY, timeout: 30000 });
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: systemPrompt },
...conversationHistory,
{ role: "user", content: userMessage }
],
max_tokens: 1000,
temperature: 0.7
});
const text = response.choices[0].message.content;
const finishReason = response.choices[0].finish_reason; // "stop" or "length"
const usage = response.usage; // { prompt_tokens, completion_tokens, total_tokens }
Token Budget Formula
available_output = context_window - input_tokens - safety_margin
input_tokens = system_prompt + few_shot_examples + conversation_history
+ rag_documents + user_message + special_token_overhead
Budget template (128K window):
System prompt: 1,500 (fixed)
Few-shot examples: 1,000 (fixed)
Output reservation: 4,096 (fixed)
Safety margin: 1,000 (fixed)
Current message: 500 (variable)
Conversation history: 8,000 (trim if needed)
RAG documents: 111,903 (fill with ranked chunks)
Overflow strategies
1. Sliding window — drop oldest messages (simple, loses early context)
2. Summarization — summarize old messages with cheap model (preserves info, costs extra)
3. Document truncation — include fewer/shorter RAG chunks
4. Pre-validation — count tokens BEFORE sending, reject if over budget
Cost Formulas
Cost per request = (input_tokens x input_price/1M) + (output_tokens x output_price/1M)
Daily cost = requests_per_day x cost_per_request
System prompt tax = system_tokens x daily_calls x input_price / 1,000,000
Pricing reference (per 1M tokens)
| Model | Input | Output |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Quick cost estimates
1K input + 300 output on GPT-4o:
(1000 x $2.50 + 300 x $10.00) / 1,000,000 = $0.0055/request
x 100K calls/day = $550/day = $16,500/month
Cost optimization levers
1. Model routing — send simple tasks to mini/flash (50-70% savings)
2. Prompt caching — discount on repeated prefixes (20-30% savings)
3. Prompt compression — shorter system prompts (10-20% savings)
4. Batching — group items per call (5-10% savings)
5. Output limits — tighter max_tokens + concise instructions
Rate Limit Handling
429 Too Many Requests → Exponential backoff + jitter
Attempt 0: immediate (first try)
Attempt 1: wait ~1.0-2.0s
Attempt 2: wait ~2.0-3.0s
Attempt 3: wait ~4.0-5.0s
(cap at 60s max delay)
delay = min(baseDelay * 2^attempt + random(0, baseDelay), maxDelay)
Which errors to retry
RETRY: 429 (rate limit), 500, 502, 503, 529
DON'T RETRY: 400, 401, 403, 404, 422
Response headers to check
retry-after → seconds to wait
x-ratelimit-remaining-requests → RPM remaining
x-ratelimit-remaining-tokens → TPM remaining
Circuit Breaker States
CLOSED ──(failures > threshold)──→ OPEN ──(cooldown elapsed)──→ HALF_OPEN
▲ │
└──────────(test request succeeds)───────────────────────────────┘
│
(test fails) ─→ back to OPEN
CLOSED: Normal operation, requests pass through
OPEN: All requests blocked (fast fail), use fallback model
HALF_OPEN: Allow ONE test request to check recovery
Production Checklist
[ ] Set timeout on API client (10-60s based on use case)
[ ] Implement exponential backoff + jitter for 429/5xx
[ ] Respect retry-after header
[ ] Don't retry 400/401/403/404
[ ] Set max_tokens explicitly
[ ] Check finish_reason for truncation
[ ] Limit concurrent requests (p-limit or semaphore)
[ ] Implement circuit breaker for persistent failures
[ ] Log: model, tokens, latency, status, cost, feature
[ ] Alert on error rate > 1%, truncation > 2%, cost > budget
[ ] Have fallback model ready
[ ] Validate response structure before using
[ ] Count tokens before sending to prevent overflow
[ ] Track cost per feature, per model, per user
[ ] Pin model versions in production
Common Gotchas
| Gotcha | Why |
|---|---|
| API is stateless | Must send full conversation history every call |
| max_tokens limits output only | Does not cap input tokens |
| finish_reason: "length" = truncated | User gets incomplete answer |
| System prompt sent every request | Compounds cost at scale |
| Output tokens cost 3-5x more | Optimize response length for cost |
| 429 can be RPM or TPM | Small frequent calls hit RPM; large calls hit TPM |
| No timeout = app hangs | LLM calls can take 60+ seconds |
| Retrying 400 errors = waste | Client errors need code fixes, not retries |
| Two user messages in a row | Some APIs reject; others behave unexpectedly |
End of 4.2 quick revision.