Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2 — Calling LLM APIs Properly: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps — reopen README.md4.2.a4.2.d.
  3. Practice4.2-Exercise-Questions.md.
  4. Polish answers4.2-Interview-Questions.md.

Core Vocabulary

TermOne-liner
system roleMessage that sets persona, rules, format, and safety constraints
user roleHuman input or application-constructed prompt
assistant roleModel responses; also used for few-shot examples
messages arrayOrdered list of {role, content} objects sent to the API
max_tokensControls maximum output tokens (not input)
finish_reasonAPI response field: "stop" (complete) or "length" (truncated)
context windowMax tokens (input + output) per API call
token budgetPlanned allocation of tokens across prompt components
RPM / TPMRequests Per Minute / Tokens Per Minute (rate limits)
429HTTP status: Too Many Requests (rate limited)
exponential backoffRetry with increasing delays: 1s, 2s, 4s, 8s...
jitterRandom delay added to backoff to prevent thundering herd
circuit breakerPattern that stops requests to a failing service
prompt cachingProvider feature that discounts repeated prompt prefixes
model routingDirecting requests to cheap/expensive models by complexity

Message Roles Summary

messages: [
  { role: "system",    content: "..." },   ← Instructions, persona, rules (first)
  { role: "user",      content: "..." },   ← Few-shot input example
  { role: "assistant", content: "..." },   ← Few-shot output example
  { role: "user",      content: "..." },   ← Conversation turn 1
  { role: "assistant", content: "..." },   ← Model's response turn 1
  { role: "user",      content: "..." }    ← Current question (model responds to this)
]

OpenAI vs Anthropic

// OpenAI — system inside messages array
{ messages: [{ role: "system", content: "..." }, { role: "user", content: "..." }] }

// Anthropic — system is a top-level parameter
{ system: "...", messages: [{ role: "user", content: "..." }] }

System prompt best practices

DO:   Use clear sections (PERSONA, RULES, FORMAT, SCOPE)
DO:   Use imperative language ("Respond in JSON." not "It would be nice if...")
DO:   Keep under 800 tokens for most apps
DON'T: Contradict yourself ("Be detailed" + "Under 50 words")
DON'T: Put instructions in user messages
DON'T: Send two user messages in a row

API Call Template (JavaScript)

// OpenAI
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY, timeout: 30000 });

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: systemPrompt },
    ...conversationHistory,
    { role: "user", content: userMessage }
  ],
  max_tokens: 1000,
  temperature: 0.7
});

const text = response.choices[0].message.content;
const finishReason = response.choices[0].finish_reason;  // "stop" or "length"
const usage = response.usage;  // { prompt_tokens, completion_tokens, total_tokens }

Token Budget Formula

available_output = context_window - input_tokens - safety_margin

input_tokens = system_prompt + few_shot_examples + conversation_history
             + rag_documents + user_message + special_token_overhead

Budget template (128K window):
  System prompt:        1,500  (fixed)
  Few-shot examples:    1,000  (fixed)
  Output reservation:   4,096  (fixed)
  Safety margin:        1,000  (fixed)
  Current message:        500  (variable)
  Conversation history: 8,000  (trim if needed)
  RAG documents:      111,903  (fill with ranked chunks)

Overflow strategies

1. Sliding window    — drop oldest messages (simple, loses early context)
2. Summarization     — summarize old messages with cheap model (preserves info, costs extra)
3. Document truncation — include fewer/shorter RAG chunks
4. Pre-validation    — count tokens BEFORE sending, reject if over budget

Cost Formulas

Cost per request = (input_tokens x input_price/1M) + (output_tokens x output_price/1M)

Daily cost = requests_per_day x cost_per_request

System prompt tax = system_tokens x daily_calls x input_price / 1,000,000

Pricing reference (per 1M tokens)

ModelInputOutput
GPT-4o$2.50$10.00
GPT-4o-mini$0.15$0.60
Claude 3.5 Sonnet$3.00$15.00
Claude 3 Haiku$0.25$1.25
Gemini 1.5 Flash$0.075$0.30

Quick cost estimates

1K input + 300 output on GPT-4o:
  (1000 x $2.50 + 300 x $10.00) / 1,000,000 = $0.0055/request

x 100K calls/day = $550/day = $16,500/month

Cost optimization levers

1. Model routing       — send simple tasks to mini/flash (50-70% savings)
2. Prompt caching      — discount on repeated prefixes (20-30% savings)
3. Prompt compression  — shorter system prompts (10-20% savings)
4. Batching            — group items per call (5-10% savings)
5. Output limits       — tighter max_tokens + concise instructions

Rate Limit Handling

429 Too Many Requests → Exponential backoff + jitter

Attempt 0: immediate (first try)
Attempt 1: wait ~1.0-2.0s
Attempt 2: wait ~2.0-3.0s
Attempt 3: wait ~4.0-5.0s
(cap at 60s max delay)

delay = min(baseDelay * 2^attempt + random(0, baseDelay), maxDelay)

Which errors to retry

RETRY:      429 (rate limit), 500, 502, 503, 529
DON'T RETRY: 400, 401, 403, 404, 422

Response headers to check

retry-after                    → seconds to wait
x-ratelimit-remaining-requests → RPM remaining
x-ratelimit-remaining-tokens   → TPM remaining

Circuit Breaker States

CLOSED ──(failures > threshold)──→ OPEN ──(cooldown elapsed)──→ HALF_OPEN
  ▲                                                                │
  └──────────(test request succeeds)───────────────────────────────┘
                                           │
                         (test fails) ─→ back to OPEN
CLOSED:    Normal operation, requests pass through
OPEN:      All requests blocked (fast fail), use fallback model
HALF_OPEN: Allow ONE test request to check recovery

Production Checklist

[ ] Set timeout on API client (10-60s based on use case)
[ ] Implement exponential backoff + jitter for 429/5xx
[ ] Respect retry-after header
[ ] Don't retry 400/401/403/404
[ ] Set max_tokens explicitly
[ ] Check finish_reason for truncation
[ ] Limit concurrent requests (p-limit or semaphore)
[ ] Implement circuit breaker for persistent failures
[ ] Log: model, tokens, latency, status, cost, feature
[ ] Alert on error rate > 1%, truncation > 2%, cost > budget
[ ] Have fallback model ready
[ ] Validate response structure before using
[ ] Count tokens before sending to prevent overflow
[ ] Track cost per feature, per model, per user
[ ] Pin model versions in production

Common Gotchas

GotchaWhy
API is statelessMust send full conversation history every call
max_tokens limits output onlyDoes not cap input tokens
finish_reason: "length" = truncatedUser gets incomplete answer
System prompt sent every requestCompounds cost at scale
Output tokens cost 3-5x moreOptimize response length for cost
429 can be RPM or TPMSmall frequent calls hit RPM; large calls hit TPM
No timeout = app hangsLLM calls can take 60+ seconds
Retrying 400 errors = wasteClient errors need code fixes, not retries
Two user messages in a rowSome APIs reject; others behave unexpectedly

End of 4.2 quick revision.