Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly
4.2.c — Cost Awareness
In one sentence: LLM APIs charge per token — with output tokens costing 3-5x more than input tokens — so understanding pricing models, calculating costs accurately, and applying optimization strategies (caching, compression, model routing) is essential for building AI products that don't bankrupt you at scale.
Navigation: ← 4.2.b — Token Budgeting · 4.2.d — Rate Limits and Retries →
1. How LLM Pricing Works
LLM providers charge based on tokens processed, split into two categories:
- Input tokens — everything you send (system prompt, history, documents, user message)
- Output tokens — everything the model generates (the response)
Output tokens are always more expensive than input tokens because generation requires sequential computation (one token at a time), while input processing is parallelized.
Cost = (input_tokens x input_price_per_token) + (output_tokens x output_price_per_token)
Example:
Input: 2,000 tokens x ($2.50 / 1,000,000) = $0.005
Output: 500 tokens x ($10.00 / 1,000,000) = $0.005
Total per request: $0.01
2. Current Pricing Table (Major Models)
Prices are per 1 million tokens (as of early 2025 — always check provider websites for latest pricing):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | Output Multiplier |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128K | 4x |
| GPT-4o-mini | $0.15 | $0.60 | 128K | 4x |
| GPT-4 Turbo | $10.00 | $30.00 | 128K | 3x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K | 5x |
| Claude 3 Haiku | $0.25 | $1.25 | 200K | 5x |
| Claude 3 Opus | $15.00 | $75.00 | 200K | 5x |
| Gemini 1.5 Pro | $1.25 | $5.00 | 2M | 4x |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M | 4x |
| Llama 3.1 70B (via API) | $0.50-0.90 | $0.50-0.90 | 128K | 1x |
Key observations:
- Output is 3-5x more expensive than input across all providers
- "Mini" and "Flash" models are 10-20x cheaper than flagship models
- Open-source models via hosted APIs are cheapest but vary by provider
- Prices have been dropping rapidly — GPT-4o is 97% cheaper than original GPT-4
3. Cost Calculation: Worked Examples
Example 1: Simple chatbot request
System prompt: 200 tokens (input)
User message: 100 tokens (input)
Model response: 300 tokens (output)
Using GPT-4o ($2.50 / $10.00 per 1M tokens):
Input cost: 300 tokens x $2.50 / 1,000,000 = $0.00075
Output cost: 300 tokens x $10.00 / 1,000,000 = $0.003
Total: $0.00375 per request
Example 2: Multi-turn conversation (10 turns deep)
System prompt: 200 tokens
Conversation history (9 turns): 4,500 tokens (avg 250 tokens each x 2 x 9)
Current user message: 150 tokens
Model response: 400 tokens
Using GPT-4o:
Input: 4,850 tokens x $2.50 / 1,000,000 = $0.012125
Output: 400 tokens x $10.00 / 1,000,000 = $0.004
Total: $0.016125 per request
NOTE: Turn 10 costs 4x more than Turn 1 because of accumulated history!
Example 3: RAG application
System prompt: 1,500 tokens
Retrieved documents: 15,000 tokens (5 chunks x 3,000 tokens each)
User question: 100 tokens
Model response: 800 tokens
Using Claude 3.5 Sonnet ($3.00 / $15.00 per 1M tokens):
Input: 16,600 tokens x $3.00 / 1,000,000 = $0.0498
Output: 800 tokens x $15.00 / 1,000,000 = $0.012
Total: $0.0618 per request
4. Cost at Scale: Why Pennies Become Thousands
Individual API calls seem cheap. At scale, costs compound dramatically.
Daily volume scenarios (using GPT-4o)
| Scenario | Calls/Day | Avg Input | Avg Output | Daily Cost | Monthly Cost |
|---|---|---|---|---|---|
| Small chatbot | 1,000 | 500 | 300 | $5.50 | $165 |
| Medium SaaS feature | 10,000 | 1,500 | 500 | $87.50 | $2,625 |
| High-traffic app | 100,000 | 2,000 | 600 | $1,100 | $33,000 |
| Enterprise platform | 1,000,000 | 3,000 | 800 | $15,500 | $465,000 |
Small chatbot:
1,000 x (500 x $2.50/1M + 300 x $10.00/1M) = $1.25 + $3.00 = $4.25/day
Enterprise platform:
1,000,000 x (3,000 x $2.50/1M + 800 x $10.00/1M) = $7,500 + $8,000 = $15,500/day
= $465,000/month!
The system prompt tax
Your system prompt is sent with every single request. At scale, this fixed cost adds up:
System prompt: 1,500 tokens
Daily API calls: 100,000
System prompt cost alone (GPT-4o input):
1,500 x 100,000 = 150,000,000 input tokens/day
150M x $2.50/1M = $375/day = $11,250/month
Optimize to 500 tokens:
500 x 100,000 = 50,000,000 input tokens/day
50M x $2.50/1M = $125/day = $3,750/month
Savings: $7,500/month from ONE optimization
5. Why Bad Prompts Waste Money
Verbose instructions
// BAD: 120+ tokens of instructions
const expensivePrompt = `I would really appreciate it if you could please
take a moment to carefully analyze the following text and then provide me
with a comprehensive yet concise summary that captures all of the key
points and main ideas while keeping the summary short enough to be easily
digestible for a busy reader who doesn't have time for long summaries.`;
// GOOD: ~15 tokens, same result
const cheapPrompt = `Summarize in 3 bullet points:`;
Savings: ~105 tokens per request. At 100K calls/day = 10.5M tokens saved = $26/day = $780/month.
Unnecessary context
// BAD: Sending full document when only a section is relevant
const messages = [
{ role: "system", content: "Answer questions about the document." },
{ role: "user", content: `${fullDocument}\n\nQuestion: What is the return policy?` }
// fullDocument = 50,000 tokens, but return policy is in ONE paragraph
];
// GOOD: Send only the relevant section
const messages = [
{ role: "system", content: "Answer questions about the document." },
{ role: "user", content: `${relevantSection}\n\nQuestion: What is the return policy?` }
// relevantSection = 500 tokens
];
Savings: 49,500 tokens per request. At GPT-4o input pricing: $0.124/request saved.
Redundant conversation history
// BAD: Sending entire 50-turn history for a simple follow-up
// 50 turns x ~500 tokens = 25,000 tokens of history
// GOOD: Summarize old history, keep last 5 turns
// Summary: 300 tokens + 5 turns x 500 = 2,800 tokens
// Savings: 22,200 tokens per request
6. Cost Optimization Strategies
Strategy 1: Prompt caching
Many providers now support prompt caching — if you send the same system prompt repeatedly, the provider caches the KV computation and charges a reduced rate.
// Anthropic prompt caching
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: longSystemPrompt,
cache_control: { type: "ephemeral" } // Cache this!
}
],
messages: [{ role: "user", content: userMessage }]
});
// First call: full price
// Subsequent calls (within cache TTL): ~90% cheaper for cached tokens
| Provider | Cache Discount | Cache TTL |
|---|---|---|
| Anthropic | 90% off cached input | 5 minutes |
| OpenAI | 50% off cached input | ~5-10 minutes |
| 75% off cached input | Configurable |
Strategy 2: Model routing (use cheaper models when possible)
async function routeToModel(userMessage) {
// Simple heuristic: short, simple questions go to cheap model
const isSimple = userMessage.length < 200 && !userMessage.includes('code');
if (isSimple) {
// GPT-4o-mini: $0.15 / $0.60 per 1M tokens
return openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: userMessage }],
max_tokens: 500
});
} else {
// GPT-4o: $2.50 / $10.00 per 1M tokens (17x more expensive)
return openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: userMessage }],
max_tokens: 2000
});
}
}
Advanced routing uses a classifier (could be another cheap LLM call or a rule-based system) to decide which model handles each request:
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ User │────▶│ Router │────▶│ GPT-4o-mini │ 70% of requests ($0.15/1M)
│ Request │ │ (classify) │ └──────────────┘
└──────────┘ │ │ ┌──────────────┐
│ │────▶│ GPT-4o │ 25% of requests ($2.50/1M)
│ │ └──────────────┘
│ │ ┌──────────────┐
│ │────▶│ Claude Opus │ 5% of requests ($15.00/1M)
└──────────────┘ └──────────────┘
Blended cost: much lower than routing everything to the best model
Strategy 3: Prompt compression
Reduce token usage by making prompts more efficient without losing meaning:
// Before compression: ~80 tokens
const verbose = `Please analyze the following customer review and determine
whether the overall sentiment expressed by the customer is positive,
negative, or neutral. Consider the tone, specific words used, and the
overall impression the customer conveys in their review.`;
// After compression: ~20 tokens
const compressed = `Classify sentiment as positive/negative/neutral:`;
// Savings: 60 tokens/request x 100K requests/day = 6M tokens/day = $15/day
Strategy 4: Batching API calls
Process multiple items in a single API call instead of one call per item:
// BAD: 100 separate API calls (100x overhead from system prompt)
for (const review of reviews) {
await openai.chat.completions.create({
messages: [
{ role: "system", content: classificationPrompt }, // 500 tokens, sent 100 times
{ role: "user", content: review }
]
});
}
// Total system prompt tokens: 500 x 100 = 50,000
// GOOD: Batch 10 reviews per call (10 calls instead of 100)
const batches = chunk(reviews, 10);
for (const batch of batches) {
await openai.chat.completions.create({
messages: [
{ role: "system", content: classificationPrompt }, // 500 tokens, sent 10 times
{ role: "user", content: `Classify each review:\n${batch.map((r, i) => `${i+1}. ${r}`).join('\n')}` }
]
});
}
// Total system prompt tokens: 500 x 10 = 5,000 (90% reduction)
Strategy 5: Response length control
// BAD: No length constraint — model may generate 2000 tokens when 200 suffice
{ max_tokens: 4096, messages: [...] }
// GOOD: Constrain output length in both the prompt and max_tokens
{
max_tokens: 300,
messages: [
{ role: "system", content: "Answer in 1-2 sentences maximum." },
{ role: "user", content: question }
]
}
7. Cost Tracking Implementation
class CostTracker {
constructor() {
this.calls = [];
this.pricing = {
'gpt-4o': { input: 2.50, output: 10.00 },
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'claude-3-5-sonnet': { input: 3.00, output: 15.00 },
'claude-3-haiku': { input: 0.25, output: 1.25 }
};
}
recordCall(model, inputTokens, outputTokens, metadata = {}) {
const prices = this.pricing[model];
if (!prices) throw new Error(`Unknown model: ${model}`);
const inputCost = (inputTokens / 1_000_000) * prices.input;
const outputCost = (outputTokens / 1_000_000) * prices.output;
const record = {
timestamp: new Date().toISOString(),
model,
inputTokens,
outputTokens,
inputCost,
outputCost,
totalCost: inputCost + outputCost,
...metadata
};
this.calls.push(record);
return record;
}
getDailySummary() {
const today = new Date().toISOString().split('T')[0];
const todayCalls = this.calls.filter(c => c.timestamp.startsWith(today));
return {
totalCalls: todayCalls.length,
totalCost: todayCalls.reduce((sum, c) => sum + c.totalCost, 0).toFixed(4),
totalInputTokens: todayCalls.reduce((sum, c) => sum + c.inputTokens, 0),
totalOutputTokens: todayCalls.reduce((sum, c) => sum + c.outputTokens, 0),
avgCostPerCall: (todayCalls.reduce((sum, c) => sum + c.totalCost, 0) / todayCalls.length).toFixed(6),
byModel: this.#groupByModel(todayCalls)
};
}
#groupByModel(calls) {
const groups = {};
for (const call of calls) {
if (!groups[call.model]) groups[call.model] = { calls: 0, cost: 0 };
groups[call.model].calls++;
groups[call.model].cost += call.totalCost;
}
return groups;
}
}
// Usage
const tracker = new CostTracker();
const response = await openai.chat.completions.create({ ... });
tracker.recordCall(
'gpt-4o',
response.usage.prompt_tokens,
response.usage.completion_tokens,
{ userId: 'user123', feature: 'chat' }
);
console.log(tracker.getDailySummary());
8. When to Use Cheaper Models
Not every task requires the most powerful (and expensive) model. Here's a decision framework:
| Task | Recommended Model Tier | Why |
|---|---|---|
| Classification (sentiment, category) | Mini/Flash | Simple pattern matching |
| Short factual Q&A | Mini/Flash | Well within capabilities |
| JSON extraction | Mini/Flash to Standard | Depends on schema complexity |
| Summarization | Standard | Needs good language understanding |
| Code generation | Standard to Premium | Depends on complexity |
| Complex reasoning | Premium (GPT-4o, Opus) | Needs maximum capability |
| Multi-step analysis | Premium | Chain-of-thought requires capacity |
| Creative writing | Standard | Good enough for most cases |
Cost comparison for classification task
10,000 classification requests/day
Average: 500 input tokens, 10 output tokens per request
GPT-4o: 10K x (500 x $2.50/1M + 10 x $10.00/1M) = $12.50 + $0.10 = $12.60/day
GPT-4o-mini: 10K x (500 x $0.15/1M + 10 x $0.60/1M) = $0.75 + $0.006 = $0.76/day
Savings: $11.84/day = $355/month by using mini for classification
Accuracy difference: typically < 2% for simple classification tasks
9. Real-World Cost Example: Production AI SaaS
A SaaS company running an AI-powered customer support system:
Feature breakdown:
Auto-categorization: 50,000 calls/day → GPT-4o-mini ($38/day)
Response drafting: 20,000 calls/day → GPT-4o ($220/day)
Sentiment analysis: 50,000 calls/day → GPT-4o-mini ($38/day)
Complex escalation: 5,000 calls/day → Claude Opus ($150/day)
Summarization: 10,000 calls/day → GPT-4o ($110/day)
Daily total: $556/day
Monthly total: $16,680/month
BEFORE optimization (everything on GPT-4o):
135,000 calls/day x avg $0.015/call = $2,025/day = $60,750/month
AFTER model routing:
$556/day = $16,680/month
Savings: $44,070/month (72% reduction!)
10. Key Takeaways
- Output tokens cost 3-5x more than input tokens — optimizing response length is high-impact.
- System prompts are a hidden tax — they're sent with every request. At 100K calls/day, a 1,000-token prompt costs $250/day on GPT-4o.
- Use cheaper models for simple tasks — GPT-4o-mini at $0.15/1M input is 17x cheaper than GPT-4o. For classification and extraction, accuracy is often comparable.
- Prompt caching can reduce input costs by 50-90% for repeated system prompts.
- Track costs per feature, per model, per user — you can't optimize what you don't measure.
- Batching reduces the per-request overhead of system prompts.
Explain-It Challenge
- A startup founder says "LLM API calls are only fractions of a cent — cost isn't worth worrying about." Using a concrete scenario with 50,000 daily users, show them why they're wrong.
- Your company's AI feature costs $30,000/month. Propose three specific optimizations and estimate the savings from each.
- Why are output tokens more expensive than input tokens? Explain the technical reason (hint: think about how the model processes input in parallel vs generating output sequentially).
Navigation: ← 4.2.b — Token Budgeting · 4.2.d — Rate Limits and Retries →