Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2.c — Cost Awareness

In one sentence: LLM APIs charge per token — with output tokens costing 3-5x more than input tokens — so understanding pricing models, calculating costs accurately, and applying optimization strategies (caching, compression, model routing) is essential for building AI products that don't bankrupt you at scale.

Navigation: ← 4.2.b — Token Budgeting · 4.2.d — Rate Limits and Retries →

1. How LLM Pricing Works

LLM providers charge based on tokens processed, split into two categories:

Input tokens — everything you send (system prompt, history, documents, user message)
Output tokens — everything the model generates (the response)

Output tokens are always more expensive than input tokens because generation requires sequential computation (one token at a time), while input processing is parallelized.

Cost = (input_tokens x input_price_per_token) + (output_tokens x output_price_per_token)

Example:
  Input:  2,000 tokens x ($2.50 / 1,000,000) = $0.005
  Output:   500 tokens x ($10.00 / 1,000,000) = $0.005
  Total per request: $0.01

2. Current Pricing Table (Major Models)

Prices are per 1 million tokens (as of early 2025 — always check provider websites for latest pricing):

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window	Output Multiplier
GPT-4o	$2.50	$10.00	128K	4x
GPT-4o-mini	$0.15	$0.60	128K	4x
GPT-4 Turbo	$10.00	$30.00	128K	3x
Claude 3.5 Sonnet	$3.00	$15.00	200K	5x
Claude 3 Haiku	$0.25	$1.25	200K	5x
Claude 3 Opus	$15.00	$75.00	200K	5x
Gemini 1.5 Pro	$1.25	$5.00	2M	4x
Gemini 1.5 Flash	$0.075	$0.30	1M	4x
Llama 3.1 70B (via API)	$0.50-0.90	$0.50-0.90	128K	1x

Key observations:

Output is 3-5x more expensive than input across all providers
"Mini" and "Flash" models are 10-20x cheaper than flagship models
Open-source models via hosted APIs are cheapest but vary by provider
Prices have been dropping rapidly — GPT-4o is 97% cheaper than original GPT-4

3. Cost Calculation: Worked Examples

Example 1: Simple chatbot request

System prompt:    200 tokens  (input)
User message:     100 tokens  (input)
Model response:   300 tokens  (output)

Using GPT-4o ($2.50 / $10.00 per 1M tokens):
  Input cost:  300 tokens x $2.50 / 1,000,000 = $0.00075
  Output cost: 300 tokens x $10.00 / 1,000,000 = $0.003
  Total: $0.00375 per request

Example 2: Multi-turn conversation (10 turns deep)

System prompt:                   200 tokens
Conversation history (9 turns): 4,500 tokens  (avg 250 tokens each x 2 x 9)
Current user message:             150 tokens
Model response:                   400 tokens

Using GPT-4o:
  Input:  4,850 tokens x $2.50 / 1,000,000 = $0.012125
  Output:   400 tokens x $10.00 / 1,000,000 = $0.004
  Total: $0.016125 per request

NOTE: Turn 10 costs 4x more than Turn 1 because of accumulated history!

Example 3: RAG application

System prompt:         1,500 tokens
Retrieved documents:  15,000 tokens  (5 chunks x 3,000 tokens each)
User question:           100 tokens
Model response:          800 tokens

Using Claude 3.5 Sonnet ($3.00 / $15.00 per 1M tokens):
  Input:  16,600 tokens x $3.00 / 1,000,000 = $0.0498
  Output:    800 tokens x $15.00 / 1,000,000 = $0.012
  Total: $0.0618 per request

4. Cost at Scale: Why Pennies Become Thousands

Individual API calls seem cheap. At scale, costs compound dramatically.

Daily volume scenarios (using GPT-4o)

Scenario	Calls/Day	Avg Input	Avg Output	Daily Cost	Monthly Cost
Small chatbot	1,000	500	300	$5.50	$165
Medium SaaS feature	10,000	1,500	500	$87.50	$2,625
High-traffic app	100,000	2,000	600	$1,100	$33,000
Enterprise platform	1,000,000	3,000	800	$15,500	$465,000

Small chatbot:
  1,000 x (500 x $2.50/1M + 300 x $10.00/1M) = $1.25 + $3.00 = $4.25/day

Enterprise platform:
  1,000,000 x (3,000 x $2.50/1M + 800 x $10.00/1M) = $7,500 + $8,000 = $15,500/day
                                                                        = $465,000/month!

The system prompt tax

Your system prompt is sent with every single request. At scale, this fixed cost adds up:

System prompt: 1,500 tokens
Daily API calls: 100,000

System prompt cost alone (GPT-4o input):
  1,500 x 100,000 = 150,000,000 input tokens/day
  150M x $2.50/1M = $375/day = $11,250/month

Optimize to 500 tokens:
  500 x 100,000 = 50,000,000 input tokens/day
  50M x $2.50/1M = $125/day = $3,750/month

Savings: $7,500/month from ONE optimization

5. Why Bad Prompts Waste Money

Verbose instructions

// BAD: 120+ tokens of instructions
const expensivePrompt = `I would really appreciate it if you could please
take a moment to carefully analyze the following text and then provide me
with a comprehensive yet concise summary that captures all of the key
points and main ideas while keeping the summary short enough to be easily
digestible for a busy reader who doesn't have time for long summaries.`;

// GOOD: ~15 tokens, same result
const cheapPrompt = `Summarize in 3 bullet points:`;

Savings: ~105 tokens per request. At 100K calls/day = 10.5M tokens saved = $26/day = $780/month.

Unnecessary context

// BAD: Sending full document when only a section is relevant
const messages = [
  { role: "system", content: "Answer questions about the document." },
  { role: "user", content: `${fullDocument}\n\nQuestion: What is the return policy?` }
  // fullDocument = 50,000 tokens, but return policy is in ONE paragraph
];

// GOOD: Send only the relevant section
const messages = [
  { role: "system", content: "Answer questions about the document." },
  { role: "user", content: `${relevantSection}\n\nQuestion: What is the return policy?` }
  // relevantSection = 500 tokens
];

Savings: 49,500 tokens per request. At GPT-4o input pricing: $0.124/request saved.

Redundant conversation history

// BAD: Sending entire 50-turn history for a simple follow-up
// 50 turns x ~500 tokens = 25,000 tokens of history

// GOOD: Summarize old history, keep last 5 turns
// Summary: 300 tokens + 5 turns x 500 = 2,800 tokens
// Savings: 22,200 tokens per request

6. Cost Optimization Strategies

Strategy 1: Prompt caching

Many providers now support prompt caching — if you send the same system prompt repeatedly, the provider caches the KV computation and charges a reduced rate.

// Anthropic prompt caching
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longSystemPrompt,
      cache_control: { type: "ephemeral" }  // Cache this!
    }
  ],
  messages: [{ role: "user", content: userMessage }]
});

// First call: full price
// Subsequent calls (within cache TTL): ~90% cheaper for cached tokens

Provider	Cache Discount	Cache TTL
Anthropic	90% off cached input	5 minutes
OpenAI	50% off cached input	~5-10 minutes
Google	75% off cached input	Configurable

Strategy 2: Model routing (use cheaper models when possible)

async function routeToModel(userMessage) {
  // Simple heuristic: short, simple questions go to cheap model
  const isSimple = userMessage.length < 200 && !userMessage.includes('code');

  if (isSimple) {
    // GPT-4o-mini: $0.15 / $0.60 per 1M tokens
    return openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: userMessage }],
      max_tokens: 500
    });
  } else {
    // GPT-4o: $2.50 / $10.00 per 1M tokens (17x more expensive)
    return openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: userMessage }],
      max_tokens: 2000
    });
  }
}

Advanced routing uses a classifier (could be another cheap LLM call or a rule-based system) to decide which model handles each request:

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│  User     │────▶│   Router     │────▶│  GPT-4o-mini │  70% of requests ($0.15/1M)
│  Request  │     │  (classify)  │     └──────────────┘
└──────────┘     │              │     ┌──────────────┐
                  │              │────▶│  GPT-4o      │  25% of requests ($2.50/1M)
                  │              │     └──────────────┘
                  │              │     ┌──────────────┐
                  │              │────▶│  Claude Opus  │  5% of requests ($15.00/1M)
                  └──────────────┘     └──────────────┘

Blended cost: much lower than routing everything to the best model

Strategy 3: Prompt compression

Reduce token usage by making prompts more efficient without losing meaning:

// Before compression: ~80 tokens
const verbose = `Please analyze the following customer review and determine
whether the overall sentiment expressed by the customer is positive,
negative, or neutral. Consider the tone, specific words used, and the
overall impression the customer conveys in their review.`;

// After compression: ~20 tokens
const compressed = `Classify sentiment as positive/negative/neutral:`;

// Savings: 60 tokens/request x 100K requests/day = 6M tokens/day = $15/day

Strategy 4: Batching API calls

Process multiple items in a single API call instead of one call per item:

// BAD: 100 separate API calls (100x overhead from system prompt)
for (const review of reviews) {
  await openai.chat.completions.create({
    messages: [
      { role: "system", content: classificationPrompt },  // 500 tokens, sent 100 times
      { role: "user", content: review }
    ]
  });
}
// Total system prompt tokens: 500 x 100 = 50,000

// GOOD: Batch 10 reviews per call (10 calls instead of 100)
const batches = chunk(reviews, 10);
for (const batch of batches) {
  await openai.chat.completions.create({
    messages: [
      { role: "system", content: classificationPrompt },  // 500 tokens, sent 10 times
      { role: "user", content: `Classify each review:\n${batch.map((r, i) => `${i+1}. ${r}`).join('\n')}` }
    ]
  });
}
// Total system prompt tokens: 500 x 10 = 5,000 (90% reduction)

Strategy 5: Response length control

// BAD: No length constraint — model may generate 2000 tokens when 200 suffice
{ max_tokens: 4096, messages: [...] }

// GOOD: Constrain output length in both the prompt and max_tokens
{
  max_tokens: 300,
  messages: [
    { role: "system", content: "Answer in 1-2 sentences maximum." },
    { role: "user", content: question }
  ]
}

7. Cost Tracking Implementation

class CostTracker {
  constructor() {
    this.calls = [];
    this.pricing = {
      'gpt-4o':        { input: 2.50, output: 10.00 },
      'gpt-4o-mini':   { input: 0.15, output: 0.60 },
      'claude-3-5-sonnet': { input: 3.00, output: 15.00 },
      'claude-3-haiku':    { input: 0.25, output: 1.25 }
    };
  }

  recordCall(model, inputTokens, outputTokens, metadata = {}) {
    const prices = this.pricing[model];
    if (!prices) throw new Error(`Unknown model: ${model}`);

    const inputCost = (inputTokens / 1_000_000) * prices.input;
    const outputCost = (outputTokens / 1_000_000) * prices.output;

    const record = {
      timestamp: new Date().toISOString(),
      model,
      inputTokens,
      outputTokens,
      inputCost,
      outputCost,
      totalCost: inputCost + outputCost,
      ...metadata
    };

    this.calls.push(record);
    return record;
  }

  getDailySummary() {
    const today = new Date().toISOString().split('T')[0];
    const todayCalls = this.calls.filter(c => c.timestamp.startsWith(today));

    return {
      totalCalls: todayCalls.length,
      totalCost: todayCalls.reduce((sum, c) => sum + c.totalCost, 0).toFixed(4),
      totalInputTokens: todayCalls.reduce((sum, c) => sum + c.inputTokens, 0),
      totalOutputTokens: todayCalls.reduce((sum, c) => sum + c.outputTokens, 0),
      avgCostPerCall: (todayCalls.reduce((sum, c) => sum + c.totalCost, 0) / todayCalls.length).toFixed(6),
      byModel: this.#groupByModel(todayCalls)
    };
  }

  #groupByModel(calls) {
    const groups = {};
    for (const call of calls) {
      if (!groups[call.model]) groups[call.model] = { calls: 0, cost: 0 };
      groups[call.model].calls++;
      groups[call.model].cost += call.totalCost;
    }
    return groups;
  }
}

// Usage
const tracker = new CostTracker();

const response = await openai.chat.completions.create({ ... });
tracker.recordCall(
  'gpt-4o',
  response.usage.prompt_tokens,
  response.usage.completion_tokens,
  { userId: 'user123', feature: 'chat' }
);

console.log(tracker.getDailySummary());

8. When to Use Cheaper Models

Not every task requires the most powerful (and expensive) model. Here's a decision framework:

Task	Recommended Model Tier	Why
Classification (sentiment, category)	Mini/Flash	Simple pattern matching
Short factual Q&A	Mini/Flash	Well within capabilities
JSON extraction	Mini/Flash to Standard	Depends on schema complexity
Summarization	Standard	Needs good language understanding
Code generation	Standard to Premium	Depends on complexity
Complex reasoning	Premium (GPT-4o, Opus)	Needs maximum capability
Multi-step analysis	Premium	Chain-of-thought requires capacity
Creative writing	Standard	Good enough for most cases

Cost comparison for classification task

10,000 classification requests/day
Average: 500 input tokens, 10 output tokens per request

GPT-4o:       10K x (500 x $2.50/1M + 10 x $10.00/1M) = $12.50 + $0.10 = $12.60/day
GPT-4o-mini:  10K x (500 x $0.15/1M + 10 x $0.60/1M)  = $0.75 + $0.006 = $0.76/day

Savings: $11.84/day = $355/month by using mini for classification
Accuracy difference: typically < 2% for simple classification tasks

9. Real-World Cost Example: Production AI SaaS

A SaaS company running an AI-powered customer support system:

Feature breakdown:
  Auto-categorization:     50,000 calls/day  → GPT-4o-mini   ($38/day)
  Response drafting:        20,000 calls/day  → GPT-4o        ($220/day)
  Sentiment analysis:       50,000 calls/day  → GPT-4o-mini   ($38/day)
  Complex escalation:        5,000 calls/day  → Claude Opus   ($150/day)
  Summarization:            10,000 calls/day  → GPT-4o        ($110/day)

Daily total:  $556/day
Monthly total: $16,680/month

BEFORE optimization (everything on GPT-4o):
  135,000 calls/day x avg $0.015/call = $2,025/day = $60,750/month

AFTER model routing:
  $556/day = $16,680/month

Savings: $44,070/month (72% reduction!)

10. Key Takeaways

Output tokens cost 3-5x more than input tokens — optimizing response length is high-impact.
System prompts are a hidden tax — they're sent with every request. At 100K calls/day, a 1,000-token prompt costs $250/day on GPT-4o.
Use cheaper models for simple tasks — GPT-4o-mini at $0.15/1M input is 17x cheaper than GPT-4o. For classification and extraction, accuracy is often comparable.
Prompt caching can reduce input costs by 50-90% for repeated system prompts.
Track costs per feature, per model, per user — you can't optimize what you don't measure.
Batching reduces the per-request overhead of system prompts.

Explain-It Challenge

A startup founder says "LLM API calls are only fractions of a cent — cost isn't worth worrying about." Using a concrete scenario with 50,000 daily users, show them why they're wrong.
Your company's AI feature costs $30,000/month. Propose three specific optimizations and estimate the savings from each.
Why are output tokens more expensive than input tokens? Explain the technical reason (hint: think about how the model processes input in parallel vs generating output sequentially).

Navigation: ← 4.2.b — Token Budgeting · 4.2.d — Rate Limits and Retries →