Episode 4 — Generative AI Engineering / 4.2 — Calling LLM APIs Properly

4.2.c — Cost Awareness

In one sentence: LLM APIs charge per token — with output tokens costing 3-5x more than input tokens — so understanding pricing models, calculating costs accurately, and applying optimization strategies (caching, compression, model routing) is essential for building AI products that don't bankrupt you at scale.

Navigation: ← 4.2.b — Token Budgeting · 4.2.d — Rate Limits and Retries →


1. How LLM Pricing Works

LLM providers charge based on tokens processed, split into two categories:

  • Input tokens — everything you send (system prompt, history, documents, user message)
  • Output tokens — everything the model generates (the response)

Output tokens are always more expensive than input tokens because generation requires sequential computation (one token at a time), while input processing is parallelized.

Cost = (input_tokens x input_price_per_token) + (output_tokens x output_price_per_token)

Example:
  Input:  2,000 tokens x ($2.50 / 1,000,000) = $0.005
  Output:   500 tokens x ($10.00 / 1,000,000) = $0.005
  Total per request: $0.01

2. Current Pricing Table (Major Models)

Prices are per 1 million tokens (as of early 2025 — always check provider websites for latest pricing):

ModelInput (per 1M tokens)Output (per 1M tokens)Context WindowOutput Multiplier
GPT-4o$2.50$10.00128K4x
GPT-4o-mini$0.15$0.60128K4x
GPT-4 Turbo$10.00$30.00128K3x
Claude 3.5 Sonnet$3.00$15.00200K5x
Claude 3 Haiku$0.25$1.25200K5x
Claude 3 Opus$15.00$75.00200K5x
Gemini 1.5 Pro$1.25$5.002M4x
Gemini 1.5 Flash$0.075$0.301M4x
Llama 3.1 70B (via API)$0.50-0.90$0.50-0.90128K1x

Key observations:

  • Output is 3-5x more expensive than input across all providers
  • "Mini" and "Flash" models are 10-20x cheaper than flagship models
  • Open-source models via hosted APIs are cheapest but vary by provider
  • Prices have been dropping rapidly — GPT-4o is 97% cheaper than original GPT-4

3. Cost Calculation: Worked Examples

Example 1: Simple chatbot request

System prompt:    200 tokens  (input)
User message:     100 tokens  (input)
Model response:   300 tokens  (output)

Using GPT-4o ($2.50 / $10.00 per 1M tokens):
  Input cost:  300 tokens x $2.50 / 1,000,000 = $0.00075
  Output cost: 300 tokens x $10.00 / 1,000,000 = $0.003
  Total: $0.00375 per request

Example 2: Multi-turn conversation (10 turns deep)

System prompt:                   200 tokens
Conversation history (9 turns): 4,500 tokens  (avg 250 tokens each x 2 x 9)
Current user message:             150 tokens
Model response:                   400 tokens

Using GPT-4o:
  Input:  4,850 tokens x $2.50 / 1,000,000 = $0.012125
  Output:   400 tokens x $10.00 / 1,000,000 = $0.004
  Total: $0.016125 per request

NOTE: Turn 10 costs 4x more than Turn 1 because of accumulated history!

Example 3: RAG application

System prompt:         1,500 tokens
Retrieved documents:  15,000 tokens  (5 chunks x 3,000 tokens each)
User question:           100 tokens
Model response:          800 tokens

Using Claude 3.5 Sonnet ($3.00 / $15.00 per 1M tokens):
  Input:  16,600 tokens x $3.00 / 1,000,000 = $0.0498
  Output:    800 tokens x $15.00 / 1,000,000 = $0.012
  Total: $0.0618 per request

4. Cost at Scale: Why Pennies Become Thousands

Individual API calls seem cheap. At scale, costs compound dramatically.

Daily volume scenarios (using GPT-4o)

ScenarioCalls/DayAvg InputAvg OutputDaily CostMonthly Cost
Small chatbot1,000500300$5.50$165
Medium SaaS feature10,0001,500500$87.50$2,625
High-traffic app100,0002,000600$1,100$33,000
Enterprise platform1,000,0003,000800$15,500$465,000
Small chatbot:
  1,000 x (500 x $2.50/1M + 300 x $10.00/1M) = $1.25 + $3.00 = $4.25/day

Enterprise platform:
  1,000,000 x (3,000 x $2.50/1M + 800 x $10.00/1M) = $7,500 + $8,000 = $15,500/day
                                                                        = $465,000/month!

The system prompt tax

Your system prompt is sent with every single request. At scale, this fixed cost adds up:

System prompt: 1,500 tokens
Daily API calls: 100,000

System prompt cost alone (GPT-4o input):
  1,500 x 100,000 = 150,000,000 input tokens/day
  150M x $2.50/1M = $375/day = $11,250/month

Optimize to 500 tokens:
  500 x 100,000 = 50,000,000 input tokens/day
  50M x $2.50/1M = $125/day = $3,750/month

Savings: $7,500/month from ONE optimization

5. Why Bad Prompts Waste Money

Verbose instructions

// BAD: 120+ tokens of instructions
const expensivePrompt = `I would really appreciate it if you could please
take a moment to carefully analyze the following text and then provide me
with a comprehensive yet concise summary that captures all of the key
points and main ideas while keeping the summary short enough to be easily
digestible for a busy reader who doesn't have time for long summaries.`;

// GOOD: ~15 tokens, same result
const cheapPrompt = `Summarize in 3 bullet points:`;

Savings: ~105 tokens per request. At 100K calls/day = 10.5M tokens saved = $26/day = $780/month.

Unnecessary context

// BAD: Sending full document when only a section is relevant
const messages = [
  { role: "system", content: "Answer questions about the document." },
  { role: "user", content: `${fullDocument}\n\nQuestion: What is the return policy?` }
  // fullDocument = 50,000 tokens, but return policy is in ONE paragraph
];

// GOOD: Send only the relevant section
const messages = [
  { role: "system", content: "Answer questions about the document." },
  { role: "user", content: `${relevantSection}\n\nQuestion: What is the return policy?` }
  // relevantSection = 500 tokens
];

Savings: 49,500 tokens per request. At GPT-4o input pricing: $0.124/request saved.

Redundant conversation history

// BAD: Sending entire 50-turn history for a simple follow-up
// 50 turns x ~500 tokens = 25,000 tokens of history

// GOOD: Summarize old history, keep last 5 turns
// Summary: 300 tokens + 5 turns x 500 = 2,800 tokens
// Savings: 22,200 tokens per request

6. Cost Optimization Strategies

Strategy 1: Prompt caching

Many providers now support prompt caching — if you send the same system prompt repeatedly, the provider caches the KV computation and charges a reduced rate.

// Anthropic prompt caching
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longSystemPrompt,
      cache_control: { type: "ephemeral" }  // Cache this!
    }
  ],
  messages: [{ role: "user", content: userMessage }]
});

// First call: full price
// Subsequent calls (within cache TTL): ~90% cheaper for cached tokens
ProviderCache DiscountCache TTL
Anthropic90% off cached input5 minutes
OpenAI50% off cached input~5-10 minutes
Google75% off cached inputConfigurable

Strategy 2: Model routing (use cheaper models when possible)

async function routeToModel(userMessage) {
  // Simple heuristic: short, simple questions go to cheap model
  const isSimple = userMessage.length < 200 && !userMessage.includes('code');

  if (isSimple) {
    // GPT-4o-mini: $0.15 / $0.60 per 1M tokens
    return openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: userMessage }],
      max_tokens: 500
    });
  } else {
    // GPT-4o: $2.50 / $10.00 per 1M tokens (17x more expensive)
    return openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: userMessage }],
      max_tokens: 2000
    });
  }
}

Advanced routing uses a classifier (could be another cheap LLM call or a rule-based system) to decide which model handles each request:

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│  User     │────▶│   Router     │────▶│  GPT-4o-mini │  70% of requests ($0.15/1M)
│  Request  │     │  (classify)  │     └──────────────┘
└──────────┘     │              │     ┌──────────────┐
                  │              │────▶│  GPT-4o      │  25% of requests ($2.50/1M)
                  │              │     └──────────────┘
                  │              │     ┌──────────────┐
                  │              │────▶│  Claude Opus  │  5% of requests ($15.00/1M)
                  └──────────────┘     └──────────────┘

Blended cost: much lower than routing everything to the best model

Strategy 3: Prompt compression

Reduce token usage by making prompts more efficient without losing meaning:

// Before compression: ~80 tokens
const verbose = `Please analyze the following customer review and determine
whether the overall sentiment expressed by the customer is positive,
negative, or neutral. Consider the tone, specific words used, and the
overall impression the customer conveys in their review.`;

// After compression: ~20 tokens
const compressed = `Classify sentiment as positive/negative/neutral:`;

// Savings: 60 tokens/request x 100K requests/day = 6M tokens/day = $15/day

Strategy 4: Batching API calls

Process multiple items in a single API call instead of one call per item:

// BAD: 100 separate API calls (100x overhead from system prompt)
for (const review of reviews) {
  await openai.chat.completions.create({
    messages: [
      { role: "system", content: classificationPrompt },  // 500 tokens, sent 100 times
      { role: "user", content: review }
    ]
  });
}
// Total system prompt tokens: 500 x 100 = 50,000

// GOOD: Batch 10 reviews per call (10 calls instead of 100)
const batches = chunk(reviews, 10);
for (const batch of batches) {
  await openai.chat.completions.create({
    messages: [
      { role: "system", content: classificationPrompt },  // 500 tokens, sent 10 times
      { role: "user", content: `Classify each review:\n${batch.map((r, i) => `${i+1}. ${r}`).join('\n')}` }
    ]
  });
}
// Total system prompt tokens: 500 x 10 = 5,000 (90% reduction)

Strategy 5: Response length control

// BAD: No length constraint — model may generate 2000 tokens when 200 suffice
{ max_tokens: 4096, messages: [...] }

// GOOD: Constrain output length in both the prompt and max_tokens
{
  max_tokens: 300,
  messages: [
    { role: "system", content: "Answer in 1-2 sentences maximum." },
    { role: "user", content: question }
  ]
}

7. Cost Tracking Implementation

class CostTracker {
  constructor() {
    this.calls = [];
    this.pricing = {
      'gpt-4o':        { input: 2.50, output: 10.00 },
      'gpt-4o-mini':   { input: 0.15, output: 0.60 },
      'claude-3-5-sonnet': { input: 3.00, output: 15.00 },
      'claude-3-haiku':    { input: 0.25, output: 1.25 }
    };
  }

  recordCall(model, inputTokens, outputTokens, metadata = {}) {
    const prices = this.pricing[model];
    if (!prices) throw new Error(`Unknown model: ${model}`);

    const inputCost = (inputTokens / 1_000_000) * prices.input;
    const outputCost = (outputTokens / 1_000_000) * prices.output;

    const record = {
      timestamp: new Date().toISOString(),
      model,
      inputTokens,
      outputTokens,
      inputCost,
      outputCost,
      totalCost: inputCost + outputCost,
      ...metadata
    };

    this.calls.push(record);
    return record;
  }

  getDailySummary() {
    const today = new Date().toISOString().split('T')[0];
    const todayCalls = this.calls.filter(c => c.timestamp.startsWith(today));

    return {
      totalCalls: todayCalls.length,
      totalCost: todayCalls.reduce((sum, c) => sum + c.totalCost, 0).toFixed(4),
      totalInputTokens: todayCalls.reduce((sum, c) => sum + c.inputTokens, 0),
      totalOutputTokens: todayCalls.reduce((sum, c) => sum + c.outputTokens, 0),
      avgCostPerCall: (todayCalls.reduce((sum, c) => sum + c.totalCost, 0) / todayCalls.length).toFixed(6),
      byModel: this.#groupByModel(todayCalls)
    };
  }

  #groupByModel(calls) {
    const groups = {};
    for (const call of calls) {
      if (!groups[call.model]) groups[call.model] = { calls: 0, cost: 0 };
      groups[call.model].calls++;
      groups[call.model].cost += call.totalCost;
    }
    return groups;
  }
}

// Usage
const tracker = new CostTracker();

const response = await openai.chat.completions.create({ ... });
tracker.recordCall(
  'gpt-4o',
  response.usage.prompt_tokens,
  response.usage.completion_tokens,
  { userId: 'user123', feature: 'chat' }
);

console.log(tracker.getDailySummary());

8. When to Use Cheaper Models

Not every task requires the most powerful (and expensive) model. Here's a decision framework:

TaskRecommended Model TierWhy
Classification (sentiment, category)Mini/FlashSimple pattern matching
Short factual Q&AMini/FlashWell within capabilities
JSON extractionMini/Flash to StandardDepends on schema complexity
SummarizationStandardNeeds good language understanding
Code generationStandard to PremiumDepends on complexity
Complex reasoningPremium (GPT-4o, Opus)Needs maximum capability
Multi-step analysisPremiumChain-of-thought requires capacity
Creative writingStandardGood enough for most cases

Cost comparison for classification task

10,000 classification requests/day
Average: 500 input tokens, 10 output tokens per request

GPT-4o:       10K x (500 x $2.50/1M + 10 x $10.00/1M) = $12.50 + $0.10 = $12.60/day
GPT-4o-mini:  10K x (500 x $0.15/1M + 10 x $0.60/1M)  = $0.75 + $0.006 = $0.76/day

Savings: $11.84/day = $355/month by using mini for classification
Accuracy difference: typically < 2% for simple classification tasks

9. Real-World Cost Example: Production AI SaaS

A SaaS company running an AI-powered customer support system:

Feature breakdown:
  Auto-categorization:     50,000 calls/day  → GPT-4o-mini   ($38/day)
  Response drafting:        20,000 calls/day  → GPT-4o        ($220/day)
  Sentiment analysis:       50,000 calls/day  → GPT-4o-mini   ($38/day)
  Complex escalation:        5,000 calls/day  → Claude Opus   ($150/day)
  Summarization:            10,000 calls/day  → GPT-4o        ($110/day)

Daily total:  $556/day
Monthly total: $16,680/month

BEFORE optimization (everything on GPT-4o):
  135,000 calls/day x avg $0.015/call = $2,025/day = $60,750/month

AFTER model routing:
  $556/day = $16,680/month

Savings: $44,070/month (72% reduction!)

10. Key Takeaways

  1. Output tokens cost 3-5x more than input tokens — optimizing response length is high-impact.
  2. System prompts are a hidden tax — they're sent with every request. At 100K calls/day, a 1,000-token prompt costs $250/day on GPT-4o.
  3. Use cheaper models for simple tasks — GPT-4o-mini at $0.15/1M input is 17x cheaper than GPT-4o. For classification and extraction, accuracy is often comparable.
  4. Prompt caching can reduce input costs by 50-90% for repeated system prompts.
  5. Track costs per feature, per model, per user — you can't optimize what you don't measure.
  6. Batching reduces the per-request overhead of system prompts.

Explain-It Challenge

  1. A startup founder says "LLM API calls are only fractions of a cent — cost isn't worth worrying about." Using a concrete scenario with 50,000 daily users, show them why they're wrong.
  2. Your company's AI feature costs $30,000/month. Propose three specific optimizations and estimate the savings from each.
  3. Why are output tokens more expensive than input tokens? Explain the technical reason (hint: think about how the model processes input in parallel vs generating output sequentially).

Navigation: ← 4.2.b — Token Budgeting · 4.2.d — Rate Limits and Retries →