Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1.e — Deterministic vs Probabilistic Outputs

In one sentence: LLMs are fundamentally probabilistic systems — the same input can produce different outputs — but you can push them toward deterministic behavior using temperature 0 and seed parameters, which is critical for building reliable, testable production systems.

Navigation: ← 4.1.d — Hallucination · 4.1 Overview

1. The Core Distinction

Deterministic systems

A deterministic system always produces the same output for the same input. There is no randomness.

// Deterministic: same input → same output, every time
function add(a, b) {
  return a + b;
}
add(2, 3); // Always 5. Always. Forever.

// Deterministic: database query
SELECT * FROM users WHERE id = 42;
// Returns the same row every time (assuming no changes)

Probabilistic systems

A probabilistic system can produce different outputs for the same input. There is inherent randomness.

// Probabilistic: same prompt → potentially different responses
Prompt: "Write a greeting"

Run 1: "Hello! How can I help you today?"
Run 2: "Hi there! What can I do for you?"
Run 3: "Hey! Welcome — how may I assist you?"
Run 4: "Hello! How can I help you today?"  ← might repeat, might not

Each run samples from a probability distribution.
The output is not guaranteed to be the same.

Why LLMs are inherently probabilistic

The core operation of an LLM is sampling from a probability distribution. Even if "Paris" has 92% probability as the next token after "The capital of France is", the other 8% of the time the model might pick a different token. This randomness is by design — it's what makes language models capable of diverse, natural-sounding text.

2. Making LLMs More Deterministic

Temperature 0

Setting temperature: 0 makes the model use greedy decoding — it always picks the highest-probability token. No sampling occurs.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,      // Greedy: always pick the most likely token
  messages: [{ role: 'user', content: 'What is 2 + 2?' }],
});
// Almost always returns "4" (or "2 + 2 = 4")

Is temperature 0 truly deterministic? Almost, but not 100%. Due to floating-point arithmetic differences across hardware (GPUs process in parallel, and order can vary), there can be extremely rare non-determinism even at temperature 0. In practice, for 99.9%+ of calls, temperature 0 gives the same output.

The `seed` parameter (OpenAI)

OpenAI provides a seed parameter for reproducibility:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  temperature: 0,
  seed: 42,            // Fixed seed for reproducibility
  messages: [{ role: 'user', content: 'Classify this sentiment: "I love it"' }],
});

// The response includes a system_fingerprint
console.log(response.system_fingerprint); // "fp_abc123..."
// Same seed + same fingerprint = same output (best effort)

Important: The seed provides best-effort determinism. If OpenAI updates the model's infrastructure (system fingerprint changes), the output may change even with the same seed.

3. When You Need Deterministic Outputs

Use Case	Why Determinism Matters	Settings
JSON extraction	Downstream code parses the output; different formats break	`temp: 0`
Classification	"positive"/"negative"/"neutral" must be consistent for same input	`temp: 0, seed`
Testing	Test assertions need predictable outputs	`temp: 0, seed`
Data pipelines	Same document → same extracted fields every time	`temp: 0`
Caching	If output varies, cache is useless	`temp: 0`
Compliance/audit	Must be able to reproduce the exact output	`temp: 0, seed, log everything`
Cost estimation	Token count varies with different outputs	`temp: 0`

4. When You Want Probabilistic Outputs

Use Case	Why Randomness Helps	Settings
Creative writing	Diversity and novelty are the goal	`temp: 0.8-1.2`
Brainstorming	Generate many different ideas	`temp: 1.0, n: 5`
Conversation	Repeating the same response feels robotic	`temp: 0.7`
A/B testing prompts	Need varied outputs to compare quality	`temp: 0.5+`
Generating alternatives	"Give me 3 versions of this email"	`temp: 0.8`
Overcoming local optima	Greedy decoding can miss better phrasings	`temp: 0.3-0.5`

5. The Reproducibility Challenge in Production

Problem: You can't easily debug non-deterministic systems

Bug report: "The AI classified my product as 'electronics' yesterday 
             but 'home appliance' today."

If you used temperature > 0:
  - You can't reproduce the original output
  - You can't verify if it was correct
  - You can't compare old vs new behavior
  - The user's experience is inconsistent

Solution: Log everything

// Production logging for AI calls
async function callLLM(messages, config) {
  const startTime = Date.now();
  
  const response = await openai.chat.completions.create({
    model: config.model,
    messages,
    temperature: config.temperature,
    seed: config.seed,
    ...config,
  });

  // Log the COMPLETE request and response
  await logger.info('llm_call', {
    requestId: generateId(),
    timestamp: new Date().toISOString(),
    model: config.model,
    temperature: config.temperature,
    seed: config.seed,
    inputMessages: messages,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    output: response.choices[0].message.content,
    systemFingerprint: response.system_fingerprint,
    latencyMs: Date.now() - startTime,
    finishReason: response.choices[0].finish_reason,
  });

  return response;
}

Solution: Use deterministic settings for critical paths

// Non-critical: chatbot greeting (variation is fine)
const greeting = await callLLM(messages, {
  temperature: 0.7,  // Some personality variation
});

// Critical: extracting billing amounts from invoices
const extraction = await callLLM(messages, {
  temperature: 0,    // Must be consistent
  seed: 42,          // Reproducible
});

// Critical: content moderation decision
const moderation = await callLLM(messages, {
  temperature: 0,    // Same content must get same verdict
});

6. The `n` Parameter: Multiple Completions

Some APIs let you request multiple completions in a single call. Each uses independent sampling:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Suggest a project name for a dating app' }],
  temperature: 1.0,
  n: 5,  // Generate 5 different completions
});

response.choices.forEach((choice, i) => {
  console.log(`Option ${i + 1}: ${choice.message.content}`);
});
// Option 1: "HeartLink"
// Option 2: "Spark"  
// Option 3: "ConnectHer"
// Option 4: "VibeMatch"
// Option 5: "SoulSync"

Cost note: You pay for all n completions. n: 5 costs 5x the output tokens.

7. Determinism Across Model Versions

Even with temperature: 0 and seed, outputs can change when:

Model version updates — GPT-4o-2024-08-06 may give different outputs than GPT-4o-2024-05-13
Infrastructure changes — Different GPU hardware or software updates
API changes — System prompt formatting or tokenizer updates
Fine-tuning — If you fine-tune and deploy a new version

Production strategy:

// Pin to a specific model version, not the alias
const MODEL = 'gpt-4o-2024-08-06';  // ✓ Pinned
// const MODEL = 'gpt-4o';          // ✗ Alias — may change under you

// Version your prompts alongside your code
const SYSTEM_PROMPT_V3 = `You are a classifier...`;

// When updating models or prompts, run evaluation suites
// to catch regressions before deploying

8. Key Takeaways

LLMs are inherently probabilistic — the same input can produce different outputs.
Temperature 0 + seed provide near-deterministic outputs for most practical purposes.
Use deterministic settings for JSON extraction, classification, testing, and any pipeline where consistency matters.
Use probabilistic settings for creative tasks, conversation, and generating diverse options.
Log everything — in a non-deterministic system, you need complete records to debug issues.
Pin model versions in production — model updates can change behavior even with the same settings.

Explain-It Challenge

A QA engineer says "I can't write tests for AI features because the output keeps changing." What do you recommend?
Your data pipeline extracts prices from receipts using GPT-4o. Sometimes it returns "$12.50" and sometimes "12.50". How do you fix this?
Why is temperature: 0 not truly 100% deterministic, and does it matter in practice?