Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work
4.1.e — Deterministic vs Probabilistic Outputs
In one sentence: LLMs are fundamentally probabilistic systems — the same input can produce different outputs — but you can push them toward deterministic behavior using temperature 0 and seed parameters, which is critical for building reliable, testable production systems.
Navigation: ← 4.1.d — Hallucination · 4.1 Overview
1. The Core Distinction
Deterministic systems
A deterministic system always produces the same output for the same input. There is no randomness.
// Deterministic: same input → same output, every time
function add(a, b) {
return a + b;
}
add(2, 3); // Always 5. Always. Forever.
// Deterministic: database query
SELECT * FROM users WHERE id = 42;
// Returns the same row every time (assuming no changes)
Probabilistic systems
A probabilistic system can produce different outputs for the same input. There is inherent randomness.
// Probabilistic: same prompt → potentially different responses
Prompt: "Write a greeting"
Run 1: "Hello! How can I help you today?"
Run 2: "Hi there! What can I do for you?"
Run 3: "Hey! Welcome — how may I assist you?"
Run 4: "Hello! How can I help you today?" ← might repeat, might not
Each run samples from a probability distribution.
The output is not guaranteed to be the same.
Why LLMs are inherently probabilistic
The core operation of an LLM is sampling from a probability distribution. Even if "Paris" has 92% probability as the next token after "The capital of France is", the other 8% of the time the model might pick a different token. This randomness is by design — it's what makes language models capable of diverse, natural-sounding text.
2. Making LLMs More Deterministic
Temperature 0
Setting temperature: 0 makes the model use greedy decoding — it always picks the highest-probability token. No sampling occurs.
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0, // Greedy: always pick the most likely token
messages: [{ role: 'user', content: 'What is 2 + 2?' }],
});
// Almost always returns "4" (or "2 + 2 = 4")
Is temperature 0 truly deterministic? Almost, but not 100%. Due to floating-point arithmetic differences across hardware (GPUs process in parallel, and order can vary), there can be extremely rare non-determinism even at temperature 0. In practice, for 99.9%+ of calls, temperature 0 gives the same output.
The seed parameter (OpenAI)
OpenAI provides a seed parameter for reproducibility:
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0,
seed: 42, // Fixed seed for reproducibility
messages: [{ role: 'user', content: 'Classify this sentiment: "I love it"' }],
});
// The response includes a system_fingerprint
console.log(response.system_fingerprint); // "fp_abc123..."
// Same seed + same fingerprint = same output (best effort)
Important: The seed provides best-effort determinism. If OpenAI updates the model's infrastructure (system fingerprint changes), the output may change even with the same seed.
3. When You Need Deterministic Outputs
| Use Case | Why Determinism Matters | Settings |
|---|---|---|
| JSON extraction | Downstream code parses the output; different formats break | temp: 0 |
| Classification | "positive"/"negative"/"neutral" must be consistent for same input | temp: 0, seed |
| Testing | Test assertions need predictable outputs | temp: 0, seed |
| Data pipelines | Same document → same extracted fields every time | temp: 0 |
| Caching | If output varies, cache is useless | temp: 0 |
| Compliance/audit | Must be able to reproduce the exact output | temp: 0, seed, log everything |
| Cost estimation | Token count varies with different outputs | temp: 0 |
4. When You Want Probabilistic Outputs
| Use Case | Why Randomness Helps | Settings |
|---|---|---|
| Creative writing | Diversity and novelty are the goal | temp: 0.8-1.2 |
| Brainstorming | Generate many different ideas | temp: 1.0, n: 5 |
| Conversation | Repeating the same response feels robotic | temp: 0.7 |
| A/B testing prompts | Need varied outputs to compare quality | temp: 0.5+ |
| Generating alternatives | "Give me 3 versions of this email" | temp: 0.8 |
| Overcoming local optima | Greedy decoding can miss better phrasings | temp: 0.3-0.5 |
5. The Reproducibility Challenge in Production
Problem: You can't easily debug non-deterministic systems
Bug report: "The AI classified my product as 'electronics' yesterday
but 'home appliance' today."
If you used temperature > 0:
- You can't reproduce the original output
- You can't verify if it was correct
- You can't compare old vs new behavior
- The user's experience is inconsistent
Solution: Log everything
// Production logging for AI calls
async function callLLM(messages, config) {
const startTime = Date.now();
const response = await openai.chat.completions.create({
model: config.model,
messages,
temperature: config.temperature,
seed: config.seed,
...config,
});
// Log the COMPLETE request and response
await logger.info('llm_call', {
requestId: generateId(),
timestamp: new Date().toISOString(),
model: config.model,
temperature: config.temperature,
seed: config.seed,
inputMessages: messages,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
output: response.choices[0].message.content,
systemFingerprint: response.system_fingerprint,
latencyMs: Date.now() - startTime,
finishReason: response.choices[0].finish_reason,
});
return response;
}
Solution: Use deterministic settings for critical paths
// Non-critical: chatbot greeting (variation is fine)
const greeting = await callLLM(messages, {
temperature: 0.7, // Some personality variation
});
// Critical: extracting billing amounts from invoices
const extraction = await callLLM(messages, {
temperature: 0, // Must be consistent
seed: 42, // Reproducible
});
// Critical: content moderation decision
const moderation = await callLLM(messages, {
temperature: 0, // Same content must get same verdict
});
6. The n Parameter: Multiple Completions
Some APIs let you request multiple completions in a single call. Each uses independent sampling:
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Suggest a project name for a dating app' }],
temperature: 1.0,
n: 5, // Generate 5 different completions
});
response.choices.forEach((choice, i) => {
console.log(`Option ${i + 1}: ${choice.message.content}`);
});
// Option 1: "HeartLink"
// Option 2: "Spark"
// Option 3: "ConnectHer"
// Option 4: "VibeMatch"
// Option 5: "SoulSync"
Cost note: You pay for all n completions. n: 5 costs 5x the output tokens.
7. Determinism Across Model Versions
Even with temperature: 0 and seed, outputs can change when:
- Model version updates — GPT-4o-2024-08-06 may give different outputs than GPT-4o-2024-05-13
- Infrastructure changes — Different GPU hardware or software updates
- API changes — System prompt formatting or tokenizer updates
- Fine-tuning — If you fine-tune and deploy a new version
Production strategy:
// Pin to a specific model version, not the alias
const MODEL = 'gpt-4o-2024-08-06'; // ✓ Pinned
// const MODEL = 'gpt-4o'; // ✗ Alias — may change under you
// Version your prompts alongside your code
const SYSTEM_PROMPT_V3 = `You are a classifier...`;
// When updating models or prompts, run evaluation suites
// to catch regressions before deploying
8. Key Takeaways
- LLMs are inherently probabilistic — the same input can produce different outputs.
- Temperature 0 + seed provide near-deterministic outputs for most practical purposes.
- Use deterministic settings for JSON extraction, classification, testing, and any pipeline where consistency matters.
- Use probabilistic settings for creative tasks, conversation, and generating diverse options.
- Log everything — in a non-deterministic system, you need complete records to debug issues.
- Pin model versions in production — model updates can change behavior even with the same settings.
Explain-It Challenge
- A QA engineer says "I can't write tests for AI features because the output keeps changing." What do you recommend?
- Your data pipeline extracts prices from receipts using GPT-4o. Sometimes it returns
"$12.50"and sometimes"12.50". How do you fix this? - Why is
temperature: 0not truly 100% deterministic, and does it matter in practice?
Navigation: ← 4.1.d — Hallucination · 4.1 Overview