Episode 4 — Generative AI Engineering / 4.1 — How LLMs Actually Work

4.1.c — Sampling: Temperature and Randomness Control

In one sentence: After the model computes probabilities for every possible next token, sampling parameters like temperature, top-p, and top-k determine which token is actually chosen — controlling the tradeoff between predictable, focused outputs and creative, diverse ones.

Navigation: ← 4.1.b — Context Window · 4.1.d — Hallucination →

1. How Token Selection Works

When the model processes your prompt, it doesn't generate text directly. It computes a probability distribution over its entire vocabulary (50,000-200,000 tokens). Every single token gets a probability score — then ONE token must be selected.

Prompt: "The capital of France is"

Model's probability distribution for the next token:
  " Paris"     → 92.3%
  " Lyon"      → 2.1%
  " the"       → 1.8%
  " located"   → 0.9%
  " not"       → 0.4%
  " a"         → 0.3%
  ... (200,000 other tokens with tiny probabilities)

SAMPLING decides which one gets picked.

Without sampling (greedy decoding): always pick the highest probability token. "Paris" every time. Deterministic but potentially boring and repetitive.

With sampling: randomly select from the distribution, weighted by probabilities. "Paris" most of the time, but occasionally "Lyon" or something surprising.

2. Temperature

Temperature is the single most important sampling parameter. It controls how "sharp" or "flat" the probability distribution is before sampling.

How it works mathematically

Temperature modifies the logits (raw scores before softmax) by dividing them by the temperature value:

adjusted_probability = softmax(logits / temperature)

Temperature = 0 (or very close to 0)

Probability distribution:
  " Paris"     → 99.99%  ████████████████████████████████████████
  " Lyon"      → 0.005%  
  " the"       → 0.003%  
  Everything else → ~0%

Result: ALWAYS picks "Paris". Completely deterministic.
        Same input → Same output every time.

Use when: You need consistent, predictable, factual responses. JSON generation, data extraction, classification, code generation.

Temperature = 0.7 (moderate)

Probability distribution:
  " Paris"     → 85%     ████████████████████████████████████
  " Lyon"      → 5%      ██
  " the"       → 4%      ██
  " located"   → 2%      █
  " not"       → 1.5%    █
  Others       → 2.5%    █

Result: Usually picks "Paris", sometimes surprises.
        Good balance of accuracy and variety.

Use when: General conversation, creative writing with some constraints, brainstorming with guardrails.

Temperature = 1.0 (default for most models)

Probability distribution:
  " Paris"     → 70%     ████████████████████████████
  " Lyon"      → 8%      ███
  " the"       → 7%      ███
  " located"   → 5%      ██
  " not"       → 3%      █
  Others       → 7%      ███

Result: Frequently picks "Paris", but meaningful chance of
        other tokens. More diverse and creative output.

Temperature = 1.5+ (high creativity)

Probability distribution:
  " Paris"     → 35%     ██████████████
  " Lyon"      → 12%     █████
  " the"       → 11%     ████
  " located"   → 9%      ████
  " not"       → 8%      ███
  Others       → 25%     ██████████

Result: Much more random. Might say "Paris" or might say
        something unexpected. Higher chance of incoherence.

Use when: Creative writing, poetry, brainstorming, generating diverse options. Rarely used above 1.5.

Temperature cheat sheet

Temperature	Behavior	Use Cases
0	Deterministic, always same output	JSON extraction, classification, code
0.1-0.3	Very focused, minimal variation	Factual Q&A, data processing
0.5-0.7	Balanced, slight creativity	General chatbots, summaries
0.8-1.0	Default, natural variation	Conversation, content writing
1.2-1.5	Creative, diverse, risky	Poetry, brainstorming, fiction
>1.5	Chaotic, often incoherent	Almost never useful in production

3. Top-p (Nucleus Sampling)

Top-p (also called nucleus sampling) is an alternative way to control randomness. Instead of adjusting the probability distribution, it limits which tokens are eligible to be sampled.

How it works: Sort tokens by probability. Include tokens until their cumulative probability reaches p. Only sample from those tokens.

top_p = 0.9

Tokens sorted by probability:
  " Paris"     → 70%     cumulative: 70%   ✓ included
  " Lyon"      → 8%      cumulative: 78%   ✓ included
  " the"       → 7%      cumulative: 85%   ✓ included
  " located"   → 5%      cumulative: 90%   ✓ included (reaches 90%)
  " not"       → 3%      cumulative: 93%   ✗ EXCLUDED
  Everything else                           ✗ EXCLUDED

Only " Paris", " Lyon", " the", " located" can be picked.
The long tail of low-probability tokens is cut off.

Top-p values

top_p	Effect
0.1	Only the very top tokens (extremely focused)
0.5	Top half of probability mass
0.9	Default — includes most reasonable tokens, cuts off noise
0.95	Slight filtering of very unlikely tokens
1.0	No filtering — all tokens eligible

Temperature vs Top-p

Temperature: Reshapes the ENTIRE distribution (makes it sharper or flatter)
Top-p:       Cuts off the TAIL of the distribution (removes unlikely tokens)

Both reduce randomness, but in different ways.
Most APIs default to: temperature=1.0, top_p=1.0

Best practice: Adjust either temperature or top-p, not both at the same time. Adjusting both can produce unpredictable results.

4. Top-k Sampling

Top-k is simpler than top-p: only consider the top k most probable tokens, regardless of their probability mass.

top_k = 3

Tokens sorted by probability:
  " Paris"     → 70%   ✓ (rank 1)
  " Lyon"      → 8%    ✓ (rank 2)
  " the"       → 7%    ✓ (rank 3)
  " located"   → 5%    ✗ (rank 4 — excluded)
  Everything else       ✗ excluded

Only the top 3 tokens can be selected.

Limitation: top-k is less adaptive than top-p. If one token has 99% probability, top-k=50 still considers 49 nearly-impossible tokens. Top-p would only include the one dominant token.

Top-k is rarely used in production APIs — most providers (OpenAI, Anthropic) use temperature + top-p. Some open-source models expose top-k.

5. Other Sampling Parameters

Frequency penalty

Reduces the probability of tokens that have already appeared in the output. Prevents repetition.

{
  "frequency_penalty": 0.5  // Range: -2.0 to 2.0
  // Positive: less repetition
  // Negative: more repetition (rare to use)
}

Presence penalty

Reduces the probability of tokens that have appeared at all (regardless of frequency). Encourages the model to talk about new topics.

{
  "presence_penalty": 0.5  // Range: -2.0 to 2.0
  // Positive: more likely to introduce new topics
  // 0: no penalty
}

Stop sequences

Tell the model to stop generating when it produces a specific string:

{
  "stop": ["\n\n", "END", "---"]
  // Model stops as soon as it generates any of these
}

6. Practical API Examples

OpenAI

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Write a haiku about coding' }],
  temperature: 0.8,       // Moderate creativity
  top_p: 1,               // No nucleus filtering
  max_tokens: 100,        // Limit output length
  frequency_penalty: 0.3, // Reduce repetition
  presence_penalty: 0.1,  // Slight topic diversity
});

Anthropic (Claude)

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  temperature: 0,         // Deterministic for structured output
  messages: [{ role: 'user', content: 'Extract the names from this text: ...' }],
});

Production presets

// Preset: Structured data extraction (JSON, parsing)
const EXTRACTION_CONFIG = { temperature: 0, top_p: 1 };

// Preset: General Q&A chatbot
const CHATBOT_CONFIG = { temperature: 0.7, top_p: 0.9 };

// Preset: Creative writing
const CREATIVE_CONFIG = { temperature: 1.0, top_p: 0.95, frequency_penalty: 0.5 };

// Preset: Code generation
const CODE_CONFIG = { temperature: 0.2, top_p: 0.95 };

7. Key Takeaways

The model outputs a probability distribution over all tokens — sampling decides which one is picked.
Temperature reshapes the distribution: 0 = deterministic, 1 = natural, >1 = creative/risky.
Top-p cuts off the long tail of unlikely tokens — 0.9 is a good default.
Adjust either temperature or top-p, not both simultaneously.
Temperature 0 is essential for structured outputs (JSON, data extraction, classification).
Frequency/presence penalties reduce repetition and encourage topic diversity.

Explain-It Challenge

A product manager asks "why does the chatbot sometimes give different answers to the same question?" — explain temperature.
You're building a JSON extraction API. What temperature do you use and why?
What's the difference between getting creative outputs via high temperature vs high top-p?

Navigation: ← 4.1.b — Context Window · 4.1.d — Hallucination →