Temperature & Sampling: Controlling Randomness in LLM Outputs


You build a code generation feature. Users describe what they want, and the LLM writes the code. During testing, you notice something strange: the same prompt sometimes produces correct Python and sometimes produces subtly broken Python. One run gives you a clean for-loop. The next run gives you a list comprehension that mishandles edge cases. Same prompt, different outputs, inconsistent quality.

You check your API call. Temperature is set to 1.0 - the default. You are letting the model roll dice on every single token in your code output. For a creative writing tool, this variance might be desirable. For code generation, it is a reliability problem hiding behind a single parameter.

Understanding sampling is the difference between building AI features that are consistently useful and ones that are unpredictably brilliant or broken.

What happens after the model computes

When an LLM processes your prompt, the final layer outputs a vector of raw scores (logits) - one score for every token in the vocabulary. For a vocabulary of 100K tokens, you get 100K numbers. These scores are not probabilities yet. They need to be converted through softmax into a probability distribution that sums to 1.

The question is: how do you pick the next token from this distribution? You have several strategies, and temperature is the most important control.

graph TD
  A["Logits: raw scores for 100K tokens"] --> B["Apply Temperature"]
  B --> C["Softmax → Probability Distribution"]
  C --> D["Apply top-k filter"]
  D --> E["Apply top-p (nucleus) filter"]
  E --> F["Sample from remaining tokens"]
  F --> G["Selected token"]

  style A fill:#F1EFE8,stroke:#888780,color:#444441
  style B fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style C fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style D fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style E fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style F fill:#FAEEDA,stroke:#854F0B,color:#633806
  style G fill:#FAEEDA,stroke:#854F0B,color:#633806

Temperature: sharpening or flattening the distribution

Temperature is a scalar that divides the logits before softmax. The formula is:

P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T))

Where T is temperature. Here is what different values do:

Temperature = 0 (or near 0): The distribution becomes extremely sharp. The highest-probability token gets nearly all the probability mass. The model becomes deterministic - it always picks the most likely token. This is called greedy decoding.

Temperature = 1.0: The distribution is used as-is from training. The model samples according to the learned probabilities. This is the “natural” randomness level.

Temperature > 1.0: The distribution flattens. Lower-probability tokens get boosted. The model becomes more “creative” but also more likely to produce nonsense.

graph LR
  subgraph t0["Temperature = 0"]
      A1["'Paris': 99%"]
      A2["'Lyon': 0.5%"]
      A3["'the': 0.3%"]
      A4["others: 0.2%"]
  end
  subgraph t07["Temperature = 0.7"]
      B1["'Paris': 78%"]
      B2["'Lyon': 8%"]
      B3["'the': 5%"]
      B4["others: 9%"]
  end
  subgraph t15["Temperature = 1.5"]
      C1["'Paris': 35%"]
      C2["'Lyon': 18%"]
      C3["'the': 14%"]
      C4["others: 33%"]
  end

  style A1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style B1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style C1 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style A2 fill:#F1EFE8,stroke:#888780,color:#444441
  style A3 fill:#F1EFE8,stroke:#888780,color:#444441
  style A4 fill:#F1EFE8,stroke:#888780,color:#444441
  style B2 fill:#F1EFE8,stroke:#888780,color:#444441
  style B3 fill:#F1EFE8,stroke:#888780,color:#444441
  style B4 fill:#F1EFE8,stroke:#888780,color:#444441
  style C2 fill:#F1EFE8,stroke:#888780,color:#444441
  style C3 fill:#F1EFE8,stroke:#888780,color:#444441
  style C4 fill:#F1EFE8,stroke:#888780,color:#444441

The intuition

Think of temperature as a confidence dial. Low temperature means “pick what you are most confident about.” High temperature means “consider alternatives you are less sure about.” At zero, the model never takes risks. At high values, it takes risks constantly.

Top-k sampling: limiting the candidate pool

Top-k sampling restricts the model to only consider the k most probable tokens at each step. All other tokens get their probability set to zero, and the remaining probabilities are renormalized.

  • top-k = 1: Identical to greedy decoding (temperature 0)
  • top-k = 50: Only the top 50 tokens are considered, regardless of how much probability mass they cover
  • top-k = 100000: Effectively no filtering

The problem with top-k is that it uses a fixed number regardless of the distribution shape. Sometimes the model is very confident (top 3 tokens hold 95% of probability), and sometimes it is uncertain (top 50 tokens each have roughly 2% probability). A fixed k does not adapt.

Top-p (nucleus sampling): adaptive filtering

Top-p sampling (also called nucleus sampling) is smarter. Instead of fixing the number of candidates, it fixes the cumulative probability mass. You set p = 0.9, and the model considers the smallest set of tokens whose cumulative probability exceeds 90%.

  • If the model is confident, this might be just 2-3 tokens
  • If the model is uncertain, this might be 50+ tokens
  • The filter adapts to the shape of the distribution at each step

top-p = 0.1: Very conservative. Only tokens that collectively hold 10% of the probability mass (usually just 1-2 tokens).

top-p = 0.9: The standard “good default.” Cuts off the long tail of unlikely tokens while preserving meaningful alternatives.

top-p = 1.0: No filtering. All tokens remain in the candidate pool.

Combining parameters

In practice, these parameters are combined. The typical API call lets you set temperature, top-p, and sometimes top-k together. They are applied in sequence:

  1. Logits are divided by temperature
  2. Softmax converts to probabilities
  3. Top-k filter removes all but the top k tokens
  4. Top-p filter removes tokens beyond the cumulative probability threshold
  5. Final sampling from whatever remains

Most APIs recommend setting either temperature or top-p, not both aggressively. OpenAI’s docs suggest: “We generally recommend altering temperature or top-p but not both.”

Where it breaks: common mistakes

Mistake 1: Temperature 0 for all tasks. Engineers who want determinism set temperature to 0 everywhere. This makes the model repetitive and unable to generate varied content when needed. It also does not guarantee identical outputs - floating point arithmetic differences across hardware can still cause variation.

Mistake 2: High temperature for “better” answers. Higher temperature does not mean better creativity. It means more randomness. Above 1.2, most models produce incoherent text. The sweet spot for creative tasks is typically 0.7-0.9.

Mistake 3: Ignoring sampling for batch tasks. If you are classifying 10,000 documents, temperature 0 gives you reproducible results. Temperature 0.7 means each document might get a different classification on re-run. For batch/evaluation workloads, determinism matters.

Mistake 4: Not accounting for compounding randomness. Each token is sampled independently. For a 500-token response, you are making 500 random choices. Even small per-token variance compounds into significant output-level variance. A 5% per-token deviation rate means your 500-token output is almost certainly different from run to run.

Real-world parameter choices

Use caseTemperatureTop-pReasoning
Code generation0 - 0.20.95Correctness over creativity
Factual Q&A0 - 0.30.9Minimize hallucination risk
Summarization0.3 - 0.50.9Some paraphrasing variety
Creative writing0.7 - 0.90.95Varied, interesting prose
Brainstorming0.9 - 1.20.95Maximum diversity of ideas
Classification01.0Deterministic, reproducible

Other sampling parameters worth knowing

Frequency penalty (0 to 2): Reduces the probability of tokens that have already appeared in the output. Prevents repetitive loops. Useful for long-form generation.

Presence penalty (0 to 2): Applies a flat penalty to any token that has appeared at all, regardless of frequency. Encourages the model to introduce new topics.

Max tokens: Hard cap on output length. The model stops generating when it hits this limit or produces a stop token. Always set this to prevent runaway generation in production.

Stop sequences: Strings that terminate generation immediately when produced. Useful for structured outputs - stop at \n\n for single-paragraph answers, or } for JSON.

How to apply this in practice

For production APIs: Start with temperature 0 for deterministic features (classification, extraction, code). Use 0.3-0.5 for tasks needing slight variety (summarization, rephrasing). Reserve 0.7+ for explicitly creative features.

For evaluation and testing: Always use temperature 0. You cannot evaluate a non-deterministic system reliably. If your eval passes at temperature 0 but the production endpoint uses 0.7, you are testing a different system.

For A/B testing outputs: Generate multiple completions at moderate temperature (0.7) and use a separate scoring step (at temperature 0) to pick the best one. This “best-of-N” approach gives you diversity in generation and consistency in selection.

For streaming UX: Higher temperature makes streaming feel more natural because the model produces less predictable, more human-like token sequences. Temperature 0 streaming can feel robotic because it always picks the obvious next word.

FAQ

Q: If I set temperature to 0, will I always get the exact same output?

Not necessarily. Temperature 0 means greedy decoding (always pick the highest probability token), but floating point operations on different hardware can produce slightly different logit values. Most providers offer a “seed” parameter for true reproducibility. Even then, model updates can change outputs. If you need guaranteed identical outputs, cache responses.

Q: What is the relationship between temperature and hallucination?

Higher temperature increases hallucination risk because lower-probability tokens (which include factually incorrect completions) become more likely to be selected. But temperature 0 does not eliminate hallucinations - it just deterministically picks the most probable completion, which can still be factually wrong. The model’s confidence is not calibrated to truth.

Q: Should I use top-p or temperature? When do I use both?

For most applications, tuning temperature alone is sufficient. Top-p is useful when you want adaptive filtering - it shines for open-ended generation where the model’s confidence varies significantly between steps. Using both is fine if you understand the interaction: temperature shapes the distribution, then top-p truncates the tail. Avoid extreme values of both simultaneously.

Interview questions

Q: You are building a customer-facing chatbot. Users complain that responses are repetitive and robotic. What sampling parameters would you adjust, and what tradeoffs do you accept?

Increase temperature from 0 to 0.4-0.6 for response variety. Add a small frequency penalty (0.3-0.5) to reduce word repetition. The tradeoff: slightly higher hallucination risk and less reproducible outputs. Mitigate by keeping factual retrieval separate from response generation - retrieve at temperature 0, generate responses at moderate temperature.

Q: Your classification pipeline uses an LLM to categorize support tickets into 5 categories. Results are inconsistent between runs. Diagnose and fix.

The pipeline likely uses non-zero temperature, causing the model to sometimes pick alternative (incorrect) classifications when probabilities are close. Fix: set temperature to 0, set max tokens to a small value (just enough for the category label), use a stop sequence after the category name. For borderline cases where the model is genuinely uncertain, implement confidence thresholds by examining logprobs and routing low-confidence tickets to human review.

Q: Explain the tradeoff between temperature and top-p. When would you use one vs the other vs both?

Temperature scales the entire distribution uniformly - it affects all tokens proportionally. Top-p adaptively truncates based on cumulative probability. Temperature alone is sufficient for most tasks. Top-p is valuable when you want the model to self-regulate: in high-confidence steps it considers few tokens, in low-confidence steps it explores more. Using both lets you shape (temperature) and then truncate (top-p), which is useful for creative tasks where you want controlled variety without the risk of truly random tokens from the deep tail.