Temperature & Sampling: Controlling Randomness in LLM Outputs
You build a code generation feature. Users describe what they want, and the LLM writes the code. During testing, you notice something strange: the same prompt sometimes produces correct Python and sometimes produces subtly broken Python. One run gives you a clean for-loop. The next run gives you a list comprehension that mishandles edge cases. Same prompt, different outputs, inconsistent quality.
You check your API call. Temperature is set to 1.0 - the default. You are letting the model roll dice on every single token in your code output. For a creative writing tool, this variance might be desirable. For code generation, it is a reliability problem hiding behind a single parameter.
Understanding sampling is the difference between building AI features that are consistently useful and ones that are unpredictably brilliant or broken.
What happens after the model computes
When an LLM processes your prompt, the final layer outputs a vector of raw scores (logits) - one score for every token in the vocabulary. For a vocabulary of 100K tokens, you get 100K numbers. These scores are not probabilities yet. They need to be converted through softmax into a probability distribution that sums to 1.
The question is: how do you pick the next token from this distribution? You have several strategies, and temperature is the most important control.
graph TD A["Logits: raw scores for 100K tokens"] --> B["Apply Temperature"] B --> C["Softmax → Probability Distribution"] C --> D["Apply top-k filter"] D --> E["Apply top-p (nucleus) filter"] E --> F["Sample from remaining tokens"] F --> G["Selected token"] style A fill:#F1EFE8,stroke:#888780,color:#444441 style B fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style C fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style D fill:#E1F5EE,stroke:#0F6E56,color:#085041 style E fill:#E1F5EE,stroke:#0F6E56,color:#085041 style F fill:#FAEEDA,stroke:#854F0B,color:#633806 style G fill:#FAEEDA,stroke:#854F0B,color:#633806
Temperature: sharpening or flattening the distribution
Temperature is a scalar that divides the logits before softmax. The formula is:
P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T))
Where T is temperature. Here is what different values do:
Temperature = 0 (or near 0): The distribution becomes extremely sharp. The highest-probability token gets nearly all the probability mass. The model becomes deterministic - it always picks the most likely token. This is called greedy decoding.
Temperature = 1.0: The distribution is used as-is from training. The model samples according to the learned probabilities. This is the “natural” randomness level.
Temperature > 1.0: The distribution flattens. Lower-probability tokens get boosted. The model becomes more “creative” but also more likely to produce nonsense.
graph LR
subgraph t0["Temperature = 0"]
A1["'Paris': 99%"]
A2["'Lyon': 0.5%"]
A3["'the': 0.3%"]
A4["others: 0.2%"]
end
subgraph t07["Temperature = 0.7"]
B1["'Paris': 78%"]
B2["'Lyon': 8%"]
B3["'the': 5%"]
B4["others: 9%"]
end
subgraph t15["Temperature = 1.5"]
C1["'Paris': 35%"]
C2["'Lyon': 18%"]
C3["'the': 14%"]
C4["others: 33%"]
end
style A1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style B1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style C1 fill:#FAEEDA,stroke:#854F0B,color:#633806
style A2 fill:#F1EFE8,stroke:#888780,color:#444441
style A3 fill:#F1EFE8,stroke:#888780,color:#444441
style A4 fill:#F1EFE8,stroke:#888780,color:#444441
style B2 fill:#F1EFE8,stroke:#888780,color:#444441
style B3 fill:#F1EFE8,stroke:#888780,color:#444441
style B4 fill:#F1EFE8,stroke:#888780,color:#444441
style C2 fill:#F1EFE8,stroke:#888780,color:#444441
style C3 fill:#F1EFE8,stroke:#888780,color:#444441
style C4 fill:#F1EFE8,stroke:#888780,color:#444441
The intuition
Think of temperature as a confidence dial. Low temperature means “pick what you are most confident about.” High temperature means “consider alternatives you are less sure about.” At zero, the model never takes risks. At high values, it takes risks constantly.
Top-k sampling: limiting the candidate pool
Top-k sampling restricts the model to only consider the k most probable tokens at each step. All other tokens get their probability set to zero, and the remaining probabilities are renormalized.
- top-k = 1: Identical to greedy decoding (temperature 0)
- top-k = 50: Only the top 50 tokens are considered, regardless of how much probability mass they cover
- top-k = 100000: Effectively no filtering
The problem with top-k is that it uses a fixed number regardless of the distribution shape. Sometimes the model is very confident (top 3 tokens hold 95% of probability), and sometimes it is uncertain (top 50 tokens each have roughly 2% probability). A fixed k does not adapt.
Top-p (nucleus sampling): adaptive filtering
Top-p sampling (also called nucleus sampling) is smarter. Instead of fixing the number of candidates, it fixes the cumulative probability mass. You set p = 0.9, and the model considers the smallest set of tokens whose cumulative probability exceeds 90%.
- If the model is confident, this might be just 2-3 tokens
- If the model is uncertain, this might be 50+ tokens
- The filter adapts to the shape of the distribution at each step
top-p = 0.1: Very conservative. Only tokens that collectively hold 10% of the probability mass (usually just 1-2 tokens).
top-p = 0.9: The standard “good default.” Cuts off the long tail of unlikely tokens while preserving meaningful alternatives.
top-p = 1.0: No filtering. All tokens remain in the candidate pool.
Combining parameters
In practice, these parameters are combined. The typical API call lets you set temperature, top-p, and sometimes top-k together. They are applied in sequence:
- Logits are divided by temperature
- Softmax converts to probabilities
- Top-k filter removes all but the top k tokens
- Top-p filter removes tokens beyond the cumulative probability threshold
- Final sampling from whatever remains
Most APIs recommend setting either temperature or top-p, not both aggressively. OpenAI’s docs suggest: “We generally recommend altering temperature or top-p but not both.”
Where it breaks: common mistakes
Mistake 1: Temperature 0 for all tasks. Engineers who want determinism set temperature to 0 everywhere. This makes the model repetitive and unable to generate varied content when needed. It also does not guarantee identical outputs - floating point arithmetic differences across hardware can still cause variation.
Mistake 2: High temperature for “better” answers. Higher temperature does not mean better creativity. It means more randomness. Above 1.2, most models produce incoherent text. The sweet spot for creative tasks is typically 0.7-0.9.
Mistake 3: Ignoring sampling for batch tasks. If you are classifying 10,000 documents, temperature 0 gives you reproducible results. Temperature 0.7 means each document might get a different classification on re-run. For batch/evaluation workloads, determinism matters.
Mistake 4: Not accounting for compounding randomness. Each token is sampled independently. For a 500-token response, you are making 500 random choices. Even small per-token variance compounds into significant output-level variance. A 5% per-token deviation rate means your 500-token output is almost certainly different from run to run.
Real-world parameter choices
| Use case | Temperature | Top-p | Reasoning |
|---|---|---|---|
| Code generation | 0 - 0.2 | 0.95 | Correctness over creativity |
| Factual Q&A | 0 - 0.3 | 0.9 | Minimize hallucination risk |
| Summarization | 0.3 - 0.5 | 0.9 | Some paraphrasing variety |
| Creative writing | 0.7 - 0.9 | 0.95 | Varied, interesting prose |
| Brainstorming | 0.9 - 1.2 | 0.95 | Maximum diversity of ideas |
| Classification | 0 | 1.0 | Deterministic, reproducible |
Other sampling parameters worth knowing
Frequency penalty (0 to 2): Reduces the probability of tokens that have already appeared in the output. Prevents repetitive loops. Useful for long-form generation.
Presence penalty (0 to 2): Applies a flat penalty to any token that has appeared at all, regardless of frequency. Encourages the model to introduce new topics.
Max tokens: Hard cap on output length. The model stops generating when it hits this limit or produces a stop token. Always set this to prevent runaway generation in production.
Stop sequences: Strings that terminate generation immediately when produced. Useful for structured outputs - stop at \n\n for single-paragraph answers, or } for JSON.
How to apply this in practice
For production APIs: Start with temperature 0 for deterministic features (classification, extraction, code). Use 0.3-0.5 for tasks needing slight variety (summarization, rephrasing). Reserve 0.7+ for explicitly creative features.
For evaluation and testing: Always use temperature 0. You cannot evaluate a non-deterministic system reliably. If your eval passes at temperature 0 but the production endpoint uses 0.7, you are testing a different system.
For A/B testing outputs: Generate multiple completions at moderate temperature (0.7) and use a separate scoring step (at temperature 0) to pick the best one. This “best-of-N” approach gives you diversity in generation and consistency in selection.
For streaming UX: Higher temperature makes streaming feel more natural because the model produces less predictable, more human-like token sequences. Temperature 0 streaming can feel robotic because it always picks the obvious next word.
FAQ
Q: If I set temperature to 0, will I always get the exact same output?
Not necessarily. Temperature 0 means greedy decoding (always pick the highest probability token), but floating point operations on different hardware can produce slightly different logit values. Most providers offer a “seed” parameter for true reproducibility. Even then, model updates can change outputs. If you need guaranteed identical outputs, cache responses.
Q: What is the relationship between temperature and hallucination?
Higher temperature increases hallucination risk because lower-probability tokens (which include factually incorrect completions) become more likely to be selected. But temperature 0 does not eliminate hallucinations - it just deterministically picks the most probable completion, which can still be factually wrong. The model’s confidence is not calibrated to truth.
Q: Should I use top-p or temperature? When do I use both?
For most applications, tuning temperature alone is sufficient. Top-p is useful when you want adaptive filtering - it shines for open-ended generation where the model’s confidence varies significantly between steps. Using both is fine if you understand the interaction: temperature shapes the distribution, then top-p truncates the tail. Avoid extreme values of both simultaneously.
Interview questions
Q: You are building a customer-facing chatbot. Users complain that responses are repetitive and robotic. What sampling parameters would you adjust, and what tradeoffs do you accept?
Increase temperature from 0 to 0.4-0.6 for response variety. Add a small frequency penalty (0.3-0.5) to reduce word repetition. The tradeoff: slightly higher hallucination risk and less reproducible outputs. Mitigate by keeping factual retrieval separate from response generation - retrieve at temperature 0, generate responses at moderate temperature.
Q: Your classification pipeline uses an LLM to categorize support tickets into 5 categories. Results are inconsistent between runs. Diagnose and fix.
The pipeline likely uses non-zero temperature, causing the model to sometimes pick alternative (incorrect) classifications when probabilities are close. Fix: set temperature to 0, set max tokens to a small value (just enough for the category label), use a stop sequence after the category name. For borderline cases where the model is genuinely uncertain, implement confidence thresholds by examining logprobs and routing low-confidence tickets to human review.
Q: Explain the tradeoff between temperature and top-p. When would you use one vs the other vs both?
Temperature scales the entire distribution uniformly - it affects all tokens proportionally. Top-p adaptively truncates based on cumulative probability. Temperature alone is sufficient for most tasks. Top-p is valuable when you want the model to self-regulate: in high-confidence steps it considers few tokens, in low-confidence steps it explores more. Using both lets you shape (temperature) and then truncate (top-p), which is useful for creative tasks where you want controlled variety without the risk of truly random tokens from the deep tail.