Temperature and top-p, visibly
Move the dials. Watch a probability distribution collapse, flatten, or get its tail trimmed off.
❂ Primer
Skip if you already know the theory; the interactive is right below.
When a model generates the next token, it doesn't return a single answer. It returns a probability distribution over its entire vocabulary — roughly 100,000 numbers that sum to 1. The "response" you see is a weighted die roll on top of those numbers.
Temperature sharpens or flattens that distribution. Divide every logit by T; at T=0 the model becomes deterministic (always top token), at T=1 it samples from the raw distribution, at T=2 the curve flattens and rare tokens become plausible. Top-p (also called nucleus sampling) is a cutoff: keep only the smallest set of tokens whose cumulative probability exceeds p, and throw the rest away before sampling.
The distributions below aren't invented — they're hand-calibrated to the shape real models actually produce. Low-entropy prompts (a factual question, a well-known quote) concentrate most mass on one or two tokens. Open-ended prompts spread it out. The sliders below let you see how each knob reshapes a fixed starting distribution.
▶ Try it
Loading distributions…
⁂ Notes from the bench
What to watch for, why it matters, and the one thing that usually surprises people.
Things to try
1. T=0 on "The capital of France is"
Slide temperature to 0 on the capital-of-France prompt. The bar chart collapses to a single bar. This is why factual questions at low temperature feel "reliable" — the model has nowhere to go but the right answer.
2. T=2 on any prompt
Crank temperature to 2. Watch the bars level out. Sample a few tokens — you'll get a mix of reasonable and weird. This is the "creativity" regime, and also the "hallucination" regime. They're the same setting.
3. Top-p at 0.3 on the open-ended prompt
Pick the "Once upon a time…" prompt, leave T at 1, slide top-p down to 0.3. Most tokens get struck through — they're out of the nucleus. This is why top-p often produces better-feeling samples than temperature alone: it bounds how weird the model can get, even at higher T.
4. Entropy as a thermometer
The entropy readout next to the chart measures how "spread out" the distribution is, in bits. Factual prompts sit near 0.1 bits. Creative prompts at T=1 might be 3–4 bits. At max T with top-p=1 you'll see the entropy approach log₂(vocab) — pure noise. Entropy is the number the decoder implicitly sees when deciding whether this is a confident step or a risky one.
What you actually ship with
Most APIs let you set temperature (0–1 or 0–2) and top-p (0–1). Reasoning and coding APIs default to very low T for a reason: you want the one right continuation, not a creative variant. Creative-writing APIs go the other way.
A model at T=0 isn't being thoughtful. It's being greedy. Both of those are fine answers; they just aren't the same answer.
In a line
Hand-calibrated distributions for five prompts, reshaped live with softmax-over-temperature and nucleus truncation. Includes a sample button for stochastic rolls and an entropy readout for the shape of each decision.
Other experiments
11- Exp 001
How a sentence becomes tokens
- Exp 003
What does this prompt actually cost?
- Exp 004
Tokens per second
- Exp 005
How far should the model think?
- Exp 006
Neural language vs a Markov chain
- Exp 007
What each token looks at
- Exp 008
Words in space
- Exp 009
The injection arena
- Exp 010
AI or human?
- Exp 011
Context Tetris
- Exp 012
Magnet flip