Temperature and top-p, visibly

Move the dials. Watch a probability distribution collapse, flatten, or get its tail trimmed off.

❂ Primer

Skip if you already know the theory; the interactive is right below.

When a model generates the next token, it doesn't return a single answer. It returns a probability distribution over its entire vocabulary — roughly 100,000 numbers that sum to 1. The "response" you see is a weighted die roll on top of those numbers.

Temperature sharpens or flattens that distribution. Divide every logit by T; at T=0 the model becomes deterministic (always top token), at T=1 it samples from the raw distribution, at T=2 the curve flattens and rare tokens become plausible. Top-p (also called nucleus sampling) is a cutoff: keep only the smallest set of tokens whose cumulative probability exceeds p, and throw the rest away before sampling.

The distributions below aren't invented — they're hand-calibrated to the shape real models actually produce. Low-entropy prompts (a factual question, a well-known quote) concentrate most mass on one or two tokens. Open-ended prompts spread it out. The sliders below let you see how each knob reshapes a fixed starting distribution.

▶ Try it

Loading distributions…

⁂ Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

Things to try

1. T=0 on "The capital of France is"

Slide temperature to 0 on the capital-of-France prompt. The bar chart collapses to a single bar. This is why factual questions at low temperature feel "reliable" — the model has nowhere to go but the right answer.

2. T=2 on any prompt

Crank temperature to 2. Watch the bars level out. Sample a few tokens — you'll get a mix of reasonable and weird. This is the "creativity" regime, and also the "hallucination" regime. They're the same setting.

3. Top-p at 0.3 on the open-ended prompt

Pick the "Once upon a time…" prompt, leave T at 1, slide top-p down to 0.3. Most tokens get struck through — they're out of the nucleus. This is why top-p often produces better-feeling samples than temperature alone: it bounds how weird the model can get, even at higher T.

4. Entropy as a thermometer

The entropy readout next to the chart measures how "spread out" the distribution is, in bits. Factual prompts sit near 0.1 bits. Creative prompts at T=1 might be 3–4 bits. At max T with top-p=1 you'll see the entropy approach log₂(vocab) — pure noise. Entropy is the number the decoder implicitly sees when deciding whether this is a confident step or a risky one.

What you actually ship with

Most APIs let you set temperature (0–1 or 0–2) and top-p (0–1). Reasoning and coding APIs default to very low T for a reason: you want the one right continuation, not a creative variant. Creative-writing APIs go the other way.

A model at T=0 isn't being thoughtful. It's being greedy. Both of those are fine answers; they just aren't the same answer.

In a line

Hand-calibrated distributions for five prompts, reshaped live with softmax-over-temperature and nucleus truncation. Includes a sample button for stochastic rolls and an entropy readout for the shape of each decision.

The Loop

Temperature and top-p, visibly

Things to try

1. T=0 on "The capital of France is"

2. T=2 on any prompt

3. Top-p at 0.3 on the open-ended prompt

4. Entropy as a thermometer

What you actually ship with

Other experiments

How a sentence becomes tokens

What does this prompt actually cost?

Tokens per second

How far should the model think?

Neural language vs a Markov chain

What each token looks at

Words in space

The injection arena

AI or human?

Context Tetris

Magnet flip