How far should the model think?

A single math problem at rising reasoning budgets — watch accuracy climb, then plateau into elaboration.

❂ Primer

Skip if you already know the theory; the interactive is right below.

Reasoning models (o1, o3, Claude with extended thinking, Gemini 2.5) spend a hidden token budget on internal deliberation before producing a visible answer. The same API lets you cap that budget — and the shape of the answer changes as you turn the dial.

Below, a single geometry problem at rising budgets. The outputs are illustrative but faithful to the pattern: zero budget guesses, a small budget coheres, a medium budget solves it, and a large budget mostly repeats the solution while verifying it.

▶ Try it

Loading reasoning trace…

⁂ Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

The shape of the curve

Plot accuracy against budget and you usually see a knee, not a line. Below the knee the model can't get the answer at all. Above it, extra budget buys verification, elaboration, and edge-case enumeration — but rarely a different final number. The interesting question isn't "how much thinking do I want" but "where's the knee for this kind of problem".

Things to notice

Zero-budget outputs can look confident. They also can be very wrong. A model with no thinking budget will still commit to an answer — usually a plausible-sounding number that fails any actual check.

The first correct trace is often the cheapest. If 4K tokens of reasoning got you to the right answer, 16K and 64K probably won't improve correctness. They'll improve confidence calibration— the model will hedge less — but you pay linearly and get a logarithmic return.

Overthinking is a real failure mode. At the largest budget, the model here enumerates alternative configurations and second-guesses a problem that was already solved. You can see this in real traces too — the reasoning starts well, then the model finds something to worry about and burns tokens without moving the answer.

The one move that pays

Pick a representative problem from your workload. Run it at 1K, 4K, 16K, 64K. The smallest budget that reliably gets the right answer is your production setting. Every token past it is lighting money on fire.

In a line

Slider over precomputed model outputs at 0 / 1K / 4K / 16K / 64K thinking tokens. Shows the knee in the accuracy curve and the shape of overthinking.

The Loop

How far should the model think?

The shape of the curve

Things to notice

The one move that pays

Other experiments

How a sentence becomes tokens

Temperature and top-p, visibly

What does this prompt actually cost?

Tokens per second

Neural language vs a Markov chain

What each token looks at

Words in space

The injection arena

AI or human?

Context Tetris

Magnet flip