The Loop  ·  Issue 017

The Loop

A field journal of the AI frontier — for engineers who ship.

  Lab bench

Experiment №005
filed Apr 21, 2026

explainer

Filed under

  • #reasoning
  • #thinking
  • #budget
  • #o1
  • #extended-thinking

How far should the model think?

A single math problem at rising reasoning budgets — watch accuracy climb, then plateau into elaboration.

  Primer

Skip if you already know the theory; the interactive is right below.

Reasoning models (o1, o3, Claude with extended thinking, Gemini 2.5) spend a hidden token budget on internal deliberation before producing a visible answer. The same API lets you cap that budget — and the shape of the answer changes as you turn the dial.

Below, a single geometry problem at rising budgets. The outputs are illustrative but faithful to the pattern: zero budget guesses, a small budget coheres, a medium budget solves it, and a large budget mostly repeats the solution while verifying it.

▶  Try it

Loading reasoning trace…

  Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

The shape of the curve

Plot accuracy against budget and you usually see a knee, not a line. Below the knee the model can't get the answer at all. Above it, extra budget buys verification, elaboration, and edge-case enumeration — but rarely a different final number. The interesting question isn't "how much thinking do I want" but "where's the knee for this kind of problem".

Things to notice

Zero-budget outputs can look confident. They also can be very wrong. A model with no thinking budget will still commit to an answer — usually a plausible-sounding number that fails any actual check.

The first correct trace is often the cheapest. If 4K tokens of reasoning got you to the right answer, 16K and 64K probably won't improve correctness. They'll improve confidence calibration— the model will hedge less — but you pay linearly and get a logarithmic return.

Overthinking is a real failure mode. At the largest budget, the model here enumerates alternative configurations and second-guesses a problem that was already solved. You can see this in real traces too — the reasoning starts well, then the model finds something to worry about and burns tokens without moving the answer.

The one move that pays

Pick a representative problem from your workload. Run it at 1K, 4K, 16K, 64K. The smallest budget that reliably gets the right answer is your production setting. Every token past it is lighting money on fire.

In a line

Slider over precomputed model outputs at 0 / 1K / 4K / 16K / 64K thinking tokens. Shows the knee in the accuracy curve and the shape of overthinking.

Other experiments

11
  1. Exp 001

    How a sentence becomes tokens

  2. Exp 002

    Temperature and top-p, visibly

  3. Exp 003

    What does this prompt actually cost?

  4. Exp 004

    Tokens per second

  5. Exp 006

    Neural language vs a Markov chain

  6. Exp 007

    What each token looks at

  7. Exp 008

    Words in space

  8. Exp 009

    The injection arena

  9. Exp 010

    AI or human?

  10. Exp 011

    Context Tetris

  11. Exp 012

    Magnet flip