By AI Blog Editor
Apr 25, 2026 · 10 min read

GPT-5.5 "Spud" — twice the price, split benchmarks, and a polite request to start your prompts over

OpenAI shipped GPT-5.5 on April 23, doubled the API price, and split the benchmark trophies with Anthropic. The line buried in the prompting guide is the one that should worry you — "treat it as a new model family to tune for, not a drop-in replacement.

On April 23, 2026, OpenAI shipped GPT-5.5. Internal codename, per multiple reports: Spud. Naming your frontier model after a potato is a choice, and it is the second-funniest thing about this launch. The funniest is the price.

GPT-5.5 costs $5 per million input tokens and $30 per million output tokens — exactly double GPT-5.4's $2.50 and $15. The Pro tier lands at $30 input / $180 output. OpenAI's own framing is that the effective cost increase, once you account for fewer tokens per task, is "about 20%." That sentence is doing a lot of work.

Two days later, on April 25, the API became broadly available, and Simon Willison published a read of OpenAI's accompanying prompting guide. The line he flagged — and it deserves flagging — is OpenAI advising developers to treat 5.5 as "a new model family to tune for, not a drop-in replacement." Translation: every prompt you wrote against 5.4 is now technical debt.

What you actually get for double the money

A frontier model, but a strangely-shaped one. GPT-5.5 is natively omnimodal across text, images, audio, and video, has a 1 million-token context window, and is built explicitly for agentic workflows — multi-tool coordination, long-horizon planning, autonomous task execution. The benchmarks, where they exist as numbers anyone can independently verify, look like this:

GPT-5.5 wins decisively on long-context retrieval, terminal-and-OS automation, and the kinds of multi-step reasoning that vendors love to put in agentic-coding demos. It loses, also decisively, on SWE-Bench Pro — the benchmark closest to "fix a real bug in a real repo." Opus 4.7 still wins that one by nearly six points, and Anthropic's unreleased Mythos Preview, which we covered when it landed, is reportedly higher again. The simplest read: OpenAI closed the long-context gap and stayed second on the part of the work that looks most like real engineering.

So the marketing claim — "new class of intelligence" — is half-defensible. There is a real long-context win and a real terminal-automation win. There is no general-purpose dethroning. Whoever is paying the API bill should care about both halves.

The "tune for" tax

This is the part of the launch that vendors do not like to put on the slide.

OpenAI's prompting guide, in plain English, tells developers: do not assume your existing prompts will transfer. Start from the smallest prompt that preserves the product contract. Then re-tune reasoning effort, verbosity, tool descriptions, and output format. Willison's read of this is generous — "interesting that they're advocating starting prompts from scratch" — but the implication for anyone running 5.4 in production is harsher: the migration has a cost, and that cost shows up before you see any of the headline gains.

There is recent precedent for taking this seriously. Anthropic's April 23 Claude Code postmortem found that one line added to the system prompt — "keep text between tool calls to ≤25 words" — cost three percentage points across their evals. Three percent is the gap between adjacent model tiers. It is the kind of regression a frontier lab spends a quarter trying to close. If a one-line system-prompt nudge can do that on Claude, the idea that you can lift-and-shift a 4.5-tuned prompt onto a fully-retrained base model and expect the published benchmark numbers is, charitably, optimistic.

The cost of "double the price plus rewrite every prompt" is not the API rate card. It is the engineer-week per prompt times the number of prompts in your stack. That is a sentence that costs $500 a head.

What is genuinely new

Two things, separated from the price-and-PR theatre.

Long-context retrieval that actually works at the advertised window. Graphwalks BFS at 1M tokens — 45.4% versus 5.4's 9.4% — is the result that is hardest to fake. Long-context retrieval has been a benchmark embarrassment for the last two model generations: vendors quote million-token windows and then quietly fall to 20% accuracy past 200K. A 5x improvement on graph-traversal at the long end is the kind of thing that lets you stop pretending the context window is real and start using it.

A first fully-retrained base model since GPT-4.5. Earlier 5.x releases were post-training improvements over the same foundation. 5.5 is a new pre-training run, which is why OpenAI is willing to say "treat this as a new family." The code-named "Spud" base, NVIDIA's GB200/GB300 optimisation, and the >20% throughput gain at matched latency all point to the same thing: this is the first frontier model whose unit economics were designed around Blackwell-class hardware from the start. If you believe Jensen Huang's 35x cost-per-token claim — and you should believe roughly a third of it — that is the structural justification for OpenAI's "effective costs are only 20% higher" framing. Doubling the rate card and shipping 30%+ throughput gains is not the same trade as just doubling the rate card. It is, however, very easy to mistake one for the other when the invoice arrives.

What to watch

Three things over the next quarter will tell you whether GPT-5.5 lands as a generational shift or as a price hike with a benchmark deck stapled to it.

Whether the long-context wins survive contact with real workloads. Graphwalks and MRCR v2 are designed benchmarks. The honest test is whether RAG pipelines and agent harnesses with 500K-token working sets actually start retrieving the right thing more often. If they do, the price-double is justified by capability. If they don't, it is a price-double.
Where Anthropic responds. Opus 4.7 came out April 20, three days before GPT-5.5. Anthropic almost certainly knew the numbers it would face and shipped anyway. The interesting question is whether the next Opus revision lands on the gap GPT-5.5 just opened on long-context, or whether Anthropic concedes that lane and presses harder on SWE-Bench Pro and HLE — the ones it still wins.
How many shops actually retune. The "new family to tune for" line is OpenAI's polite way of saying don't ship 5.5 with a copy-pasted 5.4 prompt and then complain that it underperforms. The shops that take that seriously will pay the migration tax and see most of the gains. The ones that don't will renew their API contract at double the price and write a Reddit post explaining that the new model is worse.

The thing to sit with is the framing OpenAI has chosen: a new pre-training run, a doubled price, and an instruction to start over. Each of those is a defensible choice on its own. Bundled, they amount to a bet that customers will pay more, work more, and end up happier than they were a week ago. We will know in about ninety days whether that bet pays out, or whether Spud turns out to be a sentence that costs five figures a quarter.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.

The Loop

GPT-5.5 "Spud" — twice the price, split benchmarks, and a polite request to start your prompts over

What you actually get for double the money

The "tune for" tax

What is genuinely new

What to watch

Elsewhere in this issue

A trillion-dollar Anthropic — the number that lives only on Forge

Decoupled DiLoCo — Google teaches frontier training to survive a bad fibre and a dead chip

Three bugs in the harness — Anthropic's Claude Code postmortem, and the system prompt that cost 3%