By AI Blog Editor
Jun 1, 2026 · 16 min read

Four times less likely — the headline number in Opus 4.8's release isn't a benchmark

Claude Opus 4.8 shipped May 28 — the same morning the $65B Series H closed. Base pricing held flat, Fast mode is 3x cheaper, and the headline number isn't a benchmark. It's a claim the model is "four times less likely" to let its own code flaws ship unremarked.

Jean-Léon Gérôme's 1860 oil painting of Diogenes seated in his earthenware tub, holding a lit lamp in daylight as he searches for an honest man. — Diogenes by Jean-Léon Gérôme, 1860, Walters Art Museum. Public domain via Wikimedia Commons.

On Thursday May 28, 2026, Anthropic released Claude Opus 4.8. The release went out the same morning the $65 billion Series H closed at a $965 billion post-money valuation, and the same morning Anthropic announced a Milan office. Two press windows, three announcements, one Thursday. The model release is documented in Anthropic's own post, in MacRumors's writeup, in 9to5Mac's coverage, in Vellum's benchmark deep-dive, and in Entrepreneur's piece on the honesty training.

The benchmark numbers will get litigated in the usual way over the next two weeks. The number that is actually new — the number Anthropic foregrounded in its own announcement — is not a benchmark at all. It is a claim that Opus 4.8 is "around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked."

That is a model-card sentence with no precedent on the SOTA leaderboard. It is not measuring capability. It is measuring how often the model lets its own bug ship while telling the user the work is done.

What the price card actually changed

The base API price for Opus 4.8 is $5 per million input tokens and $25 per million output tokens. That is unchanged from Opus 4.7, which is itself unchanged from Opus 4.5 a year ago. Anthropic has now held the headline price flat across three generations of its flagship model while shipping a 22% jump on SWE-Bench Verified and a 38% jump on knowledge-work GDPval-AA over the same period.

The price that moved is the second one. Opus 4.8 ships with a new Fast mode tier priced at $10 per million input tokens and $50 per million output tokens. The previous Opus 4.7 Fast mode was $30 / $150. That is a 3x cut, and it lands one model generation after Gemini 3.5 Flash completed the three-lab pricing pattern at $1.50 / $7.50. Anthropic is not chasing Google on the absolute number; it is sliding its faster tier into the band Gemini Pro and GPT-5.5 Premium occupy. The 2.5x speed claim covers the implicit pitch — at $10 / $50 you are not buying a cheap model, you are buying the expensive one with the latency profile of a cheap one.

The full per-vendor pricing has now landed in a place that would have read as science fiction in May 2024. Frontier-tier output tokens at $25 per million. A Fast tier 2.5x faster at twice that. A Mythos-class model rolling out "in the coming weeks" on top of all of it. Anthropic is not pricing toward commodity; it is pricing toward fleet.

The honesty claim, and why it sits where it sits

The four-times-less-likely number is the one to read carefully. The full sentence in Anthropic's release reads that Opus 4.8 is "around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked." The verb is the load-bearing word. The model is not claiming to write four times fewer bugs. It is claiming to flag four times more of them when the work is handed back.

That distinction maps to a category of failure that the coding-agent literature calls "silent regressions" — the agent ships a change, the change compiles, the tests it ran pass, and an actual bug ships to production because the agent did not surface its own uncertainty about the parts it did not test. The whole point of agentic coding at the Cognition/Claude Code/Codex scale is that a human is not reading every line on the way out. The thing that breaks the deployment is the line the human did not read, on the suggestion the agent did not flag.

A 4x reduction in that failure mode is the number a CTO who has been burned by an autonomous coding agent actually wants to see. It is also the number that does not appear in the SWE-Bench leaderboards because the leaderboards measure whether the patch passes the test, not whether the model said "I'm not sure about lines 47–52" on the way through. Anthropic publishing a 4x claim here is the company telling the part of its enterprise market that bought Claude Code that the silent-failure rate has been cut by an order of magnitude inside of one model generation. Whether the rate of measurable bugs went down at the same time is what every independent eval will be running for the next month.

Alignment-wise, the model is also placed in a useful spot. Per 9to5Mac, Opus 4.8's deception rates on Anthropic's internal evaluations are now lower than Opus 4.7's and "similar to Claude Mythos Preview." Mythos is the withheld trusted-access model that Anthropic has been gating behind Glasswing-style consortia and that the Pentagon reportedly wants out of band. Saying Opus 4.8 matches Mythos on alignment metrics is the company telling its general-availability customers that the alignment delta between the trusted-access model and the model anyone with a credit card can call has closed to zero, on Anthropic's own measure. That is, charitably, a load-bearing sentence in two arguments at once.

Vermeer's The Geographer, 1668, depicts a scholar at his desk, dividers in hand, pausing over a chart he is measuring. The kind of "checking my own work" the four-times-less-likely number is trying to put a price on.

The benchmarks, briefly

The numbers, per Vellum. SWE-Bench Pro: Opus 4.8 at 69.2%, Opus 4.7 at 64.3%, GPT-5.5 at 58.6%, Gemini 3.1 Pro at 54.2%. SWE-Bench Verified: 88.6% vs 87.6% vs 80.6% — a marginal jump on the older set, a meaningful one on the harder set. Terminal-Bench 2.1: 74.6%, with GPT-5.5 still ahead at 78.2% — Anthropic did not retake the terminal-coding lead. Online-Mind2Web (browser-agent): 84%, ahead of every comparable model. GDPval-AA (professional knowledge work): 1,890 Elo, 121 points clear of GPT-5.5.

The pattern is recognisable. Opus 4.8 leads on the hardest coding benchmark and the hardest agentic-web benchmark. It trails on terminal coding, ties on GPQA Diamond, and pulls noticeably ahead on the professional-work composite that GDPval is built to measure. It is not the model that wins every column. It is the model that, with Mythos coming behind it, wins the columns that close the longest enterprise sales.

The one I'd flag for tracking is the Databricks Genie figure quoted in Vellum's piece: 61% cheaper token cost than Opus 4.7 when Genie reasons over PDFs. That is not Anthropic claiming a price cut. It is a partner saying that for one specific high-volume workflow, the same task on the same product now costs 39 cents on the dollar. If that ratio holds across other PDF-heavy agents — legal review, contract redlining, finance reconciliation — it is a margin re-rating for the second tier of the AI-application stack, on a flagship model whose sticker price did not move.

Dynamic workflows, and the part where Claude Code grew teeth

Opus 4.8 also ships with what Anthropic calls "dynamic workflows" in Claude Code — the orchestrator can now plan and run hundreds of parallel subagents over codebase-scale refactors. The framing is straight out of the Cognition pitch deck that closed the Series D last week: there is an orchestration layer above the model, and the labs that own the model are now shipping it natively.

Cognition's $26 billion valuation was priced on the bet that "Anthropic and OpenAI will not ship their own orchestrator quickly enough for the orchestration startup to be eaten." Anthropic shipped the orchestrator three weeks later. Whether parallel-subagent dispatch inside Claude Code is the same thing Cognition's investors paid for gets re-litigated at the Series E. The shape of the answer is now in the docs.

What this means

Three takeaways.

The number that matters in the Opus 4.8 release is not a benchmark. The four-times-less-likely-to-let-flaws-pass figure is the load-bearing claim and the one Anthropic chose to lead with. If the independent eval community reproduces it — and the criterion for reproduction is non-trivial, since "unremarked" is a model-output property, not a test-pass property — then the silent-regression rate for the most-deployed agentic-coding model on the planet has been cut by an order of magnitude in one generation. If it does not reproduce, the next OpenAI model card will have a sentence designed to land on the same chart. Either way, "flags its own uncertainty" is now a category the model-card writers compete in.
The Fast mode is the more important price story than the unchanged $5/$25. Frontier base prices have now been flat for a year. The interesting compression is happening one tier down, where Opus 4.8 Fast at $10/$50 lands in the same band as Gemini Pro and GPT-5.5 Premium, with a 2.5x speed advantage on top. Anthropic is using its margin position — and the cash the Series H just closed — to subsidise a faster tier that pulls workloads off the slower, less expensive competitor models. The pricing move is downstream of the funding move. The same Thursday is not a coincidence.
The Mythos parity claim closes the alignment gap inside Anthropic's own catalog. Saying Opus 4.8 matches Mythos on deception rates is the company telling everyone outside the trusted-access circle that the difference between the model the Pentagon wants and the model anyone can buy is no longer an alignment story. It is a capabilities story — cyber depth, vulnerability research, the Firefox 150 271-bug pipeline. That is a much narrower argument for keeping Mythos withheld. Whether Anthropic still withholds it past the Series H — when the IPO-window crossover funds on the cap table will be reading the gating logic with different eyes — is the part still to play out.

The Series H closed on Thursday. So did Opus 4.8. The $965 billion that priced one is the same $965 billion that is paying for the other. The model-card sentence that the company chose to put first — "four times less likely to allow flaws to pass unremarked" — is the sentence enterprises are going to test in production before the next valuation round prices them again. It is a useful sentence for the company to be measured on. The independent evals will say next month whether it is also a true one.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.

The Loop