Anthropic Releases Claude Opus 4.8 — The Only Model to Ace Every Super-Agent Benchmark, and It Costs 61% Less to Tell You Your Code Is Wrong

🤚 The Open-Palm Upgrade

In a move that surprised absolutely no one who has been watching Anthropic ship models like a bakery ships croissants, the company announced Claude Opus 4.8 on May 28th — an incremental but decidedly pointed upgrade to their flagship reasoning model.

The headline numbers are, as is tradition, designed to make competitors quietly close their browser tabs:

  • Four times better at catching flaws in written code compared to Opus 4.7 — which means your AI reviewer now has the editorial disposition of a tenured professor who has seen too many undergrad submissions
  • The only model to complete every case end-to-end on the Super-Agent benchmark, beating both its predecessor and GPT-5.5 at cost parity
  • 84% on Online-Mind2Web for browser-agent testing — a benchmark most humans couldn’t pass if you gave them the answer key
  • The highest score on the Legal Agent Benchmark, and the first model to break 10% on the all-pass standard, which sounds modest until you realize no other model has managed it at all
  • 61% cheaper token cost than Opus 4.7 for multimodal reasoning over unstructured content — because apparently making the intelligence more affordable is now a competitive sport

The pricing structure remains mercifully unchanged at $5 per million input tokens and $25 per million output tokens, while the new “fast mode” runs at 2.5× speed for double the price, which is still 3× cheaper than the previous fast tier. The model is available immediately via the API under claude-opus-4-8.

👐 The Two-Handed Feature Parade

Beyond the raw benchmark improvements, Opus 4.8 arrives with three features that suggest Anthropic is no longer building a chatbot but rather a middle manager with a Ph.D. and a taste for parallelism.

Effort Control lets users dial how much computational effort Claude applies to a given task — higher settings for deep thinking, lower for speed. It’s essentially a throttle for intelligence, which is the kind of feature you invent when your model is smart enough to overthink a grocery list. Available on all plans.

Dynamic Workflows — currently in research preview — allows Claude to orchestrate hundreds of parallel subagents within Claude Code, enabling codebase-scale migrations involving hundreds of thousands of lines. This is the feature that will make engineering managers either ecstatic or deeply existential, depending on how they feel about their job security. Available for Enterprise, Team, and Max plans.

And then there’s the Messages API update, which now supports system entries mid-conversation. This sounds mundane until you realize it means you can redirect Claude’s behavior mid-task without breaking prompt caching — effectively giving developers the ability to change the mission briefing while the agent is already in the field.

🌿 The Gentle Awakening

What is perhaps most interesting about Opus 4.8 is not what it does better, but what it refuses to pretend it knows. Anthropic’s alignment assessment reveals the model has reached “new highs on measures of prosocial traits” with substantially lower rates of misaligned behavior — matching the Claude Mythos Preview, which is the model the U.S. government is currently using to hunt for vulnerabilities in national infrastructure.

The model now proactively flags its own uncertainties and avoids unfounded claims. In other words, Claude Opus 4.8 is the first AI model with imposter syndrome that is actually well-calibrated. It knows what it doesn’t know, and it will tell you about it — a trait that puts it ahead of approximately 94% of LinkedIn thought leaders.

Anthropic also teased that lower-cost models with Opus-comparable capabilities are coming, alongside the general availability of Project Glasswing’s Mythos-class models “within weeks.” For those keeping score, Mythos is the model that found 10,000 vulnerabilities in a month during its limited cybersecurity deployment. Making that generally available is either an act of extraordinary public service or the most ambitious penetration test in history.

👑 The Gold-Leaf Reckoning

The release lands against the backdrop of Anthropic’s $65 billion Series H at a $965 billion post-money valuation — a number that would be absurd in any context except the one where your AI model is simultaneously the best coder, the best lawyer, and the most honest conversationalist in every room it enters.

What Opus 4.8 represents is not a leap but a tightening. Every edge is sharper, every cost is lower, every alignment metric is higher. The model doesn’t hallucinate less because it was punished for hallucinating — it hallucinates less because it was taught to say “I don’t know,” which is the most expensive sentence in Silicon Valley.

The competitive implications are straightforward: OpenAI now has to match not just the capability, but the honesty. Google has to explain why Gemini still confidently generates wrong answers at higher cost. And every enterprise CTO who delayed their AI integration is now looking at a model that catches four times more code bugs, runs legal workflows no other model can complete, and costs less than it did last month.

The arms race continues. But the ammunition just got cheaper, smarter, and slightly more polite about telling you when you’re wrong.

“The model said ‘I don’t know’ and the board called it a breakthrough. We have reached the point where intellectual humility is a premium feature, and it is priced per million tokens.” — The Slap of Wisdom Model Evaluation Bureau, currently benchmarking its own relevance