May 29, 2026AI10 min read

Claude Opus 4.8: benchmarks, new features, and what changed

Published May 29, 202610 min read

Every AI model confidently tells you it's right. Claude Opus 4.8 is the first one that tells you when it's wrong — and backs that claim with a measurable number: 0% flawed results passed silently. That's not a roadmap item. It shipped on May 28, 2026.

This post covers everything in the release: the benchmark numbers, the four features that landed alongside the model, what the competition looks like right now, and what actually matters if you're building with it.

What shipped on May 28

Four things landed together with the model.

Dynamic Workflows (Claude Code, research preview). Claude can now plan a task, spin up hundreds of parallel subagents in a single session, and verify its own output before handing results back. The headline use case is codebase-scale migrations: kickoff to merge, using your existing test suite as the bar. Previously this kind of work required a team over multiple days. Now it's a single prompt.

Effort control. A new dial alongside the model selector lets you choose how much thinking Claude applies per task. Opus 4.8 defaults to High, which Anthropic describes as the best overall balance of quality and token cost. You can push it to Extra or Max for harder problems; Low for fast, high-volume tasks where depth isn't the priority.

  EFFORT LEVELS
  ─────────────────────────────────────────────────
  Low     Fast responses, minimal reasoning
  High    Default: best quality/cost tradeoff
  Extra   More thinking passes, better on hard tasks
  Max     Maximum reasoning budget, highest token use
  ─────────────────────────────────────────────────
  In Claude Code: xhigh maps to Extra

Fast Mode, 3 times cheaper. Fast Mode already ran at 2.5 times the speed of standard. The new pricing brings it to $10 per million input tokens and $50 per million output tokens, down from what was effectively 3 times higher cost on Opus 4.7. For high-volume inference pipelines, this is a meaningful cost reduction with no model change required.

System entries in the Messages API. Developers can now inject updated instructions mid-task via system entries in the messages array, without breaking the prompt cache or routing the update through a user turn. Small surface-area change, significant for complex agentic pipelines where instructions need to evolve as the task progresses.

The benchmark numbers

The generation-over-generation jump on math is the most striking number in this release.

  OPUS 4.8 vs OPUS 4.7
  ──────────────────────────────────────────────────────
  Benchmark                  Opus 4.7    Opus 4.8    Delta
  ──────────────────────────────────────────────────────
  USAMO 2026 (math)          69.3%       96.7%       +27.4
  SWE-bench Pro (coding)     64.3%       69.2%       +4.9
  SWE-bench Verified         87.6%       88.6%       +1.0
  GraphWalks @ 1M tokens     40.3%       68.1%       +27.8
  Overconfidence rate        Baseline    10x lower   N/A
  Flawed results passed      ~15%        0%          first
  ──────────────────────────────────────────────────────

A 27-point improvement on competition-level mathematics in a single generation is not an incremental update. USAMO problems require sustained multi-step reasoning with no tolerance for errors that compound. Getting from 69% to 96% means the model's approach to hard reasoning changed substantially, not just got tuned at the margins.

The long-context jump is equally significant for practical use. GraphWalks at 1 million tokens measures whether the model can retrieve and reason over relevant information across a very large context window. Going from 40.3% to 68.1% means RAG pipelines, document analysis, and large codebase tasks all get materially more reliable.

How it compares to the rest of the field

  SWE-BENCH VERIFIED: CODING BENCHMARK
  ───────────────────────────────────────────────────────────
  Model               Score    Output $/M    Context
  ───────────────────────────────────────────────────────────
  GPT-5.5             88.7%    $30           128K
  Claude Opus 4.8     88.6%    $25           200K
  DeepSeek V4-Pro     80.6%    $0.87         1M
  Gemini 3.5 Flash    ~65%     $9            1M
  ───────────────────────────────────────────────────────────
  Gemini 3.5 Pro (Jun 2026) benchmarks not yet published.

On SWE-bench Verified, GPT-5.5 (Apr 23) and Opus 4.8 (May 28) are essentially tied: 88.7% vs 88.6%. The practical difference comes down to price and context. Opus 4.8 costs $25/M output vs GPT-5.5's $30, and offers a 200K context window versus 128K. DeepSeek V4-Pro is the biggest surprise: 80.6% on SWE-bench Verified at just $0.87/M output with a 1M context window, making it the strongest open-weights option for teams where cost and context matter more than the last 8 percentage points. Gemini 3.5 Flash (GA May 19) leads on agentic tasks (MCP Atlas: 83.6%, Terminal-Bench: 76.2%) but trails on raw coding accuracy. Gemini 3.5 Pro, due in June, may shift that.

On BenchLM's composite leaderboard, Opus 4.8 scores 93 and GPT-5.5 scores 91. Claude leads on long-context retrieval and honesty metrics; GPT-5.5 holds a narrow edge on SWE-bench Verified (88.7% vs 88.6%) and Terminal-Bench agentic tasks.

The honesty numbers: why they matter more than the benchmarks

The coding leaderboard is useful. The honesty metrics are more interesting.

Opus 4.8's system card reports three numbers that don't appear in standard benchmark suites:

0%: rate of uncritically reporting flawed results. First model in the Claude line to hit this.
3.7%: rate of failing to flag important events to the user. Down from roughly 15% on Opus 4.7.
10x: reduction in overconfidence compared to Opus 4.7.

These numbers matter if you're running agentic workflows unsupervised. A model that silently passes broken results is a liability in any pipeline where you're not manually reviewing output. The difference between a 15% silent-failure rate and 0% is the difference between a tool you can trust with a cron job and one you can't let out of your sight.

The practical implication: Opus 4.8 is more useful in autonomous contexts precisely because it will stop and tell you something is wrong rather than confidently completing the task incorrectly. That's a different kind of reliability from raw benchmark accuracy.

Pricing

  OPUS 4.8 PRICING
  ─────────────────────────────────────────────
  Mode          Input $/M     Output $/M
  ─────────────────────────────────────────────
  Standard      $5            $25
  Fast Mode     $10           $50
  ─────────────────────────────────────────────
  Same as Opus 4.7. Fast Mode is 3x cheaper
  than the equivalent mode on previous models.

Standard pricing is unchanged from Opus 4.7. The notable move is Fast Mode: the same 2.5 times speed multiplier now comes at a significantly lower cost. If you were avoiding Fast Mode on previous models because of price, the calculation changes with 4.8.

What's coming next

Anthropic signalled two things in the release notes.

Lower-cost models with Opus-level capability. The goal is to bring Opus 4.8's reasoning quality down into smaller, cheaper model tiers. Think Sonnet-class pricing with Opus-class outputs on hard problems.

Mythos-class models for general availability. Mythos is Anthropic's internal frontier model, held back while cybersecurity safeguards are finalised. The release notes say this is weeks away. When it ships, Opus 4.8 won't be the ceiling anymore.

Who should update today

Engineers running Claude Code on large codebases. Dynamic Workflows is in research preview now. If you have migrations, multi-file refactors, or test suite generation tasks at scale, this is worth testing immediately. The parallel subagent architecture changes what's possible in a single session.

Teams running high-volume Fast Mode inference. The 3x cost reduction applies without any model change. Lower bills, same speed.

Anyone doing large-context retrieval. The jump from 40.3% to 68.1% on GraphWalks at 1 million tokens is the biggest practical improvement for RAG pipelines and document analysis workloads.

Builders who care about agentic reliability. 0% uncritical flaw reporting changes what you can trust an autonomous workflow to do without supervision. The capability gains compound when the model knows what it doesn't know.

Model ID: claude-opus-4-8. Available now on claude.ai, the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.