Anthropic released Claude Fable 5 on June 9, 2026 — a Mythos-class model made generally available. Launch benchmarks: 95% SWE-bench Verified, ~80% SWE-bench Pro. We ran four first-party eval rounds against Opus 4.8 in Claude Code within 24 hours. Here are the receipts, the pricing math, and the routing guide.
TL;DR: Anthropic released Claude Fable 5 on June 9, 2026 — a Mythos-class model made safe for general availability, while the unrestricted Claude Mythos 5 stays gated behind trusted-access programs. Launch numbers lead agentic coding: 95.0% SWE-bench Verified and ~80% SWE-bench Pro versus GPT-5.5's 58.6%. Pricing is $10/$50 per million tokens (input/output), 1M context, 128K max output — double Opus 4.8's standard rate. We didn't stop at the model card: within 24 hours we ran four head-to-head eval rounds against Opus 4.8 inside Claude Code, with published JSON receipts. The short version: Fable 5 wins on constraint precision and hard clean reasoning; Opus 4.8 keeps winning on situational judgment, accessibility craft, and speed. Routing guide below.
Claude Fable 5 is Anthropic's new flagship, released June 9, 2026. The model ID is claude-fable-5, and it ships as the default model in Claude Code. The unusual part is the lineage: Fable 5 is a Mythos-class model — per Anthropic, it shares its underlying capabilities with Claude Mythos 5, a model the company has kept out of general availability. Fable 5 is the version with safety classifiers attached; Mythos 5 itself remains limited to approved Project Glasswing and trusted-access customers.
Anthropic's launch post says Fable 5 exceeds every model the company has previously made generally available, with the lead widening as tasks get longer and more complex. That's a testable claim, and the launch-window numbers back the agentic-coding half of it convincingly.
Three facts to anchor on:
claude-fable-5 — a drop-in string swap in the API and the default in Claude Code.The launch-window figures, all vendor-claimed until third parties reproduce them:
| Benchmark | Fable 5 | Field |
|---|---|---|
| SWE-bench Verified | 95.0% | — |
| SWE-bench Pro | ~80% | GPT-5.5: 58.6% · Gemini 3.1 Pro: 54.2% |
| CursorBench (max effort) | 72.9% | — |
| FrontierCode (Diamond + Main) | leads both subsets | — |
The SWE-bench Pro gap is the headline. A 21-point lead over GPT-5.5 on a contamination-resistant agentic-coding benchmark is not a rounding error — it's a generation gap, if it holds up under independent reproduction. For how these models stack across the rest of the field, see the frontier model landscape and the FrankX models tracker.
Benchmarks tell you what a model does on someone else's tasks. We wanted to know what it does on ours — so within 24 hours of release we ran four head-to-head rounds against Opus 4.8 inside Claude Code itself, using the Model Arena harness: same prompt to both models as parallel subagents, ground truth fixed before dispatch, objective tasks verified mechanically, subjective tasks judged blind with shuffled labels, and a JSON receipt published for every run.
| Round | Card | Result |
|---|---|---|
| 1 — Capability | Logic, coding, repo-grounding, voice writing | Correctness parity. Fable 5 was the only model to respect every output-format and length constraint. |
| 2 — Behavioral stress | Governance traps, prompt injection, lying docs, contradictory specs | Fable 5 took it 3–2, but the split was the finding: Fable aced constraint stacks; Opus flagged a governance-gated edit Fable executed silently. |
| 3 — Hard capability | Harder reasoning (no tools), parser build, live agentic repo work | Fable 5, 2-2-0 — including the first correctness failure on record: Opus answered a hard no-tools reasoning task confidently wrong in 2.7 seconds. |
| 4 — Premium work samples | Real component build in a live repo + agentic skill authoring | 1–1 split. Opus built the more rigorously accessible component; Fable authored the sharper system doc. |
Four findings that survive all four rounds:
Every claim above traces to a receipt: the raw JSON for all four rounds is in the arena runs directory, and the methodology lives on the Model Arena research page. The caveats are part of the result: n=1 per task, Claude-family blind judge, and everything measured model-in-harness — the configuration we actually operate, not raw API behavior.
Fable 5 costs $10 input / $50 output per million tokens, with batch at $5/$25. That's exactly double Opus 4.8's standard $5/$25 — and identical to Opus 4.8's fast-mode pricing. So the routing question isn't "is Fable 5 better" — it's "which tasks justify 2×":
| Task shape | Route | Why |
|---|---|---|
| Agentic pipelines feeding schemas, tools, other agents | Fable 5 | Measured constraint precision; the SWE-bench Pro lead is exactly this shape |
| Long-horizon coding (multi-hour, multi-file) | Fable 5 | Anthropic's "lead widens with task length" claim + 95% SWE-bench Verified |
| Ambiguous or possibly-wrong specs; gate-sensitive contexts | Opus 4.8 | It pushes back and flags gates; Fable executes agreeably |
| Deep single-shot prose a human reads | Opus 4.8 standard ($5/$25) | Half the price; style verdicts are contested anyway |
| Bulk fan-out, classification, low-stakes extraction | Haiku / cheaper tiers | Don't pay flagship rates for commodity calls |
You don't need an eval platform. The Claude Code Agent tool accepts a per-spawn model override, which makes the CLI itself the harness: dispatch the same task to fable and opus subagents in one parallel block, verify objective tasks with shipped asserts, judge subjective ones with a blind non-contestant model, and write a JSON receipt. The full pattern — task-design rules, the dispatch-verify-judge-receipt loop, and the eval-stack doctrine (arena rounds natively, prompt regression in promptfoo, tracing in Langfuse only once an app serves real users) — is documented in the open-source arena harness README.
The principle that matters more than the tooling: fix ground truth before dispatch, and never promote a claim on a single round. A leaderboard you can't audit is marketing with decimals.
If your workload is agentic — coding agents, tool pipelines, long-horizon tasks — yes, and the swap is a model-string change. Run your own evals first (an afternoon, not a sprint), because the 2× price only pays for itself where constraint precision and task length actually bind. If your workload is judgment-heavy review or human-read prose, Opus 4.8 at half the price remains the honest default. And either way: enforce output contracts in structure, not in trust. Round 4 showed every model's discipline bends under load.
Claude Fable 5 is Anthropic's flagship model released June 9, 2026 — a Mythos-class model made safe for general availability. It shares underlying capabilities with the restricted Claude Mythos 5, ships with safety classifiers, and is the default model in Claude Code. The model ID is claude-fable-5.
Same underlying capabilities, different access. Fable 5 is generally available with safety classifiers attached. Mythos 5 has safeguards lifted in some areas and is limited to approved Project Glasswing and trusted-access customers.
Launch-window figures: 95.0% on SWE-bench Verified, ~80% on SWE-bench Pro (vs 58.6% for GPT-5.5 and 54.2% for Gemini 3.1 Pro), 72.9% on CursorBench at max effort, and the lead on both FrontierCode subsets. Treat these as vendor-claimed until independently reproduced.
$10 per million input tokens and $50 per million output tokens, with batch pricing at $5/$25. That's double Opus 4.8's standard rate and equal to Opus 4.8 fast mode.
1M input tokens with a 128K max output per request.
On agentic work, our four first-party eval rounds say yes — Fable 5 led on constraint precision, output discipline, and hard clean reasoning. Opus 4.8 stayed ahead on situational judgment (flagging gated edits, pushing back on contradictory specs), accessibility craft in real component work, and speed. Neither dominates everything; route by task shape.
Inside Claude Code, dispatch the same task to subagents with different model overrides, verify objective tasks with asserts you wrote before dispatch, judge subjective tasks blind with a non-contestant model, and record a receipt. The open-source harness pattern is at the Model Arena.
Analysis by Frank — AI Architect at Oracle's EMEA AI Center of Excellence, building agentic systems and publishing every eval receipt. Launch facts validated against Anthropic's announcement, TechCrunch, VentureBeat, CNBC, and heise. First-party eval data from the Starlight Model Arena, receipts in the open repo. Vendor-claimed figures are marked as such.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.
Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.
Read article
Anthropic's Opus 4.6 brings 1M context, 128K output, adaptive thinking, and a 67% price cut. Technical breakdown with benchmarks, migration guide, and practical implications for builders.
Read articleOpenAI's GPT-5.5 leads GDPval at 84.9%, OSWorld at 78.7%, and Tau2 Telecom at 98% — at double the price of GPT-5.4. Technical breakdown with verified benchmarks, pricing, and what it means for builders.
Read article