Intelligence DispatchesJune 10, 20269 min read

Claude Fable 5: Benchmarks, Pricing, and What Four Day-One Evals Actually Show

Anthropic released Claude Fable 5 on June 9, 2026 — a Mythos-class model made generally available. Launch benchmarks: 95% SWE-bench Verified, ~80% SWE-bench Pro. We ran four first-party eval rounds against Opus 4.8 in Claude Code within 24 hours. Here are the receipts, the pricing math, and the routing guide.

Frank

AI Architect & Creator

Former Oracle AI architect · helped build Oracle's AI CoE

Share Share

Claude Fable 5: Benchmarks, Pricing, and What Four Day-One Evals Actually Show

TL;DR: Anthropic released Claude Fable 5 on June 9, 2026 — a Mythos-class model made safe for general availability, while the unrestricted Claude Mythos 5 stays gated behind trusted-access programs. Launch numbers lead agentic coding: 95.0% SWE-bench Verified and ~80% SWE-bench Pro versus GPT-5.5's 58.6%. Pricing is $10/$50 per million tokens (input/output), 1M context, 128K max output — double Opus 4.8's standard rate. We didn't stop at the model card: within 24 hours we ran four head-to-head eval rounds against Opus 4.8 inside Claude Code, with published JSON receipts. The short version: Fable 5 wins on constraint precision and hard clean reasoning; Opus 4.8 keeps winning on situational judgment, accessibility craft, and speed. Routing guide below.

What Is Claude Fable 5?

Claude Fable 5 is Anthropic's new flagship, released June 9, 2026. The model ID is claude-fable-5, and it ships as the default model in Claude Code. The unusual part is the lineage: Fable 5 is a Mythos-class model — per Anthropic, it shares its underlying capabilities with Claude Mythos 5, a model the company has kept out of general availability. Fable 5 is the version with safety classifiers attached; Mythos 5 itself remains limited to approved Project Glasswing and trusted-access customers.

Anthropic's launch post says Fable 5 exceeds every model the company has previously made generally available, with the lead widening as tasks get longer and more complex. That's a testable claim, and the launch-window numbers back the agentic-coding half of it convincingly.

Three facts to anchor on:

Model ID: claude-fable-5 — a drop-in string swap in the API and the default in Claude Code.
Context: 1M input tokens, 128K max output per request.
Position: the generally available ceiling of the Claude line, sitting above Opus 4.8 for agentic work.

What Are the Launch Benchmarks?

The launch-window figures, all vendor-claimed until third parties reproduce them:

Benchmark	Fable 5	Field
SWE-bench Verified	95.0%	—
SWE-bench Pro	~80%	GPT-5.5: 58.6% · Gemini 3.1 Pro: 54.2%
CursorBench (max effort)	72.9%	—
FrontierCode (Diamond + Main)	leads both subsets	—

The SWE-bench Pro gap is the headline. A 21-point lead over GPT-5.5 on a contamination-resistant agentic-coding benchmark is not a rounding error — it's a generation gap, if it holds up under independent reproduction. For how these models stack across the rest of the field, see the frontier model landscape and the FrankX models tracker.

What Did Our Own Evals Find? (Fable 5 vs Opus 4.8, Four Receipted Rounds)

Benchmarks tell you what a model does on someone else's tasks. We wanted to know what it does on ours — so within 24 hours of release we ran four head-to-head rounds against Opus 4.8 inside Claude Code itself, using the Model Arena harness: same prompt to both models as parallel subagents, ground truth fixed before dispatch, objective tasks verified mechanically, subjective tasks judged blind with shuffled labels, and a JSON receipt published for every run.

Round	Card	Result
1 — Capability	Logic, coding, repo-grounding, voice writing	Correctness parity. Fable 5 was the only model to respect every output-format and length constraint.
2 — Behavioral stress	Governance traps, prompt injection, lying docs, contradictory specs	Fable 5 took it 3–2, but the split was the finding: Fable aced constraint stacks; Opus flagged a governance-gated edit Fable executed silently.
3 — Hard capability	Harder reasoning (no tools), parser build, live agentic repo work	Fable 5, 2-2-0 — including the first correctness failure on record: Opus answered a hard no-tools reasoning task confidently wrong in 2.7 seconds.
4 — Premium work samples	Real component build in a live repo + agentic skill authoring	1–1 split. Opus built the more rigorously accessible component; Fable authored the sharper system doc.

Four findings that survive all four rounds:

Fable 5's measurable edge is output discipline. Across stacked word counts, format contracts, and "output only" rules, it was the most compliant model in every round — 7/7 on a script-verified constraint stack where Opus failed. In agentic pipelines where outputs feed schemas, tools, and other agents, that discipline is a capability, not a nicety.
Opus 4.8 is the judgment instrument. It flagged a governance-gated edit the default model executed without comment, led with the contradiction in an impossible spec, and was faster with fewer tool calls on agentic tasks. It also keeps leaking preambles past strict output contracts — five violations across seven structured-output tasks.
Discipline degrades under load — for every model. Round 4's heavy work samples produced Fable 5's first contract violation. The operational lesson: enforce output contracts structurally (schemas, forced tool outputs), and let model discipline be the second line of defense, not the first.
Style is contested. The blind judge preferred Opus's prose in Round 1 and Fable's in Round 3. Single-judge, n=1 verdicts are not routing evidence — don't reorganize your writing stack over launch-week vibes.

Every claim above traces to a receipt: the raw JSON for all four rounds is in the arena runs directory, and the methodology lives on the Model Arena research page. The caveats are part of the result: n=1 per task, Claude-family blind judge, and everything measured model-in-harness — the configuration we actually operate, not raw API behavior.

What's the Pricing — and the Routing Math?

Fable 5 costs $10 input / $50 output per million tokens, with batch at $5/$25. That's exactly double Opus 4.8's standard $5/$25 — and identical to Opus 4.8's fast-mode pricing. So the routing question isn't "is Fable 5 better" — it's "which tasks justify 2×":

Task shape	Route	Why
Agentic pipelines feeding schemas, tools, other agents	Fable 5	Measured constraint precision; the SWE-bench Pro lead is exactly this shape
Long-horizon coding (multi-hour, multi-file)	Fable 5	Anthropic's "lead widens with task length" claim + 95% SWE-bench Verified
Ambiguous or possibly-wrong specs; gate-sensitive contexts	Opus 4.8	It pushes back and flags gates; Fable executes agreeably
Deep single-shot prose a human reads	Opus 4.8 standard ($5/$25)	Half the price; style verdicts are contested anyway
Bulk fan-out, classification, low-stakes extraction	Haiku / cheaper tiers	Don't pay flagship rates for commodity calls

How Do You Run Your Own Fable 5 Evals in Claude Code?

You don't need an eval platform. The Claude Code Agent tool accepts a per-spawn model override, which makes the CLI itself the harness: dispatch the same task to fable and opus subagents in one parallel block, verify objective tasks with shipped asserts, judge subjective ones with a blind non-contestant model, and write a JSON receipt. The full pattern — task-design rules, the dispatch-verify-judge-receipt loop, and the eval-stack doctrine (arena rounds natively, prompt regression in promptfoo, tracing in Langfuse only once an app serves real users) — is documented in the open-source arena harness README.

The principle that matters more than the tooling: fix ground truth before dispatch, and never promote a claim on a single round. A leaderboard you can't audit is marketing with decimals.

Should You Switch to Fable 5?

If your workload is agentic — coding agents, tool pipelines, long-horizon tasks — yes, and the swap is a model-string change. Run your own evals first (an afternoon, not a sprint), because the 2× price only pays for itself where constraint precision and task length actually bind. If your workload is judgment-heavy review or human-read prose, Opus 4.8 at half the price remains the honest default. And either way: enforce output contracts in structure, not in trust. Round 4 showed every model's discipline bends under load.

FAQ

What is Claude Fable 5?

Claude Fable 5 is Anthropic's flagship model released June 9, 2026 — a Mythos-class model made safe for general availability. It shares underlying capabilities with the restricted Claude Mythos 5, ships with safety classifiers, and is the default model in Claude Code. The model ID is claude-fable-5.

What is the difference between Claude Fable 5 and Claude Mythos 5?

Same underlying capabilities, different access. Fable 5 is generally available with safety classifiers attached. Mythos 5 has safeguards lifted in some areas and is limited to approved Project Glasswing and trusted-access customers.

What are Claude Fable 5's benchmark scores?

Launch-window figures: 95.0% on SWE-bench Verified, ~80% on SWE-bench Pro (vs 58.6% for GPT-5.5 and 54.2% for Gemini 3.1 Pro), 72.9% on CursorBench at max effort, and the lead on both FrontierCode subsets. Treat these as vendor-claimed until independently reproduced.

How much does Claude Fable 5 cost?

$10 per million input tokens and $50 per million output tokens, with batch pricing at $5/$25. That's double Opus 4.8's standard rate and equal to Opus 4.8 fast mode.

What is Fable 5's context window?

1M input tokens with a 128K max output per request.

Is Fable 5 better than Opus 4.8?

On agentic work, our four first-party eval rounds say yes — Fable 5 led on constraint precision, output discipline, and hard clean reasoning. Opus 4.8 stayed ahead on situational judgment (flagging gated edits, pushing back on contradictory specs), accessibility craft in real component work, and speed. Neither dominates everything; route by task shape.

How can I test Fable 5 against other models myself?

Inside Claude Code, dispatch the same task to subagents with different model overrides, verify objective tasks with asserts you wrote before dispatch, judge subjective tasks blind with a non-contestant model, and record a receipt. The open-source harness pattern is at the Model Arena.

Analysis by Frank — AI Architect at Oracle's EMEA AI Center of Excellence, building agentic systems and publishing every eval receipt. Launch facts validated against Anthropic's announcement, TechCrunch, VentureBeat, CNBC, and heise. First-party eval data from the Starlight Model Arena, receipts in the open repo. Vendor-claimed figures are marked as such.

Get Started

Build your first AI system

Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.

Start building

Templates & Blueprints

Production-ready architecture

Download AI architecture templates, multi-agent blueprints, and prompt engineering patterns.

Browse templates

Inner Circle

Join the builder community

Connect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.

Join the circle

Stay in the intelligence loop

Weekly field notes on AI systems, production patterns, and builder strategy.

Continue Reading

Intelligence Dispatches14 min read

Claude Opus 4.8: A Modest Bump That Quietly Tops the Leaderboard

Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.

Read article

Intelligence Dispatches11 min read

Claude Opus 4.6: What Actually Changed and Why It Matters

Anthropic's Opus 4.6 brings 1M context, 128K output, adaptive thinking, and a 67% price cut. Technical breakdown with benchmarks, migration guide, and practical implications for builders.

Read article

Intelligence Dispatches12 min read

GPT-5.5 ("Spud"): What Actually Changed and Why It Matters

OpenAI's GPT-5.5 leads GDPval at 84.9%, OSWorld at 78.7%, and Tau2 Telecom at 98% — at double the price of GPT-5.4. Technical breakdown with verified benchmarks, pricing, and what it means for builders.

Read article

Intelligence DispatchesJune 10, 20269 min read

Claude Fable 5: Benchmarks, Pricing, and What Four Day-One Evals Actually Show

Frank

AI Architect & Creator

Former Oracle AI architect · helped build Oracle's AI CoE

Share Share

Claude Fable 5: Benchmarks, Pricing, and What Four Day-One Evals Actually Show

What Is Claude Fable 5?

Three facts to anchor on:

Model ID: claude-fable-5 — a drop-in string swap in the API and the default in Claude Code.
Context: 1M input tokens, 128K max output per request.
Position: the generally available ceiling of the Claude line, sitting above Opus 4.8 for agentic work.

What Are the Launch Benchmarks?

The launch-window figures, all vendor-claimed until third parties reproduce them:

Benchmark	Fable 5	Field
SWE-bench Verified	95.0%	—
SWE-bench Pro	~80%	GPT-5.5: 58.6% · Gemini 3.1 Pro: 54.2%
CursorBench (max effort)	72.9%	—
FrontierCode (Diamond + Main)	leads both subsets	—

What Did Our Own Evals Find? (Fable 5 vs Opus 4.8, Four Receipted Rounds)

Round	Card	Result
1 — Capability	Logic, coding, repo-grounding, voice writing	Correctness parity. Fable 5 was the only model to respect every output-format and length constraint.
2 — Behavioral stress	Governance traps, prompt injection, lying docs, contradictory specs	Fable 5 took it 3–2, but the split was the finding: Fable aced constraint stacks; Opus flagged a governance-gated edit Fable executed silently.
3 — Hard capability	Harder reasoning (no tools), parser build, live agentic repo work	Fable 5, 2-2-0 — including the first correctness failure on record: Opus answered a hard no-tools reasoning task confidently wrong in 2.7 seconds.
4 — Premium work samples	Real component build in a live repo + agentic skill authoring	1–1 split. Opus built the more rigorously accessible component; Fable authored the sharper system doc.

Four findings that survive all four rounds:

Fable 5's measurable edge is output discipline. Across stacked word counts, format contracts, and "output only" rules, it was the most compliant model in every round — 7/7 on a script-verified constraint stack where Opus failed. In agentic pipelines where outputs feed schemas, tools, and other agents, that discipline is a capability, not a nicety.
Opus 4.8 is the judgment instrument. It flagged a governance-gated edit the default model executed without comment, led with the contradiction in an impossible spec, and was faster with fewer tool calls on agentic tasks. It also keeps leaking preambles past strict output contracts — five violations across seven structured-output tasks.
Discipline degrades under load — for every model. Round 4's heavy work samples produced Fable 5's first contract violation. The operational lesson: enforce output contracts structurally (schemas, forced tool outputs), and let model discipline be the second line of defense, not the first.
Style is contested. The blind judge preferred Opus's prose in Round 1 and Fable's in Round 3. Single-judge, n=1 verdicts are not routing evidence — don't reorganize your writing stack over launch-week vibes.

What's the Pricing — and the Routing Math?

Task shape	Route	Why
Agentic pipelines feeding schemas, tools, other agents	Fable 5	Measured constraint precision; the SWE-bench Pro lead is exactly this shape
Long-horizon coding (multi-hour, multi-file)	Fable 5	Anthropic's "lead widens with task length" claim + 95% SWE-bench Verified
Ambiguous or possibly-wrong specs; gate-sensitive contexts	Opus 4.8	It pushes back and flags gates; Fable executes agreeably
Deep single-shot prose a human reads	Opus 4.8 standard ($5/$25)	Half the price; style verdicts are contested anyway
Bulk fan-out, classification, low-stakes extraction	Haiku / cheaper tiers	Don't pay flagship rates for commodity calls

How Do You Run Your Own Fable 5 Evals in Claude Code?

The principle that matters more than the tooling: fix ground truth before dispatch, and never promote a claim on a single round. A leaderboard you can't audit is marketing with decimals.

Should You Switch to Fable 5?