OpenAI's GPT-5.5 leads GDPval at 84.9%, OSWorld at 78.7%, and Tau2 Telecom at 98% — at double the price of GPT-5.4. Technical breakdown with verified benchmarks, pricing, and what it means for builders.
TL;DR: OpenAI released GPT-5.5 (internal codename "Spud") on April 23, 2026 as its new flagship. The headline numbers are agentic: 84.9% on GDPval (knowledge work across 44 occupations), 78.7% on OSWorld-Verified (autonomous computer use), and 98.0% on Tau2 Telecom (customer-service workflows) without prompt tuning. Long-context reasoning at 512K–1M tokens jumped from 36.6% on GPT-5.4 to 74.0%. The catch: pricing doubled to $5/$30 per million tokens, and it's a drop-in API replacement for GPT-5.4. Here's what actually matters for builders.
GPT-5.5 is OpenAI's new flagship, replacing GPT-5.4 at the top of the line. The model id string is gpt-5.5, and OpenAI positions it squarely as an agentic model — built for long-horizon tasks where the model plans, calls tools, makes decisions, and keeps going for minutes or hours without hand-holding.
Three things make this release worth a builder's attention:
It set the agentic benchmark pace. GDPval at 84.9%, OSWorld-Verified at 78.7%, and Tau2-bench Telecom at 98.0% are the strongest knowledge-work and computer-use numbers OpenAI has published. These are the benchmarks that map to real automation, not trivia recall.
Long context finally works. Long-context reasoning at 512K–1M tokens went from 36.6% on GPT-5.4 to 74.0% on GPT-5.5. That's the single largest jump in the GPT-5 series, and it's the difference between "accepts a big prompt" and "uses it."
It costs more, not less. GPT-5.5 lists at exactly 2x GPT-5.4 on both sides of the meter — $5 input, $30 output per million tokens. OpenAI's argument is that better token efficiency offsets the higher rate. That argument is testable, and I get into it below.
Numbers below are from OpenAI's announcement cross-referenced against independent coverage (Vellum, The Decoder, Interesting Engineering) and benchmark aggregators. Where sources disagree or the figure is vendor-reported, I flag it explicitly. Treat anything labeled vendor-claimed as a marketing-conditions number, not an independently reproduced one.
| Benchmark | What it measures | GPT-5.5 | GPT-5.4 |
|---|---|---|---|
| GDPval | Knowledge work across 44 occupations | 84.9% | — |
| OSWorld-Verified | Autonomous real-computer operation | 78.7% | — |
| Tau2-bench Telecom | Complex customer-service workflows | 98.0% | — |
| Terminal-Bench 2.0 | Command-line agentic workflows | 82.7% | 75.1% |
| Long context (512K–1M) | Retrieval + reasoning over long inputs | 74.0% | 36.6% |
| AIME 2025 | Competition math | 93.6% | — |
| SWE-bench Verified | Real GitHub issue resolution | ~82.6–88.7% (vendor-claimed, varies by source) | — |
| ARC-AGI-2 | Abstract reasoning | ~85% (vendor-claimed, read skeptically) | — |
Two honest caveats. First, SWE-bench Verified scores at this tier should be read with heavy skepticism — every frontier lab has plausibly trained on or adjacent to this data, and reported figures for GPT-5.5 range from roughly 82.6% to 88.7% depending on the source and test conditions. On the harder SWE-bench Pro, GPT-5.5 lands at 58.6%, which is the more useful signal. Second, the ARC-AGI-2 figure floating around (~85%) is unusually high for an abstract-reasoning benchmark and I could not pin it to a single authoritative methodology; I'd wait for an independent run before quoting it as fact.
The cleaner competitive read comes from GDPval-AA, the Artificial Analysis arena version of GDPval. There, GPT-5.5 scores 1769 Elo — strong, and ahead of Gemini 3.1 Pro's 1314, but behind Claude Opus 4.8's 1890 (released roughly five weeks later, on May 28). GPT-5.5's most durable lead is on terminal-agent benchmarks, where it keeps a narrow edge even against Opus 4.8.
This is the comparison most builders actually care about. The frontier in mid-2026 is a three-way race, and the answer is genuinely "it depends on the task."
| Capability | GPT-5.5 | Claude Opus 4.8 | Gemini 3.1 Pro |
|---|---|---|---|
| GDPval-AA (Elo) | 1769 | 1890 | 1314 |
| OSWorld (computer use) | 78.7% | 83.4% | — |
| SWE-bench Pro | 58.6% | 69.2% | 54.2% |
| Terminal-agent tasks | narrow lead | strong | — |
| GPQA Diamond | — | 93.6% | — |
| Released | Apr 23, 2026 | May 28, 2026 | earlier 2026 |
A few takeaways that hold up across sources:
If you're choosing today, the honest framing is: Opus 4.8 for deep coding and knowledge synthesis, GPT-5.5 for terminal-agent and computer-use automation, and route by task rather than picking one winner. I broke down the Opus side of this in the Claude Opus 4.8 analysis.
GPT-5.5 doubled the per-token price of GPT-5.4. This is the most contested part of the release.
| Model | Input / 1M | Cached input / 1M | Output / 1M |
|---|---|---|---|
| GPT-5.5 | $5.00 | $0.50 | $30.00 |
| GPT-5.5 Pro | $30.00 | — | $180.00 |
| GPT-5.4 (prior) | $2.50 | — | $15.00 |
| Claude Opus 4.8 (reference) | varies | varies | varies |
The output price is the one to watch: $30 per million output tokens is steep for a default agentic model, and GPT-5.5 Pro at $180 output is firmly in "reserve for the hardest research-grade problems" territory.
OpenAI's counter-argument is token efficiency. The company says GPT-5.5 uses significantly fewer tokens to complete the same Codex tasks than GPT-5.4, and Artificial Analysis reported roughly 40% fewer output tokens in its own framework. If that holds for your workload, the effective cost gap narrows — a 2x rate against 40% fewer tokens is roughly a 1.2x effective increase on output-heavy agentic runs, not 2x. But "roughly 40% fewer on Codex tasks" is not a universal guarantee, and on output-light, input-heavy workloads (long-context retrieval, document analysis) you pay the full 2x with less efficiency offset. Measure your own token mix before assuming the efficiency narrative covers you.
The prior flagship most readers will be migrating from is GPT-5.2 (released December 11, 2025). Here's the delta that matters.
| Dimension | GPT-5.2 / 5.2 Pro | GPT-5.5 |
|---|---|---|
| Positioning | "Most capable work model" | Agentic, long-horizon autonomy |
| Context window | 400K (272K input + 128K output) | Up to 1M via API; 400K class in Codex |
| Modalities | Text + image (no native audio) | Text, image, audio, video in one unified architecture (vendor framing) |
| Long-context reasoning | weaker | 74.0% at 512K–1M (vs 36.6% on 5.4) |
| Output price / 1M | $14 (Pro tier) | $30 (base), $180 (Pro) |
| Knowledge cutoff | late 2025 | December 2025 |
Two things to correct from the common framing. GPT-5.2 did not ship native audio — it was a text-and-image model, and the modality story is genuinely new at the 5.5 line. And on context: the published 1M window comes with real-world caveats. There are open reports of GPT-5.5 surfacing a ~258K effective window in Codex despite the 400K/1M published figures, and of catalog mismatches that can bypass auto-compaction. If you're building on the long-context promise, validate the effective window in your actual harness rather than trusting the spec sheet.
On the API surface, GPT-5.5 is a drop-in replacement for GPT-5.4 — it supports the same prompt caching, hosted tools, tool search, and compaction features. There's no painful migration. The work is in re-tuning cost expectations and validating context behavior, not rewriting integrations.
Native real-time voice is a related but separate story: OpenAI shipped the gpt-realtime-2 family on May 7 with GPT-5-class reasoning, a 128K context (up from 32K), and parallel mid-conversation tool calls. If voice agents are your use case, that's the model line to evaluate, priced at $32/$64 per million audio input/output tokens — not GPT-5.5 itself.
GPT-5.5 is the first OpenAI flagship genuinely tuned for long-horizon autonomy, and the OSWorld (78.7%) and Terminal-Bench 2.0 (82.7%) numbers back that up. If you're running Codex-style loops or computer-use automation, this is the model to benchmark against your current stack. OpenAI also reports a roughly 60% drop in hallucination rate versus GPT-5.4, which matters more for autonomous loops than for single-turn chat — an agent that hallucinates a file path or API parameter mid-task wastes the whole run.
The token-efficiency claim is the lever to pull. Agentic loops that converge sooner and "dig themselves in less when they're wrong" are worth real money at $30/1M output. Instrument your runs: measure tokens-to-completion on GPT-5.5 versus your incumbent, not just per-token price.
The 36.6% → 74.0% jump on long-context reasoning is the most practically useful improvement in the release. Loading an entire codebase, a full research corpus, or a long document set into one prompt now produces answers that actually reflect the whole input. But validate the effective window in your harness — the Codex reports of a shrunken real window are a caution, not a footnote.
Do the math before you switch your default. At 2x the rate, GPT-5.5 only pays off if your workload is output-heavy enough to capture the token-efficiency savings. For high-volume, output-light tasks, GPT-5.4 or a Sonnet-class model may still be the better economic choice. Route by task. The right architecture in mid-2026 is a router that sends terminal-agent and computer-use work to GPT-5.5, deep coding to Opus 4.8, and high-throughput simple tasks to a cheaper tier — not a single-model bet. I've written about how the broader frontier is fragmenting in the Microsoft MAI frontier models breakdown.
GPT-5.5 is a real step forward on the metrics that matter for automation: knowledge work, computer use, terminal agents, and long context. It is not a clean sweep of the frontier — Claude Opus 4.8 leads on GDPval-AA and SWE-bench Pro, and the price doubled. The token-efficiency argument is plausible but conditional on your workload. Treat the SWE-bench Verified and ARC-AGI-2 headline numbers with skepticism, lean on GDPval-AA and OSWorld as the cleaner signals, and benchmark tokens-to-completion before you migrate. For agentic and computer-use builders, it earns its place in the routing table. For everyone else, it's a "measure, then decide" release.
It depends on the task. Opus 4.8 leads on GDPval-AA (1890 vs 1769 Elo), OSWorld computer use (83.4% vs 78.7%), and SWE-bench Pro (69.2% vs 58.6%). GPT-5.5 keeps a narrow edge on terminal-agent benchmarks. For deep coding and knowledge synthesis, Opus 4.8 is stronger; for terminal-style autonomous loops, GPT-5.5 is competitive or ahead. Note that Opus 4.8 shipped about five weeks later, so the comparison is OpenAI's April flagship against Anthropic's late-May one.
$5 per million input tokens, $30 per million output tokens, with cached input at $0.50. That's exactly 2x GPT-5.4's $2.50/$15. The GPT-5.5 Pro tier is $30 input / $180 output. OpenAI argues the higher rate is offset by roughly 40% better output-token efficiency on Codex-style tasks (per Artificial Analysis), but that offset depends on your workload being output-heavy.
OpenAI publishes up to a 1M-token context window via the API, with a 400K-class configuration in Codex (272K input + 128K output in the standard setup). There are open reports of the effective window appearing smaller (~258K) in Codex despite the published figures, and of catalog mismatches that can bypass auto-compaction — so validate the effective window in your own harness before relying on it.
GPT-5.5 itself processes text, image, audio, and video in a unified architecture (OpenAI's framing). For production real-time voice agents, the relevant model line is the separately released gpt-realtime-2 family (shipped May 7) with GPT-5-class reasoning, a 128K context, and parallel mid-conversation tool calls, priced at $32/$64 per million audio input/output tokens.
No. GPT-5.5 is a drop-in API replacement for GPT-5.4 and supports the same features — prompt caching, hosted tools, tool search, and compaction. The real work is re-tuning cost expectations (2x rate) and validating long-context behavior, not rewriting integrations.
Be cautious. Reported SWE-bench Verified figures for GPT-5.5 range from about 82.6% to 88.7% depending on the source, and frontier labs may have trained on or adjacent to this data. The harder SWE-bench Pro (58.6%) is a more honest signal. The ~85% ARC-AGI-2 figure is unusually high and I couldn't tie it to a single authoritative methodology — wait for an independent run before quoting it.
Analysis by Frank — former Oracle AI architect who helped build Oracle's AI Center of Excellence, now building agentic systems independently and making music with AI. Published June 5, 2026 with benchmarks cross-referenced against OpenAI's announcement and independent coverage; vendor-claimed figures flagged where they could not be independently verified.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.
Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.
Read articlexAI's Grok 4.3 scores 53 on the Artificial Analysis Intelligence Index, lifts GDPval-AA to 1500 Elo, ships a 1M context window with always-on reasoning, and cuts price ~40% to $1.25/$2.50. Technical breakdown with verified benchmarks and what it means for builders.
Read articleMoonshot AI's Kimi K2.6 is a 1T-parameter MoE (32B active) you can self-host. SWE-Bench Pro 58.6%, HLE-with-tools 54.0%, Agent Swarm to 300 sub-agents, $0.60/$2.50 per million. Technical breakdown with verified benchmarks, the open-weight angle, and what it means for builders.
Read article