Alibaba's Qwen3.7-Max lands May 2026 with a 1M-token context, $2.50/$7.50 pricing, SWE-Bench Pro 60.6, Terminal-Bench 69.7, and a verified 35-hour autonomous run. Technical breakdown with sourced benchmarks, what changed vs Qwen3.6, and what it means for builders.
TL;DR: Alibaba released Qwen3.7-Max (qwen3.7-max) on May 19, 2026, announcing it the next day at the Alibaba Cloud Summit. It's a closed-weight, API-only reasoning agent with a 1M-token context window (65K max output) priced at $2.50 input / $7.50 output per million, with a 90% cached-input discount ($0.25/M). It scores 56.6 on the Artificial Analysis Intelligence Index v4.0 — the highest a Chinese model has ever placed, sitting in the global top 5. The agentic numbers are the headline: SWE-Bench Pro 60.6, Terminal-Bench 2.0 69.7, and a documented 35-hour autonomous CUDA-kernel optimization run that hit a 10x speedup. Here's what's verifiable and what isn't.
Qwen3.7-Max is Alibaba's new flagship, the top of the Qwen3.7 line and the successor to Qwen3.6-Max-Preview. The Qwen team frames it as "the agent frontier" — a model built less for chat and more for long-horizon, tool-driven execution. It is a reasoning model: it produces an internal chain of thought, plans, checks its own work, and corrects course before committing to an answer.
Three things define this release:
It's an agent model first. The pitch isn't a few points of MMLU; it's how long and how reliably the model can drive a tool loop without a human in it. The flagship demo is a 35-hour autonomous run, not a chat transcript.
It's closed-weight. This is the important asterisk for anyone who associates Qwen with open weights. Qwen3.7-Max is API-only — no Hugging Face checkpoint, no GGUF, no Ollama. The open-weight Qwen story continues, but not at the frontier tier (more on that below).
It's a genuine top-5 placement. At 56.6 on the Artificial Analysis Intelligence Index v4.0, it's the strongest Chinese model the index has ranked, in the same conversation as Claude Opus 4.7 and GPT-5.5 — at roughly half the price of the Western frontier.
A note on naming: the repo slug here is qwen3-max, but the actual current flagship is Qwen3.7-Max. If you came looking for a plain "Qwen3-Max," this is the model that supersedes it.
A sourcing note, because it matters more than usual here. Qwen3.7-Max's architecture is not fully disclosed — Alibaba has not published parameter counts or the MoE expert configuration, and independent outlets only place it "within the broader Mixture-of-Experts frontier" without confirmed numbers. So treat architecture as undisclosed, not assumed. The benchmark figures below are drawn from Alibaba's own technical materials, Artificial Analysis, and independent coverage from DataCamp and VentureBeat. Where a number comes only from Alibaba's own comparison tables, I mark it vendor-claimed.
| Benchmark | Qwen3.7-Max | What it measures | Source confidence |
|---|---|---|---|
| AA Intelligence Index v4.0 | 56.6 | Composite of 10 evals (GDPval-AA, GPQA, HLE, SciCode, Terminal-Bench Hard, etc.) | Artificial Analysis (independent) |
| SWE-Bench Pro | 60.6 | Harder, contamination-resistant coding | Vendor table, AA-aligned |
| SWE-Bench Verified | 80.4 | Real GitHub issue resolution | Vendor table |
| Terminal-Bench 2.0 (Terminus) | 69.7 | Agentic terminal/CLI workflows | Vendor table |
| MCP-Atlas | 76.4 | Tool-use / MCP orchestration | Vendor-claimed |
| GPQA Diamond | 92.4 | Graduate-level science Q&A | Vendor table |
| HMMT 2026 Feb | 97.1 | Competition mathematics | Vendor-claimed |
| LiveCodeBench | 91.6 | Live competitive coding | Vendor table |
| MMLU-Pro | 89.6 | Broad knowledge, harder MMLU | Vendor table |
| Humanity's Last Exam | 41.4 | Frontier multidisciplinary reasoning | AA-aligned |
| SciCode | 53.5 | Research-grade scientific coding | Vendor table |
| Apex (reasoning) | 44.5 | Hard multi-step reasoning | Vendor-claimed |
Two rows deserve more than a line.
SWE-Bench Pro at 60.6 is where Qwen3.7-Max actually leads its peer group. Alibaba's table puts it ahead of Kimi K2.6 Thinking (59.5) and DeepSeek V4 Pro Max (59.0) on this harder, contamination-resistant coding benchmark. On the easier, more saturated SWE-Bench Verified, it does not lead — 80.4 trails Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6) by a hair. That's an honest split: it's strong where the benchmark is hard, mid-pack where the benchmark is near saturation.
HMMT 2026 Feb at 97.1 is a near-ceiling competition-math score, and Alibaba's table puts it at the top of the group. But it's vendor-reported and the kind of narrow olympiad number that's easy to over-read. Treat it as a real signal about multi-step reasoning rigor, not a verdict on general intelligence.
One thing I'm deliberately not putting in the table: AIME 2026 and ARC-AGI-2. I couldn't find published Qwen3.7-Max numbers for either from a source I trust, so I'm leaving them out rather than inventing rows.
This is the demo that earns the "agent frontier" framing, and it's the most concrete thing in the release. In Alibaba's technical writeup, Qwen3.7-Max ran a 35-hour autonomous CUDA-kernel optimization task on T-Head ZW-M890 PPUs. Over that run it executed 1,158 distinct tool calls, performed 432 kernel evaluations, diagnosed compilation failures, and iteratively improved the code to a 10x geometric-mean speedup over the SGLang Triton reference.
The numbers that make this interesting aren't the speedup — they're the duration and the comparison. On the same task under identical conditions, Alibaba reports GLM 5.1 reached 7.3x, Kimi K2.6 reached 5.0x, DeepSeek V4 Pro reached 3.3x, and the prior Qwen3.6-Plus managed only 1.1x. On KernelBench L3, Qwen3.7-Max produced working accelerated kernels for 96% of scenarios versus 48% for Qwen3.6-Plus. The model was reportedly still finding meaningful improvements after 30+ hours — which is the actual claim worth caring about: long-horizon autonomy that stays productive, not just stable.
Two caveats. First, this is a vendor demo on Alibaba's own silicon and benchmark harness — directionally credible, but not independently reproduced. Second, the same release describes the model autonomously flagging "1,618 hacking cases" during an 86-hour RL training session and adding 13 heuristic rules to its own loop. That's a self-monitoring-for-reward-hacking claim, and it's exactly the sort of vendor narrative to read as marketing until a third party verifies it.
Where Qwen3.7-Max sits against the June 2026 frontier. Two honesty notes up front: Alibaba's launch tables benchmark against Opus 4.6 Max, not the newer Claude Opus 4.8 that shipped May 28 — so the Opus column below mixes generations and should be read as a rough position, not a head-to-head. And cross-vendor benchmark numbers always carry harness differences.
| Capability | Qwen3.7-Max | Claude Opus 4.8 | GPT-5.5 | DeepSeek V4 Pro |
|---|---|---|---|---|
| AA Intelligence Index | 56.6 | 61.4 | — | — |
| SWE-Bench Pro | 60.6 | 69.2 | ~58.6 | 59.0 |
| SWE-Bench Verified | 80.4 | 88.6 | ~82 | 80.6 |
| Terminal-Bench 2.0 | 69.7 | 74.6 (2.1) | 78.2 (2.1) | 67.9 |
| GPQA Diamond | 92.4 | 93.6 | — | — |
| Input / output (per 1M) | $2.50 / $7.50 | $5 / $25 | ~$5 / ~$30 | varies |
Read that table carefully: on the aggregate and on the hardest coding (SWE-Bench Pro/Verified), Opus 4.8 is clearly ahead. Qwen3.7-Max's case isn't "it's the smartest model" — it's "it's a top-5 model at roughly half the price of the Western frontier, and it leads its peer group on the hard agentic-coding benchmarks where the leaders aren't." That's a real and defensible position. For the full cross-model picture, see the FrankX models tracker, and for the open-weight Chinese alternative, the DeepSeek V4 breakdown.
One cost wrinkle the per-token price hides: Artificial Analysis observed Qwen3.7-Max generating roughly 97 million tokens across its evaluation suite, far above the ~24M median. It's a verbose reasoner. The headline $7.50 output price is cheap, but a verbose model burns more output tokens per task — so the effective cost-per-task gap versus a terser model is smaller than the rate card suggests. Budget on tokens-per-task from your own evals, not on the sticker.
| Model | Input / 1M | Output / 1M | Cached input | Notes |
|---|---|---|---|---|
| Qwen3.7-Max | $2.50 | $7.50 | $0.25 | 90% cache discount; 1M context |
| Claude Opus 4.8 | $5.00 | $25.00 | — | Western frontier leader |
| GPT-5.5 | ~$5.00 | ~$30.00 | — | Terminal/computer-use workhorse |
| Qwen3.7-Plus | $0.40 | $1.60 | — | Multimodal, cheaper sibling (June 2, 2026) |
Two things stand out. First, the 90% cached-input discount ($2.50 → $0.25) is meaningful for agent workloads, which re-send large, stable context (codebase, instructions, tool schemas) on every turn. If your prompt prefix is cacheable and stable, your effective input cost collapses. Second, note the sibling: Qwen3.7-Plus shipped June 2, 2026 at $0.40/$1.60 with text, image, and video input — a cheaper, multimodal model. Max is the pure-text, pure-reasoning flagship; Plus is the budget multimodal option. If your workload is vision-heavy or cost-sensitive and doesn't need the absolute top of the reasoning ladder, Plus is the one to evaluate first.
Third-party routing can be even cheaper — OpenRouter has listed Qwen3.7-Max as low as $1.25/$3.75 — but verify the provider's context limit and rate caps before you commit a production agent to a reseller's pricing.
The jump from the Qwen3.6 generation is mostly about horizon and harness, not raw single-shot intelligence:
| Area | Qwen3.6-Max-Preview | Qwen3.7-Max |
|---|---|---|
| Context window | ~262K | 1M tokens |
| Max output | smaller | 65,536 tokens |
| KernelBench L3 success | 48% (3.6-Plus) | 96% |
| CUDA-kernel demo speedup | 1.1x (3.6-Plus) | 10x |
| Long-horizon autonomy | hours-scale | 35-hour productive runs |
| Native thinking | partial | full extended-thinking mode |
| External harnesses | limited | Claude Code, OpenClaw, Qwen Code, custom |
The cross-harness point is the underrated one. Alibaba says it's the same backbone whether you drive it through Anthropic's Claude Code, OpenClaw, Qwen Code, or your own tool-use framework — and exposes a native Anthropic Messages-compatible protocol, so it's a near-drop-in for code already written against Claude's API. For teams already standardized on the Anthropic protocol, that lowers the switching cost to "change the base URL and model string and re-run your evals." That's a deliberate land-grab on Anthropic's developer ergonomics, and it's smart.
This is where Qwen's reputation and Qwen3.7-Max's reality diverge, so be precise. Qwen3.7-Max is closed-weight and API-only. There are no published weights, no GGUF, no Ollama image — the only way to run it today is through Alibaba Cloud Model Studio (DashScope) or a reseller. If your requirement is on-prem, air-gapped, or self-hosted inference, Qwen3.7-Max does not satisfy it.
The open-weight Qwen story is still alive — just one tier down. The prior generation followed a consistent pattern: the Max flagship stayed closed, while smaller dense and MoE variants (the 27B dense and 35B-A3B MoE in Qwen3.6) shipped open under Apache 2.0 with 256K-class context. Multiple outlets expect Qwen3.7-equivalent open-weight variants to follow in the June–July 2026 window on the same cadence — but Alibaba has not confirmed this, so treat it as expectation, not roadmap.
The practical takeaway: if you want a self-hostable Qwen today, you're on the Qwen3.6 open-weight models or waiting for the 3.7 open releases. If you want the frontier agent, you're on the API and you're accepting closed weights. For genuinely open frontier reasoning, the DeepSeek V4 analysis is the more relevant comparison.
This is the model's home turf. If you're building something that runs a tool loop for hours — migrations, optimization sweeps, research agents, batch refactors — Qwen3.7-Max is built for it, priced for it, and has the cache discount to make re-sending stable context cheap. Give it the full task in one well-specified first turn, lean on the 1M context to hold the whole working set, and let it run. The 35-hour demo is a demo, but the underlying claim — productive autonomy past 30 hours — is the differentiator worth testing on your own workload.
The Anthropic Messages-compatible protocol makes Qwen3.7-Max one of the lowest-friction non-Anthropic models to trial. Point your Claude Code or custom harness at the Alibaba endpoint, swap the model string, and run your existing eval suite. At half the input price and a quarter of the output price of Opus 4.8, even a modest pass rate on your evals can change your routing math for the cost-tolerant slice of your traffic.
The honest framing: Qwen3.7-Max is not the model you reach for when a silent error is expensive — Opus 4.8 leads the hard-coding and aggregate benchmarks, and you pay the premium to buy down error risk. Qwen3.7-Max is the model you reach for when the task is genuinely hard and high-volume, the cost-of-error is bounded, and you can verify output against tests. Route the expensive-failure work to Opus, the high-volume agentic execution to Qwen3.7-Max, and the multimodal or budget tier to Qwen3.7-Plus. Match the model to the task's cost-of-error, not to the leaderboard.
Re-baseline your token budgets. The verbosity that showed up in Artificial Analysis's 97M-token run is real, and it means your output-token spend and your latency will both run higher than the per-token price implies. Measure tokens-per-task on your own traffic before you size the bill.
No, not on aggregate. Opus 4.8 leads the Artificial Analysis Intelligence Index (61.4 vs 56.6) and the hardest coding benchmarks (SWE-Bench Pro 69.2 vs 60.6, SWE-Bench Verified 88.6 vs 80.4). Qwen3.7-Max's advantage is price and peer-group leadership: it's a top-5 model at roughly half Opus's input price and a quarter of its output price, and it leads its own peer group (Kimi K2.6, DeepSeek V4 Pro) on hard agentic-coding benchmarks. Choose Opus when silent errors are expensive; choose Qwen3.7-Max for high-volume, verifiable agentic work.
$2.50 per million input tokens and $7.50 per million output tokens, with a 90% cached-input discount that drops cached input to $0.25/M. That's roughly half the input price and a quarter of the output price of Claude Opus 4.8. Note that it's a verbose reasoner — Artificial Analysis measured ~97M tokens across its eval suite versus a ~24M median — so effective cost-per-task is higher than the rate card alone suggests. Third-party routers like OpenRouter have listed lower rates ($1.25/$3.75).
No. Qwen3.7-Max is closed-weight and API-only — accessible only through Alibaba Cloud Model Studio (DashScope) and resellers. There are no published weights, GGUF files, or Ollama images. Qwen's open-weight releases continue at the smaller-model tier (the Qwen3.6 27B dense and 35B-A3B MoE shipped Apache 2.0), and Qwen3.7-equivalent open variants are expected mid-2026 — but that's unconfirmed by Alibaba.
1 million input tokens and 65,536 max output tokens — up from ~262K context on Qwen3.6-Max-Preview. It's a text-only reasoning model; for multimodal input (image, video), the cheaper Qwen3.7-Plus sibling is the one to use.
It's Alibaba's flagship demo: Qwen3.7-Max autonomously optimized CUDA kernels over 35 hours, making 1,158 tool calls and 432 kernel evaluations to reach a 10x geometric-mean speedup, reportedly still improving past 30 hours. It's a vendor demo on Alibaba's own hardware and harness — directionally credible and consistent with the model's long-horizon design, but not independently reproduced. Treat the specific numbers as vendor-claimed.
The Artificial Analysis Intelligence Index (56.6), pricing, context window, and release date are independently sourced. The SWE-Bench, LiveCodeBench, GPQA, and MMLU-Pro figures align with Artificial Analysis but several come from Alibaba's own comparison tables. HMMT 97.1, MCP-Atlas 76.4, Apex 44.5, and the entire 35-hour-run narrative are vendor-claimed — credible but not yet third-party reproduced. Architecture (parameters, MoE config) is undisclosed; I did not assume it.
Analysis by Frank — former Oracle AI architect who helped build Oracle's AI Center of Excellence, now building agentic systems independently and making music with AI. Published June 5, 2026 with benchmarks validated against Alibaba's official materials, Artificial Analysis, DataCamp, and VentureBeat. Architecture is undisclosed and several agentic figures are vendor-claimed — both are marked as such throughout.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.
Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.
Read articleOpenAI's GPT-5.5 leads GDPval at 84.9%, OSWorld at 78.7%, and Tau2 Telecom at 98% — at double the price of GPT-5.4. Technical breakdown with verified benchmarks, pricing, and what it means for builders.
Read articlexAI's Grok 4.3 scores 53 on the Artificial Analysis Intelligence Index, lifts GDPval-AA to 1500 Elo, ships a 1M context window with always-on reasoning, and cuts price ~40% to $1.25/$2.50. Technical breakdown with verified benchmarks and what it means for builders.
Read article