Intelligence DispatchesJune 5, 202614 min read

Qwen3.7-Max: Alibaba's Agent Flagship Cracks the Global Top 5 — and Runs for 35 Hours

Alibaba's Qwen3.7-Max lands May 2026 with a 1M-token context, $2.50/$7.50 pricing, SWE-Bench Pro 60.6, Terminal-Bench 69.7, and a verified 35-hour autonomous run. Technical breakdown with sourced benchmarks, what changed vs Qwen3.6, and what it means for builders.

Frank

AI Architect & Creator

Former Oracle AI architect · helped build Oracle's AI CoE

Share Share

Qwen3.7-Max: Alibaba's Agent Flagship Cracks the Global Top 5 — and Runs for 35 Hours

TL;DR: Alibaba released Qwen3.7-Max (qwen3.7-max) on May 19, 2026, announcing it the next day at the Alibaba Cloud Summit. It's a closed-weight, API-only reasoning agent with a 1M-token context window (65K max output) priced at $2.50 input / $7.50 output per million, with a 90% cached-input discount ($0.25/M). It scores 56.6 on the Artificial Analysis Intelligence Index v4.0 — the highest a Chinese model has ever placed, sitting in the global top 5. The agentic numbers are the headline: SWE-Bench Pro 60.6, Terminal-Bench 2.0 69.7, and a documented 35-hour autonomous CUDA-kernel optimization run that hit a 10x speedup. Here's what's verifiable and what isn't.

What Is Qwen3.7-Max?

Qwen3.7-Max is Alibaba's new flagship, the top of the Qwen3.7 line and the successor to Qwen3.6-Max-Preview. The Qwen team frames it as "the agent frontier" — a model built less for chat and more for long-horizon, tool-driven execution. It is a reasoning model: it produces an internal chain of thought, plans, checks its own work, and corrects course before committing to an answer.

Three things define this release:

It's an agent model first. The pitch isn't a few points of MMLU; it's how long and how reliably the model can drive a tool loop without a human in it. The flagship demo is a 35-hour autonomous run, not a chat transcript.
It's closed-weight. This is the important asterisk for anyone who associates Qwen with open weights. Qwen3.7-Max is API-only — no Hugging Face checkpoint, no GGUF, no Ollama. The open-weight Qwen story continues, but not at the frontier tier (more on that below).
It's a genuine top-5 placement. At 56.6 on the Artificial Analysis Intelligence Index v4.0, it's the strongest Chinese model the index has ranked, in the same conversation as Claude Opus 4.7 and GPT-5.5 — at roughly half the price of the Western frontier.

A note on naming: the repo slug here is qwen3-max, but the actual current flagship is Qwen3.7-Max. If you came looking for a plain "Qwen3-Max," this is the model that supersedes it.

What Are the Verified Benchmarks?

A sourcing note, because it matters more than usual here. Qwen3.7-Max's architecture is not fully disclosed — Alibaba has not published parameter counts or the MoE expert configuration, and independent outlets only place it "within the broader Mixture-of-Experts frontier" without confirmed numbers. So treat architecture as undisclosed, not assumed. The benchmark figures below are drawn from Alibaba's own technical materials, Artificial Analysis, and independent coverage from DataCamp and VentureBeat. Where a number comes only from Alibaba's own comparison tables, I mark it vendor-claimed.

Benchmark	Qwen3.7-Max	What it measures	Source confidence
AA Intelligence Index v4.0	56.6	Composite of 10 evals (GDPval-AA, GPQA, HLE, SciCode, Terminal-Bench Hard, etc.)	Artificial Analysis (independent)
SWE-Bench Pro	60.6	Harder, contamination-resistant coding	Vendor table, AA-aligned
SWE-Bench Verified	80.4	Real GitHub issue resolution	Vendor table
Terminal-Bench 2.0 (Terminus)	69.7	Agentic terminal/CLI workflows	Vendor table
MCP-Atlas	76.4	Tool-use / MCP orchestration	Vendor-claimed
GPQA Diamond	92.4	Graduate-level science Q&A	Vendor table
HMMT 2026 Feb	97.1	Competition mathematics	Vendor-claimed
LiveCodeBench	91.6	Live competitive coding	Vendor table
MMLU-Pro	89.6	Broad knowledge, harder MMLU	Vendor table
Humanity's Last Exam	41.4	Frontier multidisciplinary reasoning	AA-aligned
SciCode	53.5	Research-grade scientific coding	Vendor table
Apex (reasoning)	44.5	Hard multi-step reasoning	Vendor-claimed

Two rows deserve more than a line.

SWE-Bench Pro at 60.6 is where Qwen3.7-Max actually leads its peer group. Alibaba's table puts it ahead of Kimi K2.6 Thinking (59.5) and DeepSeek V4 Pro Max (59.0) on this harder, contamination-resistant coding benchmark. On the easier, more saturated SWE-Bench Verified, it does not lead — 80.4 trails Opus 4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6) by a hair. That's an honest split: it's strong where the benchmark is hard, mid-pack where the benchmark is near saturation.

HMMT 2026 Feb at 97.1 is a near-ceiling competition-math score, and Alibaba's table puts it at the top of the group. But it's vendor-reported and the kind of narrow olympiad number that's easy to over-read. Treat it as a real signal about multi-step reasoning rigor, not a verdict on general intelligence.

One thing I'm deliberately not putting in the table: AIME 2026 and ARC-AGI-2. I couldn't find published Qwen3.7-Max numbers for either from a source I trust, so I'm leaving them out rather than inventing rows.

What Is the 35-Hour Autonomous Run?

This is the demo that earns the "agent frontier" framing, and it's the most concrete thing in the release. In Alibaba's technical writeup, Qwen3.7-Max ran a 35-hour autonomous CUDA-kernel optimization task on T-Head ZW-M890 PPUs. Over that run it executed 1,158 distinct tool calls, performed 432 kernel evaluations, diagnosed compilation failures, and iteratively improved the code to a 10x geometric-mean speedup over the SGLang Triton reference.

The numbers that make this interesting aren't the speedup — they're the duration and the comparison. On the same task under identical conditions, Alibaba reports GLM 5.1 reached 7.3x, Kimi K2.6 reached 5.0x, DeepSeek V4 Pro reached 3.3x, and the prior Qwen3.6-Plus managed only 1.1x. On KernelBench L3, Qwen3.7-Max produced working accelerated kernels for 96% of scenarios versus 48% for Qwen3.6-Plus. The model was reportedly still finding meaningful improvements after 30+ hours — which is the actual claim worth caring about: long-horizon autonomy that stays productive, not just stable.

Two caveats. First, this is a vendor demo on Alibaba's own silicon and benchmark harness — directionally credible, but not independently reproduced. Second, the same release describes the model autonomously flagging "1,618 hacking cases" during an 86-hour RL training session and adding 13 heuristic rules to its own loop. That's a self-monitoring-for-reward-hacking claim, and it's exactly the sort of vendor narrative to read as marketing until a third party verifies it.

How Does It Compare to Opus 4.8, GPT-5.5, and DeepSeek V4?

Where Qwen3.7-Max sits against the June 2026 frontier. Two honesty notes up front: Alibaba's launch tables benchmark against Opus 4.6 Max, not the newer Claude Opus 4.8 that shipped May 28 — so the Opus column below mixes generations and should be read as a rough position, not a head-to-head. And cross-vendor benchmark numbers always carry harness differences.

Capability	Qwen3.7-Max	Claude Opus 4.8	GPT-5.5	DeepSeek V4 Pro
AA Intelligence Index	56.6	61.4	—	—
SWE-Bench Pro	60.6	69.2	~58.6	59.0
SWE-Bench Verified	80.4	88.6	~82	80.6
Terminal-Bench 2.0	69.7	74.6 (2.1)	78.2 (2.1)	67.9
GPQA Diamond	92.4	93.6	—	—
Input / output (per 1M)	$2.50 / $7.50	$5 / $25	~$5 / ~$30	varies

Read that table carefully: on the aggregate and on the hardest coding (SWE-Bench Pro/Verified), Opus 4.8 is clearly ahead. Qwen3.7-Max's case isn't "it's the smartest model" — it's "it's a top-5 model at roughly half the price of the Western frontier, and it leads its peer group on the hard agentic-coding benchmarks where the leaders aren't." That's a real and defensible position. For the full cross-model picture, see the FrankX models tracker, and for the open-weight Chinese alternative, the DeepSeek V4 breakdown.

One cost wrinkle the per-token price hides: Artificial Analysis observed Qwen3.7-Max generating roughly 97 million tokens across its evaluation suite, far above the ~24M median. It's a verbose reasoner. The headline $7.50 output price is cheap, but a verbose model burns more output tokens per task — so the effective cost-per-task gap versus a terser model is smaller than the rate card suggests. Budget on tokens-per-task from your own evals, not on the sticker.

What's the Pricing?

Model	Input / 1M	Output / 1M	Cached input	Notes
Qwen3.7-Max	$2.50	$7.50	$0.25	90% cache discount; 1M context
Claude Opus 4.8	$5.00	$25.00	—	Western frontier leader
GPT-5.5	~$5.00	~$30.00	—	Terminal/computer-use workhorse
Qwen3.7-Plus	$0.40	$1.60	—	Multimodal, cheaper sibling (June 2, 2026)

Two things stand out. First, the 90% cached-input discount ($2.50 → $0.25) is meaningful for agent workloads, which re-send large, stable context (codebase, instructions, tool schemas) on every turn. If your prompt prefix is cacheable and stable, your effective input cost collapses. Second, note the sibling: Qwen3.7-Plus shipped June 2, 2026 at $0.40/$1.60 with text, image, and video input — a cheaper, multimodal model. Max is the pure-text, pure-reasoning flagship; Plus is the budget multimodal option. If your workload is vision-heavy or cost-sensitive and doesn't need the absolute top of the reasoning ladder, Plus is the one to evaluate first.

Third-party routing can be even cheaper — OpenRouter has listed Qwen3.7-Max as low as $1.25/$3.75 — but verify the provider's context limit and rate caps before you commit a production agent to a reseller's pricing.

What Changed vs Qwen3.6?

The jump from the Qwen3.6 generation is mostly about horizon and harness, not raw single-shot intelligence:

Area	Qwen3.6-Max-Preview	Qwen3.7-Max
Context window	~262K	1M tokens
Max output	smaller	65,536 tokens
KernelBench L3 success	48% (3.6-Plus)	96%
CUDA-kernel demo speedup	1.1x (3.6-Plus)	10x
Long-horizon autonomy	hours-scale	35-hour productive runs
Native thinking	partial	full extended-thinking mode
External harnesses	limited	Claude Code, OpenClaw, Qwen Code, custom

The cross-harness point is the underrated one. Alibaba says it's the same backbone whether you drive it through Anthropic's Claude Code, OpenClaw, Qwen Code, or your own tool-use framework — and exposes a native Anthropic Messages-compatible protocol, so it's a near-drop-in for code already written against Claude's API. For teams already standardized on the Anthropic protocol, that lowers the switching cost to "change the base URL and model string and re-run your evals." That's a deliberate land-grab on Anthropic's developer ergonomics, and it's smart.

What About Open Weights and Self-Hosting?

This is where Qwen's reputation and Qwen3.7-Max's reality diverge, so be precise. Qwen3.7-Max is closed-weight and API-only. There are no published weights, no GGUF, no Ollama image — the only way to run it today is through Alibaba Cloud Model Studio (DashScope) or a reseller. If your requirement is on-prem, air-gapped, or self-hosted inference, Qwen3.7-Max does not satisfy it.

The open-weight Qwen story is still alive — just one tier down. The prior generation followed a consistent pattern: the Max flagship stayed closed, while smaller dense and MoE variants (the 27B dense and 35B-A3B MoE in Qwen3.6) shipped open under Apache 2.0 with 256K-class context. Multiple outlets expect Qwen3.7-equivalent open-weight variants to follow in the June–July 2026 window on the same cadence — but Alibaba has not confirmed this, so treat it as expectation, not roadmap.

The practical takeaway: if you want a self-hostable Qwen today, you're on the Qwen3.6 open-weight models or waiting for the 3.7 open releases. If you want the frontier agent, you're on the API and you're accepting closed weights. For genuinely open frontier reasoning, the DeepSeek V4 analysis is the more relevant comparison.

What Does It Mean for Builders?

For agentic and long-horizon work

This is the model's home turf. If you're building something that runs a tool loop for hours — migrations, optimization sweeps, research agents, batch refactors — Qwen3.7-Max is built for it, priced for it, and has the cache discount to make re-sending stable context cheap. Give it the full task in one well-specified first turn, lean on the 1M context to hold the whole working set, and let it run. The 35-hour demo is a demo, but the underlying claim — productive autonomy past 30 hours — is the differentiator worth testing on your own workload.

For teams already on Claude's API

The Anthropic Messages-compatible protocol makes Qwen3.7-Max one of the lowest-friction non-Anthropic models to trial. Point your Claude Code or custom harness at the Alibaba endpoint, swap the model string, and run your existing eval suite. At half the input price and a quarter of the output price of Opus 4.8, even a modest pass rate on your evals can change your routing math for the cost-tolerant slice of your traffic.

For cost-conscious routing

The honest framing: Qwen3.7-Max is not the model you reach for when a silent error is expensive — Opus 4.8 leads the hard-coding and aggregate benchmarks, and you pay the premium to buy down error risk. Qwen3.7-Max is the model you reach for when the task is genuinely hard and high-volume, the cost-of-error is bounded, and you can verify output against tests. Route the expensive-failure work to Opus, the high-volume agentic execution to Qwen3.7-Max, and the multimodal or budget tier to Qwen3.7-Plus. Match the model to the task's cost-of-error, not to the leaderboard.

One caution

Re-baseline your token budgets. The verbosity that showed up in Artificial Analysis's 97M-token run is real, and it means your output-token spend and your latency will both run higher than the per-token price implies. Measure tokens-per-task on your own traffic before you size the bill.

FAQ

Is Qwen3.7-Max better than Claude Opus 4.8?

No, not on aggregate. Opus 4.8 leads the Artificial Analysis Intelligence Index (61.4 vs 56.6) and the hardest coding benchmarks (SWE-Bench Pro 69.2 vs 60.6, SWE-Bench Verified 88.6 vs 80.4). Qwen3.7-Max's advantage is price and peer-group leadership: it's a top-5 model at roughly half Opus's input price and a quarter of its output price, and it leads its own peer group (Kimi K2.6, DeepSeek V4 Pro) on hard agentic-coding benchmarks. Choose Opus when silent errors are expensive; choose Qwen3.7-Max for high-volume, verifiable agentic work.

How much does Qwen3.7-Max cost?

$2.50 per million input tokens and $7.50 per million output tokens, with a 90% cached-input discount that drops cached input to $0.25/M. That's roughly half the input price and a quarter of the output price of Claude Opus 4.8. Note that it's a verbose reasoner — Artificial Analysis measured ~97M tokens across its eval suite versus a ~24M median — so effective cost-per-task is higher than the rate card alone suggests. Third-party routers like OpenRouter have listed lower rates ($1.25/$3.75).

Is Qwen3.7-Max open source or open weight?

No. Qwen3.7-Max is closed-weight and API-only — accessible only through Alibaba Cloud Model Studio (DashScope) and resellers. There are no published weights, GGUF files, or Ollama images. Qwen's open-weight releases continue at the smaller-model tier (the Qwen3.6 27B dense and 35B-A3B MoE shipped Apache 2.0), and Qwen3.7-equivalent open variants are expected mid-2026 — but that's unconfirmed by Alibaba.

What's the context window and max output?

1 million input tokens and 65,536 max output tokens — up from ~262K context on Qwen3.6-Max-Preview. It's a text-only reasoning model; for multimodal input (image, video), the cheaper Qwen3.7-Plus sibling is the one to use.

What is the 35-hour autonomous run, and is it verified?

It's Alibaba's flagship demo: Qwen3.7-Max autonomously optimized CUDA kernels over 35 hours, making 1,158 tool calls and 432 kernel evaluations to reach a 10x geometric-mean speedup, reportedly still improving past 30 hours. It's a vendor demo on Alibaba's own hardware and harness — directionally credible and consistent with the model's long-horizon design, but not independently reproduced. Treat the specific numbers as vendor-claimed.

Which numbers here are verified vs vendor-claimed?

The Artificial Analysis Intelligence Index (56.6), pricing, context window, and release date are independently sourced. The SWE-Bench, LiveCodeBench, GPQA, and MMLU-Pro figures align with Artificial Analysis but several come from Alibaba's own comparison tables. HMMT 97.1, MCP-Atlas 76.4, Apex 44.5, and the entire 35-hour-run narrative are vendor-claimed — credible but not yet third-party reproduced. Architecture (parameters, MoE config) is undisclosed; I did not assume it.

Analysis by Frank — former Oracle AI architect who helped build Oracle's AI Center of Excellence, now building agentic systems independently and making music with AI. Published June 5, 2026 with benchmarks validated against Alibaba's official materials, Artificial Analysis, DataCamp, and VentureBeat. Architecture is undisclosed and several agentic figures are vendor-claimed — both are marked as such throughout.

Get Started

Build your first AI system

Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.

Start building

Templates & Blueprints

Production-ready architecture

Download AI architecture templates, multi-agent blueprints, and prompt engineering patterns.

Browse templates

Inner Circle

Join the builder community

Connect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.

Join the circle

Stay in the intelligence loop

Weekly field notes on AI systems, production patterns, and builder strategy.

Continue Reading

Intelligence Dispatches14 min read

Claude Opus 4.8: A Modest Bump That Quietly Tops the Leaderboard

Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.

Read article

Intelligence Dispatches12 min read

GPT-5.5 ("Spud"): What Actually Changed and Why It Matters

OpenAI's GPT-5.5 leads GDPval at 84.9%, OSWorld at 78.7%, and Tau2 Telecom at 98% — at double the price of GPT-5.4. Technical breakdown with verified benchmarks, pricing, and what it means for builders.

Read article

Intelligence Dispatches13 min read

Grok 4.3: xAI Trades the Crown for the Price Tag

xAI's Grok 4.3 scores 53 on the Artificial Analysis Intelligence Index, lifts GDPval-AA to 1500 Elo, ships a 1M context window with always-on reasoning, and cuts price ~40% to $1.25/$2.50. Technical breakdown with verified benchmarks and what it means for builders.

Read article

Intelligence DispatchesJune 5, 202614 min read

Qwen3.7-Max: Alibaba's Agent Flagship Cracks the Global Top 5 — and Runs for 35 Hours

Frank

AI Architect & Creator

Former Oracle AI architect · helped build Oracle's AI CoE

Share Share

Qwen3.7-Max: Alibaba's Agent Flagship Cracks the Global Top 5 — and Runs for 35 Hours

What Is Qwen3.7-Max?

Three things define this release:

It's an agent model first. The pitch isn't a few points of MMLU; it's how long and how reliably the model can drive a tool loop without a human in it. The flagship demo is a 35-hour autonomous run, not a chat transcript.
It's closed-weight. This is the important asterisk for anyone who associates Qwen with open weights. Qwen3.7-Max is API-only — no Hugging Face checkpoint, no GGUF, no Ollama. The open-weight Qwen story continues, but not at the frontier tier (more on that below).
It's a genuine top-5 placement. At 56.6 on the Artificial Analysis Intelligence Index v4.0, it's the strongest Chinese model the index has ranked, in the same conversation as Claude Opus 4.7 and GPT-5.5 — at roughly half the price of the Western frontier.

A note on naming: the repo slug here is qwen3-max, but the actual current flagship is Qwen3.7-Max. If you came looking for a plain "Qwen3-Max," this is the model that supersedes it.

What Are the Verified Benchmarks?

Benchmark	Qwen3.7-Max	What it measures	Source confidence
AA Intelligence Index v4.0	56.6	Composite of 10 evals (GDPval-AA, GPQA, HLE, SciCode, Terminal-Bench Hard, etc.)	Artificial Analysis (independent)
SWE-Bench Pro	60.6	Harder, contamination-resistant coding	Vendor table, AA-aligned
SWE-Bench Verified	80.4	Real GitHub issue resolution	Vendor table
Terminal-Bench 2.0 (Terminus)	69.7	Agentic terminal/CLI workflows	Vendor table
MCP-Atlas	76.4	Tool-use / MCP orchestration	Vendor-claimed
GPQA Diamond	92.4	Graduate-level science Q&A	Vendor table
HMMT 2026 Feb	97.1	Competition mathematics	Vendor-claimed
LiveCodeBench	91.6	Live competitive coding	Vendor table
MMLU-Pro	89.6	Broad knowledge, harder MMLU	Vendor table
Humanity's Last Exam	41.4	Frontier multidisciplinary reasoning	AA-aligned
SciCode	53.5	Research-grade scientific coding	Vendor table
Apex (reasoning)	44.5	Hard multi-step reasoning	Vendor-claimed

Two rows deserve more than a line.

What Is the 35-Hour Autonomous Run?

How Does It Compare to Opus 4.8, GPT-5.5, and DeepSeek V4?

Capability	Qwen3.7-Max	Claude Opus 4.8	GPT-5.5	DeepSeek V4 Pro
AA Intelligence Index	56.6	61.4	—	—
SWE-Bench Pro	60.6	69.2	~58.6	59.0
SWE-Bench Verified	80.4	88.6	~82	80.6
Terminal-Bench 2.0	69.7	74.6 (2.1)	78.2 (2.1)	67.9
GPQA Diamond	92.4	93.6	—	—
Input / output (per 1M)	$2.50 / $7.50	$5 / $25	~$5 / ~$30	varies

What's the Pricing?

Model	Input / 1M	Output / 1M	Cached input	Notes
Qwen3.7-Max	$2.50	$7.50	$0.25	90% cache discount; 1M context
Claude Opus 4.8	$5.00	$25.00	—	Western frontier leader
GPT-5.5	~$5.00	~$30.00	—	Terminal/computer-use workhorse
Qwen3.7-Plus	$0.40	$1.60	—	Multimodal, cheaper sibling (June 2, 2026)

What Changed vs Qwen3.6?

The jump from the Qwen3.6 generation is mostly about horizon and harness, not raw single-shot intelligence:

Area	Qwen3.6-Max-Preview	Qwen3.7-Max
Context window	~262K	1M tokens
Max output	smaller	65,536 tokens
KernelBench L3 success	48% (3.6-Plus)	96%
CUDA-kernel demo speedup	1.1x (3.6-Plus)	10x
Long-horizon autonomy	hours-scale	35-hour productive runs
Native thinking	partial	full extended-thinking mode
External harnesses	limited	Claude Code, OpenClaw, Qwen Code, custom

What About Open Weights and Self-Hosting?