Gemini 3.5 Pro is still in limited Vertex preview as of June 2026 — no model card, no benchmarks, no pricing. Here's the verifiable picture: what Flash already proved, what Google has committed to, and what to wait for at GA.
TL;DR: As of June 5, 2026, Gemini 3.5 Pro is not generally available. Google announced it at I/O on May 19, 2026, but it remains in limited Vertex AI preview for select enterprise customers — no model card, no published benchmarks, no pricing tier. The model ID is gemini-3.5-pro. Google has positioned it as its strongest agentic and coding model, targeting a 2M-token context window and "Deep Think" reasoning. Sundar Pichai told the I/O audience to "give us until next month." So this is an honest pre-GA brief: what Gemini 3.5 Flash already proved, what Google has actually committed to, and what numbers to wait for. Where a figure isn't confirmed, I say so.
No. Not at GA. This is the single most important thing to get right, because half the "Gemini 3.5 Pro benchmarks" content circulating right now is projecting Flash's numbers onto a model whose model card does not yet exist.
Here's the verifiable timeline:
As of this writing (June 5), nothing has changed that status. There is no spec sheet, no benchmark card, no pricing tier, and no general API access for Gemini 3.5 Pro. If you see a "94% on benchmark X" claim for 3.5 Pro right now, it is either extrapolated from Gemini 3.1 Pro / 3.5 Flash or invented. Treat it as such.
So the useful question isn't "how good is 3.5 Pro" — nobody outside Google can answer that yet. It's "what's the evidence base, and what should I watch for at GA."
Flash is the reason 3.5 Pro is interesting. The whole point of a "Pro" tier is that it sits above Flash — so Flash's GA numbers set the floor for what Pro has to clear.
Gemini 3.5 Flash shipped May 19, 2026 (gemini-3.5-flash), and the headline was genuinely unusual: a Flash-tier model leading the previous Pro tier on agentic benchmarks. Verified specs from Google's model card and independent coverage:
| Spec / Benchmark | Gemini 3.5 Flash | Source basis |
|---|---|---|
| Context window | 1M tokens | Google model card |
| Max output | 64K tokens | Google model card |
| Modalities | Text, image, audio, video, PDF in | Google model card |
| Terminal-Bench 2.1 | 76.2% | Google / independent |
| MCP Atlas | 83.6% | Google / independent |
| GDPval-AA | 1656 Elo | |
| CharXiv Reasoning | 84.2% | |
| Pricing (in / out) | $1.50 / $9.00 per 1M | Google API pricing |
| Cached input | $0.15 per 1M (90% off) | Google API pricing |
The thing to internalize: Flash already beats Gemini 3.1 Pro (Google's February 2026 flagship) on Terminal-Bench 2.1, MCP Atlas, and GDPval-AA. That's the bar 3.5 Pro is built to exceed. If Pro merely matched Flash on agentic coding, the tier wouldn't justify itself — so Google is effectively committing to a model that pushes past 76% Terminal-Bench and 83% MCP Atlas, with Deep Think reasoning layered on top.
Where Flash still trails: pure abstract reasoning. Flash sits around 72.1% on ARC-AGI-2 versus Gemini 3.1 Pro's verified 77.1%. That reasoning gap is exactly the territory a Deep Think-equipped Pro model is designed to reclaim.
Stripping out the speculation, here's what Google itself has stated about Gemini 3.5 Pro — framed as targets and positioning, not measured results:
gemini-3.5-pro (visible in Vertex preview).Every one of those is a vendor-stated target. None is an independently reproduced benchmark, because the model card doesn't exist yet. I'm flagging that explicitly because the whole value of this piece is not pretending otherwise.
For a sense of the lineage these targets sit on, Gemini 3.1 Pro (the current shipped Pro, GA February 2026) is the honest baseline:
| Gemini 3.1 Pro (verified, shipped) | Value |
|---|---|
| Context window | 1M tokens |
| ARC-AGI-2 | 77.1% |
| GPQA Diamond | 94.3% |
| SWE-bench Verified | 80.6% |
| MMMU-Pro | 80.5% |
| Pricing (in / out) | $2 / $12 per 1M (tiered: $4 / $18 above 200K) |
3.5 Pro is the model that's supposed to beat that line while adding Deep Think and (targeted) 2M context. Until GA, that's the most defensible way to think about it.
This is the table everyone wants, so here it is with a hard rule: Gemini 3.5 Pro's column is marked "preview — TBD at GA," not filled with guesses. The other models' numbers are verified from their GA releases and independent trackers.
| Benchmark | Gemini 3.5 Pro | Gemini 3.5 Flash | Claude Opus 4.8 | GPT-5.5 | Grok 4.3 |
|---|---|---|---|---|---|
| ARC-AGI-2 | TBD at GA | 72.1% | 75.8% | 85.0% | not published |
| SWE-bench Pro | TBD at GA | — | 69.2% | 58.6% | — |
| Terminal-Bench 2.1 | TBD at GA | 76.2% | 74.6% | 78.2% | — |
| MCP Atlas | TBD at GA | 83.6% | — | — | — |
| GPQA Diamond | TBD at GA | — | 93.6% | 93.5% | — |
| GDPval-AA (Elo) | TBD at GA | 1656 | 1890 | ~1769 | — |
| Context window | 2M (target) | 1M | 1M | 922K | 1M |
Reading this honestly:
The interesting strategic question at GA: does 3.5 Pro chase GPT-5.5's ARC-AGI-2 crown via Deep Think, or does Google double down on the agentic-coding + multimodal + 2M-context lane where it's already differentiated? My read is the latter — Flash's numbers tell you where this generation's engineering went.
For the live, continuously-updated cross-model view, see the 2026 model landscape on AI Ops. For the two models currently setting the bar Pro has to clear, see the Claude Opus 4.8 breakdown.
Unconfirmed. No pricing tier has been published for 3.5 Pro. The only honest statement: it will be announced at GA.
That said, Google's Pro-tier pricing has been remarkably stable, so the shape is predictable even if the number isn't:
| Model | Input / 1M | Output / 1M | Notes |
|---|---|---|---|
| Gemini 3.5 Flash | $1.50 | $9.00 | Cached $0.15; verified |
| Gemini 3.1 Pro | $2.00 | $12.00 | Tiered to $4/$18 above 200K; verified |
| Gemini 3.5 Pro | TBD | TBD | Announced at GA; expect tiered, context-length-dependent |
If history holds, expect context-length-dependent tiered pricing (a higher rate above a long-context threshold, the way 3.1 Pro jumps at 200K) and a likely premium over 3.1 Pro for the 2M window and Deep Think. But I'm not going to put a fake dollar figure in a table. Wait for the model card.
For most teams, today, the answer is Flash — because it's the only one of the two you can actually deploy. But the routing logic at GA is straightforward:
Use Gemini 3.5 Flash when:
Wait for Gemini 3.5 Pro when:
The honest framing: Flash already absorbed most of what used to require Pro. Pro 3.5 has to earn its slot on the hardest reasoning and the longest context — not on general agentic coding, where Flash already leads the prior Pro tier. That's a higher bar than a normal Pro release, and it's why the GA benchmarks matter more than usual.
A few practical takeaways while we wait:
Don't architect on a model you can't call. If you're building agent pipelines now, build on Gemini 3.5 Flash (GA, priced, documented) or a confirmed competitor — not on 3.5 Pro promises. Swap Pro in at GA if its measured numbers justify the cost delta over Flash. Many workloads won't need it.
Plan for context-length pricing tiers. Gemini Pro pricing jumps above a threshold (200K on 3.1 Pro). If your prompts straddle that boundary, your cost model needs the tiered rate, not the headline rate. Budget for the worst-case tier.
Watch ARC-AGI-2 and SWE-bench Pro at GA specifically. Those are where Gemini has historically trailed Opus and GPT-5.5. If 3.5 Pro with Deep Think closes the ARC-AGI-2 gap to GPT-5.5's 85.0%, that's a real shift. If it lands near 3.1 Pro's 77.1%, the story stays "multimodal + context + price," not "reasoning crown."
Multimodal is the durable edge. Across the frontier, Gemini's consistent differentiator is native text/image/audio/video/PDF in one model. If your product is video- or audio-heavy, the Gemini line is worth tracking regardless of where the reasoning benchmarks land.
Route, don't standardize. The June 2026 frontier has no single winner — Opus 4.8 leads aggregate intelligence and coding, GPT-5.5 leads abstract reasoning, Grok 4.3 leads on price, Gemini leads on multimodal + context. A routing layer that sends each task to the right tier beats betting the whole stack on one model. This is exactly the case for treating models as interchangeable infrastructure rather than a religion. For how Microsoft is entering this same fight with its own full-stack play, see the Microsoft MAI frontier models breakdown.
Not at GA. As of June 5, 2026, it is in limited Vertex AI preview for select enterprise customers only. There is no public model card, no published benchmarks, and no general API pricing. Google announced it at I/O on May 19, 2026, with GA targeted for sometime in June 2026 — Sundar Pichai's phrasing was "give us until next month," with no committed date.
gemini-3.5-pro, visible in the Vertex AI preview. The general-availability API name should match at GA.
Google is targeting 2M tokens — double Gemini 3.5 Flash's 1M. This is a stated target, not a confirmed spec, until the GA model card lands. For comparison, the currently shipped Gemini 3.1 Pro has a verified 1M-token window.
Unconfirmed. Pricing will be announced at GA. Google's Pro tier has historically used context-length-dependent tiered pricing — Gemini 3.1 Pro runs $2/$12 per 1M, rising to $4/$18 above 200K tokens. Expect a similar tiered structure, likely with a premium for the larger context window and Deep Think. Any specific dollar figure circulating now is speculation.
Unknown — it hasn't been benchmarked publicly. What's verified: as of late May 2026, Claude Opus 4.8 leads aggregate intelligence (Artificial Analysis Index 61.4) and SWE-bench Pro (69.2%), while GPT-5.5 leads ARC-AGI-2 (85.0%). Gemini 3.5 Flash already beats the prior Gemini Pro tier on agentic coding (Terminal-Bench 2.1 76.2%, MCP Atlas 83.6%). Where 3.5 Pro lands against Opus and GPT-5.5 is precisely the open question at GA.
If you need to ship now, use Flash — it's GA, priced, and documented, and it already leads the previous Pro tier on agentic coding at $1.50/$9 per 1M. Wait for Pro only if your workload hits Flash's ceiling on hard reasoning, needs the 2M context window, or involves heavy video/audio multimodal reasoning. For most agentic-coding workloads, Flash is the pragmatic pick today.
ARC-AGI-2 and SWE-bench Pro — the two areas where Gemini has historically trailed Opus and GPT-5.5. Also watch whether Deep Think reasoning numbers are reported separately, and whether independent trackers (Artificial Analysis, llm-stats, LMArena) reproduce Google's claimed figures before you trust them in production.
Analysis by Frank — former Oracle AI architect who helped build Oracle's AI Center of Excellence, now building agentic systems independently and making music with AI. Written June 5, 2026, while Gemini 3.5 Pro is still in preview. Verified figures are sourced to Google's GA releases, model cards, and independent trackers; everything about 3.5 Pro itself is marked as preview-stage and will be updated when the GA model card lands.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.
Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.
Read articleDeepSeek shipped V4-Pro (1.6T/49B active) and V4-Flash (284B/13B active) on April 24, 2026 under MIT license, open weights, 1M context. SWE-bench Verified 80.6%, AA Intelligence Index 52, V4-Pro API at $1.74/$3.48 per 1M. Technical breakdown with verified benchmarks, what changed vs V3.2, and the self-host vs API math.
Read articleGoogle's current open-weight Gemma is Gemma 4 (April 2026), now Apache 2.0, in E2B/E4B/12B/26B-A4B/31B tiers. The 31B dense model hits 1452 LMArena Elo and runs in ~18GB VRAM at Q4. Self-host specifics, verified benchmarks, license analysis, and which size for which job.
Read article