For frontier training in 2026, what are the real alternatives to Blackwell/Rubin?

TPU v7 Ironwood (if you're inside Google Cloud or Anthropic), Trainium 3 UltraServers (if you're inside AWS), and AMD MI355X / MI400 Helios (if you have ROCm engineers and want second-source leverage). Everyone else is rebadging NVIDIA.

What's the lowest-cost-per-Mtoken inference option in 2026?

Depends on model and latency target. Cerebras and Groq lead throughput-per-dollar on Llama; d-Matrix Corsair leads perf/watt (38 TOPS/W); hosted hyperscaler endpoints (DeepInfra, Together, Fireworks) win on raw $/Mtoken for open weights, often under $0.20/M.

Should I buy a DGX Spark or a Ryzen AI Max+ 395 for local inference?

If you're CUDA-native and want fast TTFT on >70B models, DGX Spark ($3,999-$4,699). If you want the best $/GB-of-unified-memory and Windows compatibility, Strix Halo at $3,999. Mac Studio M4 Ultra remains the developer default for sustained local workflows.

Are neuromorphic and photonic chips production-ready or still research?

Neuromorphic (Loihi 2, NorthPole) wins on specific sparse/event-driven workloads but isn't a general LLM substrate. Photonic interconnect (Lightmatter Passage M1000 → 1.6 Tbps/fiber with Qualcomm SerDes, Mar 2026; Celestial AI acquired by Marvell Dec 2025) is shipping at the interconnect layer — photonic compute is still mostly research.

Is the CUDA moat actually breakable this decade?

Only via the dependency layer (Triton, MLIR, PyTorch compile, vLLM portability) — not via a single rival chip. NVIDIA knows this and is investing heavily in Python DSLs (CuTe 4.0, cuTile, Warp). Most credible threat is "CUDA-optional" becoming default in PyTorch and JAX, not "CUDA-replaced."

Research Hub/AI Chips & Silicon Landscape

AI Chips & Silicon Landscape

Post-Blackwell hardware across training, inference, and edge

TL;DR

NVIDIA Blackwell holds ~80% of the AI accelerator market in 2026, but Google TPU v7 Ironwood, AWS Trainium 3, and inference specialists (Groq, Cerebras, d-Matrix, Etched) are absorbing meaningful share. HBM4 supply, not transistors, is the binding constraint through late 2026.

Updated 2026-06-2234 sources validated

Research briefs like this — one per week. Validated sources, no filler.

~80%

NVIDIA share of AI accelerator revenue (2026)

Silicon Analysts / TrendForce

1.44 EF

FP4 compute in a single GB200 NVL72 rack (72 GPUs, 13.4 TB HBM3e)

NVIDIA GB200 NVL72 datasheet

44.6%

Projected 2026 YoY growth in custom ASIC shipments vs 16.1% for merchant GPUs

TrendForce

500K+

Trainium 2 chips in AWS Project Rainier for Anthropic — 5x prior Claude training compute

AWS / SemiAnalysis

Training-Class Silicon

The frontier-training tier is a four-horse race in compute and a one-horse race in software. Blackwell shipped at scale through 2025; Rubin entered production in late 2025 and was formally launched at GTC 2026. AMD's MI350 is the first credible head-to-head on memory capacity; MI400 (2026) is the first credible head-to-head on rack-scale interconnect. Intel exited the standalone training GPU race entirely.

NVIDIA Blackwell B200 / GB200 NVL72

Reference

B200: 192 GB HBM3e, 8 TB/s bandwidth, 20 PFLOPS sparse FP4. GB200 NVL72 rack: 72 B200 + 36 Grace CPUs, 13.4 TB unified HBM3e, 576 TB/s aggregate bandwidth, 1.44 EF FP4. MLPerf Inference v5.0: 3.4x per-GPU and 30x per-system gain on Llama 3.1 405B vs H200.

NVIDIA Rubin / Vera Rubin (GTC 2026)

Shipping 2026

Rubin GPU: 336B transistors, 288 GB HBM4, 50 PFLOPS. Vera CPU: 88 Olympus cores (Armv9.2). Rubin NVL72 rack: 3.6 EF NVFP4 inference, 2.5 EF training — 10x inference throughput/watt vs Blackwell per NVIDIA. Production confirmed at GTC 2026; Rubin Ultra in 2H 2027.

AMD MI350X / MI355X & MI400 (Helios)

Credible #2

MI350X (CDNA 4, 2025): 288 GB HBM3e, 8 TB/s, 4x gen-on-gen AI compute and up to 35x inference uplift per AMD. MI400 (2026): 432 GB HBM4, 19.6 TB/s, 20 PFLOPS FP8, UALink scale to 72 GPUs in the Helios rack with Zen 6 EPYC + Vulcano NICs. ROCm 7 is the closest CUDA has had to a real challenger.

Intel Gaudi 3 → Jaguar Shores

Falling behind

Gaudi 3 (2024): 128 GB HBM2e, 3.7 TB/s — cheaper than H100 but slower and behind on memory. Falcon Shores cancelled Jan 2025, repurposed as internal test chip. Successor Jaguar Shores (rack-scale, HBM4/4E) slips to 2026+. Intel has effectively conceded the training-GPU market for this cycle.

Hyperscaler Custom Silicon

The story of 2025-26 is that the top four hyperscalers each ship credible silicon at scale and are absorbing the workloads they care most about. TrendForce projects ASIC shipment growth at 44.6% in 2026 versus 16.1% for merchant GPUs.

Google TPU v7 Ironwood

Strongest ASIC

192 GB HBM3e per chip (6x Trillium), 7.37 TB/s, 4,614 FP8 TFLOPS. SuperPod: 9,216 chips, 1.2 TB/s ICI bidirectional. 2x perf/watt vs Trillium. Anthropic and Google both committed multi-billion-dollar capacity. SemiAnalysis calls TPU v7 "the 900lb gorilla in the room."

AWS Trainium 2 / 3 / 4

Anthropic-aligned

Trainium 2: 96 GB HBM per chip, NeuronLink to 64; ~500K chips in Project Rainier (Indiana, Oct 2025) — Anthropic's exclusive Claude training cluster. Trainium 3 (Dec 2025): 3nm, 2.52 PFLOPS FP8, 144 GB HBM3e, +40% energy efficiency. Trainium 4 announced with NVLink Fusion (hybrid Trainium + NVIDIA clusters).

Microsoft Maia 100 / 200

Catching up

Maia 100: TSMC N5, ~820mm², COWOS-S, 64 GB HBM2E, 1.8 TB/s, 500W provisioned. 16xRx16 tensor unit, MX format support, 4.8 Tbps all-gather. Powers a subset of Azure OpenAI inference. Maia 200 succeeds it; Microsoft remains the most NVIDIA-dependent hyperscaler by ratio.

Meta MTIA v2

Recommender-focused

TSMC 5nm, 421 mm², 1.35 GHz, 128 GB LPDDR5 @ 204.8 GB/s, 3x perf vs MTIA v1. Focused on Meta's recommendation/ranking workloads, not LLM training. Six-chip cadence over two years signals serious commitment to in-house silicon for the workloads that matter most to Meta's P&L.

Inference Specialists

Inference economics is the bloodiest battleground. Hosted token prices fell ~80% between early 2025 and early 2026. The architectural splits are real: SRAM-only (Groq, Cerebras), wafer-scale (Cerebras), digital in-memory compute (d-Matrix), transformer-only ASIC (Etched), and reconfigurable dataflow (SambaNova).

Groq LPU

Speed king

230 MiB on-die SRAM per LPU — needs hundreds of chips to host Llama 3.3 70B. Delivers ~350 tok/s on 70B, sub-100ms TTFT. Pricing $0.05-$0.59/M input. The "fast first token" play; bandwidth is the moat, capacity is the constraint.

Cerebras WSE-3 / CS-3

Throughput king

Wafer-scale: 46,225 mm², 4 trillion transistors, 900,000 cores, 44 GB on-die SRAM, 21 PB/s memory bandwidth, 125 PFLOPS peak. CS-3 system scales to 1.2 PB external memory; trains models up to 24T parameters. Demonstrated 2,500+ tok/s on Llama 3.3 70B; with speculative decoding, 4,000 tok/s.

SambaNova SN40L

Density play

Reconfigurable dataflow, 102B transistors per socket, 640 BF16 TFLOPS, 520 MB on-chip SRAM. 16 SN40L sockets serve a 671B-param model at 198 tok/s per user. SambaNova claims 40x perf/area vs Groq and 10x vs Cerebras on Llama 3.1 70B.

d-Matrix Corsair

Perf/watt leader

Digital In-Memory Compute (DIMC) on TSMC 6nm. Two-chip card: 9.6 PFLOPS MXINT4 / 2.4 PFLOPS MXINT8. 256 GB LPDDR5 pool + 2 GB on-chip SRAM. 38 TOPS/W — best published perf/watt for transformer inference. Full production Q4 2025; $275M raise Nov 2025.

Etched Sohu

Concentrated bet

Transformer-only ASIC — fixed-function attention silicon. Claim: 8-chip Sohu server delivers 500K+ tok/s on Llama 70B, equivalent to ~160 H100s. $500M raise at $5B valuation. The biggest bet on "transformers won." Catastrophic risk if architecture moves; massive payoff if it doesn't.

Tenstorrent Blackhole p150

Open-stack

TSMC 6nm, ~600mm². 120 Tensix cores + 768 RISC-V cores (one of the largest shipping RISC-V designs). 32 GB GDDR6 @ 512 GB/s, 664 BLOCKFP8 TFLOPS, 300W. 4x 800G Ethernet QSFP-DD for chip-to-chip (3.2 Tbps without switches). $1,299-$1,399 — open-stack play led by Jim Keller.

Edge & Personal Compute

Local AI got real in 2025-26. A $4K desktop now hosts 128 GB unified memory and runs 120B-param models. The split: Apple owns latency-sensitive client AI via Neural Engine + tight software; NVIDIA + AMD compete on "AI workstation in a box"; Qualcomm gets the Windows-on-Arm laptop story.

NVIDIA DGX Spark (Project Digits)

CUDA-native

GB10 Superchip: 20-core Arm Grace CPU + Blackwell GPU on package, 128 GB unified memory. $3,999 launch / $4,699 post-supply-shock. CUDA-native; ~1.6s TTFT on 120B models in NVIDIA's testing.

AMD Ryzen AI Max+ 395 (Strix Halo)

Best $/GB

16C/32T Zen 5, integrated Radeon with 50 TOPS NPU. 128 GB unified memory at $3,999 (undercuts DGX Spark by $700). Faster than DGX Spark in single-batch Llama.cpp via Vulkan backend; 6-7s TTFT on 120B (vs NVIDIA's 1.6s with CUDA). Runs Windows 11.

Apple M4 Ultra / Pro / Max

Best client utilization

16-core Neural Engine, 38 TOPS sustained. Geekbench AI: 2-3x faster than Snapdragon X Elite NPU in INT8 and FP16. Unified memory architecture (up to 192 GB on Ultra-class Mac Studio) makes Macs the default local-inference machine for many developers.

Qualcomm Snapdragon X / X2 Elite Extreme

Windows-on-Arm

X Elite Hexagon NPU: 45 TOPS. X2 Elite Extreme: 80 TOPS, +50% multicore over Gen 1, Geekbench AI ~88,919 vs ~52,000 on M4. The Windows-on-Arm story; raw NPU TOPS lead but software stack still maturing relative to Apple's CoreML.

Key Findings

NVIDIA holds ~80% AI accelerator share by revenue in 2026 — but custom ASIC shipment growth (44.6%) is projected to nearly triple merchant GPU growth (16.1%), per TrendForce.

HBM4 supply is the binding 2026 constraint, not transistors. SK hynix, Samsung, and Micron only enter mass production in 2026; NVIDIA targeting 11 Gb/s pin speeds that vendors struggle to yield (SemiAnalysis).

TSMC N3 is the new universal node: Blackwell→Rubin moves 4NP→3NP, AMD on N3 for MI350/MI400 AID+MID tiles (XCD on N2), TPU v7 on N3E, Trainium 3 on N3P. AI is now the majority of N3 demand.

Project Rainier (Oct 2025) deployed ~500K Trainium 2 chips on a 1,200-acre Indiana campus exclusively for Anthropic — 5x the compute used for prior Claude generations. The largest non-NVIDIA training cluster in production.

Inference economics collapsed: hosted token prices dropped ~80% from early 2025 to early 2026. Cerebras serves Llama 3.3 70B at 2,100 tok/s for $0.85/$1.20 per M tokens; Groq at $0.05-$0.59 per M.

Intel exited the standalone AI training chip race: Falcon Shores cancelled (Jan 2025) and repurposed as internal test chip; Gaudi line retired. Jaguar Shores (rack-scale, 2026+) is the reboot — but Intel cedes this cycle entirely.

CUDA's moat is software, not silicon. 4M+ developers, 3,000+ optimized apps, 19 years of cuDNN/cuBLAS/NCCL/Nsight. Real threat isn't a rival chip — it's OpenAI Triton + MLIR enabling write-once, run-anywhere kernels.

Research Transparency

Limitations

•Market-share figures (NVIDIA 75-80%, ASIC 44.6% growth) are analyst projections (TrendForce, Silicon Analysts) — vendor revenue disclosures lag and reported segments don't always align with accelerator-only TAM.
•MLPerf results cited are best-case vendor-submitted configurations with optimized stacks; real-world workload performance can deviate 30-50%.
•Inference pricing changes monthly; figures cited are accurate as of Jan-May 2026 but the floor keeps falling.
•Photonic and neuromorphic deployment claims for 2026 are partly forward-looking; only interconnect-layer photonics (Lightmatter Passage) has documented hyperscaler customer integration.

What We Don't Know

?Actual yield rates on HBM4 at NVIDIA's targeted 11 Gb/s pin speed — SK hynix, Samsung, Micron all guide to mass production but won't publish yields.
?Anthropic's internal split between Trainium 2 (Project Rainier) and TPU v7 (Google Cloud) for Claude training and inference workloads going forward.
?Whether Etched Sohu's transformer-only bet survives an architecture shift (e.g., to state-space models or Mamba-class hybrids at scale).

Evidence Grade:Grade A(Peer-reviewed / meta-analyses)

Frequently Asked Questions

Both. Revenue share is still ~75-80%, but the unit-shipment growth rate is now lower than custom ASIC growth. Hyperscalers absorb their captive inference (TPU for Google/Anthropic, Trainium for AWS/Anthropic, Maia for Microsoft/OpenAI), while NVIDIA keeps the frontier-training default and almost all merchant external demand.