Skip to content
Google DeepMindGA

Gemma 4

Google’s open-weight flagship: a 31B frontier-tier model on one GPU, now Apache 2.0.

Read the full Gemma 4 analysis

Context

256K

Max output

Input /1M

Open

Output /1M

Open

Best for

  • Local-first / privacy-sensitive products (on-prem, healthcare, legal, finance)
  • Cost-conscious agentic systems (26B A4B MoE + vLLM)
  • Commercial fine-tuning under a clean license

Watch out

Academic benchmark jumps (AIME 89.2%, GPQA 84.3%) are largely Google’s own evals — vendor-claimed until reproduced. Qwen3.7-Max narrowly out-scores it on pure reasoning.

For creators. Run the 12B locally for offline multimodal work (image/audio on a 16GB laptop); 31B for on-prem RAG and coding where data can’t leave your box.

Benchmarks

lmarena elo1452
mmlu pro85.2
gpqa diamond84.3
livecodebench80
aime 202689.2

Capabilities

  • Apache 2.0 license (replaces the custom Gemma Terms used through Gemma 3)
  • 31B dense flagship runs in ~18GB VRAM at Q4 on a single 24GB consumer GPU
  • 26B A4B — Gemma's first MoE (~3.8B active, ~15.6GB int4) for agentic throughput
  • Encoder-free 12B multimodal (native audio) runs on a 16GB laptop
  • E2B (~2GB) to E4B (~8GB) tiers for edge / on-device
  • Up to 256K context; runs via Ollama / llama.cpp / LM Studio / vLLM / HF

Compare Gemma 4

More from Google DeepMind

Sources