Accelerating Gemma 4: faster inference with multi-token prediction drafters

· ai · Source ↗

TLDR

  • Google releases MTP drafters for Gemma 4, delivering up to 3x inference speedup via speculative decoding with zero output quality degradation.

Key Takeaways

  • MTP drafters pair a lightweight drafter model with the heavy Gemma 4 target; the drafter proposes tokens, the target verifies in parallel, yielding up to 3x throughput.
  • Drafters share the target model’s KV cache and activations, avoiding redundant context recomputation and keeping overhead low.
  • For E2B and E4B edge models, an efficient clustering technique in the embedder addresses the final logit bottleneck on constrained hardware.
  • Batch sizes of 4-8 unlock ~2.2x speedup on the 26B MoE model on Apple Silicon; similar gains appear on Nvidia A100.
  • Available now under Apache 2.0 via Hugging Face, Kaggle, Ollama, vLLM, SGLang, MLX, and Google AI Edge Gallery.

Hacker News Comment Review

  • Commenters broadly validated speculative decoding as a sound technique, noting its CPU branch-prediction analogy: the draft model proposes, the main model verifies cheaply in one forward pass.
  • Real-world testers on Ollama with MLX saw roughly 2x gains on 31B for coding, but noted that smaller Gemma 4 models often lose gains because draft-model validation overhead eats the speedup, and heavy quantization hurts acceptance rate.
  • A recurring thread contrasted Gemma’s token efficiency against Qwen: Gemma often finishes tasks in a fraction of the time even when Qwen scores marginally higher on benchmarks, making wall-clock speed a meaningful differentiator for agentic workloads.

Notable Comments

  • @msp26: Fitting 31B dense + vision projector + MTP drafter into 24 GB VRAM is tight; tip from replies: use --no-mmproj-offload in llama.cpp to keep the multimodal projector in system RAM.
  • @zdw: MTP support PR for llama.cpp (covering Qwen first, Gemma 4 expected soon) is close to merge, signaling broad local-inference ecosystem uptake.

Original | Discuss on HN