Google releases MTP drafters for Gemma 4, delivering up to 3x inference speedup via speculative decoding with zero output quality degradation.
Key Takeaways
MTP drafters pair a lightweight drafter model with the heavy Gemma 4 target; the drafter proposes tokens, the target verifies in parallel, yielding up to 3x throughput.
Drafters share the target model’s KV cache and activations, avoiding redundant context recomputation and keeping overhead low.
For E2B and E4B edge models, an efficient clustering technique in the embedder addresses the final logit bottleneck on constrained hardware.
Batch sizes of 4-8 unlock ~2.2x speedup on the 26B MoE model on Apple Silicon; similar gains appear on Nvidia A100.
Available now under Apache 2.0 via Hugging Face, Kaggle, Ollama, vLLM, SGLang, MLX, and Google AI Edge Gallery.
Hacker News Comment Review
Commenters broadly validated speculative decoding as a sound technique, noting its CPU branch-prediction analogy: the draft model proposes, the main model verifies cheaply in one forward pass.
Real-world testers on Ollama with MLX saw roughly 2x gains on 31B for coding, but noted that smaller Gemma 4 models often lose gains because draft-model validation overhead eats the speedup, and heavy quantization hurts acceptance rate.
A recurring thread contrasted Gemma’s token efficiency against Qwen: Gemma often finishes tasks in a fraction of the time even when Qwen scores marginally higher on benchmarks, making wall-clock speed a meaningful differentiator for agentic workloads.
Notable Comments
@msp26: Fitting 31B dense + vision projector + MTP drafter into 24 GB VRAM is tight; tip from replies: use --no-mmproj-offload in llama.cpp to keep the multimodal projector in system RAM.
@zdw: MTP support PR for llama.cpp (covering Qwen first, Gemma 4 expected soon) is close to merge, signaling broad local-inference ecosystem uptake.