Recent open-weight LLMs (Gemma 4, DeepSeek V4, ZAYA1, Laguna XS.2) are converging on architecture tricks to cut KV-cache memory and attention cost for long-context and agent workloads.
Key Takeaways
Gemma 4 E2B/E4B use cross-layer KV sharing: only the first ~15-24 layers compute KV projections; later layers reuse them, saving ~2.7-6 GB at 128K context in bfloat16.
Per-layer embeddings (PLE) in Gemma 4 E-series separate embedding capacity from transformer-stack compute, keeping active parameter cost near the smaller number while total params count higher.
KV sharing is an approximation that reduces model capacity, though the cross-layer attention NeurIPS 2024 paper reports minimal impact on small models tested.
No ablation comparing Gemma 4 E2B to a plain 2.3B or 5.1B dense model is publicly available, leaving the PLE tradeoff unquantified.