Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

· ai devtools · Source ↗

TLDR

  • Paper introduces Orthrus, a dual-view diffusion framework that injects parallel decoding into frozen Qwen3 AR models with provably lossless output distribution.

Key Takeaways

  • Speedups of 4.25x (1.7B), 5.20x (4B), and 5.36x (8B) on Qwen3 backbones; peak claim is 7.8x tokens per forward pass.
  • Output is strictly lossless: an AR head verifies tokens generated by the diffusion head, accepting the longest matching prefix, preserving exact predictive distribution.
  • Only 16% of parameters are fine-tuned; the base LLM stays frozen, and both views share a single KV cache with O(1) memory overhead.
  • Outperforms speculative decoding methods EAGLE-3 and DFlash on token acceptance rate, especially at longer context lengths.
  • Unlike diffusion LLMs (e.g., Fast-dLLM-v2), Orthrus shows no accuracy drop on MATH-500 at ~6x speedup over Qwen3-8B baseline.

Hacker News Comment Review

  • Discussion is minimal; a co-author confirmed the core mechanism: trainable diffusion attention injected per layer, K=32 tokens projected in parallel then AR-verified.
  • Community interest focused on broader model support, specifically Qwen3 32B and quantized variants, which are not yet addressed in the repo.

Notable Comments

  • @ilaksh: Asked about Qwen3 32B and quant compatibility, pointing to real deployment gaps not covered by current model zoo.

Original | Discuss on HN