Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

May 16, 2026 · ai devtools · Source ↗

TLDR

Paper introduces Orthrus, a dual-view diffusion framework that injects parallel decoding into frozen Qwen3 AR models with provably lossless output distribution.

Speedups of 4.25x (1.7B), 5.20x (4B), and 5.36x (8B) on Qwen3 backbones; peak claim is 7.8x tokens per forward pass.
Output is strictly lossless: an AR head verifies tokens generated by the diffusion head, accepting the longest matching prefix, preserving exact predictive distribution.
Only 16% of parameters are fine-tuned; the base LLM stays frozen, and both views share a single KV cache with O(1) memory overhead.
Outperforms speculative decoding methods EAGLE-3 and DFlash on token acceptance rate, especially at longer context lengths.
Unlike diffusion LLMs (e.g., Fast-dLLM-v2), Orthrus shows no accuracy drop on MATH-500 at ~6x speedup over Qwen3-8B baseline.

Discussion is minimal; a co-author confirmed the core mechanism: trainable diffusion attention injected per layer, K=32 tokens projected in parallel then AR-verified.
Community interest focused on broader model support, specifically Qwen3 32B and quantized variants, which are not yet addressed in the repo.

@ilaksh: Asked about Qwen3 32B and quant compatibility, pointing to real deployment gaps not covered by current model zoo.