TurboQuant compresses AI vectors (KV caches, embeddings) to 2-4 bits using random rotation and a single fixed codebook, with no per-block metadata overhead.
Key Takeaways
Random rotation maps any input’s coordinates to a near-Gaussian distribution; a codebook built once for that distribution reuses across every input.
Production quantizers (GPTQ, AWQ, KIVI, KVQuant) store per-block scale and zero-point; advertised 3-bit schemes are effectively 4-5 bits once metadata is counted.
High-dimensional concentration: unit-vector coordinates in d-space cluster near ±1/√d – the geometric property the rotation trick exploits.
No training, calibration, or per-input scale factors required; the codebook is designed once offline for the post-rotation Gaussian.
Targets KV caches, embeddings, and attention keys – the dominant memory bottleneck in LLM inference at scale.
Hacker News Comment Review
Two prior works dispute TurboQuant’s novelty: EDEN (NeurIPS 21, ICML 22) introduced post-rotation distribution-aware quantization first and reportedly achieves better accuracy via optimal scale derivations.
The RaBitQ authors allege in a public technical note that several of TurboQuant’s runtime and recall benchmarks do not reproduce from the released code under the paper’s stated setup.
Current inference overhead on a llama.cpp fork is 5-10x slower than vanilla (M1 Max, Qwen3 35B MoE); the math is sound but productionization is not close.
Notable Comments
@amitport: TurboQuant is a restricted version of EDEN and lacks optimal scale derivations; the gap makes TurboQuant considerably less accurate per a new arxiv note.
@mskkm: OpenReview now carries explicit allegations that TurboQuant knowingly misrepresented RaBitQ’s results; benchmark numbers reportedly do not reproduce from released code.