TurboQuant: A First-Principles Walkthrough

· ai · Source ↗

TLDR

  • TurboQuant compresses AI vectors (KV caches, embeddings) to 2-4 bits using random rotation and a single fixed codebook, with no per-block metadata overhead.

Key Takeaways

  • Random rotation maps any input’s coordinates to a near-Gaussian distribution; a codebook built once for that distribution reuses across every input.
  • Production quantizers (GPTQ, AWQ, KIVI, KVQuant) store per-block scale and zero-point; advertised 3-bit schemes are effectively 4-5 bits once metadata is counted.
  • High-dimensional concentration: unit-vector coordinates in d-space cluster near ±1/√d – the geometric property the rotation trick exploits.
  • No training, calibration, or per-input scale factors required; the codebook is designed once offline for the post-rotation Gaussian.
  • Targets KV caches, embeddings, and attention keys – the dominant memory bottleneck in LLM inference at scale.

Hacker News Comment Review

  • Two prior works dispute TurboQuant’s novelty: EDEN (NeurIPS 21, ICML 22) introduced post-rotation distribution-aware quantization first and reportedly achieves better accuracy via optimal scale derivations.
  • The RaBitQ authors allege in a public technical note that several of TurboQuant’s runtime and recall benchmarks do not reproduce from the released code under the paper’s stated setup.
  • Current inference overhead on a llama.cpp fork is 5-10x slower than vanilla (M1 Max, Qwen3 35B MoE); the math is sound but productionization is not close.

Notable Comments

  • @amitport: TurboQuant is a restricted version of EDEN and lacks optimal scale derivations; the gap makes TurboQuant considerably less accurate per a new arxiv note.
  • @mskkm: OpenReview now carries explicit allegations that TurboQuant knowingly misrepresented RaBitQ’s results; benchmark numbers reportedly do not reproduce from released code.

Original | Discuss on HN