KV Cache Is Becoming the Memory Hierarchy of Inference

· systems ai-agents ai · Source ↗

TLDR

  • KV cache has expanded beyond GPU-local storage into a multi-tier hierarchy spanning host DRAM, distributed pools, and RDMA transfer, with real prefill cost implications for long-running agents.

Key Takeaways

  • A 50-turn agentic loop (OpenClaw/Hermes-style) can force repeated prefill of unchanged context if prefix alignment breaks or the next request lands on a different worker.
  • vLLM x Mooncake traces show context growing from 12K to 180K+ tokens by turn 30, with a 131:1 input-to-output ratio – repeated prefill, not decode, is the cost driver.
  • A 100K-token Kimi-2.5 FP8 context uses roughly 3.8 GB of KV cache; at scale this becomes a working-set scheduling problem across GPU HBM, host DRAM, and remote tiers.
  • LMCache CacheBlend targets agentic workflows where shared documents reappear at non-prefix positions, breaking standard prefix caching; NVIDIA Dynamo and SGLang address routing and scheduler policy.
  • GB200 NVL72 delivers 3.13x throughput over B200 on Kimi K2.5 via rack-scale NVLink; AMD MI355X reached 7.7x improvement in 25 days via software (vLLM + AITER MLA), closing some of the gap.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN