IEEE/ACM MICRO paper proposes Stratum, combining Mono3D DRAM, near-memory processing, and GPU to serve MoE LLMs at 8.29x higher throughput and 7.66x better energy efficiency than GPU baselines.
Key Takeaways
Mono3D DRAM uses hybrid bonding (1 μm pitch, ~5x finer than HBM TSVs) to deliver higher internal bandwidth than HBM to a logic die, enabling stronger NMP without embedding logic in DRAM cells.
Stratum introduces in-memory tiering: layers with lower access latency hold hot experts, slower layers hold cold experts, guided by a lightweight topic classifier predicting request topics.
The silicon interposer connects the Mono3D DRAM+logic stack to the GPU, keeping dense compute on GPU while offloading expert weight fetches to NMP.
Cross-stack evaluation (device, circuit, algorithm, system) validates 8.29x decoding throughput and 7.66x energy efficiency gains over state-of-the-art GPU-HBM baselines across multiple MoE benchmarks.
Target workloads include models like DeepSeek-V3 (671B) and Kimi-K2 (1T), where MoE expert weight volume saturates HBM bandwidth during decoding.
Hacker News Comment Review
One commenter notes Mono3D DRAM-based NMP likely generalizes well beyond LLM workloads to memory-bound traditional computing, which the paper does not explore.