antirez built DwarfStar 4, a focused local inference runtime for DeepSeek v4 Flash, in one week, calling it the first local model he’d trust for serious work.
Key Takeaways
DS4 targets DeepSeek v4 Flash using a 2/8-bit asymmetric quant recipe; runs in ~80-96GB RAM on high-end Macs or DGX Spark hardware.
antirez frames this as model-agnostic long-term: DS4 will track the best “practically fast” open-weights model, with planned variants like ds4-coding, ds4-legal, ds4-medical.
Roadmap includes quality benchmarks, a bundled coding agent, CI hardware, more backend ports, and both serial and parallel distributed inference.
Vector steering is a first-class feature enabling freer, less restricted model interaction compared to typical local inference setups.
antirez explicitly argues: “AI is too critical to be just a provided service.”
Hacker News Comment Review
Commenters largely confirm the quality gap claim: multiple independent testers found DS4/DeepSeek v4 Pro competitive with Claude Sonnet at coding, though Opus still leads on harder benchmarks per Kilo agent tests.
The model-specific runtime vs. llama.cpp debate is live: critics note duplicated effort as PRs land against both codebases, while supporters point to the focused Metal/CUDA/ROCm backend design and tighter single-model optimization.
AMD ROCm support exists on a separate branch but lacks direct maintainer hardware; Strix Halo 128GB unified RAM users are watching but no confirmed results yet.
Notable Comments
@simonw: Confirmed painless setup on 128GB M5; model fits in ~80GB RAM with strong code and tool-execution performance.
@0xbadcafebee: Raises sustainability concern – building a model-specific engine risks obsolescence and splits contributor attention with llama.cpp.