whichllm auto-detects GPU/CPU/RAM and ranks local LLMs by merged real benchmarks, not just what fits in VRAM.
Key Takeaways
Scoring merges LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, and Open LLM Leaderboard with confidence-weighted dampening.
Recency-aware lineage demotion prevents stale 2024 leaderboard scores from outranking current-generation models.
VRAM estimation covers weights + GQA KV cache + activation + overhead; MoE models rank speed on active params, quality on total.
whichllm plan reverse-looks up what GPU you need for a target model and context length; --json enables pipeline use with Ollama or jq.
Evidence is graded across five levels (direct, variant, base, interpolated, self-reported) and cross-family score inheritance is rejected when param count diverges more than 2x.
Hacker News Comment Review
Commenters flagged a direct competitor, llmfit (Go-based), with the main differentiator being that whichllm is Python and adds benchmark-aware ranking rather than pure fit detection.
The fixed context-length assumption in VRAM estimation is a real gap: sliding window attention models like Mistral use substantially less KV cache at 32k context than the README implies.
Early users on Apple Silicon report stale Qwen 2.5 recommendations despite running Qwen 3.x fine, suggesting HuggingFace data freshness or ranking logic may lag model releases.
Notable Comments
@Jasssss: VRAM estimation does not account for sliding window attention, so KV cache sizing is likely overstated for Mistral-class models at long context.