IBM’s Granite 4.1 8B dense model outperforms its own 32B MoE (9B active) predecessor across ArenaHard, BFCL V3, GSM8K, and coding benchmarks, Apache 2.0 licensed.
Key Takeaways
Four-stage RL pipeline caught a mid-training regression where RLHF improved chat but dropped GSM8K and DeepMind-Math; a dedicated math RL stage recovered and surpassed baselines.
Five distinct pre-training phases shifted data mix from 59% CommonCrawl to heavy math/code/CoT blends, totaling 15 trillion tokens with a curated 4.1M fine-tuning samples.
LLM-as-Judge filtered every fine-tuning sample across six dimensions; hallucinations, false premises, and incorrect computations triggered automatic rejection with no partial credit.
512K context achieved via staged extension (32K to 128K to 512K) with model merging at each step to preserve short-context performance; RULER scores degrade gradually, not sharply.
3B model hits 87.0 on GSM8K and 60.8 on BFCL V3, making it viable for edge deployment; 3B caps at 128K context, not 512K.
Hacker News Comment Review
Early hands-on testers found the 8B genuinely useful on commodity hardware but noted Qwen3 35B still leads for local general use; recent training data is cited as Granite’s practical edge.
Commenters flagged that all benchmark comparisons are IBM’s own self-reported results using their own harness, and the article lacks third-party validation beyond noting the absolute numbers look plausible.
There is interest in granite-vision-4.1-4b as a potential sleeper for table and semantic key-value extraction if benchmarks hold against frontier models.
Notable Comments
@cbg0: Flags granite-vision-4.1-4b as potentially strong for table and semantic k:v extraction at that parameter count.
@Havoc: Notes both IBM and Mistral are pivoting away from MoE while larger SOTA models stick with it; 8B Q6 vibe check showed clinical tone suited for data processing.