Granite 4.1: IBM's 8B Model Matching 32B MoE

· ai coding · Source ↗

TLDR

  • IBM’s Granite 4.1 8B dense model outperforms its own 32B MoE (9B active) predecessor across ArenaHard, BFCL V3, GSM8K, and coding benchmarks, Apache 2.0 licensed.

Key Takeaways

  • Four-stage RL pipeline caught a mid-training regression where RLHF improved chat but dropped GSM8K and DeepMind-Math; a dedicated math RL stage recovered and surpassed baselines.
  • Five distinct pre-training phases shifted data mix from 59% CommonCrawl to heavy math/code/CoT blends, totaling 15 trillion tokens with a curated 4.1M fine-tuning samples.
  • LLM-as-Judge filtered every fine-tuning sample across six dimensions; hallucinations, false premises, and incorrect computations triggered automatic rejection with no partial credit.
  • 512K context achieved via staged extension (32K to 128K to 512K) with model merging at each step to preserve short-context performance; RULER scores degrade gradually, not sharply.
  • 3B model hits 87.0 on GSM8K and 60.8 on BFCL V3, making it viable for edge deployment; 3B caps at 128K context, not 512K.

Hacker News Comment Review

  • Early hands-on testers found the 8B genuinely useful on commodity hardware but noted Qwen3 35B still leads for local general use; recent training data is cited as Granite’s practical edge.
  • Commenters flagged that all benchmark comparisons are IBM’s own self-reported results using their own harness, and the article lacks third-party validation beyond noting the absolute numbers look plausible.
  • There is interest in granite-vision-4.1-4b as a potential sleeper for table and semantic key-value extraction if benchmarks hold against frontier models.

Notable Comments

  • @cbg0: Flags granite-vision-4.1-4b as potentially strong for table and semantic k:v extraction at that parameter count.
  • @Havoc: Notes both IBM and Mistral are pivoting away from MoE while larger SOTA models stick with it; 8B Q6 vibe check showed clinical tone suited for data processing.

Original | Discuss on HN