Laguna XS.2 and M.1

· ai ai-agents coding · Source ↗

TLDR

  • Poolside releases Laguna M.1 (225B-A23B MoE) and open-weight Laguna XS.2 (33B-A3B MoE, Apache 2.0), both agentic coding models trained from scratch for long-horizon tasks.

Key Takeaways

  • Laguna XS.2 (33B total, 3B activated) hits 44.5% SWE-bench Pro and 30.1% Terminal-Bench 2.0; weights downloadable now under Apache 2.0, NVFP4 variant included for Blackwell hardware.
  • Laguna M.1 (225B-A23B, 30T tokens, 6,144 Hopper GPUs) scores 46.9% SWE-bench Pro and 40.7% Terminal-Bench 2.0; closed API only, free during research preview.
  • Poolside is shipping the same ACP (Agent Client Protocol) server harness used internally for agent RL training and evaluation alongside the model weights.
  • Both models trained entirely in-house using the Titan codebase, Muon optimizer, and async on-policy RL; synthetic data covers ~13% of XS.2 pre-training mix (4.4T+ synthetic tokens across the family).
  • AutoMixer framework runs ~60 proxy models per sweep to optimize data mixture proportions, targeting code, math, STEM, and common sense tradeoffs without manual heuristics.

Hacker News Comment Review

  • Commenters broadly noted that Qwen3.6 35B-A3B beats Laguna XS.2 on Terminal-Bench 2.0 (51.5 vs 30.1) and also edges out the much larger M.1 (225B-A23B), raising questions about what Poolside’s model training investment actually buys over frontier open-weight competitors.
  • The decision to co-release the ACP harness was treated as the genuinely differentiated move: it’s the same runtime exercised in production RL rather than a demo wrapper, which is rare among lab releases.
  • Early testers reported fast inference and strong ACP spec adherence via the “pool” agent in Zed, though benchmark-vs-real-use gaps remain an open question given the Terminal-Bench numbers.

Notable Comments

  • @vijgaurav: “Most labs dump the model and make you figure out the agent layer yourself” – shipping the RL-exercised harness is the distinguishing detail.
  • @jaen: Side-by-side Terminal-Bench 2.0 comparison shows Qwen3.6 35B-A3B at 51.5 vs XS.2 at 30.1, a gap that persists even against M.1 at 40.7.

Original | Discuss on HN