SANA-WM, a 2.6B open-source world model for 1-minute 720p video

· ai design · Source ↗

TLDR

  • NVLabs paper introduces SANA-WM, a 2.6B open-source world model generating 60-second 720p video with 6-DoF camera control on a single GPU.

Key Takeaways

  • Hybrid Linear Attention pairs frame-wise Gated DeltaNet with periodic softmax attention, keeping memory flat across minute-long sequences where all-softmax OOMs.
  • Dual-Branch Camera Control uses a coarse global pose branch plus a fine pixel-aligned geometric branch to follow metric 6-DoF trajectories.
  • Two-stage pipeline: a 2.6B backbone handles long rollout; a separate 17B long-video refiner sharpens texture and late-window quality.
  • Trained on ~213K public video clips in 15 days on 64 H100s; distilled variant denoises a 60s 720p clip in 34s on a single RTX 5090 with NVFP4 quantization.
  • Claims 36x higher throughput than prior open-source baselines at comparable visual quality to industrial models like LingBot-World and HY-WorldPlay.

Hacker News Comment Review

  • The model is not yet publicly downloadable; the GitHub repo and website download button are both inactive, raising questions about RTX 4090 (24GB) compatibility.
  • Commenters see open-sourcing as a compounding advantage, while others push back that leading labs should release stronger models like Seedance 2.0 or Veo 3 instead.

Notable Comments

  • @Fischgericht: Download link absent from GitHub and website; asks explicitly about RTX 4090 24GB support.

Original | Discuss on HN