NVLabs paper introduces SANA-WM, a 2.6B open-source world model generating 60-second 720p video with 6-DoF camera control on a single GPU.
Key Takeaways
Hybrid Linear Attention pairs frame-wise Gated DeltaNet with periodic softmax attention, keeping memory flat across minute-long sequences where all-softmax OOMs.
Dual-Branch Camera Control uses a coarse global pose branch plus a fine pixel-aligned geometric branch to follow metric 6-DoF trajectories.
Two-stage pipeline: a 2.6B backbone handles long rollout; a separate 17B long-video refiner sharpens texture and late-window quality.
Trained on ~213K public video clips in 15 days on 64 H100s; distilled variant denoises a 60s 720p clip in 34s on a single RTX 5090 with NVFP4 quantization.
Claims 36x higher throughput than prior open-source baselines at comparable visual quality to industrial models like LingBot-World and HY-WorldPlay.
Hacker News Comment Review
The model is not yet publicly downloadable; the GitHub repo and website download button are both inactive, raising questions about RTX 4090 (24GB) compatibility.
Commenters see open-sourcing as a compounding advantage, while others push back that leading labs should release stronger models like Seedance 2.0 or Veo 3 instead.
Notable Comments
@Fischgericht: Download link absent from GitHub and website; asks explicitly about RTX 4090 24GB support.