Paper introduces Stable Audio 3, a family of latent diffusion models for fast, variable-length audio generation and editing with open weights for small and medium sizes.
Key Takeaways
Three model sizes (small, medium, large) built on a semantic-acoustic autoencoder that compresses audio into a compact latent space preserving fidelity and semantic structure.
Variable-length generation avoids the cost of full-length inference for short sounds; inpainting enables targeted edits and continuation of existing recordings.
Adversarial post-training accelerates inference and improves quality, cutting required diffusion steps while boosting prompt adherence.
Generates audio in under 2s on an H200 and a few seconds on a MacBook Pro M4; small and medium weights released with full training and inference code.
Training data is licensed and Creative Commons, addressing a persistent legal concern in generative audio models.