Microsoft’s VibeVoice family covers ASR (7B, 60-min single-pass with diarization) and TTS (0.5B streaming, ~300ms latency), built on a next-token diffusion + LLM backbone at 7.5 Hz token rate.
Key Takeaways
ASR-7B accepts up to 60 minutes of continuous audio within 64K tokens, avoiding the chunk-boundary speaker drift that breaks most production pipelines.
Outputs structured Who/When/What: joint diarization, timestamps, and ASR in one pass; customizable hotwords improve accuracy on domain-specific vocabulary.
Realtime-0.5B targets deployment with ~300ms first-audible latency and streaming text input; fine-tuning code and vLLM inference are both available.
Core architecture uses Acoustic and Semantic continuous speech tokenizers at 7.5 Hz, dramatically reducing sequence length for long-form audio without losing fidelity.
TTS-1.5B (90-min, 4-speaker, ICLR 2026 Oral) was publicly pulled after deepfake misuse; weights remain disabled; ASR is now integrated into Hugging Face Transformers with vLLM support.
Hacker News Comment Review
The “open-source” framing drew sharp pushback: training code is proprietary and not released, making this open-weight only – documented in GitHub issue #102.
Practitioners report ASR hallucinations, slow inference, and underperforming multilingual quality relative to the headline claims, especially for STT use cases.
Mistral’s Voxtral was raised as a competitive alternative: reportedly stronger quality, small enough to run on webGPU in-browser, raising the bar for what “better” looks like here.
Notable Comments
@vicchenai: single-pass diarization eliminates the Whisper + pyannote two-step and its chunk-boundary speaker continuity failures – the concrete workflow win regardless of benchmark numbers.
@pluc: points to Kevin Beaumont’s cybersecurity research on the repo and its author, adding important context on the original TTS removal beyond the official safety statement.