Microsoft has released VibeVoice, a comprehensive open-source voice AI framework that gives developers and researchers access to frontier-quality speech recognition and synthesis capabilities — entirely free, entirely on-premises. With over 31,000 GitHub stars in recent days, VibeVoice is emerging as one of the most ambitious open-source voice AI projects to date.
What Is VibeVoice?
VibeVoice is a family of open-source voice AI models developed by Microsoft Research, encompassing both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The framework centers on a core innovation: continuous speech tokenizers operating at an ultra-low frame rate of just 7.5 Hz, which efficiently preserve audio fidelity while dramatically boosting computational efficiency for processing long sequences.
The framework employs a next-token diffusion architecture, using a Large Language Model to understand textual context and dialogue flow, combined with a diffusion head to generate high-fidelity acoustic details. This design enables VibeVoice to handle conversational speech that previous open-source models struggled with.
Three Model Variants
VibeVoice comes in three flavors, each targeting different use cases:
VibeVoice-ASR-7B
The flagship speech-to-text model accepts up to 60 minutes of continuous audio in a single pass, producing structured transcriptions that capture not just what was said, but who said it, when they said it, and what the content was. Unlike conventional ASR models that slice audio into short chunks — often losing global context — VibeVoice-ASR maintains consistent speaker tracking and semantic coherence across the entire hour.
Key features include:
- Natively multilingual across 50+ languages
- Customized hotwords support for domain-specific terminology
- Joint ASR, diarization, and timestamping in a single pass
- Available via Hugging Face and integrated into Transformers v5.3
VibeVoice-TTS-1.5B
A long-form text-to-speech model that synthesizes conversational speech up to 90 minutes long in a single pass, maintaining speaker consistency and semantic coherence throughout. Supports up to 4 distinct speakers in a single conversation with natural turn-taking. The model was accepted as an Oral presentation at ICLR 2026.
VibeVoice-Realtime-0.5B
A lightweight real-time TTS model with only 500 million parameters, achieving approximately 300 milliseconds first-audible-latency. Designed for streaming text input and robust long-form generation (~10 minutes). Deployment-friendly enough to run on consumer hardware.
The Responsible AI Angle
VibeVoice has not been without controversy. In September 2025, Microsoft removed the VibeVoice-TTS code from the repository after discovering instances where it was being used in ways inconsistent with the project’s stated intent. High-quality synthetic speech carries obvious risks for deepfakes and disinformation.
“We do not recommend using VibeVoice in commercial or real-world applications without further testing and development,” the project README now states. “This model is intended for research and development purposes only. Please use responsibly.”
Despite these concerns, the ASR components remain fully available and have found traction in legitimate use cases — including Vibing, a voice-powered input method app for macOS and Windows built on VibeVoice-ASR, released just days ago.
Microsoft’s Open-Source Voice Strategy
VibeVoice represents a deliberate move by Microsoft to establish itself as a major player in open-source voice AI — a space that has historically been dominated by proprietary offerings like ElevenLabs, OpenAI, and Google. By releasing both ASR and TTS capabilities under an open research framework, Microsoft is enabling the academic and developer community to advance the state of speech AI without commercial restrictions.
The integration into Hugging Face Transformers means any developer with a few lines of code can now incorporate state-of-the-art speech recognition into their applications. The vLLM inference support brings production-grade serving performance to VibeVoice-ASR, making it viable for real-world deployment at scale.
“We open-sourced VibeVoice-TTS, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers.” — Microsoft Research
With VibeVoice, Microsoft is demonstrating that frontier-quality voice AI doesn’t require sending data to third-party servers. For enterprises with strict data sovereignty requirements — financial services, healthcare, government — an open-source, on-premises voice AI stack is an increasingly attractive alternative to proprietary APIs.
The project is available on GitHub, with models on Hugging Face. A live playground is also available for trying VibeVoice-ASR without any installation.