Microsoft VibeVoice: Open-Source Frontier Voice AI for 60-Minute Transcription and 90-Minute Speech Synthesis

Microsoft has released VibeVoice, a comprehensive open-source framework for frontier voice AI that includes state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) models. The framework represents a significant step forward in accessible voice AI technology.

Introducing VibeVoice-ASR

VibeVoice-ASR is a unified speech-to-text model capable of handling 60-minute long-form audio in a single pass. Unlike conventional ASR systems that slice audio into short chunks鈥攐ften losing global context鈥擵ibeVoice-ASR processes continuous audio while generating structured transcriptions containing speaker identification, timestamps, and content.

The model supports over 50 languages natively and allows users to provide customized hotwords to guide the recognition process. This is particularly useful for domain-specific content involving technical terminology, names, or specialized vocabulary.

Key Features of VibeVoice-ASR

The ASR system produces rich transcription output with three key components:

Who: Speaker diarization identifies different speakers in the conversation
When: Precise timestamps for each spoken segment
What: Accurate transcription of the spoken content

The model leverages an ultra-low frame rate of 7.5 Hz through continuous speech tokenizers (Acoustic and Semantic), which efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences.

VibeVoice-TTS: 90-Minute Speech Generation

VibeVoice-TTS can synthesize speech up to 90 minutes long in a single pass, maintaining speaker consistency and semantic coherence throughout. The system supports up to 4 distinct speakers in a single conversation with natural turn-taking.

The TTS model employs a next-token diffusion framework, using a Large Language Model to understand textual context and dialogue flow, combined with a diffusion head to generate high-fidelity acoustic details.

VibeVoice-Realtime: Lightweight Streaming TTS

For applications requiring real-time response, VibeVoice-Realtime is a lightweight 0.5 billion parameter model that supports streaming text input with approximately 300 milliseconds first audible latency. This makes it suitable for interactive applications where responsiveness is critical.

Integration with Transformers

Starting from Transformers release v5.3.0, VibeVoice-ASR is directly available through the Hugging Face Transformers library, enabling seamless integration into existing machine learning pipelines. Developers can now use Microsoft’s speech recognition model with just a few lines of code.

Open Source Commitment

Microsoft has released the fine-tuning code for VibeVoice-ASR, allowing researchers and developers to adapt the model to their specific needs. The company has also enabled vLLM inference support for faster, more efficient inference.

As of March 2026, VibeVoice-ASR has been downloaded over 100,000 times from Hugging Face, demonstrating strong community interest in open-source voice AI solutions.

Technical Innovation

The core innovation in VibeVoice lies in its use of continuous speech tokenizers operating at 7.5 Hz. This ultra-low frame rate is significantly different from traditional approaches, which typically use much higher frame rates that consume more memory and computational resources.

The next-token diffusion approach combines the reasoning capabilities of large language models with the generation quality of diffusion models, resulting in natural-sounding speech that captures conversational dynamics and emotional nuances.

VibeVoice represents Microsoft’s commitment to advancing the state of open-source voice AI, providing researchers and developers with powerful tools to build applications ranging from transcription services to interactive voice assistants.