Microsoft VibeVoice: The Open-Source Frontier Voice AI That Transcribes an Hour of Audio in One Pass

Microsoft has released VibeVoice, an ambitious open-source voice AI framework that brings together speech recognition (ASR) and text-to-speech (TTS) capabilities in a single, unified research platform — and it’s already turning heads in the AI community.

What Is VibeVoice?

VibeVoice is a family of open-source frontier voice AI models developed by Microsoft. It currently includes three distinct models:

VibeVoice-ASR-7B: A 7-billion parameter automatic speech recognition model capable of processing up to 60 minutes of continuous audio in a single pass.
VibeVoice-TTS-1.5B: A 1.5-billion parameter text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers.
VibeVoice-Realtime-0.5B: A lightweight real-time TTS model with streaming text input support, now with experimental multilingual voices in 9 languages and 11 distinct English style voices.

60-Minute Single-Pass Transcription

Traditional ASR systems typically chop audio into short chunks — often 30 seconds or less — which causes them to lose critical global context: speaker identity drifts, mid-sentence topics shift, and semantic continuity breaks down. VibeVoice-ASR solves this by accepting up to 60 minutes of continuous audio within a 64K token context window, maintaining consistent speaker tracking and semantic coherence throughout an entire hour of content.

The model jointly performs speech recognition, speaker diarization, and timestamping, outputting a structured transcript that clearly indicates who said what and when — a capability that normally requires chaining together multiple specialized tools.

Multilingual Support Across 50+ Languages

VibeVoice-ASR is natively multilingual, with support for over 50 languages. Microsoft has published the full language distribution on GitHub. The model also supports Customized Hotwords — users can inject domain-specific terminology, proper names, or technical jargon to dramatically boost recognition accuracy on specialized content like financial calls, medical dictation, or legal proceedings.

Technical Architecture: Continuous Tokenizers at 7.5 Hz

A core innovation in VibeVoice is its use of continuous speech tokenizers — both Acoustic and Semantic — operating at an ultra-low frame rate of just 7.5 Hz. This dramatically reduces the sequence length of audio while preserving fidelity, making long-form processing computationally tractable.

VibeVoice employs a next-token diffusion framework that leverages a Large Language Model (LLM) to understand textual context and dialogue flow, combined with a diffusion head to generate high-fidelity acoustic details. This hybrid approach is what allows the system to maintain both semantic accuracy and natural prosody across long passages.

Hugging Face Transformers Integration

On March 6, 2026, Microsoft announced that VibeVoice-ASR was merged into the main Hugging Face Transformers release (v5.3.0). This means any developer can now load and use VibeVoice-ASR directly through the familiar Transformers pipeline — just a few lines of code — with zero additional setup for existing HF users.

The team has also released vLLM inference support for even faster server-side transcription, and fine-tuning code is available for those who want to specialize the model on their own domains.

The TTS Story: From Open to Closed

Microsoft originally open-sourced VibeVoice-TTS in August 2025, showcasing a model capable of synthesizing 90-minute multi-speaker speeches. However, in September 2025, the team removed the TTS code from the public repository after discovering it was being used in ways inconsistent with the project’s stated intent — a cautionary tale about the dual-use nature of voice synthesis technology.

The Realtime-0.5B model remains available, including experimental voices across nine languages.

How to Try VibeVoice

VibeVoice-ASR is available on Hugging Face at microsoft/VibeVoice-ASR, with an interactive playground at aka.ms/vibevoice-asr. For developers wanting to integrate it into their apps, the Transformers pipeline makes it as simple as installing the latest version of the library and loading the model by name.

Why It Matters

Voice AI has traditionally been dominated by closed, API-only services from Google, AssemblyAI, and others. Microsoft’s decision to open-source a genuinely competitive speech recognition model — with a 7B-parameter architecture that rivals proprietary systems — is a significant moment for developers who need on-premise transcription, custom fine-tuning, or cost-effective processing at scale. With the Hugging Face integration, it’s now within reach of essentially every machine learning practitioner on the planet.

Featured image: Abstract audio waveform visualization representing voice AI technology.