The enterprise voice AI market is heating up with a surprising new entrant. Mistral AI has released Voxtral TTS, what it calls the first frontier-quality, open-weight text-to-speech model designed specifically for enterprise use. And here is the kicker: they are giving away the full model weights for free.
Where every major competitor in the voice AI space operates a proprietary, API-first business??nterprises rent the voice, they do not own it??istral is taking a fundamentally different approach. Companies can download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party.
The voice AI market is enormous. It crossed 22 billion dollars globally in 2026, with the voice AI agents segment alone projected to reach 47.5 billion dollars by 2034. Mistral’s bet is that the future of enterprise voice AI will be shaped by whoever gives companies the most control over their voice infrastructure.
Technical Specifications
Voxtral TTS is a 3-billion-parameter model that can fit on a laptop and runs six times faster than real-time speech. The architecture comprises three components: a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec developed in-house.
When quantized for inference, it requires roughly 3GB of RAM. It can run on any laptop or smartphone, and even on older hardware, it still operates in real time with a time-to-first-audio of just 90 milliseconds.
Multilingual Capabilities
The model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. Perhaps most remarkably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task.
In practical terms, you can feed the model 10 seconds of a French-accented voice, type a prompt in German, and the model will generate German speech that sounds like the original speaker??omplete with their natural accent and vocal characteristics.
Beating ElevenLabs
Mistral is not being shy about which competitor it intends to displace. In human evaluations conducted by the company, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks.
ElevenLabs remains widely regarded as the benchmark for raw voice quality. However, ElevenLabs operates as a closed platform with tiered subscription pricing that scales from around 5 dollars per month at the starter level to over 1,300 dollars per month for business plans??nd it does not release model weights.
Mistral’s pitch is that enterprises should not have to choose between quality and control. At scale, the economics of an open-weight model are dramatically more favorable.
Voxtral TTS completes Mistral’s enterprise AI stack. The company has been aggressively assembling building blocks including its Forge customization platform, AI Studio production infrastructure, and the Voxtral Transcribe speech-to-text model. Voxtral TTS gives enterprises a complete speech-to-speech pipeline they can run end-to-end without relying on any external provider.