VoxCPM2: Tokenizer-Free TTS That Speaks 30 Languages With Studio Quality

Most text-to-speech systems work by converting text into discrete tokens, then generating audio from those tokens. It’s a reasonable approach, but it introduces friction — tokens don’t capture every nuance of human speech, and the tokenization process itself can introduce artifacts. VoxCPM2, released this week by the OpenBMB lab at Tsinghua University, takes a fundamentally different approach: it does away with discrete tokenization entirely, generating continuous speech representations end-to-end via a diffusion autoregressive architecture.

Tokenizer-Free: Why It Matters

The key innovation in VoxCPM2 is its tokenizer-free architecture. By bypassing discrete tokenization, the model preserves the full continuous nature of speech — the subtle rhythms, the micro-expressions in voice, the things that make one speaker’s voice distinct from another. The result is synthesis that sounds more natural and more expressive than tokenized alternatives, especially for emotional and stylistic variation.

The architecture is built on a MiniCPM-4 backbone and uses a diffusion process for high-quality generation. The model was trained on over 2 million hours of multilingual speech data — a scale that gives it the coverage to handle 30 languages with genuine fluency.

30 Languages, 48kHz Studio Quality

VoxCPM2 supports 30 languages including Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, and Vietnamese. It also handles six major Chinese dialects: Sichuanese, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, and Southern Min.

Audio output is 48kHz studio quality — a significant step up from the 16kHz that many open TTS models output. This is achieved through AudioVAE V2’s asymmetric encode/decode design with built-in super-resolution.

Voice Design: Create Voices From Natural Language

One of VoxCPM2’s most impressive features is Voice Design — the ability to create an entirely new synthetic voice from a natural-language description. You don’t need a reference audio file. You just describe what you want: gender, age, tone, emotion, pace, and the model synthesizes a voice that matches. This opens up creative use cases that would previously have required hiring voice actors.

Controllable Voice Cloning

For applications that need to clone a specific voice, VoxCPM2 offers two modes. Controllable Cloning lets you upload a reference clip and then use natural language instructions to steer the style — faster delivery, more cheerful tone, etc. — while preserving the original speaker’s timbre. Ultimate Cloning goes further: provide both the reference audio and its exact transcript, and the model reproduces every vocal nuance — timbre, rhythm, emotion, and style.

Real-Time Performance on Consumer Hardware

Perhaps most impressively, VoxCPM2 achieves real-time factor (RTF) as low as ~0.3 on a single NVIDIA RTX 4090. With the dedicated Nano-vLLM inference engine, that drops to ~0.13 — meaning you can synthesize faster than real-time on consumer hardware.

Fully Open Source and Commercial-Ready

VoxCPM2 is released under Apache-2.0 license — weights and code are both available, free for commercial use. Demo pages are live on Hugging Face and ModelScope. The Python API is clean and straightforward — five lines of code from import to saved audio file.

VoxCPM2 represents a significant step forward for open-source speech synthesis. If you’ve been looking for a reason to explore what’s possible with modern open-source TTS, VoxCPM2 might be it.

Tokenizer-Free: Why It Matters

30 Languages, 48kHz Studio Quality

Voice Design: Create Voices From Natural Language

Controllable Voice Cloning

Real-Time Performance on Consumer Hardware

Fully Open Source and Commercial-Ready

Related Posts

Newsletter

Join the discussion Cancel reply