Text-to-speech technology has come a long way. But most TTS systems, even commercial ones, still rely on a process called tokenization 鈥?converting text into discrete tokens that the model processes before generating audio. This introduces friction, loses nuance, and often produces speech that sounds technically correct but emotionally flat. VoxCPM2, the latest release from the OpenBMB research team, takes a fundamentally different approach: it does away with tokenization entirely, generating continuous speech representations directly from text through a diffusion autoregressive architecture.
The result is 48kHz studio-quality speech synthesis across 30 languages, with features like Voice Design (creating entirely new voices from text descriptions alone), Controllable Voice Cloning (taking a short reference clip and steering its emotion, pace, and expression), and Ultimate Cloning (reproducing every vocal nuance with near-perfect fidelity). All of it is open-source and commercially usable under Apache-2.0.
Why “Tokenizer-Free” Matters
Traditional TTS systems work by first converting text into a sequence of discrete tokens 鈥?a kind of compressed text representation. The model then generates audio from these tokens. The problem is that this conversion process inevitably loses information. Prosody, emotional coloring, subtle rhythm variations 鈥?these are often casualties of tokenization.
VoxCPM2 sidesteps this entirely. Built on a MiniCPM-4 backbone, it uses a diffusion autoregressive architecture that operates entirely in the latent space of AudioVAE V2. The four-stage pipeline 鈥?LocEnc 鈫?TSLM 鈫?RALM 鈫?LocDiT 鈥?enables the model to generate highly natural and expressive speech without the information bottleneck of discrete tokenization.
30 Languages, One Model
VoxCPM2 supports 30 languages out of the box:
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, and Vietnamese.
It also supports 8 Chinese dialects: Sichuanese, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, and Minnan.
No language tag is needed 鈥?the model automatically infers the language from the input text.
Voice Design: Create a Voice from a Description
Perhaps the most striking feature is Voice Design 鈥?the ability to create an entirely new synthetic voice using nothing but a natural-language description. Want a “young woman with a gentle, sweet voice”? A “deep, authoritative male news anchor”? A “cheerful teenager with a slight Southern accent”? You describe it, and VoxCPM2 generates it. No reference audio needed, no voice actor required.
This is done by including a description in parentheses at the start of your input text:
(A young woman, gentle and sweet voice) Hello, welcome to our platform!
Controllable Voice Cloning
The Controllable Cloning feature lets you take a short reference audio clip and clone the voice, then apply style guidance 鈥?controlling the emotion, pace, and expression while preserving the original speaker’s timbre.
For the highest-fidelity cloning, Ultimate Cloning takes both the reference audio and its exact transcript, reproducing every vocal detail 鈥?timbre, rhythm, emotion, and speaking style.
48kHz Studio Quality
Most TTS systems output 16kHz or 44.1kHz audio. VoxCPM2 outputs 48kHz studio-quality audio natively 鈥?no external upsampler required.
Real-Time Performance
VoxCPM2 achieves an RTF (Real-Time Factor) as low as ~0.3 on an NVIDIA RTX 4090. With the Nano-VLLM acceleration, RTF drops to ~0.13 鈥?approaching real-time speech generation even for longer texts.
Benchmarks
On the Seed-TTS-eval benchmark, VoxCPM2 achieves:
- English WER: 1.84% with SIM: 75.3%
- Chinese WER: 0.97% with SIM: 79.5%
These results compare favorably to proprietary systems like MegaTTS3, DiTAR, CosyVoice3, Seed-TTS, and MiniMax-Speech.
Getting Started
VoxCPM2 can be installed with a single pip command:
pip install voxcpm
Usage is straightforward:
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="(A young woman, gentle and sweet voice) Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
The project is available on Hugging Face, ModelScope, and the full documentation is at voxcpm.readthedocs.io.
For developers building multilingual applications, content creators who need voice variety without licensing headaches, and businesses that need high-quality speech synthesis at reasonable cost, VoxCPM2 is worth serious attention.