Mistral AI is expanding its Voxtral model family with its first text-to-speech model.
The launch comes amid intensifying competition in the fast-growing AI voice market, with Voxtral TTS pitched as an alternative to models from competitors including OpenAI and ElevenLabs.
The Paris-based startup unveiled its new system on Thursday. The 4 billion parameter model is designed for enterprise deployment across voice assistants, customer support and sales engagement tools.
Unlike many rival offerings, Voxtral TTS has been released with open weights, allowing organizations to run the model on their own infrastructure rather than relying on third-party APIs.
The model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi and Arabic.
Mistral said the model is lightweight enough to operate on consumer hardware, including laptops, smartphones and edge devices, while maintaining what it describes as “frontier-quality” performance. The company positions this as a key differentiator for enterprises seeking greater control over data, cost and customization.
Another key feature, Mistral said, is voice adaptability. The model can replicate a speaker’s voice using just a few seconds of reference audio, capturing not only tone but also accent, intonation and emotion.
“Our model excels at both contextual understanding and speaker modeling: capturing how a specific person naturally speaks,” Mistral wrote in a blog post. “With its compact size, low cost and latency and easy adaptability, Voxtral TTS gives full control and customization for enterprises looking to own their voice AI stack.”
Voxtral TTS can also perform cross-language voice control, such as generating English speech with a French accent, based on a short prompt.
In human evaluations of Voxtral, Mistral said its system matched or outperformed competing systems in terms of naturalness, exceeding lower-latency models from ElevenLabs while achieving parity with more advanced offerings in lifelike interaction.
The launch builds on Mistral’s earlier release of speech-to-text models and signals a broader push toward multimodal AI systems.

