Mistral’s new open-source text-to-speech model, built on a 3-billion-parameter architecture, runs directly on smartphones and smartwatches — moving voice cloning away from expensive cloud APIs and into users’ hands for the first time at this performance level.
PARIS — French AI company Mistral released Voxtral TTS on Thursday, March 26, 2026, a new open-source text-to-speech model built to run locally on edge devices including smartwatches, smartphones, and laptops, according to the company, completing a full voice AI pipeline that now covers both speech transcription and audio generation.
The launch puts Mistral in direct competition with ElevenLabs, Deepgram, and OpenAI in the fast-growing voice AI market — but with one distinct structural difference. Where rivals depend on cloud infrastructure for high-fidelity voice cloning, Voxtral TTS processes audio entirely on-device.
That gap matters more than it may appear. Cloud-dependent voice models transmit audio to remote servers for processing — raising both latency and privacy concerns for enterprise customers in healthcare, legal services, and financial sectors. Voxtral TTS removes that transmission step entirely.
“Our customers have been asking for a speech model,” Pierre Stock, Vice President of Science Operations at Mistral, told TechCrunch. “The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance.”
Built on Ministral 3B, the company’s small-parameter edge model, Voxtral TTS achieves a time-to-first-audio of 90 milliseconds for a 10-second, 500-character sample and a real-time factor of 6x — generating a 10-second clip in roughly 1.6 seconds, according to Mistral’s official release.
Voice cloning from under 5 seconds of audio
The model clones a custom voice from fewer than 5 seconds of source audio, capturing accent, inflection, and speech irregularities — and critically, maintains voice consistency when switching between the nine languages it supports: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
That cross-language voice retention is aimed squarely at dubbing and real-time translation use cases. Existing proprietary tools typically lose voice characteristics when switching languages — requiring separate voice profile generation per language, which increases cost and processing time.
Voxtral TTS is released under the Apache 2.0 license, meaning developers can download model weights, modify the architecture, and deploy commercially without licensing fees. That mirrors the strategy Mistral has used to challenge closed AI systems from OpenAI and Google in the large language model space.
A complete pipeline, not just another model
The release is not standalone. In February 2026, Mistral launched Voxtral Transcribe 2, a pair of speech-to-text models — one for batch processing and one for real-time transcription across 13 languages — priced at $0.003 per minute and $0.006 per minute respectively, according to the company. Voxtral TTS now completes the other half: audio output.
Together, the two product lines form an end-to-end on-device voice pipeline. A developer building a voice assistant, a call center AI, or a real-time translation tool can now handle both input and output using Mistral’s open-weight models — without routing sensitive audio through a third-party cloud.
Stock said the company’s longer-term goal is a multimodal platform spanning audio, text, and image, both as input and output, which he described as enabling “way more information” for agentic AI systems.
What remains unanswered publicly is the exact memory and compute threshold required to run Voxtral TTS smoothly on lower-end Android devices and older smartphone hardware. While the model is benchmarked on premium edge devices, real-world performance across the broad Android ecosystem — where many users in markets like India operate on devices with 3GB RAM or less — has not been independently verified. That is the gap the developer community is now actively testing.
The text-to-speech market is projected to reach $26 billion by 2028, and ElevenLabs is reportedly approaching a $3 billion valuation on a cloud-first model. Mistral is betting the next competitive shift will be won at the hardware level — not in the data center.

