Mistral Launches Voxtral TTS: Shifts Voice Cloning From Cloud To Smartphones

Mistral’s new open-source text-to-speech model, built on a 3-billion-parameter architecture, runs directly on smartphones and smartwatches — moving voice cloning away from expensive cloud APIs and into users’ hands for the first time at this performance level.

PARIS — French AI company Mistral released Voxtral TTS on Thursday, March 26, 2026, a new open-source text-to-speech model built to run locally on edge devices including smartwatches, smartphones, and laptops, according to the company, completing a full voice AI pipeline that now covers both speech transcription and audio generation.

The launch puts Mistral in direct competition with ElevenLabs, Deepgram, and OpenAI in the fast-growing voice AI market — but with one distinct structural difference. Where rivals depend on cloud infrastructure for high-fidelity voice cloning, Voxtral TTS processes audio entirely on-device.

That gap matters more than it may appear. Cloud-dependent voice models transmit audio to remote servers for processing — raising both latency and privacy concerns for enterprise customers in healthcare, legal services, and financial sectors. Voxtral TTS removes that transmission step entirely.

“Our customers have been asking for a speech model,” Pierre Stock, Vice President of Science Operations at Mistral, told TechCrunch. “The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance.”

Built on Ministral 3B, the company’s small-parameter edge model, Voxtral TTS achieves a time-to-first-audio of 90 milliseconds for a 10-second, 500-character sample and a real-time factor of 6x — generating a 10-second clip in roughly 1.6 seconds, according to Mistral’s official release.

Mistral AI released Voxtral TTS, a 3-billion-parameter text-to-speech model with open weights that the company says outperformed ElevenLabs Flash v2.5 in human preference tests roughly 63% of the time on standard voices and nearly 70% on voice customization.

The model runs on… pic.twitter.com/vfnuNBY1fT
— Chubby♨️ (@kimmonismus) March 26, 2026

Voice cloning from under 5 seconds of audio

The model clones a custom voice from fewer than 5 seconds of source audio, capturing accent, inflection, and speech irregularities — and critically, maintains voice consistency when switching between the nine languages it supports: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

That cross-language voice retention is aimed squarely at dubbing and real-time translation use cases. Existing proprietary tools typically lose voice characteristics when switching languages — requiring separate voice profile generation per language, which increases cost and processing time.

Voxtral TTS is released under the Apache 2.0 license, meaning developers can download model weights, modify the architecture, and deploy commercially without licensing fees. That mirrors the strategy Mistral has used to challenge closed AI systems from OpenAI and Google in the large language model space.

A complete pipeline, not just another model

The release is not standalone. In February 2026, Mistral launched Voxtral Transcribe 2, a pair of speech-to-text models — one for batch processing and one for real-time transcription across 13 languages — priced at $0.003 per minute and $0.006 per minute respectively, according to the company. Voxtral TTS now completes the other half: audio output.

Together, the two product lines form an end-to-end on-device voice pipeline. A developer building a voice assistant, a call center AI, or a real-time translation tool can now handle both input and output using Mistral’s open-weight models — without routing sensitive audio through a third-party cloud.

Stock said the company’s longer-term goal is a multimodal platform spanning audio, text, and image, both as input and output, which he described as enabling “way more information” for agentic AI systems.

What remains unanswered publicly is the exact memory and compute threshold required to run Voxtral TTS smoothly on lower-end Android devices and older smartphone hardware. While the model is benchmarked on premium edge devices, real-world performance across the broad Android ecosystem — where many users in markets like India operate on devices with 3GB RAM or less — has not been independently verified. That is the gap the developer community is now actively testing.

The text-to-speech market is projected to reach $26 billion by 2028, and ElevenLabs is reportedly approaching a $3 billion valuation on a cloud-first model. Mistral is betting the next competitive shift will be won at the hardware level — not in the data center.

Mistral Launches Voxtral TTS: Shifts Voice Cloning From Cloud to Smartphones

Voice cloning from under 5 seconds of audio

A complete pipeline, not just another model

Check out our other content

OpenAI drops erotic AI plans to prioritize enterprise clients before 2026 IPO

Melania Trump Pitches AI Educators: Figure 03 Humanoid Debuts at White House

US Jury Fines Meta $375 Million For Child Harm: Verdict Targets Algorithms

Pennsylvania School Braces for Lawsuits as Teens Get Probation for AI Deepfakes

OpenAI drops erotic AI plans to prioritize enterprise clients before 2026 IPO

US Court Certifies Nvidia Lawsuit Over Alleged $1B Hidden Crypto Revenue

Melania Trump Pitches AI Educators: Figure 03 Humanoid Debuts at White House

US Jury Fines Meta $375 Million For Child Harm: Verdict Targets Algorithms

Japanese Manga Publishers Require B4 Paper: Why Digital Artists Must Adapt

Most Popular Articles

US Commerce Dept Probes Meta Staff Access to WhatsApp Encrypted Messages

Reliance Launches Jio Electric Cycle 180km Range at ₹999 Monthly EMI

India Joins US-Led Pax Silica Alliance Next Month in Strategic Shift

Microsoft Loses $357 Billion in Market Value as AI Spending Spooks Investors

Google DeepMind Locks Three AI Deals in One Week Using Hybrid Acquisitions

Meta Cuts Third-Party VR Dev Support While Pledging Ecosystem Focus

Apple Acquires Israeli Audio AI Startup Q.ai in Second Deal With PrimeSense Founder

UCSF Maps 260K Orphan RNAs as Cancer Barcodes in Blood