Voicebox by Meta
Overview
Voicebox is a generative AI model for speech that can generalize to tasks it was not specifically trained for with state-of-the-art performance. Unlike existing speech synthesizers, it can be trained on diverse, unstructured data without requiring carefully labeled inputs.
Voicebox uses a new approach called Flow Matching, which is a Meta's latest advancement on non-autoregressive generative models that can learn highly non-deterministic mapping between text and speech.
Voicebox can produce high-quality audio clips in a vast variety of styles and can synthesize speech across six languages, as well as perform noise removal, content editing, style conversion, and diverse sample generation.
One of the main advantages of Voicebox is its ability to modify any part of a given sample, not just the end of an audio clip it is given. This makes it highly versatile and suitable for tasks such as in-context text-to-speech synthesis, cross-lingual style transfer, speech denoising and editing, and diverse speech sampling.
Additionally, Voicebox outperforms existing state-of-the-art speech models on word error rate and audio similarity metrics. While Voicebox is not currently available to the public due to potential risks of misuse, Meta has shared audio samples and a research paper detailing its approach and results.
This breakthrough in generative AI for speech is exciting as it has potential applications in helping people communicate and customize voices for virtual assistants.
Releases
Top alternatives
-
Open115,703643v2 released 3d agoFree + from $3/moElevenLabs AI Voice Generator — v2Real-time streaming via WebSockets with partial and committed transcripts, while Scribe v1 is positioned for high-accuracy file transcription rather than live use. ` Much lower latency, about 150 ms for Scribe v2 Realtime, versus Scribe v1 which is not optimized for real time. Both cover 99 languages. New control over segmentation with manual commit and built-in Voice Activity Detection, including tunable silence and sensitivity thresholds. Simpler client workflow with single-use tokens and SDKs for microphone streaming or server-side chunking. Broader live-ingest support, including PCM at 8–48 kHz and μ-law, suitable for telephony and varied capture pipelines. Updated pricing and limits called out separately for Scribe v2 Realtime, plus distinct concurrency guidance, whereas Scribe v1 has its own concurrency rules.
-
63,00538Released 10mo agoNo pricingTimeship🙏 60 karmaOct 19, 2025@SpeechmaGreat AI voices, though still monotonous and robotic, or rather too neutral, with almost zero voice inflections! Excellent to read news, articles, essays, and any nonfiction books. BTW, computers should be allowed to TALK to us for free, like in the Star Trek TV series. Our future grandchildren will laugh at us for "paying" to use this everyday option, biting the hook to "monthly subscriptions" like gullible fish and then getting up to 1,000 words per month" under the so-called Pro version ;-) This is crazy!
-
54,0541,324Released 2y ago100% Freesuperb, e gratis, merge blana, in EN se aude ideal
-
46,16339Released 4mo agoFree + from $9.9/moAt last, we have a voice AI hub that works like OpenRouter.
-
Transform any articles into podcast-quality audio instantly with just a click.Open31,42556v1.3.6 released 8mo agoFrom $4.99/mo -
29,967930Released 3y agoFrom $19.99/moI didn't get to the voices. I don't give my credit card information up front. I clicked away as soon as I saw that. It's a shame too, the pricing structure looked great.
