🎨

Creativity8,666

🔊

Audio643

🎤

Voice158

🔊

Text to speech99

Voicebox by Meta

Generate versatile speech with state-of-the-art AI

Open

Dashboard Edit Promote Manage releases Analytics

Overview Releases Alternatives Pricing Pros & Cons Prompts Reviews Q&A

Go to 🔊 Text to speech

ElevenLabs AI Voice Generator

Text to Speech Reader by Audeus

AI Voice Generator Free

9,416

Listnr AI

8,039

The AI Voice Generator

7,859

Go to 🎨 Creativity 📷 Images (2670) 🔠 Text (1921) 💻 Software (1632) 🎥 Videos (824) 🔊 Audio (643) 🎨 Art (470) 🎨 Design (241) 🌪️ Brainstorming (198) 🖌️ 3D (52) 📸 Multimedia (8)

Creative Minds Think Alike

So You Want To Be A Writer?

180

Minka

112

Ancient AI powered spirits

Go to 🔊 Audio 🎵 Music (411) 🎤 Voice (158) 🎙 Podcasts (52) 🔊 Audio enhancement (11) 🔊 Sound effects (8) 🔊 Audio guides (3)

Go to 🎤 Voice 🔊 Text to speech (99) 🗣️ Voice cloning (30) 🎙 Voice chatting (10) 🎙️ Voiceovers (9) 🎤 Vocal removal (6) 🎙️ Voice detection (2) 🔊 Voice enhancement (2)

June 16, 2023

Dashboard Edit AI

Voicebox by Meta

Text to speech

2,481

2.0(1)

Use tool Copy 🔗

Inputs:

Outputs:

Generate versatile speech with state-of-the-art AI

Overview

Overview Releases Alternatives Pricing Pros & Cons Prompts Reviews Q&A

Featured alternatives

Advertise here

Overview Discussion

Overview

Voicebox is a generative AI model for speech that can generalize to tasks it was not specifically trained for with state-of-the-art performance. Unlike existing speech synthesizers, it can be trained on diverse, unstructured data without requiring carefully labeled inputs.

Voicebox uses a new approach called Flow Matching, which is a Meta's latest advancement on non-autoregressive generative models that can learn highly non-deterministic mapping between text and speech.

Voicebox can produce high-quality audio clips in a vast variety of styles and can synthesize speech across six languages, as well as perform noise removal, content editing, style conversion, and diverse sample generation.

One of the main advantages of Voicebox is its ability to modify any part of a given sample, not just the end of an audio clip it is given. This makes it highly versatile and suitable for tasks such as in-context text-to-speech synthesis, cross-lingual style transfer, speech denoising and editing, and diverse speech sampling.

Additionally, Voicebox outperforms existing state-of-the-art speech models on word error rate and audio similarity metrics. While Voicebox is not currently available to the public due to potential risks of misuse, Meta has shared audio samples and a research paper detailing its approach and results.

This breakthrough in generative AI for speech is exciting as it has potential applications in helping people communicate and customize voices for virtual assistants.

Releases

Voicebox by MetaInitial

Get notified when a new version of Voicebox by Meta is released

Notify me

Initial release

June 16, 2023

Initial release of Voicebox by Meta.

+ Submit new release

By unverified author Claim this AI

Pricing

Pricing model

Free

Paid options from

Free

Use tool

Save

🔗 Copy link

🗳️ Vote Best AI Tool

Featured

Text to speech Voicebox by Meta

Text to speech

2,481

2.0(1)

Overview Releases Alternatives Pricing Pros & Cons Prompts Reviews Q&A

Use tool

Save

Top alternatives

ElevenLabs AI Voice Generator

Create lifelike AI voices for compelling storytelling.

Text to speech

Open

115,703
643
4.0

v2 released 3d ago
Free + from $3/mo
Share
40,779 www.elevenlabs.io

ElevenLabs AI Voice Generator — v2

Real-time streaming via WebSockets with partial and committed transcripts, while Scribe v1 is positioned for high-accuracy file transcription rather than live use. ` Much lower latency, about 150 ms for Scribe v2 Realtime, versus Scribe v1 which is not optimized for real time. Both cover 99 languages. New control over segmentation with manual commit and built-in Voice Activity Detection, including tunable silence and sensitivity thresholds. Simpler client workflow with single-use tokens and SDKs for microphone streaming or server-side chunking. Broader live-ingest support, including PCM at 8–48 kHz and μ-law, suitable for telephony and varied capture pipelines. Updated pricing and limits called out separately for Scribe v2 Realtime, plus distinct concurrency guidance, whereas Scribe v1 has its own concurrency rules.

2
Speechma

Transform text to speech with 400+ premium AI voices

Text to speech

Open

63,005
38
3.7

Released 10mo ago
No pricing
Share
12,493 speechma.com

Timeship

🙏 60 karma

Oct 19, 2025

@Speechma

Great AI voices, though still monotonous and robotic, or rather too neutral, with almost zero voice inflections! Excellent to read news, articles, essays, and any nonfiction books. BTW, computers should be allowed to TALK to us for free, like in the Star Trek TV series. Our future grandchildren will laugh at us for "paying" to use this everyday option, biting the hook to "monthly subscriptions" like gullible fish and then getting up to 1,000 words per month" under the so-called Pro version ;-) This is crazy!

1 Reply Share Edit Delete Report
Free Text-To-Speech

Text transformed into customizable spoken output.

Text to speech

Open

54,054
1,324
4.0

Released 2y ago
100% Free
Share

jon doe

🙏 124 karma

Jul 18, 2024

@Free Text-To-Speech

superb, e gratis, merge blana, in EN se aude ideal

9622 Reply Share Edit Delete Report
VoiSpark

Create human-like voices for content with AI

Text to speech

Open

46,163
39
2.4

Released 4mo ago
Free + from $9.9/mo
Share
41,520 voispark.com

Hank Darren

🙏 18 karma

Jul 6, 2025

@VoiSpark

At last, we have a voice AI hub that works like OpenRouter.

162 Reply Share Edit Delete Report
Read-this.ai

Transform any articles into podcast-quality audio instantly with just a click.

Text to speech

Open

31,425
56
2.9

v1.3.6 released 8mo ago
From $4.99/mo
Share
24,394 read-this.ai

Andrei

🛠️ 37 tools 🙏 3,872 karma

Mar 12, 2025

@Read-this.ai

Great video!

121 Reply Share Edit Delete Report
Audioread.com

Listen to any text in your podcast app or browser.

Text to speech

Open

29,967
930
2.4

Released 3y ago
From $19.99/mo
Share

David Marshall

🙏 25 karma

Sep 14, 2023

@Audioread.com

I didn't get to the voices. I don't give my credit card information up front. I clicked away as soon as I saw that. It's a shame too, the pricing structure looked great.

103 Reply Share Edit Delete Report

Promote AI Claim AI New release

Reviews

2.0

Average from 1 rating.

★ ★ ★ ★ ★ 0

★ ★ ★ ★ 0

★ ★ ★ 0

★ ★ 1

★ 0

Your rating

★ ★ ★ ★ ★

Post

How would you rate Voicebox by Meta?

Help other people by letting them know if this AI was useful.

Prompts & Results

Title:

Description:

Prompt type*:

Prompt*:

Output type*:

Output*:

Add your own prompts and outputs to help others understand how to use this AI.

Pros and Cons

Pros

Generative model

Generalizes to untrained tasks

Trains on diverse data

Doesn't require labeled inputs

Uses Flow Matching

High-quality audio clips

Operates in six languages

Performs noise removal

Performs content editing

Performs style conversion

Does diverse sample generation

Can modify any sample part

In-context text-to-speech synthesis

Performs cross-lingual style transfer

Performs speech denoising

Performs speech editing

Performs diverse speech sampling

Outperforms other models

Superior word error rate

Superior audio similarity metrics

Versatile across tasks

Significant potential applications

Style transfer capability

Audio editing functionality

Large data scale training

Trains on unstructured data

Effective model classifier

Potential virtual assistant voices

Fast performance

Effective for in-wild data

Potential for synthetic data generation

Trains on multilingual benchmarks

View 27 more pros

Cons

Not available to public

Potential for misuse

Requires a lot of data

Limited to six languages

20 times slower than Vall-E

Depends on Flow Matching

Doesn't support task-specific training

Currently lacks public API

Lacks verification functionality

No open-source code

View 5 more cons

Q&A

What are the key features of Voicebox by Meta?

Voicebox by Meta is a generative AI model for speech that uses a new approach called Flow Matching. It can train on diverse, unstructured data without requiring carefully labeled inputs. It can produce high-quality audio clips in a variety of styles and synthesize speech across six languages. Other features include noise removal, content editing, style conversion, and diverse sample generation. Unlike existing models, it can modify any part of a given sample, not just the end, making it versatile across different tasks.

What does the Flow Matching approach utilized by Voicebox entail?

Flow Matching is a new approach developed by Meta which is seen as their latest advancement on non-autoregressive generative models. This technique enables highly non-deterministic mapping between text and speech. This non-deterministic mapping is beneficial as it allows Voicebox to learn from varied speech data without the necessity for those variations to be carefully labeled. This indicates that Voicebox can be trained on significantly more diverse and larger scales of data.

In what languages can Voicebox synthesize speech?

Voicebox can synthesize speech in six languages: English, French, Spanish, German, Polish, and Portuguese.

How does Voicebox perform in terms of word error rate and audio similarity metrics compared to existing models?

Voicebox outperforms the current state-of-the-art English model, VALL-E, in terms of both intelligibility and audio similarity. It achieves a 5.9 percent word error rate versus VALL-E's 1.9 percent, and an audio similarity score of 0.580 compared to VALL-E's 0.681. Furthermore, for cross-lingual style transfer, Voicebox reduces the average word error rate from 10.9 percent to 5.2 percent, and improves audio similarity from 0.335 to 0.481.

What makes Voicebox different from traditional speech synthesizers?

Traditional speech synthesizers require specific training for each task using carefully prepared data and they can only modify the end part of an audio clip. Conversely, Voicebox can learn from raw audio and an accompanying transcription. It is capable of modifying any part of a given sample and doesn't require carefully labeled inputs. This difference allows for greater versatility across a wider range of tasks and data sources.

How can Voicebox modify any part of a given audio sample?

Along with producing outputs from scratch, Voicebox can modify existing samples. The model can learn to predict a speech segment by analyzing the surrounding speech and the transcript of the segment. Given this learning, it can apply it to generate or modify audio in any part of a recording without having to recreate the entire input.

+ Show 14 more

Is Voicebox available for public use?

No, as of the provided information, Voicebox is not available to the public due to potential risks of misuse.

What are the potential applications of Voicebox?

Potential applications of Voicebox are wide-ranging. Its in-context text-to-speech synthesis could potentially bring speech to people who are unable to speak or allow people to customize the voices of non-player characters and virtual assistants. Its ability to perform cross-lingual style transfer could help people communicate naturally in different languages. Voicebox's abilities in speech denoising and editing could ease the process of cleaning up and editing audio. In terms of diverse speech sampling, it could generate synthetic data to better train a speech assistant model.

What data was Voicebox trained on?

Voicebox was trained using more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in six languages including English, French, Spanish, German, Polish, and Portuguese.

Can Voicebox perform speech denoising and editing?

Yes, Voicebox's in-context learning enables it to generate speech to seamlessly edit segments within audio recordings. It can resynthesize the portion of speech corrupted by short-duration noise or replace misspoken words without having to re-record the entire speech.

How does Voicebox handle diverse speech sampling?

Voicebox is able to generate speech that is more representative of how people talk in the real world and across the six languages it functions in. This could, in the future, be used to generate synthetic data to help better train a speech assistant model.

Can Voicebox perform in-context text-to-speech synthesis?

Yes, using an input audio sample just two seconds in length, Voicebox can match the sample's audio style and use it for text-to-speech generation.

Does Voicebox have the ability to perform cross-lingual style transfer?

Yes, given a sample of speech and a passage of text in English, French, German, Spanish, Polish, or Portuguese, Voicebox can produce a reading of the text in that language.

How does Voicebox handle content editing and style conversion?

Voicebox handles content editing and style conversion by leveraging its ability to modify any part of a given sample. It can regenerate a corrupted segment of the speech or replace misspoken words, effectively performing content editing. However, the specifics of how Voicebox performs style conversion are not mentioned.

How efficient is Voicebox compared to existing models?

Voicebox significantly outperforms the current state-of-the-art model, VALL-E, in terms of speed, being up to 20 times faster. This makes it an incredibly efficient model for the task.

Can Voicebox create outputs from scratch?

Yes, Voicebox can create outputs from scratch. It also has the ability to generate text-to-speech in a vast variety of styles which makes it highly versatile.

What measures are being taken to avoid misuse of Voicebox?

To avoid misuse of Voicebox, Meta is not making the Voicebox model or code publicly available. A classifier has been built that can distinguish between authentic speech and audio generated with Voicebox to mitigate possible future risks.

What makes Voicebox suitable for tasks such as in-context text-to-speech synthesis, cross-lingual style transfer, speech denoising, and editing?

Voicebox can modify any part of a given sample and not just the end, making it suitable for various tasks. Its ability to handle noise removal, content editing, style conversion, and diverse sample generation further increases its suitability for tasks such as in-context text-to-speech synthesis, cross-lingual style transfer, speech denoising, and editing.

What is the impact of Voicebox on synthetic speech recognition?

Voicebox can generate synthetic data that helps in training speech assistant models. Results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech. There is only 1 percent error rate degradation with Voicebox compared to 45 to 70 percent degradation with synthetic speech from previous text-to-speech models.

What potential risks have been identified with Voicebox technology?

The potential risk with Voicebox technology, as is the case with many generative AI, is the potential for misuse. However, specific types of risks are not mentioned in the provided information.

Ask a question

Submit

Search

Voicebox by Meta

Overview

Releases

Pricing

Top alternatives

Related topics

Reviews

How would you rate Voicebox by Meta?

Prompts & Results

Pros and Cons

Pros

View 27 more pros

Cons

View 5 more cons

Q&A

Search

Overview

Releases

Pricing

Top alternatives

Related topics

Reviews

How would you rate Voicebox by Meta?

Prompts & Results

Pros and Cons

Pros

View 27 more pros

Cons

View 5 more cons

Q&A

Help

People also viewed

Feedback and Incident Report

AI Options

Create AI Tools

Mini Tool

Vibe code an AI Tool