Text to speech

taaft.com/text-to-speech 272,336 subscribers

There are 8 AI tools for Text to speech.

Copy 🔗

Number of tools

Number of models

107

Number of devices

Also used for Text to speech 8

anyAInow

Every top AI model. No subscription.

Share

Released 9d ago
From $5.00

351
4
Spritefy

Generate game-ready assets with AI in seconds.

3,749 spritefy.com

Share

🇩🇪 Germany
Released 1mo ago
Free + from $12

4,203
9
PodcastorAI

Turn any content into engaging podcasts instantly.

Yawin Lin

🙏 3 karma

Jun 10, 2026

@PodcastorAI

As a student, I regularly work with research papers and long PDFs. PodcastorAI has been a helpful way to turn that content into something I can listen to while commuting or walking. I like that it creates a structured podcast rather than simply reading the text aloud. The dialogue format is surprisingly engaging and makes dense material easier to get through. I still rely on the original documents for deeper study, but for review and knowledge retention, it's been a genuinely useful tool.

3 Reply Share Edit Delete Report

Share

Released 1mo ago
Free + from $9.9/mo

649
11
3.5
CiteVerse v2.0

Real-time AI-powered Scripture display and note taker that responds to voice.

16,405 citeverse.live

Share

Released 2mo ago
Free + from $22.50/mo

17,338
10
4.6
Trainn AI v1.4

Turn one screen recording into videos, interactive tours, and product guides.

6,449 trainn.co

Share

Released 4mo ago
Free + from $19/mo

7,838
10
5.0
Whimsy v2.0

A personalized audio story gift starring any child, delivered instantly by email.

Share

🇺🇸 United States
Released 4mo ago
Free + from $19.99

5,640
34
5.0
KikiVoice

Clone any voice in seconds with 99% similarity.

CatCat01

🙏 7 karma

Jan 26, 2026

@KikiVoice

KikiVoice does a great job with voice cloning — the results sound very natural and close to the original, and there’s no sign-up required.

81 Reply Share Edit Delete Report

Share

Released 6mo ago
100% Free

8,342
101
4.2
FastlyConvert

Transcribe audio & video with Whisper. Export TXT/SRT/VTT. Auto-delete 24h.

Share

Released 6mo ago
Free + from $14.99/mo

1,105
3
3.6

Related Tasks✕

Speech to speech1 0

Speech to image1 0

Models 107

Gen 4 Qwen

Qwen Audio 3.0 TTS

By Alibaba

Qwen Audio 3.0 TTS is a production oriented text to speech model built on a 12.5 Hz speech tokenizer and a five stage progressive training pipeline. It supports zero shot voice cloning, 16 languages, 20 Chinese dialect regions, and one pass synthesis up to 3 minutes, with strong robustness to noisy or degraded reference audio.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎧Audiobooks

NewAudio

Released 7d ago
Gen 4 GPT

GPT Live 1 mini

By OpenAI

GPT-Live-1 mini is a smaller full-duplex voice model that listens and speaks simultaneously for natural, real-time conversation. It delegates web search or agentic tasks to a frontier model in the background and powers ChatGPT Voice for Free users.

🎙Voice chatting 🔊Text to speech 🤖Task automation 🎤Voice assistants 🔎Search 🌍Translations

NewAudio

Released 19d ago
Gen 4 GPT

GPT Live 1

By OpenAI

GPT-Live-1 is a full-duplex voice model that listens and speaks simultaneously for natural, real-time conversation. It delegates web search, reasoning, or agentic work to a frontier model in the background while keeping the conversation flowing, and powers ChatGPT Voice for Go, Plus, and Pro users.

🎙Voice chatting 🔊Text to speech 🤖Task automation 🎤Voice assistants 🔎Search 🌍Translations

NewAudio

Released 19d ago
Gen 4

Sume Avatar 1.0

By Sume

Sume Avatar 1.0 is a multi-agent orchestration system exposed as a single avatar video model. It routes a script, avatar choice, and optional product image across specialized image, audio, and video models to produce consistent talking head videos up to 60 seconds long.

🎥Videos 🔊Text to speech 🎥Product videos 👤Avatars 📹UGC videos

NewVideo

Released 26d ago
Gen 4 Fish

S2.1 Pro

By Fish Audio

S2.1 Pro is a neural text to speech model that converts text into speech across 83 languages with automatic language detection. It supports inline bracket tags for emotion and paralinguistic control, multi-speaker dialogue, and instant voice cloning from a reference audio sample. It streams audio in real time with about 70 to 90 ms time to first audio and outputs mp3, wav, pcm, or opus.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎧Audiobooks

NewAudio

Released 1mo ago
Gen 4 Seed

Seed Audio 1.0

By ByteDance

Seed Audio 1.0 is ByteDance's universal audio generation model that creates voice, music, sound effects, and ambient soundscapes from text prompts. It supports zero-shot voice cloning from short audio references, multi-character dialogue generation in a single pass, and cross-lingual synthesis without fine-tuning. Accessible via Volcano Engine API.

🔊Advanced audio generation 🔊Text to speech 🗣️Voice cloning 🎶Music generation

NewAudio

Released 1mo ago
Gen 3

ZONOS2

By Zyphra AI

ZONOS2 is Zyphra’s open-source real-time text-to-speech model with MoE architecture, high-fidelity zero-shot voice cloning, and multilingual expressive speech generation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 1mo ago
Gen 4 Gemini

Gemini 3.5 Live Translate

By Google DeepMind

Gemini Audio is Google DeepMind’s closed-source native audio model family for low-latency live dialogue, controllable speech generation, audio understanding, and voice-first applications.

🎧Audio translation 🔊Text to speech 🗣️Voice cloning 🗒Transcription

NewAudio

Released 1mo ago
Gen 3 Nemotron

Nemotron Labs Audex 30B A3B

By NVIDIA

Nemotron-Labs-Audex-30B-A3B is a unified audio-text large language model from NVIDIA built on a 30B parameter Mixture-of-Experts backbone with 3B active parameters. It extends a strong text-only reasoning model with an audio encoder and discrete audio token vocabulary, enabling audio understanding, speech recognition, speech translation, text-to-speech, audio generation, and speech-to-speech interaction while preserving the backbone's reasoning, knowledge, and long-context abilities.

🗣️Speech to speech 🔊Text to speech 🎧Audio translation 🗒Transcription 🔍Advanced reasoning

NewMultimodal

Released 1mo ago
Gen 4

Higgs Audio v3 TTS

By Boson AI

Higgs Audio v3 TTS is Boson AI’s text-to-speech model for expressive conversational voice agents across 100+ languages with zero-shot voice cloning and inline speech controls.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🗣Dialogue generation

NewAudio

Released 1mo ago
Gen 4

Miso TTS 8B

By Miso Labs

MisoTTS is Miso Labs’ open-weight 8B text-and-audio-conditioned speech generation model for expressive, context-aware, emotive TTS and dialogue voice output.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🗣Dialogue generation

NewAudio

Released 1mo ago
Gen 3 MAI

MAI Voice 2 Flash

By Microsoft

MAI-Voice-2-Flash is Microsoft AI’s upcoming lower-cost, ultra-efficient variant of MAI-Voice-2 for speech generation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 1mo ago
Gen 3 MAI

MAI Voice 2

By Microsoft

MAI-Voice-2 is Microsoft AI’s speech generation model for natural-sounding voice output across 15 languages with short-sample voice adaptation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎧Audiobooks

NewMultimodal

Released 1mo ago
Gen 4

Phonon

By Gradium

Phonon is Gradium’s private-beta 100M-parameter on-device text-to-speech model for low-latency, offline, privacy-sensitive voice generation.

🔊Text to speech 🗣️Voice cloning

NewVideo

Released 2mo ago
Gen 3

Realtime TTS 2

By Inworld AI

Realtime TTS-2 is Inworld AI’s realtime conversational text-to-speech model. It is built for live voice interaction rather than narration, with conversational awareness from prior audio turns, natural-language voice direction, crosslingual voice identity across 100+ languages, and prompt-based voice design.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

NewMultimodal

Released 2mo ago
Gen 3

Sonic 3.5

By Cartesia

Sonic 3.5 is Cartesia’s fastest and most natural text-to-speech model, built for low-latency conversational voice generation across 42 languages.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🗣Dialogue generation

NewMultimodal

Released 2mo ago
Gen 3

p video avatar

By Pruna AI

p-video-avatar is Pruna’s talking-head video generation model for creating speaking avatar videos from a single portrait image. It takes either a text script or an audio file, then generates a realistic head-and-shoulders speaking video, with support for multiple voices, languages, and 720p or 1080p output.

🎥Video avatars 🔊Text to speech 🎤Lip sync videos 🎨Portrait animation

NewMultimodal

Released 2mo ago
Gen 3

sarashina 2.2 tts

By SB Intuitions

sarashina2.2-tts is SB Intuitions’ Japanese-centric large-language-model-based text-to-speech system. It supports Japanese and English, is designed for high pronunciation accuracy, naturalness, and stability across diverse speaking styles, and includes zero-shot voice generation.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers

Multimodal

Released 3mo ago
Gen 3

Soniox Text to Speech

By Soniox

Soniox Text-to-Speech is Soniox’s multilingual TTS model and API for precise, low-latency speech generation. It is built for production voice systems, supports 60+ languages, and emphasizes accurate pronunciation, faithful reading of structured text like emails and phone numbers, natural code-switching, and streaming output for real-time voice apps.

🔊Text to speech 🎙️Voiceovers 🎤Voice agents

Multimodal

Released 3mo ago
Gen 3

OpenRouter Text to Speech

By OpenRouter

OpenRouter TTS is OpenRouter’s unified text-to-speech interface for accessing multiple speech-generation models through one API layer. It standardizes voice generation across providers, supporting streaming audio output, customizable voices, and multimodal workflows without provider-specific integration complexity.

🔊Text to speech 🎙Voice chatting 🎙️Voiceovers

Multimodal

Released 3mo ago
Gen 3

StepAudio 2.5 TTS

By StepFun

StepAudio 2.5 TTS is StepFun’s contextual text-to-speech model with performance-oriented vocal control. It combines global and inline context guidance with zero-shot voice cloning so generated speech can follow broader style instructions as well as local delivery details, rather than just reading text flatly.

🔊Text to speech 🗣️Voice cloning 🗣️Dialect simulation

Multimodal

Released 3mo ago
Gen 4 MAI

MAI Voice 1

By Microsoft

MAI-Voice-1 is Microsoft’s top-tier text-to-speech model for natural, expressive voice generation. It is built to preserve clarity, intent, speaker identity, emotional nuance, and pacing across long-form speech, and supports custom voice creation from only a few seconds of audio. Microsoft positions it for voice experiences, voice agents, and expressive spoken content at high speed and low cost.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🎧Audiobooks 🔊Advanced audio generation

Audio

Released 3mo ago
Gen 3 MOSS

MOSS TTS Nano

By OpenMOSS

MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and OpenMOSS. With only 0.1B parameters, it is built for real-time TTS, can run directly on CPU without a GPU, and keeps deployment simple enough for local demos, web serving, and lightweight product integration.

🔊Text to speech 🗣️Voice cloning 🌐Multilingual communication

Multimodal

Released 3mo ago
Gen 4

Pocket TTS Spokenword

By danneauxs

Pocket-TTS-Spokenword is an enhanced version of Kyutai’s Pocket TTS built for emotionally expressive audiobook generation from plain text. It adds AI emotion analysis, smart text chunking, voice adaptation, and voice cloning, while staying lightweight enough to run on CPU-only systems without requiring a GPU.

🔊Text to speech 🗣️Voice cloning 🎧Audiobooks

Audio

Released 3mo ago
Gen 3

VoxCPM2

By OpenBMB

VoxCPM2 is OpenBMB’s open-source tokenizer-free multilingual text-to-speech model for natural speech generation, voice design, and controllable voice cloning. It is a 2B-parameter model trained on over 2 million hours of speech, supports 30 languages, and produces 48 kHz studio-quality audio with real-time streaming capability.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🔊Audio

Multimodal

Released 3mo ago
Gen 3

OmniVoice

By Xiaomi

OmniVoice is a multilingual zero-shot text-to-speech model built for voice cloning, voice design, and general speech synthesis at massive language scale. It supports more than 600 languages, uses a diffusion language model-style architecture, and is positioned for high-quality speech generation with fast inference.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🌐Multilingual communication

Multimodal

Released 3mo ago
Gen 4 LongCat

LongCat AudioDiT 3.5B

By Meituan

LongCat-AudioDiT-3.5B is Meituan LongCat’s diffusion-based text-to-speech model built directly in waveform latent space rather than mel-spectrogram space. It is designed for high-fidelity speech generation and zero-shot voice cloning, supports Chinese and English, and is positioned as a top-performing open model on the Seed benchmark for speaker similarity and intelligibility.

🔊Text to speech 🗣️Voice cloning 🔊Voice enhancement

Audio

Released 3mo ago
Gen 3

Voxtral TTS

By Mistral AI

Voxtral TTS is Mistral’s new open-source text-to-speech model for building voice agents and enterprise speech applications. According to TechCrunch, it supports 9 languages, can clone a voice from under 5 seconds of audio, preserves accents and speaking style, and is optimized for real-time use on edge devices like phones, laptops, and wearables.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎤Voice agents

Audio

Released 4mo ago
Gen 4

Lightning v3

By Smallest AI

Lightning is Smallest.ai’s low-latency text-to-speech system for real-time voice agents, voiceovers, and voice cloning.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎤Voice agents

Audio

Released 4mo ago
Gen 4 MOSS

MOSS TTS Local Transformer v1.5

By OpenMOSS

MOSS-TTS-Local-Transformer-v1.5 is a 5B-parameter text-to-speech model supporting 31 languages with zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, code-switching, and 48 kHz stereo audio output via MOSS-Audio-Tokenizer-v2.

🔊Text to speech 🗣️Voice cloning 🎙️Voiceovers 🎧Audiobooks

Audio

Released 4mo ago
Gen 3 MiMo

Xiaomi MiMo V2 TTS

By Xiaomi

MiMo-V2-TTS is Xiaomi’s large-scale speech synthesis model built for expressive agent voice, aiming for natural, emotionally aware speech.

🔊Text to speech 🎤Voice changing 🎙️Voiceovers 🎤Singing

Audio

Released 4mo ago
Gen 4

TADA 1B

By Hume AI

TADA-1B is a unified speech-language model checkpoint that aligns text tokens and speech representations 1-to-1 for fast, reliable text-to-speech generation.

🔊Text to speech 🗣️Voice cloning 🗣️Speech to speech 🎙️Voiceovers

Audio

Released 4mo ago
Gen 3

TADA 3B ML

By Hume AI

TADA-3B-ml is a multilingual TADA checkpoint built for fast, reliable speech generation using the same 1-to-1 text-acoustic alignment framework.

🔊Text to speech 🗣️Voice cloning 🗣️Speech to speech 🎙️Voiceovers 🌐Multilingual communication

Multimodal

Released 4mo ago
Gen 3

Qwen3 TTS

By Alibaba

Qwen3-TTS is a speech generation model family designed for high-quality, human-like TTS with voice cloning and natural-language control over voice style.

🔊Text to speech 🎤Voice changing 🗣️Voice cloning 🎙️Voiceovers

Audio

Released 5mo ago
Gen 3 GPT

GPT Realtime 1.5

By OpenAI

gpt-realtime-1.5 is OpenAI’s flagship real-time voice model for audio-in, audio-out use cases like voice agents and customer support, with support for text, audio, and image inputs and text and audio outputs.

🎙Voice chatting 🔊Text to speech 🎤Voice assistants

Multimodal

Released 5mo ago
Gen 2

CSM

By Sesame AI Labs

Conversational speech generation model that generates audio codes from text and audio inputs for dialogue style speech output.

🔊Text to speech

Coding

Released 5mo ago
Gen 3 Seed

Seed 2.0

By ByteDance

Seed 2.0 is described publicly only as a new ByteDance Seed language model for Doubao, but there is not yet any reliable, detailed public technical description of its architecture, context length, or training data that I can quote.

🎬Video editing 🔊Text to speech 📰News analysis

Multimodal

Released 5mo ago
Gen 3 Seedream

Seedream 5.0 Lite

By ByteDance

Seedream 5.0 Lite is ByteDance Seed's multimodal image generator with deep reasoning and built in web search, built for precise text to image and image editing that follow complex, real time instructions with tight layout and style control.

🖼️Image generation 🔊Text to speech 🖌️Image editing 🖼️Logos

Image

Released 5mo ago
Gen 4

Zyphra

By Zyphra AI

Zonos-v0.1 is Zyphra’s open-weight text-to-speech family, two 1.6B models trained on 200k+ hours of multilingual speech, offering expressive, real-time TTS and high-quality voice cloning.

🔊Text to speech 🗣️Voice cloning

Audio

Released 5mo ago
Gen 4

LuxTTS

By ysharma3501

ZipVoice-based voice cloning TTS that generates 48 kHz speech at up to 150x real time, fitting in about 1 GB VRAM for local, high quality synthesis

🔊Text to speech 🗣️Voice cloning

Audio

Released 5mo ago
Gen 3

MOVA

By OpenMOSS

Open source foundation model that jointly generates video and audio in one pass, achieving tightly synchronized lip movements and environment-aware sound effects.

🎥Videos 🔊Text to speech 🎵Music 🎬Animations

Video

Released 5mo ago
Gen 4 Qwen

Qwen3 ForcedAligner 0.6B

By Alibaba

Multilingual forced alignment model that aligns speech and transcripts in 11 languages, predicting timestamps for arbitrary units in up to 5 minutes of audio with accuracy surpassing previous end-to-end aligners.

🗒Transcription 🔊Text to speech 🌐Text translation 🔍SEO content

Audio

Released 5mo ago
Gen 4 MiniMax

Minimax Speech 2.8 Turbo

By MiniMax

Latency-optimized sibling of Speech-2.8-HD, trading a bit of ultimate fidelity for faster, cheaper generation while keeping multilingual, emotional, voice-cloning strengths.

🔊Text to speech 🗣️Voice cloning

Audio

Released 6mo ago
Gen 4 MiniMax

Minimax Speech 2.8 HD

By MiniMax

High-definition MiniMax TTS model focused on studio-grade, multilingual speech, rich emotion control, interjections and voice cloning for premium voiceovers and production audio.

🔊Text to speech 🗣️Voice cloning

Audio

Released 6mo ago
Gen 2

D4RT

By Google

D4RT is DeepMind’s unified 4D scene reconstruction and tracking model that turns ordinary videos into a fast, queryable representation of 3D geometry and motion, solving tracking, depth and pose up to hundreds of times faster than prior work.

🌍3D images 🔊Text to speech 🎮Game creation 🎬Video editing

Coding

Released 6mo ago
Gen 4 Qwen

Qwen3 TTS 12Hz 1.7B CustomVoice

By Alibaba

Qwen’s open text-to-speech model supporting multilingual speech generation with custom voice capability.

🔊Text to speech 🗣️Voice cloning

Audio

Released 6mo ago
Gen 3

Manzano

By Apple

Manzano is Apple’s unified multimodal model that shares a hybrid vision tokenizer for both image understanding and text-to-image generation, using one autoregressive LLM plus a diffusion decoder to reach state-of-the-art unified performance.

🖼️Image generation 🔊Text to speech 🔍SEO content 🎮Game creation

Multimodal

Released 6mo ago
Gen 7

Chroma 1.0

By Flash Labs

FlashLabs Chroma 1.0 is a real-time spoken dialogue model that interleaves text and audio tokens to enable sub-second, end-to-end conversations with personalized voice cloning and high speaker similarity.

🔊Text to speech 🗣️Voice cloning

Text

Released 6mo ago
Gen 4 FLUX

FLUX.2 [klein] 4B

By Black Forest Labs

FLUX.2-klein-4B is Black Forest Labs’ 4B-parameter rectified-flow image model, unifying fast text-to-image and image-editing with multi-reference support, distilled for sub-second generation on consumer GPUs under Apache 2.0.

🖼️Image generation 🔊Text to speech 🔍SEO content 🖌️Image editing

Image

Released 6mo ago
Gen 7 Seed

Seed Prover 1.5

By ByteDance

Seed-Prover 1.5 is ByteDance Seed’s formal theorem-proving model for Lean, trained with agentic RL and test-time scaling to solve most undergraduate and many graduate-level competition problems.

📚Academic research 🔊Text to speech 🎨NFT art

Text

Released 6mo ago