160,499
88,491
67,684
59,960
37,612
30,618
22,789
21,095
19,413
19,169
17,427
17,104
16,969
13,584
12,984
11,985
11,578
10,931
10,916
9,189
Text to speech
taaft.com/text-to-speech
261,552 subscribers
There are 6 AI tools for Text to speech.
Subscribe
Free mode
100% free
Freemium
Free Trial
Also used for Text to speech 6
-
Turn any content into engaging podcasts instantly.Share
GrapePine🙏 4 karmaJun 9, 2026@PodcastorAII really love the Pet Studio section of PodcastorAI. I often get ideas for my podcasts but don’t want to show my face or do the voiceover myself. When I discovered this tool, I uploaded pictures of my furry friends and created a host, and honestly, watching them interact was just adorable and hilarious! The speed is a bit slow, but that’s okay—I can just hang out and pet my cats while it runs,haha4 Reply Share Edit Delete ReportReleased 4d agoFree + from $9.9/mo282104.7 -
Real-time AI-powered Scripture display and note taker that responds to voice.Share14,298 citeverse.liveReleased 30d agoFree + from $22.50/mo14,97175.0 -
Turn one screen recording into videos, interactive tours, and product guides.Share6,138 trainn.coReleased 2mo agoFree + from $19/mo7,489105.0
-
Share
A personalized audio story gift starring any child, delivered instantly by email. -
Clone any voice in seconds with 99% similarity.ShareKikiVoice does a great job with voice cloning — the results sound very natural and close to the original, and there’s no sign-up required.61 Reply Share Edit Delete ReportReleased 4mo ago#20 in Trending6,245964.5
-
Share
Transcribe audio & video with Whisper. Export TXT/SRT/VTT. Auto-delete 24h.Released 4mo agoFree + from $14.99/mo1,07633.6
Related Tasks✕
Models 96
-
Gemini Audio is Google DeepMind’s closed-source native audio model family for low-latency live dialogue, controllable speech generation, audio understanding, and voice-first applications.NewAudioReleased 2d ago
-
By Boson AIHiggs Audio v3 TTS is Boson AI’s text-to-speech model for expressive conversational voice agents across 100+ languages with zero-shot voice cloning and inline speech controls.NewAudioReleased 7d ago
-
By Miso LabsMisoTTS is Miso Labs’ open-weight 8B text-and-audio-conditioned speech generation model for expressive, context-aware, emotive TTS and dialogue voice output.NewAudioReleased 8d ago
-
By MicrosoftMAI-Voice-2-Flash is Microsoft AI’s upcoming lower-cost, ultra-efficient variant of MAI-Voice-2 for speech generation.NewMultimodalReleased 9d ago
-
By MicrosoftMAI-Voice-2 is Microsoft AI’s speech generation model for natural-sounding voice output across 15 languages with short-sample voice adaptation.NewMultimodalReleased 9d ago
-
By GradiumPhonon is Gradium’s private-beta 100M-parameter on-device text-to-speech model for low-latency, offline, privacy-sensitive voice generation.NewVideoReleased 16d ago
-
By Inworld AIRealtime TTS-2 is Inworld AI’s realtime conversational text-to-speech model. It is built for live voice interaction rather than narration, with conversational awareness from prior audio turns, natural-language voice direction, crosslingual voice identity across 100+ languages, and prompt-based voice design.NewMultimodalReleased 1mo ago
-
By Pruna AIp-video-avatar is Pruna’s talking-head video generation model for creating speaking avatar videos from a single portrait image. It takes either a text script or an audio file, then generates a realistic head-and-shoulders speaking video, with support for multiple voices, languages, and 720p or 1080p output.NewMultimodalReleased 1mo ago
-
sarashina2.2-tts is SB Intuitions’ Japanese-centric large-language-model-based text-to-speech system. It supports Japanese and English, is designed for high pronunciation accuracy, naturalness, and stability across diverse speaking styles, and includes zero-shot voice generation.NewMultimodalReleased 1mo ago
-
By SonioxSoniox Text-to-Speech is Soniox’s multilingual TTS model and API for precise, low-latency speech generation. It is built for production voice systems, supports 60+ languages, and emphasizes accurate pronunciation, faithful reading of structured text like emails and phone numbers, natural code-switching, and streaming output for real-time voice apps.NewMultimodalReleased 1mo ago
-
By OpenRouterOpenRouter TTS is OpenRouter’s unified text-to-speech interface for accessing multiple speech-generation models through one API layer. It standardizes voice generation across providers, supporting streaming audio output, customizable voices, and multimodal workflows without provider-specific integration complexity.NewMultimodalReleased 1mo ago
-
By StepFunStepAudio 2.5 TTS is StepFun’s contextual text-to-speech model with performance-oriented vocal control. It combines global and inline context guidance with zero-shot voice cloning so generated speech can follow broader style instructions as well as local delivery details, rather than just reading text flatly.NewMultimodalReleased 1mo ago
-
By MicrosoftMAI-Voice-1 is Microsoft’s top-tier text-to-speech model for natural, expressive voice generation. It is built to preserve clarity, intent, speaker identity, emotional nuance, and pacing across long-form speech, and supports custom voice creation from only a few seconds of audio. Microsoft positions it for voice experiences, voice agents, and expressive spoken content at high speed and low cost.NewAudioReleased 1mo ago
-
By OpenMOSSMOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and OpenMOSS. With only 0.1B parameters, it is built for real-time TTS, can run directly on CPU without a GPU, and keeps deployment simple enough for local demos, web serving, and lightweight product integration.NewMultimodalReleased 2mo ago
-
By danneauxsPocket-TTS-Spokenword is an enhanced version of Kyutai’s Pocket TTS built for emotionally expressive audiobook generation from plain text. It adds AI emotion analysis, smart text chunking, voice adaptation, and voice cloning, while staying lightweight enough to run on CPU-only systems without requiring a GPU.NewAudioReleased 2mo ago
-
By OpenBMBVoxCPM2 is OpenBMB’s open-source tokenizer-free multilingual text-to-speech model for natural speech generation, voice design, and controllable voice cloning. It is a 2B-parameter model trained on over 2 million hours of speech, supports 30 languages, and produces 48 kHz studio-quality audio with real-time streaming capability.NewMultimodalReleased 2mo ago
-
By XiaomiOmniVoice is a multilingual zero-shot text-to-speech model built for voice cloning, voice design, and general speech synthesis at massive language scale. It supports more than 600 languages, uses a diffusion language model-style architecture, and is positioned for high-quality speech generation with fast inference.NewMultimodalReleased 2mo ago
-
By MeituanLongCat-AudioDiT-3.5B is Meituan LongCat’s diffusion-based text-to-speech model built directly in waveform latent space rather than mel-spectrogram space. It is designed for high-fidelity speech generation and zero-shot voice cloning, supports Chinese and English, and is positioned as a top-performing open model on the Seed benchmark for speaker similarity and intelligibility.NewAudioReleased 2mo ago
-
By Mistral AIVoxtral TTS is Mistral’s new open-source text-to-speech model for building voice agents and enterprise speech applications. According to TechCrunch, it supports 9 languages, can clone a voice from under 5 seconds of audio, preserves accents and speaking style, and is optimized for real-time use on edge devices like phones, laptops, and wearables.NewAudioReleased 2mo ago
-
By Smallest AILightning is Smallest.ai’s low-latency text-to-speech system for real-time voice agents, voiceovers, and voice cloning.NewAudioReleased 2mo ago
-
By XiaomiMiMo-V2-TTS is Xiaomi’s large-scale speech synthesis model built for expressive agent voice, aiming for natural, emotionally aware speech.NewAudioReleased 2mo ago
-
By Hume AITADA-1B is a unified speech-language model checkpoint that aligns text tokens and speech representations 1-to-1 for fast, reliable text-to-speech generation.AudioReleased 3mo ago
-
By Hume AITADA-3B-ml is a multilingual TADA checkpoint built for fast, reliable speech generation using the same 1-to-1 text-acoustic alignment framework.MultimodalReleased 3mo ago
-
By AlibabaQwen3-TTS is a speech generation model family designed for high-quality, human-like TTS with voice cloning and natural-language control over voice style.AudioReleased 3mo ago
-
By OpenAIgpt-realtime-1.5 is OpenAI’s flagship real-time voice model for audio-in, audio-out use cases like voice agents and customer support, with support for text, audio, and image inputs and text and audio outputs.MultimodalReleased 3mo ago
-
Conversational speech generation model that generates audio codes from text and audio inputs for dialogue style speech output.CodingReleased 3mo ago
-
By ByteDanceSeed 2.0 is described publicly only as a new ByteDance Seed language model for Doubao, but there is not yet any reliable, detailed public technical description of its architecture, context length, or training data that I can quote.MultimodalReleased 3mo ago
-
By ByteDanceSeedream 5.0 Lite is ByteDance Seed's multimodal image generator with deep reasoning and built in web search, built for precise text to image and image editing that follow complex, real time instructions with tight layout and style control.ImageReleased 4mo ago
-
By Zyphra AIZonos-v0.1 is Zyphra’s open-weight text-to-speech family, two 1.6B models trained on 200k+ hours of multilingual speech, offering expressive, real-time TTS and high-quality voice cloning.AudioReleased 4mo ago
-
By ysharma3501ZipVoice-based voice cloning TTS that generates 48 kHz speech at up to 150x real time, fitting in about 1 GB VRAM for local, high quality synthesisAudioReleased 4mo ago
-
By OpenMOSSOpen source foundation model that jointly generates video and audio in one pass, achieving tightly synchronized lip movements and environment-aware sound effects.VideoReleased 4mo ago
-
By AlibabaMultilingual forced alignment model that aligns speech and transcripts in 11 languages, predicting timestamps for arbitrary units in up to 5 minutes of audio with accuracy surpassing previous end-to-end aligners.AudioReleased 4mo ago
-
By MiniMaxLatency-optimized sibling of Speech-2.8-HD, trading a bit of ultimate fidelity for faster, cheaper generation while keeping multilingual, emotional, voice-cloning strengths.AudioReleased 4mo ago
-
By MiniMaxHigh-definition MiniMax TTS model focused on studio-grade, multilingual speech, rich emotion control, interjections and voice cloning for premium voiceovers and production audio.AudioReleased 4mo ago
-
By GoogleD4RT is DeepMind’s unified 4D scene reconstruction and tracking model that turns ordinary videos into a fast, queryable representation of 3D geometry and motion, solving tracking, depth and pose up to hundreds of times faster than prior work.CodingReleased 4mo ago
-
By AlibabaQwen’s open text-to-speech model supporting multilingual speech generation with custom voice capability.AudioReleased 4mo ago
-
By AppleManzano is Apple’s unified multimodal model that shares a hybrid vision tokenizer for both image understanding and text-to-image generation, using one autoregressive LLM plus a diffusion decoder to reach state-of-the-art unified performance.MultimodalReleased 4mo ago
-
By Flash LabsFlashLabs Chroma 1.0 is a real-time spoken dialogue model that interleaves text and audio tokens to enable sub-second, end-to-end conversations with personalized voice cloning and high speaker similarity.TextReleased 4mo ago
-
FLUX.2-klein-4B is Black Forest Labs’ 4B-parameter rectified-flow image model, unifying fast text-to-image and image-editing with multi-reference support, distilled for sub-second generation on consumer GPUs under Apache 2.0.ImageReleased 4mo ago
-
By ByteDanceSeed-Prover 1.5 is ByteDance Seed’s formal theorem-proving model for Lean, trained with agentic RL and test-time scaling to solve most undergraduate and many graduate-level competition problems.TextReleased 4mo ago
-
By KyutaiKyutai TTS 1.6B is Kyutai's open-source streaming text-to-speech model for English and French, using delayed streams modeling to start speaking before the full text is read, enabling ultra-low-latency, high-quality voices for assistants and real-time apps.AudioReleased 5mo ago
-
By AlibabaQwen3-TTS-VC-Flash is Qwen’s VoiceClone voice-conversion model that clones any speaker from about 3 seconds of audio, then revoices speech in that identity across 10 languages with low word-error rates.AudioReleased 5mo ago
-
By AlibabaQwen3-TTS-VD-Flash is Alibaba Qwen's voice-design TTS model that creates fully custom voices from natural-language instructions, letting users control timbre, rhythm, emotion and persona for expressive, multilingual speech via the Qwen API.AudioReleased 5mo ago
-
By OpenBMBVoxCPM1.5 is OpenBMB’s tokenizer-free TTS model that generates expressive, context-aware speech and realistic zero-shot voice clones in Chinese and English, with real-time streaming and open-source weights that support full and LoRA fine-tuning.AudioReleased 5mo ago
-
By MicrosoftVibeVoice is Microsoft’s open-source frontier TTS framework that turns long text into expressive multi-speaker conversational audio, generating podcast-style speech with natural turn-taking in English and MandarinAudioReleased 6mo ago
-
Kling Video 2.6 is Kling AI's latest video model that natively generates video plus dialogue, music and sound effects in one step, turning text or images into 5-10 second 1080p clips with tightly synced audio-visual storytelling for creators and advertisers.VideoReleased 6mo ago
-
By PixversePixVerse V5.5 is PixVerse’s audio-visual text and image to video model that generates 5-10 s 1080p multi-shot clips with native speech, music and SFX, improved motion stability and multi-shot camera control for story driven, lip-synced short videos.VideoReleased 6mo ago
-
FLUX.2 [dev] is the open weight, guidance distilled FLUX.2 variant for non commercial use, derived from FLUX.2 [pro] and designed to keep similar quality and prompt adherence while being efficient to run and fine tune.ImageReleased 6mo ago
-
By Nari LabsDia2 is an open source streaming dialogue TTS model that generates speech in real time from partial text, supports audio conditioning for natural back and forth conversations, and ships 1B and 2B checkpoints under Apache 2.0.AudioReleased 6mo ago
-
By SupertoneSupertonic is Supertone’s lightning fast, on device text to speech system. It runs via ONNX Runtime entirely locally, uses a lightweight 66M parameter model, and can generate speech up to hundreds of times faster than real time on consumer hardware, including small devices like a Raspberry Pi.AudioReleased 6mo ago
Loading more models...
Devices 4
-
The BeanieSmart Phone · SabiApr, 2026AnnouncedN/AA non-invasive knit beanie that decodes internal speech into text using a dense array of ~70,000–100,000 fabric-embedded dry biosensors p... -
MemoMind OneSmart Glasses · MemoMindMay 28, 2026Announced$599.00Camera-free AI smart glasses with dual-eye waveguide microLED display (green monochrome), integrated Harman Kardon-tuned stereo speakers,... -
AIY Voice KitSmart Speaker · GoogleApr 16, 2018Discontinued$49.99The AIY Voice Kit from Google is a do-it-yourself intelligent speaker that lets you build your own natural language processor and connect... -
OpenHome DevKitSmart Speaker · OpenHome TechnologiesMar 11, 2026Available$200.00The OpenHome DevKit is an open-source voice AI development platform that lets developers build custom AI-powered smart speakers and voice...
