TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

LongCat AudioDiT 3.5B

By Meituan
LongCat-AudioDiT-3.5B is a non-autoregressive diffusion TTS model that generates speech directly in waveform latent space using a Wav-VAE and diffusion backbone. It is built for high-quality text-to-speech and voice cloning with a simpler pipeline than multi-stage acoustic approaches. The model card says it improves on prior Seed benchmark results, reaching 0.818 SIM on Seed-ZH, 0.786 EN SIM, and 0.797 on Seed-Hard, while remaining competitive on intelligibility. It supports standard TTS and prompt-audio voice cloning, and is released under the MIT license.
New Audio Gen 4
Released: March 30, 2026

Overview

LongCat-AudioDiT-3.5B is Meituan LongCat’s diffusion-based text-to-speech model built directly in waveform latent space rather than mel-spectrogram space. It is designed for high-fidelity speech generation and zero-shot voice cloning, supports Chinese and English, and is positioned as a top-performing open model on the Seed benchmark for speaker similarity and intelligibility.

About Meituan

Meituan is a technology-driven retail company based in Beijing, founded in March 2010. It operates a platform that digitises local goods and services—from food delivery to travel bookings—with the mission “We help people eat better, live better.

Website: meituan.com
View Company Profile

Tools using LongCat AudioDiT 3.5B

No tools found for this model yet.

Last updated: March 31, 2026
0 AIs selected
Clear selection
#
Name
Task