LongCat AudioDiT 3.5B

LongCat-AudioDiT-3.5B is a non-autoregressive diffusion TTS model that generates speech directly in waveform latent space using a Wav-VAE and diffusion backbone. It is built for high-quality text-to-speech and voice cloning with a simpler pipeline than multi-stage acoustic approaches. The model card says it improves on prior Seed benchmark results, reaching 0.818 SIM on Seed-ZH, 0.786 EN SIM, and 0.797 on Seed-Hard, while remaining competitive on intelligibility. It supports standard TTS and prompt-audio voice cloning, and is released under the MIT license.

Overview

LongCat-AudioDiT-3.5B is Meituan LongCat’s diffusion-based text-to-speech model built directly in waveform latent space rather than mel-spectrogram space. It is designed for high-fidelity speech generation and zero-shot voice cloning, supports Chinese and English, and is positioned as a top-performing open model on the Seed benchmark for speaker similarity and intelligibility.

🔊Text to speech 🗣️Voice cloning 🔊Voice enhancement

About Meituan

Meituan is a technology-driven retail company based in Beijing, founded in March 2010. It operates a platform that digitises local goods and services—from food delivery to travel bookings—with the mission “We help people eat better, live better.

Website: meituan.com

View Company Profile

Tools using LongCat AudioDiT 3.5B

No tools found for this model yet.

Last updated: March 31, 2026

Search

Overview

About Meituan

Tools using LongCat AudioDiT 3.5B

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: