Multilingual speech and text translation made easy.
SeamlessM4T is a foundational multimodal model for speech translation that enables high-quality translation between different languages. Its primary purpose is to facilitate effortless communication through both speech and text.

With the increasing interconnectedness of our world and the abundance of multilingual content available, the ability to understand and communicate in any language is becoming more important than ever.SeamlessM4T supports various translation tasks, including automatic speech recognition for nearly 100 languages, speech-to-text translation for nearly 100 input and output languages, speech-to-speech translation for nearly 100 input languages and 35 output languages (including English), text-to-text translation for nearly 100 languages, and text-to-speech translation for nearly 100 input languages and 35 output languages (including English).

Unlike existing systems that only cover a fraction of the world's languages, SeamlessM4T addresses the challenges of limited language coverage and the reliance on separate subsystems by providing a unified multilingual model.

It aims to bridge the gap between low and mid-resource languages and high-resource languages, improving performance for both types. Furthermore, SeamlessM4T can implicitly recognize the source languages without the need for a separate language identification model.The development of SeamlessM4T builds upon previous advancements made by Meta and others, such as the creation of the No Language Left Behind (NLLB) machine translation model supporting 200 languages and the Universal Speech Translator for Hokkien, a language without a widely used writing system.SeamlessM4T is built on the multitask UnitY model architecture, which enables the generation of translated text and speech, as well as automatic speech recognition, text-to-text, text-to-speech, speech-to-text, and speech-to-speech translations.

Pros and Cons


Supports nearly 100 languages
Includes speech-to-speech translation
Text-to-text and text-to-speech translations
Implicit source language recognition
Single unified multilingual model
Improved performance on high-resource languages
Addresses low-resource language limitations
Improves mid-resource language translation
Built on multitask UnitY model
Enhanced by fairseq2 toolkit
Supports wide variety of translation tasks
Effortless communication through speech and text
No need for separate language identification
Covers universal speech translator concept
Open-source release under CC BY-NC 4.0
Released metadata of large translation dataset
Unified model for all translation tasks
Built using modern PyTorch ecosystem
Lightweight, easily composable toolkit
Direct generation of translated text and speech
Automatic speech recognition built in
Improved training stability
Redesigned fairseq for more efficiency
High-quality end-to-end data mining
Extensive language and modality coverage
SONAR for multilingual similarity search
Teacher-student approach for embedding space extension
433,000 hours of speech-text aligned training data
State-of-the-art performance across multiple tasks
Toxicity and bias management mechanisms
Significant toxicity reduction on speech translations
Gender bias quantification in translation
Improved robustness against background noises
Better performance on speaker variations
Reduced toxicity and enhanced safety
Speech-to-text translation improvements
Demonstrates state-of-the-art results
Significant improvement for low-resource languages
Strong performance on high-resource languages
Improved training stability
Easily integrable into existing systems


Supports 100 languages not 200
Limited speech-to-speech translation languages
Dependent on fairseq2
Designed for specific UnitY architecture
Possible mistranscription and bias
Doesn't handle speech-to-speech well
Requires text-to-text for accuracy
Doesn't handle background noises well
May need constant improvements

