TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

MM1

By Apple
MM1 is Apple’s research program for building capable vision–language models with a transparent, reproducible recipe. A high-quality image encoder connects to a language model through cross-attention so the system can “look and reason” in one pass. Rather than chasing size alone, MM1 emphasizes the training mixture: large volumes of image–caption pairs are blended with interleaved sequences where text and images appear together, plus pure text to strengthen language fluency. The work shows that mixture ratios, image resolution, and the presence of interleaved examples often matter more than raw parameter count, especially for reading small text, following layouts, and reasoning over diagrams and charts. With light instruction tuning, MM1 follows multimodal prompts, handles multiple images, and can ground answers in specific visual regions. As a research line, it’s meant to clarify which ingredients actually move the needle for practical OCR, document understanding, and visual QA, and to provide a clean foundation others can adapt for assistants, analytics, and developer tools.
New Text Gen 3
Released: March 14, 2024

Overview

MM1 is Apple Research’s multimodal LLM blueprint: a vision encoder feeding a text decoder via cross-attention, pretrained on a balanced mix of image–caption, interleaved image–text, and text-only data. It highlights how data quality, interleaving, and resolution—not just scale—drive strong OCR, document/chart reasoning, and grounded visual answers.

About Apple

Industry: Technology, Information and Media
Company Size: 12000
Location: Cupertino, California, US
Website: apple.com
View Company Profile

Tools using MM1

No tools found for this model yet.

Last updated: February 12, 2026
0 AIs selected
Clear selection
#
Name
Task