MM1

MM1

MM1 is Apple’s research program for building capable vision–language models with a transparent, reproducible recipe. A high-quality image encoder connects to a language model through cross-attention so the system can “look and reason” in one pass. Rather than chasing size alone, MM1 emphasizes the training mixture: large volumes of image–caption pairs are blended with interleaved sequences where text and images appear together, plus pure text to strengthen language fluency. The work shows that mixture ratios, image resolution, and the presence of interleaved examples often matter more than raw parameter count, especially for reading small text, following layouts, and reasoning over diagrams and charts. With light instruction tuning, MM1 follows multimodal prompts, handles multiple images, and can ground answers in specific visual regions. As a research line, it’s meant to clarify which ingredients actually move the needle for practical OCR, document understanding, and visual QA, and to provide a clean foundation others can adapt for assistants, analytics, and developer tools.

Overview

MM1 is Apple Research’s multimodal LLM blueprint: a vision encoder feeding a text decoder via cross-attention, pretrained on a balanced mix of image–caption, interleaved image–text, and text-only data. It highlights how data quality, interleaving, and resolution—not just scale—drive strong OCR, document/chart reasoning, and grounded visual answers.

About Apple

Industry: Technology, Information and Media

Company Size: 12000

Location: Cupertino, California, US

Website: apple.com

View Company Profile

Tools using MM1

No tools found for this model yet.

Last updated: February 12, 2026

Search

Overview

About Apple

Tools using MM1

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: