MM1.5

MM1.5

MM1.5 builds directly on Apple’s MM1 blueprint, preserving a strong vision encoder connected to a language model through cross-attention while focusing the gains on data and training strategy rather than a wholesale architectural change. The team tightens curation, increases image resolution, and emphasizes interleaved sequences where text and multiple images appear together. That mix—combined with document-centric training and careful cropping/tiling—teaches the model to read fine print, follow complex layouts, and maintain references across several images without losing context.

With a light layer of instruction tuning, MM1.5 follows multimodal prompts more reliably and grounds explanations in specific regions of a page or frame. It delivers clearer step-by-step reasoning for tables, charts, and diagrams, improves robustness to screenshot UI patterns, and remains a transparent, reproducible recipe aimed at showing which ingredients actually move practical performance. The result is a cleaner path to assistants and tools that can genuinely “look, read, and reason” for documents, dashboards, and everyday visual QA.

Overview

MM1.5 is Apple Research’s refinement of the MM1 multimodal recipe. It keeps the same encoder–decoder architecture but upgrades data curation, image resolution, and multi-image/document training, yielding stronger OCR, layout understanding, chart/diagram reasoning, and more grounded visual answers.

About Apple

Industry: Technology, Information and Media

Company Size: 12000

Location: Cupertino, California, US

Website: apple.com

View Company Profile

Tools using MM1.5

No tools found for this model yet.

Last updated: February 12, 2026

Search

Overview

About Apple

Tools using MM1.5

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: