MM1.5 | AI Model

Overview

MM1.5 is Apple Research’s refinement of the MM1 multimodal recipe. It keeps the same encoder–decoder architecture but upgrades data curation, image resolution, and multi-image/document training, yielding stronger OCR, layout understanding, chart/diagram reasoning, and more grounded visual answers.

Description

MM1.5 builds directly on Apple’s MM1 blueprint, preserving a strong vision encoder connected to a language model through cross-attention while focusing the gains on data and training strategy rather than a wholesale architectural change. The team tightens curation, increases image resolution, and emphasizes interleaved sequences where text and multiple images appear together. That mix—combined with document-centric training and careful cropping/tiling—teaches the model to read fine print, follow complex layouts, and maintain references across several images without losing context.

With a light layer of instruction tuning, MM1.5 follows multimodal prompts more reliably and grounds explanations in specific regions of a page or frame. It delivers clearer step-by-step reasoning for tables, charts, and diagrams, improves robustness to screenshot UI patterns, and remains a transparent, reproducible recipe aimed at showing which ingredients actually move practical performance. The result is a cleaner path to assistants and tools that can genuinely “look, read, and reason” for documents, dashboards, and everyday visual QA.

About Apple

No company description available.

Website: podcasts.apple.com

View Company Profile

Related Models

Last updated: October 15, 2025

Overview

Description

About Apple

Related Models

GPT-3

Pixtral Large

Command A Vision

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool