SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal Accuracy Max 100.0% Released Oct 2024

Results

Models scored

70.2%

Top: Deepseek 3.2

70.2%

Median

Best results

Top primary scores; one row per model.

70.2%

Each dot is one model result; the line traces the running best score.

Not enough data to plot a trend yet.

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Deepseek 3.2	70.2%	—	01 Dec 2025	Paper	Primary Verified