TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal Accuracy Max 100.0% Released Oct 2024
1
Results
1
Models scored
70.2%
Top: Deepseek 3.2
70.2%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Not enough data to plot a trend yet.

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Deepseek 3.2 70.2% 01 Dec 2025 Paper Primary Verified
0 AIs selected
Clear selection
#
Name
Task