TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal accuracy Max 100.0% Released Apr 2024
7
Results
6
Models scored
75.0%
Top: GPT 5.4
66.3%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2025Dec 2025Mar 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GPT 5.4 75.0% Mar 5, 2026 self reported primary
2 GPT 5.3 Codex 74.0% Mar 5, 2026 self reported primary
3 Claude Sonnet 4.6 72.5% Feb 17, 2026 self reported primary
4 Claude Opus 4.5 66.3% Nov 24, 2025 self reported primary
5 Claude Sonnet 4.5 61.4% Sep 29, 2025 self reported primary
6 Claude Haiku 4.5 50.7% Oct 15, 2025 self reported primary
7 Claude Haiku 4.5 50.7% Oct 15, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task