TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal Accuracy Max 100.0% Released Apr 2024
7
Results
6
Models scored
75.0%
Top: GPT 5.4
66.3%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2025Dec 2025Mar 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GPT 5.4 75.0% 05 Mar 2026 Self-reported Primary
2 GPT 5.3 Codex 74.0% 05 Mar 2026 Self-reported Primary
3 Claude Sonnet 4.6 72.5% 17 Feb 2026 Self-reported Primary
4 Claude Opus 4.5 66.3% 24 Nov 2025 Self-reported Primary
5 Claude Sonnet 4.5 61.4% 29 Sep 2025 Self-reported Primary
6 Claude Haiku 4.5 50.7% 15 Oct 2025 Self-reported Primary
7 Claude Haiku 4.5 50.7% 15 Oct 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task