TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal accuracy Max 100.0% Released Apr 2024
7
Results
6
Models scored
75.0%
Top: GPT 5.4
66.3%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2025Dec 2025Mar 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Opus 4.8 83.4% 0-shot · CoT · agentic May 28, 2026 self reported
2 Gemini 3.5 Flash 78.4% 0-shot · CoT · agentic May 19, 2026 self reported
3 GPT 5.4 75.0% Mar 5, 2026 self reported primary
4 GPT 5.3 Codex 74.0% Mar 5, 2026 self reported primary
5 Claude Sonnet 4.6 72.5% Feb 17, 2026 self reported primary
6 MiniMax M3 70.1% 0-shot · CoT · agentic Jun 1, 2026 self reported
7 Claude Opus 4.5 66.3% Nov 24, 2025 self reported primary
8 Claude Sonnet 4.5 61.4% Sep 29, 2025 self reported primary
9 Claude Haiku 4.5 50.7% Oct 15, 2025 self reported primary
10 Claude Haiku 4.5 50.7% Oct 15, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task