OSWorld
369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | GPT 5.4 | 75.0% | — | Mar 5, 2026 | self reported | primary |
| 2 | GPT 5.3 Codex | 74.0% | — | Mar 5, 2026 | self reported | primary |
| 3 | Claude Sonnet 4.6 | 72.5% | — | Feb 17, 2026 | self reported | primary |
| 4 | Claude Opus 4.5 | 66.3% | — | Nov 24, 2025 | self reported | primary |
| 5 | Claude Sonnet 4.5 | 61.4% | — | Sep 29, 2025 | self reported | primary |
| 6 | Claude Haiku 4.5 | 50.7% | — | Oct 15, 2025 | self reported | primary |
| 7 | Claude Haiku 4.5 | 50.7% | — | Oct 15, 2025 | self reported | primary |
