AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
BFCL v3
Evaluates function/tool-calling correctness across single, parallel, multi-turn and irrelevance-detection scenarios.
OSWorld
369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.
Tau2-Bench Telecom
Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.
Terminal-Bench Hard
The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.
