AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

Benchmarks

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

BFCL v3

Berkeley Function-Calling Leaderboard v3

Evaluates function/tool-calling correctness across single, parallel, multi-turn and irrelevance-detection scenarios.

Agentic Text 2 results

Last eval 28 Apr 2025 View leaderboard →

GeneBench-Pro

GeneBench-Pro: Evaluating Multistage Statistical Reasoning in Genomics, Quantitative Biology, and Translational Biomedicine

A 129-problem benchmark testing whether AI agents can perform realistic, multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine. Each pr…

Agentic Text 11 results

Last eval 30 Jun 2026 View leaderboard →

OSWorld

369 real desktop tasks across Ubuntu, Windows and macOS apps. Agents act through screenshots + mouse/keyboard.

Agentic Multimodal 7 results

Last eval 05 Mar 2026 View leaderboard →

Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text 14 results

Last eval 12 Jun 2026 View leaderboard →

Terminal-Bench 2.1

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-Bench 2.1 evaluates AI agents on 89 hard, realistic tasks in command-line terminal environments inspired by real workflows, spanning domains such as compiling code, trai…

Agentic Text 5 results

Last eval 30 Jun 2026 View leaderboard →

Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text 26 results

Last eval 12 Jun 2026 View leaderboard →

ViralBench

ViralBench: The First AI Marketing Benchmark

ViralBench evaluates AI models on their ability to autonomously generate viral TikTok content in the fitness space, running each model twice daily in an agentic loop with tools f…

Agentic Multimodal 3 results

Last eval 26 Jun 2026 View leaderboard →