AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
AIME 2024
30 problems from AIME I and II 2024. Standard high-school competition math eval before AIME 2025 superseded it as primary signal.
AIME 2025
30 problems from the 2025 AIME I and II contests. High-school competition math with integer answers 0-999; valuable post-cutoff signal for 2024-trained models.
GSM8K Saturated
8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.
MATH Saturated
12.5k competition mathematics problems (AMC, AIME, USAMO style). Reported as overall % or split by Level 1-5 difficulty. The "easy" levels are now saturated; Level 5 still discri…
MATH-500
500-question subset of MATH popularised by OpenAI's o-series releases. Reported widely as the standard 'MATH' number on modern leaderboards.
USAMO 2025
Six proof-based problems from the 2025 USAMO. Graded out of 42 (7 points per problem) by expert judges.
