AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

Benchmarks

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

AIME 2024

American Invitational Mathematics Examination 2024

30 problems from AIME I and II 2024. Standard high-school competition math eval before AIME 2025 superseded it as primary signal.

Last eval 03 Apr 2026 View leaderboard →

AIME 2025

American Invitational Mathematics Examination 2025

30 problems from the 2025 AIME I and II contests. High-school competition math with integer answers 0-999; valuable post-cutoff signal for 2024-trained models.

Trinity Large Thinking

96.3%

Last eval 05 May 2026 View leaderboard →

GSM8K Saturated

Grade School Math 8K

8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.

Last eval 03 Apr 2026 View leaderboard →

MATH Saturated

MATH (Hendrycks)

12.5k competition mathematics problems (AMC, AIME, USAMO style). Reported as overall % or split by Level 1-5 difficulty. The "easy" levels are now saturated; Level 5 still discri…

Last eval 03 Apr 2026 View leaderboard →