AI benchmarks
Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.
Cross-benchmark model leaderboard
Compare how every tracked model ranks across the headline benchmarks in one matrix.
IFBench
Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.
IFEval
Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).
MGSM
GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.
RULER 128k
Synthetic long-context evaluation suite measuring needle-in-a-haystack, multi-key retrieval and tracing across 128k token contexts.
