AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

Benchmarks

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

IFBench

Instruction Following Benchmark

Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.

Language Text 5 results

Last eval 12 Jun 2026 View leaderboard →

IFEval

Instruction-Following Eval

Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).

Language Text 12 results

Top results

Claude Sonnet 3.7 (Thinking)

Last eval 24 Feb 2026 View leaderboard →

MGSM

Multilingual Grade School Math

GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.

Language Text 5 results

Last eval 03 Apr 2026 View leaderboard →

RULER 128k

RULER (128k context)

Synthetic long-context evaluation suite measuring needle-in-a-haystack, multi-key retrieval and tracing across 128k token contexts.

Language Text 1 results

Top results

Nemotron 3 Super

88.3%

Last eval 03 Apr 2026 View leaderboard →

Go to section

Search

AI benchmarks

Cross-benchmark model leaderboard

IFBench

IFEval

MGSM

RULER 128k

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: