TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI benchmarks

Browse benchmarks tracking how AI models perform on reasoning, coding, multimodal and knowledge tasks. Each benchmark has its own leaderboard with the latest results from frontier models.

62
Benchmarks
8
Categories
671
Results recorded
136
Models scored

Cross-benchmark model leaderboard

Compare how every tracked model ranks across the headline benchmarks in one matrix.

Open leaderboard →

Aider Polyglot

Aider Polyglot Coding Benchmark

225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.

Coding Text 12 results
Top results
1
Claude Opus 4.5
89.4%
2
GPT 5.1
88.0%
3
GPT 5 (Thinking)
88.0%
Last eval 24 Nov 2025 View leaderboard →

CodeForces ELO

CodeForces Live Contest ELO

Live competitive-programming ELO inferred from a model's performance on recent Codeforces rounds. Reported as a Codeforces rating.

Coding Text 12 results
Top results
1
Deepseek V4 Pro
3,206.0rating
2
o4 mini
2,719.0rating
3
o3
2,706.0rating
Last eval 24 Apr 2026 View leaderboard →

HumanEval Saturated

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpo…

Coding Text 14 results
Top results
1
GPT-4o
90.2%
2
Llama 3.3
88.4%
3
Claude Haiku 3.5
88.1%
Last eval 03 Apr 2026 View leaderboard →

HumanEval+ Saturated

HumanEval+ (EvalPlus)

HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.

Coding Text 2 results
Top results
1
Phi 4 reasoning plus
92.3%
2
WizardCoder
64.6%
Last eval 08 Jul 2025 View leaderboard →

LiveCodeBench

Continuously refreshed competitive-programming problems sourced from LeetCode, AtCoder, and Codeforces after the model's knowledge cutoff. Designed to stay contamination-free.

Coding Text 27 results
Top results
1
Deepseek V4 Pro
93.5%
2
Kimi K2.6
89.6%
3
Kimi K2.5
85.0%
Last eval 24 Apr 2026 View leaderboard →

MBPP Saturated

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text 5 results
Top results
1
Nemotron 3 Super
78.4%
2
Code Llama
66.7%
3
Mixtral 8x7B
60.7%
Last eval 03 Apr 2026 View leaderboard →

SWE-bench Multimodal

Variant of SWE-bench where issues include screenshots, diagrams and other visual context. Tests multimodal software-engineering ability.

Coding Multimodal 1 results
Top results
1
Deepseek 3.2
70.2%
Last eval 01 Dec 2025 View leaderboard →

SWE-bench Verified

500 manually validated GitHub issues from popular Python repos. Models must produce a patch that passes the hidden test suite. The current standard for "real software engineering…

Coding Text 50 results
Top results
1
Claude Opus 4.7
87.6%
2
Claude Opus 4.5
80.9%
3
Claude Opus 4.6
80.8%
Last eval 27 Apr 2026 View leaderboard →
0 AIs selected
Clear selection
#
Name
Task