TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

HumanEval

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpora.

Coding Text Pass@k Max 100.0% Released Jul 2021 Saturated Possibly contaminated
14
Results
13
Models scored
90.2%
Top: GPT-4o
72.6%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Dec 0025Feb 1026Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GPT-4o 90.2% 16 Apr 2025 Self-reported Primary
2 Llama 3.3 88.4% 0-shot · Pass@1 06 Dec 2024 Self-reported Primary
3 Claude Haiku 3.5 88.1% 0-shot 22 Oct 2024 Self-reported Primary
4 Claude Opus 3 84.9% 0-shot 22 Oct 2024 Self-reported Primary
5 Mistral Small 3 84.8% Pass@1 30 Dec 0025 Self-reported Primary
6 Nemotron 3 Super 79.4% 0-shot · pass@1 n=32 03 Apr 2026 Self-reported Primary
7 WizardCoder 73.2% 01 Aug 2023 Paper Primary
8 Pixtral 12B 72.0% Pass@1 10 Oct 2024 Self-reported Primary
9 Code Llama 67.8% 01 Aug 2023 Paper Primary
10 Mixtral 8x7B 40.2% 01 Dec 2023 Paper Primary
11 Mixtral 8x7B 40.2% 08 Jan 2024 Self-reported Primary
12 Mistral 7B 30.5% 01 Sep 2023 Paper Primary
13 LLaMA 2 29.9% 0-shot 19 Jul 2023 Paper Primary Verified
14 Gemma 2 17.7% Pass@1 25 Feb 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task