TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

HumanEval

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpora.

Coding Text pass at k Max 100.0% Released Jul 2021 Saturated Possibly contaminated
14
Results
13
Models scored
90.2%
Top: GPT-4o
72.6%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Dec 0025Feb 1026Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GPT-4o 90.2% Apr 16, 2025 self reported primary
2 Llama 3.3 88.4% 0-shot · Pass@1 Dec 6, 2024 self reported primary
3 Claude Haiku 3.5 88.1% 0-shot Oct 22, 2024 self reported primary
4 Claude Opus 3 84.9% 0-shot Oct 22, 2024 self reported primary
5 Mistral Small 3 84.8% Pass@1 Dec 30, 0025 self reported primary
6 Nemotron 3 Super 79.4% 0-shot · pass@1 n=32 Apr 3, 2026 self reported primary
7 WizardCoder 73.2% Aug 1, 2023 paper primary
8 Pixtral 12B 72.0% Pass@1 Oct 10, 2024 self reported primary
9 Code Llama 67.8% Aug 1, 2023 paper primary
10 Mixtral 8x7B 40.2% Dec 1, 2023 paper primary
11 Mixtral 8x7B 40.2% Jan 8, 2024 self reported primary
12 Mistral 7B 30.5% Sep 1, 2023 paper primary
13 LLaMA 2 29.9% 0-shot Jul 19, 2023 paper primary verified
14 Gemma 2 17.7% Pass@1 Feb 25, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task