HumanEval

HumanEval (pass@1)

OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpora.

Coding Text Pass@k Max 100.0% Released Jul 2021 Saturated Possibly contaminated

Homepage Paper Code

Results

Models scored

90.2%

Top: GPT-4o

72.6%

Median

Best results

Top primary scores; one row per model.

90.2%

88.4%

88.1%

84.9%

84.8%

79.4%

73.2%

72.0%

67.8%

40.2%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	GPT-4o	90.2%	—	16 Apr 2025	Self-reported	Primary
2	Llama 3.3	88.4%	0-shot · Pass@1	06 Dec 2024	Self-reported	Primary
3	Claude Haiku 3.5	88.1%	0-shot	22 Oct 2024	Self-reported	Primary
4	Claude Opus 3	84.9%	0-shot	22 Oct 2024	Self-reported	Primary
5	Mistral Small 3	84.8%	Pass@1	30 Dec 0025	Self-reported	Primary
6	Nemotron 3 Super	79.4%	0-shot · pass@1 n=32	03 Apr 2026	Self-reported	Primary
7	WizardCoder	73.2%	—	01 Aug 2023	Paper	Primary
8	Pixtral 12B	72.0%	Pass@1	10 Oct 2024	Self-reported	Primary
9	Code Llama	67.8%	—	01 Aug 2023	Paper	Primary
10	Mixtral 8x7B	40.2%	—	01 Dec 2023	Paper	Primary
11	Mixtral 8x7B	40.2%	—	08 Jan 2024	Self-reported	Primary
12	Mistral 7B	30.5%	—	01 Sep 2023	Paper	Primary
13	LLaMA 2	29.9%	0-shot	19 Jul 2023	Paper	Primary Verified
14	Gemma 2	17.7%	Pass@1	25 Feb 2025	Self-reported	Primary

Go to section

Search

HumanEval

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: