HumanEval
OpenAI's 164 hand-written Python programming problems with unit tests. The original code-LLM benchmark; now saturated and broadly considered contaminated in modern training corpora.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | GPT-4o | 90.2% | — | Apr 16, 2025 | self reported | primary |
| 2 | Llama 3.3 | 88.4% | 0-shot · Pass@1 | Dec 6, 2024 | self reported | primary |
| 3 | Claude Haiku 3.5 | 88.1% | 0-shot | Oct 22, 2024 | self reported | primary |
| 4 | Claude Opus 3 | 84.9% | 0-shot | Oct 22, 2024 | self reported | primary |
| 5 | Mistral Small 3 | 84.8% | Pass@1 | Dec 30, 0025 | self reported | primary |
| 6 | Nemotron 3 Super | 79.4% | 0-shot · pass@1 n=32 | Apr 3, 2026 | self reported | primary |
| 7 | WizardCoder | 73.2% | — | Aug 1, 2023 | paper | primary |
| 8 | Pixtral 12B | 72.0% | Pass@1 | Oct 10, 2024 | self reported | primary |
| 9 | Code Llama | 67.8% | — | Aug 1, 2023 | paper | primary |
| 10 | Mixtral 8x7B | 40.2% | — | Dec 1, 2023 | paper | primary |
| 11 | Mixtral 8x7B | 40.2% | — | Jan 8, 2024 | self reported | primary |
| 12 | Mistral 7B | 30.5% | — | Sep 1, 2023 | paper | primary |
| 13 | LLaMA 2 | 29.9% | 0-shot | Jul 19, 2023 | paper | primary verified |
| 14 | Gemma 2 | 17.7% | Pass@1 | Feb 25, 2025 | self reported | primary |
