HumanEval+
HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Phi 4 reasoning plus | 92.3% | — | Jul 8, 2025 | self reported | primary |
| 2 | WizardCoder | 64.6% | — | May 27, 2025 | paper | primary |
