HumanEval+
HumanEval with substantially expanded test cases (~80x more) to catch wrong-but-passing solutions.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Phi 4 reasoning plus | 92.3% | — | 08 Jul 2025 | Self-reported | Primary |
| 2 | WizardCoder | 64.6% | — | 27 May 2025 | Paper | Primary |
