MBPP
974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Nemotron 3 Super | 78.4% | 3-shot · pass@1 n=32 | 03 Apr 2026 | Self-reported | Primary |
| 2 | Code Llama | 66.7% | pass@10 | 01 Aug 2023 | Self-reported | Primary |
| 3 | Mixtral 8x7B | 60.7% | — | 08 Jan 2024 | Self-reported | Primary |
| 4 | Code Llama | 41.4% | pass@1 | 01 Aug 2023 | Paper | Primary |
| 5 | Gemma 2 | 29.6% | 3-shot | 25 Feb 2025 | Self-reported | Primary |
