MBPP
974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Nemotron 3 Super | 78.4% | 3-shot · pass@1 n=32 | Apr 3, 2026 | self reported | primary |
| 2 | Code Llama | 66.7% | pass@10 | Aug 1, 2023 | self reported | primary |
| 3 | Mixtral 8x7B | 60.7% | — | Jan 8, 2024 | self reported | primary |
| 4 | Code Llama | 41.4% | pass@1 | Aug 1, 2023 | paper | primary |
| 5 | Gemma 2 | 29.6% | 3-shot | Feb 25, 2025 | self reported | primary |
