MBPP

Mostly Basic Python Problems

974 short Python problems with test cases. One of the original code-LLM benchmarks; saturated by frontier models.

Coding Text Pass@k Max 100.0% Released Aug 2021 Saturated Possibly contaminated

Results

Models scored

78.4%

Top: Nemotron 3 Super

60.7%

Median

Best results

Top primary scores; one row per model.

78.4%

66.7%

60.7%

41.4%

29.6%

Each dot is one model result; the line traces the running best score.

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	Nemotron 3 Super	78.4%	3-shot · pass@1 n=32	03 Apr 2026	Self-reported	Primary
2	Code Llama	66.7%	pass@10	01 Aug 2023	Self-reported	Primary
3	Mixtral 8x7B	60.7%	—	08 Jan 2024	Self-reported	Primary
4	Code Llama	41.4%	pass@1	01 Aug 2023	Paper	Primary
5	Gemma 2	29.6%	3-shot	25 Feb 2025	Self-reported	Primary