TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

MMLU-Pro

A harder, more reasoning-focused replacement for MMLU. 10 answer choices instead of 4 and curated to remove trivially answerable items.

Knowledge Text accuracy Max 100.0% Released Jun 2024
23
Results
22
Models scored
90.0%
Top: GPT OSS 120B
79.9%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Oct 2024Jul 2025Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GPT OSS 120B 90.0% CoT Aug 5, 2025 self reported primary
2 Qwen 3.5 27B 86.1% Feb 24, 2026 third party primary verified
3 Gemma 4 85.2% CoT Apr 3, 2026 self reported primary
4 DeepSeek V3.2 Exp 85.0% CoT Sep 29, 2025 self reported primary
5 Deepseek 3.2 85.0% Dec 1, 2025 paper primary
6 DeepSeek V3.1 Terminus 85.0% Sep 22, 2025 self reported primary
7 DeepSeek-R1 84.0% CoT Jan 21, 2025 paper primary
8 Llama 4 Behemoth 82.2% Apr 5, 2025 self reported primary
9 Llama 4 Maverick 80.5% Apr 5, 2025 self reported primary
10 Seed 1.5 80.1% 0-shot · CoT Jan 22, 2025 self reported primary
11 Grok 3 79.9% Feb 19, 2025 self reported primary
12 Grok 3 79.9% Feb 19, 2025 self reported primary
13 Grok 3 mini 78.9% Feb 19, 2025 self reported primary
14 Nemotron 3 Nano 78.3% Dec 15, 2025 self reported primary
15 Gemma 3 78.0% May 20, 2025 self reported primary
16 Phi 4 reasoning plus 76.0% Jul 8, 2025 self reported primary
17 DeepSeek V3 75.9% Dec 26, 2024 paper primary
18 Nemotron 3 Super 75.7% 5-shot · CoT Apr 3, 2026 self reported primary
19 Llama 4 Scout 74.3% Apr 5, 2025 self reported primary
20 Command A 69.6% Apr 7, 2025 paper primary
21 Llama 3.3 68.9% 5-shot · CoT Dec 6, 2024 self reported primary
22 Mistral Small 3 66.3% 5-shot · CoT Jan 30, 2025 self reported primary
23 Claude Haiku 3.5 41.6% 0-shot · CoT Oct 22, 2024 self reported primary
0 AIs selected
Clear selection
#
Name
Task