TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

AI model leaderboard

Every tracked model ranked across the headline benchmarks. The Intelligence Index averages each model's normalized scores; click any benchmark column header to sort by it.

Best overall
90.8 Intelligence Index
Best at knowledge
94.4 GPQA Diamond
Best at math
100.0 AIME 2025
Best at coding
87.6 SWE-bench Verified
Best at multimodal
84.2 MMMU

Models × benchmarks

Cells are best primary scores. Color intensity reflects normalized score. Click a column header to sort.
# Model MMLU-Pro GPQA Diamond Humanity's Last Exam AIME 2025 SWE-bench Verified LiveCodeBench MMMU AA-LCR Intelligence Index
1 GPT 5.1 88.1% 94.6% 74.9% 84.2% 85.5 4/8
2 GPT 5 (Thinking) 85.7% 24.8% 94.6% 74.9% 84.2% 72.8 5/8
3 o3 83.3% 20.3% 88.9% 69.1% 82.9% 68.9 5/8
4 Qwen 3.6 27B 86.2% 87.8% 24.0% 77.2% 83.9% 82.9% 73.7 6/8
5 o4 mini 92.7% 68.1% 81.6% 80.8 3/8
6 Claude Opus 4.5 87.0% 80.9% 80.7% 82.9 3/8
7 Claude Sonnet 4.5 83.4% 87.0% 77.2% 77.8% 81.4 4/8
8 o1 78.0% 8.12% 79.2% 48.9% 77.6% 58.4 5/8
9 Qwen 3.5 122B A10B 86.6% 72.0% 76.9% 78.5 3/8
10 Llama 4 Behemoth 82.2% 73.7% 49.4% 76.1% 70.4 4/8
11 GPT 4.1 66.3% 55.0% 75.0% 65.4 3/8
12 Claude Sonnet 3.7 (Thinking) 78.2% 62.3% 75.0% 71.8 3/8
13 Claude Sonnet 4 75.4% 70.5% 72.7% 74.4% 73.3 4/8
14 GPT 5 77.8% 6.30% 61.9% 52.8% 74.4% 54.6 5/8
15 Seed 1.5 80.1% 65.0% 73.9% 73.0 3/8
16 Llama 4 Maverick 80.5% 69.8% 43.4% 73.4% 66.8 4/8
17 Claude Haiku 4.5 73.0% 80.7% 73.3% 73.2% 75.1 4/8
18 Grok 3 79.9% 75.4% 57.0% 73.2% 71.4 4/8
19 Gemini 2.5 Flash-Lite 64.6% 5.10% 49.8% 31.6% 33.7% 72.9% 43.0 6/8
20 Claude Sonnet 3.7 62.3% 62.3% 71.8% 65.5 3/8
21 Llama 4 Scout 74.3% 57.2% 32.8% 69.4% 58.4 4/8
22 Grok 3 mini 78.9% 66.2% 41.5% 69.4% 64.0 4/8
23 GPT-4o 53.6% 69.1%
24 Pixtral Large 64.0%
25 Pixtral 12B 52.0%
26 Claude Haiku 3.5 41.6% 65.0% 40.6% 49.1 3/8
27 Claude Opus 3 50.4%
28 Claude Opus 4.6 91.3% 80.8%
29 Claude Opus 4.7 94.2% 46.9% 87.6% 76.2 3/8
30 Claude Sonnet 4.6 89.9% 33.2% 79.6% 67.6 3/8
31 Command A 69.6% 50.8%
32 Deepseek 3.2 85.0% 82.4% 40.8% 93.1% 73.1% 83.3% 76.3 6/8
33 DeepSeek 3.2 Speciale 30.6% 96.0%
34 DeepSeek V3 75.9% 59.1% 42.0% 59.0 3/8
35 DeepSeek V3.1 Terminus 85.0% 80.7% 21.7% 88.4% 74.9% 70.1 5/8
36 DeepSeek V3.2 Exp 85.0% 79.9% 89.3% 67.8% 74.1% 79.2 5/8
37 Deepseek V4 Pro 93.5%
38 DeepSeek-R1 84.0% 71.5% 70.0% 49.2% 68.7 4/8
39 Devstral 2 72.2%
40 Gemini 2.5 Flash (Thinking) 82.8% 11.0% 72.0% 60.4% 56.6 4/8
41 Gemini 2.5 Pro 84.0% 18.8% 86.7% 63.8% 70.4% 64.7 5/8
42 Gemini 2.5 Pro (Thinking) 86.4% 21.6% 88.0% 59.6% 63.9 4/8
43 Gemini 3 Deep Think 93.8% 41.0%
44 Gemini 3 Flash 90.4% 78.0%
45 Gemini 3 Flash (Thinking) 90.4% 33.7% 95.2% 78.0% 74.3 4/8
46 Gemini 3 Pro 91.9% 37.5% 95.0% 76.2% 75.2 4/8
47 Gemini 3.1 Pro 94.3% 44.4% 80.6% 73.1 3/8
48 Gemma 3 78.0% 72.6%
49 Gemma 4 85.2% 84.3% 80.0% 83.2 3/8
50 GLM 4.6 81.0% 17.2% 93.9% 68.0% 82.8% 68.6 5/8
51 GLM 5 86.0% 77.8%
52 GLM 5.2 91.2% 40.5%
53 GLM-5.1 86.2% 31.0%
54 GPT 5.1 Thinking 88.1% 94.6%
55 GPT 5.2 Pro 93.2%
56 GPT 5.2 Thinking 92.4% 100.0% 80.0% 90.8 3/8
57 GPT 5.3 Codex 92.6% 56.8%
58 GPT 5.4 92.8% 57.7%
59 GPT 5.4 Mini 88.0%
60 GPT 5.4 Nano 82.8%
61 GPT 5.4 Pro 94.4%
62 GPT 5.5 93.6% 41.4%
63 GPT 5.5 Instant 81.2%
64 GPT OSS 120B 90.0% 80.1%
65 GPT-4 Turbo 50.4%
66 Grok 3 Think 84.6% 93.3% 79.4% 85.8 3/8
67 Grok 4 87.5% 25.4% 91.7% 79.0% 70.9 4/8
68 Grok 4 Heavy 88.4% 44.4% 100.0% 79.4% 78.1 4/8
69 Grok Code Fast 1 70.8%
70 Kimi K2 Instruct 75.1% 49.5% 65.8% 53.7% 61.0 4/8
71 Kimi K2.5 76.8% 85.0%
72 Kimi K2.6 90.5% 54.0% 80.2% 89.6% 78.6 4/8
73 Kimi K2.7 Code 89.6% 32.8% 66.3% 62.9 3/8
74 Llama 3.1 Nemotron Ultra 76.0%
75 Llama 3.2 32.8%
76 Llama 3.3 68.9% 50.5%
77 Magistral Medium 70.8% 64.9% 50.3% 62.0 3/8
78 MiMo V2.5 Pro 48.0% 78.9%
79 MiniMax M2.5 80.2%
80 Mistral Large 3 43.9% 34.4%
81 Mistral Medium 3.5 77.6%
82 Mistral Small 3 66.3%
83 Muse Spark 89.5% 42.8% 77.4% 69.9 3/8
84 Nemotron 3 78.3% 75.0% 89.1% 38.8% 68.3% 69.9 5/8
85 Nemotron 3 Nano 78.3% 75.0% 89.1% 68.3% 77.7 4/8
86 Nemotron 3 Super 75.7% 60.0%
87 Nova Lite 42.0%
88 Nova Micro 40.0%
89 Nova Premier 42.4%
90 Nova Pro 46.9%
91 Opus 4.1 Thinking 80.9% 74.5%
92 Phi 4 reasoning plus 76.0% 69.3% 78.0% 74.4 3/8
93 Qwen 3.5 27B 86.1% 85.5% 72.4% 81.3 3/8
94 Qwen 3.5 35B A3B 84.2% 69.2%
95 Qwen3 235B A22B 81.5% 70.7%
96 Qwen3 30B A3B 65.8% 70.9% 62.6% 66.4 3/8
97 Qwen3 Coder 67.0%
98 Qwen3-235B-A22B 81.5% 70.7%
99 Qwen3-30B-A3B 65.8%
100 R1 1776 71.5% 70.0%

Capability scatter

Each dot is a model. Position shows two-axis capability; size reflects how many headline benchmarks the model has been scored on.
Capability scatter: MMLU-Pro vs SWE-bench Verified0.000.0025.025.050.050.075.075.0100.0100.0MMLU-ProSWE-bench VerifiedQwen 3.6 27B • MMLU-Pro: 86.2 • SWE-bench Verified: 77.2Qwen 3.6 27BClaude Haiku 3.5 • MMLU-Pro: 41.6 • SWE-bench Verified: 40.6Claude Haiku 3.5Deepseek 3.2 • MMLU-Pro: 85.0 • SWE-bench Verified: 73.1Deepseek 3.2DeepSeek V3 • MMLU-Pro: 75.9 • SWE-bench Verified: 42.0DeepSeek V3DeepSeek V3.2 Exp • MMLU-Pro: 85.0 • SWE-bench Verified: 67.8DeepSeek V3.2 ExpDeepSeek-R1 • MMLU-Pro: 84.0 • SWE-bench Verified: 49.2DeepSeek-R1Nemotron 3 • MMLU-Pro: 78.3 • SWE-bench Verified: 38.8Nemotron 3Qwen 3.5 27B • MMLU-Pro: 86.1 • SWE-bench Verified: 72.4Qwen 3.5 27B
0 AIs selected
Clear selection
#
Name
Task