AI model leaderboard

Every tracked model ranked across the headline benchmarks. The Intelligence Index averages each model's normalized scores; click any benchmark column header to sort by it.

Best overall

GPT 5.2 Thinking

90.8 Intelligence Index

Best at knowledge

GPT 5.4 Pro

94.4 GPQA Diamond

Best at math

GPT 5.2 Thinking

100.0 AIME 2025

Best at coding

Claude Opus 4.7

87.6 SWE-bench Verified

Best at multimodal

GPT 5.1

84.2 MMMU

Models × benchmarks

Cells are best primary scores. Color intensity reflects normalized score. Click a column header to sort.

#	Model	MMLU-Pro	GPQA Diamond	Humanity's Last Exam	AIME 2025	SWE-bench Verified	LiveCodeBench	MMMU	AA-LCR	Intelligence Index
1	GPT 5.1	—	88.1%	—	94.6%	74.9%	—	84.2%	—	85.5 4/8
2	GPT 5 (Thinking)	—	85.7%	24.8%	94.6%	74.9%	—	84.2%	—	72.8 5/8
3	o3	—	83.3%	20.3%	88.9%	69.1%	—	82.9%	—	68.9 5/8
4	Qwen 3.6 27B	86.2%	87.8%	24.0%	—	77.2%	83.9%	82.9%	—	73.7 6/8
5	o4 mini	—	—	—	92.7%	68.1%	—	81.6%	—	80.8 3/8
6	Claude Opus 4.5	—	87.0%	—	—	80.9%	—	80.7%	—	82.9 3/8
7	Claude Sonnet 4.5	—	83.4%	—	87.0%	77.2%	—	77.8%	—	81.4 4/8
8	o1	—	78.0%	8.12%	79.2%	48.9%	—	77.6%	—	58.4 5/8
9	Qwen 3.5 122B A10B	—	86.6%	—	—	72.0%	—	76.9%	—	78.5 3/8
10	Llama 4 Behemoth	82.2%	73.7%	—	—	—	49.4%	76.1%	—	70.4 4/8
11	GPT 4.1	—	66.3%	—	—	55.0%	—	75.0%	—	65.4 3/8
12	Claude Sonnet 3.7 (Thinking)	—	78.2%	—	—	62.3%	—	75.0%	—	71.8 3/8
13	Claude Sonnet 4	—	75.4%	—	70.5%	72.7%	—	74.4%	—	73.3 4/8
14	GPT 5	—	77.8%	6.30%	61.9%	52.8%	—	74.4%	—	54.6 5/8
15	Seed 1.5	80.1%	65.0%	—	—	—	—	73.9%	—	73.0 3/8
16	Llama 4 Maverick	80.5%	69.8%	—	—	—	43.4%	73.4%	—	66.8 4/8
17	Claude Haiku 4.5	—	73.0%	—	80.7%	73.3%	—	73.2%	—	75.1 4/8
18	Grok 3	79.9%	75.4%	—	—	—	57.0%	73.2%	—	71.4 4/8
19	Gemini 2.5 Flash-Lite	—	64.6%	5.10%	49.8%	31.6%	33.7%	72.9%	—	43.0 6/8
20	Claude Sonnet 3.7	—	62.3%	—	—	62.3%	—	71.8%	—	65.5 3/8
21	Llama 4 Scout	74.3%	57.2%	—	—	—	32.8%	69.4%	—	58.4 4/8
22	Grok 3 mini	78.9%	66.2%	—	—	—	41.5%	69.4%	—	64.0 4/8
23	GPT-4o	—	53.6%	—	—	—	—	69.1%	—	—
24	Pixtral Large	—	—	—	—	—	—	64.0%	—	—
25	Pixtral 12B	—	—	—	—	—	—	52.0%	—	—
26	Claude Haiku 3.5	41.6%	65.0%	—	—	40.6%	—	—	—	49.1 3/8
27	Claude Opus 3	—	50.4%	—	—	—	—	—	—	—
28	Claude Opus 4.6	—	91.3%	—	—	80.8%	—	—	—	—
29	Claude Opus 4.7	—	94.2%	46.9%	—	87.6%	—	—	—	76.2 3/8
30	Claude Sonnet 4.6	—	89.9%	33.2%	—	79.6%	—	—	—	67.6 3/8
31	Claude Sonnet 5	—	—	43.2%	—	—	—	—	—	—
32	Command A	69.6%	50.8%	—	—	—	—	—	—	—
33	Deepseek 3.2	85.0%	82.4%	40.8%	93.1%	73.1%	83.3%	—	—	76.3 6/8
34	DeepSeek 3.2 Speciale	—	—	30.6%	96.0%	—	—	—	—	—
35	DeepSeek V3	75.9%	59.1%	—	—	42.0%	—	—	—	59.0 3/8
36	DeepSeek V3.1 Terminus	85.0%	80.7%	21.7%	88.4%	—	74.9%	—	—	70.1 5/8
37	DeepSeek V3.2 Exp	85.0%	79.9%	—	89.3%	67.8%	74.1%	—	—	79.2 5/8
38	Deepseek V4 Pro	—	—	—	—	—	93.5%	—	—	—
39	DeepSeek-R1	84.0%	71.5%	—	70.0%	49.2%	—	—	—	68.7 4/8
40	Devstral 2	—	—	—	—	72.2%	—	—	—	—
41	Gemini 2.5 Flash (Thinking)	—	82.8%	11.0%	72.0%	60.4%	—	—	—	56.6 4/8
42	Gemini 2.5 Pro	—	84.0%	18.8%	86.7%	63.8%	70.4%	—	—	64.7 5/8
43	Gemini 2.5 Pro (Thinking)	—	86.4%	21.6%	88.0%	59.6%	—	—	—	63.9 4/8
44	Gemini 3 Deep Think	—	93.8%	41.0%	—	—	—	—	—	—
45	Gemini 3 Flash	—	90.4%	—	—	78.0%	—	—	—	—
46	Gemini 3 Flash (Thinking)	—	90.4%	33.7%	95.2%	78.0%	—	—	—	74.3 4/8
47	Gemini 3 Pro	—	91.9%	37.5%	95.0%	76.2%	—	—	—	75.2 4/8
48	Gemini 3.1 Pro	—	94.3%	44.4%	—	80.6%	—	—	—	73.1 3/8
49	Gemma 3	78.0%	72.6%	—	—	—	—	—	—	—
50	Gemma 4	85.2%	84.3%	—	—	—	80.0%	—	—	83.2 3/8
51	GLM 4.6	—	81.0%	17.2%	93.9%	68.0%	82.8%	—	—	68.6 5/8
52	GLM 5	—	86.0%	—	—	77.8%	—	—	—	—
53	GLM 5.2	—	91.2%	40.5%	—	—	—	—	—	—
54	GLM-5.1	—	86.2%	31.0%	—	—	—	—	—	—
55	GPT 5.1 Thinking	—	88.1%	—	94.6%	—	—	—	—	—
56	GPT 5.2 Pro	—	93.2%	—	—	—	—	—	—	—
57	GPT 5.2 Thinking	—	92.4%	—	100.0%	80.0%	—	—	—	90.8 3/8
58	GPT 5.3 Codex	—	92.6%	—	—	56.8%	—	—	—	—
59	GPT 5.4	—	92.8%	—	—	57.7%	—	—	—	—
60	GPT 5.4 Mini	—	88.0%	—	—	—	—	—	—	—
61	GPT 5.4 Nano	—	82.8%	—	—	—	—	—	—	—
62	GPT 5.4 Pro	—	94.4%	—	—	—	—	—	—	—
63	GPT 5.5	—	93.6%	41.4%	—	—	—	—	—	—
64	GPT 5.5 Instant	—	—	—	81.2%	—	—	—	—	—
65	GPT OSS 120B	90.0%	80.1%	—	—	—	—	—	—	—
66	GPT-4 Turbo	—	50.4%	—	—	—	—	—	—	—
67	Grok 3 Think	—	84.6%	—	93.3%	—	79.4%	—	—	85.8 3/8
68	Grok 4	—	87.5%	25.4%	91.7%	—	79.0%	—	—	70.9 4/8
69	Grok 4 Heavy	—	88.4%	44.4%	100.0%	—	79.4%	—	—	78.1 4/8
70	Grok Code Fast 1	—	—	—	—	70.8%	—	—	—	—
71	Kimi K2 Instruct	—	75.1%	—	49.5%	65.8%	53.7%	—	—	61.0 4/8
72	Kimi K2.5	—	—	—	—	76.8%	85.0%	—	—	—
73	Kimi K2.6	—	90.5%	54.0%	—	80.2%	89.6%	—	—	78.6 4/8
74	Kimi K2.7 Code	—	89.6%	32.8%	—	—	—	—	66.3%	62.9 3/8
75	Llama 3.1 Nemotron Ultra	—	76.0%	—	—	—	—	—	—	—
76	Llama 3.2	—	32.8%	—	—	—	—	—	—	—
77	Llama 3.3	68.9%	50.5%	—	—	—	—	—	—	—
78	Magistral Medium	—	70.8%	—	64.9%	—	50.3%	—	—	62.0 3/8
79	MiMo V2.5 Pro	—	—	48.0%	—	78.9%	—	—	—	—
80	MiniMax M2.5	—	—	—	—	80.2%	—	—	—	—
81	Mistral Large 3	—	43.9%	—	—	—	34.4%	—	—	—
82	Mistral Medium 3.5	—	—	—	—	77.6%	—	—	—	—
83	Mistral Small 3	66.3%	—	—	—	—	—	—	—	—
84	Muse Spark	—	89.5%	42.8%	—	77.4%	—	—	—	69.9 3/8
85	Nemotron 3	78.3%	75.0%	—	89.1%	38.8%	68.3%	—	—	69.9 5/8
86	Nemotron 3 Nano	78.3%	75.0%	—	89.1%	—	68.3%	—	—	77.7 4/8
87	Nemotron 3 Super	75.7%	60.0%	—	—	—	—	—	—	—
88	Nova Lite	—	42.0%	—	—	—	—	—	—	—
89	Nova Micro	—	40.0%	—	—	—	—	—	—	—
90	Nova Premier	—	—	—	—	42.4%	—	—	—	—
91	Nova Pro	—	46.9%	—	—	—	—	—	—	—
92	Opus 4.1 Thinking	—	80.9%	—	—	74.5%	—	—	—	—
93	Phi 4 reasoning plus	76.0%	69.3%	—	78.0%	—	—	—	—	74.4 3/8
94	Qwen 3.5 27B	86.1%	85.5%	—	—	72.4%	—	—	—	81.3 3/8
95	Qwen 3.5 35B A3B	—	84.2%	—	—	69.2%	—	—	—	—
96	Qwen3 235B A22B	—	—	—	81.5%	—	70.7%	—	—	—
97	Qwen3 30B A3B	—	65.8%	—	70.9%	—	62.6%	—	—	66.4 3/8
98	Qwen3 Coder	—	—	—	—	67.0%	—	—	—	—
99	Qwen3-235B-A22B	—	—	—	81.5%	—	70.7%	—	—	—
100	Qwen3-30B-A3B	—	65.8%	—	—	—	—	—	—	—

Capability scatter

Each dot is a model. Position shows two-axis capability; size reflects how many headline benchmarks the model has been scored on.

X axis Y axis

Go to section

Search

AI model leaderboard

Models × benchmarks

Capability scatter

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: