TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

GSM8K

Grade School Math 8K

8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.

Math Text Accuracy Max 100.0% Released Oct 2021 Saturated Possibly contaminated
12
Results
11
Models scored
95.0%
Top: Claude Opus 3
84.2%
Median

Best results

Top primary scores; one row per model.
2
94.8%
3
94.5%
5
92.0%
7
77.7%
10
56.8%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Jul 2023Nov 2024Apr 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Sonnet 3.5 96.4% 0-shot · CoT · standard 20 Jun 2024 Self-reported
2 Claude Opus 3 95.0% 0-shot · CoT 22 Oct 2024 Self-reported Primary
3 Nova Pro 94.8% 0-shot · CoT 03 Dec 2024 Self-reported Primary
4 Nova Lite 94.5% 0-shot · CoT 03 Dec 2024 Self-reported Primary
5 Gemini Ultra 94.4% 0-shot · standard 06 Dec 2023 Self-reported
6 Nova Micro 92.3% 0-shot · CoT 03 Dec 2024 Self-reported Primary
7 GPT-4 92.0% 5-shot · CoT 04 Mar 2024 Self-reported Primary
8 Gemini 1.5 90.8% 11-shot · standard 01 May 2024 Self-reported
9 Nemotron 3 Super 90.7% 8-shot 03 Apr 2026 Self-reported Primary
10 Claude Haiku 3 88.9% 0-shot · CoT · standard 04 Mar 2024 Self-reported
11 Claude 2 88.0% 0-shot · CoT · standard 11 Jul 2023 Self-reported
12 Gemini 1.5 Flash 86.2% 11-shot · standard 01 May 2024 Self-reported
13 Llama 3.2 77.7% 8-shot · CoT 25 Sep 2024 Self-reported Primary
14 Mixtral 8x7B 74.4% 01 Dec 2023 Self-reported Primary
15 Mixtral 8x7B 74.4% 08 Jan 2024 Self-reported Primary
16 GPT 3.5 57.1% 5-shot · standard 14 Mar 2023 Self-reported
17 LLaMA 2 56.8% 8-shot 19 Jul 2023 Paper Primary Verified
18 LLaMA 2 70B 56.8% 8-shot 11 Jul 2023 Paper
19 Mistral 7B 52.2% 01 Sep 2023 Self-reported Primary
20 Gemma 2 23.9% 5-shot · Maj@1 25 Feb 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task