GSM8K
8.5k grade-school math word problems requiring 2-8 step arithmetic reasoning. Saturated by all frontier models; mostly useful as a smoke test today.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 3.5 | 96.4% | 0-shot · CoT · standard | 20 Jun 2024 | Self-reported | |
| 2 | Claude Opus 3 | 95.0% | 0-shot · CoT | 22 Oct 2024 | Self-reported | Primary |
| 3 | Nova Pro | 94.8% | 0-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 4 | Nova Lite | 94.5% | 0-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 5 | Gemini Ultra | 94.4% | 0-shot · standard | 06 Dec 2023 | Self-reported | |
| 6 | Nova Micro | 92.3% | 0-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 7 | GPT-4 | 92.0% | 5-shot · CoT | 04 Mar 2024 | Self-reported | Primary |
| 8 | Gemini 1.5 | 90.8% | 11-shot · standard | 01 May 2024 | Self-reported | |
| 9 | Nemotron 3 Super | 90.7% | 8-shot | 03 Apr 2026 | Self-reported | Primary |
| 10 | Claude Haiku 3 | 88.9% | 0-shot · CoT · standard | 04 Mar 2024 | Self-reported | |
| 11 | Claude 2 | 88.0% | 0-shot · CoT · standard | 11 Jul 2023 | Self-reported | |
| 12 | Gemini 1.5 Flash | 86.2% | 11-shot · standard | 01 May 2024 | Self-reported | |
| 13 | Llama 3.2 | 77.7% | 8-shot · CoT | 25 Sep 2024 | Self-reported | Primary |
| 14 | Mixtral 8x7B | 74.4% | — | 01 Dec 2023 | Self-reported | Primary |
| 15 | Mixtral 8x7B | 74.4% | — | 08 Jan 2024 | Self-reported | Primary |
| 16 | GPT 3.5 | 57.1% | 5-shot · standard | 14 Mar 2023 | Self-reported | |
| 17 | LLaMA 2 | 56.8% | 8-shot | 19 Jul 2023 | Paper | Primary Verified |
| 18 | LLaMA 2 70B | 56.8% | 8-shot | 11 Jul 2023 | Paper | |
| 19 | Mistral 7B | 52.2% | — | 01 Sep 2023 | Self-reported | Primary |
| 20 | Gemma 2 | 23.9% | 5-shot · Maj@1 | 25 Feb 2025 | Self-reported | Primary |
