MGSM
GSM8K translated into 10 typologically diverse languages. Tests cross-lingual mathematical reasoning.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Llama 3.3 | 91.1% | 0-shot | Dec 6, 2024 | self reported | primary |
| 2 | Claude Opus 3 | 90.7% | 0-shot | Oct 22, 2024 | self reported | primary |
| 3 | GPT-4o | 90.5% | — | Apr 16, 2025 | self reported | primary |
| 4 | Nemotron 3 Super | 87.5% | 8-shot | Apr 3, 2026 | self reported | primary |
| 5 | Llama 3.2 | 58.2% | 0-shot · CoT | Oct 25, 2024 | self reported | primary |
