IFEval
Verifiable instruction-following: ~25 instruction types whose compliance can be checked deterministically (e.g. word counts, formats).
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Qwen 3.7 Max | 94.3% | 0-shot · CoT · standard | May 20, 2026 | self reported | |
| 2 | Claude Sonnet 3.7 (Thinking) | 93.2% | — | Feb 24, 2025 | self reported | primary |
| 3 | Nova Pro | 92.1% | 0-shot | Dec 3, 2024 | self reported | primary |
| 4 | Llama 3.3 | 92.1% | — | Dec 6, 2024 | self reported | primary |
| 5 | Command A | 90.9% | — | Apr 7, 2025 | self reported | primary |
| 6 | Claude Sonnet 3.7 | 90.8% | — | Feb 24, 2025 | self reported | primary |
| 7 | Nova Lite | 89.7% | 0-shot | Dec 3, 2024 | self reported | primary |
| 8 | Seed 1.5 | 89.5% | 0-shot · CoT | Jan 22, 2025 | self reported | primary |
| 9 | Claude Sonnet 3.5 | 87.8% | 0-shot · standard | Oct 22, 2024 | self reported | |
| 10 | Nova Micro | 87.2% | 0-shot | Dec 3, 2024 | self reported | primary |
| 11 | GPT 4.1 | 87.0% | — | Apr 14, 2025 | self reported | primary |
| 12 | Mistral Small 3 | 82.9% | — | Jan 30, 2025 | self reported | primary |
| 13 | Llama 3.2 | 77.4% | — | Sep 25, 2025 | self reported | primary |
| 14 | Claude Haiku 3 | 77.2% | 0-shot · standard | Oct 22, 2024 | self reported | |
| 15 | Qwen 3.5 27B | 76.5% | — | Feb 24, 2026 | third party | primary verified |
