IFBench
Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Qwen 3.5 122B A10B | 76.1% | — | Apr 24, 2026 | third party | primary verified |
| 2 | Qwen 3.5 35B A3B | 70.2% | — | Feb 15, 2025 | third party | primary verified |
