IFBench

Instruction Following Benchmark

Measures how reliably a model follows complex multi-constraint instructions, a known weak spot for many otherwise strong models.

Language Text Accuracy Max 100.0% Released Jun 2025

Homepage

Results

Models scored

76.1%

Top: Qwen 3.5 122B A10B

70.2%

Median

Best results

Top primary scores; one row per model.

76.1%

71.5%

70.2%

63.1%

52.3%

Each dot is one model result; the line traces the running best score.

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	Qwen 3.5 122B A10B	76.1%	—	24 Apr 2026	Third-party	Primary Verified
2	Nemotron 3	71.5%	standard	15 Dec 2025	Self-reported	Primary
3	Qwen 3.5 35B A3B	70.2%	—	15 Feb 2025	Third-party	Primary Verified
4	Kimi K2.7 Code	63.1%	0-shot · standard	12 Jun 2026	Third-party	Primary Verified
5	Trinity Large Thinking	52.3%	0-shot · standard	01 Apr 2026	Self-reported	Primary