BFCL v3
Evaluates function/tool-calling correctness across single, parallel, multi-turn and irrelevance-detection scenarios.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Qwen3 235B A22B | 70.8% | — | Apr 28, 2025 | self reported | primary |
| 2 | Qwen3 30B A3B | 69.1% | — | Apr 28, 2025 | self reported | primary |
