TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text Accuracy Max 100.0% Released Jun 2025
12
Results
12
Models scored
99.3%
Top: Gemini 3.1 Pro
93.5%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Aug 2025Dec 2025Apr 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Gemini 3.1 Pro 99.3% CoT 19 Feb 2026 Self-reported Primary
2 GPT 5.4 98.9% 05 Mar 2026 Self-reported Primary
3 Claude Opus 4.5 98.2% 24 Oct 2025 Self-reported Primary
4 GPT 5.5 98.0% CoT 23 Apr 2026 Self-reported Primary
5 Claude Sonnet 4.6 97.9% 17 Feb 2026 Self-reported Primary
6 GPT 5 (Thinking) 96.7% 07 Aug 2025 Self-reported Primary
7 Gemini 3 Flash (Thinking) 90.2% 17 Dec 2025 Self-reported Primary
8 Claude Haiku 4.5 83.0% 15 Oct 2025 Self-reported Primary
9 Gemini 2.5 Flash (Thinking) 79.5% 17 Dec 2025 Self-reported Primary
10 Qwen 3.5 27B 79.0% 24 Feb 2026 Third-party Primary Verified
11 Gemini 2.5 Pro (Thinking) 77.8% 17 Dec 2025 Self-reported Primary
12 Claude Sonnet 3.5 62.6% 0-shot · standard 22 Oct 2024 Self-reported
13 GPT 5 38.6% 07 Aug 2025 Self-reported Primary
14 Claude Haiku 3 18.2% 0-shot · standard 22 Oct 2024 Self-reported
0 AIs selected
Clear selection
#
Name
Task