TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Tau2-Bench Telecom

τ²-Bench Telecom

Multi-turn customer-service agent benchmark in a telecom domain: the model must take real tool actions while a simulated customer pushes back on incomplete or wrong answers.

Agentic Text Accuracy Max 100.0% Released Jun 2025
14
Results
14
Models scored
99.3%
Top: Gemini 3.1 Pro
92.5%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Aug 2025Jan 2026Jun 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Gemini 3.1 Pro 99.3% CoT 19 Feb 2026 Self-reported Primary
2 GPT 5.4 98.9% 05 Mar 2026 Self-reported Primary
3 Claude Opus 4.5 98.2% 24 Oct 2025 Self-reported Primary
4 GPT 5.5 98.0% CoT 23 Apr 2026 Self-reported Primary
5 Claude Sonnet 4.6 97.9% 17 Feb 2026 Self-reported Primary
6 GPT 5 (Thinking) 96.7% 07 Aug 2025 Self-reported Primary
7 Trinity Large Thinking 94.7% 0-shot · standard 01 Apr 2026 Self-reported Primary
8 Gemini 3 Flash (Thinking) 90.2% 17 Dec 2025 Self-reported Primary
9 Kimi K2.7 Code 90.1% 0-shot · agentic 12 Jun 2026 Third-party Primary Verified
10 Claude Haiku 4.5 83.0% 15 Oct 2025 Self-reported Primary
11 Gemini 2.5 Flash (Thinking) 79.5% 17 Dec 2025 Self-reported Primary
12 Qwen 3.5 27B 79.0% 24 Feb 2026 Third-party Primary Verified
13 Gemini 2.5 Pro (Thinking) 77.8% 17 Dec 2025 Self-reported Primary
14 GPT 5 38.6% 07 Aug 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task