TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text Accuracy Max 100.0% Released Apr 2025
25
Results
25
Models scored
82.7%
Top: GPT 5.5
57.0%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Feb 2025Sep 2025Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GPT 5.5 82.7% CoT 23 Apr 2026 Self-reported Primary
2 GPT 5.3 Codex 77.3% CoT 05 Feb 2026 Self-reported Primary
3 GLM 4.6 75.9% 30 Sep 2025 Self-reported Primary
4 GPT 5.4 75.1% 05 Mar 2026 Self-reported Primary
5 Claude Opus 4.7 69.4% 16 Apr 2026 Self-reported Primary
6 Gemini 3.1 Pro 68.5% CoT 19 Feb 2026 Self-reported Primary
7 Kimi K2.6 66.7% CoT 20 Apr 2026 Self-reported Primary
8 Claude Opus 4.6 65.4% 05 Feb 2026 Self-reported Primary
9 GPT 5.2 Codex 64.0% 18 Dec 2025 Self-reported Primary
10 GPT 5.4 Mini 60.0% CoT 17 Mar 2026 Self-reported Primary
11 Claude Opus 4.5 59.3% 24 Nov 2025 Self-reported Primary
12 Claude Sonnet 4.6 59.1% 17 Feb 2026 Self-reported Primary
13 MiniMax M2.7 57.0% 0-shot · CoT 18 Mar 2026 Self-reported Primary
14 GLM 5 56.2% CoT 12 Feb 2026 Self-reported Primary
15 Gemini 3 Pro 54.2% CoT 18 Nov 2025 Self-reported Primary
16 Qwen 3.5 122B A10B 49.4% 24 Apr 2026 Third-party Primary Verified
17 Gemini 3 Flash (Thinking) 47.6% 17 Dec 2025 Self-reported Primary
18 Deepseek 3.2 46.4% 01 Dec 2025 Paper Primary Verified
19 GPT 5.4 Nano 46.3% CoT 17 Mar 2026 Self-reported Primary
20 Opus 4.1 Thinking 43.3% CoT 05 Aug 2025 Self-reported Primary
21 Qwen 3.5 27B 41.6% 24 Feb 2026 Third-party Primary Verified
22 Qwen 3.5 35B A3B 40.5% 15 Feb 2025 Third-party Primary Verified
23 Claude Sonnet 4 35.5% 22 May 2025 Self-reported Primary
24 Gemini 2.5 Pro (Thinking) 32.6% 17 Dec 2025 Self-reported Primary
25 Gemini 2.5 Flash (Thinking) 16.9% 17 Dec 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task