TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text Accuracy Max 100.0% Released Apr 2025
25
Results
25
Models scored
82.7%
Top: GPT 5.5
57.0%
Median

Best results

Top primary scores; one row per model.

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Feb 2025Sep 2025Apr 2026

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 GPT 5.5 82.7% CoT 23 Apr 2026 Self-reported Primary
2 GPT 5.3 Codex 77.3% CoT 05 Feb 2026 Self-reported Primary
3 Gemini 3.5 Flash 76.2% 0-shot · CoT · agentic 19 May 2026 Self-reported
4 GLM 4.6 75.9% 30 Sep 2025 Self-reported Primary
5 GPT 5.4 75.1% 05 Mar 2026 Self-reported Primary
6 Claude Opus 4.8 74.6% 0-shot · CoT · agentic 28 May 2026 Self-reported
7 Qwen 3.7 Max 69.7% 0-shot · CoT · agentic 20 May 2026 Self-reported
8 Claude Opus 4.7 69.4% 16 Apr 2026 Self-reported Primary
9 Composer 2.5 69.3% 0-shot · CoT · agentic 18 May 2026 Self-reported
10 Gemini 3.1 Pro 68.5% CoT 19 Feb 2026 Self-reported Primary
11 Kimi K2.6 66.7% CoT 20 Apr 2026 Self-reported Primary
12 MiniMax M3 66.0% 0-shot · CoT · agentic 01 Jun 2026 Self-reported
13 Claude Opus 4.6 65.4% 05 Feb 2026 Self-reported Primary
14 GPT 5.2 Codex 64.0% 18 Dec 2025 Self-reported Primary
15 GPT 5.4 Mini 60.0% CoT 17 Mar 2026 Self-reported Primary
16 Claude Opus 4.5 59.3% 24 Nov 2025 Self-reported Primary
17 Claude Sonnet 4.6 59.1% 17 Feb 2026 Self-reported Primary
18 MiniMax M2.7 57.0% 0-shot · CoT 18 Mar 2026 Self-reported Primary
19 GLM 5 56.2% CoT 12 Feb 2026 Self-reported Primary
20 Gemini 3 Pro 54.2% CoT 18 Nov 2025 Self-reported Primary
21 Qwen 3.5 122B A10B 49.4% 24 Apr 2026 Third-party Primary Verified
22 Gemini 3 Flash (Thinking) 47.6% 17 Dec 2025 Self-reported Primary
23 Deepseek 3.2 46.4% 01 Dec 2025 Paper Primary Verified
24 GPT 5.4 Nano 46.3% CoT 17 Mar 2026 Self-reported Primary
25 Opus 4.1 Thinking 43.3% CoT 05 Aug 2025 Self-reported Primary
26 Qwen 3.5 27B 41.6% 24 Feb 2026 Third-party Primary Verified
27 Qwen 3.5 35B A3B 40.5% 15 Feb 2025 Third-party Primary Verified
28 Claude Sonnet 4 35.5% 22 May 2025 Self-reported Primary
29 Gemini 2.5 Pro (Thinking) 32.6% 17 Dec 2025 Self-reported Primary
30 Gemini 2.5 Flash (Thinking) 16.9% 17 Dec 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task