Terminal-Bench Hard
The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | GPT 5.5 | 82.7% | CoT | 23 Apr 2026 | Self-reported | Primary |
| 2 | GPT 5.3 Codex | 77.3% | CoT | 05 Feb 2026 | Self-reported | Primary |
| 3 | GLM 4.6 | 75.9% | — | 30 Sep 2025 | Self-reported | Primary |
| 4 | GPT 5.4 | 75.1% | — | 05 Mar 2026 | Self-reported | Primary |
| 5 | Claude Opus 4.7 | 69.4% | — | 16 Apr 2026 | Self-reported | Primary |
| 6 | Gemini 3.1 Pro | 68.5% | CoT | 19 Feb 2026 | Self-reported | Primary |
| 7 | Kimi K2.6 | 66.7% | CoT | 20 Apr 2026 | Self-reported | Primary |
| 8 | Claude Opus 4.6 | 65.4% | — | 05 Feb 2026 | Self-reported | Primary |
| 9 | GPT 5.2 Codex | 64.0% | — | 18 Dec 2025 | Self-reported | Primary |
| 10 | GPT 5.4 Mini | 60.0% | CoT | 17 Mar 2026 | Self-reported | Primary |
| 11 | Claude Opus 4.5 | 59.3% | — | 24 Nov 2025 | Self-reported | Primary |
| 12 | Claude Sonnet 4.6 | 59.1% | — | 17 Feb 2026 | Self-reported | Primary |
| 13 | MiniMax M2.7 | 57.0% | 0-shot · CoT | 18 Mar 2026 | Self-reported | Primary |
| 14 | GLM 5 | 56.2% | CoT | 12 Feb 2026 | Self-reported | Primary |
| 15 | Gemini 3 Pro | 54.2% | CoT | 18 Nov 2025 | Self-reported | Primary |
| 16 | Qwen 3.5 122B A10B | 49.4% | — | 24 Apr 2026 | Third-party | Primary Verified |
| 17 | Gemini 3 Flash (Thinking) | 47.6% | — | 17 Dec 2025 | Self-reported | Primary |
| 18 | Deepseek 3.2 | 46.4% | — | 01 Dec 2025 | Paper | Primary Verified |
| 19 | GPT 5.4 Nano | 46.3% | CoT | 17 Mar 2026 | Self-reported | Primary |
| 20 | Opus 4.1 Thinking | 43.3% | CoT | 05 Aug 2025 | Self-reported | Primary |
| 21 | Qwen 3.5 27B | 41.6% | — | 24 Feb 2026 | Third-party | Primary Verified |
| 22 | Qwen 3.5 35B A3B | 40.5% | — | 15 Feb 2025 | Third-party | Primary Verified |
| 23 | Claude Sonnet 4 | 35.5% | — | 22 May 2025 | Self-reported | Primary |
| 24 | Gemini 2.5 Pro (Thinking) | 32.6% | — | 17 Dec 2025 | Self-reported | Primary |
| 25 | Gemini 2.5 Flash (Thinking) | 16.9% | — | 17 Dec 2025 | Self-reported | Primary |
