Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text Accuracy Max 100.0% Released Apr 2025

Homepage Code

Results

Models scored

82.7%

Top: GPT 5.5

56.6%

Median

Best results

Top primary scores; one row per model.

82.7%

77.3%

75.9%

75.1%

69.4%

68.5%

66.7%

65.4%

64.0%

60.0%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	GPT 5.5	82.7%	CoT	23 Apr 2026	Self-reported	Primary
2	GPT 5.3 Codex	77.3%	CoT	05 Feb 2026	Self-reported	Primary
3	GLM 4.6	75.9%	—	30 Sep 2025	Self-reported	Primary
4	GPT 5.4	75.1%	—	05 Mar 2026	Self-reported	Primary
5	Claude Opus 4.7	69.4%	—	16 Apr 2026	Self-reported	Primary
6	Gemini 3.1 Pro	68.5%	CoT	19 Feb 2026	Self-reported	Primary
7	Kimi K2.6	66.7%	CoT	20 Apr 2026	Self-reported	Primary
8	Claude Opus 4.6	65.4%	—	05 Feb 2026	Self-reported	Primary
9	GPT 5.2 Codex	64.0%	—	18 Dec 2025	Self-reported	Primary
10	GPT 5.4 Mini	60.0%	CoT	17 Mar 2026	Self-reported	Primary
11	Claude Opus 4.5	59.3%	—	24 Nov 2025	Self-reported	Primary
12	Claude Sonnet 4.6	59.1%	—	17 Feb 2026	Self-reported	Primary
13	MiniMax M2.7	57.0%	0-shot · CoT	18 Mar 2026	Self-reported	Primary
14	GLM 5	56.2%	CoT	12 Feb 2026	Self-reported	Primary
15	Gemini 3 Pro	54.2%	CoT	18 Nov 2025	Self-reported	Primary
16	Qwen 3.5 122B A10B	49.4%	—	24 Apr 2026	Third-party	Primary Verified
17	Gemini 3 Flash (Thinking)	47.6%	—	17 Dec 2025	Self-reported	Primary
18	Deepseek 3.2	46.4%	—	01 Dec 2025	Paper	Primary Verified
19	GPT 5.4 Nano	46.3%	CoT	17 Mar 2026	Self-reported	Primary
20	Kimi K2.7 Code	44.7%	0-shot · agentic	12 Jun 2026	Third-party	Primary Verified
21	Opus 4.1 Thinking	43.3%	CoT	05 Aug 2025	Self-reported	Primary
22	Qwen 3.5 27B	41.6%	—	24 Feb 2026	Third-party	Primary Verified
23	Qwen 3.5 35B A3B	40.5%	—	15 Feb 2025	Third-party	Primary Verified
24	Claude Sonnet 4	35.5%	—	22 May 2025	Self-reported	Primary
25	Gemini 2.5 Pro (Thinking)	32.6%	—	17 Dec 2025	Self-reported	Primary
26	Gemini 2.5 Flash (Thinking)	16.9%	—	17 Dec 2025	Self-reported	Primary

Go to section

Search

Terminal-Bench Hard

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: