Terminal-Bench Hard

The hardest split of Terminal-Bench: agents must complete real CLI tasks (debugging, system admin, multi-step automation) inside a sandboxed terminal.

Agentic Text Accuracy Max 100.0% Released Apr 2025

Homepage Code

Results

Models scored

82.7%

Top: GPT 5.5

57.0%

Median

Best results

Top primary scores; one row per model.

82.7%

77.3%

75.9%

75.1%

69.4%

68.5%

66.7%

65.4%

64.0%

60.0%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	GPT 5.5	82.7%	CoT	23 Apr 2026	Self-reported	Primary
2	GPT 5.3 Codex	77.3%	CoT	05 Feb 2026	Self-reported	Primary
3	Gemini 3.5 Flash	76.2%	0-shot · CoT · agentic	19 May 2026	Self-reported
4	GLM 4.6	75.9%	—	30 Sep 2025	Self-reported	Primary
5	GPT 5.4	75.1%	—	05 Mar 2026	Self-reported	Primary
6	Claude Opus 4.8	74.6%	0-shot · CoT · agentic	28 May 2026	Self-reported
7	Qwen 3.7 Max	69.7%	0-shot · CoT · agentic	20 May 2026	Self-reported
8	Claude Opus 4.7	69.4%	—	16 Apr 2026	Self-reported	Primary
9	Composer 2.5	69.3%	0-shot · CoT · agentic	18 May 2026	Self-reported
10	Gemini 3.1 Pro	68.5%	CoT	19 Feb 2026	Self-reported	Primary
11	Kimi K2.6	66.7%	CoT	20 Apr 2026	Self-reported	Primary
12	MiniMax M3	66.0%	0-shot · CoT · agentic	01 Jun 2026	Self-reported
13	Claude Opus 4.6	65.4%	—	05 Feb 2026	Self-reported	Primary
14	GPT 5.2 Codex	64.0%	—	18 Dec 2025	Self-reported	Primary
15	GPT 5.4 Mini	60.0%	CoT	17 Mar 2026	Self-reported	Primary
16	Claude Opus 4.5	59.3%	—	24 Nov 2025	Self-reported	Primary
17	Claude Sonnet 4.6	59.1%	—	17 Feb 2026	Self-reported	Primary
18	MiniMax M2.7	57.0%	0-shot · CoT	18 Mar 2026	Self-reported	Primary
19	GLM 5	56.2%	CoT	12 Feb 2026	Self-reported	Primary
20	Gemini 3 Pro	54.2%	CoT	18 Nov 2025	Self-reported	Primary
21	Qwen 3.5 122B A10B	49.4%	—	24 Apr 2026	Third-party	Primary Verified
22	Gemini 3 Flash (Thinking)	47.6%	—	17 Dec 2025	Self-reported	Primary
23	Deepseek 3.2	46.4%	—	01 Dec 2025	Paper	Primary Verified
24	GPT 5.4 Nano	46.3%	CoT	17 Mar 2026	Self-reported	Primary
25	Opus 4.1 Thinking	43.3%	CoT	05 Aug 2025	Self-reported	Primary
26	Qwen 3.5 27B	41.6%	—	24 Feb 2026	Third-party	Primary Verified
27	Qwen 3.5 35B A3B	40.5%	—	15 Feb 2025	Third-party	Primary Verified
28	Claude Sonnet 4	35.5%	—	22 May 2025	Self-reported	Primary
29	Gemini 2.5 Pro (Thinking)	32.6%	—	17 Dec 2025	Self-reported	Primary
30	Gemini 2.5 Flash (Thinking)	16.9%	—	17 Dec 2025	Self-reported	Primary

Go to section

Search

Terminal-Bench Hard

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: