Terminal-Bench 2.1

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-Bench 2.1 evaluates AI agents on 89 hard, realistic tasks in command-line terminal environments inspired by real workflows, spanning domains such as compiling code, training models, and setting up servers. Each task features a unique environment, a human-written reference solution, and comprehensive automated tests for verification; frontier models currently score under 65%.

Agentic Text Accuracy Max 100.0% Released Jan 2026

Homepage Paper Code

Results

Models scored

88.8%

Top: GPT-5.6 Sol

82.5%

Median

Best results

Top primary scores; one row per model.

88.8%

84.3%

82.5%

81.0%

80.4%

Frontier over time

Each dot is one model result; the line traces the running best score.

All results

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	GPT-5.6 Sol	88.8%	agentic	26 Jun 2026	Self-reported	Primary
2	GPT 5.6 Terra	84.3%	agentic	26 Jun 2026	Self-reported	Primary
3	GPT 5.6 Luna	82.5%	agentic	26 Jun 2026	Self-reported	Primary
4	GLM 5.2	81.0%	0-shot · standard	—	Self-reported	Primary
5	Claude Sonnet 5	80.4%	agentic	30 Jun 2026	Self-reported	Primary

Go to section

Search

Terminal-Bench 2.1

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: