TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

Terminal-Bench 2.1

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-Bench 2.1 evaluates AI agents on 89 hard, realistic tasks in command-line terminal environments inspired by real workflows, spanning domains such as compiling code, training models, and setting up servers. Each task features a unique environment, a human-written reference solution, and comprehensive automated tests for verification; frontier models currently score under 65%.

Agentic Text Accuracy Max 100.0% Released Jan 2026
1
Results
1
Models scored
81.0%
Top: GLM 5.2
81.0%
Median

Best results

Top primary scores; one row per model.
1
81.0%

Frontier over time

Each dot is one model result; the line traces the running best score.
Not enough data to plot a trend yet.

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 GLM 5.2 81.0% 0-shot · standard Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task