Terminal-Bench 2.1
Terminal-Bench 2.1 evaluates AI agents on 89 hard, realistic tasks in command-line terminal environments inspired by real workflows, spanning domains such as compiling code, training models, and setting up servers. Each task features a unique environment, a human-written reference solution, and comprehensive automated tests for verification; frontier models currently score under 65%.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | GLM 5.2 | 81.0% | 0-shot · standard | — | Self-reported | Primary |
