Aider Polyglot
225 hard Exercism programming exercises across 6 languages (C++, Go, Java, JavaScript, Python, Rust). Measures whole-file edit accuracy under realistic agentic-coding harness.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 | 89.4% | — | 24 Nov 2025 | Self-reported | Primary |
| 2 | GPT 5.1 | 88.0% | 0-shot · CoT | 13 Nov 2025 | Self-reported | Primary |
| 3 | GPT 5 (Thinking) | 88.0% | — | 07 Aug 2025 | Self-reported | Primary |
| 4 | o3 (High) | 81.3% | — | 16 Apr 2024 | Self-reported | Primary |
| 5 | Claude Sonnet 4.5 | 78.8% | — | 24 Nov 2025 | Self-reported | Primary |
| 6 | Gemini 2.5 Pro | 74.0% | — | 17 Jun 2025 | Third-party | Primary Verified |
| 7 | o4 mini (high) | 68.9% | — | 16 Apr 2025 | Self-reported | Primary |
| 8 | o1 (High) | 64.4% | — | 16 Apr 2025 | Self-reported | Primary |
| 9 | Qwen3 235B A22B | 61.8% | Pass@2 | 28 Apr 2025 | Self-reported | Primary |
| 10 | GPT 4.1 | 52.0% | — | 14 Apr 2025 | Self-reported | Primary |
| 11 | Gemini 2.5 Flash-Lite | 26.7% | — | 26 Sep 2025 | Self-reported | Primary |
| 12 | GPT 5 | 26.7% | — | 07 Aug 2025 | Self-reported | Primary |
