DROP
Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Seed 1.5 | 93.0% | — | Jan 22, 2025 | self reported | primary |
| 2 | Command A | 91.1% | — | Apr 7, 2025 | self reported | primary |
| 3 | Nova Pro | 85.4% | 6-shot · CoT | Dec 3, 2024 | self reported | primary |
| 4 | GPT-4o | 83.4% | — | Apr 16, 2025 | self reported | primary |
| 5 | Claude Opus 3 | 83.1% | 3-shot · CoT | Oct 22, 2024 | self reported | primary |
| 6 | Nova Lite | 80.2% | 6-shot · CoT | Dec 3, 2024 | self reported | primary |
| 7 | Nova Micro | 79.3% | 6-shot · CoT | Dec 3, 2024 | self reported | primary |
| 8 | Gemma 2 | 52.0% | 3-shot | Feb 25, 2025 | self reported | primary |
