DROP
Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Seed 1.5 | 93.0% | — | 22 Jan 2025 | Self-reported | Primary |
| 2 | Command A | 91.1% | — | 07 Apr 2025 | Self-reported | Primary |
| 3 | Nova Pro | 85.4% | 6-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 4 | GPT-4o | 83.4% | — | 16 Apr 2025 | Self-reported | Primary |
| 5 | Claude Opus 3 | 83.1% | 3-shot · CoT | 22 Oct 2024 | Self-reported | Primary |
| 6 | Gemini Ultra | 82.4% | 0-shot · standard | 06 Dec 2023 | Self-reported | |
| 7 | Nova Lite | 80.2% | 6-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 8 | Nova Micro | 79.3% | 6-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 9 | Claude Haiku 3 | 78.4% | 3-shot · standard | 04 Mar 2024 | Self-reported | |
| 10 | Gemini 1.5 Flash | 78.4% | 0-shot · standard | 01 May 2024 | Self-reported | |
| 11 | Gemini 1.5 Pro | 74.9% | 0-shot · standard | 01 May 2024 | Self-reported | |
| 12 | GPT 3.5 | 64.1% | 3-shot · standard | 14 Mar 2023 | Self-reported | |
| 13 | Gemma 2 | 52.0% | 3-shot | 25 Feb 2025 | Self-reported | Primary |
