TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

DROP

Discrete Reasoning Over Paragraphs

Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.

Reasoning Text F1 Max 100.0% Released Mar 2019 Saturated Possibly contaminated
8
Results
8
Models scored
93.0%
Top: Seed 1.5
83.3%
Median

Best results

Top primary scores; one row per model.
1
93.0%
2
91.1%
3
85.4%
4
83.4%
6
80.2%
8
52.0%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Oct 2024Jan 2025Apr 2025

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Seed 1.5 93.0% 22 Jan 2025 Self-reported Primary
2 Command A 91.1% 07 Apr 2025 Self-reported Primary
3 Nova Pro 85.4% 6-shot · CoT 03 Dec 2024 Self-reported Primary
4 GPT-4o 83.4% 16 Apr 2025 Self-reported Primary
5 Claude Opus 3 83.1% 3-shot · CoT 22 Oct 2024 Self-reported Primary
6 Nova Lite 80.2% 6-shot · CoT 03 Dec 2024 Self-reported Primary
7 Nova Micro 79.3% 6-shot · CoT 03 Dec 2024 Self-reported Primary
8 Gemma 2 52.0% 3-shot 25 Feb 2025 Self-reported Primary
0 AIs selected
Clear selection
#
Name
Task