TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

DROP

Discrete Reasoning Over Paragraphs

Reading-comprehension benchmark requiring discrete operations (addition, counting, sorting) over passages. Mostly saturated by frontier models.

Reasoning Text f1 Max 100.0% Released Mar 2019 Saturated Possibly contaminated
8
Results
8
Models scored
93.0%
Top: Seed 1.5
83.3%
Median

Best results

Top primary scores; one row per model.
1
93.0%
2
91.1%
3
85.4%
4
83.4%
6
80.2%
8
52.0%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Oct 2024Jan 2025Apr 2025

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Seed 1.5 93.0% Jan 22, 2025 self reported primary
2 Command A 91.1% Apr 7, 2025 self reported primary
3 Nova Pro 85.4% 6-shot · CoT Dec 3, 2024 self reported primary
4 GPT-4o 83.4% Apr 16, 2025 self reported primary
5 Claude Opus 3 83.1% 3-shot · CoT Oct 22, 2024 self reported primary
6 Nova Lite 80.2% 6-shot · CoT Dec 3, 2024 self reported primary
7 Nova Micro 79.3% 6-shot · CoT Dec 3, 2024 self reported primary
8 Gemma 2 52.0% 3-shot Feb 25, 2025 self reported primary
0 AIs selected
Clear selection
#
Name
Task