TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text Accuracy Max 100.0% Released Oct 2022
6
Results
6
Models scored
91.6%
Top: Seed 1.5
84.6%
Median

Best results

Top primary scores; one row per model.
1
91.6%
2
86.9%
4
82.4%
6
51.2%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Jul 2023Apr 2024Jan 2025

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Seed 1.5 91.6% 22 Jan 2025 Self-reported Primary
2 Nova Pro 86.9% 3-shot · CoT 03 Dec 2024 Self-reported Primary
3 Claude Opus 3 86.8% 3-shot · CoT 22 Oct 2024 Self-reported Primary
4 Nova Lite 82.4% 3-shot · CoT 03 Dec 2024 Self-reported Primary
5 Nova Micro 79.5% 3-shot · CoT 03 Dec 2024 Self-reported Primary
6 LLaMA 2 51.2% 3-shot 19 Jul 2023 Paper Primary Verified
0 AIs selected
Clear selection
#
Name
Task