TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text Accuracy Max 100.0% Released Oct 2022
6
Results
6
Models scored
91.6%
Top: Seed 1.5
84.6%
Median

Best results

Top primary scores; one row per model.
1
91.6%
2
86.9%
4
82.4%
6
51.2%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Jul 2023Apr 2024Jan 2025

All results

Showing all configurations including non-primary alternates.  · Show only primary
# Model Score Conditions Eval date Source Flags
1 Claude Sonnet 3.5 93.1% 3-shot · CoT · standard 20 Jun 2024 Self-reported
2 Seed 1.5 91.6% 22 Jan 2025 Self-reported Primary
3 Gemini 1.5 89.2% 3-shot · CoT · standard 01 May 2024 Self-reported
4 Nova Pro 86.9% 3-shot · CoT 03 Dec 2024 Self-reported Primary
5 Claude Opus 3 86.8% 3-shot · CoT 22 Oct 2024 Self-reported Primary
6 Gemini 1.5 Flash 85.5% 3-shot · CoT · standard 01 May 2024 Self-reported
7 Gemini Ultra 83.6% 3-shot · CoT · standard 06 Dec 2023 Self-reported
8 Nova Lite 82.4% 3-shot · CoT 03 Dec 2024 Self-reported Primary
9 Nova Micro 79.5% 3-shot · CoT 03 Dec 2024 Self-reported Primary
10 Claude Haiku 3 73.7% 3-shot · CoT · standard 04 Mar 2024 Self-reported
11 LLaMA 2 51.2% 3-shot 19 Jul 2023 Paper Primary Verified
12 LLaMA 2 70B 51.2% 3-shot 11 Jul 2023 Paper
0 AIs selected
Clear selection
#
Name
Task