BIG-Bench Hard

The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.

Reasoning Text Accuracy Max 100.0% Released Oct 2022

Results

Models scored

91.6%

Top: Seed 1.5

84.6%

Median

Best results

Top primary scores; one row per model.

91.6%

86.9%

86.8%

82.4%

79.5%

51.2%

Each dot is one model result; the line traces the running best score.

Showing all configurations including non-primary alternates. · Show only primary

#	Model	Score	Conditions	Eval date	Source	Flags
1	Claude Sonnet 3.5	93.1%	3-shot · CoT · standard	20 Jun 2024	Self-reported
2	Seed 1.5	91.6%	—	22 Jan 2025	Self-reported	Primary
3	Gemini 1.5	89.2%	3-shot · CoT · standard	01 May 2024	Self-reported
4	Nova Pro	86.9%	3-shot · CoT	03 Dec 2024	Self-reported	Primary
5	Claude Opus 3	86.8%	3-shot · CoT	22 Oct 2024	Self-reported	Primary
6	Gemini 1.5 Flash	85.5%	3-shot · CoT · standard	01 May 2024	Self-reported
7	Gemini Ultra	83.6%	3-shot · CoT · standard	06 Dec 2023	Self-reported
8	Nova Lite	82.4%	3-shot · CoT	03 Dec 2024	Self-reported	Primary
9	Nova Micro	79.5%	3-shot · CoT	03 Dec 2024	Self-reported	Primary
10	Claude Haiku 3	73.7%	3-shot · CoT · standard	04 Mar 2024	Self-reported
11	LLaMA 2	51.2%	3-shot	19 Jul 2023	Paper	Primary Verified
12	LLaMA 2 70B	51.2%	3-shot	11 Jul 2023	Paper