BIG-Bench Hard
The 23-task subset of BIG-Bench where prior LLMs underperformed humans. Mix of logical, algorithmic, and language-understanding tasks.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 3.5 | 93.1% | 3-shot · CoT · standard | 20 Jun 2024 | Self-reported | |
| 2 | Seed 1.5 | 91.6% | — | 22 Jan 2025 | Self-reported | Primary |
| 3 | Gemini 1.5 | 89.2% | 3-shot · CoT · standard | 01 May 2024 | Self-reported | |
| 4 | Nova Pro | 86.9% | 3-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 5 | Claude Opus 3 | 86.8% | 3-shot · CoT | 22 Oct 2024 | Self-reported | Primary |
| 6 | Gemini 1.5 Flash | 85.5% | 3-shot · CoT · standard | 01 May 2024 | Self-reported | |
| 7 | Gemini Ultra | 83.6% | 3-shot · CoT · standard | 06 Dec 2023 | Self-reported | |
| 8 | Nova Lite | 82.4% | 3-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 9 | Nova Micro | 79.5% | 3-shot · CoT | 03 Dec 2024 | Self-reported | Primary |
| 10 | Claude Haiku 3 | 73.7% | 3-shot · CoT · standard | 04 Mar 2024 | Self-reported | |
| 11 | LLaMA 2 | 51.2% | 3-shot | 19 Jul 2023 | Paper | Primary Verified |
| 12 | LLaMA 2 70B | 51.2% | 3-shot | 11 Jul 2023 | Paper |
