TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

ARC Challenge

AI2 Reasoning Challenge (Challenge set)

Grade-school science multiple-choice, hard subset. Saturated by frontier models but still in many evaluation harnesses.

Knowledge Text accuracy Max 100.0% Released Mar 2018 Saturated Possibly contaminated
10
Results
9
Models scored
554.0%
Top: Gemma 2
91.3%
Median

Best results

Top primary scores; one row per model.
1
554.0%
4
94.8%
5
92.4%
7
78.6%
10
55.6%

Frontier over time

Each dot is one model result; the line traces the running best score.
Best score over time0.0025.050.075.0100.0Sep 2023Dec 2024Apr 2026

All results

Showing one canonical row per model. Show all configurations
# Model Score Conditions Eval date Source Flags
1 Gemma 2 554.0% Feb 25, 2025 self reported primary
2 Claude Opus 3 96.4% 25-shot Oct 22, 2024 self reported primary
3 Nemotron 3 Super 96.1% 25-shot Apr 3, 2026 self reported primary
4 Nova Pro 94.8% 0-shot Dec 3, 2024 self reported primary
5 Nova Lite 92.4% 0-shot Dec 3, 2024 self reported primary
6 Nova Micro 90.2% 0-shot Dec 3, 2024 self reported primary
7 Llama 3.2 78.6% 0-shot Oct 22, 2024 self reported primary
8 Mixtral 8x7B 59.7% Dec 1, 2023 self reported primary
9 Mixtral 8x7B 59.7% Jan 8, 2024 self reported primary
10 Mistral 7B 55.6% Sep 1, 2023 self reported primary
0 AIs selected
Clear selection
#
Name
Task