GeneBench-Pro
A 129-problem benchmark testing whether AI agents can perform realistic, multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine. Each problem gives a messy, synthetically-generated dataset with known ground truth and requires the agent to navigate ambiguous judgment calls, choose the correct analysis path, and arrive at a graded numerical answer. Best model (GPT-5.6 Sol) scores 28.7% (31.5% in Pro mode) as of release.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | GPT-5.6 Sol (Pro) | 31.5% | agentic | 30 Jun 2026 | Self-reported | |
| 2 | GPT 5.6 Sol (max) | 28.7% | agentic | 30 Jun 2026 | Self-reported | Primary |
| 3 | GPT-5.6 Terra (Pro) | 28.5% | agentic | 30 Jun 2026 | Self-reported | |
| 4 | GPT-5.6 Luna (Pro) | 23.6% | agentic | 30 Jun 2026 | Self-reported | |
| 5 | GPT-5.6 Terra (max) | 23.3% | agentic | 30 Jun 2026 | Self-reported | Primary |
| 6 | GPT-5.5 (Pro) | 20.5% | agentic | 30 Jun 2026 | Self-reported | |
| 7 | GPT-5.6 Luna (max) | 16.5% | agentic | 30 Jun 2026 | Self-reported | Primary |
| 8 | GPT-5.4 (Pro) | 16.3% | agentic | 30 Jun 2026 | Self-reported | |
| 9 | Claude Opus 4.8 | 16.0% | — | 30 Jun 2026 | Self-reported | Primary |
| 10 | GPT-5.5 (xhigh) | 12.0% | agentic | 30 Jun 2026 | Self-reported | Primary |
| 11 | GPT-5.4 (xhigh) | 8.90% | agentic | 30 Jun 2026 | Self-reported | Primary |
| 12 | GPT-5.2 (Pro) | 8.50% | agentic | 30 Jun 2026 | Self-reported | |
| 13 | Gemini 3.5 Flash | 8.10% | — | 30 Jun 2026 | Self-reported | Primary |
| 14 | GPT-5.2 (xhigh) | 4.90% | agentic | 30 Jun 2026 | Self-reported | Primary |
| 15 | GLM 5.2 | 4.60% | — | 30 Jun 2026 | Self-reported | Primary |
| 16 | Gemini 3.1 Pro | 3.10% | — | 30 Jun 2026 | Self-reported | Primary |
| 17 | Deepseek V4 Pro | 2.40% | — | 30 Jun 2026 | Self-reported | Primary |
