GeneBench-Pro
A 129-problem benchmark testing whether AI agents can perform realistic, multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine. Each problem gives a messy, synthetically-generated dataset with known ground truth and requires the agent to navigate ambiguous judgment calls, choose the correct analysis path, and arrive at a graded numerical answer. Best model (GPT-5.6 Sol) scores 28.7% (31.5% in Pro mode) as of release.
Best results
Frontier over time
All results
| # | Model | Score | Conditions | Eval date | Source | Flags |
|---|---|---|---|---|---|---|
| 1 | GPT 5.6 Sol (max) | 28.7% | agentic | 30 Jun 2026 | Self-reported | Primary |
