GeneBench-Pro

GeneBench-Pro: Evaluating Multistage Statistical Reasoning in Genomics, Quantitative Biology, and Translational Biomedicine

A 129-problem benchmark testing whether AI agents can perform realistic, multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine. Each problem gives a messy, synthetically-generated dataset with known ground truth and requires the agent to navigate ambiguous judgment calls, choose the correct analysis path, and arrive at a graded numerical answer. Best model (GPT-5.6 Sol) scores 28.7% (31.5% in Pro mode) as of release.

Agentic Text Accuracy Max 100.0% Released Jun 2026

Homepage Paper

Results

Models scored

28.7%

Top: GPT 5.6 Sol (max)

8.90%

Median

Best results

Top primary scores; one row per model.

28.7%

23.3%

16.5%

16.0%

12.0%

8.90%

8.10%

4.90%

4.60%

3.10%

Frontier over time

Each dot is one model result; the line traces the running best score.

All data points share one date — no trend to plot.

All results

Showing one canonical row per model. Show all configurations

#	Model	Score	Conditions	Eval date	Source	Flags
1	GPT 5.6 Sol (max)	28.7%	agentic	30 Jun 2026	Self-reported	Primary
2	GPT-5.6 Terra (max)	23.3%	agentic	30 Jun 2026	Self-reported	Primary
3	GPT-5.6 Luna (max)	16.5%	agentic	30 Jun 2026	Self-reported	Primary
4	Claude Opus 4.8	16.0%	—	30 Jun 2026	Self-reported	Primary
5	GPT-5.5 (xhigh)	12.0%	agentic	30 Jun 2026	Self-reported	Primary
6	GPT-5.4 (xhigh)	8.90%	agentic	30 Jun 2026	Self-reported	Primary
7	Gemini 3.5 Flash	8.10%	—	30 Jun 2026	Self-reported	Primary
8	GPT-5.2 (xhigh)	4.90%	agentic	30 Jun 2026	Self-reported	Primary
9	GLM 5.2	4.60%	—	30 Jun 2026	Self-reported	Primary
10	Gemini 3.1 Pro	3.10%	—	30 Jun 2026	Self-reported	Primary
11	Deepseek V4 Pro	2.40%	—	30 Jun 2026	Self-reported	Primary

Go to section

Search

GeneBench-Pro

Best results

Frontier over time

All results

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: