LifeSciBench measures what agents know. Here's what they can do.
OpenAI just published the most rigorous life science AI benchmark ever built. It tells you whether your agent can think like a computational biologist. It doesn't tell you whether it can act like one.
OpenAI just published the most rigorous life science AI benchmark ever built. It tells you whether your agent can think like a computational biologist. It doesn't tell you whether it can act like one.
Yesterday OpenAI published LifeSciBench: 750 tasks, 173 PhD-level scientists, 453 independent reviewers, 19,020 rubric criteria. The most serious attempt yet to measure whether AI can contribute to real biological research. It's worth reading. And when you read it carefully, it reveals a gap it can't close.
Every task in LifeSciBench follows the same shape: scientific prompt, attached artifacts, free-text answer. Evidence Handling asks models to reconcile papers and experimental records. Design, Optimization & Prediction asks them to reason about optimization choices. Scientific Communication asks them to write clearly for expert reviewers. These are real tasks that practicing scientists spend real time on — and every one of them is a task a capable model can attempt with an API key and a PDF. No computation runs anywhere. No cluster needed.
The benchmark says so itself. In the limitations section: "Real research is iterative: scientists gather new evidence, revise hypotheses, design follow-up experiments, and adapt their plans as results emerge. The next step is to connect benchmark performance to deployment studies in live research workflows."
That next step is the one nobody has measured yet.
A computational chemist's actual workday isn't mostly document reasoning. It's submitting an RBFE network — a TYK2 campaign, say, with a full perturbation network across the ligand series — then watching which replicas fail, requeuing on different nodes, checking whether lambda windows converged, pulling free-energy values when they do, and ranking hits against experiment. The design judgment — which scaffold to prioritize, how to interpret the uncertainty — is maybe an hour of work. The execution is everything around it.
When we ran TYK2 on Clusterra, an agent could submit the full network as a parallel job array, handle spot interruptions automatically, and return a ranked hit list overnight. The ΔΔG mean unsigned error was 0.38 kcal/mol against the experimental reference set — within the expected accuracy for a 5 ns production run — at roughly $33 on A10G spot. The agent's contribution wasn't just planning the campaign. It was running it at a scale a single machine can't reach.
On a single GPU VM, a free-energy perturbation campaign runs one edge at a time. The science is identical. The time isn't: what completes overnight in parallel takes days sequentially. More importantly, some workloads can't run without the right substrate at all. Temperature replica exchange MD — the standard method for sampling conformational space in flexible targets — requires multiple GPUs running in synchronized lock-step, with coordinated swap steps between replicas. It doesn't run slowly on a single machine; it doesn't run. The same is true for RELION multi-GPU refinement in cryo-EM. You can have the most capable agent in the world designing the right experiment, and if the execution layer underneath it doesn't exist, the experiment doesn't happen.
This isn't a corner case. Recursive AI published results last week showing their automated research system reached state-of-the-art on the NanoGPT Speedrun — but only because they had eight GPUs to parallelize the search across. The ideas weren't the ceiling. The hardware was the floor. The same holds for computational biology: the agent isn't the ceiling. The substrate is.
LifeSciBench is measuring from the top down: given a capable model, how well can it reason about life science problems? That's exactly the right question for evaluating models. It's not the right question for evaluating what agents can accomplish in a working lab. The missing benchmark measures from the bottom up: given a cluster and an agent, how much science can you run per dollar, overnight, without a researcher watching?
There are real numbers here waiting to be established. How many RBFE edges can an agent run in twelve hours, at what cost, at what accuracy? What's the parallel speedup on cryo-EM refinement across GPU counts? What's the cost per Ångström of resolution improvement, on spot, in an unattended overnight run? None of these appear in LifeSciBench because you can't evaluate them with a rubric and a PDF — you need a cluster to run them.
We're building toward those numbers. The TYK2 result is a start: $33, 0.38 kcal/mol MUE, overnight on an agent-submitted job array. We've done the same for cryo-ET subtomogram averaging on EMPIAR-10164 — 3.99 Å resolution on a four-GPU spot cluster at about $2.30 — and for GROMACS T-REMD at 1365 ns/day/replica on a single node. These aren't formal benchmarks yet. They're data points of the kind LifeSciBench can't produce.
LifeSciBench is the reasoning benchmark. The execution benchmark doesn't exist yet. If you're a computational chemist or cryo-EM scientist who wants to start generating those numbers in your own AWS account, clusterra.cloud or hello@clusterra.cloud.