2026-06-20

LifeSciBench measures what agents know. Here's what they can do.

OpenAI just published the most rigorous AI benchmark for life science research. All 750 tasks are answerable with an API key and a PDF. Here's what agents can do when there's a cluster underneath them instead.

It's Tuesday morning. You have a TYK2 RBFE campaign to run — a full perturbation network across your ligand series. You submit it. Your workstation starts on the first edge. By Thursday you have enough to see a pattern. By Friday you have most of them. The Monday meeting is tight. This is not a bad outcome. This is just what happens when edges run one at a time.

On Clusterra, those same edges run as a parallel job array: every edge simultaneously, on spot GPU nodes that scale up and drop to zero when they're done. The full TYK2 campaign — 18 alchemical edge-legs across 9 perturbations, each running independently on its own spot GPU — completed the next morning. Run the same 18 legs sequentially on a single machine and you're looking at roughly three days (18 legs at roughly 4 hours each). ΔΔG mean unsigned error: 0.38 kcal/mol against the experimental reference set — within expected range for a 5 ns production run. Cost: roughly $33 on A10G spot. The agent submitted it, handled spot interruptions, and the ranked hit list was there in the morning. That's not a demo. That's the run.

The difference between these two outcomes is not the model, not the force field, not the researcher's judgment. It's whether the substrate underneath the agent can run the jobs in parallel or not.

OpenAI published LifeSciBench this week — 750 tasks, 173 PhD-level scientists, 19,020 rubric criteria, the most serious attempt yet to measure whether AI can contribute to real biological research. Every one of those tasks follows the same structure: scientific prompt, attached artifacts, free-text answer. A capable model can attempt all of them with an API key and a PDF. No computation runs anywhere.

Their own limitations section says it clearly: "Real research is iterative: scientists gather new evidence, revise hypotheses, design follow-up experiments, and adapt their plans as results emerge. The next step is to connect benchmark performance to deployment studies in live research workflows."

That next step requires a cluster. LifeSciBench can't produce it, because you can't evaluate it with a rubric and a PDF.

LifeSciBench is the reasoning benchmark and it's excellent at that. The execution question — what can an agent actually run, in parallel, overnight, at what cost — is one a rubric can't answer. You need a cluster to run it.

If your FEP campaign is running sequentially on a workstation, Clusterra is the shortest path to parallel. The TYK2 template is live. Email hello@clusterra.cloud — we'll spin up your cluster and run your first campaign together, same day.