2026-06-03

Reproduce TYK2 RBFE in your own AWS in an afternoon

A copy-paste quickstart: take the canonical TYK2 ligand set, plan a 9-edge OpenFE perturbation network, fan it out across A10G spot GPUs with checkpoint-resume, and gather a per-edge ΔΔG table — all in your own AWS account, in an afternoon, for ~$33. Three templates (openfe-plan-network → openfe-rbfe-run → openfe-gather) wrap stock OpenFE 1.11.1 / OpenMM; nothing about the science is proprietary. You should land MUE ~0.38 kcal/mol on 9 edges — parity within noise vs published FEP+, read as a single-replicate sanity check, not a benchmark claim. Soft pitch at the end; mostly this is a recipe.

This is the hands-on companion to our TYK2 RBFE case study. That post argued the operating model. This one is the recipe: the exact three-stage chain, the commands, the runtimes, and the cost — so a skeptical comp chemist can reproduce the numbers in their own AWS account and decide for themselves.

There's no proprietary science here. The whole thing is stock OpenFE 1.11.1 driving OpenMM with Hamiltonian replica exchange — the same openfe plan-rbfe-network / openfe quickrun / openfe gather you'd run by hand. What the three Clusterra templates add is the plumbing around the science: the network→Slurm-array fan-out, spot checkpoint-resume, GPU right-sizing, per-edge cost stamping, and the gather — running on managed Slurm in your AWS account, so you never stand up a cluster. If you already have an HPC engineer and a working OpenFE/Slurm setup, you can run the identical science on that. This post is the fast path.

1. What you'll get, and what it costs

One ΔΔG table for the canonical 10-ligand TYK2 series: a 9-edge minimal-spanning network, per-edge ΔΔG(i→j) vs experiment.
~$33 of your AWS spot, end to end (~$3.7 per edge — an edge being its complex + solvent leg).
An afternoon of wall-clock. The plan and gather steps are minutes on CPU; the GPU edge-legs run in parallel, so 18 single-GPU jobs finish in roughly one long edge's wall time (~4–5 h with spot interruptions), not 18× that.
All in your own account. The network plan, every per-edge result JSON, the cost rollup, and the provenance land on your EFS / in your AWS — not in someone else's tenant.

Expected accuracy, stated honestly up front: MUE 0.38 kcal/mol [95% CI 0.23–0.55], RMSE 0.46, Pearson R 0.85 [−0.29–0.98] on 9 edges, single replicate. That is parity within experimental noise with published TYK2 FEP+ (0.75 / 0.93 / 0.89) and open-source OpenFE (0.75 / 0.94) — not a method that beats them. At n=1 on 9 edges this is a sanity check that the pipeline is wired correctly. More on why in §7.

2. Prereqs

The TYK2 ligand set. Use the canonical set that ships in OpenFE's rbfe_tutorial (also mirrored in OpenFreeEnergy/openfe-benchmarks): 10 ligands — ejm_31, ejm_42, ejm_43, ejm_46, ejm_47, ejm_48, ejm_50 (EJMECH series) and jmc_23, jmc_27, jmc_28 (JMC series) — plus the prepared TYK2 receptor PDB. You need:
a single multi-molecule ligand SDF with valid 3D conformers and correct protonation/tautomer states (OpenFE does not protonate ligands — wrong states silently corrupt every downstream ΔΔG);
the receptor PDB (protonated, capped, loops modeled).

Partial charges are optional on the SDF — the plan step regenerates AM1-BCC at plan time for one consistent charge model across the network, which is what you want.

An AWS account you control.
A Clusterra managed cluster in that account (this is what runs the three templates on Slurm). Honest caveat: the templates are Clusterra's packaging — the same three openfe commands run on any plain OpenFE + Slurm setup. If you have one, the protocol below maps to it one-to-one; you're just supplying your own array submission and spot-resume wiring. Clusterra removes the cluster-ops burden and ships the spot-resume / BYOC management already wired.

One-time env build. All three templates share a single micromamba env materialized on EFS at /mnt/efs/_openfe-env/v1.11.1-cu126 (openfe=1.11.1, cuda-version=12.6). The first template to run pays a ~5–10 min micromamba create; everything after reuses it. The CUDA pin matters: the default resolve pulls CUDA 13.3, whose PTX the host driver (max CUDA 13.2) rejects with CUDA_ERROR_UNSUPPORTED_PTX_VERSION. The cuda-version=12.6 pin is the fix — don't drop it.

3. Step 1 — plan the network (`openfe-plan-network`)

What it does. Atom-maps every ligand pair (Kartograf, the OpenFE 1.11 default), scores edges with LOMAP, and builds a minimal spanning network over those scores: 9 edges connecting the 10 ligands, emitted as 18 transformation JSONs (one complex + one solvent leg per edge). Each JSON is a self-contained alchemical transformation that openfe quickrun executes independently — these are your Slurm fan-out units. It also writes a manifest.txt (one JSON per line, line N = array task N) and edge_count.txt.

Inputs / params:

ligands → your TYK2 SDF
protein → the TYK2 receptor PDB
network_type = minimal_spanning (default; fewest edges, cheapest)
mapper = kartograf (default)
production_ns = 5.0, equilibration_ns = 1.0 (OpenFE defaults; baked into every transformation JSON at plan time)

Under the hood the template runs:

openfe plan-rbfe-network \
  -M ligands.sdf \
  -p receptor.pdb \
  -s plan_settings.yaml \      # mapper + network topology
  -o network/ \
  --n-protocol-repeats 1 \     # 1 repeat per transformation; repeats fan out as array tasks
  -n 8

--n-protocol-repeats 1 is deliberate: it keeps each emitted transformation a single repeat so replicates become independent array tasks downstream (better spot packing) rather than being serialized inside one quickrun. The CLI has no flag for per-window MD length, so the template patches production_length / equilibration_length directly into each transformation JSON after planning.

Runtime / hardware: CPU only (atom mapping + graph optimization), minutes. Runs on the cpu partition, 8 cores, 16 GB.

Check: cat network/../edge_count.txt should read 18. If it reads 0, the atom mapper failed on your series — inspect stdout.log (usual culprit: a scaffold hop that LOMAP scores ~0 and the spanning tree drops; for scaffold-hopping series use network_type=minimal_redundant).

4. (Optional) smoke-test first — validate cheaply before spending GPU hours

Before you commit ~$33 of GPU time, prove the pipeline runs end to end for cents. Two ways:

Re-plan a SMOKE network. Run Step 1 with tiny MD lengths — production_ns=0.02, equilibration_ns=0.01 — so each edge-leg finishes in minutes. This validates the full plan → run → gather chain (manifest, array fan-out, result JSONs, gather) on real chemistry. Results below ~1 ns are not scientifically meaningful — this checks plumbing, not numbers.
Single-edge smoke. Run one stock OpenFE transformation JSON through openfe quickrun on a single GPU. Expected wall: ~10–30 min on an A10G once the env is built (first run on a fresh tenant adds the ~5–10 min env build). Cost: ~$0.25–0.50. PASS = a result.json with a finite numeric estimate field. Watch for exit 127 / "no extractable OCI layers" (someone reverted to a broken OpenFE container tag — the micromamba pattern is the fix) and CUDA error: out of memory (you passed a larger transformation than the ~25k-atom TYK2 fixture).

Either way you've confirmed CUDA, the env, and the array wiring before any real spend.

5. Step 2 — fan out the edges (`openfe-rbfe-run`)

What it does. This is the GPU-heavy core. It emits a Slurm array — one task per transformation JSON — where task N reads line N of manifest.txt and runs:

openfe quickrun "$TX_JSON" -d "$TASK_OUT" -o "result_${N}.json" --resume

#SBATCH --array=1-18 (set edge_count to match edge_count.txt). RBFE at this scale is embarrassingly parallel: each edge-leg is an independent single-GPU OpenMM run — the 11 λ-windows are replica-exchanged within one GPU, so there's no multi-node MPI and no gang scheduling. (A ~30k-atom RBFE leg is single-GPU by design; OpenMM's CUDA platform doesn't split one simulation across GPUs.)

Spot checkpoint-resume — the part you'd otherwise build. Every leg runs on A10G spot. --resume reuses the cached ProtocolDAG and restarts from the last HREX checkpoint after a reclaim — an eviction loses one exchange iteration, not the job. In the case-study run, several legs were reclaimed mid-flight and resumed to clean completion with zero lost work; their wall times stretched to 4–5 h across interruptions. This is in the template, not something you configure per run.

GPU right-sizing (measured, not guessed). nvidia-smi sampling on every leg: complex legs peak ~3.7–4.5 GB VRAM at 95–100% GPU utilization; solvent legs ~0.5 GB. So the A10G's 24 GB is heavily over-provisioned and the cheapest A10G instance (g5.xlarge, ~$0.49/hr spot) is the right node; and because the compute is saturated, packing multiple legs per GPU wouldn't help. The template stamps each task's AWS cost independently, so per-edge cost rolls up to the campaign total.

GPU type note. The shipped template defaults to --gres=gpu:l40s:1; the case-study run pinned A10G (g5.xlarge) spot, which is what the measured VRAM/util numbers above say is correctly sized. Set the gres to whatever A10G/L40S capacity you have — the science is identical; A10G is the cost-optimal choice for this workload.

Protocol (baked at plan time, runs here): OpenFE 1.11.1 RelativeHybridTopologyProtocol, OpenMM 8.4.0 + openmmtools 0.26.0 — Hamiltonian replica exchange (repex), 11 λ-windows, 1 ns equilibration + 5 ns production per window, HMR (H mass 3.0 amu) → 4 fs timestep, PME, 0.9 nm cutoff, 0.15 M NaCl. Force fields: OpenFF Sage 2.2.1 (ligands), ff14SB (protein), TIP3P water, AM1-BCC charges. The template exports OPENMM_DEFAULT_PLATFORM=CUDA explicitly so a mis-provisioned node fails loud instead of silently dropping to the ~100×-slower CPU platform.

Runtime / cost: ~4–5 h wall for the slowest legs (with interruptions); the whole 18-leg array completes in roughly that, in parallel, for ~$33 total. Default sbatch: gpu partition, 8 CPU, 32 GB, --time=24:00:00 (size --time for the worst edge — charge-changing edges auto-expand to 22 λ-windows / 20 ns and take ~4× the wall; the canonical TYK2 set is all neutral).

6. Step 3 — gather (`openfe-gather`)

What it does. Collects the per-edge result JSONs into the campaign-level table:

openfe gather <results_dir> --report ddg --allow-partial -o results.tsv

--report ddg → per-edge ΔΔG(i→j) + uncertainty (the table you want here)
--report dg → per-ligand absolute ΔG via network MLE — requires ≥2 replicates; on a single replicate it refuses with Every edge must have at least two simulation repeats
--report raw → every individual repeat (per-leg MBAR uncertainty, convergence debugging)
--allow-partial lets a campaign with a few preempted edges still produce a table; check gather_stderr.log for skipped edges before trusting it

Runtime / hardware: CPU only (MBAR + graph MLE), seconds to minutes. Output is Cinnabar-ready TSV — drop it straight into a ΔΔG-vs-experiment plot and compute MUE / RMSE against the known TYK2 affinities.

7. What you should see

Your per-edge ΔΔG table (single replicate) should land close to this:

Edge	ΔΔG calc	ΔΔG exp	error
ejm_31 → ejm_46	−1.0	−1.77	+0.77
ejm_31 → ejm_47	+0.1	−0.16	+0.26
ejm_31 → ejm_48	+0.8	+0.54	+0.26
ejm_31 → ejm_50	+0.2	+0.56	−0.36
ejm_42 → ejm_43	+1.4	+1.52	−0.12
ejm_42 → ejm_50	−0.0	+0.80	−0.80
ejm_46 → jmc_28	+0.3	+0.33	−0.03
jmc_23 → jmc_28	+0.4	+0.72	−0.32
jmc_27 → jmc_28	+0.8	+0.30	+0.50

MUE 0.38 [0.23–0.55] · RMSE 0.46 [0.27–0.61] · Pearson R 0.85 [−0.29–0.98] (n=9 edges, 95% bootstrap CIs). Every point falls inside the ±1 kcal/mol band. Per-leg MBAR uncertainties (--report raw) run 0.1–0.5 kcal/mol on complex legs, ~0.1 on solvent legs.

Read this honestly — it is what makes the recipe trustworthy:

This is a single-replicate sanity check, not a benchmark claim. It says the pipeline is wired correctly and produces sane numbers. It does not say OpenFE-on-your-AWS beats FEP+.
MUE 0.38 vs FEP+ 0.75 is parity within noise, not an improvement. Experimental error on the reference ΔΔG is ~0.4 kcal/mol — about the size of the MUE itself. With 9 points you cannot statistically distinguish 0.38 from 0.75.
The R is effectively undetermined. Pearson R = 0.85 sounds strong, but at n=9 its bootstrap CI is [−0.29, 0.98]. Report it for completeness, not as evidence.
The 9 edges are a favorable subset. A minimal spanning tree picks the easiest (highest-similarity) perturbations and contains no cycles — so there is no cycle-closure (hysteresis) self-consistency check in this run, and the MUE is expected to flatter relative to a full 24-edge network.
Scope: RBFE/FEP is for congeneric series only. TYK2 is the easy, well-trodden benchmark, with ligands in the lineage these force fields were tuned against — no claim about novel chemistry, charge-changing perturbations, cofactors, or harder targets.

If you want an actual accuracy statement, run the same chain at n=3 with a redundant network (for cycle-closure QC) — roughly 3× this run, ~$70–100 on the same spot. --report dg then gives you the per-ligand ΔG MLE and the ranking "money plot."

8. Where Clusterra fits

If you have a platform team and a working OpenFE/Slurm setup, everything above runs on it — the science is open and the commands are stock. What you'd be building yourself is the part around the openfe calls: the network→array fan-out, the spot-reclaim checkpoint-resume, the GPU right-sizing, the per-edge cost stamping, the gather — and the in-account record where the plan, the ΔΔG table, the costs, and the provenance accrete as one queryable campaign history instead of scattered job folders.

That's what the three templates package, running managed Slurm in your AWS account (BYOC): you supply the ligands and the receptor, we remove the cluster-ops. Pilot is $4K, then $1,500/mo BYOC.

If you'd rather we scope it against your own series and your account, book a pilot scoping call →.