2026-06-02

A RELION cryo-EM 3D classification, on spot, in your own AWS

The canonical RELION cryo-EM benchmark — 105,247 Plasmodium falciparum 80S ribosome particles (EMPIAR-10028), 3D classification into 6 classes over 25 iterations — refined on a single commodity A10G GPU in 3 h 18 m for ~$1.65 of the customer's own AWS spot, with no HPC engineer and no data egress. The same job on 4× A10G finishes in 1 h 36 m (2.07× scaling), run as single-node MPI inside our containerized Slurm. Wall-clock lands right alongside published V100 and RTX 3090 numbers for the identical job, and the reconstruction converges to 9.65 Å. Headline n=2 (~0.5% spread); single-node only, scoped honestly below.

The canonical RELION benchmark — 105,247 Plasmodium 80S ribosome particles, 3D classification into 6 classes, 25 iterations — refined on a single commodity A10G GPU in 3 h 18 m for ~$1.65 of the customer's own AWS spot, with no HPC engineer, no data egress, and the run history and cost stamped in their account. The same job on 4× A10G finishes in 1 h 36 m, run as single-node MPI inside our containerized Slurm.

Companion to the OpenFE RBFE post — same operating model (managed Slurm, customer's AWS, spot economics), a different top-priority workload.

TL;DR (June 2 2026)

  • Dataset: EMPIAR-10028, the Plasmodium falciparum 80S ribosome — the dataset the RELION team and every GPU vendor benchmark on. 105,247 pre-extracted particles, box 360 px, 1.34 Å/px. We run the canonical benchmark job verbatim: relion_refine 3D classification, K=6, 25 iterations.
  • Image: relion 5.0 (CUDA 12.6) via Apptainer; apptainer pull on the node, no build host needed.
  • 1× A10G: 197 m 51 s (g5 spot), exit 0, converged to 9.65 Å, peak GPU memory 21.5 GB / 23 GB.
  • 4× A10G: 95 m 38 smpirun -n 5, 4 ranks mapped 1:1 to 4 A10Gs inside one pod. 2.07× speedup (sub-linear, exactly as the literature predicts).
  • Cost: ~$1.65 per refinement on a right-sized single-A10G spot node; ~$4 on the 4-GPU box. The single GPU is cheaper per result; the 4-GPU box is faster wall-clock. Pick by urgency.
  • vs CPU: the same job is ~6× slower per iteration on a ~20-core CPU node — GPU is both faster and cheaper. (in-house baseline, same environment; full 5-iteration run completing — see Results)
  • Spot: all runs were on spot; this session's runs completed without interruption. The platform requeues on reclaim, but a plain requeue restarts RELION from iteration 0 — checkpoint-resume (--continue) is a known gap, not yet exercised. (see The operating model.)
  • n: headline 1× A10G is n=2 (198 m 51 s / 198 m 56 s — ~0.5% spread); a 3rd rep is running. Scaling and CPU runs n=1.

Setup

Target. The Plasmodium falciparum 80S ribosome (Wong et al., eLife 2014). Deliberately the canonical, well-behaved benchmark — the right fixture for proving the managed pipeline produces a correct, literature-grade reconstruction, not a claim about solving hard novel structures. The whole point of EMPIAR-10028 is that there are published wall-clock numbers for the identical job on other GPUs, so our A10G number slots into a recognized table.

Data. We download the MRC-LMB relion_benchmark package (47 GiB, pre-extracted particles + the emd_2660.map reference) — not the 1.2 TB raw EMPIAR deposit. So we run classification directly, no motion-correction / CTF / picking / extraction pipeline. 105,247 particles, box 360, 1.34 Å/px, staged once to the cluster's EFS.

Job (verbatim MRC-LMB benchmark).

relion_refine --i Particles/shiny_2sets.star --ref emd_2660.map:mrc \
  --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref \
  --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 \
  --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 \
  --offset_range 5 --offset_step 2 --sym C1 --norm --scale \
  --random_seed 0 --pool 30 --j 6 --gpu --scratch_dir <local NVMe>

--scratch_dir on the instance's local NVMe is load-bearing — RELION copies the particle stack off EFS once, then does its many random reads against fast local disk.

Results

Wall-clock — same dataset, same job definition

Run Hardware Wall-clock n
1× A10G (this run) g5 spot 198 m 51 s / 198 m 56 s (mean 198.9 m) n=2; 3rd running
4× A10G (this run) g5.12xlarge spot 95 m 38 s 1
CPU (this run) ~20-core spot ~6× slower/iter (5-iter run completing) 1
— published —
1× Tesla V100 MRC-LMB (RELION 3) 3 h 06 m 1
4× Tesla V100 MRC-LMB (RELION 3) 1 h 12 m 1
4× RTX 3090 linuxvixion (RELION 3.1) 43 m 1

A single A10G (198 min) lands just below a single V100 (186 min) — right where it should: the A10G is a slightly slower card for this FP32-heavy workload, and there was no published A10G number for this job until now, so this fills a real gap.

Read the comparison honestly: these are the same dataset and same job definition (K=6, 25 iter, box 360), but the published numbers are RELION 2/3 and ours is RELION 5.0 — whose GPU code path differs. This is "same canonical benchmark, modern stack," not a controlled version-to-version delta. We make no claim that the A10G/V100 gap is a clean hardware ratio.

Scaling: 1 → 4 GPUs = 2.07×

Wall-clock on the identical job (K=6, 25 iter, box 360): RELION 5 on Clusterra A10G spot vs published RELION 2/3 V100 and RTX 3090 numbers

Four times the GPUs buys 2.07× the throughput — sub-linear, and consistent with the published curves (V100 1→4 was 2.6×, RTX 3090 2→4 was 1.65×). RELION's 3D classification alternates a GPU-bound expectation step with a partly CPU-bound maximization step and per-iteration disk I/O, so adding GPUs has diminishing returns. This is the honest shape of the workload, and it drives the cost story below.

Correctness — a real reconstruction, not fast noise

PyMOL isosurface of the dominant 3D class (class 3, 21.8% of particles, 9.65 Å) — the reconstructed Plasmodium 80S ribosome, 15 Å low-pass

RELION 3D classification convergence: current resolution improves from 60.3 Å at iteration 1 to 9.65 Å at iteration 25 on a single A10G

The run converged from 60 Å at iteration 1 to 9.65 Å at iteration 25, producing six 3D class volumes and the per-particle class assignments. The reconstructed density is the recognizable 80S ribosome (rendered above). This is the proof that the managed pipeline yields a correct result — the speed numbers only matter because the science is right.

The operating model: what you actually get

One Slurm submit, GPU passthrough, no build host. The RELION 5 CUDA image is pulled with apptainer pull directly onto the worker (cap-pods can't build, but they can pull), and apptainer exec --nv exposes the host A10G + CUDA. No HPC engineer assembled this; it's a template.

Single-node MPI, inside containerized Slurm — and it just runs. Multi-GPU RELION is MPI: mpirun -n 5 spawns one coordinator + four workers, each pinned to one of the four A10Gs in a single g5.12xlarge. Making OpenMPI launch cleanly inside a containerized slurmd took exactly two things: tell OpenMPI not to reach for Slurm's launcher (--mca plm ^slurm, so it forks ranks locally in the pod), and bind the scratch path so OpenMPI's session directory exists. That's the whole trick. We make no multi-node MPI claim — RELION 5's container is single-node by design, and at this scale single-node multi-GPU is the dominant, recommended path anyway. The differentiation here is managed + spot + in-account, not architecture; AWS PCS and ParallelCluster run RELION too, and we don't pretend otherwise.

Right-sized GPU, measured not guessed — and the answer surprised us. nvidia-smi sampling every 30 s showed the 1× A10G run peaking at 21.5 GB of the card's 23 GB at 100% utilization (37% average — that's the GPU/CPU alternation, not starvation). The consequence is the opposite of our OpenFE FEP workload, where the A10G's VRAM was massively over-provisioned: here, the A10G's 24 GB class is the right floor — a 16 GB card would OOM on this box-360 / K=6 job. Same platform, different workload, different correct instance — because we measure it.

GPU utilization and VRAM sampled every 30 s over the 1× A10G run: VRAM square-waves to a 21.5 GB peak during each expectation step while utilization averages 37% (the GPU/CPU alternation), with an initial lull during the one-time particle copy to local NVMe

Spot — and an honest gap. Every run was on spot, and they completed cleanly without interruption this session (g5 spot was scarce enough that some replicates waited in the queue for capacity, but none were reclaimed mid-run). We're not going to dress that into a resilience story it didn't earn. What we will flag is the gap a reclaim would expose: the current RELION template relies on plain Slurm requeue, which restarts RELION from iteration 0 — a late reclaim would be expensive. RELION supports --continue <optimiser.star> to resume from the last checkpointed iteration, exactly the spot-safety we already ship for OpenFE (quickrun --resume). Wiring that into the RELION template is a known, queued improvement — not something this run validated.

Cost, and the crossover that matters to a small lab.

Config Instance ~spot $/hr wall ~$/refinement
1× A10G g5.xlarge (right-sized) ~$0.50 3.30 h ~$1.65
4× A10G g5.12xlarge ~$2.5 1.59 h ~$4.0

The single A10G is ~2.4× cheaper per refinement; the 4-GPU box is ~2× faster wall-clock. For a lab with no standing cluster, the default is one A10G — you only pay the multi-GPU premium when turnaround urgency justifies it. (Note: our headline runs landed on a larger g5.4xlarge/g5.8xlarge than needed because of how the CPU request maps to instance selection; the ~$1.65 above is normalized to the right-sized g5.xlarge, which is the template default going forward.)

"Why not just install RELION and run this on my own spot?"

You can — RELION is free and AWS rents the GPUs. What you'd be assembling: a working RELION 5 CUDA image and the apptainer pull path; a Slurm scheduler that provisions A10G spot on demand and tears it down; the mpirun --mca plm ^slurm incantation to make multi-GPU MPI launch inside a container; NVMe scratch routing so EFS isn't your bottleneck; spot-reclaim requeue (and checkpoint-resume so it isn't wasted); GPU right-sizing measured rather than guessed; and the per-job cost stamped in your own account. None of it is exotic. All of it is a week you don't get back, every time you onboard a new workload — which is the actual product.

Honest scope (what this run is and isn't)

  • Easy benchmark, by design. EMPIAR-10028 is the cryo-EM equivalent of TYK2 for FEP — a well-behaved canonical fixture. It proves the pipeline yields correct, fast, literature-grade results; it does not claim we solve hard, heterogeneous, or novel structures.
  • Single-node only. RELION 5's container has no multi-node MPI. We validate single-node multi-GPU MPI; we do not claim (and the workload doesn't need) tightly-coupled multi-node.
  • Version caveat. Published comparison numbers are RELION 2/3; ours is RELION 5.0. Same dataset/job-def, different stack — not a controlled hardware delta.
  • n. Headline 1× A10G is n=3; scaling, CPU, and the resolution figure are n=1.

Reproduce it

  • Dataset: ftp://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz (47 GiB; EMPIAR-10028 pre-extracted particles + emd_2660.map).
  • Image: relion 5.0 CUDA 12.6, pulled to a cluster-cached SIF.
  • Jobs (clusde74): 2983 (1× A10G headline), 2986 (4× A10G), 2990/2991 (1× A10G reps 2–3), 2987 (CPU baseline).
  • MPI launch: mpirun --mca plm ^slurm --mca ras ^slurm --oversubscribe --bind-to none -np <N+1> relion_refine_mpi ... --gpu inside apptainer exec --nv --bind $TMPDIR.

What's next

  • A real-world target where there's no published number to hide behind — e.g. a membrane protein (TRPM8, EMPIAR-11233) — to show the managed pipeline on data that isn't the easy fixture.
  • --continue spot-resume wired into the RELION template, so a reclaim resumes instead of restarting.
  • The full SPA pipeline as one chained campaign (Class3D → Refine3D → postprocess), the way the OpenFE post runs plan → fan-out → gather.