One Cluster for Drug Discovery: Why Computational Biology Teams Shouldn't Manage Three Systems
Drug discovery teams run Nextflow, GROMACS, AlphaFold, and ML training on 2-3 separate systems. Here's how a single managed Slurm cluster handles all of them — at 3x lower cost than AWS HealthOmics.
If you lead computational biology at a Series A–B drug discovery startup, your infrastructure probably looks something like this:
- Nextflow pipelines (genomics, transcriptomics) running on AWS Batch or Seqera Platform
- Molecular dynamics (GROMACS, AMBER, OpenMM) on AWS ParallelCluster or Rescale
- ML workloads (AlphaFold, ESMFold, property prediction) on EC2 instances or SageMaker
Three systems. Three billing models. Three sets of credentials. Three failure modes. And one computational scientist — probably you — managing all of it.
This is the "fragmented HPC" problem, and it's the #1 operational pain point for 15–50 person drug discovery teams. Not because each system is bad — Batch is fine for Nextflow, ParallelCluster works for MPI — but because the combined operational overhead is crushing.
The Real Cost of Fragmentation
It's not just the AWS bill. It's the invisible tax:
Ops overhead: Each system has its own upgrade cycle, security patches, network configuration, and monitoring. ParallelCluster alone requires managing AMIs, VPC subnets, shared storage, and Slurm configuration. Multiply by three systems.
Context switching: Your GROMACS simulation fails at 2 AM. You debug it in ParallelCluster's Slurm logs. Then a Nextflow pipeline fails — different logs, different system, different mental model. Then a SageMaker training job runs out of GPU memory — yet another dashboard.
Cost opacity: You get three separate AWS bills. Attributing compute cost to "Project X — lead optimization" requires manually cross-referencing Batch task IDs, ParallelCluster job IDs, and SageMaker training job ARNs. Nobody does this accurately.
Team friction: The wet-lab biologist who needs to run a quick AlphaFold prediction has to learn SageMaker. The computational chemist who wants to add a Nextflow pre-processing step has to learn Batch. Knowledge silos form around infrastructure, not science.
What If It Was One Cluster?
Clusterra runs all of these workloads on a single managed Slurm cluster. Same scheduler. Same console. Same cost tracking. Same credentials.
Here's what that looks like:
Nextflow Pipelines
Same nf-core pipelines you already run — just point at Clusterra's Slurm executor:
profiles {
clusterra {
executor = 'slurm'
queue = 'cpu-workers'
}
}
nextflow run nf-core/sarek -profile clusterra --input samples.csv
Karpenter provisions the right instance types automatically. Scale-to-zero when done.
Molecular Dynamics (MPI)
GROMACS and AMBER are native Slurm workloads. Multi-node MPI works out of the box:
#!/bin/bash
#SBATCH --job-name=md-simulation
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --partition=cpu-workers
#SBATCH --time=48:00:00
srun gmx_mpi mdrun -deffnm production -nsteps 50000000
No separate ParallelCluster deployment. No separate AMI management. Same cluster, same Slurm scheduler.
GPU Workloads (AlphaFold, ML Training)
Submit GPU jobs via Slurm's native GRES scheduling:
#!/bin/bash
#SBATCH --job-name=alphafold-predict
#SBATCH --partition=gpu-workers
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
python run_alphafold.py --fasta_paths=target.fasta --output_dir=./results
Karpenter provisions GPU nodes (g5, p4d, etc.) on Spot when available. Scale to zero when no GPU jobs are queued.
Cost Comparison: Unified Cluster vs Three Systems
For a typical drug discovery team running ~$10,000/month in compute across all workloads:
| System | Monthly Cost | Notes |
|---|---|---|
| 3-system setup | ||
| Seqera + Batch (Nextflow) | ~$4,000 | Seqera compute markup + Batch |
| ParallelCluster (GROMACS) | ~$3,500 | Controller + On-Demand nodes |
| SageMaker (AlphaFold/ML) | ~$3,000 | ML instance premium |
| Subtotal | ~$10,500 | Plus ops time across 3 systems |
| Clusterra (unified) | ||
| SaaS (Spot + 10% fee) | ~$3,500 | All workloads on one cluster |
| Savings | ~$7,000/month (67%) | Plus recovered engineering time |
The savings come from three sources: 1. Spot pricing passed through (vs Seqera/SageMaker markups) 2. No idle infrastructure (scale-to-zero vs ParallelCluster controller always-on) 3. No platform fees for separate orchestration layers
vs AWS HealthOmics
AWS HealthOmics is trying to solve a similar problem — unified compute for multi-modal biotech. But:
| HealthOmics | Clusterra | |
|---|---|---|
| Compute | Proprietary "omics instances" (~3x EC2) | AWS Spot pass-through |
| MPI | Experimental (GROMACS via Nextflow DSL) | Native Slurm MPI |
| Non-containerized | No | Yes |
| GPU | Limited instance selection | Any EC2 GPU type via GRES |
| Lock-in | Complete (proprietary instances) | K8s-portable |
| Pricing (16 vCPU, 32 GB) | ~$1.20/hr | ~$0.33/hr |
HealthOmics is 3x more expensive for comparable workloads, and doesn't support native MPI — which is essential for molecular dynamics.
vs Rescale
Rescale solves the same unified-platform problem, but for enterprises:
| Rescale | Clusterra | |
|---|---|---|
| Target | Large pharma, 100+ people | Series A–B, 15–50 people |
| Pricing | Custom quotes, $50k–$500k/year | $9k/year BYOC or usage-based SaaS |
| Min deal | $50k+ ACV | Start for $5 |
| Setup | Weeks (enterprise onboarding) | Minutes (self-serve) |
If you can afford Rescale, it's a great platform. If you're a 20-person startup, it's not an option.
The Slurm Advantage
Why Slurm as the unified scheduler? Because computational biology teams already know it.
Most comp bio PhDs learned Slurm in grad school. The commands are muscle memory: sbatch, squeue, scancel, sinfo. Moving from an academic cluster to Clusterra is frictionless — same workflow, better infrastructure.
And Slurm natively handles the diversity of workloads in drug discovery:
- Batch jobs: Nextflow tasks, data processing scripts
- MPI jobs: Multi-node molecular dynamics, quantum chemistry
- GPU jobs: Deep learning training, structure prediction
- Array jobs: Docking screens, parameter sweeps
- Interactive: Jupyter notebooks via srun --pty
No other single scheduler handles all of these as well.
What You Give Up
Transparency matters. Here's what a unified Clusterra cluster doesn't give you (yet):
- Data Studios (Seqera): Interactive Jupyter/RStudio environments connected to pipeline outputs. We're working on this.
- Enterprise compliance: SOC2, HIPAA, GxP certifications. Not yet available — if you're in a regulated environment, this may be a blocker.
- AWS support SLA: Clusterra is a startup. AWS HealthOmics and PCS come with AWS support tiers.
- Pre-built pipeline gallery: Seqera's nf-core integration is more polished. You bring your own pipelines to Clusterra.
Who This Is For
If you're a computational biology team that: - Runs 2+ of: Nextflow, GROMACS/AMBER, AlphaFold/ML - Spends $3k–$30k/month on compute - Has 1–3 people managing infrastructure - Wants to spend time on science, not ops
Then consolidating to a single managed Slurm cluster saves money and engineering time.
Try it: Sign up at clusterra.cloud. Run a Nextflow pipeline, then submit a GROMACS job — same cluster, same console, same bill.
Built by the former Product Manager for AWS Batch and AWS Parallel Computing Service.
Managing multiple compute systems for drug discovery? We'd love to hear about your setup — hello@clusterra.cloud.