2026-05-05

30× WGS in ~2.5 hours for ~$2.60: Sarek + Parabricks on Clusterra

FASTQ → VCF in ~2h 32m for ~$2.59 of AWS spot, on the unmodified nf-core/sarek pipeline a clinical lab already validates against. SNP F1 = 0.9966, INDEL F1 = 0.9934 against GIAB v4.2.1 — indistinguishable from CPU DeepVariant. Backed by n=3 clean Stage A reproductions (1.8% spread) and n=4 clean Stage B reproductions (0.7% spread).

End-to-end FASTQ → VCF in ~2h 32m for ~$2.59 of AWS spot, on the unmodified nf-core/sarek pipeline a clinical lab already validates against. Accuracy: SNP F1 = 0.9966, INDEL F1 = 0.9934 against GIAB v4.2.1 — indistinguishable from CPU DeepVariant.

Numbers backed by n=3 clean Stage A reproductions (1h 46m – 1h 48m, 1.8% spread), n=4 clean Stage B reproductions (44m 28s – 44m 46s, 0.7% spread), and a hap.py validation against the NIST v4.2.1 truth-set. We also captured one Stage A spot reclaim (1/4 attempts) and one Stage B Parabricks SIGSEGV (1/5 attempts), reported as natively-collected reliability datapoints.


TL;DR

We ran the nf-core/sarek variant-calling pipeline on a 30× whole-genome FASTQ (NA12878, the GIAB reference sample) on Clusterra's managed Slurm. With NVIDIA Parabricks for alignment on a single A10G and a chained Parabricks DeepVariant for variant calling, we get from FASTQ to a clinical-grade VCF in ~2 hours 32 minutes for about $2.59 per sample (median across 3 reproductions) — and the resulting VCF lands at F1 = 0.9966 / 0.9934 (SNP / INDEL) against the NIST v4.2.1 truth-set.

For comparison, the published nf-core/sarek CPU bench on the same input takes roughly 13 hours (nf-co.re/sarek). AWS HealthOmics' Ready2Run germline workflow charges a flat $10.00/run on a black-box managed runtime (HealthOmics pricing). NVIDIA's marquee Parabricks claim is 25 minutes for <$15 on a 4×H100 p5 instance with the standalone tool (AWS+NVIDIA solution brief) — but that's not in nf-core/sarek and isn't ready-to-run for a clinical lab.

What we wanted to ship: the same nf-core/sarek you already trust, the cost transparency you already need, with GPU-class wall-clock on commodity (A10G) instances — and accuracy numbers that prove it.

Setup

Input. GIAB NA12878, 30× WGS, paired-end FASTQ (~92 GB compressed). Streamed via Mountpoint-S3 directly off s3://ngi-igenomes/test-data/sarek/, no pre-staging, no copy. iGenomes references (GATK.GRCh38) resolved through the same FUSE mount — zero reference-disk provisioning.

Pipeline.

  • nf-core/sarek 3.6.1 with -profile test_full_germline,gpu
  • aligner: parabricks — alignment via NVIDIA Parabricks pbrun fq2bam on GPU
  • tools: "" — sarek stops at the recalibrated CRAM
  • skip_tools: fastqc,samtools,mosdepth — FASTP (run by sarek by default) covers the same QC signal these tools would re-derive; Parabricks' own QC report covers fq2bam-side stats
  • Variant calling chained as a separate Slurm job: pbrun deepvariant on the same A10G class, takes the recalibrated CRAM as input

Compute.

  • Sarek head (orchestrator): t3a.2xlarge on-demand. ~$0.30/hr. Pinned on-demand because the head is the single point of failure for a multi-hour pipeline.
  • Parabricks fq2bam: g5.8xlarge spot (1× A10G 24 GiB). ~$0.91/hr. NVMe scratch via --tmp=80G routing.
  • GPU DeepVariant (chained): g5.8xlarge spot (1× A10G), 12 vCPU / 64 GB request, NVMe scratch.
  • CPU sub-tasks (interval prep, multiqc, gather): cheap CPU spot.

All workdirs on AWS EFS; per-task scratch on instance-store NVMe (RAID0 xfs). Container runtime: Apptainer.

Results

Wall-clock (n=3 clean reproductions for Stage A, n=4 for Stage B)

Stage What Wall-clock (median) Range Spread Notes
Stage A sarek 3.6.1 + Parabricks fq2bam → recalibrated CRAM 1h 47m 05s 1h 46m 05s – 1h 47m 59s 1.8% Of which Parabricks fq2bam itself ≈ 1h 16m on the A10G; remainder is interval prep + Nextflow publishDir on EFS.
Stage B Parabricks pbrun deepvariant → VCF + index 44m 35s 44m 28s – 44m 46s 0.7% DV is essentially deterministic — output VCFs are byte-identical across runs.
End-to-end FASTQ → VCF ~2h 31m 40s n/a n/a Sequential chain via Slurm --dependency=afterok.

The Parabricks fq2bam step itself — the actual GPU alignment — took 1h 15m 36s realtime on a single A10G, peaking at 85.9 GB RSS and 1181% CPU utilisation alongside the GPU. The remainder of Stage A is Nextflow plumbing (interval BED prep, channel gather, MultiQC) plus EFS publishDir overhead.

For reference: the published nf-core/sarek CPU bench on the same input is ~13 hours.

Accuracy (hap.py vs GIAB NIST v4.2.1)

We validated the DeepVariant VCF against the GIAB NIST v4.2.1 high-confidence truth-set for NA12878/HG001 GRCh38 using Illumina's hap.py with the GA4GH-recommended RTG vcfeval engine. Comparison restricted to the high-confidence regions BED (~88% of the genome).

Type Recall Precision F1 TP FN FP
SNP 0.99489 0.99834 0.99661 3,237,741 16,645 5,394
INDEL 0.99195 0.99480 0.99337 463,938 3,764 2,520

These are clinical-grade numbers. The GPU-accelerated Parabricks DeepVariant uses the same model weights as Google's CPU DeepVariant, and these F1 scores are within the published range for DV on 30× NA12878 GIAB v4.2.1 (Google DeepVariant blog, NVIDIA Parabricks docs). The accuracy story: GPU DeepVariant on Clusterra produces variants indistinguishable from the CPU baseline, ~10× faster.

Output

VCF NA12878.deepvariant.vcf.gz (94.8 MB compressed; identical bytes across reproduction runs)
Total variant records 6,613,144
PASS 4,808,227
SNVs 5,253,889
Indels / MNVs 1,359,255

Cost

Stage Instance Time $/hr Cost
Sarek head t3a.2xlarge OD 1.78 h $0.30 $0.53
Parabricks fq2bam g5.8xlarge spot 1.40 h $0.91 $1.27
Interval + multiqc CPU mixed CPU spot aggregate ~$0.10 $0.10
Stage A subtotal $1.90
GPU DeepVariant g5.8xlarge spot 0.74 h $0.91 $0.68
Per-sample total ~$2.59

Cost discipline: Clusterra runs in the customer's own AWS account, billed at AWS spot rates. There is no per-sample managed-service surcharge, no flat per-run fee, no credit-pack lock-in. Strip the head node (move it onto a shared orchestrator in a panel run) and the per-sample marginal cost drops below $2.05.

How we got here — three tiers of bench discipline

The headline number isn't the first number we got. Three runs, three configs, one story about where the time goes:

Configuration Stage A wall What changed Use case
Untuned ~5h 35m FASTQC + SAMTOOLS_STATS + MOSDEPTH all running, NVMe scratch not yet routed for fq2bam "What you get out of the box if you just flip aligner: parabricks."
Partial-QC 3h 11m NVMe scratch routed; FASTQC skipped (saves ~2h single-threaded); samtools_stats + mosdepth still running "If you want full coverage QC retained for sign-off — apples-to-apples comparable to the published Sarek CPU bench's QC profile."
Headline (n=3) ~1h 47m All three duplicate-QC tools skipped — FASTP already covers the read-level QC signal "Production."

The progression matters because it's the knob a customer can turn. A clinical lab that wants the full QC report can stop at "partial QC" — that's already 4× faster than the published CPU baseline. A production pipeline that's already QC'd upstream goes to "headline" — 7× faster, and indistinguishable in accuracy.

What we did differently from the published Sarek bench

  1. GPU alignment via Parabricks fq2bam. Sarek 3.6.1 supports this via aligner: parabricks — we just had to wire the Slurm cap-pod template to pin A10G (--gres=gpu:a10g:1) and route NVMe scratch with scratch=true + --tmp=80G.
  2. Two-stage chain for GPU DeepVariant. Sarek 3.6.1 hardcodes the CPU DeepVariant module; the GPU variant is roughly 10× faster end-to-end. Rather than fork sarek (brittle across upgrades), we ship a separate parabricks-deepvariant job template that takes sarek's recalibrated CRAM as input and chains via Slurm's native --dependency=afterok primitive. One launch, two jobs, one VCF.
  3. Skip the duplicate QC. FASTQC adds ~2h single-threaded; SAMTOOLS_STATS adds ~33m; MOSDEPTH adds ~12m. FASTP (which sarek runs by default) already covers the FASTQC signal; Parabricks fq2bam emits its own coverage and dup-rate metrics. Three tools we didn't need running serialise the head's wall-clock for no information gain.
  4. NVMe scratch where it matters. Parabricks fq2bam moved 294 GB read / 220 GB write through scratch on this run — pushing that through EFS would dominate runtime. Read-heavy QC modules (when enabled) stay on EFS because staging cost dominates the I/O savings for them.
  5. References via Mountpoint-S3. All iGenomes references resolve from s3://ngi-igenomes through a FUSE mount. No reference disk to provision, no warm-up cache to manage, no cross-region copy.

Honest caveats

  • n=3 for Stage A, n=4 for Stage B, not n=10. Variance is small enough (1.8% on Stage A wall-clock, 0.7% on Stage B — output VCFs are byte-identical across DV runs) that we're confident in the headline number. Larger panel will come from the multi-sample post.

  • Spot reclaim measured at 25% per Stage A run. Of 4 Stage A attempts in our reproduction set, one hit a NODE_FAIL on a downstream sub-task during its publishDir phase — a spot interruption. Nextflow did not auto-retry across that specific failure (the failed sub-job was past the resume checkpoint), so the run had to be re-submitted from scratch. 1 of 4 attempts ≈ 25% reclaim probability on this hardware mix in this AZ. Mitigations for production: pin the orchestrator head on-demand (we already do), and consider on-demand for the long-tail publishDir step if you can't tolerate occasional ~1.5h reruns. Cost overhead of one reclaim: ~$2 of doubled spot time, no manual intervention beyond the resubmit click.

  • Parabricks DeepVariant SIGSEGV measured at ~20% per run. Of 5 Stage B attempts, one crashed with Received signal: 11 mid-chr13 at the 28-minute mark — a process-level segfault inside the deepsomatic binary, same SIF and same input as the four runs that succeeded. This is a known-flaky behaviour of pbrun 4.6.0-1; NVIDIA tracks similar reports. Mitigation: the parabricks-deepvariant template now includes requeue: true so this crash class is absorbed transparently by Slurm without user intervention. Cost overhead of one crash: ~$0.50 of wasted GPU time.

  • The Stage A wall-clock includes a non-trivial publishDir tax. Of the ~1h 47m, roughly 30 minutes is Nextflow's filePorter copying the 47 GB CRAM from work dir to outdir on EFS. With publish_dir_mode = 'symlink' instead of copy, this becomes near-instant — at the cost of pinning work dirs as the source of truth. Acceptable for benchmarks; configurable per use case.

  • A10G vs H100. NVIDIA's "25 min" headline uses 4×H100 (p5.48xlarge, ~$98/hr on-demand). We chose 1× A10G (g5.8xlarge spot) for cost-per-sample efficiency — the right choice for biotech production, the wrong choice if you want the absolute speed crown. The cost-per-sample math overwhelmingly favours A10G.

  • GPU DeepVariant runs as a chained job, not in-pipeline. The user sees two job IDs in the Slurm queue. This is intentional (sarek upgrades stay low-friction) but is a UX wart we'll smooth with a console rollup view.

  • NA12878 only. NA12878/HG001 is the easy GIAB sample — every variant caller is implicitly tuned to it. HG002 (the modern GIAB standard with a more rigorous truth-set) and the AJ trio Mendelian-concordance check are scoped for a follow-up "accuracy deep-dive" post.

How this compares

Source Time Cost per 30× sample Accuracy (SNP F1) Notes
nf-core/sarek published CPU bench ~13 h ~$20 (cited; on AWS spot) ~0.996 (GATK HaplotypeCaller) nf-co.re/sarek
Seqera (Parabricks + Fusion) <2 h $6.07 spot / $13.97 on-demand not published in their post Seqera blog — requires Seqera Platform + Fusion file system
NVIDIA / AWS Parabricks marquee 25 min <$15 (4×H100) not published in marketing AWS+NVIDIA brief — standalone Parabricks, not nf-core
AWS HealthOmics Ready2Run (managed) $10.00 flat not published per-run HealthOmics pricing — black-box managed runtime, GATK-BP not sarek
DNAnexus Titan (managed) sales-led, opaque not published DNAnexus
Clusterra (this writeup) ~2h 32m ~$2.59 0.9966 (SNP) / 0.9934 (INDEL) Same nf-core/sarek you already trust, customer's own AWS account, accuracy validated against GIAB

We're not the fastest number on the page — the 4×H100 standalone Parabricks bench is. We are the cheapest by a wide margin, and the only writeup of this category to publish hap.py F1 numbers on the produced VCF, while still running the exact nf-core/sarek pipeline a clinical lab is already validating against, in the customer's own AWS account, on commodity (g5.8xlarge) GPU instances.

What this means for biotech infrastructure

Three things have been true for biotech compute for years and are getting less true:

  1. You had to choose between speed and trust. GPU acceleration meant either rolling your own orchestration (Parabricks standalone, no pipeline ergonomics) or accepting a managed black box (HealthOmics, DNAnexus). With sarek + Parabricks on Clusterra, the same pipeline a clinical lab already validates is now GPU-accelerated, end-to-end, in your own account — and we can prove the accuracy match with hap.py F1, not just claim it.
  2. You had to choose between cost transparency and convenience. Flat per-run pricing hides what you're actually paying for. AWS spot pricing in your own account shows every dollar — and at $2.59 per 30× sample, it's cheaper than every comparator we could find. Clusterra adds a small per-cluster management fee, not a per-sample tax.
  3. You had to choose between fast and reproducible. Forking sarek to inline GPU DeepVariant gets you a single-job UX but a maintenance burden every time sarek bumps. The two-stage chain we ship preserves sarek upstream verbatim — when sarek 3.7 lands, you upgrade by changing one version pin.

What's next

  • HG002 + accuracy deep-dive. A follow-up post will validate against HG002 (Ashkenazi son, the modern GIAB standard), the HG002/HG003/HG004 Mendelian-concordance check, and stratified GIAB v3.5 region splits (low-complexity, segdup, MHC). The headline-grabbing accuracy story.
  • Multi-sample panel benchmark. Cohort throughput numbers (10, 50, 100 samples) where the per-sample amortization story actually pays off and the head-node overhead disappears. Likely the strongest standalone post once we have the data.
  • Closing the DV-on-GPU gap upstream. We're tracking the in-pipeline sarek module override path; once it lands in a tagged release, the chained job goes away.

Run it yourself

If you have a Clusterra cluster: in the launcher, pick Benchmark: GPU Parabricks under the Sarek presets and click Submit. When that finishes, submit Benchmark: GIAB NA12878 GPU DeepVariant (post-sarek) with Run after job ID = <stage-A-id>. To validate accuracy, submit hap.py vs GIAB truth-set with the resulting VCF as query_vcf. The numbers in this writeup are reproducible end-to-end.

If you don't yet, contact us.


Reproducibility notes.

All four presets — Benchmark: GPU Parabricks, Benchmark: GIAB NA12878 GPU DeepVariant (post-sarek), Parabricks DeepVariant (GPU), and hap.py vs GIAB truth-set — ship in every Clusterra cluster's launcher catalog under HCLS → Sarek and HCLS → Variant Calling. Pick a preset, click Submit, get this number.

The runs cited in the tables above:

  • Stage A (n=3 clean): Slurm jobs 1490 (01:46:05), 1491 (01:47:59), 1580 (01:47:05). One additional attempt ended in a spot reclaim and is included in the resilience math, not the median.
  • Stage B (n=4 clean): Slurm jobs 1305 (00:44:33), 1585 (00:44:38), 1588 (00:44:46, chained off 1580 via afterok), 1590 (00:44:28). One additional attempt ended in a Parabricks SIGSEGV and is included in the resilience math, not the median.
  • Accuracy validation: Slurm job 1573, hap.py 0.3.12 + RTG vcfeval, query VCF with 6,613,144 records / 4,808,227 PASS, truth set NIST v4.2.1 NA12878/HG001 GRCh38 from https://giab.s3.amazonaws.com/release/NA12878_HG001/NISTv4.2.1/GRCh38/.
  • Cost: AWS spot-list pricing for g5.8xlarge / t3a.2xlarge in us-east-1, May 2026.