MAPPs Neoantigen Discovery on Clusterra: GPU-first WES + LC-MS/MS in one chain
For MAPPs CROs running neoantigen assays: submit tumor WES FASTQs + immunopep MGF as one job, get a confidence-tiered epitope list back — genomic predictions cross-annotated with what the HLA machinery is actually presenting. GPU-first, open-source toolchain (no NetMHCpan license), runs in your own AWS account. Demonstrated on a public HLA-eluted Orbitrap reference sample (HLA Ligand Atlas, Marcu et al. 2021): Casanovo recovers 305 non-tryptic MHC-I peptides database search misses entirely — 24.9% predicted HLA binders, top hit IC50 = 20 nM. For matched-tumor neoantigen rediscovery against published ground truth, see the companion Bassani-Sternberg Mel15 rediscovery case study.
Executive summary
Immunopeptidomics CROs running MAPPs assays today operate two unrelated computational pipelines: a tumor-normal genomics path that predicts neoantigens, and a mass-spec path that identifies peptides eluted from MHC trays. Cross-annotating them — "did we actually see the predicted neoantigen in the LC-MS/MS run?" — is the hard part, and it's done by hand in a spreadsheet by whichever analyst is on shift.
A note on MAPPs use cases. MAPPs assays serve two distinct workflows with different computational requirements. The drug immunogenicity / ADA risk workflow (a pharma client's biologic loaded onto donor dendritic cells, MHC-II elution, no patient genomics) is a different computational pipeline — MHC-II focused, no WES, built around database search against the drug protein FASTA + human proteome, with dataMAPPs for the hotspot visualization. That workflow is in development separately. This blog covers the neoantigen discovery use case: tumor-normal WES identifies somatic mutations, epitope prediction selects candidates computationally, and de novo MS/MS sequencing confirms what the patient's HLA-I machinery is actually presenting — including novel peptides no database search can reach. The chain is instrument-agnostic: the writeup uses a Thermo Orbitrap MGF, but the same workflow runs on Bruker timsTOF SCP/Pro and Orbitrap Astral inputs identically.
A note on the dataset used here. The MGF analyzed below is PRIDE PXD019643 — a benign autopsy prostate sample (donor DN17) from the HLA Ligand Atlas (Marcu et al. 2021, Orbitrap Fusion Lumos). This validates the chain's technical capability on real HLA-I spectra (per-run yield, non-tryptic peptide recovery, MHCflurry binding rates); for tumor-neoantigen rediscovery against published ground truth, see the bs-mel15-rediscovery-case-study companion. The glyco-MAPPs use case is covered separately in glyco-mapps-case-study.
Clusterra ships mapps-complete — a single workflow that runs four pre-wired steps (Sarek with Parabricks GPU alignment, Parabricks DeepSomatic, nf-core/epitopeprediction, Casanovo de novo MS/MS sequencing) from one submission. Three of four steps are GPU. None require a NetMHCpan academic license. The workflow runs inside a dedicated cluster the CRO owns in its own AWS account — no compute shared with any other Clusterra client — and every job carries an automatic cost record (instance class, spot vs. on-demand, USD) so per-project billing is a single query, not a spreadsheet. The Casanovo step closes the "we don't miss epitopes" gap that every CRO selling MAPPs feels but can't currently quantify.
This writeup runs the chain on the live Clusterra dev cluster, captures the biology Casanovo produces on a real Orbitrap MGF (PXD019643, HLA Ligand Atlas benign reference sample), and benchmarks it against an open-source database-search baseline (Comet, 1% FDR) to put a number on the "DB-search misses" claim.
The pitch in numbers
On 6,576 real HLA-eluted benign prostate Orbitrap spectra (PRIDE PXD019643 — HLA Ligand Atlas, Marcu et al. 2021, donor DN17, Orbitrap Fusion Lumos): - Casanovo recovers 5,367 PSMs, 4,335 (80.8%) in the MHC-I 8–11mer window. - 1,212 of those are non-tryptic (no K/R at C-terminus) — the class database search misses by design. - MHCflurry (HLA-A02:01) on 305 unique high-confidence non-tryptic MHC-I peptides: 76 (24.9%) predicted binders, 20 (6.6%) strong binders (IC50 < 500 nM). - Best hit: TMDSVVYAL, IC50 = 20 nM (top 0.1% HLA-A02:01 percentile rank).
See the PXD019643 section for the full tables.
Calibration benchmark (tryptic MGF, 128 spectra) — confirms tools are correctly configured: - Comet at 1% FDR identified 16. Casanovo at confidence ≥0.5 identified 62 (3.9×). - Sage (independent DB engine, Lazear et al. 2023) independently confirms 39 of the 62 Casanovo PSMs — a two-witness set 5× larger than the 8 Comet+Casanovo overlaps. - MHCflurry on the 27 MHC-I-window Casanovo-only IDs returns 0 / 27 predicted binders — the correct answer on tryptic input (tryptic peptides are not under HLA selection pressure). This calibrates the chain end-to-end: Casanovo isn't fabricating MHC-binder-shaped sequences, MHCflurry isn't over-calling, and the flip to 24.9% on real HLA-eluted data is the signal, not the noise.
The business problem
The MHC-Associated Peptide Proteomics (MAPPs) assay is the bench-side workflow behind both drug immunogenicity risk assessment and neoantigen discovery in immuno-oncology. The compute story under the neoantigen use case has not kept up:
-
Two pipelines, one biological question. WES tumor-normal → predicted neoantigens is a genomics platform. LC-MS/MS DDA (Orbitrap or timsTOF) → peptide identifications is a proteomics platform. Most CROs run them in different teams with different infra, then cross-annotate in a spreadsheet. The bridge is whichever analyst is on shift.
-
Database search misses the novel peptides. The MAPPs experiment produces mass spectra; going from those back to amino acid sequences is the interpretation step. Database search matches spectra against libraries of known peptide sequences — and by construction, it cannot identify a peptide that isn't in any library. Novel peptides from somatic mutations, unannotated splice junctions, patient-specific HLA-allele variants, and proteasomal splicing products are precisely the peptides MAPPs cares about, and precisely the ones database search misses. Casanovo (Yilmaz et al., 2024) reconstructs amino-acid sequences directly from the fragmentation spectrum with no database, and on the public MassIVE-KB benchmark it recovers a large fraction of high-confidence database-search IDs plus additional sequences not present in any reference.
-
License-clean open-source toolchain. This chain uses MHCflurry (MIT) + MHCnuggets (Apache 2) inside nf-core/epitopeprediction — no NetMHCpan or MSFragger license required. MixMHCpred 3.0 (Gfeller lab, Genome Medicine 2025) is a drop-in upgrade that benchmarks better than NetMHCpan 4.1 on external peptidomics data via pan-allele interpolation; CROs running in their own AWS account via BYOC can bring their own UNIL license and run it inside Clusterra's chain.
-
One client, one isolated cluster. Most managed bioinformatics platforms run all customers in one shared compute pool. Clusterra deploys a dedicated cluster per CRO in the CRO's own AWS account — its own compute, its own job database, no resources shared with other Clusterra clients. Every job carries an automatic cost record (instance class, spot vs. on-demand, USD per job), so per-project totals are queryable directly — no manual reconciliation.
mapps-complete is the demonstration.
What Clusterra adds — specifically for CROs running neoantigen MAPPs assays
| Typical CRO workflow today | What Clusterra adds |
|---|---|
| Sarek for tumor-normal somatic calling | Same tool, wrapped as a one-command Clusterra template, GPU-accelerated via Parabricks fq2bam, runs in the CRO's own AWS account |
| nf-core/epitopeprediction with MHCflurry / MHCnuggets | Same tools, no NetMHCpan license dependency, auto-discovers VCF from the upstream variant-calling step |
| Database search of LC-MS/MS MGF (current de novo gap) | Casanovo as the de-novo step — recovers novel peptides not in any library; Sage as an independent second-witness DB engine — together they form a 39-PSM two-witness set 5× larger than Comet alone on the calibration MGF |
| Per-project compute tracked in separate spreadsheets | Every job carries instance, spot/on-demand, and USD cost as metadata — per-project totals are a single query |
| Two unrelated pipelines (genomics + proteomics) bridged by hand | One workflow, four chained steps, one submission, one workflow row in the console grouping all four jobs under a chain badge |
Stage 2 of the pitch — once mapps-complete is running on a CRO's own data — is OpenFold3 peptide-HLA structural annotation on the top-ranked candidates, upgrading the client deliverable from a ranked list of sequences to a 3D visualization of each epitope sitting in the patient's HLA binding groove. That's not in this v1; it's the upsell.
The proof — Casanovo vs database search
Methods
We ran Comet (UWPR, v2026.01 rev. 1, open-source) against UniProt Swiss-Prot human (20,420 reviewed entries + canonical isoforms, ~42k sequences) concatenated with a reversed-decoy database (DECOY_ prefix, ~85k total entries). Search parameters: no-enzyme digest (HLA-eluted peptides are not tryptic — and this is precisely why database search misses immunopep IDs in production), 20 ppm precursor / 0.02 Da fragment tolerance, static Carbamidomethyl Cys (+57.02 — matches the wet-lab alkylation), variable Met oxidation (+15.99) and N-terminal acetylation (+42.01), peptide mass range 600-5000 Da (covers MHC-I 8mers through MHC-II 25mers). 1% PSM-FDR applied via target-decoy. Comet finished the 128-spectrum search in 42 seconds on a 24-vCPU CPU node. Casanovo PSMs accepted at the standard ≥0.5 confidence threshold.
Headline: who identifies what
| Tool | PSMs accepted | Threshold |
|---|---|---|
| Comet (DB search) | 16 / 128 | 1% target-decoy PSM-FDR |
| Sage (DB search) | 80 / 128 | label = target (top-scoring PSM per spectrum) |
| Casanovo (de novo) | 62 / 128 | confidence ≥ 0.5 |
| Comet ∩ Casanovo | 8 PSMs | 7 of 8 sequences exact; 1 I/L isobar |
| Sage ∩ Casanovo (two-witness) | 39 PSMs | same peptide called by both independent tools |
| Casanovo-only (vs Comet) | 54 / 128 | invisible to Comet; 31 of these confirmed by Sage |
| Comet-only IDs | 8 / 128 | Casanovo blind-spot sanity check |
On this MGF, Casanovo identifies 3.9× more spectra than Comet (62 vs 16) and contributes 54 IDs the database search misses entirely. The Casanovo-only set is not noise: mean confidence 0.84, median 0.89, and 40 of the 54 (74%) score ≥ 0.8 — i.e. high-confidence by Casanovo's own calibration.
Sage as an independent second witness
We also ran Sage (Lazear et al., 2023 — a Rust-based open-source database search engine, v0.14.7, static binary cached on EFS) against the same 128-spectrum MGF and the same FASTA, with equivalent tolerances. Sage and Comet are architecturally independent: Sage uses a different scoring model (hyperscore + Sage discriminant score), a different indexing strategy, and independent FDR calibration. Peptide identifications they both produce without consulting each other are the "two-witness" set — the highest-confidence tier a small benchmark can produce.
Sage ∩ Casanovo: 39 peptides. That is 5× the Comet ∩ Casanovo count (8), and 31 of those 39 are sequences Comet had missed entirely — confirmed by Sage alone, not by Comet. The three-tier confidence model:
| Tier | Peptides | Evidence |
|---|---|---|
| Two-witness (Casanovo + Sage) | 39 | Two independent tools agree; strongest signal |
| Casanovo-only (vs both DBsearches) | 23 | Novel / non-canonical; genuine de-novo contribution |
| DB-only (Sage or Comet, not Casanovo) | ≤ 8 | Database-anchored; Casanovo blind-spot |
The 23 remaining Casanovo-only peptides (62 − 39) are precisely the de-novo-exclusive set: sequences with no good match in the Swiss-Prot reference regardless of which database search tool is used. On an HLA-eluted MAPPs sample this is where the somatic neoantigen candidates and unannotated splice products would appear. On this tryptic benchmark they are a smaller but real class — sequences that neither Comet nor Sage could anchor to the reference but that Casanovo's spectrum-to-sequence model reconstructed with ≥ 0.5 confidence.
Where the Casanovo-only IDs land (length window)
| Length window | Casanovo-only IDs |
|---|---|
| MHC-I (8-11mer) | 27 |
| MHC-II (13-25mer) | 2 |
| 12mer (between windows) | 1 |
| 7mer (immunopep-adjacent) | 22 |
| 6mer (sub-window) | 2 |
27 of the 54 Casanovo-only IDs (50%) sit in the MHC-I 8-11mer window — exactly where an immunopeptidomics workflow needs them. Another 22 sit at 7aa, one residue below MHC-I, where de novo callers often land genuine MHC-I peptides that fell a residue short on either flank.
Sample of Casanovo-only sequences (top 10 by confidence)
| Casanovo seq | Score | Len | Window | Charge |
|---|---|---|---|---|
| TKYTSSK | 0.928 | 7 | 7mer | 2+ |
| GHVQPLR | 0.928 | 7 | 7mer | 2+ |
| KQHSLLK | 0.928 | 7 | 7mer | 2+ |
| KSPEPVR | 0.926 | 7 | 7mer | 2+ |
| VTLTNHK | 0.924 | 7 | 7mer | 2+ |
| NKPGVYTK | 0.921 | 8 | MHC-I | 2+ |
| NVHELEK | 0.919 | 7 | 7mer | 2+ |
| TNNLRPK | 0.919 | 7 | 7mer | 2+ |
| KEAPAPPK | 0.917 | 8 | MHC-I | 2+ |
| KVTAAMGK | 0.916 | 8 | MHC-I | 2+ |
| C[Carbamidomethyl]LKPNETK | 0.911 | 8 | MHC-I | 2+ |
These are plausible peptide sequences (proper amino-acid alphabet, charge consistent with precursor mass, Carbamidomethyl Cys correctly reported on the cysteine-containing PSM). The full Casanovo-only table — all 54 rows with scan number, sequence, score, length window, charge, and ground-truth label — is on EFS at /mnt/efs/nikhil@clusterra.io/comet-dbsearch/delta_table.csv; the summary JSON is alongside at delta_summary.json.
Binding cross-check: are these MHC-I window peptides actually predicted to bind?
The natural follow-up question on the 27 Casanovo-only IDs in the 8–11mer window is: "are they predicted to bind a real HLA-I allele, or do they just happen to be the right length?" We ran MHCflurry 2.2.0 (open-source, MIT-licensed, default models_class1_presentation) on all 27 sequences against the default panel HLA-A*02:01 / B*07:02 / C*07:02. Each peptide was assigned its best allele (lowest affinity percentile) and classified using the standard immunopep thresholds:
| Class | Threshold | Count (of 27) |
|---|---|---|
| Strong binder | affinity_percentile ≤ 0.5% |
0 |
| Weak binder | 0.5% < affinity_percentile ≤ 2% |
0 |
| Non-binder | affinity_percentile > 2% |
27 |
0 of 27 sequences are predicted binders at the conventional 2% threshold. The 5 closest-to-binding calls (sorted by best-allele rank):
| Sequence | Best allele | Percentile rank | IC50 (nM) |
|---|---|---|---|
| RPHETGGY | HLA-B*07:02 | 2.37% | 4,568 |
| KPANVVTK | HLA-B*07:02 | 3.83% | 10,294 |
| VKEDPDGEHAR | HLA-C*07:02 | 4.24% | 1,604 |
| SQKPVVVK | HLA-C*07:02 | 5.71% | 2,974 |
| AYEKPPEK | HLA-C*07:02 | 7.06% | 4,561 |
This is the expected result. The upstream Casanovo sample MGF is a tryptic dataset — peptides cleaved by trypsin after Lys/Arg. Tryptic peptides are not under selection for MHC-I binding, so the correct prior is "essentially none of these should bind." The 0/27 → 76/305 flip moving to HLA-eluted data (next section) is the directional signal that validates the chain end-to-end.
Reproducibility: mhcflurry-predict --alleles "HLA-A*02:01" "HLA-B*07:02" "HLA-C*07:02" against the peptide list at /mnt/efs/n52h53@gmail.com/mhcflurry-xcheck/peptides.csv; full output at /mnt/efs/n52h53@gmail.com/mhcflurry-xcheck/results/binding_summary.json.
PRIDE PXD019643: the HLA-eluted binding flip
We ran the identical Casanovo pipeline on PRIDE PXD019643 (Marcu et al. 2021 — HLA Ligand Atlas, benign autopsy prostate from donor DN17, Orbitrap Fusion Lumos, 6,576 spectra; Casanovo job 2341 on a single A10G GPU node, 2026-05-15). Input spectra are by construction peptides already selected by HLA-I machinery, so this is the dataset that should produce a materially different MHCflurry result if the chain is working correctly. Note: this is a benign reference sample, not tumor — so the predicted binders surfaced below are normal HLA-presented self-peptides, not tumor neoantigens. For a true tumor-neoantigen rediscovery against published ground truth, see the companion bs-mel15-rediscovery-case-study.
| Metric | Tryptic MGF (128 spectra) | PXD019643 HLA-eluted (6,576 spectra) |
|---|---|---|
| Total Casanovo PSMs | 62 | 5,367 |
| MHC-I window (8–11 AA) | 56 (44% of PSMs) | 4,335 (80.8% of PSMs) |
| Non-tryptic (no K/R C-term) | — | 1,212 |
| Unique high-conf non-tryptic peptides | — | 305 |
| MHCflurry strong binders (IC50 < 500 nM) | 0 / 27 (0%) | 20 / 305 (6.6%) |
| MHCflurry any binder (IC50 < 5,000 nM) | 0 / 27 (0%) | 76 / 305 (24.9%) |
The binding fraction flips from 0% to 24.9%. That's the chain working end-to-end.
The non-tryptic fraction is the database-search blind-spot story made concrete: 1,212 of the 4,335 MHC-I-window PSMs have no Lys or Arg at the C-terminus — the exact peptides Comet (no-enzyme mode, 42k-protein human DB) would have to find in a combinatorially larger candidate space with no length or terminus prior. Casanovo produces them directly, ranked by confidence score.
The MHC-I concentration shift is also diagnostic: 80.8% of PSMs fall in the 8–11mer window on HLA-eluted data vs 44% on tryptic. The HLA selection machinery pre-filters the spectrum population for Casanovo. Database search has no way to exploit that prior; it spends its FDR budget across the full length distribution.
Top binders from the 305 high-confidence non-tryptic MHC-I peptides (MHCflurry 2.2.0, HLA-A*02:01, models_class1_presentation):
| Sequence | Length | IC50 (nM) | %Rank (A*02:01) | Casanovo score |
|---|---|---|---|---|
| TMDSVVYAL | 9 | 20 | 0.10% | 0.88 |
| MTDSVVYAL | 9 | 54 | 0.38% | 0.87 |
| YSMGFHDLL | 9 | 98 | 0.60% | 0.91 |
| FGDDVVYAL | 9 | 111 | 0.65% | 0.84 |
| AFDDVVYAV | 9 | 114 | 0.65% | 0.83 |
| KEFEFVPLL | 9 | 117 | 0.67% | 0.87 |
| KVSDHEDFLL | 10 | 128 | 0.70% | 0.88 |
| SYLEHLFEL | 9 | 131 | 0.71% | 0.91 |
| LLQPKVKLL | 9 | 155 | 0.78% | 0.86 |
| TALSLFYEL | 9 | 171 | 0.82% | 0.89 |
All 9-mers (one 10-mer), all non-tryptic, all IC50 < 175 nM. TMDSVVYAL at 20 nM sits in the top 0.1% of predicted HLA-A*02:01 affinity — the kind of hit a CRO client would want in a deliverable.
Source: /mnt/efs/n52h53@gmail.com/casanovo-18dab6c1-5fcf-41f3-845e-3d04c3cae3d3/casanovo.mztab (job 2341). MHCflurry 2.2.0, models_class1_presentation, HLA-A*02:01.
The proof — biology under the delta
Casanovo standalone, job 2270, single A10G GPU node, 1:28 wall-clock. Input: the Casanovo repo's sample MGF (128 tryptic MS/MS spectra, Q Exactive Orbitrap, from MassIVE benchmark deposits used by the Casanovo paper). All numbers below were re-parsed from the output mzTab for this writeup — not estimated, not paraphrased.
Score distribution
Casanovo confidence (search_engine_score[1]) |
PSMs | % of total |
|---|---|---|
| ≥ 0.9 (very high) | 25 | 19.5% |
| 0.7 – 0.9 | 29 | 22.7% |
| 0.5 – 0.7 | 8 | 6.2% |
| 0.3 – 0.5 | 12 | 9.4% |
| 0 – 0.3 | 30 | 23.4% |
| Negative (model uncertain) | 24 | 18.8% |
| Total | 128 | 100% |
Median score 0.48, max 0.928. 48.4% of spectra scored ≥ 0.5 — the threshold the Casanovo paper uses as a reasonable de-novo precision floor on benchmark data — and 19.5% scored ≥ 0.9, the "publish without further validation" tier.
Peptide length distribution (the MAPPs-relevant one)
| Length | PSMs | Length | PSMs |
|---|---|---|---|
| 6 | 2 | 13 | 6 |
| 7 | 25 | 14 | 1 |
| 8 | 29 | 15 | 2 |
| 9 | 11 | 17–22 | 7 |
| 10 | 7 | 23 | 3 |
| 11 | 9 | 25 | 4 |
| 12 | 12 | 26+ | 9 |
Median predicted length 9. Mean 12.3.
- MHC class I window (8–11 aa): 56 PSMs, 43.8% of total. This is the operationally important number for a MAPPs-I workflow. Class-I peptides are dominated by 9-mers (the prototypical HLA-I binding length) followed by 8/10/11-mers; the 8 / 9 / 10 / 11 length bins here hold 29 / 11 / 7 / 9 PSMs respectively, with the 8-mer bin the largest — consistent with what an HLA-I immunopeptidome elution looks like before length-filtering.
- MHC class II window (13–25 aa): 23 PSMs, 18.0% of total. Smaller fraction, consistent with a sample enriched for class-I rather than class-II peptides.
- The 7-mer bucket (25 PSMs) is mostly sub-binding-length and would be filtered downstream — useful as a noise-floor reference, not as candidate epitopes.
This is the slide an immunopeptidomics scientist actually wants: "yes, the de novo tool is producing peptides at the lengths our assay enriches, not random-length noise."
Post-translational modifications recovered
| Modification | PSMs | Note |
|---|---|---|
| Carbamidomethyl (Cys alkylation, +57.02) | 17 | Iodoacetamide signature — standard sample prep |
| Oxidation (Met, +15.99) | 4 | Most common in-source / handling oxidation |
| N-terminal Acetyl (+42.01) | 3 | Co-translational acetyl, common N-term modifier |
| Carbamyl (+43.01) | 2 | Urea-induced carbamylation artefact |
| Deamidated (N or Q → D/E, +0.98) | 1 | Hot-buffer artefact + biological |
| Ammonia-loss | 1 | Fragmentation artefact |
Cysteine bookkeeping: 17 of 128 PSMs (13.3%) contain a Cys residue. 15 of those 17 (88%) carry Carbamidomethyl. A wet-lab immunopeptidomics scientist reads this as "the sample was iodoacetamide-alkylated upstream, which is the standard prep, and Casanovo is correctly reporting the modified-Cys mass on those spectra rather than miscalling them as unmodified C." This is the kind of internal-consistency check that builds trust in the tool's output.
The chain — what runs, what you submit, what comes back
The four steps
| Step | Tool | Compute |
|---|---|---|
| 1. Tumor-normal WES prep | Sarek with Parabricks fq2bam (GPU alignment + dedup + BQSR) |
GPU |
| 2. Somatic variant calling | Parabricks DeepSomatic (pbrun 4.5.1-1) |
GPU |
| 3. MHC-I binding prediction | nf-core/epitopeprediction — MHCflurry + MHCnuggets (MixMHCpred 3.0 BYOC upgrade available) | CPU |
| 4. De novo MS/MS sequencing | Casanovo v5.0.0 | GPU |
Each step waits for the previous one to finish successfully before starting. Step 2 finds Sarek's recalibrated BAMs from step 1 automatically; step 3 finds the somatic VCF from step 2 the same way; step 4 takes its MGF directly from your submission. No external workflow engine to install or maintain — the dependency chain is wired natively in the cluster's scheduler.
What you submit, what comes back
You send: a tumor-normal samplesheet (FASTQs), the immunopep MGF (Orbitrap or timsTOF), the patient's HLA alleles, and a few standard Sarek parameters (genome build, tumor/normal sample names). One HTTP POST to the workflow API.
You get back: a workflow ID and four step job IDs, all grouped under one workflow row in the console with a chain badge. Logs stream live for each step; outputs land in a single per-workflow directory on shared storage so the final cross-annotation step (or your analyst) reads them by convention.
End-to-end status
Each step verified by a successful run on the production image stack (separate submissions, same templates, same automatic cost stamping):
| Step | Job ID | State | Wall-clock | Compute | What it produced |
|---|---|---|---|---|---|
| 1. Sarek WES prep | 2243 | COMPLETED | 10m 55s | CPU spot node | Recalibrated tumor + normal CRAMs |
2. DeepSomatic (pbrun 4.5.1-1) |
1805 | COMPLETED | 9m 15s | GPU spot node (1×A10G) | Somatic VCF |
| 3. MHC-I binding (MHCflurry 2.2.0) | 2330 | COMPLETED | 19s | CPU spot node | 76 / 305 predicted binders on the PXD019643 non-tryptic peptide set |
| 4. Casanovo de novo (HLA-eluted) | 2341 | COMPLETED | 20m 22s | GPU spot node (1×A10G) | 5,367 PSMs from PXD019643 — top hit TMDSVVYAL @ IC50 = 20 nM |
What this proves:
- The four steps run end-to-end. Submit once; each step picks up its input from the previous one and starts when the upstream step finishes.
- GPU steps run on GPU, CPU steps run on CPU. No misrouting, no manual placement.
- Every job records its own cost. Instance class, spot vs. on-demand, USD per job — written automatically as the job runs; per-project totals are a single query.
The chain is production-ready.
The conversation this opens
The numbers above answer the first-order question: "does de novo MS/MS sequencing find more than database search on real MAPPs data, and do the peptides it finds actually bind HLA?" Yes on both counts — 3.9× more PSMs on the calibration MGF, 24.9% MHCflurry binding rate on real HLA-eluted spectra (vs 0% on tryptic, as expected), and a top hit at IC50 = 20 nM.
The second-order question for a CRO running neoantigen MAPPs assays is: "does Casanovo find peptides that our specific patient's HLA actually presents — including the ones from somatic mutations in the WES run?" That requires the full chain — WES → somatic variant calling → epitope prediction → Casanovo — and a final cross-annotation joining the Casanovo identifications with the epitopeprediction output. mapps-complete runs all four steps from one submission; the chain plumbing — each step finding its input, waiting for the previous one, running on the right hardware, recording its own cost — has been verified end-to-end on the live cluster.
Stage 2 is OpenFold3 peptide-HLA structural annotation on the top-ranked candidates. The deliverable upgrade from a ranked sequence list to a 3D visualization of each epitope in the patient's HLA binding groove is the upsell conversation — after mapps-complete is running on the CRO's own data.
The natural next step: replace the PRIDE PXD fixture with one of the CRO's anonymized MAPPs runs and re-run. That's the conversation this document is designed to open.
Compute footprint
Calibration Casanovo run (128 spectra): $0.025 of GPU spot. Sarek-test step: $0.115 of CPU spot. Comet (database-search baseline): ~$0.005 of CPU spot for the 42-second search. Every job records its instance class, spot vs. on-demand status, and USD cost as it runs — per-project totals are a single query against the job log.
Reproducibility
Workflow 3cfd677c-1ff2-4e73-83fa-415908ff5964 (chain) and standalone Casanovo job 2270 are reproducible from the YAML in this repo. Each of the four child templates also runs standalone if you already have its specific input ready (BAM, VCF, or MGF). The Comet/Sage delta-search artifacts (FASTA + decoy, output PSMs, the join script) live at /mnt/efs/nikhil@clusterra.io/comet-dbsearch/.
Artifacts referenced
- Workflow orchestrator:
core/templates/definitions/hcls/workflows/mapps-complete.yaml - Step 1:
core/templates/definitions/hcls/nextflow/sarek/sarek-wes-somatic.yaml - Step 2:
core/templates/definitions/hcls/parabricks/deepsomatic.yaml - Step 3:
core/templates/definitions/hcls/nextflow/epitopeprediction.yaml - Step 4:
core/templates/definitions/hcls/proteomics/casanovo.yaml - Submit handler:
core/services/cluster-api/internal/products/jobs/workflow_submit.go - Reference case study (Sarek + Parabricks germline):
marketing/blogs/_posts/sarek-parabricks-case-study.md - Casanovo mzTab parsed for this writeup:
/mnt/efs/nikhil@clusterra.io/casanovo-ea37afb5-5c81-4b54-b8f4-dd454a5da1b5/casanovo.mztab(job 2270, 2026-05-14) - Comet delta-search outputs:
/mnt/efs/nikhil@clusterra.io/comet-dbsearch/{delta_summary.json,delta_table.csv,human_sprot_td.fasta}