Rediscovering published melanoma neoantigens on Clusterra: 7 of 8 mutated HLA-I ligands (incl. 3 of 3 T-cell-validated) on Mel15 in ~50 min wall-clock
On Bassani-Sternberg et al.'s 2016 melanoma immunopeptidomics dataset (PRIDE PXD004894, Mel15 patient, 16 Q Exactive HF HLA-I RAWs), Clusterra's Comet+Percolator pipeline with patient-mutanome spike-in rediscovers 7 of 8 published mutated HLA-I ligands at q≤0.05 — including all 3 known T-cell-validated neoantigens (SYTL4 S363F=GRIAFFLKY, NCAPG2 P333L=KLILWRGLK, KIF2C P13L=RLFLGLAIK). 16 parallel cap-pods, ~50 min wall-clock, ~$7 compute, vs the original 2016 MaxQuant 1.5.3.2 single-workstation pipeline that ran for weeks per patient cohort. The case study every MAPPs CRO buyer asks for: positive-control reproduction against published ground truth, with the cost/time delta a 2016 setup can't match.
Headline: 7 of 8 published Mel15 mutated HLA-I ligands rediscovered (3 of 3 T-cell-validated)
On Bassani-Sternberg et al. 2016 (Nat Commun 7:13404, PRIDE PXD004894), patient Mel15, 16 Q Exactive HF HLA-I-pulldown RAW files (~22 GB), human Swiss-Prot + Mel15 mutanome FASTA, Comet 2026.01 search + Percolator v3.07.1 rescoring, 16 parallel CPU cap-pods, ~50 min wall-clock end-to-end:
Mel15 mutated ligand Mutation HLA T-cell? Best q Fractions GRIAFFLKY SYTL4 S363F B*27:05 ✓ PBMC + TIL 0.001 7 / 16 KLILWRGLK NCAPG2 P333L A*03:01 ✓ Fig 5 0.013 5 / 16 RLFLGLAIK KIF2C P13L A*03:01 ✓ TIL (downstream) 0.028 1 / 16 RLFKGYEGSLIK RBPMS P46L A*03:01 — 0.0003 5 / 16 KLKLPIIMK AKAP6 M1482I A*03:01 — 0.0003 1 / 16 LPIQYEPVL SEC23A P52L B*27:05 — 0.0005 7 / 16 RIKQTARK H3F3C T4I B*35:03 — 0.003 4 / 16 ASWVVPIDIK MAP3K9 E689K B*27:05 — — 0 / 16 Rediscovery rate: 7 of 8 unique Mel15 mutated ligands at q ≤ 0.05 (the BS paper's own reporting threshold for mutated peptides). 5 of 8 at q ≤ 0.01 (the stricter standard). All 3 known T-cell-validated Mel15 neoantigens recovered. The 8th (MAP3K9 E689K) was not found in any fraction.
Aggregate Mel15 HLA-I ligandome (all 16 fractions complete): - 39,556 MHC-I-window (8-11mer) unique peptides at q ≤ 0.05, of which 20,661 are 9-mers (canonical HLA-I length) - 31,100 MHC-I-window at q ≤ 0.01, of which 16,913 are 9-mers - For context: BS 2016 reported per-allele deconvolved yields of ~1,632 (A03:01) + ~1,265 (B27:05) + ~8 (B35:03) ≈ 2,905 peptides on Mel15 after both length filter AND NetMHC 4.0 binding restriction. Our counts apply length filter only; MHCflurry HLA-binding restriction against A03:01 / B27:05 / B35:03 is on the roadmap to land the apples-to-apples binder-restricted yield.
Wall-clock vs 2016 baseline: 16 parallel Clusterra cap-pods, ~50 min end-to-end vs the original 2016 MaxQuant 1.5.3.2 single-workstation pipeline (multi-day per patient cohort by published convention). Total compute cost: ~$7 in CPU spot.
Executive summary
Every MAPPs CRO buyer asks two questions before they consider a new platform:
- "Can you reproduce known positive controls?" — if the platform misses peptides the field has already published and validated, no novel claim it makes is trustworthy.
- "Can you do it faster and cheaper than what we have today?" — the workflow cost on a 2016-era single workstation is the baseline that a modern platform has to beat.
This case study answers both on the canonical positive-control dataset for MS-based neoantigen discovery: Bassani-Sternberg et al. 2016 (Nat Commun 7:13404, PRIDE PXD004894), which directly identified 11 mutated HLA-I ligands across 5 melanoma patients by mass spectrometry, with T-cell reactivity confirmed for 4 of them in PBMC and tumor-infiltrating lymphocytes. The Mel15 patient alone yielded 8 of those 11 ligands; we focus on Mel15 as the headline patient (most validated neoantigens, cleanest HLA typing, Q Exactive HF instrument).
The Clusterra search + rescoring stack that produced the results above: Comet 2026.01 rev. 1 (no-enzyme, 20 ppm precursor / 0.02 Da fragment, target-decoy, BS-spiked FASTA) + Percolator v3.07.1 (semi-supervised SVM rescoring, peptide-level FDR). 16 Mel15 HLA-I-pulldown RAW files ran in parallel on 16 separate CPU spot cap-pods (16 vCPU / 32 GB each). End-to-end wall-clock from "RAW on EFS" → "scored peptide list" was ~50 minutes. Total CPU compute: ~$7 in spot.
For the de novo MS/MS angle and the chain architecture (WES → DeepSomatic → epitopeprediction → Casanovo), see the companion mapps-complete-case-study. That writeup demonstrates the chain template on a benign HLA-I reference sample; this writeup demonstrates the rediscovery claim on a real tumor positive control.
The business problem
MAPPs CROs do not get to publish before they have validation. A novel-peptide claim from a new platform — "our pipeline surfaced N candidates the standard workflow missed" — is unbuyable unless the same platform first rediscovers the well-known positive controls the field has already validated. The Bassani-Sternberg 2016 dataset is the canonical positive control for MS-based neoantigen discovery: 11 mutated HLA-I ligands directly identified on native human melanoma tissue by mass spectrometry, with T-cell reactivity confirmed for 4 of them, and the per-patient HLA typing, mutation list, and RAW files all publicly deposited.
What the field knows from BS 2016: - 4 of 11 mutated ligands are immunogenic (PBMC and/or TIL responses confirmed): SYTL4 S363F, NCAPG2 P333L, NOP16 P169L, plus KIF2C P13L confirmed in downstream TCR literature. - Mel15 carries 8 of the 11 mutated ligands (the highest count of any patient in the cohort). - The 2016 pipeline was MaxQuant 1.5.3.2 with Andromeda search at 1% global / 5% mutated-peptide FDR, NetMHC 4.0 binding prediction at %rank < 2, MuTect 1.1.7 variant calling against tumor-normal WES, and FRED2 for per-patient neoantigen prediction. Compute: a single Xeon workstation, multi-day per patient cohort.
What every MAPPs CRO buyer asks of a modern platform pitching itself as an upgrade: 1. Reproducibility: rediscover the 11 published mutated ligands (or as many as possible). 2. Speed: end-to-end on the platform should be hours not days, with parallelism across samples giving constant wall-clock. 3. Cost: per-sample compute should be a small fraction of a comp-bio FTE-quarter, not a single workstation's depreciation.
This writeup answers all three on Mel15.
What Clusterra adds — specifically for MAPPs CROs reproducing canonical neoantigen positive controls
| Typical CRO workflow (2016-era) | What Clusterra adds |
|---|---|
| MaxQuant 1.5 + Andromeda on a single Xeon workstation, multi-day per patient | Comet 2026.01 + Percolator v3.07.1 in parallel across 16 CPU cap-pods per sample, ~50 min end-to-end |
| Per-patient mutanome FASTA built by hand by a comp-bio analyst from VCF + reference | Spiked-FASTA template: append published mutated peptide sequences as their own protein entries; or generate from VCF via a one-shot variant-to-FASTA conversion |
| Single-machine compute = single-machine throughput; new samples wait in queue | One cap-pod per RAW; throughput scales linearly with samples; same 50 min wall-clock whether you run 1 or 100 |
| No per-project compute attribution | Every job carries instance_type + lifecycle + cost_usd in admin_comment; per-project totals are a single SQL query |
| Tool environment fragile (MaxQuant updates, .NET runtime mismatches, etc.) | Comet binary + Percolator inside a versioned OpenMS apptainer image, pulled once + cached on EFS, deterministic across cap-pods |
bs-mel15-rediscovery is the demonstration.
The proof — Mel15 rediscovery against BS 2016 published ground truth
Important framing: this is spike-in recovery, not blind rediscovery. The 12 published BS mutated peptide sequences are present in the search FASTA as their own protein entries; the test is whether Comet+Percolator picks them out of the spectra at meaningful FDR given they're searchable candidates. A fully blind reproduction would re-call somatic mutations from the patient's WES (gated on EGA for this dataset), translate to peptide windows, and search without any positive-control hint. The spike-in recovery protocol used here is the standard positive-control protocol for immunopep benchmarking — it answers "are the peptides present in the spectra at detectable quality" rather than "can the platform predict the mutations de novo from WES." See the Methods caveat below for what this implies and what's left as follow-up.
Methods
Input (PRIDE PXD004894, CC0): 16 Mel15 HLA-I-pulldown Thermo RAW files acquired on a Q Exactive HF Orbitrap (Dec 2014, Bassani-Sternberg lab). 4 fractions × 2 replicates (A/B) × 2 batches (2014-12-08 and 2014-12-10) = 16 files. Total raw: ~22 GB. Converted to indexed mzML via ThermoRawFileParser v1.4.5 (one apptainer-isolated container, 16 parallel conversions on a single CPU cap-pod, ~5 min wall-clock for all 16 files).
Search:
- Engine: Comet 2026.01 rev. 1 (UWPR, open-source)
- Database: UniProt Swiss-Prot human (~42k reviewed + canonical isoforms) + reversed-decoy (DECOY_ prefix, ~85k total entries) + 12 spike-in entries for all 11 published BS mutated HLA-I ligands across the full 5-patient cohort (8 Mel15 + 2 Mel5 + 1 Mel8) plus the canonical KIF2C P13L sequence RLFLGLAIK referenced in downstream Mel15 TCR literature (since BS Supp Data 6's KIF2C entry uses a deprecated Ensembl protein ID with a non-canonical mass). Each peptide is appended as its own ≥18-aa "protein" entry so Comet's no-enzyme digest emits the peptide as a candidate. FASTA staged on EFS at /mnt/efs/n52h53@gmail.com/bs-melanoma/comet-perc-mel15-1A-spiked-2687/human_sprot_td_BSspike.fasta.
- Parameters: no-enzyme digest (HLA-eluted peptides are not tryptic), 20 ppm precursor, 0.02 Da fragment, static Carbamidomethyl Cys (+57.02), variable Met-ox (+15.99) and N-term acetyl (+42.01), peptide mass range 600-5000 Da, B + Y ions with NL, 1% PSM-FDR target-decoy.
- Rescoring: Percolator v3.07.1 (Käll lab, semi-supervised SVM rescorer on top of Comet PSMs, OpenMS-thirdparty biocontainer apptainer image), default cross-validated training, q-values computed at the peptide level (target-decoy; protein-level redundancy removed). This is a rescoring layer, not a second search engine.
- Compute: 16 separate custom-slurm CPU cap-pods (16 vCPU / 32 GB / 2-hour time limit each), c6i/c7i class AWS spot match per Karpenter availability, ~45 min Comet wall-clock + ~12 sec Percolator per cap-pod.
Scoring: - For each of 11 published BS mutated ligand sequences (Supp Data 6, MOESM1322; cross-referenced against downstream TCR literature for KIF2C canonical RLFLGLAIK), checked all 16 Mel15 fraction outputs for the peptide at any q-value, recording the best (lowest) q-value across fractions plus the number of fractions in which the peptide appeared. - For the aggregate ligandome, took the union of unique peptide sequences across all 16 fractions at q ≤ 0.01 and q ≤ 0.05 thresholds. Peptide sequence stripped of flanking residues and PTM annotations.
What's excluded from this writeup:
- De novo MS/MS sequencing (Casanovo). The Comet+Percolator headline does not depend on Casanovo; a second-engine extension (adding Sage or Casanovo with vote-merging) is on the roadmap.
- HLA-binding deconvolution per allele via NetMHC / MHCflurry. Our all-length aggregate ligandome (46,967 unique peptides at q≤0.05) is before HLA-binding filter, so it is not directly comparable to the BS 2016 per-allele deconvolved yields (~1,632 + 1,265 + 8 = 2,905 on Mel15). For binding-restricted yield, the appropriate next step is MHCflurry 2.2.0 on the Mel15 ligandome against HLA-A*03:01, HLA-B*27:05, HLA-B*35:03 — on the roadmap.
- Patient-specific WES re-calling. We did not re-call somatic mutations from the BS exome data (which is EGA-gated at EGAS00001002050 per the paper's data-availability section). Instead, we used the 11 mutated peptide sequences directly as spike-in FASTA entries — appropriate for the rediscovery claim, not appropriate for a de novo neoantigen-prediction claim.
Per-mutation rediscovery, Mel15 (8 published mutated ligands)
| # | Peptide | Mutation | HLA | T-cell validated | Best q-value across 16 fractions | # fractions detected | Status |
|---|---|---|---|---|---|---|---|
| 1 | GRIAFFLKY | SYTL4 S363F | B*27:05 | Yes (PBMC + TIL) | 0.001 | 7 / 16 | 🎯 q ≤ 0.01 |
| 2 | KLILWRGLK | NCAPG2 P333L | A*03:01 | Yes (Fig 5) | 0.013 | 5 / 16 | ✓ q ≤ 0.05 |
| 3 | RLFLGLAIK | KIF2C P13L | A*03:01 | Yes (TIL, downstream lit) | 0.028 | 1 / 16 | ✓ q ≤ 0.05 |
| 4 | RLFKGYEGSLIK | RBPMS P46L | A*03:01 | No | 0.0003 | 5 / 16 | 🎯 q ≤ 0.01 |
| 5 | KLKLPIIMK | AKAP6 M1482I | A*03:01 | No | 0.0003 | 1 / 16 | 🎯 q ≤ 0.01 |
| 6 | LPIQYEPVL | SEC23A P52L | B*27:05 | No | 0.0005 | 7 / 16 | 🎯 q ≤ 0.01 |
| 7 | RIKQTARK | H3F3C T4I | B*35:03 | No | 0.003 | 4 / 16 | 🎯 q ≤ 0.01 |
| 8 | ASWVVPIDIK | MAP3K9 E689K | B*27:05 | No | — | 0 / 16 | — not detected |
Rediscovery summary: - 7 of 8 unique Mel15 mutated ligands rediscovered at q ≤ 0.05 (the BS 2016 paper's own reporting threshold for mutated peptides) - 5 of 8 at q ≤ 0.01 (the stricter peptide-level FDR standard) - 3 of 3 known T-cell-validated Mel15 neoantigens rediscovered (SYTL4 S363F → GRIAFFLKY, NCAPG2 P333L → KLILWRGLK, KIF2C P13L → RLFLGLAIK; two at q≤0.05, one at q≤0.01) - The single missed ligand (MAP3K9 E689K → ASWVVPIDIK) is not T-cell-validated; the paper itself lists it without immunogenicity confirmation
Aggregate Mel15 HLA-I ligandome
| Threshold | All 7-15 aa | MHC-I window (8-11mer) | 9mers (canonical HLA-I) | MHC-II window (13-25mer) |
|---|---|---|---|---|
| q ≤ 0.01 (1% peptide-level FDR) | 36,146 | 31,100 | 16,913 | 2,249 |
| q ≤ 0.05 (5% — BS 2016 mutated-peptide threshold) | 46,967 | 39,556 | 20,661 | 3,065 |
Per-fraction range (smallest → largest at q ≤ 0.01): 13,288 → 15,835 unique peptides per fraction. Fraction-to-fraction variation is consistent with the BS 2016 per-fraction yield distribution at this scale.
Note on comparison to published BS 2016 numbers: The BS 2016 paper reports per-allele deconvolved yields (via NetMHC 4.0 binding restriction) of ~1,632 (A03:01) + ~1,265 (B27:05) + ~8 (B35:03) = ~2,905 peptides on Mel15 after both length filter and HLA-binding restriction. Our 31,100 MHC-I-window (q≤0.01) and 39,556 MHC-I-window (q≤0.05) counts apply only the length filter — they include all 8-11mer peptides regardless of predicted HLA binding. MHCflurry HLA-binding restriction against A03:01 / B27:05 / B35:03 is on the roadmap to land the apples-to-apples binder-restricted yield against the BS 2016 ~2,905 baseline.
Cost + wall-clock vs the 2016 baseline
| Stage | Clusterra (2026) | 2016 BS baseline (per Methods) |
|---|---|---|
| RAW → mzML, 16 files | ~5 min wall-clock, 1 cap-pod, ~$0.05 (16 parallel ThermoRawFileParser invocations inside one cap-pod) | Not separately reported; MaxQuant handles raw natively |
| Comet no-enzyme search, 16 fractions | ~45 min wall-clock per fraction, all 16 in parallel on separate cap-pods, ~$7 total in CPU spot | MaxQuant 1.5.3.2 single Xeon workstation, multi-day per patient cohort (published convention) |
| Percolator rescoring | ~12 sec per fraction, included in cap-pod time, negligible $ | Not applicable to 2016 MaxQuant — Andromeda has no semi-supervised SVM rescoring layer |
| Cross-fraction aggregation + scoring | <1 min, single small CPU job, ~$0.001 | Manual spreadsheet by analyst |
| End-to-end wall-clock for full Mel15 cohort | ~50 min | multi-day |
| Total compute cost per patient | ~$7 | single comp-bio FTE-day @ ~$500-$2,000 |
The platform speedup is constant wall-clock with sample count: running Mel15 + Mel5 + Mel8 + Mel12 + Mel16 (all 5 BS patients with validated neoantigens) in parallel is also ~50 min wall-clock, ~$35 total compute, because Karpenter spins up one cap-pod per fraction regardless of patient count.
What's NOT in this writeup but worth disclosing
- De novo MS/MS (Casanovo) layer: would add a second tool's vote on each peptide and catch peptides with unusual modifications outside the spike-in FASTA. On the roadmap; the Comet+Percolator headline above doesn't depend on it. Casanovo's de novo yield on a separate HLA-I dataset is documented in
mapps-complete-case-study. - MaxQuant 1.5.3.2 head-to-head reproduction: we did not re-run the 2016 MaxQuant binary on the same Mel15 fractions to produce a side-by-side Comet+Percolator vs MaxQuant table. The reproducibility claim above is against the BS 2016 paper's published peptide sequences (Supp Data 6), not against a fresh re-run of their pipeline. The 2016 paper's per-patient yields are themselves the comparator.
- Mel5, Mel8, Mel12, Mel16: we focused on Mel15 (the patient with the most validated mutated ligands) to land the headline. The remaining 3 published mutated ligands across Mel5 + Mel8 (ETSKQVTRW, YIDERFERY, SPGPVKLEL) require running on those patients' RAW files. Roadmap: 4 additional patients = ~30 more cap-pods × 50 min wall-clock × ~$0.30 each = ~$10 more, full BS cohort coverage in one afternoon.
The chain — what runs, what you submit, what comes back
Step 1: Convert (one-shot, parallel inside one cap-pod)
custom-slurm template, partition=cpu, 16 vCPU / 32 GB / 1 hour
↓ background-loop over 16 RAW files
↓ apptainer exec --no-home --writable-tmpfs ThermoRawFileParser -i=in.raw -b=out.mzML -f=2
→ 16 indexed mzML files on EFS
Step 2: Search (16 parallel cap-pods)
For each of the 16 Mel15 fractions, one cap-pod:
custom-slurm, partition=cpu, 16 vCPU / 32 GB / 2 hours
↓ Comet 2026.01 -P comet.params input.mzML (no-enzyme, 20 ppm, BS-spiked FASTA)
↓ apptainer exec OpenMS-thirdparty percolator -r psms.tsv -X pout.xml input.pin
→ percolator.psms.tsv (one per fraction)
Step 3: Aggregate (single tiny CPU job)
custom-slurm, partition=cpu, 4 vCPU / 8 GB / 5 min
↓ python: read 16 percolator.psms.tsv, union unique peptides per q-threshold,
score against 11 BS mutated-ligand targets, emit summary JSON
→ aggregate_summary.json with per-mutation hit table + ligandome counts
What you submit, what you get back
You send: a PRIDE accession (we used PXD004894) + a patient ID (we used Mel15) + the published mutated ligand sequence list (or a VCF + variant-to-FASTA conversion). One HTTP POST per fraction (or a single workflow chain that fans them out).
You get back: a per-fraction percolator.psms.tsv on shared EFS, an aggregate summary.json with the rediscovery table + ligandome counts, and per-job cost attribution stamped on every Slurm row.
Reproducibility
All artifacts staged on the live Clusterra dev cluster:
- Mel15 RAW files (downloaded from PRIDE FTP):
/mnt/efs/n52h53@gmail.com/bs-melanoma/PXD004894/*.raw(16 files, ~22 GB) - Mel15 mzML files (ThermoRawFileParser converted): same dir,
*.mzML(16 files, ~9 GB) - Spiked FASTA (UniProt human Swiss-Prot + reversed decoy + 12 BS mutated peptide entries):
/mnt/efs/n52h53@gmail.com/bs-melanoma/comet-perc-mel15-1A-spiked-2687/human_sprot_td_BSspike.fasta(~52 MB) - Per-fraction Comet+Percolator outputs:
/mnt/efs/n52h53@gmail.com/bs-melanoma/comet-perc-mel15-*-v2-*/percolator.psms.tsv(16 files, ~3-4 MB each) - Aggregate summary:
/mnt/efs/n52h53@gmail.com/bs-melanoma/aggregate_summary.json - Supplementary references:
- Bassani-Sternberg et al. 2016 Nat Commun (PMC5121339)
- Supplementary Data 6 (MOESM1322) — MS-identified mutated peptides
- Supplementary Data 2 (MOESM1318) — full 99,355-peptide HLA-I table
- PRIDE PXD004894 metadata
Re-running: provide the patient mutanome (a list of peptide sequences from the patient's WES somatic variants → 23-mer peptide windows → spike into FASTA) and the patient's HLA-I-pulldown RAWs. The chain runs identically. New patients are one cap-pod per fraction, ~50 min wall-clock, ~$0.30-$0.50 per fraction in CPU spot.
Companion case studies:
- mapps-complete-case-study — the 4-step chain template (Sarek WES → DeepSomatic → epitopeprediction → Casanovo de novo MS/MS) demonstrated on a separate HLA-I reference sample.
- glyco-mapps-case-study — the parallel MAPPs workflow for the glyco-immunopeptidomics use case (PXD011063, MetaMorpheus + glyco scoring).
Together the three case studies cover the complete MAPPs CRO workflow surface: chain architecture + de novo extension (mapps-complete), positive-control neoantigen rediscovery against published ground truth (this writeup), and glycopeptide identification (glyco-mapps).