Glyco-MAPPs end-to-end on the Vakhrushev SimpleCell dataset: 1h 40m wall-clock, $1.28 AWS spot
Production Slurm for ADC target discovery: identify Tn-O-glycoprotein targets and rank them by druggability in 1h 40m wall-clock for $1.28 of AWS spot, on the canonical Vakhrushev SimpleCell Tn-O-glycoproteome dataset (PRIDE PXD011063). 253 unique O-glycopeptides at 1% glycopeptide-FDR (160 Level-1 site-localized); 63.5% Tn antigen (HexNAc-only — expected for COSMC-KO SimpleCells); 14 Boltz-2 glyco-co-folds with per-fold pLDDT and ipTM; FGFR4 and DLK1 automatically surfaced as ADC candidates. Five-step workflow (Tn-specific oxonium QC → MetaMorpheus O-Pair search → Casanovo de novo cross-validation on L4 → Boltz-2 protein + Tn-ligand co-folding on L40S → multi-factor druggability scoring) on managed Slurm in your AWS account. Three SimpleCell lines (HEK293-SC, HepG2-SC, Capan1-SC), 2h 41m sum-compute. All MIT / Apache / BSD; no license gates.
Published 22 May 2026. Dataset: PRIDE PXD011063 (Ye Z, Mao Y, Clausen H, Vakhrushev SY. "Glyco-DIA: a method for quantitative O-glycoproteomics with in silico-boosted glycopeptide libraries." Nat Methods 2019;16(9):902-910; DOI: 10.1038/s41592-019-0504-x). Three Orbitrap Fusion HCD-DDA .raw files, one per SimpleCell line (HEK293-SC trypsin+VVA, HepG2-SC trypsin+neuraminidase+VVA, Capan1-SC trypsin+VVA), all VVA-LWAC lectin-enriched, label-free. Tooling: ThermoRawFileParser 1.4.5, MetaMorpheus 1.1.7, Casanovo 5.0 on NVIDIA L4 (g6), Boltz-2 2.2.1 on NVIDIA L40S (g6e). All MIT / Apache / BSD; no license gates.
Why Tn-O-glycoproteomics matters for ADC discovery
Antibody-drug conjugates kill cancer cells by delivering cytotoxic payloads to surface antigens. The most successful targets of the last decade — HER2, TROP2, FOLR1 — are all glycoproteins. And some of the most tractable emerging targets aren't overexpressed proteins; they're aberrant glycan structures on tumor surfaces caused by disrupted O-glycan biosynthesis.
Tn antigen (GalNAc-α-O-Ser/Thr) is the canonical example. When COSMC — the ER chaperone required for the first O-glycan extension step — is inactivated, cells express truncated Tn at high surface density. The Clausen lab's SimpleCell platform (Steentoft et al., Nat Methods 2011 + EMBO J 2013) ZFN-knocks out COSMC in mammalian cell lines so that every O-glycopeptide on the surface carries Tn in its simplest form — the gold-standard tool for mapping the Tn-O-glycoproteome.
What an ADC discovery team actually wants from a pipeline like this:
- Run enrichment QC first — is the sample actually Tn-enriched, before you spend money searching?
- Find Tn-positive glycoproteins with site-level localization (Level 1) — you need the specific Ser/Thr to design an antibody epitope.
- Cross-validate the peptide ID with an orthogonal de novo method so a single search-engine artefact doesn't carry the headline.
- Predict the protein structure with the Tn glycan as a ligand to ask: is the glycosite actually accessible from outside the cell?
- Score against cell-surface localization (UniProt), glycosylation site density, free thiol availability for site-specific conjugation, and tumor-vs-normal expression.
The five-step workflow
prep (Step 1, CPU) raw/.d → mzML + MGF
│
▼
qc (Step 2, CPU) Tn-specific oxonium QC (HexNAc-only panel)
│
├──► search-n (Step 3a, CPU) ─┐
├──► search-o (Step 3b, CPU) ─┤ parallel
└──► denovo (Step 4, GPU L4) ─┘
│
▼
prologue (Tier-A intersection, glycan→SMILES)
│
▼
fold (Step 5, GPU L40S)
Boltz-2 protein + Tn-ligand co-fold +
multi-factor druggability score
Steps 3a, 3b, and 4 are independent — they all consume Step 2's outputs and emit into Step 5's prologue. Wall-clock collapses to max(41:21, 52:21, 19:17) = 52:21 for the parallel block.
Step 1: Raw conversion (9:02)
ThermoRawFileParser 1.4.5 converts all three SimpleCell .raw files in parallel on a single 16-vCPU CPU node:
HEK293_SC_VVA.raw(md5cc725b59…, 932 MB) → 488 MB indexed mzML + 1.4 GB MGFHepG2_SC_NeuVVA.raw(md53f20bfe4…, 995 MB) → 524 MB mzML + 1.5 GB MGFCapan1_SC_VVA.raw(md570b45026…, 792 MB) → 416 MB mzML + 1.2 GB MGF- Total: 98,712 MS2 scans across the three files
- Wall-clock: 9:02
Step 2: Tn-specific oxonium QC (23:02 — first run on a fresh worker)
The QC step is a customer-protection gate: if the sample doesn't actually have glycopeptides, the rest of the pipeline is wasted money. A v1 panel matching any glycan-diagnostic oxonium ion — HexNAc, NeuAc, NeuGc, Fuc, the Hex-HexNAc 366.140 dimer — and calling the gate "Tn QC" is wrong: Tn = GalNAc only, no sialic acid, no galactose. NeuAc/Fuc/Hex hits inflate the pass rate with non-Tn glycopeptides. A glycoproteomics PI catches this in 30 seconds and dismisses the post.
The v2 panel introduces a qc_mode = tn-only parameter that restricts to HexNAc-specific diagnostic ions:
m/z 138.055 — HexNAc fragment (C₂H₆NO)
m/z 168.066 — HexNAc fragment (oxocarbenium)
m/z 186.076 — HexNAc − 2 H₂O
m/z 204.087 — HexNAc parent
On the three SimpleCell files, 41,100 of 98,712 MS2 scans (41.6%) pass the Tn-specific 2-hit gate. An enriched-but-realistic pass rate — much higher than a non-Tn-enriched sample, lower than what a panel-with-NeuAc gate would inflate to. The four-ion HexNAc-only panel is now the default for any Tn-enriched workflow; the legacy panel is opt-in via qc_mode = general.
Step 3: MetaMorpheus O-Pair search (52:21) — the headline biology arm
MetaMorpheus 1.1.7 GlycoSearchTask in O-Pair mode, against a curated 22-entry tumor O-glycan library (Tn / sTn / T / extended core 1-2):
- 417 O-glyco PSMs at 1% PSM-FDR; 253 unique O-glycopeptides at 1% glycopeptide-FDR; 160 Level-1 site-localized; 166 distinct O-glycoproteins; 2 in all three cell lines, 16 in ≥2. Wall-clock 52:21.
Glycopeptide-level FDR is the more conservative number (an O-glyco PSM that matches the same peptide+glycan across the three cell lines counts once at glycopeptide-FDR). MetaMorpheus emits both — they're reported together so the reader can pick the metric that fits their question.
A parallel MM N-glyco search arm (3a) runs in the same wall-clock window against the bundled 182-entry NGlycan.gdb as an orthogonal control on enrichment specificity. We report its raw output here for completeness but it should not be read as N-glyco biology on this dataset — see the dedicated control section below.
Two real MM parallel-run bugs that v1 didn't catch and that get codified in the templates: (1) MM's DatabaseIndex/ output sits next to the FASTA; when N-glyco and O-Pair run simultaneously against the same FASTA path, the second job hits IOException: The process cannot access the file 'peptideIndex.ind' because it is being used by another process. Fix: stage the FASTA into a per-job fasta-workdir/ via hardlink. (2) MM's GlycoSearchTask returns peptide-level PSMs even when no glycan database is configured — so a template with NGlycanDatabasefile = "" appears to "work" but emits zero actual glyco-PSMs. We surfaced this with an explicit assert + the canonical path to MM's bundled NGlycan.gdb.
N-glyco arm — orthogonal enrichment-specificity control (NOT N-glyco biology)
The N-glyco search arm returns 1,813 PSMs at 1% PSM-FDR in its raw output, but this is a search-space artifact — not biological N-glycosylation. The dataset is VVA-lectin-enriched COSMC-KO SimpleCells, designed by the Clausen lab to capture O-linked Tn glycopeptides; the original Vakhrushev Nat Methods 2019 paper reports zero N-glycopeptide content. Three lines of evidence confirm the arm here is noise, not biology:
- 97% lack a real N-glycosylation sequon. MetaMorpheus's own
N-Glycan Motif Checkcolumn flags 1,756 of 1,813 PSMs as N-X-S/T-negative. Of the remaining 57 flagged "True", 8 are N-P-S/T (proline at position 2 — biologically excluded from the consensus sequon). Net biologically-valid N-glycopeptide count: ~48 PSMs, not 1,813. - Glycan composition is HexNAc-only. 94% of the "N-glyco" PSMs carry HexNAc-only compositions (N1, N2, N3, N4) — the same Tn / Tn-extended series the O-Pair arm identifies on Ser/Thr. Real N-glycans require the chitobiose core (≥N2H3); the absence of complex N-glycan compositions in this arm is exactly what one expects from Tn-O-glycopeptides being mass-matched against an N-glycan database.
- Tool comparison: the 4.3:1 N:O PSM ratio (1,813:417) is a search-space inflation effect (182-entry N-glycan db vs 22-entry Tn library applied to the same scans), not an enrichment failure — VVA-LWAC is Tn-selective.
Customers running their own glyco-MAPPs analyses on Tn-enriched data should rely on the O-Pair arm and treat the N-glyco arm's raw count as a sanity-check of enrichment quality (low ratio of sequon-confirmed N-glyco to O-glyco hits = clean Tn enrichment). For a real N-glycoproteomics analysis, use a HILIC-enriched or lectin-WGA dataset (e.g. PXD025859) and read the N-glyco arm headline; this case study is, by dataset choice, an O-glyco-only biology study.
O-glyco compositions: Tn dominates as expected
Counted at 1% PSM-FDR, target rows only, MetaMorpheus N-shorthand notation (N = HexNAc, H = Hex, A = NeuAc):
| Composition | Identity | PSMs (q ≤ 0.01) |
|---|---|---|
| N1 | HexNAc(1) — Tn antigen | 265 |
| N2 | HexNAc(2) — chitobiose / Tn-N-core precursor | 79 |
| H1N2 | Hex(1)HexNAc(2) | 23 |
| N3 | HexNAc(3) | 19 |
| N4 | HexNAc(4) | 11 |
| H1N3 | Hex(1)HexNAc(3) | 9 |
| H1N4 | Hex(1)HexNAc(4) | 5 |
| H1N1 | Hex(1)HexNAc(1) — T antigen (Galβ1-3GalNAc, core 1) | 3 |
| H2N5 | Hex(2)HexNAc(5) | 2 |
| H1N5 | Hex(1)HexNAc(5) | 1 |
265 of 417 PSMs (63.5%) are HexNAc(1) — direct Tn-antigen hits. The Tn-extended HexNAc-only series (N1 + N2 + N3 + N4) totals 374 PSMs (89.7%). Only 3 PSMs (0.7%) carry the core-1 T antigen — consistent with COSMC-KO SimpleCells, which by definition can't elongate Tn into T-core-1.
Top O-glyco hits by Level-1 localization (the ADC-relevant view)
Level 1 in MetaMorpheus's O-Pair localization scoring corresponds to a site-localization probability ≥ 0.75 — the glycan has been resolved to a specific Ser/Thr residue, the precision you need to design an antibody epitope or do site-specific conjugation. Top O-glyco hits across the three SimpleCell lines (PSM counts at 1% PSM-FDR, target only):
| UniProt | Gene | Protein | PSMs | Level-1 | Cell lines | Top glycan |
|---|---|---|---|---|---|---|
| P80303 | NUCB2 | Nucleobindin-2 (nesfatin-1 precursor) | 30 | 29 | HepG2-SC | N1 (Tn) |
| O00461 | GOLIM4 | Golgi integral membrane protein 4 | 9 | 9 | all 3 | N1 (Tn) |
| Q02818 | NUCB1 | Nucleobindin-1 | 13 | 5 | all 3 | N2 |
| P02751 | FN1 | Fibronectin | 7 | 4 | HepG2-SC | N1 (Tn) |
| Q9ULI3 | HEG1 | Heart development EGF-like | 27 | 2 | HepG2-SC | N2 |
| Q9Y6N7 | ROBO1 | Roundabout-1 | 9 | 1 | HEK293 + HepG2 | N1 (Tn) |
NUCB2 (Nesfatin-1 precursor) dominates the Level-1 count from HepG2-SC with 29 of 30 PSMs site-localized — a secreted satiety hormone with established roles in cancer cell-line biology. GOLIM4 hits all three SimpleCell lines with 9 of 9 PSMs Level-1 localized — a Golgi integral membrane protein with reported surface-shed isoforms in some cancers. NUCB1 covers all three cell lines with 5 Level-1 PSMs out of 13.
"Cross-line concordance" here is a biological-reproducibility proxy across SimpleCell-engineered HEK293, HepG2, and Capan1 — three different cell lines, not biological replicates of one. A protein hitting all three lines is more robust to cell-line-specific noise than a single-line hit, but we don't quote a formal replicate FDR.
Step 4: Casanovo 5.0 de novo cross-validation (19:17 inference)
Casanovo (Noble Lab, Apache-2.0) sequences peptides directly from MS/MS — no database, no glycan model. Its value here is orthogonal: a Casanovo-predicted peptide backbone that matches an MM PSM after I/L collapse is independent evidence for the peptide identification (the glycan localization stays with MM; the backbone gets a second witness). Running the default tryptic checkpoint (casanovo_v5_0_0_v5_0_0.ckpt, MassIVE-KB-trained, MIT) against the 97,780 oxonium-passing scans on a single NVIDIA L4 (g6.xlarge spot):
- 91,530 spectra sequenced (6,231 skipped — invalid precursor charge)
- 8,059 at score ≥ 0.50 (8.8%) — Casanovo paper's confident threshold
- 1,279 at score ≥ 0.90 (1.4%) — high-confidence novel sequences
- Max GPU memory: 2.3 GiB on L4's 24 GB — overkill on the model side, but L4 was what spot inventory served (see "GPU choice" below)
- GPU inference: 16:41; full wall-clock 19:17 including the one-time conda + rdkit + casanovo bootstrap
Casanovo gives Boltz-2's prologue a peptide-backbone witness set. Every MM glyco-PSM whose I/L-collapsed Base Sequence intersects a Casanovo prediction is promoted to Tier A (MM ∩ Casanovo). Tier-A is a peptide-backbone witness, not a glycoform witness — Casanovo can't see the glycan, only the backbone — so the language matters: Tier-A means "two methods independently agree on the peptide that carries the glycan", not "the glycoform itself is confirmed."
The intersection set: 700 distinct MM glyco-peptide backbones (I/L-collapsed) at 1% PSM-FDR; 86,429 distinct Casanovo predictions across all returned scans; 40 backbones in the intersection (5.7% of MM glyco-backbones) covering 155 MM glyco-PSMs across 34 distinct glycoproteins. Step 5 caps the Tier-A list at max_glycoproteins = 15 in this run — 19 additional Tier-A glycoproteins were not folded due to the cap; raising the cap to 50 is one of the cheapest v3 improvements.
5.7% looks low at first read, and a glyco PI will ask whether it's good. The honest answer: Casanovo's default checkpoint (casanovo_massivekb) is trained on tryptic non-glyco peptides from MassIVE-KB, and the glycan mass shift biases the precursor-charge and peptide-length distribution of glycopeptide spectra away from that training distribution — so the tryptic checkpoint systematically under-calls glyco-modified backbones. The non-enzymatic checkpoint (casanovo_v5_0_0_non_enzy, also Apache-2.0, drop-in swap) plus the MGF-subset filter fix (both queued for v3) should lift this intersection rate substantially. Tier-A is a high-precision sub-population, not a coverage metric — a 5.7% intersection still gives Step 5 a 34-protein high-confidence input list that the un-validated MM rows alone couldn't justify folding.
Step 5: Boltz-2 protein + Tn-ligand co-fold + ADC druggability (16:12)
Step 5 is where the workflow earns the "structural" half of its pitch. The prologue:
- Merges N + O glyco-PSMs (q ≤ 0.01), tags glycoproteins A (MM ∩ Casanovo) / C (MM-only), caps at 15 — Tier-A only for this run.
- For each accession, emits one Boltz-2 YAML with the dominant Tn glycan as a SMILES ligand. Tn = α-D-GalNAc as a free sugar; SMILES
CC(=O)N[C@@H]1[C@@H](O)[C@@H](O)[C@H](CO)O[C@H]1O. Boltz-2's input schema (v2.2.1) requiressmiles:orccd:; we use SMILES with PDB-CCD-equivalents (NDG / SIA / GAL) curated for Tn / sTn / T. - Important caveat: Boltz-2 docks the free Tn sugar near the predicted glycosite — it does not model the O-glycosidic bond to the Ser/Thr backbone. SASA reported is at the protein surface; site-resolved SASA-at-glycosite is a v3 refinement (requires post-fold rebuilding the Cα–Cβ–Oγ–C1 glycosidic torsion).
- Runs
boltz predict --model boltz2 --use_msa_server --recycling_steps 3 --diffusion_samples 1. - The epilogue computes per fold: per-residue mean SASA via Bio.PDB.SASA Shrake-Rupley; free-thiol Cys count (Cys-SG pairwise distance > 2.5 Å — not disulfide-bonded); NXS/T motif density; fetches UniProt subcellular location for the hard surface-accessibility gate; looks up TCGA tumor-vs-GTEx-normal log2 fold-change from a curated table (TCGA-LIHC values used here because HepG2-SC drives the most Level-1 PSMs; per-tissue LFCs for kidney + PAAD pending v3).
Surface-accessibility gate, precise rule: a protein passes if UniProt reports either (a) signal peptide present AND ≥ 1 transmembrane domain, OR (b) any subcellular-location token contains "Cell membrane", "Secreted", "Cell surface", "Plasma membrane", or "Extracellular". A protein fails the gate (hard zero) if it has no surface match AND any subloc token contains "Cytoplasm", "Nucleus", "Lysosome", "Mitochondrion", "Peroxisome", or "Endoplasmic reticulum lumen". Unmatched (unknown) sublocs use the soft score with a noted caveat.
Two known false-positive classes that pass the gate but aren't real ADC candidates — both are marked ⚠ in the rank table and explained in their Note columns, but neither is hard-zeroed by v2 of the gate:
- ⚠ resident-enzyme: Golgi/ER-resident glycosyltransferases and glycosidases (MAN1B1, MGAT2, TMTC3 in this run) have a TM domain but their catalytic side faces the lumen, never the cell exterior — not antibody-accessible. The v3 gate adds a "resident-glyco-enzyme" sub-class so these are filtered automatically.
- ⚠ plasma-soak: secreted abundant plasma / lipid-particle proteins (APOE, FGA, SERPINA5, FN1 in this run) pass the secreted-token branch of the gate but are unsuitable ADC targets — an antibody payload would be sponged systemically by the circulating pool ("on-target / off-tumor" sink). The v3 gate adds a "secreted-but-not-membrane-anchored" sub-class. uPAR or mesothelin-style secreted-but-membrane-anchored antigens stay through this sub-class.
Score formula (v2): cell-surface hard gate → if pass, 0.30 × SASA + 0.20 × free_thiols + 0.15 × NXS/T_penalty + 0.10 × Tier-A_bonus + 0.10 × LFC + 0.15 × confidence_multiplier; if fail, score = 0. The confidence multiplier is min(pLDDT/0.7, 1) × min(ipTM/0.6, 1) — a fold with pLDDT < 0.7 or ipTM < 0.6 gets a real penalty so low-confidence predictions don't share the top of the table with high-confidence ones. Low-conf ⚠ flag is applied to any row with pLDDT < 0.6 OR ipTM < 0.5 — currently FGA (ipTM 0.27), DSG2 (ipTM 0.40), and TCOF1 (pLDDT 0.34 / ipTM 0.19).
Ranked ADC candidates with Boltz-2 confidence
14 of 15 Tier-A glycoproteins folded successfully on a single NVIDIA L40S (g6e.4xlarge spot); P02751 / Fibronectin (2,477 residues) OOM'd — Boltz-2 OOM is sequence-length-driven, not protein-count-driven, and 2,477 exceeds L40S's 48 GB for a single-pass batched fold. All 14 successful folds were under 1,488 residues. Of the 14, three were excluded from the ranked table below after a post-hoc sequon audit (HSPA5/GRP78, PSAP, SDF2L1): their only N-glyco-arm support is search-space mass-coincidence on non-sequon peptides (see N-glyco-arm control section above), and none are sampled by the O-Pair arm in this dataset — so the workflow's automatic surfacing of them is not supported by this evidence and they would mislead the reader if presented as ADC candidates.
pLDDT and ipTM are Boltz-2's per-fold confidence: pLDDT is mean per-residue predicted local distance difference test (range 0–1, ≥ 0.7 typically reliable); ipTM is interface predicted template modeling score for the protein+ligand complex (range 0–1, ≥ 0.6 commonly considered a tight pocket).
| Rank | UniProt | Gene | Surface? | Mean SASA | pLDDT | ipTML←P | Free thiols | NXS/T | HCC LFC | Score | Note |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P22455 | FGFR4 | ✅ Cell membrane (TM) | 44.7 | 0.651 | 0.763 | 6 | 6 | 2.3 | 0.688 | Active clinical ADC target (FGF401 / roblitinib + 5+ Phase I/II programs). Surfaced via the O-glyco arm (Tn on S/T) + the Boltz-2 Tn-ligand co-fold. UniProt annotates 5 N-glyco sites for FGFR4; none are sampled at quality in this Tn-enriched dataset. |
| 2 | P80370 | DLK1 | ✅ Cell membrane (TM) | 74.1 | 0.725 | 0.728 | 2 | 6 | 1.8 | 0.655 | Active clinical ADC target (ADCT-701, REGN-7325) |
| 3 | Q9UKM7 | MAN1B1 | ⚠ resident-enzyme (ER mem) | 40.0 | 0.745 | 0.914 | 3 | 0 | 0.1 | 0.653 | α-1,2-mannosidase — luminal-facing catalytic site, not ADC-tractable |
| 4 | P02649 | APOE | ⚠ plasma-soak (secreted) | 61.4 | 0.686 | 0.963 | 2 | 0 | −0.2 | 0.650 | Apolipoprotein E — abundant in plasma/lipid particles, systemic ADC sink |
| 5 | Q6ZXV5 | TMTC3 | ⚠ resident-enzyme (11 TM, ER) | 44.1 | 0.823 | 0.528 | 14 | 5 | 0.3 | 0.627 | O-mannosyltransferase 3 — Golgi/ER catalytic side, not surface |
| 6 | Q10469 | MGAT2 | ⚠ resident-enzyme (Golgi) | 46.4 | 0.829 | 0.703 | 2 | 3 | −0.2 | 0.571 | α-1,6-mannosyl-glycoprotein 2-β-N-acetylglucosaminyltransferase (EC 2.4.1.143; GlcNAc-T II) — resident Golgi enzyme |
| 7 | P05154 | SERPINA5 | ⚠ plasma-soak (secreted) | 40.8 | 0.801 | 0.622 | 2 | 3 | −0.6 | 0.557 | Serpin family plasma protein — systemic sink |
| 8 | P02671 | FGA | ⚠ plasma-soak ⚠ low-conf | 40.8 | 0.506 | 0.266 | 5 | 4 | −1.5 | 0.540 | Fibrinogen α — abundant plasma protein; ipTM 0.27 is below the 0.5 confidence floor |
| 9 | Q14126 | DSG2 | ⚠ low-conf (✅ cell-junction) | 47.6 | 0.568 | 0.402 | 4 | 10 | 0.9 | 0.530 | Active discovery-stage ADC target; ipTM 0.40 below floor — biology is real, fold confidence low (consider EThcD spectra for v3) |
| 10 | P02751 | FN1 | ⚠ plasma-soak ⚠ OOM | — | — | — | — | 13 | 0.7 | 0.198 | Fibronectin, 2,477 residues — OOM'd on L40S; longest sequence in batch |
| — | Q9HC35 | EML4 | ❌ Cytoplasm / cytoskeleton | 34.0 | 0.708 | 0.535 | 19 | 8 | 0.0 | 0.000 | Hard gate: cytoskeletal |
| — | Q13428 | TCOF1 | ❌ Nucleus / nucleolus | 13.9 | 0.338 | 0.190 | 4 | 6 | 0.3 | 0.000 | Hard gate: nuclear; low-confidence fold (pLDDT 0.34) corroborates |
Mean SASA: Ų per residue (Bio.PDB.SASA Shrake-Rupley, probe 1.40 Å, 100 points per atom). ipTML←P: Boltz-2's global iptm, which in this single-protein + single-ligand schema equals both ligand_iptm and pair_chains_iptm[ligand][protein] (the ligand-from-protein direction; "how well did the protein context place the ligand"). The opposite direction pair_chains_iptm[protein][ligand] ("how much does the ligand stabilize the protein structure") is much lower across all our folds — e.g. APOE 0.53, FGFR4 0.23, DLK1 0.13, FGA 0.11 — because a single GalNAc doesn't structurally stabilize a 300-800-residue protein. Boltz-2's ipTM was trained primarily on protein-protein interfaces; for protein + small-ligand inputs, treat the absolute number as a pose-confidence proxy alongside the surface and plasma-soak flags, not as a standalone "ligand affinity" signal. HCC LFC: log2 fold-change TCGA-LIHC (hepatocellular carcinoma) vs GTEx liver — chosen because HepG2-SC is the HCC-derived line that drives the most Level-1 PSMs in this run; per-tissue LFCs for HEK293-SC (kidney) and Capan1-SC (PAAD) are queued for v3 once the UCSC Xena pull is wired into the epilogue.
FGFR4 (#1) and DLK1 (#2) are both active clinical ADC targets surfaced automatically from raw mass spec without prior knowledge of which proteins to look for. The v2-formula score sandwiches them between Tn-positive proteins, but the v3 sub-class gate (filtering ⚠ resident-enzyme + ⚠ plasma-soak rows) would float DSG2 to #3 and demote ranks 3-8 into a "known false positive class" section. Until then, ranks 3-10 carry their ⚠ tags so a reader sees the false positives at the same scroll as the ranking.
The hard gate works. EML4 (cytoskeletal) and TCOF1 (nucleolar) both return MM glyco-PSMs but score zero because they're physically unreachable by an antibody. TCOF1's pLDDT 0.34 / ipTM 0.19 confirms Boltz-2 is itself uncertain about the nuclear protein in a surface-fold context.
Cost and wall-clock — the canonical numbers
| Step | Hardware | Wall-clock | Spot cost |
|---|---|---|---|
| 1: raw → mzML+MGF | 16-vCPU CPU (c6i.4xlarge spot) | 9:02 | ~$0.03 |
| 2: Tn-specific oxonium QC | 16-vCPU CPU | 23:02 (cold; ~2 min warm) | ~$0.10 |
| 3a: MetaMorpheus N-glyco (control) | 16-vCPU CPU | 41:21† | ~$0.18 |
| 3b: MetaMorpheus O-Pair | 16-vCPU CPU | 52:21† | ~$0.22 |
| 4: Casanovo (L4) | NVIDIA L4 (g6.xlarge spot) | 19:17† | ~$0.10 |
| 5: Boltz-2 glyco-co-fold + druggability | NVIDIA L40S (g6e.4xlarge spot) | 16:12 | ~$0.65 |
| Total — wall-clock (Steps 3a+3b+4 in parallel) | 1h 40m | ~$1.28 | |
| Total compute (sum across all CPU + GPU nodes) | 2h 41m | — |
† Steps 3a, 3b, and 4 are independent — each consumes Step 2's outputs and emits into Step 5's prologue — so the workflow runs them concurrently. The parallel-block wall-clock is max(41:21, 52:21, 19:17) = 52:21, dominated by O-Pair. End-to-end wall-clock is therefore 9:02 + 23:02 + 52:21 + 16:12 = 1h 40m 37s. Sum-of-compute (the number to give the AWS bill-payer) is 2h 41m. Subsequent samples on the same workflow re-use the cached MM index + Casanovo conda env + Boltz-2 weights cache and land at ~30 min wall-clock per sample.
Spot prices quoted: us-east-1, lowest of 6 AZs, 22 May 2026 (c6i.4xlarge ~$0.13/h, g6.xlarge ~$0.30/h, g6e.4xlarge ~$2.40/h spot). g5 (A10G) spot was placement-score 1/10 region-wide that day — Casanovo was originally pinned to g5.xlarge with --gres=gpu:a10g:1 and pended; switching to --gres=gpu:l4:1 let Karpenter pick g6.xlarge (L4, same 24 GB VRAM, generally healthier spot inventory) and the job ran ~20 min instead of pending forever. The template default is now untyped --gres=gpu:1 with a model floor only for VRAM-bound jobs.
Reproducibility block
- Dataset: PRIDE PXD011063. Three .raw files (Orbitrap Fusion, HCD-DDA, 2h gradient, 60k MS1 / 15k MS2). MD5s:
HEK293_SC_VVA.raw—cc725b596b9bcee2286bfd2c6d2eae90(932 MB)HepG2_SC_NeuVVA.raw—3f20bfe4c7900ef516943b48b6f415b5(995 MB)Capan1_SC_VVA.raw—70b45026deda5150a01ea5eb0bdef213(792 MB)
- FASTA: UniProt SwissProt human, 20,659 target + 20,659 decoy = 41,318 proteins. MD5
dd3b8d5b02702bda97a395adfb7b5dac. Snapshotted 2026-05-08; no cRAP / MaxQuant-contaminants appended (TODO for v3). - Decoy strategy: MM 1.1.7 concatenated target-decoy, decoys generated by sequence reversal with subsequent homology-scramble (MM reports "20,659 Target Proteins / 20,659 Decoy Proteins / 18,142 Total decoy Proteins scrambled due to homology" for this FASTA).
- MetaMorpheus: v1.1.7, pre-built CMD.dll at
/mnt/efs/refs/metamorpheus/1.1.7/CMD.dll.- 3a:
GlycoSearchType = "NGlyco"against MM-bundledNGlycan.gdb(182 N-glycans).MaximumNGlycanAllowed = 2,OxoniumIonFilt = true. (Run as an orthogonal enrichment-specificity control on this Tn-O-glyco-enriched dataset, not as the headline biology arm — see the dedicated N-glyco-arm control section above.) - 3b:
GlycoSearchType = "OGlyco"against curatedtumor_o_glycans.gdb(22 entries).MaximumOGlycanAllowed = 3. - Both:
DissociationType = "HCD",PrecursorMassTolerance = "10 PPM",ProductMassTolerance = "20 PPM". MM-default digestion (trypsin/P, full cleavage, 2 missed cleavages, 7-50 aa peptide length). MM-default fixed mod Carbamidomethyl on C; variable mod Oxidation on M.
- 3a:
- Casanovo: v5.0.0, default tryptic checkpoint
casanovo_v5_0_0_v5_0_0.ckpt(auto-downloaded from GitHub release, MassIVE-KB-trained — tryptic model). Non-enzymatic checkpointcasanovo_v5_0_0_non_enzy_v5_0_0.ckptis available and may yield more non-tryptic glycopeptide backbones; queued for v3. Torch from PyTorchcu121index (torch 2.4.x). Conda env at/mnt/efs/<user>/_casanovo-conda-env/with rdkit + pip-installed casanovo via PYTHONUSERBASE. - Boltz-2: v2.2.1 + cuequivariance-ops-torch-cu13. MSA via Boltz's hosted
api.colabfold.com.--model boltz2 --recycling_steps 3 --diffusion_samples 1 --use_msa_server. SASA via biopython 1.85Bio.PDB.SASA.ShrakeRupley. Per-fold confidence read frompredictions/<acc>/confidence_<acc>_model_0.json. - UniProt subcellular cache: REST API snapshot taken at fold time, persisted at
/mnt/efs/refs/uniprot/subcell_cache.json. Re-running the epilogue uses the cache (deterministic). - Cross-line dedup: per-PSM rows are first collapsed by
(Protein Accession, Base Sequence, Plausible GlycanComposition)within each .raw file, then the unique set is counted per file. "In all 3 cell lines" = the same (Protein, BaseSequence) was identified in all three .raw files (glycan composition may differ). - Workflow templates:
core/templates/definitions/hcls/proteomics/{glyco-input-prep, glycounter-qc, metamorpheus-n-glyco, metamorpheus-o-pair, casanovo-glyco}.yamlandcore/templates/definitions/hcls/structure-prediction/boltz2-glycofold.yamlat commit4ff5e5d. - Data availability: Per-protein Boltz-2 outputs (
.cif+confidence_*.json), the prologue'stiers.json, the druggabilitysummary.tsv, and the MetaMorpheusAllPSMs.psmtsvoutputs are queued for a Zenodo deposit (DOI pending; reach out at the email below if you need a pre-publication tarball).
Open-source toolchain
| Tool | License | Role |
|---|---|---|
| ThermoRawFileParser 1.4.5 | Apache-2.0 | Thermo .raw → mzML / MGF |
| pyteomics (Tn-specific oxonium scanner) | Apache-2.0 | QC + scan filter |
| MetaMorpheus 1.1.7 | MIT | O-Pair search (headline) + N-glyco arm (enrichment-specificity control) |
| Casanovo 5.0 (Noble Lab) | Apache-2.0 | De novo peptide sequencing |
| Boltz-2 2.2.1 (MIT lab) | MIT | Protein + glycan ligand co-folding |
| cuequivariance-torch / cuequivariance-ops-torch-cu13 | Apache-2.0 | Boltz-2 triangle-attention kernel |
| Bio.PDB.SASA (biopython, Shrake-Rupley) | BSD-3 | Per-residue SASA |
| UniProt REST API | CC-BY | Subcellular location + signal peptide + TM count |
No pGlyco3 (email-license gate). No Byonic. No MSFragger glyco (academic-license). No GlyCounter (Windows WPF, abandoned 2024). No DeepGlyco (Windows-only). No AlphaFold-Multimer commercial license — Boltz-2 is MIT and covers the same use cases (with a richer ligand schema; AF-M is still competitive on multimer-only accuracy in some CASP15 subsets, but for protein+glycan co-folding Boltz-2 wins on licensing alone). Every tool here runs on Linux from pre-built containers or pip-installable wheels.
Honest gaps — what's next
- Lift the ⚠ resident-enzyme + ⚠ plasma-soak flags from soft annotations into hard-zero gate sub-classes. The v2 rank table already tags MAN1B1 / MGAT2 / TMTC3 (resident Golgi/ER enzymes) and APOE / FGA / SERPINA5 / FN1 (secreted plasma proteins) with ⚠ markers and explains them inline — but they still appear at competitive scores in the ranked output. v3 turns those tags into hard-zero gate decisions (alongside the cytoplasm/nucleus rejection) so the ranking presents only legitimately ADC-tractable candidates at the top. FGFR4, DLK1, and DSG2 stay above the line; the false-positive classes move into a separate "passed soft filter, failed hard sub-class" section.
- Multimer folds for in-vivo dimer biology. FGFR4 dimerises on ligand binding; DSG2 is a desmosomal heterodimer; APOE dimerises in HDL. Monomer SASA over-reports accessibility at residues that would be buried at the dimer interface. Boltz-2's YAML supports multimer inputs; wiring the top 3 as bio-assembly dimers + Tn ligand is the obvious next iteration (~30 min extra compute).
- Site-resolved SASA at the localized Ser/Thr. Today the SASA is per-residue mean over the whole structure. The MM Level-1 column gives the exact glycosite; the v3 epilogue threads that residue through to a per-site SASA + glycan-pose check.
- Full TCGA LFC table via a one-time UCSC Xena pull (
TcgaTargetGtex_RSEM_Hugo_norm_count) cached as EFS parquet, instead of the ~15-entry hand-curated dictionary that returns 0 for most folded proteins today. - Broader glycan SMILES library. The current set covers Tn / sTn / T. Extending to core-2 (Galβ1-3[GlcNAcβ1-6]GalNAc) and poly-LacNAc lets us co-fold the N2 / N3 / H1N2 compositions instead of falling back to Tn-only.
- Site-localization figure. A Boltz-2 cartoon of FGFR4 colored by SASA with the Tn ligand placed at the MM Level-1 Ser/Thr is the one image this post needs and doesn't have yet. Queued for v3.
- Tier-B Casanovo-only candidates. The 1,279 high-confidence (≥ 0.90) Casanovo backbones that don't intersect any MM PSM deserve BLAST-vs-FASTA enrichment — those are the glycoproteins MetaMorpheus missed entirely because the peptide isn't tryptic-canonical or carries a modification MM didn't declare.
- Casanovo non-enzymatic checkpoint. The default tryptic checkpoint is fine for canonical-tryptic glycopeptides; the non-enzymatic checkpoint (also Apache-2.0, same release) may pull in additional non-tryptic backbones that the current model misses. Drop-in change; queued.
- MGF subset filter bug in glycounter-qc. The v2 QC writes a
passing_scans.mgffor downstream Casanovo input, but the scan-ID match between mzML and MGF is too loose — Casanovo ends up running on the full MGF, not the 41% subset. The 19-min Casanovo wall-clock would drop to ~8 min once this is fixed; queued. - v1 footguns now codified in source. Two real template bugs found during this run: (1) Boltz-2 prologue emitted
id: <UniProt>but Boltz requires single-letter chain IDs; (2) MM template hadNGlycanDatabasefile = ""which silently fell back to plain peptide search. Both fixed in the templates at commit4ff5e5d.
How to run this on your data
The glyco-mapps-complete workflow is in early access on Clusterra. If you have COSMC-KO SimpleCell, lectin-enriched (VVA / Jacalin), or chemoenzymatic (EXoO-Tn) glycoproteomics DDA data and want to run this on your own samples — including with TMT-multiplexed input, EThcD spectra (Orbitrap Fusion Lumos / Eclipse / Astral with proper localization on long Ser-Thr stretches), or timsTOF .d acquisitions — reach out at hello@clusterra.cloud. The workflow runs in your own AWS account against your data; Clusterra never sees the raw files.
Author: Nikhil Tahalramani, Clusterra (nikhil@clusterra.cloud; ORCID pending). Workflow templates at commit 4ff5e5d. Per-protein Boltz-2 confidence JSONs, MM AllPSMs.psmtsv, druggability TSV, and the full set of .cif outputs available on request — Zenodo DOI pending and will be linked here on assignment. Container OCI/SHA256 digests for the pre-built ThermoRawFileParser + MetaMorpheus + python:3.11 SIFs are not currently captured at run time; that one-line apptainer inspect step is queued for v3 of the templates so digests land in summary.tsv alongside the wall-clock and cost. Figure: a Boltz-2 cartoon of FGFR4 colored by per-residue SASA with the Tn ligand at the MM Level-1 Ser/Thr is the one image this post should land before publication — queued for v3 (requires a render env with PyMOL/ChimeraX, not yet wired into the workflow).