2026-05-21

DIA proteomics at cohort scale in your own AWS

AlphaDIA identifies 8,591 proteins from an Orbitrap Astral DIA run in 13 minutes — and a 50-sample cohort finishes in the same 13 minutes, every sample on its own node with no queue. Open-source, ~$0.05/sample, in your own AWS account.

Published May 21 2026. Benchmark run May 21 2026. Input: Orbitrap Astral HeLa 200 ng DIA, 21-min gradient. Library: Mann lab alphaDIA tutorial spectral library (113,978 target precursors, 9,889 proteins). AlphaDIA v2.1.1, CPU-only. Container: mannlabs/alphadia:2.1.1. Output: precursors.parquet, pg.matrix.parquet, peptide.matrix.parquet.

AlphaDIA processes one DIA raw file in 13 minutes. On Clusterra, a 50-sample cohort also takes 13 minutes — every sample runs simultaneously, each on its own dedicated cloud node, and you pay only for the compute that ran.

The analysis bottleneck in high-throughput DIA proteomics isn't the mass spectrometer. DIA search is inherently per-sample: each raw file is processed independently, and most labs end up queueing them sequentially on a single workstation or waiting 24–48 hours for a slot on a shared university cluster. A 50-sample project that takes 13 minutes per sample ties up a workstation for nearly 11 hours. A plate of 48 single cells that finishes on the instrument overnight yields results the following afternoon at best — assuming no queue backlog. Clusterra removes this bottleneck entirely, whether the cohort is a bulk DIA screening run or a single-cell plate.

TL;DR (n=1, May 21 2026)

8,591 proteins identified at 1% FDR (target-decoy, AlphaDIA two-pass) from a single Orbitrap Astral DIA run.
105,414 precursors and 95,777 peptides quantified.
13 min 2 sec from raw file to parquet output.
~$0.05 per sample on AWS spot compute.
50× throughput gain: a 50-sample cohort completes in 13 min instead of 10.8 hours, at identical total cost (~$2.70 either way).
No software setup: AlphaDIA v2.1.1 runs from a pre-built container — no conda, no pip, no environment management.
All open-source. AlphaDIA (Mann lab). No license fees, no per-sample API costs. Compute runs in your own AWS account.

The DDA → DIA shift: why it matters at any input scale

For most of the last decade, single-cell proteomics (SCP) relied on data-dependent acquisition (DDA): the mass spectrometer selects the top-N most abundant precursor ions for fragmentation in each cycle. At nanogram single-cell input, DDA identifies roughly 500–1,500 proteins per cell, and stochastic precursor selection means different cells get different peptides fragmented — making quantitative comparison across cells noisy. At bulk 200 ng input, DDA on a modern Orbitrap typically reaches ~2,000–4,000 proteins per run.

Data-independent acquisition (DIA) eliminates precursor selection. The instrument fragments all ions in every predefined isolation window on every cycle. Every sample gets identical fragmentation coverage. Combined with deep spectral libraries and tools like AlphaDIA, DIA identifies 3,000–8,000+ proteins per cell at nanogram input — a meaningful depth increase over DDA at equivalent gradient length, with substantially better quantitative reproducibility across samples.

The instruments that make this practical — Orbitrap Astral (Thermo), timsTOF SCP (Bruker) — became widely available in 2023–2024. The bioinformatics infrastructure to analyse the resulting data at cohort scale is what most labs are still catching up on.

What AlphaDIA does

AlphaDIA is the Mann lab's open-source DIA search engine. The capabilities that matter most for cohort-scale analysis:

Library prediction from FASTA. For novel cell types with no published spectral library, AlphaDIA's built-in PeptDeep predicts retention times, ion mobilities, and fragment intensities for every tryptic peptide in the proteome. This one-time prediction step (~90 min for a human proteome) produces a library you reuse across all subsequent runs of that cell type.
Pre-built library reuse. For established cell lines — HeLa, Jurkat, primary T cells — the Mann lab and others publish ready-made spectral libraries. Loading one replaces the prediction step with ~1 minute of library load. This benchmark used the Mann lab's HeLa hybrid library, which is publicly available.
Match-between-runs (MBR). After the initial per-sample search, AlphaDIA performs within-run refinement automatically (this is included in the 13-minute benchmark). For cohort analysis, an optional cross-sample MBR step transfers high-confidence identifications across samples to recover precursors that were below threshold in individual runs — the 15–30% per-cell coverage improvement reported in the literature refers specifically to this cross-sample step, run after all samples in the cohort are complete.
Parquet output. Results are written as precursors.parquet, peptide.matrix.parquet, and pg.matrix.parquet — directly loadable in pandas, polars, or R for downstream statistical analysis.

Benchmark setup

Input data. One Orbitrap Astral DIA run: 200 ng HeLa digest, 21-minute LC gradient, label-free. This is a bulk benchmark; true single-cell runs at 0.1–1 ng require an instrument optimised for nanogram-sensitivity DIA (Astral or timsTOF SCP). The AlphaDIA analysis pipeline is identical at both input scales — protein identification depth is determined by the instrument and input mass, not by the analysis software or compute.

Spectral library. Mann lab pre-built HeLa hybrid library (94 MB HDF5), released alongside the alphaDIA tutorial. 113,978 target precursors covering 9,889 proteins. A pre-built library reflects production usage: built once per cell type, staged to shared storage, reused across every run in a cohort.

Software. AlphaDIA v2.1.1 (mannlabs/alphadia:2.1.1), run from a pre-built container. No installation, no virtual environment management. The container image is cached on shared storage after the first run.

Compute. A single standard CPU node on AWS (16 vCPU, 64 GB RAM), 4 threads allocated to AlphaDIA. No GPU required. Runs in the customer's own AWS account; raw files and results stay in their storage.

Results

Identifications (n=1 sample)

Metric	Value
Proteins (1% FDR, target-decoy)	8,591
Precursors (1% FDR)	105,414
Peptides	95,777
Library input (target precursors)	113,978
Library input (proteins)	9,889
Precursor identification rate	92.5%
Protein identification rate	86.9%
MS2 search tolerance (post-optimisation)	10 ppm
RT FWHM	3.0 sec
DIA cycle length	301 isolation windows
DIA cycle duration	1.55 sec
Total DIA cycles	806

8,591 proteins at 1% FDR from a 21-minute gradient is consistent with published Orbitrap Astral benchmarks (typically 7,000–9,000 proteins at 200 ng input). The 92.5% precursor identification rate reflects a library built on matched instrument and gradient conditions — a practical ceiling for a well-characterised cell line. On novel cell types or shorter gradients, expect lower match rates; that is expected and does not indicate a pipeline problem.

Analysis time and cost

Metric	Value
Total wall-clock (1 sample)	13 min 2 sec
— library load + prep	1 min 50 sec
— AlphaDIA core search	3 min 46 sec
— Within-run library refinement + output write	7 min 26 sec
Cost per sample	~$0.05
50-sample cohort, sequential	~10 hr 50 min, ~$2.70
50-sample cohort, parallel (Clusterra)	*~16–18 min total, ~$2.70**

*13 min of AlphaDIA compute plus ~3–5 min for cloud nodes to provision on first submission. The total cost is identical whether you run sequentially or in parallel — you pay for 50 × 13 node-minutes either way. What changes is wall-clock: the cohort finishes in the time it takes to analyse one sample.

From one sample to a whole cohort

A typical DIA project — 50 samples from a screening campaign, or 48 single cells from a tumor plate — takes 10–12 hours to acquire on the instrument. The bioinformatics step, run sequentially on a workstation, adds another 10+ hours. On a university HPC queue, the same jobs may wait 24–48 hours for a slot. Either way, the analyst's day is blocked and the PI waits until the following afternoon at best.

On Clusterra, each sample in the cohort runs simultaneously on its own dedicated cloud node. You configure your AlphaDIA settings once — spectral library path, search parameters, output directory — and Clusterra runs one independent search per sample in parallel. All nodes start together, all finish together, and all are automatically shut down and billed only for what ran. The analysis that would tie up a workstation for most of a working day completes in a coffee break.

The key point is that Clusterra is not faster because of different software — it runs the same AlphaDIA v2.1.1 you'd run locally. It is faster because there is no queue, no shared workstation bottleneck, and no idle time between samples. The same setup works identically for a 50-sample bulk screening project and a 48-cell single-cell plate: the biology and scale change, the analysis workflow does not.

DIA depth × parallelism: the compounded advantage

The practical case for DIA on Clusterra combines two independent improvements:

Dimension	DDA (sequential, local)	DIA on Clusterra (parallel)
Proteins per sample — bulk (200 ng)	~2,000–4,000	~6,000–9,000
Proteins per sample — single cell (0.1–1 ng, Astral)	~500–1,500	~2,000–4,500
Cohort analysis time (50 samples)	~10–12 hr sequential	~16–18 min parallel
Quantitative reproducibility across cells	Stochastic (different peptides per cell)	Consistent (same windows every sample)

More proteins per sample AND same-day turnaround per cohort makes DIA a practical tool for routine biology rather than a one-off heroic experiment. A lab running 3 plates per week can close the analysis loop the same day instead of accumulating a backlog that takes a postdoc a week to process.

The full pipeline

A complete DIA proteomics run on Clusterra has three stages:

#	Stage	What it does	Output
1	Raw file conversion (one-time per plate)	Converts Thermo .raw files to open mzML format via ThermoRawFileParser	mzML files on shared storage
2	DIA search — this benchmark	AlphaDIA runs independently per sample; all samples run simultaneously	precursors.parquet, pg.matrix.parquet per sample
3	Cross-sample MBR (optional)	Transfers high-confidence IDs across all samples to improve per-sample coverage	Updated per-sample parquet files with rescued identifications

Stage 2 is where all the parallelism happens — one independent search per sample. Stage 1 is a fast single conversion (~2 min per raw file). Stage 3 is optional and runs after all samples in stage 2 complete; it is most valuable for single-cell cohorts where individual cell coverage is sparse and cross-sample transfer meaningfully rescues identifications.

For established cell lines with a pre-built library, stage 2 costs ~$0.05/sample and runs in 13 minutes. For novel cell types, a one-time PeptDeep library prediction step (~90 min) generates a library that is reused across all subsequent cohorts at no additional cost.

Honest caveats

This benchmark used 200 ng bulk HeLa, not true single cells. Single-cell input at 0.1–1 ng requires instruments optimised for nanogram-sensitivity DIA: Orbitrap Astral, timsTOF SCP, or similar. The AlphaDIA pipeline and analysis cost are identical at both scales; identification depth per cell is determined by the instrument and input mass. At true single-cell input on an Astral, expect 2,000–4,500 proteins per cell depending on gradient length and cell type. The parallelism and cost figures hold regardless of scale.
50-sample parallelism assumes 50 cloud nodes are available simultaneously. In most AWS regions this is reliably the case for standard CPU instance types. For very large cohorts (200+ samples), configuring fallback instance types ensures provisioning completes quickly even when one instance family is temporarily scarce.
MBR requires all samples in the cohort to finish first. The cross-sample identification transfer step (stage 3) cannot start until every sample search is complete. If one sample fails, MBR is delayed until it is rerun or excluded. AlphaDIA's alphastats tooling handles failed samples gracefully.
Library quality sets the identification ceiling. The 8,591 proteins here reflects a library built from deep-fractionated Astral data on a matched gradient. A library built from shallower data will yield lower coverage — that is an instrument and library curation decision, not a compute one.
This benchmark's 13 min includes ~2 min of raw file download. In production, raw files already reside on shared storage before the search starts; pure AlphaDIA runtime is ~11 minutes.
Cloud node startup adds ~3–5 minutes for a fresh cohort submission. The 13-min figure measures AlphaDIA compute time only. When a cohort of 50 samples is submitted to a cold cluster, all 50 nodes provision simultaneously, adding ~3–5 minutes before analysis begins. Total time from submission to results: ~16–18 minutes.
Single-run benchmark (n=1). Wall-clock and identification counts are point measurements. Expect ±10% variability across identical runs.
Peak memory not instrumented. AlphaDIA documentation indicates ~16–32 GB RAM for a single DIA file with a pre-built library. The node used here (64 GB RAM) was not memory-constrained; smaller nodes have not been tested.
Write performance for large cohorts. The within-run refinement and output write phase (7 min 26 sec) includes writing parquet files to shared storage. For cohorts larger than 50 samples running simultaneously, shared storage throughput can become a bottleneck — provisioned throughput mode is recommended for sustained high-concurrency workloads.

Run it yourself

From the Clusterra console:

Stage your raw files. Upload your Thermo .raw files to your cluster's shared storage, or use the ThermoRawFileParser workflow to convert them to mzML first (~2 min per file, runs in the background).
Point to your spectral library. Upload a pre-built .hdf library to shared storage and note the path. To smoke-test the pipeline, use the Mann lab's public HeLa library (freely available from the alphaDIA tutorial repository).
Submit the AlphaDIA search. Open the alphadia-search workflow in the console, set the number of samples, and point it at your raw files and library. The workflow fills in the AlphaDIA configuration — search parameters, thread count, output paths, parquet output format. Submit. Clusterra provisions one node per sample automatically.
Optional: run cross-sample MBR. After all sample searches complete, submit a second dependent AlphaDIA run over the full cohort output directory. This recovers additional identifications across samples via match-between-runs. Load the resulting pg.matrix.parquet into Python or R for downstream analysis.

Time to reproduce this benchmark from a fresh console session: under 3 minutes to configure and submit, then 13–18 minutes of compute. Total cost: ~$0.05.

All tools are open-source. AlphaDIA is maintained by the Mann lab at the Max Planck Institute of Biochemistry (github.com/MannLabs/alphaDIA). No license fees, no per-sample API costs. Compute runs in your own AWS account — you see the exact cloud bills, with no managed-service markup.

Run it in your own AWS

DIA proteomics at cohort scale — and the rest of your HPC stack — runs on a managed Slurm cluster in your own AWS account: on spot, no cluster to stand up, no data egress. Start at clusterra.cloud, or email hello@clusterra.cloud.