Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning

In mass-spectrometry-based proteomics, the identification and quantification of peptides and proteins heavily rely on sequence database searching or spectral library matching. The lack of accurate predictive models for fragment ion intensities impairs the realization of the full potential of these approaches. Here, we extended the ProteomeTools synthetic peptide library to 550,000 tryptic peptides and 21 million high-quality tandem mass spectra. We trained a deep neural network, termed Prosit, resulting in chromatographic retention time and fragment ion intensity predictions that exceed the quality of the experimental data. Integrating Prosit into database search pipelines led to more identifications at >10× lower false discovery rates. We show the general applicability of Prosit by predicting spectra for proteases other than trypsin, generating spectral libraries for data-independent acquisition and improving the analysis of metaproteomes. Prosit is integrated into ProteomicsDB, allowing search result re-scoring and custom spectral library generation for any organism on the basis of peptide sequence alone.A deep learning–based tool, Prosit, predicts high-quality peptide tandem mass spectra, improving peptide-identification performance compared with that of traditional proteomics analysis methods.

[1]  J. Diedrich,et al.  Energy Dependence of HCD on Peptide Fragmentation: Stepped Collisional Energy Finds the Sweet Spot , 2013, Journal of The American Society for Mass Spectrometry.

[2]  D. Scott,et al.  Optimization and testing of mass spectral library search algorithms for compound identification , 1994, Journal of the American Society for Mass Spectrometry.

[3]  J. Yates,et al.  Protein analysis by shotgun/bottom-up proteomics. , 2013, Chemical reviews.

[4]  Johannes Griss,et al.  Expanding the Use of Spectral Libraries in Proteomics. , 2018, Journal of proteome research.

[5]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[6]  B. Kuster,et al.  Challenges in Clinical Metaproteomics Highlighted by the Analysis of Acute Leukemia Patients with Gut Colonization by Multidrug-Resistant Enterobacteriaceae , 2019, Proteomes.

[7]  Ruedi Aebersold,et al.  Mass-spectrometric exploration of proteome structure and function , 2016, Nature.

[8]  Ari Frank,et al.  Predicting intensity ranks of peptide fragment ions. , 2009, Journal of proteome research.

[9]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[10]  Chunjie Luo,et al.  pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. , 2017, Analytical chemistry.

[11]  O. Krokhin,et al.  Sequence-specific retention calculator. Algorithm for peptide retention prediction in ion-pair RP-HPLC: application to 300- and 100-A pore size C18 sorbents. , 2006, Analytical chemistry.

[12]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[13]  Jens Roat Kultima,et al.  An integrated catalog of reference genes in the human gut microbiome , 2014, Nature Biotechnology.

[14]  Lukas Käll,et al.  Training, selection, and robust calibration of retention time models for targeted proteomics. , 2010, Journal of proteome research.

[15]  Mathias Wilhelm,et al.  PROCAL: A Set of 40 Peptide Standards for Retention Time Indexing, Column Performance Monitoring, and Collision Energy Calibration , 2017, Proteomics.

[16]  Roland Bruderer,et al.  High‐precision iRT prediction in the targeted analysis of data‐independent acquisition and its impact on identification and quantitation , 2016, Proteomics.

[17]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[18]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[19]  J. Schneider-Mergener,et al.  Coherent Membrane Supports for Parallel Microsynthesis and Screening of Bioactive Peptides , 2001 .

[20]  R. Aebersold,et al.  Selected reaction monitoring for quantitative proteomics: a tutorial , 2008, Molecular systems biology.

[21]  Thilo Muth,et al.  Navigating through metaproteomics data: A logbook of database searching , 2015, Proteomics.

[22]  Brendan MacLean,et al.  Building high-quality assay libraries for targeted analysis of SWATH MS data , 2015, Nature Protocols.

[23]  Jürgen Cox,et al.  Computational Methods for Understanding Mass Spectrometry–Based Shotgun Proteomics Data , 2018, Annual Review of Biomedical Data Science.

[24]  Joseph M. Foster,et al.  Chromatographic retention time prediction for posttranslationally modified peptides , 2012, Proteomics.

[25]  William Stafford Noble,et al.  Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0 , 2016, Journal of The American Society for Mass Spectrometry.

[26]  D. Benndorf,et al.  Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. , 2013, Molecular bioSystems.

[27]  Simon Davis,et al.  Expanding Proteome Coverage with CHarge Ordered Parallel Ion aNalysis (CHOPIN) Combined with Broad Specificity Proteolysis , 2017, Journal of proteome research.

[28]  Predrag Radivojac,et al.  A Machine Learning Approach to Predicting Peptide Fragmentation Spectra , 2005, Pacific Symposium on Biocomputing.

[29]  Mathias Wilhelm,et al.  Building ProteomeTools based on a complete synthetic human proteome , 2017, Nature Methods.

[30]  Ludovic C. Gillet,et al.  Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis* , 2012, Molecular & Cellular Proteomics.

[31]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[32]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[33]  Karina D. Sørensen,et al.  An Optimized Shotgun Strategy for the Rapid Generation of Comprehensive Human Proteomes , 2017, Cell systems.

[34]  Joshua E. Elias,et al.  Building proteomic tool boxes to monitor MHC class I and class II peptides , 2017, Proteomics.

[35]  Alexey I Nesvizhskii,et al.  Effective Leveraging of Targeted Search Spaces for Improving Peptide Identification in Tandem Mass Spectrometry Based Proteomics. , 2015, Journal of proteome research.

[36]  Karl Mechtler,et al.  CharmeRT: Boosting Peptide Identifications by Chimeric Spectra Identification and Retention Time Prediction , 2018, Journal of proteome research.

[37]  Mathias Wilhelm,et al.  A deep proteome and transcriptome abundance atlas of 29 healthy human tissues , 2018, bioRxiv.

[38]  V. Spicer,et al.  Generation of accurate peptide retention data for targeted and data independent quantitative LC‐MS analysis: Chromatographic lessons in proteomics , 2016, Proteomics.

[39]  B. Kuster,et al.  Proteomics: a pragmatic perspective , 2010, Nature Biotechnology.

[40]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.

[41]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[42]  Lennart Martens,et al.  MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation , 2015, Nucleic Acids Res..

[43]  Kathryn S Lilley,et al.  Spectral Libraries for SWATH‐MS Assays for Drosophila melanogaster and Solanum lycopersicum , 2017, Proteomics.

[44]  Mathias Wilhelm,et al.  ProteomeTools: Systematic Characterization of 21 Post-translational Protein Modifications by Liquid Chromatography Tandem Mass Spectrometry (LC-MS/MS) Using Synthetic Peptides* , 2018, Molecular & Cellular Proteomics.

[45]  Oliver M. Bernhardt,et al.  Optimization of Experimental Parameters in Data-Independent Mass Spectrometry Significantly Increases Depth and Reproducibility of Results* , 2017, Molecular & Cellular Proteomics.

[46]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[47]  Ruedi Aebersold,et al.  Conserved Peptide Fragmentation as a Benchmarking Tool for Mass Spectrometers and a Discriminating Feature for Targeted Proteomics* , 2014, Molecular & Cellular Proteomics.