UniSpec: A Deep Learning Approach for Predicting Energy-Sensitive Peptide Tandem Mass Spectra and Generating Proteomics-Wide In-Silico Spectral Libraries

In this report, we present UniSpec, an attention-based deep neural network designed to predict complete collision-induced fragmentation of tryptic peptides, aimed at enhancing peptide and protein identification in shotgun proteomics studies. We preprocessed spectral data from peptide tandem mass spectral libraries, compiled by the National Institute of Standards and Technology (NIST), utilizing a data approach tailored for model development, resulting in high-quality, energy-consistent spectral datasets. By analyzing all the annotated fragment ions present in these libraries, we constructed an extensive peptide fragment dictionary containing 7919 isotopic ions from sequence ions, neutral loss, internal, iminium, and amino acid fragment ions. The streamlined dictionary-based spectral training data enables UniSpec to efficiently learn the complex intensity patterns of various product ions, resulting in reliable spectral predictions for a wide range of unmodified and modified peptides. We evaluated the model’s accuracy by comparing its performance across training and testing data, considering diverse peptide characteristics like peptide classes, charge states, and sequence lengths. Our model attained a median cosine similarity score of 0.951 and 0.923 on the training and test data respectively. Contrary to existing deep learning models that often overlook a substantial part of peptide tandem mass spectra beyond the sequence b and y ion series, UniSpec can predict up to 75% of all measured fragment intensities (including unknown signals) in the raw experimental spectra. This represents a marked advancement from the 43.5% coverage achieved solely by b and y sequence ions in the NIST library spectra. For the evaluation of our model’s practical utility in predicting proteome-wide in-silico spectral libraries, we executed a benchmark test using a dataset of HeLa cells. UniSpec displayed a significant overlap of peptide identifications with the widely used search engine MS-GF+ and the NIST experimental spectral library, demonstrating its robust performance as a standalone peptide identification tool.

[1]  J. Cox Prediction of peptide mass spectral libraries with machine learning , 2022, Nature Biotechnology.

[2]  S. Willems,et al.  AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics , 2022, bioRxiv.

[3]  Alexander J. Federation,et al.  Building Spectral Libraries from Narrow-Window Data-Independent Acquisition Mass Spectrometry Data. , 2022, Journal of proteome research.

[4]  Mathias Wilhelm,et al.  Prosit Transformer: A transformer for Prediction of MS2 Spectrum Intensities , 2022, Journal of proteome research.

[5]  William Stafford Noble,et al.  Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics , 2022, Journal of proteome research.

[6]  Sean J. Humphrey,et al.  MaxDIA enables library-based and library-free data-independent acquisition proteomics , 2021, Nature Biotechnology.

[7]  B. Ma,et al.  MSTracer: A Machine Learning Software Tool for Peptide Feature Detection from Liquid Chromatography-Mass Spectrometry Data. , 2021, Journal of proteome research.

[8]  S. Carr,et al.  Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics , 2021, Nature Communications.

[9]  Siegfried Gessulat,et al.  INFERYS Rescoring: boosting peptide identifications and scoring confidence of database search results. , 2021, Rapid communications in mass spectrometry : RCM.

[10]  Bing Zhang,et al.  Deep Learning in Proteomics , 2020, Proteomics.

[11]  Haixu Tang,et al.  Full-Spectrum Prediction of Peptides Tandem Mass Spectra using Deep Neural Network. , 2020, Analytical chemistry.

[12]  Xiaohui Liu,et al.  In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics , 2020, Nature Communications.

[13]  Hao Chi,et al.  MS/MS Spectrum Prediction for Modified Peptides Using pDeep2 Trained by Transfer Learning. , 2019, Analytical chemistry.

[14]  Mathias Wilhelm,et al.  Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning , 2019, Nature Methods.

[15]  R. Zimmer,et al.  Multi-Reference Spectral Library Yields Almost Complete Coverage of Heterogeneous LC-MS/MS Data Sets. , 2019, Journal of proteome research.

[16]  Seema Shah,et al.  A Review of Machine Learning and Deep Learning Applications , 2018, 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA).

[17]  Yuri A. Mirokhin,et al.  The NISTmAb tryptic peptide spectral library for monoclonal antibody characterization , 2018, mAbs.

[18]  Henry Lam,et al.  Tandem mass spectral libraries of peptides and their roles in proteomics research. , 2017, Mass spectrometry reviews.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[21]  Yuri A. Mirokhin,et al.  Tandem Mass Spectral Libraries of Peptides in Digests of Individual Proteins: Human Serum Albumin (HSA) * , 2014, Molecular & Cellular Proteomics.

[22]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[23]  Lennart Martens,et al.  MS2PIP: a tool for MS/MS peak intensity prediction , 2013, Bioinform..

[24]  Stephen E. Stein,et al.  Metabolite profiling of a NIST Standard Reference Material for human plasma (SRM 1950): GC-MS, LC-MS, NMR, and clinical laboratory analyses, libraries, and web-based resources. , 2013, Analytical chemistry.

[25]  Jürgen Cox,et al.  A systematic investigation into the nature of tryptic HCD spectra. , 2012, Journal of proteome research.

[26]  Pedro Navarro,et al.  A refined method to calculate false discovery rates for peptide identification using decoy databases. , 2009, Journal of proteome research.

[27]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[28]  Predrag Radivojac,et al.  A Machine Learning Approach to Predicting Peptide Fragmentation Spectra , 2005, Pacific Symposium on Biocomputing.

[29]  Zhongqi Zhang Prediction of low-energy collision-induced dissociation spectra of peptides. , 2004, Analytical chemistry.

[30]  Eugene A. Kapp,et al.  Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. , 2003, Analytical chemistry.

[31]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[32]  V. Wysocki,et al.  Mobile and localized protons: a framework for understanding peptide dissociation. , 2000, Journal of mass spectrometry : JMS.

[33]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[34]  D. Scott,et al.  Optimization and testing of mass spectral library search algorithms for compound identification , 1994, Journal of the American Society for Mass Spectrometry.

[35]  J. D. Lee,et al.  Interpretation of mass spectra. , 1973, Talanta.

[36]  Hongyu Zhao,et al.  Statistical Methods In Proteomics , 2006 .