Annotation of tandem mass spectrometry data using stochastic neural networks in shotgun proteomics

MOTIVATION The discrimination ability of score functions to separate correct from incorrect peptide-spectrum matches in database-searching-based spectrum identification are hindered by many superfluous peaks belonging to unexpected fragmentation ions or by the lacking peaks of anticipated fragmentation ions. RESULTS Here, we present a new method, called BoltzMatch, to learn score functions using a particular stochastic neural networks, called restricted Boltzmann machines, in order to enhance their discrimination ability. BoltzMatch learns chemically explainable patterns among peak pairs in the spectrum data, and it can augment peaks depending on their semantic context or even reconstruct lacking peaks of expected ions during its internal scoring mechanism. As a result, BoltzMatch achieved 50% and 33% more annotations on high- and low-resolution MS2 data than XCorr at a 0.1% false discovery rate in our benchmark; conversely, XCorr yielded the same number of spectrum annotations as BoltzMatch, albeit with 4-6 times more errors. In addition, BoltzMatch alone does yield 14% more annotations than Prosit (which runs with Percolator), and BoltzMatch with Percolator yields 32% more annotations than Prosit at 0.1% FDR level in our benchmark. AVAILABILITY BoltzMatch is freely available at: https://github.com/kfattila/BoltzMatch. SUPPORTING INFORMATION Supplementary materials are available at Bioinformatics Online.

[1]  Pavel Sulimov,et al.  Bias in False Discovery Rate Estimation in Mass-Spectrometry-Based Peptide Identification. , 2019, Journal of proteome research.

[2]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[3]  Jeff A. Bilmes,et al.  Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry , 2014, UAI.

[4]  Baozhen Shan,et al.  De novo peptide sequencing by deep learning , 2017, Proceedings of the National Academy of Sciences.

[5]  M. MacCoss,et al.  A fast SEQUEST cross correlation algorithm. , 2008, Journal of proteome research.

[6]  Mathias Wilhelm,et al.  Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning , 2019, Nature Methods.

[7]  Henry H N Lam,et al.  Proteome Informatics Research Group (iPRG)_2012: A Study on Detecting Modified Peptides in a Complex Mixture* , 2013, Molecular & Cellular Proteomics.

[8]  William Stafford Noble,et al.  Combining High-Resolution and Exact Calibration To Boost Statistical Power: A Well-Calibrated Score Function for High-Resolution MS2 Data. , 2018, Journal of proteome research.

[9]  R. Aebersold,et al.  Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. , 2004, Drug discovery today.

[10]  William Stafford Noble,et al.  Computing Exact p-values for a Cross-correlation Shotgun Proteomics Score Function , 2014, Molecular & Cellular Proteomics.

[11]  Jürgen Cox,et al.  High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis , 2019, Nature Methods.

[12]  Attila Kertész-Farkas,et al.  Database searching in mass spectrometry based proteomics , 2012 .

[13]  William Stafford Noble,et al.  On the Importance of Well-Calibrated Scores for Identifying Shotgun Proteomics Spectra , 2014, Journal of proteome research.

[14]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[15]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Michael J MacCoss,et al.  A Deeper Look into Comet—Implementation and Features , 2015, Journal of The American Society for Mass Spectrometry.

[18]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  R. Aebersold,et al.  An integrated workflow for charting the human interaction proteome: insights into the PP2A system , 2009, Molecular systems biology.

[20]  Chunjie Luo,et al.  pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. , 2017, Analytical chemistry.

[21]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[22]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[23]  Karina D. Sørensen,et al.  An Optimized Shotgun Strategy for the Rapid Generation of Comprehensive Human Proteomes , 2017, Cell systems.

[24]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[25]  William Stafford Noble,et al.  Tandem Mass Spectrum Identification via Cascaded Search , 2015, Journal of proteome research.

[26]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[27]  J. Coon,et al.  A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. , 2013, Journal of proteome research.

[28]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[29]  William Stafford Noble,et al.  Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics , 2015, Journal of proteome research.

[30]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[31]  Edward L Huttlin,et al.  Global analysis of protein expression and phosphorylation of three stages of Plasmodium falciparum intraerythrocytic development. , 2013, Journal of proteome research.

[32]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[33]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[34]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[35]  Stephan M. Winkler,et al.  MS Amanda, a Universal Identification Algorithm Optimized for High Accuracy Tandem Mass Spectra , 2014, Journal of proteome research.

[36]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[37]  William Stafford Noble,et al.  Computational and Statistical Analysis of Protein Mass Spectrometry Data , 2012, PLoS Comput. Biol..

[38]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.