Peptide-Spectra Matching with Weak Supervision

As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-the-art grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems.

[1]  Baozhen Shan,et al.  De novo peptide sequencing by deep learning , 2017, Proceedings of the National Academy of Sciences.

[2]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[3]  The Ligo Scientific Collaboration,et al.  Observation of Gravitational Waves from a Binary Black Hole Merger , 2016, 1602.03837.

[4]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[8]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[9]  Gisbert Schneider,et al.  Automating drug discovery , 2017, Nature Reviews Drug Discovery.

[10]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[11]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[12]  Andrea J. Liu,et al.  A structural approach to relaxation in glassy liquids , 2015, Nature Physics.

[13]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[14]  M. Mann,et al.  The abc's (and xyz's) of peptide sequencing , 2004, Nature Reviews Molecular Cell Biology.

[15]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[16]  Eric D. Dodds,et al.  Understanding and Exploiting Peptide Fragment Ion Intensities Using Experimental and Informatic Approaches , 2010, Proteome Bioinformatics.

[17]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[18]  Ekin D. Cubuk,et al.  Holistic computational structure screening of more than 12 000 candidates for solid lithium-ion conductor materials , 2017 .

[19]  R. Henderson,et al.  Three-dimensional model of purple membrane obtained by electron microscopy , 1975, Nature.

[20]  Ying Xu,et al.  Improved peptide elution time prediction for reversed-phase liquid chromatography-MS by incorporating peptide sequence information. , 2006, Analytical chemistry.

[21]  Navdeep Jaitly,et al.  Next-Step Conditioned Deep Convolutional Neural Networks Improve Protein Secondary Structure Prediction , 2017, ArXiv.

[22]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[23]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[24]  Mathias Wilhelm,et al.  Building ProteomeTools based on a complete synthetic human proteome , 2017, Nature Methods.

[25]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[26]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[27]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[28]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[29]  J. T. Childers,et al.  Combined Measurement of the Higgs Boson Mass in $pp$ Collisions at $\sqrt{s}=7$ and 8 TeV with the ATLAS and CMS Experiments , 2015, 1503.07589.

[30]  Richard D. Smith,et al.  Dissociation behavior of doubly-charged tryptic peptides: correlation of gas-phase cleavage abundance with ramachandran plots. , 2004, Journal of the American Chemical Society.

[31]  William Stafford Noble,et al.  Posterior error probabilities and false discovery rates: two sides of the same coin. , 2008, Journal of proteome research.

[32]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[33]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[34]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.