Uncovering Hidden Members and Functions of the Soil Microbiome Using De Novo Metaproteomics

The fundamental task in proteomic mass spectrometry is identifying peptides from their observed spectra. Where protein sequences are known, standard algorithms utilize these to narrow the list of peptide candidates. If protein sequences are unknown, a distinct class of algorithms must interpret spectra de novo. Despite decades of effort on algorithmic constructs and machine learning methods, de novo software tools remain inaccurate when used on environmentally diverse samples. Here we train a deep neural network on 5 million spectra from 55 phylogenetically diverse bacteria. This new model outperforms current methods by 25-100%. The diversity of organisms used for training also improves the generality of the model, and ensures reliable performance regardless of where the sample comes from. Significantly, it also achieves a high accuracy in long peptides which assist in identifying taxa from samples of unknown origin. With the new tool, called Kaiko, we analyze proteomics data from six natural soil isolates for which a proteome database did not exist. Without any sequence information, we correctly identify the taxonomy of these soil microbes as well as annotate thousands of peptide spectra.

[1]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[2]  Ji Zhu,et al.  Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches* , 2006, Molecular & Cellular Proteomics.

[3]  Baozhen Shan,et al.  De novo peptide sequencing by deep learning , 2017, Proceedings of the National Academy of Sciences.

[4]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[5]  Donovan H. Parks,et al.  A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life , 2018, Nature Biotechnology.

[6]  Ilias Tagkopoulos,et al.  DeepPep: Deep proteome inference from peptide profiles , 2017, PLoS Comput. Biol..

[7]  P. Mallick,et al.  Peptide Identification from Mixture Tandem Mass Spectra* , 2010, Molecular & Cellular Proteomics.

[8]  B. Ma Novor: Real-Time Peptide de Novo Sequencing Software , 2015, Journal of The American Society for Mass Spectrometry.

[9]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[10]  Nathan J Edwards,et al.  PepArML: A Meta‐Search Peptide Identification Platform for Tandem Mass Spectra , 2013, Current protocols in bioinformatics.

[11]  Juan Antonio Vizcaíno,et al.  The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition , 2016, Nucleic Acids Res..

[12]  Patrice Waridel,et al.  Rapid validation of protein identifications with the borderline statistical confidence via de novo sequencing and MS BLAST searches. , 2006, Journal of proteome research.

[13]  P. Pevzner,et al.  Sequence similarity‐driven proteomics in organisms with unknown genomes by LC‐MS/MS and automated de novo sequencing , 2007, Proteomics.

[14]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[15]  Tim W. Nattkemper,et al.  Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics , 2008, BMC Bioinformatics.

[16]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[17]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[18]  P. Bork,et al.  Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. , 2001, Analytical chemistry.

[19]  Yan Zhao Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2010 .

[20]  Debojyoti Dutta,et al.  MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry. , 2007, Analytical chemistry.

[21]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[22]  Julie Marin,et al.  The Timetree of Prokaryotes: New Insights into Their Evolution and Speciation. , 2016, Molecular biology and evolution.

[23]  Ari M Frank,et al.  A ranking-based scoring function for peptide-spectrum matches. , 2009, Journal of proteome research.

[24]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[25]  Yoshua Bengio,et al.  Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[26]  Nuno Bandeira,et al.  De Novo MS/MS Sequencing of Native Human Antibodies. , 2017, Journal of proteome research.

[27]  Joshua N. Adkins,et al.  The Pacific Northwest National Laboratory library of bacterial and archaeal proteomic biodiversity , 2015, Scientific Data.

[28]  S Blair Hedges,et al.  BMC Evolutionary Biology BioMed Central , 2003 .

[29]  B. Ma,et al.  De Novo Sequencing and Homology Searching‡‡* , 2011, Molecular & Cellular Proteomics.

[30]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[31]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[32]  Ronald J Moore,et al.  Chemically etched open tubular and monolithic emitters for nanoelectrospray ionization mass spectrometry. , 2006, Analytical chemistry.

[33]  Christopher V. Rao,et al.  Ancient Regulatory Role of Lysine Acetylation in Central Metabolism , 2017, mBio.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Tara N. Sainath,et al.  The shared views of four research groups ) , 2012 .

[36]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[37]  Frank Kjeldsen,et al.  Deconvolution of mixture spectra and increased throughput of peptide identification by utilization of intensified complementary ions formed in tandem mass spectrometry. , 2013, Journal of proteome research.

[38]  Samuel H Payne,et al.  Phosphorylation-specific MS/MS scoring for rapid and accurate phosphoproteome analysis. , 2008, Journal of proteome research.