Sipros Ensemble improves database searching and filtering for complex metaproteomics

Motivation Complex microbial communities can be characterized by metagenomics and metaproteomics. However, metagenome assemblies often generate enormous, and yet incomplete, protein databases, which undermines the identification of peptides and proteins in metaproteomics. This challenge calls for increased discrimination of true identifications from false identifications by database searching and filtering algorithms in metaproteomics. Results Sipros Ensemble was developed here for metaproteomics using an ensemble approach. Three diverse scoring functions from MyriMatch, Comet and the original Sipros were incorporated within a single database searching engine. Supervised classification with logistic regression was used to filter database searching results. Benchmarking with soil and marine microbial communities demonstrated a higher number of peptide and protein identifications by Sipros Ensemble than MyriMatch/Percolator, Comet/Percolator, MS‐GF+/Percolator, Comet & MyriMatch/iProphet and Comet & MyriMatch & MS‐GF+/iProphet. Sipros Ensemble was computationally efficient and scalable on supercomputers. Availability and implementation Freely available under the GNU GPL license at http://sipros.omicsbio.org. Contact cpan@utk.edu Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Bahlul Haider,et al.  Omega: an Overlap-graph de novo Assembler for Metagenomics , 2014, Bioinform..

[2]  Zhou Li,et al.  Sipros/ProRata: a versatile informatics system for quantitative community proteomics , 2013, Bioinform..

[3]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[4]  Qiuming Yao,et al.  Diverse and divergent protein post-translational modifications in two growth stages of a natural microbial community , 2014, Nature Communications.

[5]  Nathan J Edwards,et al.  PepArML: A Meta‐Search Peptide Identification Platform for Tandem Mass Spectra , 2013, Current protocols in bioinformatics.

[6]  Erik Sjölund,et al.  Fast and accurate database searches with MS-GF+Percolator. , 2014, Journal of proteome research.

[7]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[8]  Chongle Pan,et al.  Proteomic Stable Isotope Probing Reveals Taxonomically Distinct Patterns in Amino Acid Assimilation by Coastal Marine Bacterioplankton , 2016, mSystems.

[9]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[10]  Xue Wu,et al.  An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra , 2009, Clinical Proteomics.

[11]  Brian C. Thomas,et al.  Proteogenomic analyses indicate bacterial methylotrophy and archaeal heterotrophy are prevalent below the grass root zone , 2016, PeerJ.

[12]  Chongle Pan,et al.  Microbial metaproteomics for characterizing the range of metabolic functions and activities of human gut microbiota , 2015, Proteomics.

[13]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[16]  B. Searle,et al.  A Face in the Crowd: Recognizing Peptides Through Database Search* , 2011, Molecular & Cellular Proteomics.

[17]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[18]  Doug Hyatt,et al.  Exhaustive database searching for amino acid mutations in proteomes , 2012, Bioinform..

[19]  Qiuming Yao,et al.  Integrated proteomics and metabolomics suggests symbiotic metabolism and multimodal regulation in a fungal‐endobacterial system , 2017, Environmental microbiology.

[20]  Jillian F. Banfield,et al.  Quantitative Tracking of Isotope Flows in Proteomes of Microbial Communities , 2011, Molecular & Cellular Proteomics.

[21]  John R Yates,et al.  Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate. , 2016, Journal of proteome research.

[22]  Hyungwon Choi,et al.  MSblender: A probabilistic approach for integrating peptide identifications from multiple database search engines. , 2011, Journal of proteome research.

[23]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[24]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[25]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[26]  Rovshan G Sadygov,et al.  Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book , 2004, Nature Methods.

[27]  Chongle Pan,et al.  Proteomic Stable Isotope Probing Reveals Biosynthesis Dynamics of Slow Growing Methane Based Microbial Communities , 2016, Front. Microbiol..

[28]  Andrew I. Su,et al.  A comprehensive and scalable database search system for metaproteomics , 2016, BMC Genomics.

[29]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[30]  Chongle Pan,et al.  Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance , 2014, Bioinform..

[31]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.