Searching for a needle in a stack of needles: challenges in metaproteomics data analysis.

In the past years the integral study of microbial communities of varying complexity has gained increasing research interest. Mass spectrometry-driven metaproteomics enables the analysis of such communities on the functional level, but this fledgling field still faces various technical and semantic challenges regarding experimental data analysis and interpretation. In the present review, we outline the hurdles involved and attempt to cover the most valuable methods and software implementations available to researchers in the field today. Beyond merely focusing on protein identification, we provide an overview on different data pre- and post-processing steps, such as metabolic pathway analysis, that can be useful in a typical metaproteomics workflow. Finally, we briefly discuss directions for future work.

[1]  Naryttza N. Diaz,et al.  The metagenome of a biogas-producing microbial community of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. , 2008, Journal of biotechnology.

[2]  Robertson Craig,et al.  Open source system for analyzing, validating, and storing protein identification data. , 2004, Journal of proteome research.

[3]  D. Scott,et al.  Optimization and testing of mass spectral library search algorithms for compound identification , 1994, Journal of the American Society for Mass Spectrometry.

[4]  R. Heyer,et al.  Metaproteome analysis to determine the metabolically active part of a thermophilic microbial community producing biogas from agricultural biomass. , 2012, Canadian journal of microbiology.

[5]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[6]  Brandi L. Cantarel,et al.  Strategies for Metagenomic-Guided Whole-Community Proteomics of Complex Microbial Environments , 2011, PloS one.

[7]  Gail L. Rosen,et al.  Combining gene prediction methods to improve metagenomic gene annotation , 2011, BMC Bioinformatics.

[8]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[9]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[10]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[11]  Limsoon Wong,et al.  How Advancement in Biological Network Analysis Methods Empowers Proteomics , 2022 .

[12]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[13]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[14]  Suparna Mitra,et al.  Introduction to the analysis of environmental sequences: metagenomics with MEGAN. , 2012, Methods in molecular biology.

[15]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[16]  Lennart Martens,et al.  Proteomics data validation: why all must provide data. , 2007, Molecular bioSystems.

[17]  Ruedi Aebersold,et al.  Building consensus spectral libraries for peptide identification in proteomics , 2008, Nature Methods.

[18]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[19]  Yuki Moriya,et al.  KAAS: an automatic genome annotation and pathway reconstruction server , 2007, Nucleic Acids Res..

[20]  Lennart Martens,et al.  A la carte proteomics with an emphasis on gel‐free techniques , 2007, Proteomics.

[21]  Christine Piggee LIMS and the art of MS proteomics. , 2008, Analytical chemistry.

[22]  J. Gilbert,et al.  Metagenomics - a guide from sampling to data analysis , 2012, Microbial Informatics and Experimentation.

[23]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[24]  Jeffrey S. Morris,et al.  Serum proteomics profiling—a young technology begins to mature , 2005, Nature Biotechnology.

[25]  Lennart Martens,et al.  Peptide and protein quantification: A map of the minefield , 2010, Proteomics.

[26]  William Stafford Noble,et al.  Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry , 2008, ECCB.

[27]  Dennis B. Troup,et al.  NCBI Peptidome: a new repository for mass spectrometry proteomics data , 2009, Nucleic Acids Res..

[28]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[29]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[30]  Hanno Steen,et al.  Estimating the confidence of peptide identifications without decoy databases. , 2010, Analytical chemistry.

[31]  Hyungwon Choi,et al.  Significance Analysis of Spectral Count Data in Label-free Shotgun Proteomics*S , 2008, Molecular & Cellular Proteomics.

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[33]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[34]  P. Wilmes,et al.  The application of two-dimensional polyacrylamide gel electrophoresis and downstream analyses to a mixed community of prokaryotic microorganisms. , 2004, Environmental microbiology.

[35]  D. Benndorf,et al.  Metaproteome analysis of sewage sludge from membrane bioreactors , 2011, Proteomics.

[36]  J. Yates,et al.  A model for random sampling and estimation of relative protein abundance in shotgun proteomics. , 2004, Analytical chemistry.

[37]  Sukhdeep Singh,et al.  Metagenomics: Concept, methodology, ecological inference and recent advances , 2009, Biotechnology journal.

[38]  Karl Mechtler,et al.  MASPECTRAS: a platform for management and analysis of proteomics LC-MS/MS data , 2007, BMC Bioinformatics.

[39]  Lennart Martens,et al.  ms_lims, a simple yet powerful open source laboratory information management system for MS‐driven proteomics , 2010, Proteomics.

[40]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[41]  F. Bäckhed,et al.  Host-Bacterial Mutualism in the Human Intestine , 2005, Science.

[42]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[43]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[44]  Martin Eisenacher,et al.  Using Laboratory Information Management Systems as central part of a proteomics data workflow , 2010, Proteomics.

[45]  Jennifer A Mead,et al.  Recent developments in public proteomic MS repositories and pipelines , 2009, Proteomics.

[46]  P. Bork,et al.  Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. , 2001, Analytical chemistry.

[47]  R. Zahedi,et al.  Peptide identification quality control , 2011, Proteomics.

[48]  Lennart Martens,et al.  Automated reprocessing pipeline for searching heterogeneous mass spectrometric data of the HUPO Brain Proteome Project pilot phase , 2006, Proteomics.

[49]  Henry H. N. Lam,et al.  PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows , 2008, EMBO reports.

[50]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[51]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[52]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[53]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[54]  Brad T. Sherman,et al.  The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists , 2007, Genome Biology.

[55]  Sven Nahnsen,et al.  Mass spectrometry at the interface of proteomics and genomics. , 2011, Molecular bioSystems.

[56]  K. Parker,et al.  Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Amine-reactive Isobaric Tagging Reagents*S , 2004, Molecular & Cellular Proteomics.

[57]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[58]  Alex Bateman,et al.  InterPro : An integrated documentation resource for protein families , domains and functional sites The InterPro Consortium : , 2005 .

[59]  T. Attwood,et al.  PRINTS--a database of protein motif fingerprints. , 1994, Nucleic acids research.

[60]  Hon Wai Leong,et al.  On Preprocessing and Antisymmetry in de novo peptide Sequencing: Improving Efficiency and Accuracy , 2008, J. Bioinform. Comput. Biol..

[61]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[62]  J. Banfield,et al.  Proteogenomic approaches for the molecular characterization of natural microbial communities. , 2005, Omics : a journal of integrative biology.

[63]  R. Beavis,et al.  Using annotated peptide mass spectrum libraries for protein identification. , 2006, Journal of proteome research.

[64]  S. Gygi,et al.  Quantitative analysis of complex protein mixtures using isotope-coded affinity tags , 1999, Nature Biotechnology.

[65]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[66]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[67]  A. Modesti,et al.  Extraction of microbial proteome from soil: potential and limitations assessed through a model study , 2011 .

[68]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[69]  Lennart Martens,et al.  PRIDE: The proteomics identifications database , 2005, Proteomics.

[70]  Eoin L. Brodie,et al.  Direct cellular lysis/protein extraction protocol for soil metaproteomics. , 2010, Journal of proteome research.

[71]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Lennart Martens,et al.  Implementation and application of a versatile clustering tool for tandem mass spectrometry data , 2007, Proteomics.

[73]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[74]  Michael Riffle,et al.  Proteomics data repositories , 2009, Proteomics.

[75]  Katharina J. Hoff,et al.  Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[76]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[77]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[78]  Lennart Martens,et al.  Proteomics data repositories: Providing a safe haven for your data and acting as a springboard for further research , 2010, Journal of proteomics.

[79]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[80]  Alexey I Nesvizhskii,et al.  Interpretation of Shotgun Proteomic Data , 2005, Molecular & Cellular Proteomics.

[81]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[82]  Michael K. Coleman,et al.  Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. , 2006, Journal of proteome research.

[83]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[84]  Richard D. Smith,et al.  Clustering millions of tandem mass spectra. , 2008, Journal of proteome research.

[85]  W. D. de Vos,et al.  Comparative Metaproteomics and Diversity Analysis of Human Intestinal Microbiota Testifies for Its Temporal Stability and Expression of Core Functions , 2012, PloS one.

[86]  Emmanuel Barillot,et al.  myProMS, a web server for management and validation of mass spectrometry‐based proteomic data , 2007, Proteomics.

[87]  Lennart Martens,et al.  Current methods for global proteome identification , 2012, Expert review of proteomics.

[88]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[89]  T. Griffin,et al.  A metaproteomic analysis of the human salivary microbiota by three-dimensional peptide fractionation and tandem mass spectrometry. , 2010, Molecular oral microbiology.

[90]  Hugh M. Cartwright,et al.  msmsEval: tandem mass spectral quality assignment for high-throughput proteomics , 2007, BMC Bioinformatics.

[91]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[92]  R. Aebersold,et al.  Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data , 2006, Molecular & Cellular Proteomics.

[93]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[94]  Hyungwon Choi,et al.  MSblender: A probabilistic approach for integrating peptide identifications from multiple database search engines. , 2011, Journal of proteome research.

[95]  Jens Allmer,et al.  Algorithms for the de novo sequencing of peptides from tandem mass spectra , 2011, Expert review of proteomics.

[96]  I. Eidhammer,et al.  Improving the reliability and throughput of mass spectrometry‐based proteomics by spectrum quality filtering , 2006, Proteomics.

[97]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[98]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[99]  Lennart Martens,et al.  Analysis of the resolution limitations of peptide identification algorithms. , 2011, Journal of proteome research.

[100]  Adam Rauch,et al.  Computational Proteomics Analysis System (CPAS): an extensible, open-source analytic system for evaluating and publishing proteomic data and high throughput biological experiments. , 2006, Journal of proteome research.

[101]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[102]  Andreas Richter,et al.  Who is who in litter decomposition? Metaproteomics reveals major microbial players and their biogeochemical functions , 2012, The ISME Journal.

[103]  Martin Vingron,et al.  Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[104]  Charles Buck,et al.  Performance evaluation of existing de novo sequencing algorithms. , 2006, Journal of proteome research.