In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. SIGNIFICANCE Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations.

[1]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[2]  K. Resing,et al.  IsoformResolver: A Peptide-Centric Algorithm for Protein Inference , 2011, Journal of proteome research.

[3]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[4]  John R Yates,et al.  Search engine processor: Filtering and organizing peptide spectrum matches , 2012, Proteomics.

[5]  Markus Müller,et al.  In silico analysis of accurate proteomics, complemented by selective isolation of peptides. , 2011, Journal of proteomics.

[6]  B. Searle Scaffold: A bioinformatic tool for validating MS/MS‐based proteomic studies , 2010, Proteomics.

[7]  Markus Müller,et al.  Isoelectric point optimization using peptide descriptors and support vector machines. , 2012, Journal of proteomics.

[8]  Juan Antonio Vizcaíno,et al.  HI-bone: a scoring system for identifying phenylisothiocyanate-derivatized peptides based on precursor mass and high intensity fragment ions. , 2013, Analytical chemistry.

[9]  Zengyou He,et al.  Protein inference: a review , 2012, Briefings Bioinform..

[10]  Lennart Martens,et al.  Bioinformatics challenges in mass spectrometry-driven proteomics. , 2011, Methods in molecular biology.

[11]  Oliver Kohlbacher,et al.  Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics , 2007, BMC Bioinformatics.

[12]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[13]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[14]  Eystein Oveland,et al.  PeptideShaker enables reanalysis of MS-derived proteomics data sets , 2015, Nature Biotechnology.

[15]  Juan Antonio Vizcaíno,et al.  A survey of molecular descriptors used in mass spectrometry based proteomics. , 2014, Current topics in medicinal chemistry.

[16]  J. Buhmann,et al.  Generic Comparison of Protein Inference Engines* , 2011, Molecular & Cellular Proteomics.

[17]  Juan Antonio Vizcaíno,et al.  ms-data-core-api: an open-source, metadata-oriented library for computational proteomics , 2015, Bioinform..

[18]  Knut Reinert,et al.  OpenMS and TOPP: open source software for LC-MS data analysis. , 2011, Methods in molecular biology.

[19]  J. Buhmann,et al.  Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[20]  Oliver Serang Concerning the accuracy of Fido and parameter choice , 2013, Bioinform..

[21]  K. Gevaert,et al.  SCX charge state selective separation of tryptic peptides combined with 2D-RP-HPLC allows for detailed proteome mapping. , 2013, Journal of proteomics.

[22]  Chris F. Taylor,et al.  A common open representation of mass spectrometry data and its application to proteomics research , 2004, Nature Biotechnology.

[23]  A. Nesvizhskii,et al.  Metrics for the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification. , 2015, Journal of proteome research.

[24]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[25]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[26]  William Stafford Noble,et al.  Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics , 2015, Journal of proteome research.

[27]  Akhilesh Pandey,et al.  Proteogenomic analysis of human chromosome 9-encoded genes from human samples and lung cancer tissues. , 2014, Journal of proteome research.

[28]  Olga Vitek,et al.  A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet , 2012, BMC Bioinformatics.

[29]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[30]  Benjamin A. Garcia,et al.  Evaluation of Proteomic Search Engines for the Analysis of Histone Modifications , 2014, Journal of proteome research.

[31]  Jun Fan,et al.  The mzTab Data Exchange Format: Communicating Mass-spectrometry-based Proteomics and Metabolomics Experimental Results to a Wider Audience* , 2014, Molecular & Cellular Proteomics.

[32]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[33]  Lukas Käll,et al.  Mass fingerprinting of complex mixtures: protein inference from high-resolution peptide masses and predicted retention times. , 2013, Journal of proteome research.

[34]  John R Yates,et al.  Validation of Tandem Mass Spectrometry Database Search Results Using DTASelect , 2006, Current protocols in bioinformatics.

[35]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[36]  Lukas Käll,et al.  Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. , 2015, Journal of proteome research.

[37]  Predrag Radivojac,et al.  Computational approaches to protein inference in shotgun proteomics , 2012, BMC Bioinformatics.

[38]  Norman W. Paton,et al.  Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines , 2009, Proteomics.

[39]  Gabriel Padrón,et al.  Peptide fractionation by acid pH SDS‐free electrophoresis , 2011, Electrophoresis.

[40]  Predrag Radivojac,et al.  The importance of peptide detectability for protein identification, quantification, and experiment design in MS/MS proteomics. , 2010, Journal of proteome research.

[41]  O. Kohlbacher,et al.  Probabilistic consensus scoring improves tandem mass spectrometry peptide identification. , 2011, Journal of proteome research.

[42]  Yasset Perez-Riverol,et al.  Bioinformatics tools for the functional interpretation of quantitative proteomics results. , 2014, Current topics in medicinal chemistry.

[43]  Gabriel Padrón,et al.  Proteomics based on peptide fractionation by SDS-free PAGE. , 2008, Journal of proteome research.

[44]  Martin Eisenacher,et al.  A standardized framing for reporting protein identifications in mzIdentML 1.2 , 2014, Proteomics.

[45]  Martin Eisenacher,et al.  PRIDE Inspector Toolsuite: Moving Toward a Universal Visualization Tool for Proteomics Data Standard Formats and Quality Assessment of ProteomeXchange Datasets , 2015, Molecular & Cellular Proteomics.

[46]  P. Pevzner,et al.  False discovery rates of protein identifications: a strike against the two-peptide rule. , 2009, Journal of proteome research.

[47]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[48]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[49]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[50]  Eric W. Deutsch,et al.  Combining Results of Multiple Search Engines in Proteomics* , 2013, Molecular & Cellular Proteomics.

[51]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[52]  Martin Eisenacher,et al.  PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface. , 2015, Journal of proteome research.

[53]  Yasset Perez-Riverol,et al.  Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective , 2014, Biochimica et biophysica acta.

[54]  A. Heck,et al.  Next-generation proteomics: towards an integrative view of proteome dynamics , 2012, Nature Reviews Genetics.

[55]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[56]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[57]  Lennart Martens,et al.  Computational proteomics pitfalls and challenges: HavanaBioinfo 2012 workshop report. , 2013, Journal of proteomics.

[58]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[59]  Kai A Reidegeld,et al.  An easy‐to‐use Decoy Database Builder software tool, implementing different decoy strategies for false discovery rate calculation in automated MS/MS protein identifications , 2008, Proteomics.

[60]  Knut Reinert,et al.  OpenMS and TOPP: Open Source Software for LC-MS Data Analysis , 2010, Proteome Bioinformatics.

[61]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[62]  Zengyou He,et al.  A linear programming model for protein inference problem in shotgun proteomics , 2012, Bioinform..