Combining Results of Multiple Search Engines

A crucial component of the analysis of shotgun proteomics datasets is the search engine, an algorithm that attempts to identify the peptide sequence from the parent molecular ion that produced each fragment ion spectrum in the dataset. There are many different search engines, both commercial and open source, each employing a somewhat different technique for spectrum identification. The set of high-scoring peptide-spectrum matches for a defined set of input spectra differs markedly among the various search engine results; individual engines each provide unique correct identifications among a core set of correlative identifications. This has led to the approach of combining the results from multiple search engines to achieve improved analysis of each dataset. Here we review the techniques and available software for combining the results of multiple search engines and briefly compare the relative performance of these techniques. Molecular & Cellular Proteomics 12: 10.1074/mcp.R113.027797, 2383– 2393, 2013. The most commonly used proteomics approach, shotgun proteomics, has become an invaluable tool for the highthroughput characterization of proteins in biological samples (1). This workflow relies on the combination of protein digestion, liquid chromatography (LC) 1 separation, tandem mass spectrometry (MS/MS), and sophisticated data analysis in its aim to derive an accurate and complete set of peptides and their inferred proteins that are present in the sample being studied. Although many variations are possible, the typical workflow begins with the digestion of proteins into peptides with a protease, typically trypsin. The resulting peptide mixture is first separated via LC and then subjected to mass spectrometry (MS) analysis. The MS instrument acquires fragment ion spectra on a subset of the peptide precursor ions that it measures. From the MS/MS spectra that measure the abundance and mass of the peptide ion fragments, peptides present in the mixture are identified and proteins are inferred by means of downstream computational analysis. The informatics component of the shotgun proteomics workflow is crucial for proper data analysis (2), and a wide variety of tools have emerged for this purpose (3). The typical informatics workflow can be summarized in a few steps: conversion from vendor proprietary formats to an open format, high-throughput interpretation of the MS/MS spectra with a search engine, and statistical validation of the results with estimation of the false discovery rate at a selected score threshold. Various tools for measuring relative peptide abundances may be applied, dependent on the type of quantitation technique applied in the experiment. Finally, the proteins present, and their abundance in the sample, are inferred based on the peptide identifications. One of the most computationally intensive and diverse steps in the computational analysis workflow is the use of a

[1]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[2]  Charles Buck,et al.  Performance evaluation of existing de novo sequencing algorithms. , 2006, Journal of proteome research.

[3]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[4]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[5]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[6]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[7]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[8]  Xue Wu,et al.  An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra , 2009, Clinical Proteomics.

[9]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[10]  J. A. Taylor,et al.  Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. , 2001, Analytical chemistry.

[11]  Tamanna Sultana,et al.  Optimization of the Use of Consensus Methods for the Detection and Putative Identification of Peptides via Mass Spectrometry Using Protein Standard Mixtures. , 2009, Journal of proteomics & bioinformatics.

[12]  O. Kohlbacher,et al.  Probabilistic consensus scoring improves tandem mass spectrometry peptide identification. , 2011, Journal of proteome research.

[13]  B. Searle,et al.  A Face in the Crowd: Recognizing Peptides Through Database Search* , 2011, Molecular & Cellular Proteomics.

[14]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[15]  Mark Gerstein,et al.  Global Survey of Human T Leukemic Cells by Integrating Proteomics and Transcriptomics Profiling*S , 2007, Molecular & Cellular Proteomics.

[16]  B. Searle,et al.  Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. , 2008, Journal of proteome research.

[17]  Andrew R Jones,et al.  FDRAnalysis: a tool for the integrated analysis of tandem mass spectrometry identification results from multiple search engines. , 2011, Journal of proteome research.

[18]  Alexey I Nesvizhskii,et al.  Integrated Phosphoproteomics Analysis of a Signaling Network Governing Nutrient Response and Peroxisome Induction* , 2010, Molecular & Cellular Proteomics.

[19]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[20]  William Stafford Noble,et al.  Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. , 2006, Analytical chemistry.

[21]  Hyungwon Choi,et al.  MSblender: A probabilistic approach for integrating peptide identifications from multiple database search engines. , 2011, Journal of proteome research.

[22]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[23]  A. Nesvizhskii,et al.  Computational analysis of unassigned high‐quality MS/MS spectra in proteomic data sets , 2010, Proteomics.

[24]  D. Ghosh,et al.  Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. , 2008, Journal of proteome research.

[25]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[26]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[27]  Yi-Kuo Yu,et al.  Enhancing Peptide Identification Confidence by Combining Search Methods , 2008, Journal of proteome research.

[28]  Ravi Tharakan,et al.  Data maximization by multipass analysis of protein mass spectra , 2010, Proteomics.

[29]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[30]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[31]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[32]  Hyungwon Choi,et al.  Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. , 2008, Journal of Proteome Research.

[33]  Brendan MacLean,et al.  General framework for developing and evaluating database scoring algorithms using the TANDEM search engine , 2006, Bioinform..

[34]  Henry H. N. Lam,et al.  Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. , 2008, Physiological genomics.

[35]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[36]  Eric W. Deutsch,et al.  File Formats Commonly Used in Mass Spectrometry Proteomics* , 2012, Molecular & Cellular Proteomics.

[37]  Martin Eisenacher,et al.  The mzIdentML Data Standard for Mass Spectrometry-Based Proteomics Results , 2012, Molecular & Cellular Proteomics.

[38]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[39]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[40]  R. Beavis,et al.  Using annotated peptide mass spectrum libraries for protein identification. , 2006, Journal of proteome research.

[41]  Ruben K Dagda,et al.  Evaluation of the Consensus of Four Peptide Identification Algorithms for Tandem Mass Spectrometry Based Proteomics. , 2010, Journal of proteomics & bioinformatics.

[42]  Bin Ma,et al.  PEAKS DB: De Novo Sequencing Assisted Database Search for Sensitive and Accurate Peptide Identification* , 2011, Molecular & Cellular Proteomics.

[43]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.