Search and Decoy: The Automatic Identification of Mass Spectra

In recent years, the generation and interpretation of MS/MS spectra for the identification of peptides and proteins has matured to a frequently used automatic workflow in Proteomics. Several software solutions for the automated analysis of MS/MS spectra allow for high-throughput/high-performance analyses of complex samples. Related to MS/MS searches, target-decoy approaches have gained more and more popularity: in a "decoy" part of the search database nonexistent sequences mimic real sequences (the "target" sequences). With their help, the number of falsely identified peptides/proteins can be estimated after a search and the resulting protein list can be cut at a specified false discovery rate (FDR). This is an essential prerequisite for all quantitative approaches, as they rely on correct identifications. Especially the label-free approach "spectral counting"-gaining more and more popularity due to low costs and simplicity-depends directly on the correctness of peptide-spectrum matches (PSMs). This work's aim is to describe five popular search engines-especially their general properties regarding protein identification, but also their quantification abilities, if those go beyond spectral counting. By doing so, Proteomics researchers are enabled to compare their features and to choose an appropriate solution for their specific question. Furthermore, the search engines are applied to a spectrum data set generated from a complex sample with a Thermo LTQ Velos OrbiTrap (Thermo Fisher Scientific, Waltham, MA, USA). The results of the search engines are compared, e.g., regarding time requirements, peptides and proteins found, and the search engines' behavior using the decoy approach.

[1]  W. Lehmann,et al.  De novo sequencing of peptides by MS/MS , 2010, Proteomics.

[2]  R. Aebersold,et al.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats , 2005, Molecular systems biology.

[3]  M. Gorenstein,et al.  Absolute Quantification of Proteins by LCMSE , 2006, Molecular & Cellular Proteomics.

[4]  Brendan K Faherty,et al.  MacroSEQUEST: efficient candidate-centric searching and high-resolution correlation analysis for large-scale proteomics data sets. , 2010, Analytical chemistry.

[5]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[6]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[7]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[8]  B. Cargile,et al.  Potential for false positive identifications from large databases through tandem mass spectrometry. , 2004, Journal of proteome research.

[9]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[10]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[11]  A. Masselot,et al.  OLAV: Towards high‐throughput tandem mass spectrometry data identification , 2003, Proteomics.

[12]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[13]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[14]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[15]  P. Højrup,et al.  Rapid identification of proteins by peptide-mass fingerprinting , 1993, Current Biology.

[16]  J R Yates,et al.  Protein sequencing by tandem mass spectrometry. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Steven P Gygi,et al.  Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations , 2005, Nature Methods.

[18]  M. Mann,et al.  Mass spectrometry–based proteomics turns quantitative , 2005, Nature chemical biology.

[19]  M. Mann,et al.  Exponentially Modified Protein Abundance Index (emPAI) for Estimation of Absolute Protein Amount in Proteomics by the Number of Sequenced Peptides per Protein*S , 2005, Molecular & Cellular Proteomics.

[20]  Hui Lu,et al.  An improved machine learning protocol for the identification of correct Sequest search results , 2010, BMC Bioinformatics.

[21]  William Stafford Noble,et al.  Statistical calibration of the SEQUEST XCorr function. , 2009, Journal of proteome research.

[22]  R. Aebersold,et al.  Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry , 2001, Nature Biotechnology.

[23]  Ruedi Aebersold,et al.  Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. , 2010, Journal of proteome research.

[24]  M. MacCoss,et al.  A fast SEQUEST cross correlation algorithm. , 2008, Journal of proteome research.

[25]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[26]  J. Yates,et al.  Direct analysis of protein complexes using mass spectrometry , 1999, Nature Biotechnology.

[27]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[28]  Kai A Reidegeld,et al.  An easy‐to‐use Decoy Database Builder software tool, implementing different decoy strategies for false discovery rate calculation in automated MS/MS protein identifications , 2008, Proteomics.

[29]  Eric W Deutsch,et al.  Spectra, chromatograms, Metadata: mzML-the standard data format for mass spectrometer output. , 2011, Methods in molecular biology.

[30]  B. Balgley,et al.  Comparative Evaluation of Tandem MS Search Algorithms Using a Target-Decoy Search Strategy*S , 2007, Molecular & Cellular Proteomics.

[31]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[32]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[33]  Bindu Nanduri,et al.  An automated proteomic data analysis workflow for mass spectrometry , 2009, BMC Bioinformatics.

[34]  Rovshan G Sadygov,et al.  Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book , 2004, Nature Methods.