Calibrating E-values for MS2 database search methods

BackgroundThe key to mass-spectrometry-based proteomics is peptide identification, which relies on software analysis of tandem mass spectra. Although each search engine has its strength, combining the strengths of various search engines is not yet realizable largely due to the lack of a unified statistical framework that is applicable to any method.ResultsWe have developed a universal scheme for statistical calibration of peptide identifications. The protocol can be used for both de novo approaches as well as database search methods. We demonstrate the protocol using only the database search methods. Among seven methods -SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X!Tandem (v1.0), OMSSA (v2.0) and RAId_DbS – calibrated, except for X!Tandem and RAId_DbS most methods require a rescaling according to the database size searched. We demonstrate that our calibration protocol indeed produces unified statistics both in terms of average number of false positives and in terms of the probability for a peptide hit to be a true positive. Although both the protocols for calibration and the statistics thus calibrated are universal, the calibration formulas obtained from one laboratory with data collected using either centroid or profile format may not be directly usable by the other laboratories. Thus each laboratory is encouraged to calibrate the search methods it intends to use. We also address the importance of using spectrum-specific statistics and possible improvement on the current calibration protocol. The spectra used for statistical (E-value) calibration are freely available upon request.Open peer reviewReviewed by Dongxiao Zhu (nominated by Arcady Mushegian), Alexey Nesvizhskii (nominated by King Jordan) and Vineet Bafna. For the full reviews, please go to the Reviewers' comments section.

[1]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[2]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[3]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[4]  Guanghui Wang,et al.  Comparative study of three proteomic quantitative methods, DIGE, cICAT, and iTRAQ, using 2D gel- or LC-MALDI TOF/TOF. , 2006, Journal of Proteome Research.

[5]  M. Baldwin Protein Identification by Mass Spectrometry , 2004, Molecular & Cellular Proteomics.

[6]  Guanghui Wang,et al.  Identification and quantification of basic and acidic proteins using solution-based two-dimensional protein fractionation and label-free or 18O-labeling mass spectrometry. , 2007, Journal of proteome research.

[7]  J. A. Taylor,et al.  Searching sequence databases via De novo peptide sequencing by tandem mass spectrometry , 2002, Molecular biotechnology.

[8]  S. Carr,et al.  Reporting Protein Identification Data , 2006, Molecular & Cellular Proteomics.

[9]  R. Appel,et al.  Popitam: Towards new heuristic strategies to improve protein identification from tandem mass spectrometry data , 2003, Proteomics.

[10]  Guanghui Wang,et al.  Label-free protein quantification using LC-coupled ion trap or FT mass spectrometry: Reproducibility, linearity, and application with complex proteomes. , 2006, Journal of proteome research.

[11]  Tao Xie,et al.  [A novel approach for peptide identification by tandem mass spectrometry]. , 2003, Sheng wu hua xue yu sheng wu wu li xue bao Acta biochimica et biophysica Sinica.

[12]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[13]  Ruedi Aebersold,et al.  The Need for Guidelines in Publication of Peptide and Protein Identification Data , 2004, Molecular & Cellular Proteomics.

[14]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[15]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[16]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[17]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[18]  Yi-Kuo Yu,et al.  Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics , 2005, Bioinform..

[19]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[20]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[21]  Yi-Kuo Yu,et al.  RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics , 2007, Biology Direct.

[22]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[23]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.