Statistical calibration of the SEQUEST XCorr function.

Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments requires a score function that consistently ranks correct peptide-spectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function Xcorr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrum-specific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to Xcorr and the preliminary Sequest score function Sp. The protocol accounts for spectrum- and peptide-specific effects by calculating p values for each spectrum individually, using only that spectrum's score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore, eliminating the need for an extra search against a decoy database. In addition, we show that the pvalues are better calibrated than their underlying scores; consequently, when ranking top-scoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[3]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[4]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[5]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[6]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[7]  Michael J MacCoss,et al.  Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions. , 2007, Analytical chemistry.

[8]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[9]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[10]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[11]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[12]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[13]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[14]  Yi-Kuo Yu,et al.  Calibrating E-values for MS2 database search methods , 2007, Biology Direct.

[15]  Alejandro Heredia-Langner,et al.  Comparison of probability and likelihood models for peptide identification from tandem mass spectrometry data. , 2005, Journal of proteome research.

[16]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[17]  W. Weibull A Statistical Distribution Function of Wide Applicability , 1951 .

[18]  Richard E Higgs,et al.  Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. , 2007, Journal of proteome research.

[19]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[20]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[21]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[22]  Michael Gribskov,et al.  Estimating and Evaluating the Statistics of Gapped Local-Alignment Scores , 2003, J. Comput. Biol..

[23]  Salvador Martínez-Bartolomé,et al.  Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. , 2004, Analytical chemistry.

[24]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.