Computing Exact p-values for a Cross-correlation Shotgun Proteomics Score Function

The core of every protein mass spectrometry analysis pipeline is a function that assesses the quality of a match between an observed spectrum and a candidate peptide. We describe a procedure for computing exact p-values for the oldest and still widely used score function, SEQUEST XCorr. The procedure uses dynamic programming to enumerate efficiently the full distribution of scores for all possible peptides whose masses are close to that of the spectrum precursor mass. Ranking identified spectra by p-value rather than XCorr significantly reduces variance because of spectrum-specific effects on the score. In combination with the Percolator postprocessor, the XCorr p-value yields more spectrum and peptide identifications at a fixed false discovery rate than Mascot, X!Tandem, Comet, and MS-GF+ across a variety of data sets.

[1]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[2]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[5]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[6]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[7]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[8]  R. Aebersold,et al.  ProbIDtree: An automated software program capable of identifying multiple peptides from a single collision‐induced dissociation spectrum collected by a tandem mass spectrometer , 2005, Proteomics.

[9]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[10]  M. MacCoss,et al.  A fast SEQUEST cross correlation algorithm. , 2008, Journal of proteome research.

[11]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[12]  Barbara Frewen,et al.  High quality catalog of proteotypic peptides from human heart. , 2008, Journal of proteome research.

[13]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[14]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[15]  Gennifer E. Merrihew,et al.  Post analysis data acquisition for the iterative MS/MS sampling of proteomics mixtures. , 2009, Journal of Proteome Research.

[16]  William Stafford Noble,et al.  Statistical calibration of the SEQUEST XCorr function. , 2009, Journal of proteome research.

[17]  Aleksey Y. Ogurtsov,et al.  RAId_aPS: MS/MS Analysis with Multiple Scoring Functions and Spectrum-Specific Statistics , 2010, PloS one.

[18]  P. Pevzner,et al.  The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search* , 2010, Molecular & Cellular Proteomics.

[19]  P. Mallick,et al.  Peptide Identification from Mixture Tandem Mass Spectra* , 2010, Molecular & Cellular Proteomics.

[20]  Michael J MacCoss,et al.  Comparison of database search strategies for high precursor mass accuracy MS/MS data. , 2010, Journal of proteome research.

[21]  Nuno Bandeira,et al.  False discovery rates in spectral identification , 2012, BMC Bioinformatics.

[22]  William Stafford Noble,et al.  Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. , 2013, Journal of proteomics.

[23]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[24]  Erik Sjölund,et al.  Fast and accurate database searches with MS-GF+Percolator. , 2014, Journal of proteome research.