Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry

MOTIVATION A mass spectrum produced via tandem mass spectrometry can be tentatively matched to a peptide sequence via database search. Here, we address the problem of assigning a posterior error probability (PEP) to a given peptide-spectrum match (PSM). This problem is considerably more dif.cult than the related problem of estimating the error rate associated with a large collection of PSMs. Existing methods for estimating PEPs rely on a parametric or semiparametric model of the underlying score distribution. RESULTS We demonstrate how to apply non-parametric logistic regression to this problem. The method makes no explicit assumptions about the form of the underlying score distribution; instead, the method relies upon decoy PSMs, produced by searching the spectra against a decoy sequence database, to provide a model of the null score distribution. We show that our non-parametric logistic regression method produces accurate PEP estimates for six different commonly used PSM score functions. In particular, the estimates produced by our method are comparable in accuracy to those of PeptideProphet, which uses a parametric or semiparametric model designed speci.cally to work with SEQUEST. The advantage of the non-parametric approach is applicability and robustness to new score functions and new types of data. AVAILABILITY C++ code implementing the method as well as supplementary information is available at http://noble.gs. washington.edu/proj/qvality

[1]  Chong Gu,et al.  Cross-Validating Non-Gaussian Data , 1992 .

[2]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[3]  Aaron A. Klammer,et al.  Effects of modified digestion schemes on the identification of proteins from complex mixtures. , 2006, Journal of proteome research.

[4]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[5]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[6]  J. Anderson,et al.  Penalized maximum likelihood estimation in logistic regression and discrimination , 1982 .

[7]  D. Cox Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[8]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[9]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[10]  William Stafford Noble,et al.  Peptide charge state determination for low-resolution tandem mass spectra , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[11]  B. Sorić Statistical “Discoveries” and Effect-Size Estimation , 1989 .

[12]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[13]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[14]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[15]  Xin Liu,et al.  A nonparametric model for quality control of database search results in shotgun proteomics , 2007, BMC Bioinformatics.

[16]  G. Wahba Bayesian "Confidence Intervals" for the Cross-validated Smoothing Spline , 1983 .

[17]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[18]  Richard E Higgs,et al.  Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. , 2007, Journal of proteome research.

[19]  Jian Feng,et al.  Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies , 2007, Bioinform..

[20]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[21]  David Goldberg,et al.  Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. , 2007, Analytical chemistry.

[22]  John D. Storey,et al.  Multiple Locus Linkage Analysis of Genomewide Expression in Yeast , 2005, PLoS biology.

[23]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[25]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[26]  John D. Storey A direct approach to false discovery rates , 2002 .

[27]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[28]  A. Masselot,et al.  OLAV: Towards high‐throughput tandem mass spectrometry data identification , 2003, Proteomics.

[29]  D. Ghosh,et al.  Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. , 2008, Journal of proteome research.

[30]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[31]  Hyungwon Choi,et al.  Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. , 2008, Journal of Proteome Research.

[32]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.