Linear discriminant analysis-based estimation of the false discovery rate for phosphopeptide identifications.

The development of liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has made it possible to characterize phosphopeptides in an increasingly large-scale and high-throughput fashion. However, extracting confident phosphopeptide identifications from the resulting large data sets in a similar high-throughput fashion remains difficult, as does rigorously estimating the false discovery rate (FDR) of a set of phosphopeptide identifications. This article describes a data analysis pipeline designed to address these issues. The first step is to reanalyze phosphopeptide identifications that contain ambiguous assignments for the incorporated phosphate(s) to determine the most likely arrangement of the phosphate(s). The next step is to employ an expectation maximization algorithm to estimate the joint distribution of the peptide scores. A linear discriminant analysis is then performed to determine how to optimally combine peptide scores (in this case, from SEQUEST) into a discriminant score that possesses the maximum discriminating power. Based on this discriminant score, the p- and q-values for each phosphopeptide identification are calculated, and the phosphopeptide identification FDR is then estimated. This data analysis approach was applied to data from a study of irradiated human skin fibroblasts to provide a robust estimate of FDR for phosphopeptides. The Phosphopeptide FDR Estimator software is freely available for download at http://ncrr.pnl.gov/software/.

[1]  William Stafford Noble,et al.  Posterior error probabilities and false discovery rates: two sides of the same coin. , 2008, Journal of proteome research.

[2]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[3]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[4]  Richard D. Smith,et al.  Advances in proteomics data analysis and display using an accurate mass and time tag approach. , 2006, Mass spectrometry reviews.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Matthew E Monroe,et al.  Phosphoproteome profiling of human skin fibroblast cells in response to low- and high-dose irradiation. , 2006, Journal of proteome research.

[7]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[8]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[10]  John D. Storey A direct approach to false discovery rates , 2002 .

[11]  T. Hunter,et al.  Signaling—2000 and Beyond , 2000, Cell.

[12]  Ronald J Moore,et al.  Profiling signaling polarity in chemotactic cells , 2007, Proceedings of the National Academy of Sciences.

[13]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[14]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[15]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[16]  Feng Yang,et al.  Identification of a novel mitotic phosphorylation motif associated with protein localization to the mitotic apparatus , 2007, Journal of Cell Science.

[17]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[18]  Salvador Martínez-Bartolomé,et al.  Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. , 2004, Analytical chemistry.

[19]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[20]  Steven P Gygi,et al.  A probability-based approach for high-throughput protein phosphorylation analysis and site localization , 2006, Nature Biotechnology.

[21]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[22]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[23]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[24]  Yingming Zhao,et al.  Integrated approach for manual evaluation of peptides identified by searching protein sequence databases with tandem mass spectra. , 2005, Journal of proteome research.

[25]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[26]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[27]  Hyungwon Choi,et al.  False discovery rates and related statistical concepts in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[28]  S. Reed,et al.  G1/S regulatory mechanisms from yeast to man. , 1996, Progress in cell cycle research.

[29]  A. Ciechanover,et al.  Ubiquitin‐mediated proteolysis: biological regulation via destruction , 2000, BioEssays : news and reviews in molecular, cellular and developmental biology.