Bayesian Nonparametric Model for the Validation of Peptide Identification in Shotgun Proteomics*S

Tandem mass spectrometry combined with database searching allows high throughput identification of peptides in shotgun proteomics. However, validating database search results, a problem with a lot of solutions proposed, is still advancing in some aspects, such as the sensitivity, specificity, and generalizability of the validation algorithms. Here a Bayesian nonparametric (BNP) model for the validation of database search results was developed that incorporates several popular techniques in statistical learning, including the compression of feature space with a linear discriminant function, the flexible nonparametric probability density function estimation for the variable probability structure in complex problem, and the Bayesian method to calculate the posterior probability. Importantly the BNP model is compatible with the popular target-decoy database search strategy naturally. We tested the BNP model on standard proteins and real, complex sample data sets from multiple MS platforms and compared it with PeptideProphet, the cutoff-based method, and a simple nonparametric method (proposed by us previously). The performance of the BNP model was shown to be superior for all data sets searched on sensitivity and generalizability. Some high quality matches that had been filtered out by other methods were detected and assigned with high probability by the BNP model. Thus, the BNP model could be able to validate the database search results effectively and extract more information from MS/MS data.

[1]  A. Masselot,et al.  OLAV: Towards high‐throughput tandem mass spectrometry data identification , 2003, Proteomics.

[2]  Jue Wang,et al.  AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results* S , 2004, Molecular & Cellular Proteomics.

[3]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[4]  K. Resing,et al.  Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. , 2004, Analytical chemistry.

[5]  Michel Verleysen,et al.  Fully nonparametric probability density function estimation with finite Gaussian mixture models , 2003 .

[6]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[7]  Matthew E Monroe,et al.  Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome. , 2005, Journal of proteome research.

[8]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[9]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[10]  Bin Ma,et al.  Complexity and scoring function of MS/MS peptide de novo sequencing. , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[11]  Adam Buciński,et al.  Artificial neural network analysis for evaluation of peptide MS/MS spectra in proteomics. , 2004, Analytical chemistry.

[12]  M Daszykowski,et al.  Retention prediction of peptides based on uninformative variable elimination by partial least squares. , 2006, Journal of proteome research.

[13]  M. Mann,et al.  On the Proper Use of Mass Accuracy in Proteomics* , 2007, Molecular & Cellular Proteomics.

[14]  Jianqi Li,et al.  A new strategy to filter out false positive identifications of peptides in SEQUEST database search results , 2007, Proteomics.

[15]  Gilbert S Omenn,et al.  An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis , 2005, Proteomics.

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  Helmut E Meyer,et al.  Valid data from large-scale proteomics studies , 2005, Nature Methods.

[18]  Jeffrey Whiteaker,et al.  Quality control metrics for LC-MS feature detection tools demonstrated on Saccharomyces cerevisiae proteomic profiles. , 2006, Journal of proteome research.

[19]  D. Ghosh,et al.  Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. , 2008, Journal of proteome research.

[20]  J. Yates,et al.  Direct analysis of protein complexes using mass spectrometry , 1999, Nature Biotechnology.

[21]  Michel Verleysen Universit Fully Nonparametric Probability Density Function Estimation with Finite Gaussian Mixture Models , 2003 .

[22]  R. Aebersold,et al.  Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data , 2006, Molecular & Cellular Proteomics.

[23]  E. Kolker,et al.  Standard mixtures for proteome studies. , 2004, Omics : a journal of integrative biology.

[24]  Yingming Zhao,et al.  Integrated approach for manual evaluation of peptides identified by searching protein sequence databases with tandem mass spectra. , 2005, Journal of proteome research.

[25]  Markus Müller,et al.  Automated protein identification by tandem mass spectrometry: issues and strategies. , 2006, Mass spectrometry reviews.

[26]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[27]  Xin Liu,et al.  A nonparametric model for quality control of database search results in shotgun proteomics , 2007, BMC Bioinformatics.

[28]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[29]  Lewis Y. Geer,et al.  DBParser: web-based software for shotgun proteomic data analyses. , 2004, Journal of proteome research.

[30]  Tom S. Price,et al.  EBP, a Program for Protein Identification Using Multiple Tandem Mass Spectrometry Datasets*S , 2007, Molecular & Cellular Proteomics.

[31]  Ji Zhu,et al.  Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches* , 2006, Molecular & Cellular Proteomics.

[32]  J. Yates,et al.  DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. , 2002, Journal of proteome research.

[33]  Zhongqi Zhang Prediction of low-energy collision-induced dissociation spectra of peptides. , 2004, Analytical chemistry.

[34]  Zbigniew Grzonka,et al.  Prediction of high‐performance liquid chromatography retention of peptides with the use of quantitative structure‐retention relationships , 2005, Proteomics.

[35]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[36]  Ruedi Aebersold,et al.  The Need for Guidelines in Publication of Peptide and Protein Identification Data , 2004, Molecular & Cellular Proteomics.

[37]  Ying Xu,et al.  The Probability Distribution for a Random Match between an Experimental-theoretical Spectral Pair in Tandem Mass Spectrometry , 2005, J. Bioinform. Comput. Biol..

[38]  George C Tseng,et al.  Statistical characterization of the charge state and residue dependence of low-energy CID peptide dissociation patterns. , 2005, Analytical chemistry.

[39]  T. Hubbard,et al.  Comparison of Mascot and X!Tandem Performance for Low and High Accuracy Mass Spectrometry and the Development of an Adjusted Mascot Threshold*S , 2008, Molecular & Cellular Proteomics.

[40]  Rovshan G Sadygov,et al.  Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book , 2004, Nature Methods.

[41]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[42]  Ping Wan,et al.  A Dataset of Human Fetal Liver Proteome Identified by Subcellular Fractionation and Multiple Protein Separation and Identification Technology*S , 2006, Molecular & Cellular Proteomics.

[43]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[44]  Hyungwon Choi,et al.  Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[45]  Eugene Kolker,et al.  Randomized sequence databases for tandem mass spectrometry peptide and protein identification. , 2005, Omics : a journal of integrative biology.

[46]  Ruedi Aebersold,et al.  Challenges and Opportunities in Proteomics Data Analysis* , 2006, Molecular & Cellular Proteomics.

[47]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[48]  J. Bunkenborg,et al.  Database‐independent, database‐dependent, and extended interpretation of peptide mass spectra in VEMS V2.0 , 2004, Proteomics.

[49]  Fuchu He,et al.  Analysis of human liver proteome using replicate shotgun strategy , 2007, Proteomics.

[50]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[51]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.

[52]  Eugene Kolker,et al.  Charge state estimation for tandem mass spectrometry proteomics. , 2005, Omics : a journal of integrative biology.

[53]  Benito Cañas,et al.  Mass spectrometry technologies for proteomics. , 2006, Briefings in functional genomics & proteomics.

[54]  Wei Sun,et al.  RScore: a peptide randomicity score for evaluating tandem mass spectra. , 2004, Rapid communications in mass spectrometry : RCM.