Non-parametric Bayesian approach to post-translational modification refinement of predictions from tandem mass

Motivation: Tandem mass spectrometry (MS/MS) is a dominant approach for large-scale high-throughput post-translational modification (PTM) profiling. Although current state-of-the-art blind PTM spectral analysis algorithms can predict thousands of modified peptides (PTM predictions) in an MS/MS experiment, a significant percentage of these predictions have inaccurate modification mass estimates and false modification site assignments. This problem can be addressed by post-processing the PTM predictions with a PTM refinement algorithm. We developed a novel PTM refinement algorithm, iPTMClust, which extends a recently introduced PTM refinement algorithm PTMClust and uses a non-parametric Bayesian model to better account for uncertainties in the quantity and identity of PTMs in the input data. The use of this new modeling approach enables iPTMClust to provide a confidence score per modification site that allows fine-tuning and interpreting resulting PTM predictions. Results: The primary goal behindiPTMClust is to improve the quality of the PTM predictions. First, to demonstrate that iPTMClust produces sensible and accurate cluster assignments, we compare it with k-means clustering, mixtures of Gaussians (MOG) and PTMClust on a synthetically generated PTM dataset. Second, in two separate benchmark experiments using PTM data taken from a phosphopeptide and a yeast proteome study, we show that iPTMClust outperforms state-of-the-art PTM prediction and refinement algorithms, including PTMClust. Finally, we illustrate the general applicability of our new approach on a set of human chromatin protein complex data, where we are able to identify putative novel modified peptides and modification sites that may be involved in the formation and regulation of protein complexes. Our method facilitates accurate PTM profiling, which is an important step in understanding the mechanisms behind many biological processes and should be an integral part of any proteomic study. Availability: Our algorithm is implemented in Java and is freely avail

[1]  G. Cagney,et al.  Sequential interval motif search: unrestricted database surveys of global MS/MS data sets for detection of putative post-translational modifications. , 2008, Analytical chemistry.

[2]  Jian Liu,et al.  Computational refinement of post-translational modifications predicted from tandem mass spectrometry , 2011, Bioinform..

[3]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[4]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[5]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[6]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[7]  Peter R Baker,et al.  Modification Site Localization Scoring Integrated into a Search Engine* , 2011, Molecular & Cellular Proteomics.

[8]  Samuel H. Payne,et al.  Accurate annotation of peptide modifications through unrestrictive database search. , 2008, Journal of proteome research.

[9]  R. Aebersold,et al.  Mass Spectrometry and Protein Analysis , 2006, Science.

[10]  Steven P Gygi,et al.  Large-scale characterization of HeLa cell nuclear phosphoproteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Eunok Paek,et al.  Prediction of novel modifications by unrestrictive search of tandem mass spectra. , 2009, Journal of proteome research.

[12]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[13]  Rong Wang,et al.  Integrating shotgun proteomics and mRNA expression data to improve protein identification , 2009, Bioinform..

[14]  B. Ueberheide,et al.  The utility of ETD mass spectrometry in proteomic analysis. , 2006, Biochimica et biophysica acta.

[15]  Daniel P. Miranker,et al.  Mining gene functional networks to improve mass-spectrometry-based protein identification , 2009, Bioinform..

[16]  D. A. Harris,et al.  Principles of Biochemistry (2nd edn) , 1993 .

[17]  M. Mann,et al.  Global, In Vivo, and Site-Specific Phosphorylation Dynamics in Signaling Networks , 2006, Cell.

[18]  B. Searle,et al.  Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. , 2005, Journal of proteome research.

[19]  K. Clauser,et al.  Modification Site Localization Scoring: Strategies and Performance , 2012, Molecular & Cellular Proteomics.

[20]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[21]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[22]  Bo Yan,et al.  Peptide sequence tag-based blind identification of post-translational modifications with point process model , 2006, ISMB.

[23]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[25]  Steven P Gygi,et al.  A probability-based approach for high-throughput protein phosphorylation analysis and site localization , 2006, Nature Biotechnology.

[26]  Samuel H. Payne,et al.  A Multidimensional Chromatography Technology for In-depth Phosphoproteome Analysis*S , 2008, Molecular & Cellular Proteomics.

[27]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[28]  John R Yates,et al.  Strategies for shotgun identification of post-translational modifications by mass spectrometry. , 2004, Journal of chromatography. A.

[29]  T. Köcher,et al.  Universal and confident phosphorylation site localization using phosphoRS. , 2011, Journal of proteome research.

[30]  B. Kuster,et al.  Confident Phosphorylation Site Localization Using the Mascot Delta Score , 2010, Molecular & Cellular Proteomics.

[31]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[32]  B. Séraphin,et al.  The tandem affinity purification (TAP) method: a general procedure of protein complex purification. , 2001, Methods.

[33]  Jennifer M. Bolin,et al.  Proteomic and phosphoproteomic comparison of human ES and iPS cells , 2011, Nature Methods.

[34]  B. Séraphin,et al.  A generic protein purification method for protein complex characterization and proteome exploration , 1999, Nature Biotechnology.

[35]  Martin Zeller,et al.  SLoMo: automated site localization of modifications from ETD/ECD mass spectra. , 2009, Journal of proteome research.

[36]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[37]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .