Computational refinement of post-translational modifications predicted from tandem mass spectrometry

Motivation: A post-translational modification (PTM) is a chemical modification of a protein that occurs naturally. Many of these modifications, such as phosphorylation, are known to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations have been linked to diverse diseases like Parkinson's, Alzheimer's, diabetes and cancer. To discover PTMs on a genome-wide scale, there is a recent surge of interest in analyzing tandem mass spectrometry data, and several unrestrictive (so-called ‘blind’) PTM search methods have been reported. However, these approaches are subject to noise in mass measurements and in the predicted modification site (amino acid position) within peptides, which can result in false PTM assignments. Results: To address these issues, we devised a machine learning algorithm, PTMClust, that can be applied to the output of blind PTM search methods to improve prediction quality, by suppressing noise in the data and clustering peptides with the same underlying modification to form PTM groups. We show that our technique outperforms two standard clustering algorithms on a simulated dataset. Additionally, we show that our algorithm significantly improves sensitivity and specificity when applied to the output of three different blind PTM search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outperforms another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is able to reduce false PTM assignments, improve overall detection coverage and facilitate novel PTM discovery, including terminus modifications. We applied our technique to a large-scale yeast MS/MS proteome profiling dataset and found numerous known and novel PTMs. Accurately identifying modifications in protein sequences is a critical first step for PTM profiling, and thus our approach may benefit routine proteomic analysis. Availability: Our algorithm is implemented in Matlab and is freely available for academic use. The software is available online from http://genes.toronto.edu. Supplementary Information: Supplementary data are available at Bioinformatics online. Contact: frey@psi.utoronto.ca

[1]  P. Højrup,et al.  VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. , 2005, Journal of proteome research.

[2]  Scott A McLuckey,et al.  Complementary structural information from a tryptic N-linked glycopeptide via electron transfer ion/ion reactions and collision-induced dissociation. , 2005, Journal of proteome research.

[3]  Christodoulos A. Floudas,et al.  A Novel Approach for Untargeted Post-translational Modification Identification Using Integer Linear Optimization and Tandem Mass Spectrometry* , 2010, Molecular & Cellular Proteomics.

[4]  Bo Yan,et al.  Peptide sequence tag-based blind identification of post-translational modifications with point process model , 2006, ISMB.

[5]  G. Cagney,et al.  Sequential interval motif search: unrestricted database surveys of global MS/MS data sets for detection of putative post-translational modifications. , 2008, Analytical chemistry.

[6]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error. , 2004, Proceedings. IEEE Computational Systems Bioinformatics Conference.

[7]  Andrew Emili,et al.  PRISM, a Generic Large Scale Proteomic Investigation Strategy for Mammals*S , 2003, Molecular & Cellular Proteomics.

[8]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[9]  R. Aebersold,et al.  Mass Spectrometry and Protein Analysis , 2006, Science.

[10]  A. Pandey,et al.  Comprehensive Comparison of Collision Induced Dissociation and Electron Transfer Dissociation , 2008, Analytical chemistry.

[11]  D. Liebler,et al.  P-Mod: an algorithm and software to map modifications to peptide sequences using tandem MS data. , 2005, Journal of proteome research.

[12]  B. Searle,et al.  Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. , 2005, Journal of proteome research.

[13]  Eunok Paek,et al.  Prediction of novel modifications by unrestrictive search of tandem mass spectra. , 2009, Journal of proteome research.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  N. Sharon,et al.  Protein glycosylation. Structural and functional aspects. , 1993, European journal of biochemistry.

[16]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[17]  A. Wool,et al.  Large-scale unrestricted identification of post-translation modifications using tandem mass spectrometry. , 2007, Analytical chemistry.

[18]  Mikhail M Savitski,et al.  ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures* , 2006, Molecular & Cellular Proteomics.

[19]  A. Makarov,et al.  The Orbitrap: a new mass spectrometer. , 2005, Journal of mass spectrometry : JMS.

[20]  Samuel H. Payne,et al.  Accurate annotation of peptide modifications through unrestrictive database search. , 2008, Journal of proteome research.

[21]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[22]  Hokeun Kim,et al.  MODi : a powerful and convenient web server for identifying multiple post-translational peptide modifications from tandem mass spectra , 2006, Nucleic Acids Res..

[23]  L. M. Akella,et al.  SeMoP: a new computational strategy for the unrestricted search for modified peptides using LC-MS/MS data. , 2008, Journal of proteome research.

[24]  Yingming Zhao,et al.  PTMap—A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites , 2009, Proceedings of the National Academy of Sciences.

[25]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[26]  R. Beavis,et al.  A method for reducing the time required to match protein sequences with tandem mass spectra. , 2003, Rapid communications in mass spectrometry : RCM.

[27]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[28]  Charles E. McCulloch,et al.  The EM Algorithm and Its Extensions , 1998 .

[29]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[30]  Steven P Gygi,et al.  Large-scale characterization of HeLa cell nuclear phosphoproteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[31]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[32]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[33]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[34]  L. Lehle,et al.  Protein glycosylation in yeast. , 1987, Antonie van Leeuwenhoek.

[35]  K. Resing,et al.  Mapping protein post-translational modifications with mass spectrometry , 2007, Nature Methods.

[36]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[37]  B. Ueberheide,et al.  The utility of ETD mass spectrometry in proteomic analysis. , 2006, Biochimica et biophysica acta.