TagGraph reveals vast protein modification landscapes from large tandem mass spectrometry data sets

Although mass spectrometry is well suited to identifying thousands of potential protein post-translational modifications (PTMs), it has historically been biased towards just a few. To measure the entire set of PTMs across diverse proteomes, software must overcome the dual challenges of covering enormous search spaces and distinguishing correct from incorrect spectrum interpretations. Here, we describe TagGraph, a computational tool that overcomes both challenges with an unrestricted string-based search method that is as much as 350-fold faster than existing approaches, and a probabilistic validation model that we optimized for PTM assignments. We applied TagGraph to a published human proteomic dataset of 25 million mass spectra and tripled confident spectrum identifications compared to its original analysis. We identified thousands of modification types on almost 1 million sites in the proteome. We show alternative contexts for highly abundant yet understudied PTMs such as proline hydroxylation, and its unexpected association with cancer mutations. By enabling broad characterization of PTMs, TagGraph informs as to how their functions and regulation intersect.A string-based computational tool enables swift, robust identification of post-translational modifications in MS/MS datasets.

[1]  P. Pevzner,et al.  Target-Decoy Approach and False Discovery Rate: When Things May Go Wrong , 2011, Journal of the American Society for Mass Spectrometry.

[2]  Richard D. Smith,et al.  High-pH reversed-phase chromatography with fraction concatenation for 2D proteomic analysis , 2012, Expert review of proteomics.

[3]  Eunok Paek,et al.  Software eyes for protein post-translational modifications. , 2015, Mass spectrometry reviews.

[4]  M. Grunstein Histone acetylation in chromatin structure and transcription , 1997, Nature.

[5]  M. Washburn,et al.  Refinements to label free proteome quantitation: how to deal with peptides shared by multiple proteins. , 2010, Analytical chemistry.

[6]  Maria Jesus Martin,et al.  High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[7]  Edward L. Huttlin,et al.  A large-scale method to measure absolute protein phosphorylation stoichiometries , 2011, Nature Methods.

[8]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[9]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[10]  D. Creasy,et al.  Error tolerant searching of uninterpreted tandem mass spectrometry data , 2002, Proteomics.

[11]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[12]  Chunaram Choudhary,et al.  Acetylation dynamics and stoichiometry in Saccharomyces cerevisiae , 2014, Molecular systems biology.

[13]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[14]  Benjamin A. Garcia,et al.  SnapShot: Histone Modifications , 2014, Cell.

[15]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[16]  Hyungwon Choi,et al.  Significance Analysis of Spectral Count Data in Label-free Shotgun Proteomics*S , 2008, Molecular & Cellular Proteomics.

[17]  Gary D. Bader,et al.  The mutational landscape of phosphorylation signaling in cancer , 2013, Scientific Reports.

[18]  Yong J. Kil,et al.  Byonic: Advanced Peptide and Protein Identification Software , 2012, Current protocols in bioinformatics.

[19]  Nicholas T. Ingolia,et al.  Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling , 2009, Science.

[20]  Edward L. Huttlin,et al.  A Tissue-Specific Atlas of Mouse Protein Phosphorylation and Expression , 2010, Cell.

[21]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[22]  Yan Fu,et al.  Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* , 2013, Molecular & Cellular Proteomics.

[23]  C. Allis,et al.  Translating the Histone Code , 2001, Science.

[24]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.

[25]  Kris Gevaert,et al.  Protein N-terminal acetyltransferases: when the start matters. , 2012, Trends in biochemical sciences.

[26]  H. Furthmayr,et al.  Comparative sequence studies on alpha2-CB2 from calf, human, rabbit and pig-skin collagen. , 1974, European journal of biochemistry.

[27]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[28]  D. Creasy,et al.  Unimod: Protein modifications for mass spectrometry , 2004, Proteomics.

[29]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[30]  Birgit Eisenhaber,et al.  Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? , 2007, Current protein & peptide science.

[31]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[32]  R. Nussinov,et al.  Allosteric post-translational modification codes. , 2012, Trends in biochemical sciences.

[33]  E. Deutsch mzML: A single, unifying data format for mass spectrometer output , 2008, Proteomics.

[34]  S. Brunak,et al.  Quantitative Phosphoproteomics Reveals Widespread Full Phosphorylation Site Occupancy During Mitosis , 2010, Science Signaling.

[35]  Jian Wang,et al.  Assembling the Community-Scale Discoverable Human Proteome , 2018, Cell systems.

[36]  Neil L Kelleher,et al.  Illuminating the dark matter of shotgun proteomics , 2015, Nature Biotechnology.

[37]  Mikhail M Savitski,et al.  ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures* , 2006, Molecular & Cellular Proteomics.

[38]  Timothy L Bailey,et al.  Defining the RGG/RG motif. , 2013, Molecular cell.

[39]  P. Pevzner,et al.  Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. , 2008, Journal of proteome research.

[40]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[41]  Carlos G Gonzalez,et al.  From mystery to mechanism: can proteomics build systems-level understanding of our gut microbes? , 2017, Expert review of proteomics.

[42]  Johannes Griss,et al.  Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets , 2016, Nature Methods.

[43]  E. Seto,et al.  Histone modifications. , 2003, Methods.

[44]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[45]  Béla Novák,et al.  Phosphorylation network dynamics in the control of cell cycle transitions , 2012, Journal of Cell Science.

[46]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[47]  Jing Xu,et al.  Coexistence of 2 types of atrial tachycardias and right ventricular outflow tract tachycardia. , 2011, Journal of electrocardiology.

[48]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[49]  Joshua E Elias,et al.  Application of de Novo Sequencing to Large-Scale Complex Proteomics Data Sets. , 2016, Journal of proteome research.

[50]  Alexey I Nesvizhskii,et al.  MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics , 2017, Nature Methods.

[51]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.

[52]  J. Boeke,et al.  Lysine Succinylation and Lysine Malonylation in Histones* , 2012, Molecular & Cellular Proteomics.

[53]  Lev I Levitsky,et al.  Pyteomics—a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics , 2013, Journal of The American Society for Mass Spectrometry.

[54]  M. Mann,et al.  Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. , 2014, Cell reports.

[55]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[56]  B. Ma Novor: Real-Time Peptide de Novo Sequencing Software , 2015, Journal of The American Society for Mass Spectrometry.

[57]  Ronald T Raines,et al.  Collagen structure and stability. , 2009, Annual review of biochemistry.

[58]  Lennart Martens,et al.  A guide to the Proteomics Identifications Database proteomics data repository , 2009, Proteomics.

[59]  Hanno Steen,et al.  Post‐translational modification: nature's escape from genetic imprisonment and the basis for dynamic information encoding , 2012, Wiley interdisciplinary reviews. Systems biology and medicine.

[60]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[61]  Gennifer E. Merrihew,et al.  Proteogenomic database construction driven from large scale RNA-seq data. , 2014, Journal of proteome research.

[62]  Clifford Odets Papers Guide to the , 2003 .

[63]  Hyungwon Choi,et al.  LuciPHOr: Algorithm for Phosphorylation Site Localization with False Localization Rate Estimation Using Modified Target-Decoy Approach* , 2013, Molecular & Cellular Proteomics.

[64]  Christopher J. Schofield,et al.  Asparagine and Aspartate Hydroxylation of the Cytoskeletal Ankyrin Family Is Catalyzed by Factor-inhibiting Hypoxia-inducible Factor , 2010, The Journal of Biological Chemistry.

[65]  Melvin A. Park,et al.  Online Parallel Accumulation–Serial Fragmentation (PASEF) with a Novel Trapped Ion Mobility Mass Spectrometer* , 2018, Molecular & Cellular Proteomics.

[66]  Suhendan Ekmekcioglu,et al.  Implications of tissue transglutaminase expression in malignant melanoma , 2006, Molecular Cancer Therapeutics.

[67]  Yingming Zhao,et al.  Modification‐specific proteomics: Strategies for characterization of post‐translational modifications using enrichment techniques , 2009, Proteomics.

[68]  Yan Fu,et al.  pNovo: de novo peptide sequencing and identification using HCD spectra. , 2010, Journal of proteome research.

[69]  William Stafford Noble,et al.  A review of statistical methods for protein identification using tandem mass spectrometry. , 2012, Statistics and its interface.

[70]  Neil L Kelleher,et al.  Pervasive combinatorial modification of histone H3 in human cells , 2007, Nature Methods.

[71]  Ailan Guo,et al.  Immunoaffinity Enrichment and Mass Spectrometry Analysis of Protein Methylation , 2013, Molecular & Cellular Proteomics.

[72]  M. Hirschey,et al.  Nonenzymatic protein acylation as a carbon stress regulated by sirtuin deacylases. , 2014, Molecular cell.

[73]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[74]  B. Ma,et al.  PeaksPTM: Mass spectrometry-based identification of peptides with unspecified modifications. , 2011, Journal of proteome research.

[75]  R. E. Neuman,et al.  The determination of hydroxyproline. , 1950, The Journal of biological chemistry.

[76]  Steven P Gygi,et al.  A probability-based approach for high-throughput protein phosphorylation analysis and site localization , 2006, Nature Biotechnology.

[77]  Eunok Paek,et al.  Fast Multi-blind Modification Search through Tandem Mass Spectrometry* , 2011, Molecular & Cellular Proteomics.