Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries.

Experimental improvements in post-translational modification (PTM) detection by tandem mass spectrometry (MS/MS) has allowed the identification of vast numbers of PTMs. Open modification searches (OMSs) of MS/MS data, which do not require prior knowledge of the modifications present in the sample, further increased the diversity of detected PTMs. Despite much effort, there is still a lack of functional annotation of PTMs. One possibility to narrow the annotation gap is to mine MS/MS data deposited in public repositories and to correlate the PTM presence with biological meta-information attached to the data. Since the data volume can be quite substantial and contain tens of millions of MS/MS spectra, the data mining tools must be able to cope with big data. Here, we present two tools, Liberator and MzMod, which are built using the MzJava class library and the Apache Spark large scale computing framework. Liberator builds large MS/MS spectrum libraries, and MzMod searches them in an OMS mode. We applied these tools to a recently published set of 25 million spectra from 30 human tissues and present tissue specific PTMs. We also compared the results to the ones obtained with the OMS tool MODa and the search engine X!Tandem.

[1]  R. Schneider,et al.  Chatting histone modifications in mammals. , 2010, Briefings in functional genomics.

[2]  D. Scott,et al.  Optimization and testing of mass spectral library search algorithms for compound identification , 1994, Journal of the American Society for Mass Spectrometry.

[3]  J. Jeffry Howbert,et al.  MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services , 2012, Bioinform..

[4]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[5]  Stephen R Master,et al.  Unbiased statistical analysis for multi-stage proteomic search strategies. , 2010, Journal of proteome research.

[6]  Takashi Kawashima,et al.  Mapping brain activity at scale with cluster computing , 2014, Nature Methods.

[7]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..

[8]  Leonid Zamdborg,et al.  Tandem mass spectrometry with ultrahigh mass accuracy clarifies peptide identification by database retrieval. , 2009, Journal of proteome research.

[9]  Yasset Perez-Riverol,et al.  Making proteomics data accessible and reusable: Current state of proteomics databases and repositories , 2015, Proteomics.

[10]  Emmanuel D Levy,et al.  Protein abundance is key to distinguish promiscuous from functional phosphorylation based on evolutionary information , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[11]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[12]  Benjamin A Garcia,et al.  Proteomic characterization of novel histone post-translational modifications , 2013, Epigenetics & Chromatin.

[13]  J. Khan,et al.  Database of mRNA gene expression profiles of multiple human organs. , 2005, Genome research.

[14]  Zhike Lu,et al.  Identification of 67 Histone Marks and Histone Lysine Crotonylation as a New Type of Histone Modification , 2011, Cell.

[15]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Qian Xiong,et al.  Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes , 2014, Proceedings of the National Academy of Sciences.

[18]  P. Bork,et al.  Evolution and functional cross‐talk of protein post‐translational modifications , 2013, Molecular systems biology.

[19]  Che-Lun Hung,et al.  Cloud Computing for Protein-Ligand Binding Site Comparison , 2013, BioMed research international.

[20]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[21]  Ivan Matic,et al.  Reanalysis of phosphoproteomics data uncovers ADP-ribosylation sites , 2012, Nature Methods.

[22]  W. Lim,et al.  Systematic Functional Prioritization of Protein Posttranslational Modifications , 2012, Cell.

[23]  Frederic Nikitin,et al.  An improved method for the construction of decoy peptide MS/MS spectra suitable for the accurate estimation of false discovery rates , 2011, Proteomics.

[24]  Marek S. Wiewiórka,et al.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision , 2014, Bioinform..

[25]  Eric W Deutsch,et al.  State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology- and disease-driven Human Proteome Project. , 2014, Journal of proteome research.

[26]  Natalie I. Tasman,et al.  iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates* , 2011, Molecular & Cellular Proteomics.

[27]  Chunaram Choudhary,et al.  Acetylation dynamics and stoichiometry in Saccharomyces cerevisiae , 2014, Molecular systems biology.

[28]  B. Kuster,et al.  Confident Phosphorylation Site Localization Using the Mascot Delta Score , 2010, Molecular & Cellular Proteomics.

[29]  Frederic Nikitin,et al.  QuickMod: A tool for open modification spectrum library searches. , 2011, Journal of proteome research.

[30]  C. Landry,et al.  Weak functional constraints on phosphoproteomes. , 2009, Trends in genetics : TIG.

[31]  Chunaram Choudhary,et al.  Proteome-wide Analysis of Lysine Acetylation Suggests its Broad Regulatory Scope in Saccharomyces cerevisiae* , 2012, Molecular & Cellular Proteomics.

[32]  Mikhail M Savitski,et al.  ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures* , 2006, Molecular & Cellular Proteomics.

[33]  C. von Mering,et al.  PaxDb, a Database of Protein Abundance Averages Across All Three Domains of Life , 2012, Molecular & Cellular Proteomics.

[34]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[35]  Peer Bork,et al.  Deciphering a global network of functionally associated post-translational modifications , 2012, Molecular systems biology.

[36]  Zhihong Zhang,et al.  Identification of lysine succinylation as a new post-translational modification. , 2011, Nature chemical biology.

[37]  Pavel A. Pevzner,et al.  Mutation-tolerant protein identification by mass-spectrometry , 2000, RECOMB '00.

[38]  Steven B Heymsfield,et al.  Specific metabolic rates of major organs and tissues across adulthood: evaluation by mechanistic model of resting energy expenditure. , 2010, The American journal of clinical nutrition.

[39]  Andrew J. Bannister,et al.  Regulation of chromatin by histone modifications , 2011, Cell Research.

[40]  Nichole L. King,et al.  Development and validation of a spectral library searching method for peptide identification from MS/MS , 2007, Proteomics.

[41]  M. Mann,et al.  Status of Large-scale Analysis of Post-translational Modifications by Mass Spectrometry* , 2013, Molecular & Cellular Proteomics.

[42]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[43]  Ludovic C. Gillet,et al.  Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis* , 2012, Molecular & Cellular Proteomics.

[44]  Eunok Paek,et al.  Software eyes for protein post-translational modifications. , 2015, Mass spectrometry reviews.

[45]  Matthew J. Rardin,et al.  SIRT5 regulates the mitochondrial lysine succinylome and metabolic networks. , 2013, Cell metabolism.

[46]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[47]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[48]  Oliver Horlacher,et al.  Unrestricted modification search reveals lysine methylation as major modification induced by tissue formalin fixation and paraffin embedding , 2015, Proteomics.

[49]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[50]  Tao Zhou,et al.  Insights into the lysine acetylproteome of human sperm. , 2014, Journal of proteomics.

[51]  Yi Tang,et al.  Lysine Propionylation and Butyrylation Are Novel Post-translational Modifications in Histones*S , 2007, Molecular & Cellular Proteomics.

[52]  Tao Zhou,et al.  Beyond single modification: Reanalysis of the acetylproteome of human sperm reveals widespread multiple modifications. , 2015, Journal of proteomics.

[53]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[54]  Ruixiang Sun,et al.  Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate , 2010, Bioinform..

[55]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.

[56]  Steven P Gygi,et al.  A probability-based approach for high-throughput protein phosphorylation analysis and site localization , 2006, Nature Biotechnology.

[57]  Eunok Paek,et al.  Fast Multi-blind Modification Search through Tandem Mass Spectrometry* , 2011, Molecular & Cellular Proteomics.

[58]  Kristen M. Naegle,et al.  PTMScout, a Web Resource for Analysis of High Throughput Post-translational Proteomics Studies* , 2010, Molecular & Cellular Proteomics.

[59]  P. Andrews,et al.  A spectral clustering approach to MS/MS identification of post-translational modifications. , 2008, Journal of proteome research.

[60]  Henry Lam,et al.  Hunting for unexpected post-translational modifications by spectral library searching with tier-wise scoring. , 2014, Journal of proteome research.

[61]  K. Orth,et al.  A newly discovered post-translational modification--the acetylation of serine and threonine residues. , 2007, Trends in biochemical sciences.

[62]  Nuno Bandeira,et al.  False discovery rates in spectral identification , 2012, BMC Bioinformatics.

[63]  Anantharaman Kalyanaraman,et al.  MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification , 2011, Bioinform..

[64]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[65]  Oliver Horlacher,et al.  MzJava: An open source library for mass spectrometry data processing. , 2015, Journal of proteomics.

[66]  John D. Storey A direct approach to false discovery rates , 2002 .

[67]  Markus Müller,et al.  Unrestricted identification of modified proteins using MS/MS , 2010, Proteomics.

[68]  Yan Fu,et al.  Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry* , 2013, Molecular & Cellular Proteomics.

[69]  Heejin Park,et al.  Unrestrictive Identification of Multiple Post-translational Modifications from Tandem Mass Spectrometry Using an Error-tolerant Algorithm Based on an Extended Sequence Tag Approach*S , 2008, Molecular & Cellular Proteomics.