PIPI: PTM-Invariant Peptide Identification Using Coding Method

In computational proteomics, identification of peptides with an unlimited number of post-translational modification (PTM) types is a challenging task. The computational cost increases exponentially with respect to the number of modifiable amino acids and linearly with respect to the number of potential PTM types at each amino acid. The problem becomes intractable very quickly if we want to enumerate all possible modification patterns. Existing tools (e.g., MS-Alignment, ProteinProspector, and MODa) avoid enumerating modification patterns in database search by using an alignment-based approach to localize and characterize modified amino acids. This approach avoids enumerating all possible modification patterns in a database search. However, due to the large search space and PTM localization issue, the sensitivity of these tools is low. This paper proposes a novel method named PIPI to achieve PTM-invariant peptide identification. PIPI first codes peptide sequences into Boolean vectors and converts experimental spectra into real-valued vectors. Then, it finds the top 10 peptide-coded vectors for each spectrum-coded vector. After that, PIPI uses a dynamic programming algorithm to localize and characterize modified amino acids. Simulations and real data experiments have shown that PIPI outperforms existing tools by identifying more peptide-spectrum matches (PSMs) and reporting fewer false positives. It also runs much faster than existing tools when the database is large.

[1]  B. Searle,et al.  High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. , 2004, Analytical chemistry.

[2]  Brian L. Frey,et al.  Global Identification of Protein Post-translational Modifications in a Single-Pass Database Search , 2015, Journal of proteome research.

[3]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[4]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[5]  Heejin Park,et al.  Unrestrictive Identification of Multiple Post-translational Modifications from Tandem Mass Spectrometry Using an Error-tolerant Algorithm Based on an Extended Sequence Tag Approach*S , 2008, Molecular & Cellular Proteomics.

[6]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[7]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[8]  D. Liebler,et al.  P-Mod: an algorithm and software to map modifications to peptide sequences using tandem MS data. , 2005, Journal of proteome research.

[9]  Frederic Nikitin,et al.  QuickMod: A tool for open modification spectrum library searches. , 2011, Journal of proteome research.

[10]  Sándor Pongor,et al.  PTMTreeSearch: a novel two-stage tree-search algorithm with pruning rules for the identification of post-translational modification of proteins in MS/MS spectra , 2014, Bioinform..

[11]  Xin Huang,et al.  ISPTM: an iterative search algorithm for systematic identification of post-translational modifications from complex proteome mixtures. , 2013, Journal of proteome research.

[12]  Hyungwon Choi,et al.  LuciPHOr: Algorithm for Phosphorylation Site Localization with False Localization Rate Estimation Using Modified Target-Decoy Approach* , 2013, Molecular & Cellular Proteomics.

[13]  Peter R Baker,et al.  Modification Site Localization Scoring Integrated into a Search Engine* , 2011, Molecular & Cellular Proteomics.

[14]  L. M. Akella,et al.  SeMoP: a new computational strategy for the unrestricted search for modified peptides using LC-MS/MS data. , 2008, Journal of proteome research.

[15]  Hyungwon Choi,et al.  False discovery rates and related statistical concepts in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[16]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[17]  B. Searle,et al.  Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. , 2005, Journal of proteome research.

[18]  P. Pevzner,et al.  Spectral Dictionaries , 2009, Molecular & Cellular Proteomics.

[19]  F. McLafferty,et al.  Automated reduction and interpretation of , 2000, Journal of the American Society for Mass Spectrometry.

[20]  B. Kuster,et al.  Confident Phosphorylation Site Localization Using the Mascot Delta Score , 2010, Molecular & Cellular Proteomics.

[21]  William Stafford Noble,et al.  Posterior error probabilities and false discovery rates: two sides of the same coin. , 2008, Journal of proteome research.

[22]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[23]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[24]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[25]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[26]  Wen Gao,et al.  pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[27]  Bo Yan,et al.  Peptide sequence tag-based blind identification of post-translational modifications with point process model , 2006, ISMB.

[28]  K. Clauser,et al.  Modification Site Localization Scoring: Strategies and Performance , 2012, Molecular & Cellular Proteomics.

[29]  Pavel A. Pevzner,et al.  Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry , 2005, RECOMB.

[30]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[31]  F. McLafferty,et al.  High-resolution electrospray mass spectra of large molecules , 1991 .

[32]  Peter R Baker,et al.  In-depth Analysis of Tandem Mass Spectrometry Data from Disparate Instrument Types*S , 2008, Molecular & Cellular Proteomics.

[33]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[34]  A. Wool,et al.  Large-scale unrestricted identification of post-translation modifications using tandem mass spectrometry. , 2007, Analytical chemistry.

[35]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[36]  Yingming Zhao,et al.  PTMap—A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites , 2009, Proceedings of the National Academy of Sciences.

[37]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[38]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[39]  Nuno Bandeira,et al.  Protein identification by spectral networks analysis. , 2011, Methods in molecular biology.

[40]  Lan Huang,et al.  Comprehensive Analysis of a Multidimensional Liquid Chromatography Mass Spectrometry Dataset Acquired on a Quadrupole Selecting, Quadrupole Collision Cell, Time-of-flight Mass Spectrometer , 2005, Molecular & Cellular Proteomics.

[41]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[42]  Mikhail M Savitski,et al.  ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures* , 2006, Molecular & Cellular Proteomics.

[43]  Bin Ma,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[44]  Ming-Yang Kao,et al.  A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry , 2000, SODA '00.

[45]  William Stafford Noble,et al.  Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics , 2015, Journal of proteome research.

[46]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[47]  A. Nesvizhskii,et al.  Improved sequence tag generation method for peptide identification in tandem mass spectrometry. , 2008, Journal of proteome research.

[48]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[49]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[50]  Steven P Gygi,et al.  A probability-based approach for high-throughput protein phosphorylation analysis and site localization , 2006, Nature Biotechnology.

[51]  Eunok Paek,et al.  Fast Multi-blind Modification Search through Tandem Mass Spectrometry* , 2011, Molecular & Cellular Proteomics.

[52]  P. Andrews,et al.  A spectral clustering approach to MS/MS identification of post-translational modifications. , 2008, Journal of proteome research.