PIPI: PTM-Invariant Peptide Identification Using Coding Method

In computational proteomics, the identification of peptides with an unlimited number of post-translational modification (PTM) types is a challenging task. The computational cost associated with database search increases exponentially with respect to the number of modified amino acids and linearly with respect to the number of potential PTM types at each amino acid. The problem becomes intractable very quickly if we want to enumerate all possible PTM patterns. To address this issue, one group of methods named restricted tools (including Mascot, Comet, and MS-GF+) only allow a small number of PTM types in database search process. Alternatively, the other group of methods named unrestricted tools (including MS-Alignment, ProteinProspector, and MODa) avoids enumerating PTM patterns with an alignment-based approach to localizing and characterizing modified amino acids. However, because of the large search space and PTM localization issue, the sensitivity of these unrestricted tools is low. This paper proposes a novel method named PIPI to achieve PTM-invariant peptide identification. PIPI belongs to the category of unrestricted tools. It first codes peptide sequences into Boolean vectors and codes experimental spectra into real-valued vectors. For each coded spectrum, it then searches the coded sequence database to find the top scored peptide sequences as candidates. After that, PIPI uses dynamic programming to localize and characterize modified amino acids in each candidate. We used simulation experiments and real data experiments to evaluate the performance in comparison with restricted tools (i.e., Mascot, Comet, and MS-GF+) and unrestricted tools (i.e., Mascot with error tolerant search, MS-Alignment, ProteinProspector, and MODa). Comparison with restricted tools shows that PIPI has a close sensitivity and running speed. Comparison with unrestricted tools shows that PIPI has the highest sensitivity except for Mascot with error tolerant search and ProteinProspector. These two tools simplify the task by only considering up to one modified amino acid in each peptide, which results in a higher sensitivity but has difficulty in dealing with multiple modified amino acids. The simulation experiments also show that PIPI has the lowest false discovery proportion, the highest PTM characterization accuracy, and the shortest running time among the unrestricted tools.

[1]  Ming-Yang Kao,et al.  A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry , 2000, SODA '00.

[2]  William Stafford Noble,et al.  Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics , 2015, Journal of proteome research.

[3]  Lan Huang,et al.  Comprehensive Analysis of a Multidimensional Liquid Chromatography Mass Spectrometry Dataset Acquired on a Quadrupole Selecting, Quadrupole Collision Cell, Time-of-flight Mass Spectrometer , 2005, Molecular & Cellular Proteomics.

[4]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[5]  A. Nesvizhskii,et al.  Improved sequence tag generation method for peptide identification in tandem mass spectrometry. , 2008, Journal of proteome research.

[6]  Hyungwon Choi,et al.  LuciPHOr: Algorithm for Phosphorylation Site Localization with False Localization Rate Estimation Using Modified Target-Decoy Approach* , 2013, Molecular & Cellular Proteomics.

[7]  Dekel Tsur,et al.  Identification of post-translational modifications via blind search of mass-spectra , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[8]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[9]  Xin Huang,et al.  ISPTM: an iterative search algorithm for systematic identification of post-translational modifications from complex proteome mixtures. , 2013, Journal of proteome research.

[10]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[11]  L. M. Akella,et al.  SeMoP: a new computational strategy for the unrestricted search for modified peptides using LC-MS/MS data. , 2008, Journal of proteome research.

[12]  Brian L. Frey,et al.  Global Identification of Protein Post-translational Modifications in a Single-Pass Database Search , 2015, Journal of proteome research.

[13]  Guanghui Wang,et al.  Decoy methods for assessing false positives and false discovery rates in shotgun proteomics. , 2009, Analytical chemistry.

[14]  Heejin Park,et al.  Unrestrictive Identification of Multiple Post-translational Modifications from Tandem Mass Spectrometry Using an Error-tolerant Algorithm Based on an Extended Sequence Tag Approach*S , 2008, Molecular & Cellular Proteomics.

[15]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[16]  B. Searle,et al.  Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. , 2005, Journal of proteome research.

[17]  B. Ma,et al.  PeaksPTM: Mass spectrometry-based identification of peptides with unspecified modifications. , 2011, Journal of proteome research.

[18]  Peter R Baker,et al.  In-depth Analysis of Tandem Mass Spectrometry Data from Disparate Instrument Types*S , 2008, Molecular & Cellular Proteomics.

[19]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[20]  Mikhail M Savitski,et al.  ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures* , 2006, Molecular & Cellular Proteomics.

[21]  Edward L. Huttlin,et al.  A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides , 2015, Nature Biotechnology.

[22]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[23]  J. Eng,et al.  Comet: An open‐source MS/MS sequence database search tool , 2013, Proteomics.

[24]  Yingming Zhao,et al.  PTMap—A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites , 2009, Proceedings of the National Academy of Sciences.

[25]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[26]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[27]  Steven P Gygi,et al.  A probability-based approach for high-throughput protein phosphorylation analysis and site localization , 2006, Nature Biotechnology.

[28]  Frederic Nikitin,et al.  QuickMod: A tool for open modification spectrum library searches. , 2011, Journal of proteome research.

[29]  Sándor Pongor,et al.  PTMTreeSearch: a novel two-stage tree-search algorithm with pruning rules for the identification of post-translational modification of proteins in MS/MS spectra , 2014, Bioinform..

[30]  Eunok Paek,et al.  Fast Multi-blind Modification Search through Tandem Mass Spectrometry* , 2011, Molecular & Cellular Proteomics.

[31]  P. Andrews,et al.  A spectral clustering approach to MS/MS identification of post-translational modifications. , 2008, Journal of proteome research.

[32]  Bo Yan,et al.  Peptide sequence tag-based blind identification of post-translational modifications with point process model , 2006, ISMB.

[33]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[34]  Pavel A. Pevzner,et al.  Protein identification by spectral networks analysis , 2007, Proceedings of the National Academy of Sciences.

[35]  A. Wool,et al.  Large-scale unrestricted identification of post-translation modifications using tandem mass spectrometry. , 2007, Analytical chemistry.

[36]  Hyungwon Choi,et al.  False discovery rates and related statistical concepts in mass spectrometry-based proteomics. , 2008, Journal of proteome research.

[37]  P. Pevzner,et al.  Spectral Dictionaries , 2009, Molecular & Cellular Proteomics.

[38]  F. McLafferty,et al.  Automated reduction and interpretation of , 2000, Journal of the American Society for Mass Spectrometry.

[39]  Sean L Seymour,et al.  The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra*S , 2007, Molecular & Cellular Proteomics.

[40]  B. Kuster,et al.  Confident Phosphorylation Site Localization Using the Mascot Delta Score , 2010, Molecular & Cellular Proteomics.

[41]  William Stafford Noble,et al.  Posterior error probabilities and false discovery rates: two sides of the same coin. , 2008, Journal of proteome research.

[42]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[43]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[44]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[45]  Wen Gao,et al.  pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[46]  B. Searle,et al.  High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. , 2004, Analytical chemistry.

[47]  Michael J MacCoss,et al.  A Deeper Look into Comet—Implementation and Features , 2015, Journal of The American Society for Mass Spectrometry.

[48]  Peter R Baker,et al.  Modification Site Localization Scoring Integrated into a Search Engine* , 2011, Molecular & Cellular Proteomics.

[49]  Kaizhong Zhang,et al.  SPIDER: software for protein identification from sequence tags with de novo sequencing error , 2004 .

[50]  William Stafford Noble,et al.  Crux: Rapid Open Source Protein Tandem Mass Spectrometry Analysis , 2014, Journal of proteome research.

[51]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[52]  K. Clauser,et al.  Modification Site Localization Scoring: Strategies and Performance , 2012, Molecular & Cellular Proteomics.

[53]  Ruedi Aebersold,et al.  The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. , 2008, Journal of proteome research.

[54]  D. Liebler,et al.  P-Mod: an algorithm and software to map modifications to peptide sequences using tandem MS data. , 2005, Journal of proteome research.

[55]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[56]  Pavel A. Pevzner,et al.  Peptide sequence tags for fast database search in mass-spectrometry. , 2005 .