On Comparison of SimTandem with State-of-the-Art Peptide Identification Tools, Efficiency of Precursor Mass Filter and Dealing with Variable Modifications

The similarity search in theoretical mass spectra generated from protein sequence databases is a widely accepted approach for identification of peptides from query mass spectra produced by shotgun proteomics. Growing protein sequence databases and noisy query spectra demand database indexing techniques and better similarity measures for the comparison of theoretical spectra against query spectra. We employ a modification of previously proposed parameterized Hausdorff distance for comparisons of mass spectra. The new distance outperforms the original distance, the angle distance and state-of-the-art peptide identification tools OMSSA and X!Tandem in the number of identified peptides even though the q-value is only 0.001. When a precursor mass filter is used as a database indexing technique, our method outperforms OMSSA in the speed of search. When variable modifications are not searched, the search time is similar to X!Tandem. We show that the precursor mass filter is an efficient database indexing technique for high-accuracy data even though many variable modifications are being searched. We demonstrate that the number of identified peptides is bigger when variable modifications are searched separately by more search runs of a peptide identification engine. Otherwise, the false discovery rates are affected by mixing unmodified and modified spectra together resulting in a lower number of identified peptides. Our method is implemented in the freely available application SimTandem which can be used in the framework TOPP based on OpenMS.

[1]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[2]  Tomás Skopal,et al.  A Statistical Comparison of SimTandem with State-of-the-Art Peptide Identification Tools , 2013, PACBB.

[3]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[4]  O. Kohlbacher,et al.  PTMeta: Increasing identification rates of modified peptides using modification prescanning and meta‐analysis , 2013, Proteomics.

[5]  B. Webb-Robertson Computational methods for mass spectrometry proteomics , 2011, Journal of the American Society for Mass Spectrometry.

[6]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[7]  Daniel P. Miranker,et al.  An inverted index for mass spectra similarity query and comparison with a metric-space method: case study , 2010, SISAP.

[8]  J. Coon,et al.  A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. , 2013, Journal of proteome research.

[9]  Ting Chen,et al.  A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications , 2003, ECCB.

[10]  Jakub Lokoc,et al.  On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering , 2012, ISBRA.

[11]  Chris L. Tang,et al.  Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. , 2001, Genome research.

[12]  Vineet Bafna,et al.  Speeding up tandem mass spectral identification using indexes , 2012, Bioinform..

[13]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[14]  David Hoksza,et al.  Parametrised Hausdorff Distance as a Non-Metric Similarity Model for Tandem Mass Spectrometry , 2010, DATESO.

[15]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[16]  Michael J MacCoss,et al.  Comparison of database search strategies for high precursor mass accuracy MS/MS data. , 2010, Journal of proteome research.

[17]  Yan Fu,et al.  Bayesian false discovery rates for post-translational modification proteomics , 2012 .

[18]  J. Ellenberg,et al.  The quantitative proteome of a human cell line , 2011, Molecular systems biology.

[19]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[20]  Hon Wai Leong,et al.  PepSOM: an algorithm for peptide identification by tandem mass spectrometry based on SOM. , 2006, Genome informatics. International Conference on Genome Informatics.

[21]  Brian Carrillo,et al.  Methods for peptide identification by spectral comparison , 2007, Proteome Science.

[22]  Ji Zhu,et al.  Improved Classification of Mass Spectrometry Database Search Results Using Newer Machine Learning Approaches* , 2006, Molecular & Cellular Proteomics.

[23]  Ruixiang Sun,et al.  Speeding up tandem mass spectrometry-based database searching by longest common prefix , 2010, BMC Bioinformatics.

[24]  Knut Reinert,et al.  TOPP - the OpenMS proteomics pipeline , 2007, Bioinform..

[25]  You Li,et al.  Speeding up Scoring Module of Mass Spectrometry Based Protein Identification by GPU , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[26]  Daniel Coca,et al.  Hardware acceleration of processing of mass spectrometric data for proteomics , 2007, Bioinform..

[27]  William Stafford Noble,et al.  Faster SEQUEST searching for peptide identification from tandem mass spectra. , 2011, Journal of proteome research.

[28]  Jakub Lokoc,et al.  Non-metric similarity search of tandem mass spectra including posttranslational modifications , 2012, J. Discrete Algorithms.

[29]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[30]  Ting Chen,et al.  Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search , 2007, Bioinform..

[31]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[32]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[33]  Igor Goryanin,et al.  Journal of Integrative Bioinformatics , 2015 .

[34]  Daniel P. Miranker,et al.  A fast coarse filtering method for peptide identification by mass spectrometry , 2006, Bioinform..

[35]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[36]  Yan Fu,et al.  An efficient parallelization of phosphorylated peptide and protein identification. , 2010, Rapid communications in mass spectrometry : RCM.

[37]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.