Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry

Filtration techniques in the form of rapid elimination of candidate sequences while retaining the true one are key ingredients of database searches in genomics. Although SEQUEST and Mascot perform a conceptually similar task to the tool BLAST, the key algorithmic idea of BLAST (filtration) was never implemented in these tools. As a result MS/MS protein identification tools are becoming too time-consuming for many applications including search for post-translationally modified peptides. Moreover, matching millions of spectra against all known proteins will soon make these tools too slow in the same way that "genome vs genome" comparisons instantly made BLAST too slow. We describe the development of filters for MS/MS database searches that dramatically reduce the running time and effectively remove the bottlenecks in searching the huge space of protein modifications. Our approach, based on a probability model for determining the accuracy of sequence tags, achieves superior results compared to GutenTag, a popular tag generation algorithm. Our tag generating algorithm along with our de novo sequencing algorithm PepNovo can be accessed via the URL http://peptide.ucsd.edu/.

[1]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[2]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[3]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[4]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[5]  Ting Chen,et al.  Algorithms for de novo peptide sequencing using tandem mass spectrometry , 2004 .

[6]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[7]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[8]  E. Kolker,et al.  Standard mixtures for proteome studies. , 2004, Omics : a journal of integrative biology.

[9]  P. Bork,et al.  Nanoelectrospray tandem mass spectrometry and sequence similarity searching for identification of proteins from organisms with unknown genomes. , 2003, Methods in molecular biology.

[10]  Bo Yan,et al.  A graph-theoretic approach for the separation of b and y ions in tandem mass spectra , 2005, Bioinform..

[11]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[12]  J. Yates,et al.  Probability-based validation of protein identifications using a modified SEQUEST algorithm. , 2002, Analytical chemistry.

[13]  Xian Chen,et al.  New computational approaches for de novo peptide sequencing from MS/MS experiments , 2002, Proc. IEEE.

[14]  M. Mann,et al.  Proteomic analysis of post-translational modifications , 2003, Nature Biotechnology.

[15]  A. Gorin,et al.  PPM-chain - de novo peptide identification program comparable in performance to Sequest , 2004 .

[16]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[17]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[18]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[19]  A. Shevchenko,et al.  MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. , 2003, Analytical chemistry.

[20]  J. A. Taylor,et al.  Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. , 2001, Analytical chemistry.

[21]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[22]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[23]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[24]  R. Appel,et al.  Popitam: Towards new heuristic strategies to improve protein identification from tandem mass spectrometry data , 2003, Proteomics.

[25]  D. Creasy,et al.  Error tolerant searching of uninterpreted tandem mass spectrometry data , 2002, Proteomics.

[26]  Ting Chen,et al.  A Suboptimal Algorithm for De Novo Peptide Sequencing via Tandem Mass Spectrometry , 2003, J. Comput. Biol..

[27]  Vineet Bafna,et al.  On de novo interpretation of tandem mass spectra for peptide identification , 2003, RECOMB '03.

[28]  J. A. Taylor,et al.  Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. , 1997, Rapid communications in mass spectrometry : RCM.

[29]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[30]  Vineet Bafna,et al.  InsPecT : Fast and accurate identification of post-translationally modified peptides from tandem mass spectra , 2005 .

[31]  T. Speed,et al.  Deriving statistical models for predicting peptide tandem MS product ion intensities. , 2003, Biochemical Society transactions.

[32]  Ming-Yang Kao,et al.  A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry , 2000, SODA '00.

[33]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[34]  Jorge Fernandez-de-Cossío,et al.  A computer program to aid the sequencing of peptides in collision- activated decomposition experiments , 1995, Comput. Appl. Biosci..

[35]  Chris L. Tang,et al.  Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. , 2001, Genome research.

[36]  J. Yates,et al.  Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. , 2003, Analytical chemistry.

[37]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[38]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[39]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[40]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[41]  A. Masselot,et al.  OLAV: Towards high‐throughput tandem mass spectrometry data identification , 2003, Proteomics.

[42]  B. Searle,et al.  High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. , 2004, Analytical chemistry.

[43]  Ying Xu,et al.  A computational method for assessing peptide‐ identification reliability in tandem mass spectrometry analysis with SEQUEST , 2004 .

[44]  J. Yates,et al.  Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. , 1995, Analytical chemistry.

[45]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[46]  Ting Chen,et al.  A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications , 2003, ECCB.

[47]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[48]  Rong Wang,et al.  The need for a public proteomics repository , 2004, Nature Biotechnology.