A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search

An accurate scoring function for database search is crucial for peptide identification using tandem mass spectrometry. Although many mathematical models have been proposed to score peptides against tandem mass spectra, our method (called PepHMM, http://msms.cmb.usc.edu) is unique in that it combines information on machine accuracy, mass peak intensity, and correlation among ions into a hidden Markov model (HMM). In addition, we develop a method to calculate statistical significance of the HMM scores. We implement the method and test them on two sets of experimental data generated by two different types of mass spectrometers and compare the results with MASCOT and SEQUEST under the same condition. One experimental results show that PepHMM has a much higher accuracy (with 6.5% error rate) than MASCOT (with 17.4% error rate), and the other experimental results show that PepHMM identifies 43 and 31% more correct spectra than SEQUEST and MASCOT, respectively.

[1]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[2]  Eugene A. Kapp,et al.  Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. , 2003, Analytical chemistry.

[3]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[4]  Marshall W. Bern,et al.  Automatic Quality Assessment of Peptide Tandem Mass Spectra , 2004, ISMB/ECCB.

[5]  Ming-Yang Kao,et al.  A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry , 2000, SODA '00.

[6]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[7]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[8]  Chris Bailey-Kellogg,et al.  Reducing Mass Degeneracy in SAR by MS by Stable Isotopic Labeling , 2000, J. Comput. Biol..

[9]  J. A. Taylor,et al.  Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. , 1997, Rapid communications in mass spectrometry : RCM.

[10]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[11]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[12]  Marshall W. Bern,et al.  EigenMS: De Novo Analysis of Peptide Tandem Mass Spectra by Spectral Graph Partitioning , 2005, RECOMB.

[13]  P. Pevzner,et al.  Shotgun protein sequencing by tandem mass spectra assembly. , 2004, Analytical chemistry.

[14]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[15]  Sorin Istrail,et al.  Proceedings of the second annual international conference on Computational molecular biology , 1998, RECOMB 1998.

[16]  Chris L. Tang,et al.  Efficiency of database search for identification of mutated and modified proteins via mass spectrometry. , 2001, Genome research.

[17]  M. K. Young,et al.  Method for screening peptide fragment ion mass spectra prior to database searching , 2000, Journal of the American Society for Mass Spectrometry.

[18]  J. Yates,et al.  Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry. , 2000, Analytical chemistry.

[19]  Ting Chen,et al.  Algorithms for identifying protein cross-links via tandem mass spectrometry , 2001, J. Comput. Biol..

[20]  D. Creasy,et al.  Error tolerant searching of uninterpreted tandem mass spectrometry data , 2002, Proteomics.

[21]  Ting Chen,et al.  A Suboptimal Algorithm for De Novo Peptide Sequencing via Tandem Mass Spectrometry , 2003, J. Comput. Biol..

[22]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[23]  Ming Li,et al.  PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[24]  Peter Walden,et al.  Sequit: software for de novo peptide sequencing by matrix-assisted laser desorption/ionization post-source decay mass spectrometry. , 2004, Rapid communications in mass spectrometry : RCM.

[25]  Steven P Gygi,et al.  Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2004, Nature Biotechnology.

[26]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[27]  J. Yates,et al.  A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. , 2003, Analytical chemistry.

[28]  M. Wilm,et al.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags. , 1994, Analytical chemistry.

[29]  Richard D. Smith,et al.  Dissociation behavior of doubly-charged tryptic peptides: correlation of gas-phase cleavage abundance with ramachandran plots. , 2004, Journal of the American Chemical Society.

[30]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[31]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[32]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[33]  T. Rejtar,et al.  Increased identification of peptides by enhanced data processing of high-resolution MALDI TOF/TOF mass spectra prior to database searching. , 2004, Analytical chemistry.

[34]  A. Shevchenko,et al.  MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. , 2003, Analytical chemistry.

[35]  Pavel A. Pevzner,et al.  Mutation-tolerant protein identification by mass-spectrometry , 2000, RECOMB '00.

[36]  A. Masselot,et al.  OLAV: Towards high‐throughput tandem mass spectrometry data identification , 2003, Proteomics.

[37]  B. Searle,et al.  High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. , 2004, Analytical chemistry.

[38]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[39]  Ruedi Aebersold,et al.  The Application of New Software Tools to Quantitative Protein Profiling Via Isotope-coded Affinity Tag (ICAT) and Tandem Mass Spectrometry , 2003, Molecular & Cellular Proteomics.

[40]  John D. Venable,et al.  Impact of ion trap tandem mass spectra variability on the identification of peptides. , 2004, Analytical chemistry.

[41]  Bo Yan,et al.  A graph-theoretic approach for the separation of b and y ions in tandem mass spectra , 2005, Bioinform..

[42]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[43]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[44]  J. Yates,et al.  Probability-based validation of protein identifications using a modified SEQUEST algorithm. , 2002, Analytical chemistry.

[45]  J. Yates,et al.  Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. , 2004, Analytical chemistry.

[46]  Pavel A. Pevzner,et al.  Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry , 2005, RECOMB.

[47]  B. Chait,et al.  ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. , 2000, Analytical chemistry.

[48]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[49]  Vineet Bafna,et al.  On de novo interpretation of tandem mass spectra for peptide identification , 2003, RECOMB '03.

[50]  R. Beavis,et al.  A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. , 2003, Analytical chemistry.

[51]  A. Podtelejnikov,et al.  Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Peter R. Baker,et al.  Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. , 1999, Analytical chemistry.

[53]  D. Liebler,et al.  SALSA: a pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses. , 2001, Analytical chemistry.

[54]  T. Köcher,et al.  Preprocessing of tandem mass spectrometric data to support automatic protein identification , 2003, Proteomics.

[55]  T. Speed,et al.  Deriving statistical models for predicting peptide tandem MS product ion intensities. , 2003, Biochemical Society transactions.

[56]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.