Automatic Quality Assessment of Peptide Tandem Mass Spectra

MOTIVATION A powerful proteomics methodology couples high-performance liquid chromatography (HPLC) with tandem mass spectrometry and database-search software, such as SEQUEST. Such a set-up, however, produces a large number of spectra, many of which are of too poor quality to be useful. Hence a filter that eliminates poor spectra before the database search can significantly improve throughput and robustness. Moreover, spectra judged to be of high quality, but that cannot be identified by database search, are prime candidates for still more computationally intensive methods, such as de novo sequencing or wider database searches including post-translational modifications. RESULTS We report on two different approaches to assessing spectral quality prior to identification: binary classification, which predicts whether or not SEQUEST will be able to make an identification, and statistical regression, which predicts a more universal quality metric involving the number of b- and y-ion peaks. The best of our binary classifiers can eliminate over 75% of the unidentifiable spectra while losing only 10% of the identifiable spectra. Statistical regression can pick out spectra of modified peptides that can be identified by a de novo program but not by SEQUEST. In a section of independent interest, we discuss intensity normalization of mass spectra.

[1]  N. Sherman,et al.  Protein Sequencing and Identification Using Tandem Mass Spectrometry: Kinter/Tandem Mass Spectrometry , 2000 .

[2]  D. Liebler Introduction to proteomics : tools for the new biology /by Daniel C. Liebler ; foreword by John R. Yates, III. , 2007 .

[3]  R. Aebersold,et al.  Mass spectrometry in proteomics. , 2001, Chemical reviews.

[4]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[5]  David L. Tabb,et al.  Protein Identification by SEQUEST , 2001 .

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Rovshan G Sadygov,et al.  Code developments to improve the efficiency of automated MS/MS spectra interpretation. , 2002, Journal of proteome research.

[8]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[9]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[10]  J. Yates,et al.  Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. , 2003, Analytical chemistry.

[11]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[13]  A. Nesvizhskii,et al.  Experimental protein mixture for validating tandem mass spectral analysis. , 2002, Omics : a journal of integrative biology.

[14]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[15]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[16]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[17]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[18]  J. A. Taylor,et al.  Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. , 2001, Analytical chemistry.

[19]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[20]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[21]  John I. Clark,et al.  Shotgun identification of protein modifications from protein complexes and lens tissue , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[22]  J. Yates,et al.  Probability-based validation of protein identifications using a modified SEQUEST algorithm. , 2002, Analytical chemistry.