Intensity-based protein identification by machine learning from a library of tandem mass spectra

Tandem mass spectrometry (MS/MS) has emerged as a cornerstone of proteomics owing in part to robust spectral interpretation algorithms. Widely used algorithms do not fully exploit the intensity patterns present in mass spectra. Here, we demonstrate that intensity pattern modeling improves peptide and protein identification from MS/MS spectra. We modeled fragment ion intensities using a machine-learning approach that estimates the likelihood of observed intensities given peptide and fragment attributes. From 1,000,000 spectra, we chose 27,000 with high-quality, nonredundant matches as training data. Using the same 27,000 spectra, intensity was similarly modeled with mismatched peptides. We used these two probabilistic models to compute the relative likelihood of an observed spectrum given that a candidate peptide is matched or mismatched. We used a 'decoy' proteome approach to estimate incorrect match frequency, and demonstrated that an intensity-based method reduces peptide identification error by 50–96% without any loss in sensitivity.

[1]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[2]  David L. Tabb,et al.  A proteomic view of the Plasmodium falciparum life cycle , 2002, Nature.

[3]  R. Aebersold,et al.  Mass spectrometry in proteomics. , 2001, Chemical reviews.

[4]  D. Hochstrasser,et al.  Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra , 2002, Proteomics.

[5]  S. Gygi,et al.  Proteomics: the move to mixtures. , 2001, Journal of mass spectrometry : JMS.

[6]  J. Yates,et al.  Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. , 2003, Analytical chemistry.

[7]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[10]  I. Papayannopoulos,et al.  The interpretation of collision‐induced dissociation tandem mass spectra of peptides , 1996 .

[11]  I. Papayannopoulos The Interpretation of Collision‐Induced Dissociation Tandem Mass Spectra of Peptides , 1996 .

[12]  A J Cuticchia,et al.  TM Finder: A prediction program for transmembrane protein segments using a combination of hydrophobicity and nonpolar phase helicity scales , 2001, Protein science : a publication of the Protein Society.

[13]  M. Mann,et al.  Analysis of proteins and proteomes by mass spectrometry. , 2001, Annual review of biochemistry.

[14]  M. Mann,et al.  Proteomics to study genes and genomes , 2000, Nature.

[15]  A. G. Harrison,et al.  The gas‐phase basicities and proton affinities of amino acids and peptides , 1997 .

[16]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[17]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[18]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[19]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[20]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[21]  John R Yates,et al.  Cleavage N-terminal to proline: analysis of a database of peptide tandem mass spectra. , 2003, Analytical chemistry.

[22]  M. Tyers,et al.  From genomics to proteomics , 2003, Nature.