Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification

Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms. Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate. Availability: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk. Contact:noble@gs.washington.edu

[1]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[2]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[3]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[4]  Roman A Zubarev,et al.  Electron-capture dissociation tandem mass spectrometry. , 2004, Current opinion in biotechnology.

[5]  M. Mann,et al.  Analysis of proteins and proteomes by mass spectrometry. , 2001, Annual review of biochemistry.

[6]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[7]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[8]  Michael J MacCoss,et al.  Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions. , 2007, Analytical chemistry.

[9]  B. Ueberheide,et al.  The utility of ETD mass spectrometry in proteomic analysis. , 2006, Biochimica et biophysica acta.

[10]  Vineet Bafna,et al.  SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database , 2001, ISMB.

[11]  Michael I. Jordan Graphical Models , 1998 .

[12]  John R Yates,et al.  Influence of basic residue content on fragment ion peak intensities in low-energy collision-induced dissociation spectra of peptides. , 2004, Analytical chemistry.

[13]  V. Wysocki,et al.  Mobile and localized protons: a framework for understanding peptide dissociation. , 2000, Journal of mass spectrometry : JMS.

[14]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[15]  J. Yates,et al.  GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. , 2003, Analytical chemistry.

[16]  Ting Chen,et al.  PepHMM: a hidden Markov model based scoring function for mass spectrometry database search. , 2006, Analytical chemistry.

[17]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[18]  David Fenyö,et al.  RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database , 2002, Proteomics.

[19]  Z. Smilansky,et al.  Intensity-based statistical scorer for tandem mass spectrometry. , 2003, Analytical chemistry.

[20]  Zhongqi Zhang Prediction of low-energy collision-induced dissociation spectra of peptides. , 2004, Analytical chemistry.

[21]  William Stafford Noble,et al.  Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. , 2008, Journal of proteome research.

[22]  William Stafford Noble,et al.  Rapid and accurate peptide identification from tandem mass spectra. , 2008, Journal of proteome research.

[23]  J. A. Taylor,et al.  Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. , 1997, Rapid communications in mass spectrometry : RCM.

[24]  Yan Zhao Intensity-based protein identification by machine learning from a library of tandem mass spectra , 2010 .

[25]  William Stafford Noble,et al.  Matrix2png: a utility for visualizing matrix data , 2003, Bioinform..

[26]  Vicki H. Wysocki,et al.  Influence of Peptide Composition, Gas-Phase Basicity, and Chemical Modification on Fragmentation Efficiency: Evidence for the Mobile Proton Model , 1996 .

[27]  M. MacCoss,et al.  High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. , 2007, Analytical chemistry.

[28]  J. Yates Mass spectrometry and the age of the proteome. , 1998, Journal of mass spectrometry : JMS.

[29]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[30]  A. Namane,et al.  Protein sequencing and identification using tandem mass spectrometry. Edited by Michael Kinter, Nicholas E. Sherman, published by Wiley-Interscience Series on Mass Spectrometry, 2000, 301 p. , 2002 .

[31]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[33]  Sándor Suhai,et al.  Fragmentation pathways of protonated peptides. , 2005, Mass spectrometry reviews.

[34]  R. Aebersold,et al.  ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data , 2002, Proteomics.

[35]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[36]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.