Predicting tryptic cleavage from proteomics data using decision tree ensembles.

Trypsin is the workhorse protease in mass spectrometry-based proteomics experiments and is used to digest proteins into more readily analyzable peptides. To identify these peptides after mass spectrometric analysis, the actual digestion has to be mimicked as faithfully as possible in silico. In this paper we introduce CP-DT (Cleavage Prediction with Decision Trees), an algorithm based on a decision tree ensemble that was learned on publicly available peptide identification data from the PRIDE repository. We demonstrate that CP-DT is able to accurately predict tryptic cleavage: tests on three independent data sets show that CP-DT significantly outperforms the Keil rules that are currently used to predict tryptic cleavage. Moreover, the trees generated by CP-DT can make predictions efficiently and are interpretable by domain experts.

[1]  J. Listgarten,et al.  Statistical and Computational Methods for Comparative Proteomic Profiling Using Liquid Chromatography-Tandem Mass Spectrometry , 2005, Molecular & Cellular Proteomics.

[2]  Thomas Lengauer,et al.  Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[4]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[5]  S. Bryant,et al.  Open mass spectrometry search algorithm. , 2004, Journal of proteome research.

[6]  M. Mann,et al.  Exponentially Modified Protein Abundance Index (emPAI) for Estimation of Absolute Protein Amount in Proteomics by the Number of Sequenced Peptides per Protein*S , 2005, Molecular & Cellular Proteomics.

[7]  Lennart Martens,et al.  A posteriori quality control for the curation and reuse of public proteomics data , 2011, Proteomics.

[8]  Lars Malmström,et al.  Bioinformatic challenges in targeted proteomics. , 2012, Journal of proteome research.

[9]  R. Aebersold,et al.  Selected reaction monitoring for quantitative proteomics: a tutorial , 2008, Molecular systems biology.

[10]  Ruedi Aebersold,et al.  Protein Significance Analysis in Selected Reaction Monitoring (SRM) Measurements* , 2011, Molecular & Cellular Proteomics.

[11]  T. Rudel,et al.  Analysis of missed cleavage sites, tryptophan oxidation and N-terminal pyroglutamylation after in-gel tryptic digestion. , 2000, Rapid communications in mass spectrometry : RCM.

[12]  Prof. Dr. Borivoj Keil Specificity of Proteolysis , 1992, Springer Berlin Heidelberg.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Lennart Martens,et al.  PRIDE: The proteomics identifications database , 2005, Proteomics.

[15]  Jennifer A. Siepen,et al.  Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. , 2007, Journal of proteome research.

[16]  Birgit Schilling,et al.  Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance* , 2009, Molecular & Cellular Proteomics.

[17]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[18]  Stephen W Holman,et al.  The use of selected reaction monitoring in quantitative proteomics. , 2012, Bioanalysis.

[19]  M. Lazdunski,et al.  The Mechanism of Activation of Trypsinogen , 1969 .

[20]  David L. Tabb,et al.  Performance Metrics for Liquid Chromatography-Tandem Mass Spectrometry Systems in Proteomics Analyses* , 2009, Molecular & Cellular Proteomics.

[21]  Michael J MacCoss,et al.  Using BiblioSpec for Creating and Searching Tandem MS Peptide Libraries , 2007, Current protocols in bioinformatics.

[22]  G. McAlister,et al.  Decision tree–driven tandem mass spectrometry for shotgun proteomics , 2008, Nature Methods.

[23]  K. Gevaert,et al.  RIBAR and xRIBAR: Methods for reproducible relative MS/MS-based label-free protein quantification. , 2011, Journal of proteome research.

[24]  Lennart Martens,et al.  ms_lims, a simple yet powerful open source laboratory information management system for MS‐driven proteomics , 2010, Proteomics.

[25]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[26]  Richard D. Smith,et al.  Does trypsin cut before proline? , 2008, Journal of proteome research.

[27]  Peter B. McGarvey,et al.  Infrastructure for the life sciences: design and implementation of the UniProt website , 2009, BMC Bioinformatics.

[28]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[29]  R. Huber,et al.  Structure of the complex formed by bovine trypsin and bovine pancreatic trypsin inhibitor. Crystal structure determination and stereochemistry of the contact region. , 1973, Journal of molecular biology.

[30]  Krzysztof J Cios,et al.  Improving sensitivity in shotgun proteomics using a peptide-centric database with reduced complexity: protease cleavage and SCX elution rules from data mining of MS/MS spectra. , 2006, Analytical chemistry.

[31]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[32]  M. Mann,et al.  Trypsin Cleaves Exclusively C-terminal to Arginine and Lysine Residues*S , 2004, Molecular & Cellular Proteomics.