Semi-supervised learning for peptide identification from shotgun proteomics datasets

Shotgun proteomics uses liquid chromatography–tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.

[1]  P. Strevens Iii , 1985 .

[2]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[3]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[4]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[5]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[6]  J. Yates,et al.  Probability-based validation of protein identifications using a modified SEQUEST algorithm. , 2002, Analytical chemistry.

[7]  J. Yates,et al.  DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. , 2002, Journal of proteome research.

[8]  Roger E. Moore,et al.  Qscore: An algorithm for evaluating SEQUEST database search results , 2002, Journal of the American Society for Mass Spectrometry.

[9]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[10]  William Stafford Noble,et al.  A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. , 2003, Journal of proteome research.

[11]  Joshua E. Elias,et al.  Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. , 2003, Journal of proteome research.

[12]  J. Yates,et al.  A method for the comprehensive proteomic analysis of membrane proteins , 2003, Nature Biotechnology.

[13]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  John D. Storey,et al.  Statistical Significance for Genome-Wide Studies , 2003 .

[15]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[16]  R. Aebersold,et al.  ProbIDtree: An automated software program capable of identifying multiple peptides from a single collision‐induced dissociation spectrum collected by a tandem mass spectrometer , 2005, Proteomics.

[17]  William Stafford Noble,et al.  Peptide charge state determination for low-resolution tandem mass spectra , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[18]  P. Pevzner,et al.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. , 2005, Analytical chemistry.

[19]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[20]  William Stafford Noble,et al.  Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. , 2006, Analytical chemistry.

[21]  Aaron A. Klammer,et al.  Effects of modified digestion schemes on the identification of proteins from complex mixtures. , 2006, Journal of proteome research.