Classification of peptide mass fingerprint data by novel no-regret boosting method

We have developed an integrated tool for statistical analysis of large-scale LC-MS profiles of complex protein mixtures comprising a set of procedures for data processing, selection of biomarkers used in early diagnostic and classification of patients based on their peptide mass fingerprints. Here, a novel boosting technique is proposed, which is embedded in our framework for MS data analysis. Our boosting scheme is based on Hannan-consistent game playing strategies. We analyze boosting from a game-theoretic perspective and define a new class of boosting algorithms called H-boosting methods. In the experimental part of this work we apply the new classifier together with classical and state-of-the-art algorithms to classify ovarian cancer and cystic fibrosis patients based on peptide mass spectra. The methods developed here provide automatic, general, and efficient means for processing of large scale LC-MS datasets. Good classification results suggest that our approach is able to uncover valuable information to support medical diagnosis.

[1]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[2]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[3]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[4]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[5]  T. Shaler,et al.  Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. , 2003, Analytical chemistry.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  L. Breiman Arcing Classifiers , 1998 .

[8]  Mark Culp,et al.  ada: An R Package for Stochastic Boosting , 2006 .

[9]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[10]  A. Olshen,et al.  Differential exoprotease activities confer tumor-specific serum peptidome patterns. , 2005, The Journal of clinical investigation.

[11]  Anna Gambin,et al.  Efficient Model-Based Clustering for LC-MS Data , 2006, WABI.

[12]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[13]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[14]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[15]  S. Grzesiek,et al.  NMRPipe: A multidimensional spectral processing system based on UNIX pipes , 1995, Journal of biomolecular NMR.

[16]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[17]  F. McLafferty,et al.  Automated assignment of charge states from resolved isotopic peaks for multiply charged ions , 1995, Journal of the American Society for Mass Spectrometry.

[18]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[19]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[20]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[21]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[22]  James Lo,et al.  Scaling Roll Call Votes with wnominate in R , 2008 .

[23]  Jerzy Tiuryn,et al.  Automated reduction and interpretation of multidimensional mass spectra for analysis of complex peptide mixtures , 2007 .

[24]  Nicolò Cesa-Bianchi,et al.  Potential-Based Algorithms in On-Line Prediction and Game Theory , 2003, Machine Learning.

[25]  Richard D. Smith,et al.  Two-dimensional gas-phase separations coupled to mass spectrometry for analysis of complex mixtures. , 2005, Analytical chemistry.

[26]  Thomas P Conrads,et al.  Multidimensional separation of peptides for effective proteomic analysis. , 2005, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[27]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[28]  Yongyi Mao,et al.  Informatics Platform for Global Proteomic Profiling and Biomarker Discovery Using Liquid Chromatography-Tandem Mass Spectrometry*S , 2004, Molecular & Cellular Proteomics.

[29]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[31]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[32]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[33]  P. Tempst,et al.  A Sequence-specific Exopeptidase Activity Test (SSEAT) for “Functional” Biomarker Discovery*S , 2008, Molecular & Cellular Proteomics.

[34]  Benno Schwikowski,et al.  Signal Maps for Mass Spectrometry-based Comparative Proteomics* , 2006, Molecular & Cellular Proteomics.

[35]  M. Dufwenberg Game theory. , 2011, Wiley interdisciplinary reviews. Cognitive science.

[36]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[37]  Robert Tibshirani,et al.  Sample classification from protein mass spectrometry, by 'peak probability contrasts' , 2004, Bioinform..

[38]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[39]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[40]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[41]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[42]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[43]  Radford M. Neal,et al.  Multiple Alignment of Continuous Time Series , 2004, NIPS.

[44]  Ruedi Aebersold,et al.  A Software Suite for the Generation and Comparison of Peptide Arrays from Sets of Data Collected by Liquid Chromatography-Mass Spectrometry*S , 2005, Molecular & Cellular Proteomics.

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  Anna Gambin,et al.  On consensus biomarker selection , 2007, BMC Bioinformatics.

[47]  Yishay Mansour,et al.  Learning with Maximum-Entropy Distributions , 1997, COLT '97.

[48]  F. McLafferty,et al.  Automated reduction and interpretation of , 2000, Journal of the American Society for Mass Spectrometry.