Boosting and Microarray Data

We have found one reason why AdaBoost tends not to perform well on gene expression data, and identified simple modifications that improve its ability to find accurate class prediction rules. These modifications appear especially to be needed when there is a strong association between expression profiles and class designations. Cross-validation analysis of six microarray datasets with different characteristics suggests that, suitably modified, boosting provides competitive classification accuracy in general.Sometimes the goal in a microarray analysis is to find a class prediction rule that is not only accurate, but that depends on the level of expression of few genes. Because boosting makes an effort to find genes that are complementary sources of evidence of the correct classification of a tissue sample, it appears especially useful for such gene-efficient class prediction. This appears particularly to be true when there is a strong association between expression profiles and class designations, which is often the case for example when comparing tumor and normal samples.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[4]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[5]  Vladimir Vapnik,et al.  Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures) , 1989, COLT '89.

[6]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[7]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[8]  M. Geisow Guide to protein purification: Methods in enzymology, vol. 182 , 1991 .

[9]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[10]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[11]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[12]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[13]  Dana Ron,et al.  An experimental and theoretical comparison of model selection methods , 1995, COLT '95.

[14]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[17]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[18]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[19]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[22]  L. Breiman Arcing Classifiers , 1998 .

[23]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[24]  Devdatt P. Dubhashi,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[28]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[29]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[30]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[31]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[32]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[35]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[36]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[37]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[38]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[39]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[40]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[41]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[42]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Philip M. Long,et al.  Optimal gene expression analysis by microarrays. , 2002, Cancer cell.

[44]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[45]  Peter L. Bartlett,et al.  Improved Generalization Through Explicit Optimization of Margins , 2000, Machine Learning.

[46]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[47]  P. Laiho MOLECULAR CLASSIFICATION OF COLORECTAL CANCER Päivi Laiho , 2005 .

[48]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .