Boosting for Tumor Classification with Gene Expression Data

MOTIVATION Microarray experiments generate large datasets with expression values for thousands of genes but not more than a few dozens of samples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment. A promising way to meet this challenge is by using boosting in conjunction with decision trees. RESULTS We demonstrate that the generic boosting algorithm needs some modification to become an accurate classifier in the context of gene expression data. In particular, we present a feature preselection method, a more robust boosting procedure and a new approach for multi-categorical problems. This allows for slight to drastic increase in performance and yields competitive results on several publicly available datasets. AVAILABILITY Software for the modified boosting algorithms as well as for decision trees is available for free in R at http://stat.ethz.ch/~dettling/boosting.html.

[1]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[2]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[3]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Peter J. Park,et al.  A Nonparametric Scoring Algorithm for Identifying Informative Genes from Microarray Data , 2000, Pacific Symposium on Biocomputing.

[6]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[7]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[8]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[11]  John Law,et al.  Robust Statistics—The Approach Based on Influence Functions , 1986 .

[12]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[13]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Mike West,et al.  Prediction and uncertainty in the analysis of gene expression profiles , 2002, Silico Biol..

[15]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Positron Lifetime Spectra Support Vector Machine in Classification of , 2004 .

[18]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[22]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[23]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[24]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.