Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. cDNA microarrays and high-density oligonucleotide chips are novel biotechnologies increasingly used in cancer research. By allowing the monitoring of expression levels in cells for thousands of genes simultaneously, microarray experiments may lead to a more complete understanding of the molecular variations among tumors and hence to a finer and more informative classification. The ability to successfully distinguish between tumor classes (already known or yet to be discovered) using gene expression data is an important aspect of this novel approach to cancer classification. This article compares the performance of different discrimination methods for the classification of tumors based on gene expression data. The methods include nearest-neighbor classifiers, linear discriminant analysis, and classification trees. Recent machine learning approaches, such as bagging and boosting, are also considered. The discrimination methods are applied to datasets from three recently published cancer gene expression studies.

[1]  E. B. Wilson PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES. , 1919, Science.

[2]  M. M. Barnard THE SECULAR VARIATIONS OF SKULL CHARACTERS IN FOUR SERIES OF EGYPTIAN SKULLS , 1935 .

[3]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[4]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[5]  A. Griffiths Introduction to Genetic Analysis , 1976 .

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[8]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[9]  D Faraggi,et al.  Discrimination techniques applied to the NCI in vitro anti-tumour drug screen: predicting biochemical mechanism of action. , 1994, Statistics in medicine.

[10]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[11]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[13]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[14]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[15]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[16]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[17]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[18]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[19]  Leo Breiman Using convex pseudo-data to increase prediction accuracy , 1998 .

[20]  L. Breiman Arcing Classifiers , 1998 .

[21]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[22]  S. Morishita On Classi cation and Regression , 1998 .

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[24]  R. Tibshirani,et al.  Clustering methods for the analysis of DNA microarray data , 1999 .

[25]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy number variation in breast cancer using DNA microarrays , 1999, Nature Genetics.

[26]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Mark Schena,et al.  DNA microarrays : a practical approach , 1999 .

[28]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[30]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Ash A. Alizadeh,et al.  Di erent types of di use large b-cell lymphoma identi ed by gene expression pro ling , 2000 .

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[34]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[35]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[37]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.

[38]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[39]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[40]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[41]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.