Comparison of Support Vector Machines to Other Classifiers Using Gene Expression Data

ABSTRACT Support vector machines (SVMs) was shown to outperform Fisher's linear discriminant analysis and two classification trees (C4.5 and MOC1) in binary classification of microarray gene expression data (MGED) (Brown et al., 2000; Furey et al. 2000). However, multiclass classification is more commonly encountered in identifying tumor subtypes using MGED. Using MGED, Dudoit et al. (2002) showed that diagonal linear discriminant analysis (DLDA) outperformed other linear and quadratic discriminants, nearest neighbor, and classification trees with univariate splits. It is of interest, therefore, to compare performance of SVMs to DLDA and the latest two classification trees with linear splits, which performered better than trees with univariate splits, in multiclass classification of MGED. Furthermore, the performance of SVMs with different types of kernels were studied by three types of multiclass MGED. Finally, we investigate how irrelevant and correlated variables (features) influence the performance of the three classifiers. Some suggestions are made for multiclass classification of MGED.

[1]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[2]  Heping Zhang,et al.  Cell and tumor classification using gene expression data: Construction of forests , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[4]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[5]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[6]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[7]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[10]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Miguel Figueroa,et al.  Competitive learning with floating-gate circuits , 2002, IEEE Trans. Neural Networks.

[13]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.