A Comparative Study of Classification Methods For Microarray Data Analysis

In response to the rapid development of DNA Microarray technology, many classification methods have been used for Microarray classification. SVMs, decision trees, Bagging, Boosting and Random Forest are commonly used methods. In this paper, we conduct experimental comparison of LibSVMs, C4.5, BaggingC4.5, AdaBoostingC4.5, and Random Forest on seven Microarray cancer data sets. The experimental results show that all ensemble methods outperform C4.5. The experimental results also show that all five methods benefit from data preprocessing, including gene selection and discretization, in classification accuracy. In addition to comparing the average accuracies of ten-fold cross validation tests on seven data sets, we use two statistical tests to validate findings. We observe that Wilcoxon signed rank test is better than sign test for such purpose.

[1]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[2]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[3]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[4]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[5]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[6]  W. W. Daniel,et al.  Applied Nonparametric Statistics , 1978 .

[7]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[8]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  D. Haussler,et al.  Knowledge-based analysis of microarray gene expression , 2000 .

[11]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[12]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Erming Tian,et al.  Gene expression profiles in primary ovarian serous papillary tumors and normal ovarian epithelium: Identification of candidate molecular markers for ovarian cancer diagnosis and therapy , 2004, International journal of cancer.

[17]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[18]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[19]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[20]  Hua Wang,et al.  Combined Gene Selection Methods for Microarray Data Analysis , 2006, KES.

[21]  M. S. Brown,et al.  Support Vector Machine Classification of Microarray from Gene Expression Data , 1999 .

[22]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[23]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[24]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[25]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[26]  Huiqing Liu,et al.  Ensembles of cascading trees , 2003, Third IEEE International Conference on Data Mining.

[27]  Robert C. Bast,et al.  Selection of Potential Markers for Epithelial Ovarian Cancer with Gene Expression Arrays and Recursive Descent Partition Analysis , 2004, Clinical Cancer Research.

[28]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[29]  L. Breiman Random Forests--random Features , 1999 .

[30]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[31]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[32]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[33]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[34]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[35]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.