A gene selection approach for classifying diseases based on microarray datasets

Gene Selection is very important problem in the classification of serious diseases in clinical information systems. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analysis. In the current work, a hybrid approach is presented in order to classify diseases, such as colon cancer, leukemia, and liver cancer, based on informative genes. This hybrid approach uses clustering (K-means) with statistical analysis (ANOVA) as a preprocessing step for gene selection and Support Vector Machines (SVM) to classify diseases related to microarray experiments. To compare the performance of the proposed methodology, two kinds of comparisons were achieved: 1) applying statistical analysis combined with clustering algorithm (K-means) as a preprocessing step and 2) comparing different classification algorithms: decision tree (ID3), naïve bayes, adaptive naïve bayes, and support vector machines. In case of combining clustering with statistical analysis, much better classification accuracy is given of 97% rather than without applying clustering in the preprocessing phase. In addition, SVM had proven better accuracy than decision trees, Naïve Bayes, and Adaptive Naïve Bayes classification.

[1]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.

[2]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[3]  Xinping Cui,et al.  Optimized Ranking and Selection Methods for Feature Selection with Application in Microarray Experiments , 2010, Journal of biopharmaceutical statistics.

[4]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[5]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[6]  Debashis Ghosh,et al.  Singular Value Decomposition Regression Models for Classification of Tumors from Microarray Experiments , 2001, Pacific Symposium on Biocomputing.

[7]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  D. Botstein,et al.  Gene expression patterns in human liver cancers. , 2002, Molecular biology of the cell.

[11]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[12]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[13]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[14]  E. Boerwinkle,et al.  Computational methods for gene expression-based tumor classification. , 2000, BioTechniques.

[15]  Walter L. Ruzzo,et al.  Improved Gene Selection for Classification of Microarrays , 2002, Pacific Symposium on Biocomputing.

[16]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[17]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[18]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[19]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[20]  Ming Fan,et al.  SamCluster: An Integrated Scheme for Automatic Discovery of Sample Classes Using Gene Expression Profile , 2003, Bioinform..

[21]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..