A Comparative Study of Clustering Methods for Relevant Gene Selection in Microarray Data

Classification of microarray cancer data has drawn the attention of research community for better clinical diagnosis in last few years. Microarray datasets are characterized by high dimension and small sample size. Hence, the conventional wrapper methods for relevant gene selection cannot be applied directly on such datasets due to large computation time. In this paper, a two stage approach is proposed to determine a subset containing relevant and non redundant genes for better classification of microarray data. In first stage, genes were partitioned into distinct clusters to identify redundant genes. To determine the better choice of clustering algorithm to group redundant genes, four different clustering methods were investigated. Experiments on four well known cancer microarray datasets depicted that hierarchical agglomerative with complete link approach performed the best in terms of average classification accuracy for three datasets. Comparison with other state-of-art methods have shown that the proposed approach which involves gene clustering is effective in reducing redundancy among selected genes to provide better classification.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[3]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  In-Beum Lee,et al.  New gene selection for classification of cancer subtype considering within-class variation , 2003 .

[5]  Yi Shi,et al.  Using Gene Clustering to Identify Discriminatory Genes with Higher Classification Accuracy , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[6]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[7]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Sankar K. Pal,et al.  Pattern Recognition Algorithms for Data Mining , 2004 .

[10]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[11]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[14]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[15]  Ujjwal Maulik,et al.  Simultaneous informative gene selection and clustering through multiobjective optimization , 2010, IEEE Congress on Evolutionary Computation.

[16]  Adil M. Bagirov,et al.  New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[17]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[18]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[19]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[20]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[21]  W. Rubinstein,et al.  Genome-wide analysis of antisense transcription with Affymetrix exon array , 2008, BMC Genomics.

[22]  Jin Hyun Park,et al.  New gene selection method for classification of cancer subtypes considering within‐class variation , 2003, FEBS letters.

[23]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Zexuan Zhu,et al.  Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[25]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  C. Ding,et al.  Gene selection algorithm by combining reliefF and mRMR , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[27]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[28]  R. K. Agrawal,et al.  Relevant Gene Selection Using Normalized Cut Clustering with Maximal Compression Similarity Measure , 2010, PAKDD.