Simultaneous informative gene selection and clustering through multiobjective optimization

Clustering methods are used for unsupervised classification of tumor subclasses in microarray gene expression data sets organized in a fashion where the rows represent the tumor samples and columns represent the genes. Clustering algorithms can be very sensitive with respect to the set of features (genes) considered in the clustering process. It is important to select the set of informative and relevant genes to be used for clustering. In this article, a multiobjective genetic algorithm based technique has been proposed for performing the tasks of gene selection and fuzzy clustering simultaneously. A novel encoding technique is developed in this regard and the algorithm searches for the best cluster centers while minimizing the number of selected genes. The number of clusters is evolved automatically. The performance of the proposed technique has been illustrated on an artificial data set and compared with that of several other related feature selection/clustering approaches. Moreover its performance is demonstrated on two real life multi-class gene expression data sets viz., Brain tumor and Lung tumor data sets.

[1]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[2]  Carlos A. Coello Coello,et al.  Evolutionary multi-objective optimization: a historical view of the field , 2006, IEEE Comput. Intell. Mag..

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Francesco Masulli,et al.  Unsupervised Gene Selection and Clustering Using Simulated Annealing , 2005, WILF.

[5]  Carlos A. Coello Coello,et al.  Evolutionary multiobjective optimization , 2011, WIREs Data Mining Knowl. Discov..

[6]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[7]  Ujjwal Maulik,et al.  Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification , 2003, IEEE Trans. Geosci. Remote. Sens..

[8]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[10]  Ujjwal Maulik,et al.  Combining Pareto-Optimal Clusters Using Supervised Learning , 2011 .

[11]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[12]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.

[13]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[14]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[15]  Mingkui Tan,et al.  Gene Selection and Tissue Classification Based on Support Vector Machine and Genetic Algorithm , 2007, 2007 1st International Conference on Bioinformatics and Biomedical Engineering.

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[19]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[20]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[21]  Omid Khayat,et al.  A hybrid GA & back propagation approach for gene selection and classification of microarray data , 2008 .

[22]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[23]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[24]  Iqbal Gondal,et al.  Feature selection and classification of gene expression profile in hereditary breast cancer , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[25]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[26]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Young Bun Kim,et al.  Unsupervised Gene Selection For High Dimensional Data , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[28]  Kalyanmoy Deb,et al.  A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[29]  David Corne,et al.  The Pareto archived evolution strategy: a new baseline algorithm for Pareto multiobjective optimisation , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[30]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.