Hybrid Algorithm for Clustering Gene Expression Data

Microarray gene expressions provide an insight into genomic biomarkers that aid in identifying cancerous cells and normal cells. In this study, functionally related genes are identified by partitioning gene data. Clustering is an unsupervised learning technique that partition gene data into groups based on the similarity between their expression profiles. This identifies functionally related genes. In this study, a hybrid framework is designed that uses adaptive pillar clustering algorithm and genetic algorithm. A first step towards, the proposed work is the utilization of clustering technique by adaptive pillar clustering algorithm that finds cluster centroids. The centroids and its clustering elements are calculated by average mean of pairwise inner distance. The output of adaptive pillar clustering algorithm results in number of clusters which is given as input to genetic algorithm. The microarray gene expression data set considered as input is given to adaptive pillar clustering algorithm that partitions gene data into given number of clusters so that the intra-cluster similarity is maximized and inter cluster similarity is minimized. Then for each combination of clustered gene expression, the optimum cluster is found out using genetic algorithm. The genetic algorithm initializes the population with set of clusters obtained from adaptive pillar clustering algorithm. Best chromosomes with maximum fitness are selected from the selection pool to perform genetic operations like crossover and mutation. The genetic algorithm is used to search optimum clusters based on its designed fitness function. The fitness function designed minimizes the intra cluster distance and maximizes the fitness value by tailoring a parameter that includes the weightage for diseased genes. The performance of adaptive pillar algorithm was compared with existing techniques such as k-means and pillar k-means algorithm. The clusters obtained from adaptive pillar clustering algorithm achieve a maximum cluster gain of 894.84, 812.4 and 756 for leukemia, lung and thyroid gene expression data, respectively. Further, the optimal cluster obtained by hybrid framework achieves cluster accuracy of 81.3, 80.2 and 78.2 for leukemia, lung and thyroid gene expression data respectively.

[1]  Jung-Hsien Chiang,et al.  Novel Algorithm for Coexpression Detection in Time-Varying Microarray Datasets , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Yasushi Kiyoki,et al.  A pillar algorithm for K-means optimization by distance maximization for initial centroid designation , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[3]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[4]  supH. Khanna Nehemiah,et al.  A Hybrid Classifier for Leukemia Gene Expression Data , 2015 .

[5]  Nevine M. Labib,et al.  Data Mining for Cancer Management in Egypt Case Study: Childhood Acute Lymphoblastic Leukemia , 2007 .

[6]  Jong-Min Park,et al.  Convergence and application of online active sampling using orthogonal pillar vectors , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[8]  Sanghamitra Bandyopadhyay,et al.  A Biologically Inspired Measure for Coexpression Analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Seo Young Kim,et al.  Iterative Clustering Algorithm for Analyzing Temporal Patterns of Gene Expression , 2005, WEC.

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  C. Gachon,et al.  DNA barcoding of oomycetes with cytochrome c oxidase subunit I and internal transcribed spacer , 2011, Molecular ecology resources.

[12]  Samiran Chattopadhyay,et al.  A Modified Local Least Squares-Based Missing Value Estimation Method in Microarray Gene Expression Data , 2013, 2013 2nd International Conference on Advanced Computing, Networking and Security.

[13]  Christophe Garcia,et al.  WaveRead: Automatic measurement of relative gene expression levels from microarrays using wavelet analysis , 2006, J. Biomed. Informatics.

[14]  Ricardo J. G. B. Campello,et al.  On the selection of appropriate distances for gene expression data clustering , 2014, BMC Bioinformatics.

[15]  Vandana Bhattacherjee,et al.  Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm , 2012, IEEE Transactions on Knowledge and Data Engineering.

[16]  Zeti-Azura Mohamed-Hussein,et al.  The Phylogenetic Tree of RNA Polymerase Constructed Using MOM Method , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[17]  Nicolaos B. Karayiannis,et al.  Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques , 1997, IEEE Trans. Neural Networks.