A Partitional Approach for Genomic-Data Clustering Combined with K-Means Algorithm

Bioinformatics is the science of managing, analyzing, extracting, and interpreting information from biological sequences and molecules. Recent advancements in microarray technology allow simultaneous monitoring of the expression levels of a large number of genes over different experiment conditions. Facing this huge amount of data, the biologist cannot simply use the traditional techniques in biology to analyze the data. In fact, information technologies are needed. Cluster analysis is of considerable interest and importance in the field of bioinformatics, either by clustering the genes or by clustering experiment conditions (samples). The clustering of genes is used to identify groups of genes with similar patterns of expression, aiming at helping to answer questions of how gene expression is affected by various diseases and which genes are responsible for specific diseases. The clustering of samples is used to organize the samples into intrinsic clusters such that samples with high similarity belong to same cluster. The significance of this clustering assists in diagnosis of the disease condition, and it discloses the effect of certain treatment on genes. In order to cluster the huge amount of gathered gene expression data, we propose a new partitional clustering-approach, combined with K-Means algorithm. The approach is compared with both K-Means and this approach before combination. The obtained results in terms of internal and external performance measures on a set of genomic benchmarks show the correctness and competence of the proposed approach.

[1]  Jin Xiong,et al.  Essential bioinformatics , 2006 .

[2]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[3]  Miyoung Shin,et al.  Microarray Data Mining for Biological Pathway Analysis , 2009 .

[4]  Yee Leung,et al.  Clustering by Scale-Space Filtering , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Aidong Zhang,et al.  Advanced Analysis of Gene Expression Microarray Data , 2006, Science, Engineering, and Biology Informatics.

[6]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  T. Werner Bioinformatics applications for pathway analysis of microarray data. , 2008, Current opinion in biotechnology.

[8]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[12]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[13]  W. El-Deiry,et al.  Microarray analysis of p53-dependent gene expression in response to hypoxia and DNA damage , 2007, Cancer biology & therapy.

[14]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[15]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  T. Kawai,et al.  Identification of Marker Genes for Differential Diagnosis of Chronic Fatigue Syndrome , 2022 .