A new approach to improve the clustering accuracy using informative genes for unsupervised microarray data sets

DNA microarray technology can be used to measure expression levels for thousands of genes in a single experiment across different samples. Within a gene expression matrix there are usually several particular Macroscopic Phenotypes of samples related to some diseases or drug effects such as diseased samples, normal samples or drug treated samples. The goal of sample based clustering is to find the phenotype structure or substructure of the samples. Currently most of research work focuses on the supervised analysis, relatively less attention has been paid to unsupervised approaches in sample based analysis which is important when domain knowledge is incomplete or hard to obtain. The standard k-means algorithm is effective in producing clusters for many practical applications. But the computational complexity of the original k-means algorithm is very high in high dimensional data and the accuracy of the clustering result depends on the initial centroid. In this paper, we present a new framework for unsupervised sample based clustering using informative genes for microarray data. We proposed a method to find initial centroid for k-means and we have used similarity measure to find the informative genes. The goal of our clustering approach is to perform better cluster discovery on sample with informative gene.

[1]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[2]  Aapo Hyvärinen,et al.  Survey on Independent Component Analysis , 1999 .

[3]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[5]  Lei Zhu,et al.  Microarray sample clustering using independent component analysis , 2006, 2006 IEEE/SMC International Conference on System of Systems Engineering.

[6]  William A. Schmitt,et al.  Interactive exploration of microarray gene expression patterns in a reduced dimensional space. , 2002, Genome research.

[7]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  A Hyvarinen,et al.  SURVEY OF INDEPENDENT COMPONENT ANALYSIS , 1999 .

[9]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[12]  I. Jolliffe Principal Component Analysis , 2002 .

[13]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[14]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.