Efficient Bisecting k-Medoids and Its Application in Gene Expression Analysis

The medoid-based clustering algorithm, Partition Around Medoids (PAM), is better than the centroid-based k-means because of its robustness to noisy data and outliers. PAM cannot recognize relatively small clusters in situations where good partitions around medoids clearly exist. Also PAM needs O(k(n-k)2) operations to cluster a given dataset, which is computationally prohibited for large nand k. In this paper, we propose a new bisecting k-medoids algorithm that is capable of grouping the co-expressed genes together with better clustering quality and time performances. The proposed algorithm is evaluated over three gene expression datasets in which noise components are involved. The proposed algorithm takes less computation time with comparable performance relative to the Partitioning Around Medoids algorithm.

[1]  Sergio M. Savaresi,et al.  On the performance of bisecting K-means and PDDP , 2001, SDM.

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  James C. Bezdek,et al.  Validity-guided (re)clustering with applications to image segmentation , 1996, IEEE Trans. Fuzzy Syst..

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[6]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[7]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[8]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[9]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[10]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[11]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[12]  Rafal Kustra,et al.  Incorporating Gene Ontology in Clustering Gene Expression Data , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[13]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[14]  Mohamed A. Ismail,et al.  Discovering Connected Patterns in Gene Expression Arrays , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[15]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[17]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[20]  Mohamed S. Kamel,et al.  Collaborative Document Clustering , 2006, SDM.

[21]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[22]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .