Enhanced bisecting k-means clustering using intermediate cooperation

Bisecting k-means (BKM) is very attractive in many applications as document-retrieval/indexing and gene expression analysis problems. However, in some scenarios when a fraction of the dataset is left behind with no other way to re-cluster it again at each level of the binary tree, a ''refinement'' is needed to re-cluster the resulting solutions. Current approaches to refine the clustering solutions produced by the BKM employ end-result enhancement using k-means (KM) clustering. In this hybrid model, KM waits for the former BKM to finish its clustering and then it takes the final set of centroids as initial seeds for a better refinement. In this paper, a cooperative bisecting k-means (CBKM) clustering algorithm is presented. The CBKM concurrently combines the results of the BKM and KM at each level of the binary hierarchical tree using cooperative and merging matrices. Undertaken experimental results show that the CBKM achieves better clustering quality than that of KM, BKM, and single linkage (SL) algorithms with comparable time performance over a number of artificial, text documents, and gene expression datasets.

[1]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[2]  Max A. Viergever,et al.  Mutual-information-based registration of medical images: a survey , 2003, IEEE Transactions on Medical Imaging.

[3]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  Chee Keong Kwoh,et al.  On the Two-level Hybrid Clustering Algorithm , 2004 .

[6]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[7]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[8]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[10]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[12]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Mohamed A. Ismail,et al.  Multidimensional data clustering utilizing hybrid search strategies , 1989, Pattern Recognit..

[14]  Shuting Xu,et al.  A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study , 2004, The Journal of Supercomputing.

[15]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[16]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[19]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[21]  Mohamed S. Kamel,et al.  Collaborative Document Clustering , 2006, SDM.

[22]  Sergio M. Savaresi,et al.  On the performance of bisecting K-means and PDDP , 2001, SDM.

[23]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[24]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[25]  Yuntao Qian,et al.  Clustering combination method , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[26]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[27]  Derek Greene,et al.  Efficient Ensemble Methods for Document Clustering , 2006 .

[28]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[29]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  Mohamed S. Kamel,et al.  Cooperative Partitional-Divisive Clustering and Its Application in Gene Expression Analysis , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[32]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[33]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[34]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[35]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .