Chi-Sim: A New Similarity Measure for the Co-clustering Task

Co-clustering has been widely studied in recent years. Exploiting the duality between objects and features efficiently helps in better clustering both objects and features. In contrast with current co-clustering algorithms that focus on directly finding some patterns in the data matrix, in this paper we define a (co-)similarity measure, named X-Sim, which iteratively computes the similarity between objects and their features. Thus, it becomes possible to use any clustering methods (k-means, ...) to co-cluster data. The experiments show that our algorithm not only outperforms the classical similarity measure but also outperforms some co-clustering algorithms on the document-clustering task.

[1]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.

[4]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[5]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[6]  Andreas Zell,et al.  A memetic clustering algorithm for the functional partition of genes based on the gene ontology , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Philip S. Yu,et al.  Unsupervised learning on k-partite graphs , 2006, KDD '06.

[9]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[10]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[13]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[14]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[15]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[16]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[17]  William M. Pottenger,et al.  A Framework for Understanding LSI Performance , 2004 .

[18]  Farshad Fotouhi,et al.  Bipartite isoperimetric graph partitioning for data co-clustering , 2008, Data Mining and Knowledge Discovery.