HICC: an entropy splitting-based framework for hierarchical co-clustering

Two-dimensional contingency tables or co-occurrence matrices arise frequently in various important applications such as text analysis and web-log mining. As a fundamental research topic, co-clustering aims to generate a meaningful partition of the contingency table to reveal hidden relationships between rows and columns. Traditional co-clustering algorithms usually produce a predefined number of flat partition of both rows and columns, which do not reveal relationship among clusters. To address this limitation, hierarchical co-clustering algorithms have attracted a lot of research interests recently. Although successful in various applications, the existing hierarchical co-clustering algorithms are usually based on certain heuristics and do not have solid theoretical background. In this paper, we present a new co-clustering algorithm, HICC, with solid theoretical background. It simultaneously constructs a hierarchical structure of both row and column clusters, which retains sufficient mutual information between rows and columns of the contingency table. An efficient and effective greedy algorithm is developed, which grows a co-cluster hierarchy by successively performing row-wise or column-wise splits that lead to the maximal mutual information gain. Extensive experiments on both synthetic and real datasets demonstrate that our algorithm can reveal essential relationships of row (and column) clusters and has better clustering precision than existing algorithms. Moreover, the experiments on real dataset show that HICC can effectively reveal hidden relationships between rows and columns in the contingency table.

[1]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Xiang Zhang,et al.  CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition , 2008, SIGMOD Conference.

[3]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[4]  Patrick J. Wolfe,et al.  Co-clustering separately exchangeable network data , 2012, ArXiv.

[5]  Mario Schkolnick,et al.  A clustering algorithm for hierarchical structures , 1977, TODS.

[6]  Chun Chen,et al.  Locally Discriminative Coclustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Gábor J. Székely,et al.  Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method , 2005, J. Classif..

[8]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Furu Wei,et al.  Constrained Text Coclustering with Supervised and Unsupervised Constraints , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Ruggero G. Pensa,et al.  Hierarchical co-clustering: off-line and incremental approaches , 2012, Data Mining and Knowledge Discovery.

[11]  Chia-Hui Chang,et al.  Co-clustering with augmented matrix , 2013, Applied Intelligence.

[12]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[13]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[14]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[15]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[16]  Mehdi Hosseini,et al.  Hierarchical Co-clustering for Web Queries and Selected URLs , 2007, WISE.

[17]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[18]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[19]  Tao Li,et al.  HCC: a hierarchical co-clustering algorithm , 2010, SIGIR '10.

[20]  Nir Ailon,et al.  Fitting tree metrics: Hierarchical clustering and phylogeny , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[21]  Wei-Ying Ma,et al.  Building implicit links from content for forum search , 2006, SIGIR.

[22]  Changsheng Xu,et al.  Special Section on Object and Event Classification in Large-Scale Video Collections , 2012, IEEE Trans. Multim..

[23]  Timos K. Sellis,et al.  Hierarchical clustering for OLAP: the CUBE File approach , 2006, The VLDB Journal.

[24]  Ruggero G. Pensa,et al.  Parameter-Free Hierarchical Co-clustering by n-Ary Splits , 2009, ECML/PKDD.

[25]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.

[26]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[27]  Edward J. Coyle,et al.  An energy efficient hierarchical clustering algorithm for wireless sensor networks , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[28]  Kathryn B. Laskey,et al.  Nonparametric Bayesian Co-clustering Ensembles , 2011, SDM.

[29]  Tao Li,et al.  Hierarchical Co-Clustering: A New Way to Organize the Music Data , 2012, IEEE Transactions on Multimedia.

[30]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[31]  Daphne Koller,et al.  Probabilistic hierarchical clustering for biological data , 2002, RECOMB '02.

[32]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[33]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[34]  Jacob Goldberger,et al.  Hierarchical Clustering of a Mixture Model , 2004, NIPS.

[35]  Dana Ron,et al.  On Finding Large Conjunctive Clusters , 2003, COLT.

[36]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[37]  Tao Li,et al.  Quantify music artist similarity based on style and mood , 2008, WIDM '08.

[38]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.