Hierarchical, Parameter-Free Community Discovery

Given a large bipartite graph (like document-term, or userproduct graph), how can we find meaningful communities, quickly, and automatically? We propose to look for community hierarchies, with communities- within-communities. Our proposed method, the Context-specific Cluster Tree (CCT)finds such communities at multiple levels, with no user intervention, based on information theoretic principles (MDL). More specifically, it partitions the graph into progressively more refined subgraphs, allowing users to quickly navigate from the global, coarse structure of a graph to more focused and local patterns. As a fringe benefit, and also as an additional indication of its quality, it also achieves better compression than typical, non-hierarchical methods. We demonstrate its scalability and effectiveness on real, large graphs.

[1]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Jimeng Sun,et al.  Relevance search and anomaly detection in bipartite graphs , 2005, SKDD.

[3]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[4]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[5]  Jian Pei,et al.  On mining cross-graph quasi-cliques , 2005, KDD '05.

[6]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[7]  Philip S. Yu,et al.  GraphScope: parameter-free mining of large time-evolving graphs , 2007, KDD '07.

[8]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[9]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[10]  Graham J. Williams,et al.  Data Mining , 2000, Communications in Computer and Information Science.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[13]  M. Berger,et al.  Adaptive mesh refinement for hyperbolic partial differential equations , 1982 .

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[16]  Jean-Daniel Fekete,et al.  MatrixExplorer: a Dual-Representation System to Explore Social Networks , 2006, IEEE Transactions on Visualization and Computer Graphics.

[17]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[18]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[19]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[20]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[21]  Danny Holten,et al.  Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[22]  Christos Faloutsos,et al.  Parameter-free spatial data mining using MDL , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[24]  Martin Rosvall,et al.  An information-theoretic framework for resolving community structure in complex networks , 2007, Proceedings of the National Academy of Sciences.

[25]  Jimeng Sun,et al.  Beyond streams and graphs: dynamic tensor analysis , 2006, KDD '06.

[26]  Suvrit Sra,et al.  Minimum Sum-Squared Residue based clustering of Gene Expression Data , 2004 .

[27]  Christos Faloutsos,et al.  Center-piece subgraphs: problem definition and fast solutions , 2006, KDD '06.

[28]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[29]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[30]  Robert L. Grossman,et al.  KDD-2005 : proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 21-24, 2005, Chicago, Illinois, USA , 2005 .

[31]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[32]  Stefano Lonardi,et al.  A Compression-Boosting Transform for Two-Dimensional Data , 2006, AAIM.

[33]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..