A probabilistic framework for relational clustering

Relational clustering has attracted more and more attention due to its phenomenal impact in various important applications which involve multi-type interrelated data objects, such as Web mining, search marketing, bioinformatics, citation analysis, and epidemiology. In this paper, we propose a probabilistic model for relational clustering, which also provides a principal framework to unify various important clustering tasks including traditional attributes-based clustering, semi-supervised clustering, co-clustering and graph clustering. The proposed model seeks to identify cluster structures for each type of data objects and interaction patterns between different types of objects. Under this model, we propose parametric hard and soft relational clustering algorithms under a large number of exponential family distributions. The algorithms are applicable to relational data of various structures and at the same time unifies a number of stat-of-the-art clustering algorithms: co-clustering algorithms, the k-partite graph clustering, Bregman k-means, and semi-supervised clustering based on hidden Markov random fields.

[1]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[2]  Martine D. F. Schlag,et al.  Spectral K-Way Ratio-Cut Partitioning and Clustering , 1993, 30th ACM/IEEE Design Automation Conference.

[3]  Mathias Kirsten,et al.  Relational Distance-Based Clustering , 1998, ILP.

[4]  Martine D. F. Schlag,et al.  Spectral K-way ratio-cut partitioning and clustering , 1994, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[5]  Stephen E. Fienberg,et al.  Bayesian Mixed Membership Models for Soft Clustering and Classification , 2004, GfKl.

[6]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[7]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[8]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[10]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[11]  Philip S. Yu,et al.  Unsupervised learning on k-partite graphs , 2006, KDD '06.

[12]  Thomas Hofmann,et al.  Latent Class Models for Collaborative Filtering , 1999, IJCAI.

[13]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[14]  Wei-Ying Ma,et al.  A unified framework for clustering heterogeneous Web objects , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[15]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[16]  Lise Getoor,et al.  An Introduction to Probabilistic Graphical Models for Relational Data , 2006, IEEE Data Eng. Bull..

[17]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[18]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.

[19]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[20]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[21]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[22]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[23]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[24]  Bruce Hendrickson,et al.  A Multi-Level Algorithm For Partitioning Graphs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[25]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[26]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[27]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[28]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[29]  Curt Jones,et al.  A Heuristic for Reducing Fill-In in Sparse Matrix Factorization , 1993, PPSC.

[30]  Luc De Raedt,et al.  Using Logical Decision Trees for Clustering , 1997, ILP.

[31]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[32]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[33]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[34]  Stephen J. Tapscott,et al.  Genetic Structure of Human Populations , 2002 .

[35]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[36]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[37]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[38]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[39]  I. Dhillon,et al.  A Unified View of Kernel k-means , Spectral Clustering and Graph Cuts , 2004 .

[40]  M. M. Meyer,et al.  Statistical Analysis of Multiple Sociometric Relations. , 1985 .

[41]  Tie-Yan Liu,et al.  Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[42]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[43]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[45]  Lawrence B. Holder,et al.  Graph-based relational learning: current and future directions , 2003, SKDD.

[46]  E. Xing,et al.  Mixed Membership Stochastic Block Models for Relational Data with Application to Protein-Protein Interactions , 2006 .

[47]  S.,et al.  An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .

[48]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[49]  Tom A. B. Snijders,et al.  Markov Chain Monte Carlo Estimation of Exponential Random Graph Models , 2002, J. Soc. Struct..

[50]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..