GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data

Networked data often consists of interconnected multityped nodes and links. A common assumption behind such heterogeneity is the shared clustering structure. However, existing network clustering approaches oversimplify the heterogeneity by either treating nodes or links in a homogeneous fashion, resulting in massive loss of information. In addition, these studies are more or less restricted to specific network schemas or applications, losing generality. In this paper, we introduce a flexible model to explain the process of forming heterogeneous links based on shared clustering information of heterogeneous nodes. Specifically, we categorize the link generation process into binary and weighted cases and model them respectively. We show these two cases can be seamlessly integrated into a unified model. We propose to maximize a joint log-likelihood function to infer the model efficiently with Expectation Maximization (EM) algorithms. Experiments on real-world networked data sets demonstrate the effectiveness and flexibility of the proposed method in fully capturing the dual heterogeneity of both nodes and links.

[1]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[2]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[3]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[4]  Bo Zhao,et al.  Probabilistic topic models with biased propagation on heterogeneous information networks , 2011, KDD.

[5]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Mark E. J. Newman,et al.  An efficient and principled method for detecting communities in networks , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[12]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[13]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[14]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Charu C. Aggarwal,et al.  Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes , 2012, Proc. VLDB Endow..

[16]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[17]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[18]  Tie-Yan Liu,et al.  Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[19]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[20]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[21]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[22]  Jiawei Han,et al.  Ranking-based classification of heterogeneous information networks , 2011, KDD.