An efficient block model for clustering sparse graphs

Models for large, sparse graphs are found in many applications and are an active topic in machine learning research. We develop a new generative model that combines rich block structure and simple, efficient estimation by collapsed Gibbs sampling. Novel in our method is that we may learn the strength of assortative and disassortative mixing schemes of communities. Most earlier approaches, both based on low-dimensional projections and Latent Dirichlet Allocation implicitely rely on one of the two assumptions: some algorithms define similarity based solely on connectedness while others solely on the similarity of the neighborhood, leading to undesired results for example in near-bipartite subgraphs. In our experiments we cluster both small and large graphs, involving real and generated graphs that are known to be hard to partition. Our method outperforms earlier Latent Dirichlet Allocation based models as well as spectral heuristics.

[1]  András A. Benczúr,et al.  Large-scale principal component analysis on LiveJournal friends network , 2008 .

[2]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[3]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[4]  E A Leicht,et al.  Community structure in directed networks. , 2007, Physical review letters.

[5]  Edoardo M. Airoldi,et al.  A latent mixed membership model for relational data , 2005, LinkKDD '05.

[6]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[7]  Ata Kabán,et al.  Sequential Activity Profiling: Latent Dirichlet Allocation of Markov Chains , 2005, Data Mining and Knowledge Discovery.

[8]  Samuel Kaski,et al.  A Block Model Suitable for Sparse Graphs , 2009 .

[9]  Samuel Kaski,et al.  Inferring Vertex Properties from Topology in Large Networks , 2007, MLG.

[10]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Ichigaku Takigawa,et al.  A spectral clustering approach to optimally combining numericalvectors with a modular network , 2007, KDD '07.

[13]  András A. Benczúr,et al.  Geographically Organized Small Communities and the Hardness of Clustering Social Networks , 2010, Data Mining for Social Network Data.

[14]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[15]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Janne Aukia,et al.  Bayesian clustering of huge friendship networks , 2007 .

[17]  Kevin J. Lang Fixing two weaknesses of the Spectral Method , 2005, NIPS.

[18]  John Yen,et al.  An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks , 2007, 2007 IEEE Intelligence and Security Informatics.

[19]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[20]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.