A scalable community detection algorithm for large graphs using stochastic block models

Community detection in graphs is widely used in social and biological networks, and the stochastic block model is a powerful probabilistic tool for describing graphs with community structures. However, in the era of “big data,” traditional inference algorithms for such a model are increasingly limited due to their high time complexity and poor scalability. In this paper, we propose a multi-stage maximum likelihood approach to recover the latent parameters of the stochastic block model, in time linear with respect to the number of edges. We also propose a parallel algorithm based on message passing. Our algorithm can overlap communication and computation, providing speedup without compromising accuracy as the number of processors grows. For example, to process a real-world graph with about 1.3 million nodes and 10 million edges, our algorithm requires about 6 seconds on 64 cores of a contemporary commodity Linux cluster. Experiments demonstrate that the algorithm can produce high quality results on both benchmark and real-world graphs. An example of finding more meaningful communities is illustrated consequently in comparison with a popular modularity maximization algorithm.

[1]  Franck Picard,et al.  A mixture model for random graphs , 2008, Stat. Comput..

[2]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[3]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[4]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[5]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[6]  Jiawei Han,et al.  Large-Scale Spectral Clustering on Graphs , 2013, IJCAI.

[7]  John Eccleston,et al.  Statistics and Computing , 2006 .

[8]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[10]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Christian Staudt,et al.  Engineering Parallel Algorithms for Community Detection in Massive Networks , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12]  Andrea Lancichinetti,et al.  Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Meirav Galun,et al.  Fundamental Limitations of Spectral Clustering , 2006, NIPS.

[14]  Carlos Ansótegui,et al.  The Community Structure of SAT Formulas , 2012, SAT.

[15]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Viktor K. Prasanna,et al.  Fast parallel algorithm for unfolding of communities in large graphs , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[17]  Alain Celisse,et al.  Consistency of maximum-likelihood and variational estimators in the Stochastic Block Model , 2011, 1105.3288.

[18]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[19]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[20]  Thomas L. Griffiths,et al.  Learning Systems of Concepts with an Infinite Relational Model , 2006, AAAI.

[21]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[22]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[23]  Giorgio Parisi,et al.  Physica A: Statistical Mechanics and its Applications: Editorial note , 2005 .

[24]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[25]  Chris H Wiggins,et al.  Bayesian approach to network modularity. , 2007, Physical review letters.

[26]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[27]  Alessandro Cimatti,et al.  Theory and Applications of Satisfiability Testing – SAT 2012 , 2012, Lecture Notes in Computer Science.

[28]  Benjamin H. Good,et al.  Performance of modularity maximization in practical contexts. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  Aravaipa Canyon Basin Volume 3 , 2012, Journal of Diabetes Investigation.

[30]  Ke Hu,et al.  Limitation of multi-resolution methods in community detection , 2011, ArXiv.

[31]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[32]  October I Physical Review Letters , 2022 .