Distributed community detection in web-scale networks

Partitioning large networks into smaller sub-networks (communities) is an important tool to analyze the structure of complex linked systems. In recent years, many in-memory community detection algorithms have been proposed for graphs with millions of edges. Analyzing massive graphs with billions of edges is impossible for existing algorithms. In this contribution, we show how to find community partitions of networks with billions of edges. Our approach is based on an ensemble learning scheme for community detection that provides a way to identify high quality partitions from an ensemble of partitions with lower quality. We present a pre-processing procedure for community detection algorithms that significantly decreases the problem size. After reducing the problem size, traditional non-distributed community detection algorithms can be applied. We implemented a weak but highly scalable label propagation algorithm on top of the distributed-computing framework Apache Hadoop. The evaluation of our implementation on a 50-node Hadoop cluster and with evaluation datasets up to 3.3 billion edges shows very good results with respect to clustering quality as well as scalability. For a smaller 260 million edge network, we show that our preprocessing can improve the results of the popular Louvain modularity clustering algorithm.

[1]  A. Medus,et al.  Detection of community structures in networks via global optimization , 2005 .

[2]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[3]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Martin Rosvall,et al.  An information-theoretic framework for resolving community structure in complex networks , 2007, Proceedings of the National Academy of Sciences.

[5]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[6]  Christos Faloutsos,et al.  Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining , 2013, ASONAM 2013.

[7]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[8]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[9]  Jianyong Wang,et al.  Parallel community detection on large networks with propinquity dynamics , 2009, KDD.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[12]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  Pietro Liò,et al.  Towards real-time community detection in large networks. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[15]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[16]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[18]  Javier Béjar,et al.  Clustering algorithm for determining community structure in large networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Mingyuan An,et al.  A microscopic view on community detection in complex networks , 2008, PIKM '08.

[20]  Bin Wu,et al.  Efficient Dense Structure Mining Using MapReduce , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[21]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[22]  Andreas Geyer-Schulz,et al.  Cluster Cores and Modularity Maximization , 2010, 2010 IEEE International Conference on Data Mining Workshops.