Scalable Graph Clustering with Pregel

We outline a method for constructing in parallel a collection of local clusters for a massive distributed graph. For a given input set of (vertex, cluster size) tuples, we compute approximations of personal PageRank vectors in parallel using Pregel, and sweep the results using MapReduce. We show our method converges to the serial approximate PageRank, and perform an experiment that illustrates the speed up over the serial method. We also outline a random selection and deconfliction procedure to cluster a distributed graph, and perform experiments to determine the quality of clusterings returned.

[1]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[2]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[3]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[4]  David F. Gleich,et al.  Vertex neighborhoods, low conductance cuts, and good seeds for local community methods , 2012, KDD.

[5]  Abdul Rahman,et al.  Finding the 'Needle': Locating Interesting Nodes Using the K-shortest Paths Algorithm in MapReduce , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[6]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[7]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[8]  Jure Leskovec,et al.  Structure and Overlaps of Communities in Networks , 2012, KDD 2012.

[9]  Yuval Peres,et al.  Finding sparse cuts locally using evolving sets , 2008, STOC '09.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[12]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[13]  Miklós Simonovits,et al.  Random Walks in a Convex Body and an Improved Volume Algorithm , 1993, Random Struct. Algorithms.

[14]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[15]  Shang-Hua Teng,et al.  A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning , 2008, SIAM J. Comput..

[16]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[17]  Dong Xin,et al.  Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[18]  Madhav V. Marathe,et al.  SAHAD: Subgraph Analysis in Massive Networks Using Hadoop , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[20]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.