Scalable graph clustering with parallel approximate PageRank

AbstractWe outline a method for constructing in parallel a collection of local clusters for a massive distributed graph. For a given input set of (vertex, cluster size) tuples, we compute approximations of personal PageRank vectors in parallel using Pregel, and sweep over the results to create clusters using MapReduce. We show that our method converges to the serial approximate PageRank, and perform an experiment that illustrates the speed up over the serial method. We also outline a random selection and de-confliction procedure to cluster a distributed graph, and perform experiments to determine the quality of clusterings returned.

[1]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[2]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[3]  Nisheeth K. Vishnoi,et al.  A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally , 2009, J. Mach. Learn. Res..

[4]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[5]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Shang-Hua Teng,et al.  A Local Clustering Algorithm for Massive Graphs and Its Application to Nearly Linear Time Graph Partitioning , 2008, SIAM J. Comput..

[7]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[8]  Jure Leskovec,et al.  Structure and Overlaps of Communities in Networks , 2012, KDD 2012.

[9]  Yuval Peres,et al.  Finding sparse cuts locally using evolving sets , 2008, STOC '09.

[10]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[11]  Miklós Simonovits,et al.  Random Walks in a Convex Body and an Improved Volume Algorithm , 1993, Random Struct. Algorithms.

[12]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[15]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[16]  David F. Gleich,et al.  Vertex neighborhoods, low conductance cuts, and good seeds for local community methods , 2012, KDD.

[17]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[18]  Benjamin H. Good,et al.  Performance of modularity maximization in practical contexts. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[20]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[21]  Fan Chung Graham,et al.  Local Graph Partitioning using PageRank Vectors , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Dong Xin,et al.  Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[23]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[24]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[25]  Peter Neal,et al.  The Generalised Coupon Collector Problem , 2008, Journal of Applied Probability.

[26]  Madhav V. Marathe,et al.  SAHAD: Subgraph Analysis in Massive Networks Using Hadoop , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[27]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[28]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[29]  Abdul Rahman,et al.  Finding the 'Needle': Locating Interesting Nodes Using the K-shortest Paths Algorithm in MapReduce , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.