Optimizing Graph Algorithms on Pregel-like Systems

We study the problem of implementing graph algorithms efficiently on Pregel-like systems, which can be surprisingly challenging. Standard graph algorithms in this setting can incur unnecessary inefficiencies such as slow convergence or high communication or computation cost, typically due to structural properties of the input graphs such as large diameters or skew in component sizes. We describe several optimization techniques to address these inefficiencies. Our most general technique is based on the idea of performing some serial computation on a tiny fraction of the input graph, complementing Pregel's vertex-centric parallelism. We base our study on thorough implementations of several fundamental graph algorithms, some of which have, to the best of our knowledge, not been implemented on Pregel-like systems before. The algorithms and optimizations we describe are fully implemented in our open-source Pregel implementation. We present detailed experiments showing that our optimization techniques improve runtime significantly on a variety of very large graph datasets.

[1]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  Joseph M. Hellerstein,et al.  The declarative imperative: experiences and conjectures in distributed logic , 2010, SGMD.

[4]  Kunle Olukotun,et al.  On fast parallel detection of strongly connected components (SCC) in small-world graphs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  M. Sharir,et al.  A strong-connectivity algorithm and its applications in data flow analysis. , 2018 .

[6]  Rob H. Bisseling,et al.  A Parallel Approximation Algorithm for the Weighted Maximum Matching Problem , 2007, PPAM.

[7]  Robert Preis,et al.  Linear Time 1/2-Approximation Algorithm for Maximum Weighted Matching in General Graphs , 1999, STACS.

[8]  Simona Orzan,et al.  On Distributed Verification and Verified Distribution , 2004 .

[9]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[10]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[11]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[12]  Kivanc Dincer,et al.  A Comparison of Parallel Graph Coloring Algorithms , 1995 .

[13]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[14]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[15]  Jaroslav Nesetril,et al.  Otakar Boruvka on minimum spanning tree problem Translation of both the 1926 papers, comments, history , 2001, Discret. Math..

[16]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[17]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[18]  Lubos Brim,et al.  Computing Strongly Connected Components in Parallel on CUDA , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[20]  John R. Gilbert,et al.  A Flexible Open-Source Toolbox for Scalable Complex Graph Analysis , 2012, SDM.

[21]  David Hardcastle,et al.  Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[22]  Frank Dehne,et al.  Practical parallel algorithms for minimum spanning trees , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Jaco van de Pol,et al.  Distributed Algorithms for SCC Decomposition , 2011, J. Log. Comput..

[25]  Dan Suciu,et al.  Parallel evaluation of conjunctive queries , 2011, PODS.

[26]  Duoqian Miao,et al.  A graph-theoretical clustering method based on two rounds of minimum spanning trees , 2010, Pattern Recognit..

[27]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[28]  Assefaw H. Gebremedhin,et al.  Scalable parallel graph coloring algorithms , 2000 .

[29]  Lawrence Rauchwerger,et al.  Identifying Strongly Connected Components in Parallel , 2000, PPSC.

[30]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[31]  Thomas H. Spencer,et al.  Time-work tradeoffs for parallel algorithms , 1997, JACM.

[32]  Michael Luby,et al.  A simple parallel algorithm for the maximal independent set problem , 1985, STOC '85.

[33]  Jesper Larsson Träff,et al.  A Practical Minimum Spanning Tree Algorithm Using the Cycle Property , 2003, ESA.

[34]  Anne Condon,et al.  Parallel implementation of Bouvka's minimum spanning tree algorithm , 1996, Proceedings of International Conference on Parallel Processing.

[35]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[36]  David R. Karger,et al.  Global min-cuts in RNC, and other ramifications of a simple min-out algorithm , 1993, SODA '93.

[37]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[38]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[39]  J. Widom,et al.  Tech Report: Compiling GreenMarl into GPS , 2012 .

[40]  Sebastiano Vigna,et al.  The Webgraph framework II: codes for the World-Wide Web , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[41]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[42]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.