On the Distributed Complexity of Large-Scale Graph Computations

Motivated by the increasing need to understand the distributed algorithmic foundations of large-scale graph computations, we study some fundamental graph problems in a message-passing model for distributed computing where $k \geq 2$ machines jointly perform computations on graphs with n nodes (typically, $n \gg k$). The input graph is assumed to be initially randomly partitioned among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication \em rounds of the computation. Our main contribution is the \em General Lower Bound Theorem, a theorem that can be used to show non-trivial lower bounds on the round complexity of distributed large-scale data computations. The General Lower Bound Theorem is established via an information-theoretic approach that relates the round complexity to the minimal amount of information required by machines to solve the problem. Our approach is generic and this theorem can be used in a "cookbook" fashion to show distributed lower bounds in the context of several problems, including non-graph problems. We present two applications by showing (almost) tight lower bounds for the round complexity of two fundamental graph problems, namely \em PageRank computation and \em triangle enumeration. Our approach, as demonstrated in the case of PageRank, can yield tight lower bounds for problems (including, and especially, under a stochastic partition of the input) where communication complexity techniques are not obvious. Our approach, as demonstrated in the case of triangle enumeration, can yield stronger round lower bounds as well as message-round tradeoffs compared to approaches that use communication complexity techniques. We then present distributed algorithms for PageRank and triangle enumeration with a round complexity that (almost) matches the respective lower bounds; these algorithms exhibit a round complexity which scales superlinearly in k , improving significantly over previous results for these problems [Klauck et al., SODA 2015]. Specifically, we show the following results: \beginitemize ıtem \em PageRank: We show a lower bound of $\tildeØmega (n/k^2)$ rounds, and present a distributed algorithm that computes the PageRank of all the nodes of a graph in $\tildeO (n/k^2)$ rounds. ıtem \em Triangle enumeration: We show that there exist graphs with m edges where any distributed algorithm requires $\tildeØmega (m/k^5/3 )$ rounds. This result also implies the first non-trivial lower bound of $\tildeØmega(n^1/3 )$ rounds for the \em congested clique model, which is tight up to logarithmic factors. We then present a distributed algorithm that enumerates all the triangles of a graph in $\tildeO (m/k^5/3 + n/k^4/3 )$ rounds. \enditemize

[1]  Leslie G. Valiant,et al.  A Scheme for Fast Parallel Communication , 1982, SIAM J. Comput..

[2]  Merav Parter,et al.  MST in Log-Star Rounds of Congested Clique , 2016, PODC.

[3]  Peter Robinson,et al.  Tight Bounds for Distributed Graph Computations , 2016, ArXiv.

[4]  Fabian Kuhn,et al.  On the power of the congested clique model , 2014, PODC.

[5]  Christoph Lenzen,et al.  Optimal deterministic routing and sorting on the congested clique , 2012, PODC '13.

[6]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  Peter Robinson,et al.  Fast Distributed Algorithms for Connectivity and MST in Large Graphs , 2015, SPAA.

[8]  Fan Chung Graham,et al.  Distributed Algorithms for Finding Local Clusters Using Heat Kernel Pagerank , 2015, WAW.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[11]  David P. Woodruff,et al.  When distributed computation is communication expensive , 2013, Distributed Computing.

[12]  Christoph Lenzen,et al.  Tight bounds for parallel randomized load balancing , 2011, Distributed Computing.

[13]  Cynthia A. Phillips,et al.  Why do simple algorithms for triangle enumeration work in the real world? , 2014, Internet Math..

[14]  Sriram V. Pemmaraju,et al.  Toward Optimal Bounds in the Congested Clique: Graph Connectivity and MST , 2015, PODC.

[15]  Dan Suciu,et al.  Worst-Case Optimal Algorithms for Parallel Query Processing , 2016, ICDT.

[16]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[17]  Svante Janson,et al.  Large deviations for sums of partly dependent random variables , 2004, Random Struct. Algorithms.

[18]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  Jonathan W. Berry,et al.  Tolerating the community detection resolution limit with edge weighting. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  Anthony K. H. Tung,et al.  On Triangulation-based Dense Neighborhood Graphs Discovery , 2010, Proc. VLDB Endow..

[22]  François Le Gall,et al.  Triangle Finding and Listing in CONGEST Networks , 2017, PODC.

[23]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[24]  Tao Guo,et al.  Distributed Algorithms on Exact Personalized PageRank , 2017, SIGMOD Conference.

[25]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[26]  David Peleg,et al.  A Near-Tight Lower Bound on the Time Complexity of Distributed Minimum-Weight Spanning Tree Construction , 2000, SIAM J. Comput..

[27]  Dong Xin,et al.  Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[28]  Sung-Hyon Myaeng,et al.  PTE: Enumerating Trillion Triangles On Distributed Systems , 2016, KDD.

[29]  Atri Rudra,et al.  Skew strikes back: new developments in the theory of join algorithms , 2013, SGMD.

[30]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[31]  Atish Das Sarma,et al.  Fast Distributed PageRank Computation , 2013, ICDCN.

[32]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[33]  David Peleg,et al.  Distributed Computing: A Locality-Sensitive Approach , 1987 .

[34]  Christoph Lenzen,et al.  "Tri, Tri Again": Finding Triangles and Small Subgraphs in a Distributed Setting - (Extended Abstract) , 2012, DISC.

[35]  Danupon Nanongkai,et al.  Distributed approximation algorithms for weighted shortest paths , 2014, STOC.

[36]  Alexandr Andoni,et al.  Parallel algorithms for geometric graph problems , 2013, STOC.

[37]  James Cheng,et al.  Triangle listing in massive networks , 2012, TKDD.

[38]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[39]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[40]  Hartmut Klauck,et al.  Can quantum communication speed up distributed computation? , 2012, PODC.

[41]  Lijun Chang,et al.  Scalable Distributed Subgraph Enumeration , 2016, Proc. VLDB Endow..

[42]  Christoph Lenzen,et al.  Algebraic methods in the congested clique , 2015, Distributed Computing.

[43]  David Peleg,et al.  Message Lower Bounds via Efficient Network Synchronization , 2016, SIROCCO.

[44]  Hartmut Klauck,et al.  Distributed Computation of Large-scale Graph Problems , 2015, SODA.

[45]  Geoffrey C. Fox,et al.  HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[46]  Noshir S. Contractor,et al.  Is a friend a friend?: investigating the structure of friendship networks in virtual worlds , 2010, CHI Extended Abstracts.

[47]  Boaz Patt-Shamir,et al.  Minimum-Weight Spanning Tree Construction in O(log log n) Communication Rounds , 2005, SIAM J. Comput..

[48]  Rotem Oshman Communication Complexity Lower Bounds in Distributed Message-Passing , 2014, SIROCCO.

[49]  Konstantin Avrachenkov,et al.  Monte Carlo Methods in PageRank Computation: When One Iteration is Sufficient , 2007, SIAM J. Numer. Anal..

[50]  Christoph M. Hoffmann,et al.  A graph-constructive approach to solving systems of geometric constraints , 1997, TOGS.

[51]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[52]  Tomasz Jurdzinski,et al.  MST in O(1) Rounds of Congested Clique , 2018, SODA.

[53]  Danupon Nanongkai,et al.  A tight unconditional lower bound on distributed randomwalk computation , 2011, PODC '11.

[54]  Igor Rivin Counting cycles and finite dimensional Lp norms , 2002, Adv. Appl. Math..

[55]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[56]  Sreenivas Gollapudi,et al.  Estimating PageRank on graph streams , 2008, PODS.

[57]  Ashish Goel,et al.  Fast Incremental and Personalized PageRank , 2010, Proc. VLDB Endow..

[58]  Fazlollah M. Reza,et al.  Introduction to Information Theory , 2004, Lecture Notes in Electrical Engineering.

[59]  Sayan Bandyapadhyay,et al.  Near-Optimal Clustering in the k-machine model , 2017, ICDCN.

[60]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[61]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[62]  Christian Scheideler,et al.  Universal Routing Strategies for Interconnection Networks , 1998, Lecture Notes in Computer Science.

[63]  Vojtech Rödl,et al.  Random Graphs with Monochromatic Triangles in Every Edge Coloring , 1994, Random Struct. Algorithms.

[64]  Avery Ching,et al.  One Trillion Edges: Graph Processing at Facebook-Scale , 2015, Proc. VLDB Endow..

[65]  Ravi Kumar,et al.  An information statistics approach to data stream and communication complexity , 2004, J. Comput. Syst. Sci..