DiSLR: Distributed Sampling with Limited Redundancy For Triangle Counting in Graph Streams

Given a web-scale graph that grows over time, how should its edges be stored and processed on multiple machines for rapid and accurate estimation of the count of triangles? The count of triangles (i.e., cliques of size three) has proven useful in many applications, including anomaly detection, community detection, and link recommendation. For triangle counting in large and dynamic graphs, recent work has focused largely on streaming algorithms and distributed algorithms. To achieve the advantages of both approaches, we propose DiSLR, a distributed streaming algorithm that estimates the counts of global triangles and local triangles associated with each node. Making one pass over the input stream, DiSLR carefully processes and stores the edges across multiple machines so that the redundant use of computational and storage resources is minimized. Compared to its best competitors, DiSLR is (a) Accurate: giving up to 39X smaller estimation error, (b) Fast: up to 10.4X faster, scaling linearly with the number of edges in the input stream, and (c) Theoretically sound: yielding unbiased estimates with variances decreasing faster as the number of machines is scaled up.

[1]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[2]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[3]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[4]  Jon M. Kleinberg,et al.  Overview of the 2003 KDD Cup , 2003, SKDD.

[5]  Silvio Lattanzi,et al.  Ego-net Community Mining Applied to Friend Suggestion , 2015, Proc. VLDB Endow..

[6]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[7]  Anthony K. H. Tung,et al.  On Triangulation-based Dense Neighborhood Graphs Discovery , 2010, Proc. VLDB Endow..

[8]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[9]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[10]  Srikanta Tirthapura,et al.  Parallel triangle counting in massive streaming graphs , 2013, CIKM.

[11]  Luca Becchetti,et al.  Efficient algorithms for large-scale local triangle counting , 2010, TKDD.

[12]  Sung-Hyon Myaeng,et al.  Enumerating Trillion Subgraphs On Distributed Systems , 2018, ACM Trans. Knowl. Discov. Data.

[13]  Rasmus Pagh,et al.  MapReduce Triangle Enumeration With Guarantees , 2014, CIKM.

[14]  Ryan A. Rossi,et al.  On Sampling from Massive Graph Streams , 2017, Proc. VLDB Endow..

[15]  Jinha Kim,et al.  OPT: a new framework for overlapped and parallel triangulation in large-scale graphs , 2014, SIGMOD Conference.

[16]  Christos Faloutsos,et al.  Spectral counting of triangles via element-wise sparsification and triangle-based link recommendation , 2011, Social Network Analysis and Mining.

[17]  Rasmus Pagh,et al.  On the streaming complexity of computing local clustering coefficients , 2013, WSDM.

[18]  Yufei Tao,et al.  Massive graph triangulation , 2013, SIGMOD '13.

[19]  Sung-Hyon Myaeng,et al.  PTE: Enumerating Trillion Triangles On Distributed Systems , 2016, KDD.

[20]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[21]  Chin-Wan Chung,et al.  An efficient MapReduce algorithm for counting triangles in a very large graph , 2013, CIKM.

[22]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, KDD 2012.

[23]  Charalampos E. Tsourakakis,et al.  Colorful triangle counting and a MapReduce implementation , 2011, Inf. Process. Lett..

[24]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[25]  Wook-Shin Han,et al.  TurboGraph++: A Scalable and Fast Graph Analytics System , 2018, SIGMOD Conference.

[26]  Charalampos E. Tsourakakis Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[27]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[28]  Yongsub Lim,et al.  MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams , 2015, KDD.

[29]  Christos Faloutsos,et al.  Think Before You Discard: Accurate Triangle Counting in Graph Streams with Deletions , 2018, ECML/PKDD.

[30]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[31]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[32]  Eric Price,et al.  A Hybrid Sampling Scheme for Triangle Counting , 2016, SODA.

[33]  Christos Faloutsos,et al.  Tri-Fly: Distributed Estimation of Global and Local Triangle Counts in Graph Streams , 2018, PAKDD.

[34]  Madhav V. Marathe,et al.  PATRIC: a parallel algorithm for counting triangles in massive networks , 2013, CIKM.

[35]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[36]  Lorenzo De Stefani,et al.  TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size , 2016, KDD.

[37]  C. Spearman The proof and measurement of association between two things. By C. Spearman, 1904. , 1987, The American journal of psychology.

[38]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[39]  Kijung Shin,et al.  WRS: Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams , 2017, 2017 IEEE International Conference on Data Mining (ICDM).