A space efficient streaming algorithm for triangle counting using the birthday paradox

We design a space-efficient algorithm that approximates the transitivity (global clustering coefficient) and total triangle count with only a single pass through a graph given as a stream of edges. Our procedure is based on the classic probabilistic result, the birthday paradox. When the transitivity is constant and there are more edges than wedges (common properties for social networks), we can prove that our algorithm requires O(√n) space (n is the number of vertices) to provide accurate estimates. We run a detailed set of experiments on a variety of real graphs and demonstrate that the memory requirement of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 40,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains a real-time estimate of the transitivity/number of triangles of a graph by storing a minuscule fraction of edges.

[1]  Mihail N. Kolountzakis,et al.  Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning , 2010, Internet Math..

[2]  A. Portes Social Capital: Its Origins and Applications in Modern Sociology , 1998 .

[3]  Jean-Pierre Eckmann,et al.  Curvature of co-links uncovers hidden thematic layers in the World Wide Web , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[5]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[6]  Tamara G. Kolda,et al.  Wedge sampling for computing clustering coefficients and triangle counts on large graphs † , 2013, Stat. Anal. Data Min..

[7]  Hans-Juergen Boehm,et al.  The runtime abort graph and its application to software transactional memory optimization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8]  Tamara G. Kolda,et al.  Community structure and scale-free collections of Erdös-Rényi graphs , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Christos Faloutsos,et al.  Spectral Counting of Triangles in Power-Law Networks via Element-Wise Sparsification , 2009, 2009 International Conference on Advances in Social Network Analysis and Mining.

[10]  Dorothea Wagner,et al.  Approximating Clustering Coefficient and Transitivity , 2005, J. Graph Algorithms Appl..

[11]  Tamara G. Kolda,et al.  Directed closure measures for networks with reciprocity , 2013, J. Complex Networks.

[12]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[13]  Sung-Ryul Kim,et al.  Improved Sampling for Triangle Counting with MapReduce , 2011, ICHIT.

[14]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[15]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[16]  Charalampos E. Tsourakakis Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  Nagiza F. Samatova,et al.  Reservoir-Based Random Sampling with Replacement from Data Stream , 2004, SDM.

[18]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[19]  Todd Plantenga,et al.  Inexact subgraph isomorphism in MapReduce , 2013, J. Parallel Distributed Comput..

[20]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[21]  Tamara G. Kolda,et al.  Triadic Measures on Graphs: The Power of Wedge Sampling , 2012, SDM.

[22]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[23]  Mohammad Ghodsi,et al.  New Streaming Algorithms for Counting Triangles in Graphs , 2005, COCOON.

[24]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[25]  Srikanta Tirthapura,et al.  Parallel triangle counting in massive streaming graphs , 2013, CIKM.

[26]  J. Coleman,et al.  Social Capital in the Creation of Human Capital , 1988, American Journal of Sociology.

[27]  Sudipto Guha,et al.  Graph sketches: sparsification, spanners, and subgraphs , 2012, PODS.

[28]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[29]  Tamara G. Kolda,et al.  Counting Triangles in Massive Graphs with MapReduce , 2013, SIAM J. Sci. Comput..

[30]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[31]  James Cheng,et al.  Triangle listing in massive networks and its applications , 2011, KDD.

[32]  Madhav V. Marathe,et al.  PATRIC: a parallel algorithm for counting triangles in massive networks , 2013, CIKM.

[33]  Michael T. Goodrich,et al.  Parallel external memory graph algorithms , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[34]  Noshir S. Contractor,et al.  Is a friend a friend?: investigating the structure of friendship networks in virtual worlds , 2010, CHI Extended Abstracts.

[35]  Yufei Tao,et al.  Massive graph triangulation , 2013, SIGMOD '13.

[36]  Dorothea Wagner,et al.  Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study , 2005, WEA.

[37]  Ramana Rao Kompella,et al.  Network Sampling: From Static to Streaming Graphs , 2012, TKDD.

[38]  Matthieu Latapy,et al.  Main-memory triangle computations for very large (sparse (power-law)) graphs , 2008, Theor. Comput. Sci..

[39]  Mihail N. Kolountzakis,et al.  Triangle Sparsifiers , 2011, J. Graph Algorithms Appl..

[40]  Charalampos E. Tsourakakis,et al.  Colorful triangle counting and a MapReduce implementation , 2011, Inf. Process. Lett..

[41]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[42]  Tamara G. Kolda,et al.  Degree relations of triangles in real-world networks and graph models , 2012, CIKM.

[43]  Jonathan W. Berry,et al.  Software and Algorithms for Graph Queries on Multithreaded Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[44]  Tamara G. Kolda,et al.  Fast Triangle Counting through Wedge Sampling , 2012, ArXiv.

[45]  Thomas Sauerwald,et al.  Counting Arbitrary Subgraphs in Data Streams , 2012, ICALP.

[46]  H. Avron Counting Triangles in Large Graphs using Randomized Matrix Trace Estimation , 2010 .

[47]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[48]  Noga Alon,et al.  Testing triangle-freeness in general graphs , 2006, SODA '06.

[49]  Tamara G. Kolda,et al.  Degree Relations of Triangles in Real-world Networks and Models , 2012, arXiv.org.

[50]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.