Counting triangles in data streams

We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log|V|⋅(1+s⋅|V|/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20, 8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computation of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8,000 nodes and 40,000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s=1,000, 10,000, 100,000, 1,000,000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s=1,000,000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s=10,000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).

[1]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[2]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[3]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[4]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[5]  Dorothea Wagner,et al.  Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study , 2005, WEA.

[6]  Frank Harary,et al.  Matrix measures for transitivity and balance , 1979 .

[7]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[8]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[9]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[10]  Christian Sohler,et al.  Counting Graph Minors in Data Streams , 2006 .

[11]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[12]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[13]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[14]  Mohammad Ghodsi,et al.  New Streaming Algorithms for Counting Triangles in Graphs , 2005, COCOON.

[15]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[16]  Anna Pagh,et al.  Uniform hashing in constant time and linear space , 2003, STOC '03.

[17]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[18]  A. Bonato RANDOM GRAPH MODELS FOR THE WEB GRAPH , 2007 .

[19]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.