Massive streaming data analytics: A case study with clustering coefficients

We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom filter approaches 200 000 updates per second.

[1]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[2]  Vladimir Batagelj,et al.  Pajek - Program for Large Network Analysis , 1999 .

[3]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[4]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[5]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[6]  Michael A. Bender,et al.  Concurrent cache-oblivious b-trees , 2005, SPAA '05.

[7]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[8]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[9]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[10]  David Eppstein,et al.  Sparsification-a technique for speeding up dynamic graph algorithms , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[11]  David A. Bader,et al.  National Laboratory Lawrence Berkeley National Laboratory Title A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets Permalink , 2009 .

[12]  H BloomBurton Space/time trade-offs in hash coding with allowable errors , 1970 .

[13]  M. Newman,et al.  Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  Michael Lang,et al.  A Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing , 2008, Parallel Process. Lett..

[15]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[16]  David A. Bader,et al.  STINGER : Spatio-Temporal Interaction Networks and Graphs ( STING ) Extensible Representation , 2009 .

[17]  Sung-Eun Choi,et al.  Optimizing Loop-level Parallelism in Cray XMT TM Applications , 2009 .

[18]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[19]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[20]  Joan Feigenbaum,et al.  On graph problems in a semi-streaming model , 2005, Theor. Comput. Sci..

[21]  David A. Bader,et al.  Compact graph representations and parallel connectivity algorithms for massive dynamic network analysis , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.