Distributed triangle counting in the Graphulo matrix math library

Triangle counting is a key algorithm for large graph analysis. The Graphulo library provides a framework for implementing graph algorithms on the Apache Accumulo distributed database. In this work we adapt two algorithms for counting triangles, one that uses the adjacency matrix and another that also uses the incidence matrix, to the Graphulo library for serverside processing inside Accumulo. Cloud-based experiments show a similar performance profile for these different approaches on the family of power law Graph500 graphs, for which data skew increasingly bottlenecks. These results motivate the design of skew-aware hybrid algorithms that we propose for future work.

[1]  Leonid Oliker,et al.  Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  Sara Cohen,et al.  User-defined aggregate functions: bridging theory and practice , 2006, SIGMOD Conference.

[3]  William Song,et al.  Static graph challenge: Subgraph isomorphism , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[4]  Georgia Koutrika,et al.  Proceedings of the 2016 International Conference on Management of Data , 2016, SIGMOD Conference.

[5]  Dan Suciu,et al.  LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation , 2017, BeyondMR@SIGMOD.

[6]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[7]  Wenguang Chen,et al.  Cloud versus in-house cluster: Evaluating Amazon cluster compute instances for running MPI applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Paolo Papotti,et al.  Rheem: Enabling Multi-Platform Task Execution , 2016, SIGMOD Conference.

[9]  Jeremy Kepner,et al.  Using a Power Law distribution to describe big data , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[10]  Michael Stonebraker,et al.  Data transformation and migration in polystores , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[11]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[12]  Paul Burkhardt,et al.  A cloud-based approach to big graphs , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[13]  Dan Suciu,et al.  The Myria Big Data Management and Analytics System and Cloud Services , 2017, CIDR.

[14]  David A. Bader Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems , 2006 .

[15]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[16]  Oded Schwartz,et al.  Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication , 2016, TOPC.

[17]  Jeremy Kepner,et al.  From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[18]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[19]  Jonathan W. Berry,et al.  A task-based linear algebra Building Blocks approach for scalable graph analytics , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[20]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[21]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[22]  Jeremy Kepner,et al.  Dynamic distributed dimensional data model (D4M) database and computation system , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Alvin Cheung,et al.  PipeGen: Data Pipe Generator for Hybrid Analytics , 2016, SoCC.

[24]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[25]  Jeremy Kepner,et al.  Graphulo: Linear Algebra Graph Kernels for NoSQL Databases , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[26]  Michael Stonebraker,et al.  The BigDAWG polystore system and architecture , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[27]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[28]  Samuel Williams,et al.  Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication , 2015, SIAM J. Sci. Comput..

[29]  Shirish Tatikonda,et al.  SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs , 2014, IEEE Data Eng. Bull..

[30]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[31]  Jeremy Kepner,et al.  Graphulo implementation of server-side sparse matrix multiply in the Accumulo database , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[32]  Jeremy Kepner,et al.  Benchmarking the graphulo processing framework , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).