Sandslash: a two-level framework for efficient graph pattern mining

Graph pattern mining (GPM) is used in diverse application areas including social network analysis, bioinformatics, and chemical engineering. Existing GPM frameworks either provide high-level interfaces for productivity at the cost of expressiveness or provide low-level interfaces that can express a wide variety of GPM algorithms at the cost of increased programming complexity. Moreover, existing systems lack the flexibility to explore combinations of optimizations to achieve performance competitive with hand-optimized applications. We present Sandslash, an in-memory Graph Pattern Mining (GPM) framework that uses a novel programming interface to support productive, expressive, and efficient GPM on large graphs. Sandslash provides a high-level API that needs only a specification of the GPM problem, and it implements fast subgraph enumeration, provides efficient data structures, and applies high-level optimizations automatically. To achieve performance competitive with expert-optimized implementations, Sandslash also provides a low-level API that allows users to express algorithm-specific optimizations. This enables Sandslash to support both high-productivity and high-efficiency without losing expressiveness. We evaluate Sandslash on shared-memory machines using five GPM applications and a wide range of large real-world graphs. Experimental results demonstrate that applications written using Sandslash high-level or low-level API outperforms state-of-the-art GPM systems AutoMine, Pangolin, and Peregrine on average by 13.8x, 7.9x, and 5.4x, respectively. We also show that these Sandslash applications outperform expert-optimized GPM implementations by 2.3x on average with less programming effort.

[1]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[2]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[3]  Tongping Liu,et al.  GraphZero: Breaking Symmetry for Efficient Graph Mining , 2019, ArXiv.

[4]  Aidong Zhang,et al.  Predicting Protein Function by Frequent Functional Association Pattern Mining in Protein Interaction Networks , 2010, IEEE Transactions on Information Technology in Biomedicine.

[5]  Panos Kalnis,et al.  ScaleMine: Scalable Parallel Frequent Subgraph Mining in a Single Large Graph , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Sungpack Hong,et al.  TurboFlux: A Fast Continuous Subgraph Matching System for Streaming Graph Data , 2018, SIGMOD Conference.

[7]  Frank Harary,et al.  Graph Theory , 2016 .

[8]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, KDD 2012.

[10]  Kai Wang,et al.  RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine , 2018, OSDI.

[11]  H. Howie Huang,et al.  TriCore: Parallel Triangle Counting on GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Jeffrey Xu Yu,et al.  Fast and Robust Distributed Subgraph Enumeration , 2019, Proc. VLDB Endow..

[13]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[14]  Bo Wu,et al.  AutoMine: harmonizing high-level abstraction and high performance for graph mining , 2019, SOSP.

[15]  James Cheng,et al.  G-Miner: an efficient task-oriented graph mining system , 2018, EuroSys.

[16]  Ali Pinar,et al.  Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts , 2014, WWW.

[17]  LeskovecJure,et al.  Defining and evaluating network communities based on ground-truth , 2015 .

[18]  Simon D. Hammond,et al.  Fast linear algebra-based triangle counting with KokkosKernels , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[19]  Benjamin W. Priest,et al.  One Quadrillion Triangles Queried on One Million Processors , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[20]  Yi Xu,et al.  GraphPi: High Performance Graph Pattern Matching through Effective Redundancy Elimination , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Lijun Chang,et al.  Efficient Subgraph Matching by Postponing Cartesian Products , 2016, SIGMOD Conference.

[22]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[23]  Qiong Luo,et al.  Efficient Parallel Subgraph Enumeration on a Single Machine , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[24]  Qiong Luo,et al.  Scaling Up Subgraph Query Processing with Efficient Subgraph Matching , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[25]  H. Howie Huang,et al.  CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching , 2019, SIGMOD Conference.

[26]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[27]  Amine Mhedhbi,et al.  Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins , 2019, Proc. VLDB Endow..

[28]  K. Pingali,et al.  Pangolin , 2019, Proc. VLDB Endow..

[29]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[30]  Katherine Faust,et al.  A puzzle concerning triads in social networks: Graph constraints and the triad census , 2010, Soc. Networks.

[31]  Ryan A. Rossi,et al.  Efficient Graphlet Counting for Large Networks , 2015, 2015 IEEE International Conference on Data Mining.

[32]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[33]  Jure Leskovec,et al.  Higher-order organization of complex networks , 2016, Science.

[34]  Ali Pinar,et al.  ESCAPE: Efficiently Counting All 5-Vertex Subgraphs , 2016, WWW.

[35]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[36]  Sivasankaran Rajamanickam,et al.  Fast Triangle Counting Using Cilk , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[37]  Mohammed J. Zaki,et al.  2016 Ieee International Conference on Big Data (big Data) Parallel Graph Mining with Dynamic Load Balancing , 2022 .

[38]  Jeremy Chen,et al.  Graphflow: An Active Graph Database , 2017, SIGMOD Conference.

[39]  Seshadhri Comandur,et al.  The Power of Pivoting for Exact Clique Counting , 2020, WSDM.

[40]  Sourav S. Bhowmick,et al.  DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine , 2016, SIGMOD Conference.

[41]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[42]  Eiko Yoneki,et al.  PDTL: Parallel and Distributed Triangle Listing for Massive Graphs , 2015, 2015 44th International Conference on Parallel Processing.

[43]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[44]  Panos Kalnis,et al.  GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph , 2014, Proc. VLDB Endow..

[45]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[46]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[47]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[48]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[49]  Guy E. Blelloch,et al.  Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable , 2018, SPAA.

[50]  Charu C. Aggarwal,et al.  Managing and Mining Graph Data , 2010, Managing and Mining Graph Data.

[51]  Tianyu Wo,et al.  Distributed graph pattern matching , 2012, WWW.

[52]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[53]  Xiaoye Sherry Li,et al.  H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[54]  Brian Gallagher,et al.  Matching Structure and Semantics: A Survey on Graph-Based Pattern Matching , 2006, AAAI Fall Symposium: Capturing and Using Patterns for Evidence Detection.

[55]  Srinivasan Parthasarathy,et al.  Fractal: A General-Purpose Graph Pattern Mining System , 2019, SIGMOD Conference.

[56]  Keval Vora,et al.  Peregrine: a pattern-aware graph mining system , 2020, EuroSys.

[57]  Xueqi Cheng,et al.  Kaleido: An Efficient Out-of-core Graph Mining System on A Single Machine , 2019, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[58]  Mohammed J. Zaki,et al.  A distributed approach for graph mining in massive networks , 2016, Data Mining and Knowledge Discovery.

[59]  Xuhao Chen,et al.  DistTC: High Performance Distributed Triangle Counting , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[60]  Maximilien Danisch,et al.  Listing k-cliques in Sparse Real-World Graphs* , 2018, WWW.

[61]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[62]  Lijun Chang,et al.  Scalable Subgraph Enumeration in MapReduce , 2015, Proc. VLDB Endow..