CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching

Subgraph matching finds all distinct isomorphic embeddings of a query graph on a data graph. For large graphs, current solutions face the scalability challenge due to expensive joins, excessive false candidates, and workload imbalance. In this paper, we propose a novel framework for subgraph listing based on Compact Embedding Cluster Index (\idx), which divides the data graph into multiple embedding clusters for parallel processing. The \sub has three unique techniques: utilizing the BFS-based filtering and reverse-BFS-based refinement to prune the unpromising candidates early on, replacing the edge verification with set intersection to speed up the candidate verification, and using search cardinality based cost estimation for detecting and dividing large embedding clusters in advance. The experiments performed on several real and synthetic datasets show that the \sub outperforms state-of-the-art solutions on average by 20.4× for listing all embeddings and by 2.6× for enumerating the first 1,024 embeddings.

[1]  Shijie Zhang,et al.  GADDI: distance index based subgraph matching in biological networks , 2009, EDBT '09.

[2]  Jeong-Hoon Lee,et al.  Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases , 2013, SIGMOD '13.

[3]  Lijun Chang,et al.  Scalable Subgraph Enumeration in MapReduce , 2015, Proc. VLDB Endow..

[4]  Lei Zou,et al.  A novel spectral coding in a large graph database , 2008, EDBT '08.

[5]  H. Howie Huang,et al.  SafeNVM: A Non-Volatile Memory Store with Thread-Level Page Protection , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[6]  Jiawei Han,et al.  On graph query optimization in large networks , 2010, Proc. VLDB Endow..

[7]  H. Howie Huang,et al.  Graphene: Fine-Grained IO Management for Graph Computing , 2017, FAST.

[8]  Hang Liu,et al.  SIMD-X: Programming and Processing of Graph Algorithms on GPUs , 2018, USENIX Annual Technical Conference.

[9]  Karsten Klein,et al.  CT-index: Fingerprint-based graph indexing combining cycles and trees , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  H. Howie Huang,et al.  GraphOne: A Data Store for Real-time Analytics on Evolving Graphs , 2020, FAST.

[11]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[12]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[13]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[14]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[15]  Sourav S. Bhowmick,et al.  DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine , 2016, SIGMOD Conference.

[16]  Lawrence B. Holder,et al.  A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs , 2015, EDBT.

[17]  Jeong-Hoon Lee,et al.  An In-depth Comparison of Subgraph Isomorphism Algorithms in Graph Databases , 2012, Proc. VLDB Endow..

[18]  Bingsheng He,et al.  Fast Subgraph Matching on Large Graphs using Graphics Processors , 2015, DASFAA.

[19]  Dana Ron,et al.  Counting stars and other small subgraphs in sublinear time , 2010, SODA '10.

[20]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[21]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[22]  Aristides Gionis,et al.  Mining Large Networks with Subgraph Counting , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[23]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[24]  Carl Ebeling,et al.  SubGemini: Identifying SubCircuits using a Fast Subgraph Isomorphism Algorithm , 1993, 30th ACM/IEEE Design Automation Conference.

[25]  Sungpack Hong,et al.  TurboFlux: A Fast Continuous Subgraph Matching System for Streaming Graph Data , 2018, SIGMOD Conference.

[26]  F. Schreiber,et al.  MODA: an efficient algorithm for network motif discovery in biological networks. , 2009, Genes & genetic systems.

[27]  Igor Jurisica,et al.  Efficient estimation of graphlet frequency distributions in protein-protein interaction networks , 2006, Bioinform..

[28]  H. Howie Huang,et al.  High-Performance Triangle Counting on GPUs , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[29]  Panos Kalnis,et al.  ScaleMine: Scalable Parallel Frequent Subgraph Mining in a Single Large Graph , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[33]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[34]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[35]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[36]  Jignesh M. Patel,et al.  TALE: A Tool for Approximate Large Graph Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[37]  Daniel J. Abadi,et al.  Query optimization of distributed pattern matching , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[38]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[39]  Alan Weiss,et al.  Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[40]  Wei Jin,et al.  SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs , 2010, Proc. VLDB Endow..

[41]  Junhu Wang,et al.  Exploiting Vertex Relationships in Speeding up Subgraph Isomorphism over Large Graphs , 2015, Proc. VLDB Endow..

[42]  H. Howie Huang,et al.  G-Store: High-Performance Graph Store for Trillion-Edge Processing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  H. Howie Huang,et al.  iSpan: Parallel Identification of Strongly Connected Components with Spanning Trees , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[44]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[45]  Jeffrey Xu Yu,et al.  Taming verification hardness: an efficient algorithm for testing subgraph isomorphism , 2008, Proc. VLDB Endow..

[46]  H. Howie Huang,et al.  Falcon: Scaling IO Performance in Multi-SSD Volumes , 2017, USENIX Annual Technical Conference.

[47]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[48]  Lijun Chang,et al.  Efficient Subgraph Matching by Postponing Cartesian Products , 2016, SIGMOD Conference.

[49]  Scott Meyers,et al.  Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library , 2001 .

[50]  Peiyi Tang,et al.  Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.

[51]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[52]  Dennis Shasha,et al.  Enhancing Graph Database Indexing by Suffix Tree Structure , 2010, PRIB.

[53]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[54]  Lei Zou,et al.  Time Constrained Continuous Subgraph Search Over Streaming Graphs , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[55]  M. Schkolnick,et al.  9th International Conference on Very Large Data Bases , 1983, Very Large Data Bases Conference.

[56]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[57]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[58]  Jinha Kim,et al.  TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[59]  Todd Plantenga,et al.  Inexact subgraph isomorphism in MapReduce , 2013, J. Parallel Distributed Comput..

[60]  H. Howie Huang,et al.  TriCore: Parallel Triangle Counting on GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[61]  H. Howie Huang,et al.  iBFS: Concurrent Breadth-First Search on GPUs , 2016, SIGMOD Conference.