Speeding up Subgraph Isomorphism Search in Large Graphs

Graph is a widely used model to represent complicated data in many domains. Finding subgraph isomorphism is a fundamental function for many graph databases and data mining applications handling graph data. This thesis studies this classic problem by considering a set of novel techniques from three different aspects. This thesis first considers speeding up subgraph isomorphism search by exploiting relationships among data vertices. Most of the subgraph isomorphism algorithms of the In-Memory model (IM) are based on a backtracking method which computes the solutions by incrementally enumerating all candidate combinations. We observed that all current algorithms blindly verify each individual mapping separately, often leading to extensive duplicate calculations. We propose two novel concepts, Syntactic Equivalence and Query Dependent Equivalence, by using which we group specific candidate data vertices into a hypervertex. The data vertices belonging to the same hypervertex can be mapped to the same query vertex. Thus, all the vertices falling into the same hypervertex can be determined whether to contribute to a solution simultaneously instead of calculating them separately. Our extensive experimental study on real datasets shows that existing subgraph isomorphism algorithms can be significantly boosted by our approach. Secondly, this thesis considers multi-query optimization where multiple queries are processed together so as to reduce the overall processing time. We propose a novel method for efficiently detecting useful common subgraphs and a data structure to organize them. We propose a heuristic algorithm based on the data structure to compute a query execution order so that cached intermediate results can be effectively utilized. To balance memory usage and the time for cached results retrieval, we present a novel structure for caching the intermediate results. We provide strategies to revise existing single-query subgraph isomorphism algorithms to seamlessly utilize the cached results, which leads to significant performance improvement. Experiments over real datasets proved the effectiveness and efficiency of our multi-query optimization approach. In the third part, this thesis considers the subgraph isomorphism search under distributed environments. We observed that current state-of-the-art distributed solutions either rely on crippling joins or cumbersome indices, which leads those solutions hard to be practically used. Moreover, most of them follow the synchronous model whose performance is often bottlenecked by the machine with the worst performance in the cluster. Motivated by this, in this thesis, we utilize a dramatically different approach and propose PADS , a Practical Asynchronous Distributed Subgraph enumeration system. We conducted extensive experiments to evaluate the performance of Pads. Compared with existing join-oriented solution, our system not only shows significant superiority in terms of query processing efficiency but also has outstanding practicality. Even compared with heavy indexed solution, our approach also has better performance in many cases.

[1]  Jeong-Hoon Lee,et al.  Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases , 2013, SIGMOD '13.

[2]  Lijun Chang,et al.  Scalable Subgraph Enumeration in MapReduce , 2015, Proc. VLDB Endow..

[3]  Ambuj K. Singh,et al.  Query Language and Access Methods for Graph Databases , 2010, Managing and Mining Graph Data.

[4]  Xin Wang,et al.  Incremental graph pattern matching , 2013, TODS.

[5]  Shuigeng Zhou,et al.  QUBLE: towards blending interactive visual subgraph search queries on large networks , 2014, The VLDB Journal.

[6]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[7]  Wenfei Fan,et al.  Graph pattern matching revised for social network analysis , 2012, ICDT '12.

[8]  Boris Schling The Boost C++ Libraries , 2011 .

[9]  Anders Edenbrandt,et al.  Quotient tree partitioning of undirected graphs , 1986, BIT.

[10]  Xin Wang,et al.  Association Rules with Graph Patterns , 2015, Proc. VLDB Endow..

[11]  Johannes Gehrke,et al.  Massively multi-query join processing in publish/subscribe systems , 2007, SIGMOD '07.

[12]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[13]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Timos K. Sellis,et al.  On the Multiple-Query Optimization Problem , 1990, IEEE Trans. Knowl. Data Eng..

[15]  Gerta Ruecker,et al.  Substructure, Subgraph, and Walk Counts as Measures of the Complexity of Graphs and Molecules. , 2010 .

[16]  Jia Wang,et al.  Truss Decomposition in Massive Networks , 2012, Proc. VLDB Endow..

[17]  Lijun Chang,et al.  Scalable subgraph enumeration in MapReduce: a cost-oriented approach , 2017, The VLDB Journal.

[18]  Daniel Deutch,et al.  Querying and monitoring distributed business processes , 2008, Proc. VLDB Endow..

[19]  Nisheeth Shrivastava,et al.  Graph summarization with bounded error , 2008, SIGMOD Conference.

[20]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[21]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[22]  Jeffrey Xu Yu,et al.  Taming verification hardness: an efficient algorithm for testing subgraph isomorphism , 2008, Proc. VLDB Endow..

[23]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[24]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[25]  Noga Alon,et al.  Biomolecular network motif counting and discovery by color coding , 2008, ISMB.

[26]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[27]  Hong Cheng,et al.  Subgraph Matching: on Compression and Computation , 2017, Proc. VLDB Endow..

[28]  Jiawei Han,et al.  On graph query optimization in large networks , 2010, Proc. VLDB Endow..

[29]  Lijun Chang,et al.  Efficient Subgraph Matching by Postponing Cartesian Products , 2016, SIGMOD Conference.

[30]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[31]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[32]  Chunming Hu,et al.  Big Graph Analyses: From Queries to Dependencies and Association Rules , 2017, Data Science and Engineering.

[33]  Ron Shamir,et al.  Faster subtree isomorphism , 1997, Proceedings of the Fifth Israeli Symposium on Theory of Computing and Systems.

[34]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Jignesh M. Patel,et al.  Discovery-driven graph summarization , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[36]  Xin Wang,et al.  Query preserving graph compression , 2012, SIGMOD Conference.

[37]  Henning Fernau,et al.  An exact algorithm for the Maximum Leaf Spanning Tree problem , 2009, Theor. Comput. Sci..

[38]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[39]  Csaba Szabó,et al.  Algebra complexity problems involving graph homomorphism, semigroups and the constraint satisfaction problem , 2003, J. Complex..

[40]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[41]  Sing-Hoi Sze,et al.  Path Matching and Graph Matching in Biological Networks , 2007, J. Comput. Biol..

[42]  Jure Leskovec,et al.  The life and death of online groups: predicting group growth and longevity , 2012, WSDM '12.

[43]  Yinghui Wu,et al.  Functional Dependencies for Graphs , 2016, SIGMOD Conference.

[44]  Alfred V. Aho,et al.  The Transitive Reduction of a Directed Graph , 1972, SIAM J. Comput..

[45]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[46]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[47]  Luis Gravano,et al.  Navigation- vs. index-based XML multi-query processing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[48]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[49]  Junhu Wang,et al.  Exploiting Vertex Relationships in Speeding up Subgraph Isomorphism over Large Graphs , 2015, Proc. VLDB Endow..

[50]  Jure Leskovec,et al.  Patterns of Influence in a Recommendation Network , 2006, PAKDD.

[51]  Sheldon J. Finkelstein Common expression analysis in database applications , 1982, SIGMOD '82.

[52]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[53]  Xin Wang,et al.  ExpFinder: Finding experts by graph pattern matching , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[54]  Feifei Li,et al.  Scalable Multi-query Optimization for SPARQL , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[55]  Todd Plantenga,et al.  Inexact subgraph isomorphism in MapReduce , 2013, J. Parallel Distributed Comput..

[56]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[57]  Jeong-Hoon Lee,et al.  An In-depth Comparison of Subgraph Isomorphism Algorithms in Graph Databases , 2012, Proc. VLDB Endow..

[58]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).