Optimal Enumeration: Efficient Top-k Tree Matching

Driven by many real applications, graph pattern matching has attracted a great deal of attention recently. Consider that a twig-pattern matching may result in an extremely large number of matches in a graph; this may not only confuse users by providing too many results but also lead to high computational costs. In this paper, we study the problem of top-k tree pattern matching; that is, given a rooted tree T, compute its top-k matches in a directed graph G based on the twig-pattern matching semantics. We firstly present a novel and optimal enumeration paradigm based on the principle of Lawler's procedure. We show that our enumeration algorithm runs in O(nT + log k) time in each round where nT is the number of nodes in T. Considering that the time complexity to output a match of T is O(nT) and nT ≥ log k in practice, our enumeration technique is optimal. Moreover, the cost of generating top-1 match of T in our algorithm is O(mR) where mR is the number of edges in the transitive closure of a data graph G involving all relevant nodes to T. O(mR) is also optimal in the worst case without pre-knowledge of G. Consequently, our algorithm is optimal with the running time O(mR + k(nT + log k)) in contrast to the time complexity O(mR log k + knT(log k + dT)) of the existing technique where dT is the maximal node degree in T. Secondly, a novel priority based access technique is proposed, which greatly reduces the number of edges accessed and results in a significant performance improvement. Finally, we apply our techniques to the general form of top-k graph pattern matching problem (i.e., query is a graph) to improve the existing techniques. Comprehensive empirical studies demonstrate that our techniques may improve the existing techniques by orders of magnitude.

[1]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[2]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[3]  Li Chen,et al.  Stack-based Algorithms for Pattern Matching on DAGs , 2005, VLDB.

[4]  Jeffrey Xu Yu,et al.  TreeSpan: efficiently computing similarity all-matching , 2012, SIGMOD Conference.

[5]  K. Selçuk Candan,et al.  Sum-Max Monotonic Ranked Joins for Evaluating Top-K Twig Queries on Weighted Data Graphs , 2007, VLDB.

[6]  Philip S. Yu,et al.  Fast Graph Pattern Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[8]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[9]  Raymond Chi-Wing Wong,et al.  Hop Doubling Label Indexing for Point-to-Point Distance Querying on Scale-Free Networks , 2014, Proc. VLDB Endow..

[10]  Jianzhong Li,et al.  Graph homomorphism revisited for graph matching , 2010, Proc. VLDB Endow..

[11]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[12]  James A. McHugh,et al.  Algorithmic Graph Theory , 1986 .

[13]  Georg Gottlob,et al.  The complexity of XPath query evaluation , 2003, PODS.

[14]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[15]  Theodoros Lappas,et al.  Finding a team of experts in social networks , 2009, KDD.

[16]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[17]  Jeong-Hoon Lee,et al.  Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases , 2013, SIGMOD '13.

[18]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[20]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[21]  Xin Wang,et al.  Diversified Top-k Graph Pattern Matching , 2013, Proc. VLDB Endow..

[22]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[23]  Hai Zhuge,et al.  Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data , 2012, Proc. VLDB Endow..

[24]  Jaroslav Nesetril,et al.  Graphs and homomorphisms , 2004, Oxford lecture series in mathematics and its applications.

[25]  Rada Chirkova,et al.  Efficiently Querying Large XML Data Repositories: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[27]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[28]  Soumen Chakrabarti,et al.  Keyword Search in Databases , 2007 .

[29]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[30]  Takuya Akiba,et al.  Fast exact shortest-path distance queries on large networks by pruned landmark labeling , 2013, SIGMOD '13.

[31]  Jeffrey Xu Yu,et al.  Taming verification hardness: an efficient algorithm for testing subgraph isomorphism , 2008, Proc. VLDB Endow..

[32]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[33]  Rada Chirkova,et al.  Efficient algorithms for exact ranked twig-pattern matching over graphs , 2008, SIGMOD Conference.

[34]  Shan Wang,et al.  Finding Top-k Min-Cost Connected Trees in Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[35]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[36]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[37]  Jeffrey Xu Yu,et al.  Top-k graph pattern matching over large graphs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).