A Scalable Index for Top-k Subtree Similarity Queries

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the-art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required. We present a scalable solution for the top-k subtree similarity problem that does not assume specific data types, nor does it require any tuning. The key idea is to process promising subtrees first. A subtree is promising if it shares many labels with the query. We develop a new technique based on inverted lists that efficiently retrieves subtrees in the required order and supports incremental updates of the document. To achieve linear space, we avoid full list materialization but build relevant parts of a list on the fly. In an extensive empirical evaluation on synthetic and real-world data, our technique consistently outperforms the state-of-the-art index w.r.t. memory usage, indexing time, and the number of candidates that must be verified. In terms of query time, we clearly outperform the state of the art and achieve runtime improvements of up to four orders of magnitude.

[1]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[2]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Denilson Barbosa,et al.  TASM: Top-k Approximate Subtree Matching , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[4]  Pável Calado,et al.  An Overview of XML Duplicate Detection Algorithms , 2010, Soft Computing in XML Data Management.

[5]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[6]  Denilson Barbosa,et al.  Efficient Top-k Approximate Subtree Matching in Small Memory , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  Nikolaus Augsten,et al.  Tree edit distance: Robust and memory-efficient , 2016, Inf. Syst..

[8]  Sara Cohen Indexing for subtree similarity-search using edit distance , 2013, SIGMOD '13.

[9]  Gerhard Weikum,et al.  TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[10]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[11]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[12]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[13]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[14]  Hoda A. ElMaraghy,et al.  Matching bills of materials using tree reconciliation , 2013 .

[15]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[16]  Sara Cohen,et al.  A general algorithm for subtree similarity-search , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[17]  Patrick Valduriez,et al.  Best Position Algorithms for Top-k Queries , 2007, VLDB.

[18]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[19]  Tatsuya Akutsu Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics , 2010, IEICE Trans. Inf. Syst..

[20]  Edleno Silva de Moura,et al.  Structure-driven crawler generation by example , 2006, SIGIR.

[21]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[22]  Norman May,et al.  DeltaNI: an efficient labeling scheme for versioned hierarchical data , 2013, SIGMOD '13.

[23]  Erik D. Demaine,et al.  An O(n^3)-Time Algorithm for Tree Edit Distance , 2005, ArXiv.

[24]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[25]  Nikos Mamoulis,et al.  Scaling Similarity Joins over Tree-Structured Data , 2015, Proc. VLDB Endow..

[26]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[27]  Fei Li,et al.  A survey on tree edit distance lower bound estimation techniques for similarity join on XML data , 2014, SGMD.

[28]  Alain Denise,et al.  Average complexity of the Jiang-Wang-Zhang pairwise tree alignment algorithm and of a RNA secondary structure alignment algorithm , 2010, Theor. Comput. Sci..

[29]  Felix Naumann,et al.  XML Duplicate Detection Using Sorted Neighborhoods , 2006, EDBT.

[30]  Gerhard Weikum,et al.  Efficient and self-tuning incremental query expansion for top-k query processing , 2005, SIGIR '05.