A general algorithm for subtree similarity-search

Determining similarity between trees is an important problem in a variety of areas. The subtree similarity-search problem is that of finding, given a tree Q and a large set of trees Γ = {T1; ...; Tn}, the subtrees of trees among Γ that are most similar to Q. Similarity is defined using some tree distance function. While subtree similarity-search has been studied in the past, solutions mostly focused on specific tree distance functions, and were usually applicable only to ordered trees. This paper presents an efficient new algorithm that solves the subtree similarity-search problem, and is compatible with a wide family of tree distance functions (for both ordered and unordered trees). Extensive experimentation confirms the efficiency and scalability of the algorithm, which displays consistently good runtime even for large queries and datasets.

[1]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[2]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[3]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[4]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[5]  Juan Ramón Rico-Juan,et al.  Comparison of AESA and LAESA search algorithms using string and tree-edit-distances , 2003, Pattern Recognit. Lett..

[6]  Philip N. Klein,et al.  A tree-edit-distance algorithm for comparing simple, closed shapes , 2000, SODA '00.

[7]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[8]  Denilson Barbosa,et al.  Efficient Top-k Approximate Subtree Matching in Small Memory , 2011, IEEE Transactions on Knowledge and Data Engineering.

[9]  Amit Kumar,et al.  Correlating XML data streams using tree-edit distance embeddings , 2003, PODS '03.

[10]  Edwin R. Hancock,et al.  Discovering Shape Classes using Tree Edit-Distance and Pairwise Clustering , 2007, International Journal of Computer Vision.

[11]  Curtis E. Dyreson,et al.  Approximate Joins for Data-Centric XML , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[13]  Albert C. S. Chung,et al.  Cerebral Vascular Tree Matching of 3D-RA Data Based on Tree Edit Distance , 2006, MIAR.

[14]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[15]  A. Tversky Features of Similarity , 1977 .

[16]  Sven Helmer,et al.  Measuring structural similarity of semistructured data based on information-theoretic approaches , 2012, The VLDB Journal.

[17]  Bernardo Magnini,et al.  Combining Lexical Resources with Tree Edit Distance for Recognizing Textual Entailment , 2005, MLCW.

[18]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[19]  G. Wittum,et al.  The tree-edit-distance, a measure for quantifying neuronal morphology , 2009, BMC Neuroscience.

[20]  Lin Guo XRANK : Ranked Keyword Search over XML Documents , 2003 .

[21]  Theo Härder,et al.  Evaluating Performance and Quality of XML-Based Similarity Joins , 2008, ADBIS.

[22]  Laurent Tichit,et al.  RNA secondary structure comparison: exact analysis of the Zhang-Shasha tree edit algorithm , 2003, Theor. Comput. Sci..

[23]  Sara Cohen Indexing for subtree similarity-search using edit distance , 2013, SIGMOD '13.

[24]  Susan Gauch,et al.  Document similarity based on concept tree distance , 2008, Hypertext.

[25]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[26]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[27]  Denilson Barbosa,et al.  TASM: Top-k Approximate Subtree Matching , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).