Extend tree edit distance for effective object identification

Similarity join on XML documents which are usually modeled as rooted ordered labeled trees is widely applied, due to the ambiguity of references to the real-world objects. The conventional method dealing with this issue is based on tree edit distance, which is shortage of flexibility and efficiency. In this paper, we propose two novel edit operations together with extended tree edit distance, which can achieve good performance in similarity matching with hierarchical data structures [the run-time is $$O(n^{3})$$O(n3) in the worst case]. And then, we propose $$k$$k-generation set distance as a good approximation of the tree edit distance to further improve the join efficiency with quadric time complexity. Experiments on real and synthetic databases demonstrate the benefit of our method in efficiency and scalability.

[1]  Kaizhong Zhang,et al.  Approximate tree pattern matching , 1997 .

[2]  Fei Li,et al.  Approximate Joins for XML Using g-String , 2010, XSym.

[3]  Yang Wang,et al.  Similarity Join on XML Based on k-Generation Set Distance , 2011, WAIM Workshops.

[4]  Amit Kumar,et al.  XML stream processing using tree-edit distance embeddings , 2005, TODS.

[5]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[6]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[7]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[8]  Sudipto Guha,et al.  Integrating XML data sources using approximate joins , 2006, TODS.

[9]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[10]  Hélène Touzet,et al.  Analysis of Tree Edit Distance Algorithms , 2003, CPM.

[11]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[12]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[13]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[14]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[15]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[16]  Shay Mozes Some Lower and Upper Bounds for Tree Edit Distance , 2008 .

[17]  Fei Li,et al.  pq-Hash: An Efficient Method for Approximate XML Joins , 2010, WAIM Workshops.

[18]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[19]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[20]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[21]  Richi Nayak,et al.  Element similarity measures in XML schema matching , 2010, Inf. Sci..

[22]  Shirish Tatikonda,et al.  Hashing tree-structured data: Methods and applications , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[23]  Zhengwu Yang,et al.  A near-optimal similarity join algorithm and performance evaluation , 2004, Inf. Sci..

[24]  Weng Tat Chan,et al.  XML application schema matching using similarity measure and relaxation labeling , 2005, Inf. Sci..