Efficient Identification of Similar XML Fragments Based on Tree Edit Distance

Similarity detection between large XML fragment sets is broadly used in many applications such as data integration and XML de-duplication. Extensive methods are used to find similar XML fragments, such as the pq-gram state-of-the-art method which allows for relatively high join quality and efficiency. In this chapter, we propose pq-hash as an improvement to pq-grams. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. To efficiently perform similarity join on XML fragment sets, in this chapter we propose a cluster-based partition strategy as well as a sort-merge & hash join strategy to avoid nested loop join. Both our theoretical analysis and experimental results confirm that, while retaining high join quality, pqhash gains much higher efficiency than pq-grams, and our strategies for approximate join are effective. DOI: 10.4018/978-1-61350-356-0.ch004

[1]  Curtis E. Dyreson,et al.  Approximate Joins for Data-Centric XML , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Amit Kumar,et al.  XML stream processing using tree-edit distance embeddings , 2005, TODS.

[3]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[4]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[5]  Aoying Zhou,et al.  XML Structural Similarity Search Using MapReduce , 2010, WAIM.

[6]  Gabriel Valiente,et al.  An efficient bottom-up distance between trees , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[7]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[8]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[9]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[10]  Erik D. Demaine,et al.  An Optimal Decomposition Algorithm for Tree Edit Distance , 2007, ICALP.

[11]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[12]  Sung-Bae Cho,et al.  An efficient algorithm to compute differences between structured documents , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[14]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[15]  Shirish Tatikonda,et al.  Hashing tree-structured data: Methods and applications , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[17]  Atsuhiro Takasu,et al.  Approximating Tree Edit Distance Through String Edit Distance , 2006, ISAAC.

[18]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[19]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[20]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.

[21]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[23]  Tatsuya Akutsu A relation between edit distance for ordered trees and edit distance for Euler strings , 2006, Inf. Process. Lett..

[24]  Filippo Furfaro,et al.  XPath Query Relaxation through Rewriting Rules , 2011, IEEE Transactions on Knowledge and Data Engineering.

[25]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[26]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[27]  Weimin Chen,et al.  New Algorithm for Ordered Tree-to-Tree Correction Problem , 2001, J. Algorithms.