Similarity Join on XML Based on k-Generation Set Distance

Similarity join is applied very widely nowadays since data items representing the same real-world objects may be different due to various conventions. Another reason for similarity join is that the efficiency of traditional methods is really low. Therefore, a method with both high efficiency and high join quality is in need. In the paper, we put forward two new edit operations (reversing and mapping) together with related algorithms concerning similarity join based on the new defined measure. In our method, computing tree edit distance is replaced by computing k-generation set distance between trees. The join process is simplified largely by applying the new method. The time complexity of our method is O(n2), where n is the tree size. We have proved that our method owns some advantages over others. And it can be scaled to large data sets as well.

[1]  Fei Li,et al.  Approximate Joins for XML Using g-String , 2010, XSym.

[2]  Dan Suciu,et al.  Database and XML Technologies , 2004, Lecture Notes in Computer Science.

[3]  Shirish Tatikonda,et al.  Hashing tree-structured data: Methods and applications , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[4]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[5]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[6]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[7]  Sudipto Guha,et al.  Integrating XML data sources using approximate joins , 2006, TODS.

[8]  Hélène Touzet,et al.  Analysis of Tree Edit Distance Algorithms , 2003, CPM.

[9]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[10]  Fei Li,et al.  pq-Hash: An Efficient Method for Approximate XML Joins , 2010, WAIM Workshops.