Embedding Similarity Joins into Native XML Databases

Similarity joins in databases can be used for several important tasks such as data cleaning and instance-based data integration. In this paper, we ex- plore ways how to support such tasks in a native XML database environment. The main goals of our work are: a) to prove the feasibility of performing tree similarity joins in a general-purpose XML database management system; b) to support string- and tree-based similarity techniques in a unified framework; c) to avoid relying on special data preparation or data structures to support simi- larity evaluation, such as partitioning or tailor-made index structures; d) to achieve a seamless integration of similarity operators and the existing database architecture.

[1]  Patrick E. O'Neil,et al.  ORDPATHs: insert-friendly XML node labels , 2004, SIGMOD '04.

[2]  Christian Mathis,et al.  Node labeling schemes for dynamic XML documents reconsidered , 2007, Data Knowl. Eng..

[3]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[4]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[5]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[6]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[9]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[10]  Theo Härder,et al.  An efficient infrastructure for native transactional XML processing , 2007, Data Knowl. Eng..

[11]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[12]  Sudipto Guha,et al.  Integrating XML data sources using approximate joins , 2006, TODS.

[13]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[14]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[15]  Christian Mathis Integrating Structural Joins into a Tuple-Based XPath Algebra , 2007, BTW.

[16]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.