A cluster-based approach to XML similarity joins

A natural consequence of the widespread adoption of XML as standard for information representation and exchange is the redundant storage of large amounts of persistent XML documents. Compared to relational data tables, data represented in XML format can potentially be even more sensitive to data quality issues because structure, besides textual information, may cause variations in XML documents representing the same information entity. Therefore, correlating XML documents, which are similar in content an structure, is a fundamental operation. In this paper, we present an effective, flexible, and high-performance XML-based similarity join framework. We exploit structural summaries and clustering concepts to produce compact and high-quality XML document representations: our approach outperforms previous work both in terms of performance and accuracy. In this context, we explore different ways to weigh and combine evidence from textual and structural XML representations. Furthermore, we address user interaction, when the similarity framework is configured for a specific domain, and updatability of clustering information, when new documents enter datasets under consideration. We present a thorough experimental evaluation to validate our techniques in the context of a native XML DBMS.

[1]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[2]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[3]  Theo Härder,et al.  An efficient infrastructure for native transactional XML processing , 2007, Data Knowl. Eng..

[4]  Sven Helmer,et al.  Measuring the Structural Similarity of Semistructured Documents Using Entropy , 2007, VLDB.

[5]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[6]  M. de Rijke,et al.  Articulating information needs in XML query languages , 2006, TOIS.

[7]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[8]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[9]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[10]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[11]  Zhen Hua Liu,et al.  A Decade of XML Data Management: An Industrial Experience Report from Oracle , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[13]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[15]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[16]  Theo Härder,et al.  Efficient Set Similarity Joins Using Min-prefixes , 2009, ADBIS.

[17]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[18]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[19]  Divesh Srivastava,et al.  Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[20]  Curtis E. Dyreson,et al.  Approximate Joins for Data-Centric XML , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[22]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[25]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Christian Mathis,et al.  Comparison of Complete and Elementless Native Storage of XML Documents , 2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007).

[27]  Michael H. Böhlen,et al.  Approximate Matching of Hierarchical Data Using pq-Grams , 2005, VLDB.

[28]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[29]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[30]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[31]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[32]  Theo Härder,et al.  Evaluating Performance and Quality of XML-Based Similarity Joins , 2008, ADBIS.

[33]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[34]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[35]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[36]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[37]  Sudipto Guha,et al.  Integrating XML data sources using approximate joins , 2006, TODS.