XML Similarity Detection and Measures

XML becomes a standard for data representation and exchange over the Internet. Due to the widespread use of XML, XML similarity detection plays an important role in facilitating many applications such as data integration, document classification/clustering, XML query and change management. In this paper we present a discussion on XML documents syntactic and semantic similarity measures along with existing research related to XML similarity detection. XML similarity measures could broadly be classified into two main categories: (1) structural similarity and (2) structural and content similarity. We review similarity detection approaches proposed in the literature and discuss some of the challenges and future directions for research on XML similarity detection and measurements.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Sophie Cluet,et al.  Querying XML Documents in Xyleme , 2000, SIGIR 2000.

[3]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[4]  Richi Nayak,et al.  Utilising Semantic Tags in XML Clustering , 2009, INEX.

[5]  J. A. Miller,et al.  Querying XML documents , 2000 .

[6]  Hans-Peter Kriegel,et al.  Similarity Search in Structured Data , 2003, DaWaK.

[7]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[8]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[9]  Abdelhamid Bouchachia,et al.  Classification of XML Documents , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[10]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[11]  Jaewook Kim,et al.  A layered approach to semantic similarity analysis of XML schemas , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[12]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[13]  Nuno Seco,et al.  Design, Implementation and Evaluation of a New Semantic Similarity Metric Combining Features and Intrinsic Information Content , 2008, OTM Conferences.

[14]  Letizia Tanca,et al.  XML-GL: A Graphical Language for Querying and Restructuring XML Documents , 1999, SEBD.

[15]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[16]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Roger King,et al.  Using Object Matching and Materialization to Integrate Heterogeneous Databases , 1999, CoopIS.

[18]  M. Hascoet,et al.  Xyleme, a dynamic warehouse for XML data of the Web , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[19]  Carlos Alberto Heuser,et al.  Matching XML documents in highly dynamic applications , 2008, DocEng '08.

[20]  Sanjay Kumar Madria,et al.  A system for detecting xml similarity in content and structure using relational database , 2009, CIKM.

[21]  Richi Nayak Investigating Semantic Measures in XML Clustering , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[22]  Andrew Trotman,et al.  Overview of INEX 2006 , 2006, INEX.

[23]  Haruo Yokota,et al.  LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration , 2005, BNCOD.

[24]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[25]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[26]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[27]  Ricardo A. Baeza-Yates,et al.  A model and a visual query language for structured text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[28]  Pável Calado,et al.  Structure-based inference of xml similarity for fuzzy duplicate detection , 2007, CIKM '07.

[29]  Sourav S. Bhowmick,et al.  XML Data Integration Based on Content and Structure Similarity Using Keys , 2008, OTM Conferences.

[30]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[31]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[32]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[33]  Andrew Trotman,et al.  Comparative Evaluation of XML Information Retrieval Systems: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006 Dagstuhl Castle, Germany, December 17-20, 2006 Revised and Selected Papers , 2005 .

[34]  Sergio Greco,et al.  Semantic clustering of XML documents , 2010, TOIS.

[35]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[36]  Hiroyuki Kitagawa,et al.  An Approach for XML Similarity Join Using Tree Serialization , 2008, DASFAA.

[37]  Haruo Yokota,et al.  SLAX: An Improved Leaf-Clustering Based Approximate XML Join Algorithm for Integrating XML Data at Subtree Classes , 2006 .

[38]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[39]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[40]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[41]  Paolo Atzeni,et al.  XML AND DATABASES , 2004 .

[42]  Davood Rafiei,et al.  Finding Syntactic Similarities Between XML Documents , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[43]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[44]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[45]  Aoying Zhou,et al.  Bloom filter-based XML packets filtering for millions of path queries , 2005, 21st International Conference on Data Engineering (ICDE'05).

[46]  Sergio Greco,et al.  Word Sense Disambiguation for XML Structure Feature Generation , 2009, ESWC.

[47]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[48]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[49]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[50]  Kaizhong Zhang,et al.  Approximate tree pattern matching , 1997 .

[51]  Toshiyuki Amagasa,et al.  XRel: a path-based approach to storage and retrieval of XML documents using relational databases , 2001, ACM Trans. Internet Techn..

[52]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[53]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[54]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[55]  Richi Nayak,et al.  HCX: an efficient hybrid clustering approach for XML documents , 2009, DocEng '09.

[56]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[57]  Sanjay Kumar Madria,et al.  XML-SIM: Structure and Content Semantic Similarity Detection Using Keys , 2009, OTM Conferences.

[58]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[59]  Anna Formica,et al.  Similarity of XML-Schema Elements: A Structural and Information Content Approach , 2008, Comput. J..

[60]  Gerhard Weikum,et al.  The XXL search engine: ranked retrieval of XML data using indexes and ontologies , 2002, SIGMOD '02.

[61]  Richard Chbeir,et al.  An overview on XML similarity: Background, current trends and future directions , 2009, Comput. Sci. Rev..

[62]  Richi Nayak,et al.  XML schema clustering with semantic and hierarchical similarity measures , 2007, Knowl. Based Syst..

[63]  A. de Keijzer,et al.  Probabilistic XML in Information Integration , 2006 .

[64]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[65]  Edleno Silva de Moura,et al.  Measuring similarity between collection of values , 2004, WIDM '04.

[66]  Alberto O. Mendelzon,et al.  Fourier transform based techniques in efficient retrieval of similar time sequences , 1999 .