A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics

XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the end-user to adjust the comparison process according to her requirements. Our framework consists of four main modules for (i) discovering the structural commonalities between sub-trees, (ii) identifying sub-tree semantic resemblances, (iii) computing tree-based edit operations costs, and (iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.

[1]  Davood Rafiei,et al.  Finding Syntactic Similarities Between XML Documents , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[2]  Beniamino Di Martino,et al.  Semantic web services discovery based on structural ontology matching , 2009, Int. J. Web Grid Serv..

[3]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[4]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[5]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[6]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[7]  Erhard Rahm,et al.  Matching large schemas: Approaches and evaluation , 2007, Inf. Syst..

[8]  Gerhard Weikum,et al.  Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data , 2003, WebDB.

[9]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[10]  Hans-Jörg Schek,et al.  Generating Vector Spaces On-the-fly for Flexible XML Retrieval , 2002 .

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Pável Calado,et al.  Structure-based inference of xml similarity for fuzzy duplicate detection , 2007, CIKM '07.

[15]  York Sure-Vetter,et al.  Ontology Mapping - An Integrated Approach , 2004, ESWS.

[16]  Felix Naumann,et al.  DogmatiX tracks down duplicates in XML , 2005, SIGMOD '05.

[17]  Steffen Staab,et al.  QOM - Quick Ontology Mapping , 2004, GI Jahrestagung.

[18]  Richard Chbeir,et al.  An overview on XML similarity: Background, current trends and future directions , 2009, Comput. Sci. Rev..

[19]  Alfred V. Aho,et al.  Bounds on the Complexity of the Longest Common Subsequence Problem , 1976, J. ACM.

[20]  Richard Chbeir,et al.  Semantic and Structure Based XML Similarity: An Integrated Approach , 2006, COMAD.

[21]  Mirella Lapata,et al.  Graph Connectivity Measures for Unsupervised Word Sense Disambiguation , 2007, IJCAI.

[22]  J. J. Hopfield,et al.  “Neural” computation of decisions in optimization problems , 1985, Biological Cybernetics.

[23]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[24]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[25]  Alan F. Smeaton,et al.  Using WordNet in a Knowledge-Based Approach to Information Retrieval , 1995 .

[26]  Richi Nayak,et al.  XML Schema Element Similarity Measures: A Schema Matching Context , 2009, OTM Conferences.

[27]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[28]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[29]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[30]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[31]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[32]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[33]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[34]  Joe Marini,et al.  Document Object Model , 2002, Encyclopedia of GIS.

[35]  Gunter Saake,et al.  Improving XML schema matching performance using Prüfer sequences , 2009, Data Knowl. Eng..

[36]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[37]  Ted Pedersen,et al.  SenseRelate: : TargetWord-A Generalized Framework for Word Sense Disambiguation , 2005, ACL.

[38]  A. Tversky Features of Similarity , 1977 .

[39]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[40]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[41]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[42]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[43]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[44]  Gerhard Weikum,et al.  Semantic Similarity Search on Semistructured Data with the XXL Search Engine , 2005, Information Retrieval.

[45]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .

[46]  Zohra Bellahsene,et al.  An Indexing Structure for Automatic Schema Matching , 2007, ICDE Workshops.

[47]  Haruo Yokota,et al.  LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration , 2005, BNCOD.

[48]  Rafael Berlanga Llavori,et al.  Approximate Subtree Identification in Heterogeneous XML Documents Collections , 2005, XSym.

[49]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[50]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[51]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[52]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[53]  Sven Helmer,et al.  Measuring the Structural Similarity of Semistructured Documents Using Entropy , 2007, VLDB.

[54]  Ahmad Abdollahzadeh Barforoush,et al.  A new word sense similarity measure in wordnet , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[55]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[56]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[57]  Richard Chbeir,et al.  Efficient XML Structural Similarity Detection using Sub-tree Commonalities , 2007, SBBD.

[58]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[59]  Duncan Temple Lang,et al.  An Introduction to XML , 2014 .

[60]  Chak-Kuen Wong,et al.  Bounds for the String Editing Problem , 1976, JACM.

[61]  Dean Jackson Scalable vector graphics (SVG): the world wide web consortium's recommendation for high quality web graphics , 2002, SIGGRAPH '02.

[62]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[63]  Richard Chbeir,et al.  A Fine-Grained XML Structural Comparison Approach , 2007, ER.

[64]  Haruo Yokota,et al.  SLAX: An Improved Leaf-Clustering Based Approximate XML Join Algorithm for Integrating XML Data at Subtree Classes , 2006 .

[65]  Larry Kerschberg,et al.  A hybrid similarity matching algorithm for mapping and rading ontologies via a multi-agent system , 2008 .

[66]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[67]  Kaizhong Zhang,et al.  Approximate tree pattern matching , 1997 .

[68]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[69]  Harald Schöning Tamino - A DBMS designed for XML , 2001, ICDE.

[70]  Michalis Vazirgiannis,et al.  Clustering algorithms and validity measures , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[71]  Michele Missikoff,et al.  Concept Similarity in SymOntos: An Enterprise Ontology Management Tool , 2002, Comput. J..

[72]  Avigdor Gal,et al.  Boosting Schema Matchers , 2008, OTM Conferences.

[73]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[74]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[75]  Khaled Mellouli,et al.  A New Similarity Measure Based On Edge Counting , 2008 .

[76]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[77]  H. Schoning Tamino - a DBMS designed for XML , 2001, Proceedings 17th International Conference on Data Engineering.

[78]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[79]  Michael B. Spring,et al.  A Harmony based Adaptive Ontology Mapping Approach , 2008, SWWS.

[80]  Isabelle Tellier,et al.  Transforming XML Trees for Efficient Classification and Clustering , 2005, INEX.

[81]  Larry Kerschberg,et al.  A Hybrid Ontology Mediation Approach for the Semantic Web , 2008, Int. J. E Bus. Res..

[82]  Biplav Srivastava,et al.  A system for knowledge management in bioinformatics , 2002, CIKM '02.

[83]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[84]  Maurizio Rafanelli,et al.  Structural similarity in geographical queries to improve query answering , 2007, SAC '07.

[85]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[86]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.

[87]  Richard Chbeir,et al.  Content and Structure Based Approach For XML Similarity , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[88]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[89]  Steffen Staab,et al.  Bootstrapping ontology alignment methods with APFEL , 2005, WWW '05.

[90]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .