Structural and semantic aspects of similarity of Document Type Definitions and XML schemas

The natural optimization strategy for XML-to-relational mapping methods is exploitation of similarity of XML data. However, none of the current similarity evaluation approaches is suitable for this purpose. While the key emphasis is currently put on semantic similarity of XML data, the main aspect of XML-to-relational mapping methods is analysis of their structure. In this paper we propose an approach that utilizes a verified strategy for structural similarity evaluation - tree edit distance - to DTD constructs. This approach is able to cope with the fact that DTDs involve several types of nodes and can form general graphs. In addition, it is optimized for the specific features of XML data and, if required, it enables one to exploit the semantics of element/attribute names. Using a set of experiments we show the impact of these extensions on similarity evaluation. And, finally, we discuss how this approach can be extended for XSDs, which involve plenty of ''syntactic sugar'', i.e. constructs that are structurally or semantically equivalent.

[1]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[2]  Yangyong Zhu,et al.  Similarity Metric for XML Documents , 2003 .

[3]  Zohra Bellahsene,et al.  A Flexible Approach for Planning Schema Matching Algorithms , 2008, OTM Conferences.

[4]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[5]  Aïcha-Nabila Benharkat,et al.  Extension of Schema Matching Platform ASMADE to Constraints and Mapping Expression , 2006, SITIS.

[6]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[7]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[8]  Vincent T. Y. Ng,et al.  Structural Similarity between XML Documents and DTDs , 2003, International Conference on Computational Science.

[9]  Jean Berstel,et al.  XML Grammars , 2000, MFCS.

[10]  Hyoung-Joo Kim,et al.  A clustering method based on path similarities of XML data , 2007, Data Knowl. Eng..

[11]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[12]  Hyunbo Cho,et al.  A novel method for measuring semantic similarity for XML schema matching , 2008, Expert Syst. Appl..

[13]  Willem Jonker,et al.  Using Element Clustering to Increase the Efficiency of XML Schema Matching , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[14]  Giovanni Quattrone,et al.  Integration of XML Schemas at various "severity" levels , 2006, Inf. Syst..

[15]  Zohra Bellahsene,et al.  PORSCHE: Performance ORiented SCHEma mediation , 2008, Inf. Syst..

[16]  E. Rodney Canfield,et al.  Approximate XML document matching , 2005, SAC '05.

[17]  Erhard Rahm,et al.  Comparison of Schema Matching Evaluations , 2002, Web, Web-Services, and Database Systems.

[18]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[19]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Kajal T. Claypool,et al.  QMatch - Using paths to match XML schemas , 2007, Data Knowl. Eng..

[21]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[22]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[23]  Jakub Yaghob,et al.  Semantic Web Infrastructure Using DataPile , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[24]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[25]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[26]  Willem Jonker,et al.  Formalizing the XML Schema Matching Problem as a Constraint Optimization Problem , 2005, DEXA.

[27]  Myriam Lamolle,et al.  Computing Path Similarity Relevant to XML Schema Matching , 2008, OTM Workshops.

[28]  Erhard Rahm,et al.  Matching large schemas: Approaches and evaluation , 2007, Inf. Syst..

[29]  Angela Bonifati,et al.  Schema mapping verification: the spicy way , 2008, EDBT '08.

[30]  Xiaojun Wan,et al.  A novel document similarity measure based on earth mover's distance , 2007, Inf. Sci..

[31]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[32]  Weng Tat Chan,et al.  XML application schema matching using similarity measure and relaxation labeling , 2005, Inf. Sci..

[33]  Gunter Saake,et al.  Improving XML schema matching performance using Prüfer sequences , 2009, Data Knowl. Eng..

[34]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[35]  David Bednárek Output-Driven XQuery Evaluation , 2008, IDC.

[36]  Kevin Chen-Chuan Chang,et al.  A holistic paradigm for large scale schema matching , 2004, SGMD.

[37]  Lusheng Wang,et al.  Alignment of trees: an alternative to tree edit , 1995 .

[38]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[39]  Yanchun Zhang,et al.  Web Services Discovery Based On Schema Matching , 2007, ACSC.

[40]  I. Mlynkova A Journey towards More Efficient Processing of XML Data in (O)RDBMS , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[41]  Guoliang Li,et al.  SAIL: Structure-aware indexing for effective and progressive top-k keyword search over XML documents , 2009, Inf. Sci..

[42]  Ahmed K. Elmagarmid,et al.  Usage-Based Schema Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[43]  Martin Necaský Reverse Engineering of XML Schemas to Conceptual Diagrams , 2009, APCCM.

[44]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[45]  Jaroslav Pokorný,et al.  Extending Fagin's Algorithm for More Users Based on Multidimensional B-Tree , 2008, ADBIS.

[46]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[47]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[48]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[49]  Sabine Van Huffel,et al.  On the Design of a Web-Based Decision Support System for Brain Tumour Diagnosis Using Distributed Agents , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[50]  Davood Rafiei,et al.  Finding Syntactic Similarities Between XML Documents , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[51]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[52]  Yacine Rezgui,et al.  A document management methodology based on similarity contents , 2004, Inf. Sci..

[53]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[54]  Hyun-Ho Lee,et al.  Selectivity-sensitive shared evaluation of multiple continuous XPath queries over XML streams , 2009, Inf. Sci..

[55]  Sven Groppe,et al.  Output schemas of XSLT stylesheets and their applications , 2008, Inf. Sci..

[56]  Elisa Bertino,et al.  Measuring the structural similarity among XML documents and DTDs , 2008, Journal of Intelligent Information Systems.

[57]  Chin-Wan Chung,et al.  XTREAM: An efficient multi-query evaluation on streaming XML data , 2007, Inf. Sci..

[58]  Irena Holubová,et al.  Similarity of DTDs Based on Edit Distance and Semantics , 2008, IDC.

[59]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[60]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[61]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[62]  Filip Zavoral,et al.  Using Input Buffers for Streaming XSLT Processing , 2009, 2009 First International Confernce on Advances in Databases, Knowledge, and Data Applications.

[63]  Irena Holubová,et al.  Statistical Analysis of Real XML Data Collections , 2006, COMAD.

[64]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[65]  Irena Mlýnková Equivalence of XSD Constructs and Its Exploitation in Similarity Evaluation , 2008 .

[66]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.