Measuring the structural similarity among XML documents and DTDs

Measuring the structural similarity between an XML document and a DTD has many relevant applications that range from document classification and approximate structural queries on XML documents to selective dissemination of XML documents and document protection. The problem is harder than measuring structural similarity among documents, because a DTD can be considered as a generator of documents. Thus, the problem is to evaluate the similarity between a document and a set of documents. An effective structural similarity measure should face different requirements that range from considering the presence and absence of required elements, as well as the structure and level of the missing and extra elements to vocabulary discrepancies due to the use of synonymous or syntactically similar tags. In the paper, starting from these requirements, we provide a definition of the measure and present an algorithm for matching a document against a DTD to obtain their structural similarity. Finally, experimental results to assess the effectiveness of the approach are presented.

[1]  Jérôme Darmont,et al.  Processing And Managing Complex Data for Decision Support , 2006 .

[2]  Marco Mesiti A structural similarity measure for XML documents: theory and applications , 2003 .

[3]  Erhard Rahm,et al.  Comparison of Schema Matching Evaluations , 2002, Web, Web-Services, and Database Systems.

[4]  Kaizhong Zhang,et al.  Tree pattern matching , 1997, Pattern Matching Algorithms.

[5]  Elisa Bertino,et al.  A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications , 2004, Inf. Syst..

[6]  W. Glas Xml and Databases , 2002 .

[7]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[8]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[9]  Paolo Atzeni,et al.  XML AND DATABASES , 2004 .

[10]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[11]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[12]  Mong-Li Lee,et al.  XClust: clustering XML schemas for effective integration , 2002, CIKM '02.

[13]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[14]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[17]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[18]  Elisa Bertino,et al.  Protection and administration of XML data sources , 2002, Data Knowl. Eng..

[19]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[20]  Nicholas Kushmerick,et al.  Similarity-based Queries for XML Databases Using ELIXIR , 2001, WWW Posters.

[21]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[22]  Ke Wang,et al.  Discovering typical structures of documents: a road map approach , 1998, SIGIR '98.

[23]  Alex Thomo,et al.  Approximate Reasoning in Semistructured Data , 2001, KRDB.

[24]  Stefano Spaccapietra,et al.  Issues and approaches of database integration , 1998, CACM.

[25]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[26]  Athena Vakali,et al.  Web Data Management Practices: Emerging Techniques and Technologies , 2007 .

[27]  David Carmel,et al.  XML and Information Retrieval: a SIGIR 2000 Workshop , 2001, SIGMOD Record.

[28]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[29]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[30]  A. Tversky Features of Similarity , 1977 .

[31]  Ee-Peng Lim,et al.  Re-engineering structures from Web documents , 2000, DL '00.

[32]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[33]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[34]  Giovanna Guerrini,et al.  An Overviewof Similarity Measures for Clustering XML Documents , 2007 .

[35]  Kaizhong Zhang,et al.  Approximate Tree Matching in the Presence of Variable Length Don't Cares , 1994, J. Algorithms.

[36]  Elisa Bertino,et al.  An Approach to Classify Semi-structured Objects , 1999, ECOOP.

[37]  Silvana Castano,et al.  Conceptual schema analysis: techniques and applications , 1998, TODS.

[38]  David Carmel,et al.  XML and information retrieval: a SIGIR 2000 workshop , 2001, SGMD.

[39]  Horst Bunke,et al.  Classes of cost functions for string edit distance , 2006, Algorithmica.

[40]  Elisa Bertino,et al.  Structural Similarity Measures in Sources of XML Documents , 2006 .

[41]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[42]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[43]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[44]  ChinenyangaTaurai Tapiwa,et al.  An expressive and efficient language for XML information retrieval , 2002 .

[45]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[46]  Kaizhong Zhang A New Editing based Distance between Unordered Labeled Trees , 1993, CPM.

[47]  Sihem Amer-Yahia,et al.  Approximate matching in XML , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[48]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[49]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[50]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[51]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[52]  Sriram Padmanabhan,et al.  A framework for the selective dissemination of XML documents based on inferred user profiles , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[53]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[54]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[55]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[56]  Li Xu-hua Extracting Schema from Semistructured Data , 2006 .

[57]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[58]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[59]  Elisa Bertino,et al.  DiXeminator: A Profile-Based Selective Dissemination System for XML Documents , 2004, EDBT Workshops.

[60]  M. Tamer Özsu,et al.  XBench - A Family of Benchmarks for XML DBMSs , 2002, EEXTT.

[61]  Shin-Yee Lu A Tree-to-Tree Distance and Its Application to Cluster Analysis , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.