An efficient similarity-based approach for comparing XML documents

Abstract XML documents are widely used to interchange information among heterogeneous systems, ranging from office applications to scientific experiments. Independently of the domain, XML documents may evolve, so identifying and understanding the changes they undergo becomes crucial. Some syntactic diff approaches have been proposed to address this problem. They are mainly designed to compare revisions of XML documents using explicit IDs to match elements. However, elements in different revisions may not share IDs due to tool incompatibility or even divergent or missing schemas. In this paper, we present Phoenix, a similarity-based approach for comparing revisions of XML documents that does not rely on explicit IDs. Phoenix uses dynamic programming and optimization algorithms to compare different features (e.g., element name, content, attributes, and sub-elements) of XML documents and calculate the similarity degree between them. We compared Phoenix with X-Diff and XyDiff, two state-of-the-art XML diff algorithms. XyDiff was the fastest approach but failed in providing precise matching results. X-Diff presented higher efficacy in 30 of the 56 scenarios but was slow. Phoenix executed in a fraction of the running time required by X-Diff and achieved the best results in terms of efficacy in 26 of 56 tested scenarios. In our evaluations, Phoenix was by far the most efficient approach to match elements across revisions of the same XML document.

[1]  Sanjay Kumar Madria,et al.  A change detection system for unordered XML data using a relational model , 2012, Data Knowl. Eng..

[2]  Wenfei Fan,et al.  Keys for XML , 2002, Comput. Networks.

[3]  R. Tarjan Amortized Computational Complexity , 1985 .

[4]  Vanessa Braganholo,et al.  Towards semantic diff of XML documents , 2014, SAC.

[5]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[6]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2005, World Wide Web.

[7]  Maarten Marx,et al.  The quality of the XML Web , 2013, J. Web Semant..

[8]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[9]  Alexis Leon A Guide to Software Configuration Management , 2000 .

[10]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[11]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[12]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1987, JACM.

[13]  Wenfei Fan,et al.  Reasoning about keys for XML , 2003, Inf. Syst..

[14]  Ran Duan,et al.  A scaling algorithm for maximum weight matching in bipartite graphs , 2012, SODA.

[15]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[16]  Carmem S. Hara,et al.  A Semantical Change Detection Algorithm for XML , 2007, SEKE.

[17]  Irena Mlynkova,et al.  Interactive inference of XML schemas , 2010, RCIS 2010.

[18]  Marta Mattoso,et al.  Efficiently Processing XML Queries over Fragmented Repositories with PartiX , 2006, EDBT Workshops.

[19]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[20]  Soon Myoung Chung,et al.  XML Integrated Environment for Service-Oriented Data Management , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[21]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[22]  Ming-Yang Kao,et al.  A Decomposition Theorem for Maximum Weight Bipartite Matchings , 2000, SIAM J. Comput..

[23]  Sanjay Kumar Madria,et al.  XML-SIM-CHANGE: Structure and Content Semantic Similarity Detection among XML Document Versions , 2010, OTM Conferences.

[24]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .