X-Diff: an effective change detection algorithm for XML documents

XML has become the de facto standard format for Web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML, or other hierarchically structured documents, used an ordered tree model, in which left-to-right order among siblings is important and it can affect the change result. We argue that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate. We propose X-Diff, an effective algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. The algorithm is analyzed and compared with XyDiff [CAM02], a published XML diff algorithm. An experimental evaluation on both algorithms is provided.

[1]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[2]  Fred Douglis,et al.  Tracking and Viewing Changes on the Web , 1996, USENIX Annual Technical Conference.

[3]  P. A. P. Moran,et al.  An introduction to probability theory , 1968 .

[4]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[5]  Chris Wilson,et al.  Document Object Model (DOM) Level 1 Specification (Second Edition) , 2000 .

[6]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[7]  Kaizhong Zhang A New Editing based Distance between Unordered Labeled Trees , 1993, CPM.

[8]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[9]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[10]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[11]  Fred Douglis,et al.  The AT&T Internet Difference Engine: Tracking and viewing changes on the web , 1998, World Wide Web.

[12]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[13]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[14]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[15]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[16]  Alon Y. Halevy,et al.  Updating XML , 2001, SIGMOD '01.

[17]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[18]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[19]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Christoph M. Hoffmann,et al.  Pattern Matching in Trees , 1982, JACM.