Comparing Hierarchical Data in External Memory

We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4 7 5 , 6 , and 1 5 , where and are the input tree sizes, is the block size, , and . This algorithm can make effective use of surplus RAM capacity to quadratically reduce I/O cost. We extend to trees the commonly used mapping from sequence comparison problems to shortest-path problems in edit graphs.

[1]  Kaizhong Zhang,et al.  Structural matching and discovery in document databases , 1997, SIGMOD '97.

[2]  Kaizhong Zhang,et al.  On the Editing Distance Between Undirected Acyclic Graphs , 1996, Int. J. Found. Comput. Sci..

[3]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[4]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[5]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[6]  Jeffrey Scott Vitter External memory algorithms , 1998, PODS '98.

[7]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[8]  Alfred V. Aho,et al.  Bounds on the Complexity of the Longest Common Subsequence Problem , 1976, J. ACM.

[9]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[10]  Eugene W. Myers,et al.  A file comparison program , 1985, Softw. Pract. Exp..

[11]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[12]  Jennifer Widom,et al.  Representing and querying changes in semistructured data , 1998, Proceedings 14th International Conference on Data Engineering.

[13]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[14]  Kaizhong Zhang,et al.  Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.

[15]  Chak-Kuen Wong,et al.  Bounds for the String Editing Problem , 1976, JACM.

[16]  Eugene W. Myers,et al.  An O(NP) Sequence Comparison Algorithm , 1990, Inf. Process. Lett..

[17]  Walter F. Tichy,et al.  Rcs — a system for version control , 1985, Softw. Pract. Exp..

[18]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[19]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[20]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[21]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[22]  F TichyWalter RCSa system for version control , 1985 .