RTED: A Robust Algorithm for the Tree Edit Distance

We consider the classical tree edit distance between ordered labeled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity, but the worst case happens frequently, or they are very efficient for some tree shapes, but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms. In this paper we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of RTED is smaller or equal to the complexity of the best competitors for any input instance, i.e., RTED is both efficient and worst-case optimal. We introduce the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature. We prove that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity. In our experiments on synthetic and real world data we empirically evaluate our solution and compare it to the state-of-the-art.

[1]  Philip N. Klein,et al.  A tree-edit-distance algorithm for comparing simple, closed shapes , 2000, SODA '00.

[2]  Robert E. Tarjan,et al.  A data structure for dynamic trees , 1981, STOC '81.

[3]  Ravi Kothari,et al.  Region-based modeling and tree edit distance as a basis for gesture recognition , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[4]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[5]  Timos K. Sellis,et al.  A methodology for clustering XML documents by structure , 2006, Inf. Syst..

[6]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[7]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[8]  Hélène Touzet,et al.  Decomposition algorithms for the tree edit distance problem , 2005, J. Discrete Algorithms.

[9]  Joongmin Choi,et al.  Web Information Extraction by HTML Tree Edit Distance Matching , 2007, 2007 International Conference on Convergence Information Technology (ICCIT 2007).

[10]  Denilson Barbosa,et al.  TASM: Top-k Approximate Subtree Matching , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[11]  Bin Ma,et al.  Computing similarity between RNA structures , 1999, Theor. Comput. Sci..

[12]  Michael H. Böhlen,et al.  An incrementally maintainable index for approximate lookups in hierarchical data , 2006, VLDB.

[13]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[15]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[16]  Amit Kumar,et al.  XML stream processing using tree-edit distance embeddings , 2005, TODS.

[17]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[18]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  José Manuel Iñesta Quereda,et al.  Melody Recognition with Learned Edit Distances , 2008, SSPR/SPR.

[20]  Kaizhong Zhang,et al.  An Improved Algorithm for Tree Edit Distance Incorporating Structural Linearity , 2007, COCOON.

[21]  Sally I. McClean,et al.  Measuring Tree Similarity for Natural Language Processing Based Information Retrieval , 2010, NLDB.

[22]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[23]  Curtis E. Dyreson,et al.  Approximate Joins for Data-Centric XML , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[25]  Tatsuya Akutsu Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics , 2010, IEICE Trans. Inf. Syst..

[26]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[27]  Sudarshan S. Chawathe,et al.  Comparing Hierarchical Data in External Memory , 1999, VLDB.

[28]  G. Wittum,et al.  The tree-edit-distance, a measure for quantifying neuronal morphology , 2009, BMC Neuroscience.

[29]  Weimin Chen,et al.  New Algorithm for Ordered Tree-to-Tree Correction Problem , 2001, J. Algorithms.

[30]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[31]  Sung-Bae Cho,et al.  An efficient algorithm to compute differences between structured documents , 2004, IEEE Transactions on Knowledge and Data Engineering.

[32]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.