Finding similar consensus between trees: an algorithm and a distance hierarchy

Abstract The problem of finding this similar consensus (also known as the largest approximately common substructures) of two trees arises in many pattern recognition applications. This paper presents a dynamic programming algorithm to solve the problem based on the distance measure originated from Tanaka and Tanaka. The algorithm runs as fast as the best-known algorithm for comparing two trees using Tanaka's distance measure when the allowed distance between the common substructures is a constant independent of the input trees. In addition, we establish a hierarchy among Tanaka's distance measure and three other edit-based distance measures published in the literature.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[3]  King-Sun Fu,et al.  A Tree System Approach for Fingerprint Pattern Recognition , 1976, IEEE Transactions on Computers.

[4]  Branimir Boguraev,et al.  Dictionaries, Dictionary Grammars and Dictionary Entry Parsing , 1989, ACL.

[5]  Kaizhong Zhang,et al.  Identifying Approximately Common Substructures in Trees Based on a Restricted Edit Distance , 1999, Inf. Sci..

[6]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[7]  Heikki Mannila,et al.  Retrieval from hierarchical texts by partial patterns , 1993, SIGIR.

[8]  Lusheng Wang,et al.  Alignment of trees: an alternative to tree edit , 1995 .

[9]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[10]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[11]  Hanan Samet,et al.  Distance Transform for Images Represented by Quadtrees , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  E. N. Adams Consensus Techniques and the Comparison of Taxonomic Trees , 1972 .

[13]  Kaizhong Zhang,et al.  Algorithms for the constrained editing distance between ordered labeled trees and related problems , 1995, Pattern Recognit..

[14]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[15]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[16]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[17]  Kaizhong Zhang,et al.  An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Dan Gusfield,et al.  Efficient algorithms for inferring evolutionary trees , 1991, Networks.

[19]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[20]  J. Ott Analysis of Human Genetic Linkage , 1985 .

[21]  Dana Angluin,et al.  Finding Patterns Common to a Set of Strings , 1980, J. Comput. Syst. Sci..

[22]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[23]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[24]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[25]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[26]  Kaizhong Zhang,et al.  Comparing multiple RNA secondary structures using tree comparisons , 1990, Comput. Appl. Biosci..