An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees

Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree T to be a connected subgraph of T. Given two ordered labeled trees T/sub 1/ and T/sub 2/ and an integer d, the largest approximately common substructure problem is to find a substructure U/sub 1/ of T/sub 1/ and a substructure U/sub 2/ of T/sub 2/ such that U/sub 1/ is within edit distance d of U/sub 2/ and where there does not exist any other substructure V/sub 1/ of T/sub 1/ and V/sub 2/ of T/sub 2/ such that V/sub 1/ and V/sub 2/ satisfy the distance constraint and the sum of the sizes of V/sub 1/ and V/sub 2/ is greater than the sum of the sizes of U/sub 1/ and U/sub 2/. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).

[1]  Andrew K. C. Wong,et al.  An algorithm for graph optimal monomorphism , 1990, IEEE Trans. Syst. Man Cybern..

[2]  Kaizhong Zhang,et al.  Comparing multiple RNA secondary structures using tree comparisons , 1990, Comput. Appl. Biosci..

[3]  Shaoming Liu,et al.  Largest Common Similar Substructures of Rooted and Unordered Trees , 1996 .

[4]  S. Y. Lu,et al.  A Tree-Matching Algorithm Based on Node Splitting and Merging , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Robert M. Haralick,et al.  Structural Descriptions and Inexact Matching , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  King-Sun Fu,et al.  A Tree System Approach for Fingerprint Pattern Recognition , 1976, IEEE Transactions on Computers.

[7]  Kaizhong Zhang,et al.  On the Editing Distance Between Undirected Acyclic Graphs , 1996, Int. J. Found. Comput. Sci..

[8]  Hanan Samet,et al.  Distance Transform for Images Represented by Quadtrees , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Multiple RNA Secondary Structures , 1996, KDD.

[10]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[11]  Bruce A. Shapiro,et al.  Secondary structure computer prediction of the poliovirus 5' non-coding region is improved by a genetic algorithm , 1997, Comput. Appl. Biosci..

[12]  Shin-Yee Lu,et al.  Waveform Correlation by Tree Matching , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Shaoming Liu,et al.  A Largest Common Similar Substructure Problem for Trees Embedded in a Plane , 1996 .

[14]  Shaoming Liu,et al.  The Largest Common Similar Substructure Problem (Special Section on Discrete Mathematics and Its Applications) , 1997 .

[15]  Eiichi Tanaka A Metric Between Unrooted and Unordered Trees and its Bottom-Up Computing Method , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[17]  K. Currey,et al.  The cardiovirulent phenotype of coxsackievirus B3 is determined at a single site in the genomic 5' nontranslated region , 1995, Journal of virology.

[18]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[19]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[20]  Kaizhong Zhang,et al.  Approximate Tree Matching in the Presence of Variable Length Don't Cares , 1994, J. Algorithms.

[21]  E. Tanaka,et al.  A unified view on tree metrics , 1988 .

[22]  Bruce A. Shapiro,et al.  An algorithm for comparing multiple RNA secondary structures , 1988, Comput. Appl. Biosci..

[23]  Kaizhong Zhang,et al.  Exact and approximate algorithms for unordered tree matching , 1994, IEEE Trans. Syst. Man Cybern..

[24]  Ruth Nussinov,et al.  RNA secondary structures: comparison and determination of frequently recurring substructures by consensus , 1989, Comput. Appl. Biosci..