Unordered tree mining with applications to phylogeny

Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data, such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. We present a new FSM technique for finding patterns in rooted unordered labeled trees. The patterns of interest are cousin pairs in these trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree T, our algorithm finds all interesting cousin pairs of T in O(|T|/sup 2/) time where |T| is the number of nodes in T. Experimental results on synthetic data and phylogenies show the scalability and effectiveness of the proposed technique. To demonstrate the usefulness of our approach, we discuss its applications to locating co-occurring patterns in multiple evolutionary trees, evaluating the consensus of equally parsimonious trees, and finding kernel trees of groups of phylogenies. We also describe extensions of our algorithms for undirected acyclic graphs (or free trees).

[1]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[2]  J. Holmes,et al.  Performance of a First-Order Transition Sampling Digital Phase-Locked Loop Using Random-Walk Models , 1972, IEEE Trans. Commun..

[3]  E. N. Adams Consensus Techniques and the Comparison of Taxonomic Trees , 1972 .

[4]  Gareth Nelson,et al.  Cladistic Analysis and Synthesis: Principles and Definitions, with a Historical Note on Adanson's Familles Des Plantes (1763–1764) , 1979 .

[5]  Fred R. McMorris,et al.  Consensusn-trees , 1981 .

[6]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[7]  W. H. Day Optimal algorithms for comparing trees with labeled leaves , 1985 .

[8]  K. Bremer COMBINABLE COMPONENT CONSENSUS , 1990, Cladistics : the international journal of the Willi Hennig Society.

[9]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[10]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[11]  Hiroshi Motoda,et al.  CLIP: Concept Learning from Inference Patterns , 1995, Artif. Intell..

[12]  Heikki Mannila,et al.  Ordered and Unordered Tree Inclusion , 1995, SIAM J. Comput..

[13]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Multiple RNA Secondary Structures , 1996, KDD.

[14]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[15]  Kaizhong Zhang,et al.  On the Editing Distance Between Undirected Acyclic Graphs , 1996, Int. J. Found. Comput. Sci..

[16]  Ron Shamir,et al.  Faster subtree isomorphism , 1997, Proceedings of the Fifth Israeli Symposium on Theory of Computing and Systems.

[17]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[18]  Ke Wang,et al.  Discovering typical structures of documents: a road map approach , 1998, SIGIR '98.

[19]  Richard Cole,et al.  Tree pattern matching and subset matching in deterministic O(n log3 n)-time , 1999, SODA '99.

[20]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[21]  Kam-Fai Wong,et al.  Approximate Graph Schema Extraction for Semi-Structured Data , 2000, EDBT.

[22]  Imke Schmitt,et al.  Evolution of Filamentous Ascomycetes Inferred from LSU rDNA Sequence Data , 2000 .

[23]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[24]  Ee-Peng Lim,et al.  DTD-Miner: a tool for mining DTD from XML documents , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[25]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[26]  M. Tristem Molecular Evolution — A Phylogenetic Approach. , 2000, Heredity.

[27]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[28]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[29]  Kaizhong Zhang,et al.  Finding approximate patterns in undirected acyclic graphs , 2002, Pattern Recognit..

[30]  P. Tucker,et al.  Phylogenetic relationships in the genus mus, based on paternally, maternally, and biparentally inherited characters. , 2002, Systematic biology.

[31]  Tandy J. Warnow,et al.  Statistically based postprocessing of phylogenetic analysis by clustering , 2002, ISMB.

[32]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[33]  Hiroki Arimura,et al.  Online algorithms for mining semi-structured data stream , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[34]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[35]  P. Diaconis,et al.  Random walks on trees and matchings , 2002 .

[36]  Kaizhong Zhang,et al.  Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[37]  Christos Faloutsos,et al.  ANF: a fast and scalable tool for data mining in massive graphs , 2002, KDD.

[38]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[39]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[40]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[41]  Ambuj K. Singh,et al.  Deriving phylogenetic trees from the similarity analysis of metabolic pathways , 2003, ISMB.

[42]  Amit Kumar,et al.  Correlating XML data streams using tree-edit distance embeddings , 2003, PODS '03.

[43]  Dennis Shasha,et al.  TreeRank: a similarity measure for nearest neighbor searching in phylogenetic databases , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[44]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[45]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.