SuMoTED: An intuitive edit distance between rooted unordered uniquely-labelled trees

Defining and computing distances between tree structures is a classical area of study in theoretical computer science, with practical applications in the areas of computational biology, information retrieval, text analysis, and many others. In this paper, we focus on rooted, unordered, uniquely-labelled trees such as taxonomies and other hierarchies. For trees as these, we introduce the intuitive concept of a ‘local move’ operation as an atomic edit of a tree. We then introduce SuMoTED, a new edit distance measure between such trees, defined as the minimal number of local moves required to convert one tree into another. We show how SuMoTED can be computed using a scalable algorithm with quadratic time complexity. Finally, we demonstrate its use on a collection of music genre taxonomies.

[1]  Peter C. Fishburn,et al.  Partial orders of dimension 2 , 1972, Networks.

[2]  Michaël Rusinowitch,et al.  Proving refutational completeness of theorem-proving strategies: the transfinite semantic tree method , 1991, JACM.

[3]  K. Bremer COMBINABLE COMPONENT CONSENSUS , 1990, Cladistics : the international journal of the Willi Hennig Society.

[4]  Yun Chi,et al.  Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees , 2005, IEEE Trans. Knowl. Data Eng..

[5]  Franz-Josef Brandenburg,et al.  Comparing and Aggregating Partial Orders with Kendall Tau Distances , 2012, WALCOM.

[6]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[7]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[8]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[9]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[10]  Aristides Gionis,et al.  Beyond rankings: comparing directed acyclic graphs , 2015, Data Mining and Knowledge Discovery.

[11]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[12]  Gareth Nelson,et al.  Cladistic Analysis and Synthesis: Principles and Definitions, with a Historical Note on Adanson's Familles Des Plantes (1763–1764) , 1979 .

[13]  Fred R. McMorris,et al.  Consensusn-trees , 1981 .

[14]  Wolfgang Kellerer,et al.  Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[15]  E. N. Adams Consensus Techniques and the Comparison of Taxonomic Trees , 1972 .

[16]  José L. Balcázar,et al.  Mining frequent closed rooted trees , 2009, Machine Learning.

[17]  W. H. Day Optimal algorithms for comparing trees with labeled leaves , 1985 .

[18]  T. McMahon,et al.  Tree structures: deducing the principle of mechanical design. , 1976, Journal of theoretical biology.

[19]  François Pachet,et al.  Representing Musical Genre: A State of the Art , 2003 .

[20]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[21]  Robert E. Stobaugh Chemical substructure searching , 1985, J. Chem. Inf. Comput. Sci..

[22]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[23]  Xuelong Li,et al.  A survey of graph edit distance , 2010, Pattern Analysis and Applications.

[24]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[25]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..