Faster Algorithms for Tree Similarity Based on Compressed Enumeration of Bounded-Sized Ordered Subtrees

In this paper, we study efficient computation of tree similarity for ordered trees based on compressed subtree enumeration. The compressed subtree enumeration is a new paradigm of enumeration algorithms that enumerates all subtrees of an input tree T in the form of their compressed bit signatures. For the task of enumerating all compressed bit signatures of k-subtrees in an ordered tree T, we first present an enumeration algorithm in Ok-delay, and then, present another enumeration algorithm in constant-delay using On time preprocessing that directly outputs bit signatures. These algorithms are designed based on bit-parallel speed-up technique for signature maintenance. By experiments on real and artificial datasets, both algorithms showed approximately 22% to 36% speed-up over the algorithms without bit-parallel signature maintenance.

[1]  Kunihiko Sadakane,et al.  Compressed random access memory , 2010, ArXiv.

[2]  Kouichi Hirata,et al.  An Efficient Unordered Tree Kernel and Its Application to Glycan Classification , 2008, PAKDD.

[3]  Shin-Ichi Nakano,et al.  Efficient generation of plane trees , 2002, Inf. Process. Lett..

[4]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[5]  Jian Su,et al.  Discovering Relations Between Named Entities from a Large Raw Corpus Using Tree Similarity-Based Clustering , 2005, IJCNLP.

[6]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[7]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[8]  Hisashi Kashima,et al.  A Subpath Kernel for Rooted Unordered Trees , 2011, PAKDD.

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[11]  Hisashi Kashima,et al.  Kernels for Semi-Structured Data , 2002, ICML.

[12]  Leslie Ann Goldberg Polynomial space polynomial delay algorithms for listing families of graphs , 1993, STOC '93.

[13]  Susan Gauch,et al.  Document similarity based on concept tree distance , 2008, Hypertext.

[14]  Hiroki Arimura,et al.  Constant Time Enumeration of Bounded-Size Subtrees in Trees and Its Application , 2012, COCOON.

[15]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[16]  Hiroshi Sakamoto,et al.  Mining Semi-structured Data by Path Expressions , 2001, Discovery Science.

[17]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[18]  Hiroshi Yasuda,et al.  A spectrum tree kernel (論文特集:データマイニングと統計数理) , 2007 .

[19]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[20]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[21]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[22]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[23]  Taku Kudo,et al.  Clustering graphs by weighted substructure mining , 2006, ICML.

[24]  Kunihiko Sadakane,et al.  CRAM: Compressed Random Access Memory , 2010, ICALP.