A Subpath Kernel for Rooted Unordered Trees

Kernel method is one of the promising approaches to learning with tree-structured data, and various efficient tree kernels have been proposed to capture informative structures in trees. In this paper, we propose a new tree kernel function based on ``subpath sets'' to capture vertical structures in tree-structured data, since tree-structures are often used to code hierarchical information in data. We also propose a simple and efficient algorithm for computing the kernel by extending the Multikey quicksort algorithm used for sorting strings. The time complexity of the algorithm is O((|T_1|+|T_2|)log(|T_1|+|T_2|)) time on average, and the space complexity is O({|T_1|+|T_2|)}, where |T_1| and |T_2| are the numbers of nodes in two trees T_1 and T_2. We apply the proposed kernel to two supervised classification tasks, XML classification in web mining and glycan classification in bioinformatics. The experimental results show that the predictive performance of the proposed kernel is competitive with that of the existing efficient tree kernel proposed by Vishwanathan et al., and is also empirically faster than the existing kernel.

[1]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[2]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[3]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[4]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[5]  Hisashi Kashima Machine learning approaches for structured data , 2007 .

[6]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[7]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[8]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[9]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[10]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[11]  Hiroshi Yasuda,et al.  A gram distribution kernel applied to glycan classification and motif extraction. , 2006, Genome informatics. International Conference on Genome Informatics.

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[14]  Tetsuo Shibuya Constructing the Suffix Tree of a Tree with a Large Alphabet , 1999, ISAAC.

[15]  Hiroshi Sakamoto,et al.  Design and Analysis of Convolution Kernels for Tree-Structured Data , 2006 .

[16]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[17]  Takenobu Tokunaga,et al.  Efficient Sentence Retrieval Based on Syntactic Structure , 2006, ACL.

[18]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[19]  Alessandro Sperduti,et al.  Route kernels for trees , 2009, ICML '09.

[20]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[21]  Susumu Goto,et al.  GLYCAN: The Database of Carbohydrate Structures , 2003 .

[22]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.