Efficient geometric-based computation of the string subsequence kernel

Kernel methods are powerful tools in machine learning. They have to be computationally efficient. This paper builds on our previous work which proposed a list-based approach to compute efficiently the string subsequence kernel (SSK). In this paper we present a novel Geometric-based approach, our main idea is that the SSK computation reduces to the range query problem. We started with the construction of a match list$$L(s,t)=\left\{ (i,j):s_{i}=t_{j}\right\} $$L(s,t)=(i,j):si=tj where s and t are the strings to be compared; such a match list contains only the required data that contribute to the result. To compute the SSK efficiently, we extended the layered range tree data structure to a layered range sum tree, a range-aggregation data structure. The SSK computation takes $$O(p|L|\log |L|)$$O(p|L|log|L|) time and $$O(|L|\log |L|)$$O(|L|log|L|) space, where |L| is the size of the match list and p is the length of the SSK. We present an empirical evaluation of our approach against the dynamic and the sparse dynamic programming approaches both on synthetically generated data and on newswire article data. Experimental results show the efficiency of our approach for large alphabets except for very short strings. So it can be used in many applications like text categorization, information extraction and music genre classification. Moreover, compared to the sparse dynamic approach, the proposed approach outperforms also for long strings.

[1]  C. Blaschke,et al.  The frame-based module of the SUISEKI information extraction system , 2002 .

[2]  Jakub Piskorski,et al.  Information Extraction: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[3]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[4]  Leonidas J. Guibas,et al.  Fractional cascading: I. A data structuring technique , 1986, Algorithmica.

[5]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[6]  Mark de Berg,et al.  Computational geometry: algorithms and applications, 3rd Edition , 1997 .

[7]  Alexander K. Seewald,et al.  Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering , 2007, Adv. Data Anal. Classif..

[8]  John Shawe-Taylor,et al.  Syllables and other String Kernel Extensions , 2002, ICML.

[9]  Djelloul Ziadi,et al.  Subsequence kernels-based Arabic text classification , 2014, 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA).

[10]  Juho Rousu,et al.  Efficient Computation of Gapped Substring Kernels on Large Alphabets , 2005, J. Mach. Learn. Res..

[11]  Djelloul Ziadi,et al.  Efficient List-Based Computation of the String Subsequence Kernel , 2014, LATA.

[12]  Alfonso Valencia,et al.  Can Bibliographic Pointers for Known Biological Data Be Found Automatically? Protein Interactions as a Case Study , 2001, Comparative and functional genomics.

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  Rohit J. Kate,et al.  Using String-Kernels for Learning Semantic Parsers , 2006, ACL.

[15]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[16]  Michelle Becker,et al.  Perceptrons An Introduction To Computational Geometry , 2016 .

[17]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[18]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[19]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[20]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[21]  Claudia Eckert,et al.  Leveraging String Kernels for Malware Detection , 2013, NSS.

[22]  Vladimir Pavlovic,et al.  Spatial Representation for Efficient Sequence Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[23]  N. Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods: Kernel-Induced Feature Spaces , 2000 .

[24]  Safaai Deris,et al.  Application of String Kernels in Protein Sequence Classification , 2005, Applied bioinformatics.

[25]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[26]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[27]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .