Efficient Computation of Gapped Substring Kernels on Large Alphabets

We present a sparse dynamic programming algorithm that, given two strings s and t , a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p |M| log |t|), where M = {(i,j) | si = tj} is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties.The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted.

[1]  Chung Keung Poon Dynamic orthogonal range queries in OLAP , 2003, Theor. Comput. Sci..

[2]  Veli Mäkinen,et al.  Parameterized Approximate String Matching and Local-Similarity-Based Point-Pattern Matching , 2003 .

[3]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[4]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[5]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[6]  Pankaj K. Agarwal,et al.  Geometric Range Searching and Its Relatives , 2007 .

[7]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[10]  Mark H. Overmars Efficient Data Structures for Range Searching on a Grid , 1988, J. Algorithms.

[11]  Bernard Chazelle,et al.  Computing partial sums in multidimensional arrays , 1989, SCG '89.

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Raffaele Giancarlo,et al.  Longest Common Subsequence from Fragments via Sparse Dynamic Programming , 1998, ESA.

[14]  Bernard Chazelle,et al.  Lower bounds for off-line range searching , 1995, STOC '95.

[15]  Christina S. Leslie,et al.  Fast Kernels for Inexact String Matching , 2003, COLT.

[16]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[17]  Stephen Alstrup,et al.  New data structures for orthogonal range searching , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[19]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[20]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[21]  John Shawe-Taylor,et al.  Syllables and other String Kernel Extensions , 2002, ICML.

[22]  D. Eppstein Efficient algorithms for sequence analysis with concave and convex gap costs , 1989 .

[23]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[24]  C. Watkins Dynamic Alignment Kernels , 1999 .

[25]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[26]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.