Scalable Algorithms for String Kernels with Inexact Matching

We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times.

[1]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[2]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[3]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[4]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[5]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[6]  Vladimir Pavlovic,et al.  Fast protein homology and fold detection with sparse spatial sample kernels , 2008, 2008 19th International Conference on Pattern Recognition.

[7]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[8]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[9]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[10]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Vladimir Pavlovic,et al.  On the Role of Local Matching for Efficient Semi-supervised Protein Sequence Classification , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[13]  Jason Weston,et al.  Multi-class protein fold recognition using adaptive codes , 2005, ICML.

[14]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[15]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[16]  Juho Rousu,et al.  Efficient Computation of Gapped Substring Kernels on Large Alphabets , 2005, J. Mach. Learn. Res..

[17]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[19]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.