Efficient evaluation of large sequence kernels

Classification of sequences drawn from a finite alphabet using a family of string kernels with inexact matching (e.g., spectrum or mismatch) has shown great success in machine learning. However, selection of optimal mismatch kernels for a particular task is severely limited by inability to compute such kernels for long substrings (k-mers) with potentially many mismatches (m). In this work we introduce a new method that allows us to exactly evaluate kernels for large k, m and arbitrary alphabet size. The task can be accomplished by first solving the more tractable problem for small alphabets, and then trivially generalizing to any alphabet using a small linear system of equations. This makes it possible to explore a larger set of kernels with a wide range of kernel parameters, opening a possibility to better model selection and improved performance of the string kernels. To investigate the utility of large (k,m) string kernels, we consider several sequence classification problems, including protein remote homology detection, fold prediction, and music classification. Our results show that increased k-mer lengths with larger substitutions can improve classification performance.

[1]  Vladimir Pavlovic,et al.  Spatial Representation for Efficient Sequence Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[2]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[3]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[5]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[6]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[7]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[8]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[9]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[12]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[13]  Carsten Wiuf,et al.  Bounded coordinate-descent for biological sequence classification in high dimensional predictor space , 2010, KDD.

[14]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[15]  Vladimir Pavlovic,et al.  Fast protein homology and fold detection with sparse spatial sample kernels , 2008, 2008 19th International Conference on Pattern Recognition.

[16]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[17]  John Shawe-Taylor,et al.  Using string kernels to identify famous performers from their playing style , 2004, Intell. Data Anal..

[18]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[19]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[20]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[21]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[22]  Zhiwu Lu,et al.  Image categorization with spatial mismatch kernels , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[24]  Jason Weston,et al.  Multi-class protein fold recognition using adaptive codes , 2005, ICML.

[25]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.