Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification

Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between k -mers ( k -length subsequences) in the two sequences. Extending this definition, by considering two k -mers to match if their distance is at most m , yields better classification performance. This, however, makes the problem computationally much more complex. Known algorithms to compute this similarity have computational complexity that render them applicable only for small values of k and m . In this work, we develop novel techniques to efficiently and accurately estimate the pairwise similarity score, which enables us to use much larger values of k and m , and get higher predictive accuracy. This opens up a broad avenue of applying this classification approach to audio, images, and text sequences. Our algorithm achieves excellent approximation performance with theoretical guarantees. In the process we solve an open combinatorial problem, which was posed as a major hindrance to the scalability of existing solutions. We give analytical bounds on quality and runtime of our algorithm and report its empirical performance on real world biological and music sequences datasets

[1]  Pavel P. Kuksa,et al.  Efficient multivariate sequence classification , 2014, ArXiv.

[2]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[3]  Vladimir Pavlovic,et al.  Generalized Similarity Kernels for Efficient Sequence Classification , 2012, SDM.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Yanjun Qi,et al.  Semi-supervised Abstraction-Augmented String Kernel for Multi-level Bio-Relation Extraction , 2010, ECML/PKDD.

[6]  Vladimir Pavlovic,et al.  Spatial Representation for Efficient Sequence Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[7]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[8]  Vladimir Pavlovic,et al.  Fast protein homology and fold detection with sparse spatial sample kernels , 2008, 2008 19th International Conference on Pattern Recognition.

[9]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[10]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[11]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[12]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[13]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[14]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[15]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[16]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[17]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[18]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[19]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[20]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[21]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[22]  Pavel Kuksa,et al.  Scalable kernel methods and algorithms for general sequence analysis , 2011 .

[23]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[24]  Daniel P. W. Ellis,et al.  Classifying Music Audio with Timbral and Chroma Features , 2007, ISMIR.

[25]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[26]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[27]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[28]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[29]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[30]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[31]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[32]  C. Watkins Dynamic Alignment Kernels , 1999 .

[33]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .