Generalized Similarity Kernels for Efficient Sequence Classification

String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on tasks such as document topic elucidation, music genre classification, protein superfamily and fold prediction. However, typical string kernel methods rely on symbolic Hammingdistance based matching which may not necessarily reflect the underlying (e.g., physical) similarity between sequence fragments. In this work we propose a novel computational framework that uses general similarity metrics S(·, ·) and distance-preserving embeddings with string kernels to improve sequence classification. In particular, we consider two approaches that allow one either to incorporate non-Hamming similarity S(·, ·) into similarity evaluation by matching only the features that are similar according to S(·, ·) or to retain actual (approximate) similarity/distance scores in similarity evaluation. An embedding step, a distance-preserving bitstring mapping, is used to effectively capture similarity between otherwise symbolically different sequence elements. We show that it is possible to retain computational efficiency of string kernels while using this more “precise” measure of similarity. We then demonstrate that on a number of sequence classification tasks such as music, and biological sequence classification, the new method can substantially improve upon state-of-the-art string kernel baselines.

[1]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[2]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[3]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[4]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[5]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[9]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[10]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[11]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[12]  Vladimir Pavlovic,et al.  Fast protein homology and fold detection with sparse spatial sample kernels , 2008, 2008 19th International Conference on Pattern Recognition.

[13]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[14]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[15]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[16]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[17]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[18]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[19]  Vladimir Pavlovic,et al.  Efficient use of unlabeled data for protein sequence classification: a comparative study , 2009, BMC Bioinformatics.

[20]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..