Efficient multivariate kernels for sequence classification

Kernel-based approaches for sequence classification have been successfully applied to a variety of domains, including the text categorization, image classification, speech analysis, biological sequence analysis, time series and music classification, where they show some of the most accurate results. Typical kernel functions for sequences in these domains (e.g., bag-of-words, mismatch, or subsequence kernels) are restricted to discrete univariate (i.e. one-dimensional) string data, such as sequences of words in the text analysis, codeword sequences in the image analysis, or nucleotide or amino acid sequences in the DNA and protein sequence analysis. However, original sequence data are often of real-valued multivariate nature, i.e. are not univariate and discrete as required by typical k-mer based sequence kernel functions. In this work, we consider the problem of the multivariate sequence classification (e.g., classification of multivariate music sequences, or multidimensional protein sequence representations). To this end, we extend univariate kernel functions typically used in sequence domains and propose efficient multivariate similarity kernel method (MVDFQ-SK) based on (1) a direct feature quantization (DFQ) of each sequence dimension in the original real-valued multivariate sequences and (2) applying novel multivariate discrete kernel measures on these multivariate discrete DFQ sequence representations to more accurately capture similarity relationships among sequences and improve classification performance. Experiments using the proposed MVDFQ-SK kernel method show excellent classification performance on three challenging music classification tasks as well as protein sequence classification with significant 25-40% improvements over univariate kernel methods and existing state-of-the-art sequence classification methods.

[1]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[2]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[3]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[4]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[5]  Shigeki Sagayama,et al.  Dynamic Time-Alignment Kernel in Support Vector Machine , 2001, NIPS.

[6]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[7]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[8]  Gunnar Rätsch,et al.  Exploiting physico-chemical properties in string kernels , 2010, BMC Bioinformatics.

[9]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[10]  Emiru Tsunoo,et al.  Autoregressive MFCC Models for Genre Classification Improved by Harmonic-percussion Separation , 2010, ISMIR.

[11]  Constantine Kotropoulos,et al.  Music Genre Classification: A Multilinear Approach , 2008, ISMIR.

[12]  Vladimir Pavlovic,et al.  Spatial Representation for Efficient Sequence Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[13]  Vladimir Pavlovic,et al.  Scalable Algorithms for String Kernels with Inexact Matching , 2008, NIPS.

[14]  Daniel P. W. Ellis,et al.  Classifying Music Audio with Timbral and Chroma Features , 2007, ISMIR.

[15]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[16]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[17]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[18]  Zhiwu Lu,et al.  Image categorization with spatial mismatch kernels , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[21]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[22]  William M. Campbell,et al.  Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[24]  Wei Liang,et al.  A novel approach to musical genre classification using probabilistic latent semantic analysis model , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[25]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[26]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[27]  Xi Chen,et al.  Text classification with kernels on the multinomial manifold , 2005, SIGIR '05.

[28]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[29]  Kun-Ming Yu,et al.  Automatic Music Genre Classification Based on Modulation Spectral Analysis of Spectral and Cepstral Features , 2009, IEEE Transactions on Multimedia.

[30]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[31]  Yanjun Qi,et al.  Semi-supervised Bio-named Entity Recognition with Word-Codebook Learning , 2010, SDM.

[32]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[34]  Tomoko Matsui,et al.  A Kernel for Time Series Based on Global Alignments , 2006, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[35]  Vladimir Pavlovic,et al.  Generalized Similarity Kernels for Efficient Sequence Classification , 2012, SDM.

[36]  Yannis Stylianou,et al.  Musical Genre Classification Using Nonnegative Matrix Factorization-Based Features , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.