Marginalized kernels for biological sequences

MOTIVATION Kernel methods such as support vector machines require a kernel function between objects to be defined a priori. Several works have been done to derive kernels from probability distributions, e.g., the Fisher kernel. However, a general methodology to design a kernel is not fully developed. RESULTS We propose a reasonable way of designing a kernel when objects are generated from latent variable models (e.g., HMM). First of all, a joint kernel is designed for complete data which include both visible and hidden variables. Then a marginalized kernel for visible data is obtained by taking the expectation with respect to hidden variables. We will show that the Fisher kernel is a special case of marginalized kernels, which gives another viewpoint to the Fisher kernel theory. Although our approach can be applied to any object, we particularly derive several marginalized kernels useful for biological sequences (e.g., DNA and proteins). The effectiveness of marginalized kernels is illustrated in the task of classifying bacterial gyrase subunit B (gyrB) amino acid sequences.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[4]  Satoru Hayamizu,et al.  Prediction of protein secondary structure by the hidden Markov model , 1993, Comput. Appl. Biosci..

[5]  T. Takagi,et al.  Genome Informatics 1998 , 1998 .

[6]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[7]  Bairoch,et al.  Construction of the gyrB Database for the Identification and Classification of Bacteria. , 1998, Genome informatics. Workshop on Genome Informatics.

[8]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[9]  Michael E. Tipping Deriving cluster analytic distance functions from Gaussian mixture models , 1999 .

[10]  Volker Roth,et al.  Nonlinear Discriminant Analysis Using Kernel Functions , 1999, NIPS.

[11]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[12]  H. Kasai,et al.  Differentiation of Phylogenetically Related Slowly Growing Mycobacteria by Their gyrB Sequences , 2000, Journal of Clinical Microbiology.

[13]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[14]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[15]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[16]  Kanako Watanabe,et al.  ICB database: the gyrB database for identification and classification of bacteria , 2001, Nucleic Acids Res..

[17]  Terrence S. Furey,et al.  Promoter Region-Based Classification of Genes , 2000, Pacific Symposium on Biocomputing.

[18]  I. Jolliffe Principal Component Analysis , 2002 .

[19]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[20]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..