Mismatch string kernels for discriminative protein classification

MOTIVATION Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. RESULTS We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.

[1]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  M. Miyamoto,et al.  Phylogenetic Analysis of DNA Sequences , 1991 .

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[6]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[10]  Alfonso Valencia,et al.  Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology , 1996 .

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[14]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  Terri K. Attwood,et al.  The PRINTS protein fingerprint database in its fifth year , 1998, Nucleic Acids Res..

[17]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[18]  C. Watkins Dynamic Alignment Kernels , 1999 .

[19]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[20]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[21]  Protein Family Classification Using Sparse Markov Transducers , 2000, ISMB.

[22]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[23]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[24]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[25]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[26]  K Karplus,et al.  What is the value added by human intervention in protein structure prediction? , 2001, Proteins.

[27]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[28]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[29]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[30]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[31]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[32]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[33]  Jens Stoye,et al.  A Novel Approach to Remote Homology Detection: Jumping Alignments , 2002, J. Comput. Biol..

[34]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[35]  Christina S. Leslie,et al.  Fast Kernels for Inexact String Matching , 2003, COLT.

[36]  Yoram Singer,et al.  An Efficient Extension to Mixture Techniques for Prediction and Decision Trees , 1997, COLT '97.

[37]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.