Learning Interpretable SVMs for Biological Sequence Classification

BACKGROUND Support Vector Machines (SVMs)--using a variety of string kernels--have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight. RESULTS We propose novel and efficient algorithms for solving the so-called Support Vector Multiple Kernel Learning problem. The developed techniques can be used to understand the obtained support vector decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We apply the proposed methods to the task of acceptor splice site prediction and to the problem of recognizing alternatively spliced exons. Our algorithms compute sparse weightings of substring locations, highlighting which parts of the sequence are important for discrimination. CONCLUSION The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time, and reliably identifies a few statistically significant positions.

[1]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[2]  John Shawe-Taylor,et al.  A Column Generation Algorithm For Boosting , 2000, ICML.

[3]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[4]  Kristin P. Bennett,et al.  MARK: a boosting algorithm for heterogeneous kernel models , 2002, KDD.

[5]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[6]  G. Rätsch Robust Boosting via Convex Optimization , 2001 .

[7]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[8]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[9]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[10]  Shie Mannor,et al.  Sparse Online Greedy Support Vector Regression , 2002, ECML.

[11]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[12]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[13]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[14]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[15]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[16]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[17]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[18]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[19]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[21]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[22]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[23]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[24]  J. Wolfowitz,et al.  Introduction to the Theory of Statistics. , 1951 .

[25]  Kimberly Van Auken,et al.  WormBase: a multi-species resource for nematode biology and genomics , 2004, Nucleic Acids Res..

[26]  Gunnar Rätsch,et al.  Large scale genomic sequence SVM classifiers , 2005, ICML.

[27]  Alexander J. Smola,et al.  Learning the Kernel with Hyperkernels , 2005, J. Mach. Learn. Res..

[28]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[29]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[30]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[31]  Gunnar Rätsch,et al.  Efficient Margin Maximizing with Boosting , 2005, J. Mach. Learn. Res..

[32]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[33]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[34]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[35]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[36]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[37]  Franklin A. Graybill,et al.  Introduction to the Theory of Statistics, 3rd ed. , 1974 .

[38]  Seth Stovack Kessler Piezoelectric-based in-situ damage detection of composite materials for structural health monitoring systems , 2002 .

[39]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[40]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[41]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[42]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[43]  L. Hogben Introduction to the Theory of Statistics , 1951 .

[44]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..