Large Scale Hidden Semi-Markov SVMs

We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  David Haussler,et al.  Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence , 1994, ISMB.

[4]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[5]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[6]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Padhraic Smyth,et al.  Segmental semi-markov models and applications to sequence analysis , 2002 .

[9]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[10]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[11]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[12]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[13]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[14]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[15]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[16]  Thomas Hofmann,et al.  Gaussian process classification for segmenting and annotating sequences , 2004, ICML.

[17]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[18]  Daniel G. Brown,et al.  ExonHunter: a comprehensive approach to gene finding , 2005, ISMB.

[19]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[20]  Gunnar Rätsch,et al.  Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning , 2006, PLoS Comput. Biol..