Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions

Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.

[1]  A. Hobolth,et al.  Estimating Divergence Time and Ancestral Effective Population Size of Bornean and Sumatran Orangutan Subspecies Using a Coalescent Hidden Markov Model , 2011, PLoS genetics.

[2]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[3]  Mehryar Mohri,et al.  Weighted Automata Algorithms , 2009 .

[4]  Gerton Lunter,et al.  Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes , 2007, ISMB/ECCB.

[5]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[6]  Manuel E Lladser,et al.  Multiple pattern matching: a Markov chain approach , 2007, Journal of mathematical biology.

[7]  G. Nuel Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata , 2008 .

[8]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[9]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[10]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[11]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[12]  Donald E. K. Martin,et al.  Distributions associated with general runs and patterns in hidden Markov models , 2007, 0706.3985.

[13]  Tung-Lung Wu On Finite Markov Chain Imbedding and Its Applications , 2013 .

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[16]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[17]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[18]  YoungSteve,et al.  The application of hidden Markov models in speech recognition , 2007 .

[19]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[20]  Jia Li,et al.  Image Segmentation and Compression Using Hidden Markov Models , 2000 .

[21]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[22]  David Haussler,et al.  Phylogenetic Hidden Markov Models , 2005 .

[23]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[24]  Philippe Flajolet,et al.  Motif Statistics , 1999, ESA.

[25]  Mark Borodovsky,et al.  Genetack: frameshift Identification in protein-Coding Sequences by the Viterbi Algorithm , 2010, J. Bioinform. Comput. Biol..

[26]  Kurt Keutzer,et al.  Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors , 2008 .