Minimal auxiliary Markov chains through sequential elimination of states

ABSTRACT When using an auxiliary Markov chain to compute the distribution of a pattern statistic, the computational complexity is directly related to the number of Markov chain states. Theory related to minimal deterministic finite automata have been applied to large state spaces to reduce the number of Markov chain states so that only a minimal set remains. In this paper, a characterization of equivalent states is given so that extraneous states are deleted during the process of forming the state space, improving computational efficiency. The theory extends the applicability of Markov chain based methods for computing the distribution of pattern statistics.

[1]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[2]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[3]  G. Nuel Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata , 2008 .

[4]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[5]  Donald E. K. Martin,et al.  Distributions associated with general runs and patterns in hidden Markov models , 2007, 0706.3985.

[6]  Manuel E Lladser,et al.  Multiple pattern matching: a Markov chain approach , 2007, Journal of mathematical biology.

[7]  Donald E. K. Martin,et al.  Distribution of Clump Statistics for a Collection of Words , 2011, Journal of Applied Probability.

[8]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[9]  John E. Hopcroft,et al.  An n log n algorithm for minimizing states in a finite automaton , 1971 .

[10]  Stéphane Robin,et al.  DNA, words and models , 2005 .

[11]  M. Lladser,et al.  Minimal Markov chain embeddings of pattern problems , 2007, 2007 Information Theory and Applications Workshop.

[12]  Ambuj Tewari,et al.  A Parallel DFA Minimization Algorithm , 2002, HiPC.

[13]  Sven Rahmann,et al.  Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics , 2008, CPM.

[14]  The Exact Joint Distribution of the Sum of Heads and Apparent Size Statistics of a “Tandem Repeats Finder” Algorithm , 2006, Bulletin of mathematical biology.

[15]  Stéphane Robin,et al.  Waiting times for clumps of patterns and for structured motifs in random sequences , 2007, Discret. Appl. Math..

[16]  Gary Benson,et al.  Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection , 2008, SPIRE.

[17]  Donald E. K. Martin,et al.  Waiting time distribution of generalized later patterns , 2008, Comput. Stat. Data Anal..

[18]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[19]  Eli Brookner,et al.  Recurrent Events in a Markov Chain , 1966, Inf. Control..

[20]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[21]  George H. Mealy,et al.  A method for synthesizing sequential circuits , 1955 .

[22]  Jean-Jacques Daudin,et al.  Occurrence Probability of Structured Motifs in Random Sequences , 2002, J. Comput. Biol..

[23]  D. E. K. Martin,et al.  p-values for the Discrete Scan Statistic through Slack Variables , 2015, Commun. Stat. Simul. Comput..

[24]  Emanuele Raineri,et al.  Faster exact Markovian probability functions for motif occurrences: a DFA-only approach , 2008, Bioinform..

[25]  J. Aston,et al.  Distribution of Statistics of Hidden State Sequences Through the Sum-Product Algorithm , 2013 .