Searching for Multiple Words in a Markov Sequence

The theory of the discrete-time Markovian arrival process (DMAP) can be applied to some statistical problems encountered when searching for multiple words in a Markov sequence. Such word searches are often emphasized in studies of the human genome. There are several advantages to the DMAP approach we present. Most notably, its derivations are transparent, and they readily unify disparate results about the exact distributions of overlapping and nonoverlapping word counts. We also present several examples and applications of our theory, including a numerical study using a random DNA dataset from the human genome.

[1]  Richard Arratia,et al.  Central Limit Theorem from Renewal Theory for Several Patterns , 1997, J. Comput. Biol..

[2]  Stéphane Robin,et al.  Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences , 2002, J. Comput. Biol..

[3]  J. D. Biggins,et al.  Markov renewal processes, counters and repeated sequences in Markov chains , 1987, Advances in Applied Probability.

[4]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[5]  M H Skolnick,et al.  A model for restriction fragment length distributions. , 1983, American journal of human genetics.

[6]  W. Y. Wendy Lou,et al.  Distribution Theory of Runs and Patterns and Its Applications: A Finite Markov Chain Imbedding Approach , 2003 .

[7]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[8]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[9]  Michael S. Waterman,et al.  Renewal theory for several patterns , 1985 .

[10]  Mikhail S. Gelfand,et al.  Extendable words in nucleotide sequences , 1992, Comput. Appl. Biosci..

[11]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[12]  U Grob,et al.  Statistical analysis of nucleotide sequences. , 1990, Nucleic acids research.

[13]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[14]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[15]  Stavros Papastavridis,et al.  Poisson Approximation for the Non-Overlapping Appearances of Several Words in Markov Chains , 2001, Combinatorics, Probability and Computing.

[16]  R. Mullin,et al.  The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. , 1989, Biometrics.

[17]  Chris Blondia,et al.  Statistical Multiplexing of VBR Sources: A Matrix-Analytic Approach , 1992, Perform. Evaluation.

[18]  Sophie Schbath,et al.  An Efficient Statistic to Detect Over-and Under-Represented Words in DNA Sequences , 1997, J. Comput. Biol..

[19]  Sophie Schbath,et al.  Exceptional Motifs in Different Markov Chain Models for a Statistical Analysis of DNA Sequences , 1995, J. Comput. Biol..

[20]  Pavel A. Pevzner,et al.  Nucleotide Sequences Versus Markov Models , 1992, Comput. Chem..

[21]  J. Fu,et al.  DISTRIBUTION THEORY OF RUNS AND PATTERNS ASSOCIATED WITH A SEQUENCE OF MULTI-STATE TRIALS , 1996 .

[22]  M. Neuts,et al.  A single-server queue with server vacations and a class of non-renewal arrival processes , 1990, Advances in Applied Probability.

[23]  S. Papastavridis,et al.  A limit theorem for the number of non-overlapping occurrences of a pattern in a sequence of independent trials , 1988 .

[24]  Bernard Prum,et al.  Finding words with unexpected frequencies in deoxyribonucleic acid sequences , 1995 .

[25]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[26]  Mark Borodovsky,et al.  First and second moment of counts of words in random texts generated by Markov chains , 1992, Comput. Appl. Biosci..

[27]  Jean-Jacques Daudin,et al.  Exact Distribution of the Distances between Any Occurrences of a Set of Words , 2001 .