An Overview on the Distribution of Word Counts in Markov Chains

In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.

[1]  S. Li,et al.  A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments , 1980 .

[2]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[3]  Anne Lohrli Chapman and Hall , 1985 .

[4]  Serguei Novak Long match patterns in random sequences , 1995 .

[5]  Sophie Schbath,et al.  Exceptional Motifs in Different Markov Chain Models for a Statistical Analysis of DNA Sequences , 1995, J. Comput. Biol..

[6]  Hans U. Gerber,et al.  The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain , 1981 .

[7]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[8]  Anant P. Godbole,et al.  Poisson approximations for runs and patterns of rare events , 1991, Advances in Applied Probability.

[9]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[10]  N. Kusolitsch Longest Runs in Markov Chains , 1982 .

[11]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[12]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[13]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[14]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[15]  D. Thorburn On the mean number of trials until the last trials satisfy a given condition , 1984 .

[16]  Bernard Prum,et al.  Finding words with unexpected frequencies in deoxyribonucleic acid sequences , 1995 .

[17]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[18]  Torkel Erhardsson,et al.  Compound Poisson Approximation for Markov Chains using Stein’s Method , 1999 .

[19]  S. S. Samarova On the Length of the Longest Head-Run for a Markov Chain with Two States , 1982 .

[20]  Steven J. Schwager,et al.  Run Probabilities in Sequences of Markov-Dependent Trials , 1983 .

[21]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[22]  Samuel Karlin,et al.  Counts of long aligned word matches among random letter sequences , 1987, Advances in Applied Probability.

[23]  G. Blom,et al.  How many random digits are required until given sequences are obtained? , 1982, Journal of Applied Probability.

[24]  Valeri T. Stefanov,et al.  Explicit distributional results in pattern formation , 1997 .

[25]  Sophie Schbath,et al.  An Efficient Statistic to Detect Over-and Under-Represented Words in DNA Sequences , 1997, J. Comput. Biol..

[26]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[27]  R. V. Benevento The occurrence of sequence patterns in ergodic Markov chains , 1984 .

[28]  Garrick L. Wallstrom,et al.  Compound Poisson approximations for word patterns under Markovian hypotheses , 1995, Journal of Applied Probability.

[29]  Gunnar Blom On the mean number of random digits until a given sequence occurs , 1982 .

[30]  Torkel Erhardsson COMPOUND POISSON APPROXIMATION FOR COUNTS OF RARE PATTERNS IN MARKOV CHAINS AND EXTREME SOJOURNS IN BIRTH-DEATH CHAINS , 2000 .

[31]  Michael S. Waterman,et al.  Critical Phenomena in Sequence Matching , 1985 .

[32]  O. Chrysaphinou,et al.  A limit theorem on the number of overlapping appearances of a pattern in a sequence of independent trials , 1988 .

[33]  Malgorzata Roos,et al.  Stein's Method for Compound Poisson Approximation: The Local Approach , 1994 .

[34]  Stéphane Robin,et al.  Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences , 2002, J. Comput. Biol..

[35]  J. D. Biggins,et al.  Markov renewal processes, counters and repeated sequences in Markov chains , 1987, Advances in Applied Probability.

[36]  Torkel Erhardsson Compound poisson approximation for Markov chains , 1997 .

[37]  Sophie Schbath,et al.  Coverage Processes in Physical Mapping by Anchoring Random Clones , 1997, J. Comput. Biol..

[38]  Andrew Odlyzko,et al.  Long repetitive patterns in random sequences , 1980 .

[39]  M. Waterman,et al.  THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[40]  Luc Devroye,et al.  Exact Convergence Rate in the Limit Theorems of Erdos-Renyi and Shepp , 1986 .

[41]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[42]  Anant P. Godbole,et al.  Improved Poisson approximations for word patterns , 1993, Advances in Applied Probability.

[43]  Sophie Schbath,et al.  Etude asymptotique du nombre d'occurrences d'un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d'ADN , 1995 .

[44]  D. Banjevic On some statistics connected with runs in Markov chains , 1988, Journal of Applied Probability.

[45]  Michael S. Waterman,et al.  Renewal theory for several patterns , 1985 .

[46]  U Grob,et al.  Statistical analysis of nucleotide sequences. , 1990, Nucleic acids research.

[47]  Mark Borodovsky,et al.  First and second moment of counts of words in random texts generated by Markov chains , 1992, Comput. Appl. Biosci..

[48]  M. B. Rajarshi Success runs in a two-state Markov chain , 1974, Journal of Applied Probability.

[49]  P. Pevzner,et al.  Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. , 1989, Journal of biomolecular structure & dynamics.

[50]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[51]  Richard Arratia,et al.  Central Limit Theorem from Renewal Theory for Several Patterns , 1997, J. Comput. Biol..

[52]  R. Ivarie,et al.  The effect of codon usage on the oligonucleotide composition of the E. coli genome and identification of over- and underrepresented sequences by Markov chain analysis. , 1987, Nucleic acids research.

[53]  A. Rényi,et al.  On a new law of large numbers , 1970 .

[54]  T. Nemetz,et al.  On the longest run of coincidences , 1982 .