A compound Poisson model for word occurrences in DNA sequences

We present a compound Poisson model describing the occurrence process of a set of words in a random sequence of letters. The model takes into account the frequency of the words and their overlapping structure. The model is compared with a Markov chain model in terms of fit and parsimony. Special attention is given to the detection of poor or rich regions. Several applications of the model are presented and a combination of the Markov and compound Poisson models is proposed. Copyright 2002 Royal Statistical Society.

[1]  S. Janson Bounds on the distributions of extremal values of a scanning process , 1984 .

[2]  Stéphane Robin,et al.  Numerical Comparison of Several Approximations of the Word Count Distribution in Random Sequences , 2002, J. Comput. Biol..

[3]  A. Barbour,et al.  Poisson Approximation , 1992 .

[4]  Amir Dembo,et al.  Poisson Approximations for $r$-Scan Processes , 1992 .

[5]  David R. Anderson,et al.  Model selection and inference : a practical information-theoretic approach , 2000 .

[6]  T. C. Brown,et al.  Stein's method and point process approximation , 1992 .

[7]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[8]  T. C. Brown,et al.  The Stein-Chen Method, Point Processes and Compensators , 1992 .

[9]  Mark Borodovsky,et al.  First and second moment of counts of words in random texts generated by Markov chains , 1992, Comput. Appl. Biosci..

[10]  Garrick L. Wallstrom,et al.  Compound Poisson approximations for word patterns under Markovian hypotheses , 1995, Journal of Applied Probability.

[11]  Anant P. Godbole,et al.  Improved Poisson approximations for word patterns , 1993, Advances in Applied Probability.

[12]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[13]  T. C. Brown,et al.  Poisson approximation for point processes via monotone couplings , 1996 .

[14]  Bernard Prum,et al.  Finding words with unexpected frequencies in deoxyribonucleic acid sequences , 1995 .

[15]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[16]  Louis H. Y. Chen,et al.  Compound Poisson Approximation for Nonnegative Random Variables Via Stein's Method , 1992 .

[17]  Catherine Macken,et al.  Some statistical problems in the assessment of inhomogeneities of DNA sequence data , 1991 .

[18]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[19]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[20]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[21]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[22]  Jean-Jacques Daudin,et al.  Exact Distribution of the Distances between Any Occurrences of a Set of Words , 2001 .

[23]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..