Waiting times for clumps of patterns and for structured motifs in random sequences

This paper provides exact probability results for waiting times associated with occurrences of two types of motifs in a random sequence. First, we provide an explicit expression for the probability generating function of the interarrival time between two clumps of a pattern. It allows, in particular, to measure the quality of the Poisson approximation which is currently used for evaluation of the distribution of the number of clumps of a pattern. Second, we provide explicit expressions for the probability generating functions of both the waiting time until the first occurrence, and the interarrival time between consecutive occurrences, of a structured motif. Distributional results for structured motifs are of interest in genome analysis because such motifs are promoter candidates. As an application, we determine significant structured motifs in a data set of DNA regulatory sequences.

[1]  M. Lothaire Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications) , 2005 .

[2]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[3]  Stavros Papastavridis,et al.  Poisson Approximation for the Non-Overlapping Appearances of Several Words in Markov Chains , 2001, Combinatorics, Probability and Computing.

[4]  Nicola Vitacolonna,et al.  Structured motifs search. , 2005, Journal of computational biology : a journal of computational molecular cell biology.

[5]  W. Feller,et al.  An Introduction to Probability Theory and Its Applications, Vol. 1 , 1967 .

[6]  Sophie Schbath,et al.  Compound Poisson approximation of word counts in DNA sequences , 1997 .

[7]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[8]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[9]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[10]  Ward Whitt,et al.  Numerical inversion of probability generating functions , 1992, Oper. Res. Lett..

[11]  Jean-Jacques Daudin,et al.  Occurrence Probability of Structured Motifs in Random Sequences , 2002, J. Comput. Biol..

[12]  Valeri T. Stefanov The Intersite Distances between Pattern Occurrences in Strings Generated by General Discrete- and Continuous-Time - Models : An Algorithmic Approach , 2003 .

[13]  Jean-Jacques Daudin,et al.  Exact Distribution of the Distances between Any Occurrences of a Set of Words , 2001 .

[14]  Nicola Vitacolonna,et al.  Structured motifs search , 2004, J. Comput. Biol..

[15]  M. Lothaire,et al.  Applied Combinatorics on Words , 2005 .

[16]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[17]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[18]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[19]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..