DIGITAL SEARCH TREES AND CHAOS GAME REPRESENTATION

In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which one can visualize repetitions of subwords (seen as suffixes of subsequences). The CGR-tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height, the insertion depth for DST built from independent successive random sequences having the same distribution. Here the successive inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained.

[1]  P. Révész,et al.  On the length of the longest head run , 2004 .

[2]  J. Fu,et al.  Bounds for Reliability of Large Consecutive-K-out-of-N:F Systems with Unequal Component Reliability , 1986, IEEE Transactions on Reliability.

[3]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[4]  Jackson B. Lackey,et al.  Errata: Handbook of mathematical functions with formulas, graphs, and mathematical tables (Superintendent of Documents, U. S. Government Printing Office, Washington, D. C., 1964) by Milton Abramowitz and Irene A. Stegun , 1977 .

[5]  D. Aldous,et al.  A diffusion limit for a class of randomly-growing binary trees , 1988 .

[6]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[7]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[8]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[9]  Julien Fayolle,et al.  Compression de données sans perte et combinatoire analytique , 2006 .

[10]  Markos V. Koutras,et al.  Waiting Times and Number of Appearances of Events in a Sequence of Discrete Random Variables , 1997 .

[11]  A. Nandy,et al.  Novel techniques of graphical representation and analysis of DNA sequences—A review , 1998, Journal of Biosciences.

[12]  Cénac Peggy Test on the structure of biological sequences via Chaos Game Representation. , 2005 .

[13]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[14]  Wojciech Szpankowski,et al.  Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[15]  Hosam M. Mahmoud,et al.  Evolution of random search trees , 1991, Wiley-Interscience series in discrete mathematics and optimization.

[16]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[17]  Michael Drmota,et al.  The variance of the height of digital search trees , 2002, Acta Informatica.

[18]  B. Pittel Asymptotical Growth of a Class of Random Trees , 1985 .

[19]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[20]  Valeri T. Stefanov,et al.  Explicit distributional results in pattern formation , 1997 .

[21]  Peggy Cénac,et al.  Test on the structure of biological sequences via Chaos Game Representation. , 2005, Statistical applications in genetics and molecular biology.

[22]  G. Blom,et al.  How many random digits are required until given sequences are obtained? , 1982, Journal of Applied Probability.

[23]  Hans U. Gerber,et al.  The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain , 1981 .

[24]  D. Owen Handbook of Mathematical Functions with Formulas , 1965 .

[25]  J. Steele,et al.  A martingale approach to scan statistics , 2005 .

[26]  N. Goldman,et al.  Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. , 1993, Nucleic acids research.

[27]  Michael S. Waterman,et al.  An extreme value theory for long head runs , 1986 .

[28]  Stephen S. Wilson,et al.  Random iterative models , 1996 .

[29]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[30]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[31]  S. S. Samarova On the Length of the Longest Head-Run for a Markov Chain with Two States , 1982 .

[32]  V. V. Petrov On the Probabilities of Large Deviations for Sums of Independent Random Variables , 1965 .

[33]  P. Billingsley,et al.  Ergodic theory and information , 1966 .