Statistical properties of DNA sequences.

We review evidence supporting the idea that the DNA sequence in genes containing non-coding regions is correlated, and that the correlation is remarkably long range--indeed, nucleotides thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationarity" feature of the sequence of base pairs by applying a new algorithm called detrended fluctuation analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and non-coding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33301 coding and 29453 non-coding) in the entire GenBank database. Finally, we describe briefly some recent work showing that the non-coding sequences have certain statistical features in common with natural and artificial languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts. These statistical properties of non-coding sequences support the possibility that non-coding regions of DNA may carry biological information.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Skolnick,et al.  Global fractal dimension of human DNA sequences treated as pseudorandom walks. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[3]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[4]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[5]  M. Ya. Azbel,et al.  Random Two-Component One-Dimensional Ising Model for Heteropolymer Melting , 1973 .

[6]  H. Stanley,et al.  Introduction to Phase Transitions and Critical Phenomena , 1972 .

[7]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8]  L. Brillouin,et al.  Science and information theory , 1956 .

[9]  E. Bacry,et al.  Characterizing long-range correlations in DNA sequences from wavelet analysis. , 1995, Physical review letters.

[10]  A L Goldberger,et al.  Correlation approach to identify coding regions in DNA sequences. , 1994, Biophysical journal.

[11]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[12]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[13]  S. Havlin,et al.  Fractals and Disordered Systems , 1991 .

[14]  G. Vojta,et al.  Fractal Concepts in Surface Growth , 1996 .

[15]  Werner Ebeling,et al.  Long-range correlations between letters and sentences in texts , 1995 .

[16]  Shlomo Havlin,et al.  Fractals in Science , 1995 .

[17]  R. Mantegna,et al.  Zipf plots and the size distribution of firms , 1995 .

[18]  S. Havlin The distance between Zipf plots , 1995 .

[19]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[20]  S. Wolfram Computation theory of cellular automata , 1984 .

[21]  George Sugihara,et al.  Fractals in science , 1995 .

[22]  G. Vojta Fractals and Disordered Systems , 1997 .

[23]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[24]  C. Peng,et al.  Mosaic organization of DNA nucleotides. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[25]  A L Goldberger,et al.  Generalized Lévy-walk model for DNA nucleotide sequences. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[26]  Stanley,et al.  Correlations in binary sequences and a generalized Zipf analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[27]  S. Grimwade Recombinant DNA , 1977, Nature.