Nonstationary Analysis of Coding and Noncoding Regions in Nucleotide Sequences

Previous statistical analysis efforts of DNA sequences revealed that noncoding regions exhibit long-range power law correlations, whereas coding regions behave like random sequences or sustain short-range correlations. A great deal of debate on the presence or absence of long-range correlations in nucleotide sequences, and more specifically in coding regions, has ensued. These results were obtained using signal processing techniques for stationary signals and statistical tools for signals with slowly varying trends superimposed on stationary signals. However, it can be verified using statistical tests that genomic sequences are nonstationary and the nature of their nonstationarity varies and is often much more complex than a simple trend. In this paper, we will bring to bear new tools to analyze nonstationary signals that have emerged in the statistical and signal processing community over the past few years. The emergence of these new methods will be used to shed new light and help resolve the issues of i) the existence of long-range correlations in DNA sequences and ii) whether they are present in both coding and noncoding segments or only in the latter. It turns out that the statistical differences between coding and noncoding segments are much more subtle than previously thought using stationary analysis. In particular, both coding and noncoding sequences exhibit long-range correlations, as asserted by a 1/fbeta(n) evolutionary (i.e., time-dependent) spectrum. However, we will use an index of randomness, which we derive from the Hilbert transform, to demonstrate that coding segments, although not random as previously suspected, are often "closer" to random sequences than noncoding segments. Moreover, we analytically justify the use of the Hilbert spectrum by proving that narrowband nonstationary signals result in a small demodulation error using the Hilbert transform.

[1]  Wentian Li,et al.  Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  Vygintas Gontis,et al.  Long-range memory model of trading activity and volatility , 2006 .

[3]  S. Nee,et al.  Uncorrelated DNA walks , 1992, Nature.

[4]  G. Dodin,et al.  Triplet correlation in DNA sequences and stability of heteroduplexes. , 1996, Journal of theoretical biology.

[5]  Azbel' Universality in a DNA statistical structure. , 1995, Physical review letters.

[6]  Nikolay V. Dokholyan,et al.  Similarity and dissimilarity in correlations of genomic DNA , 2007 .

[7]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[8]  E. Rubiola,et al.  On the 1/f Frequency Noise in Ultra-Stable Quartz Oscillators , 2006, 2006 IEEE International Frequency Control Symposium and Exposition.

[9]  A. El-Jaroudi,et al.  Evolutionary periodogram for nonstationary signals , 1994, IEEE Trans. Signal Process..

[10]  Yoshiharu Yamamoto,et al.  Aging of complex heart rate dynamics , 2006, IEEE Transactions on Biomedical Engineering.

[11]  Ramakrishna Ramaswamy,et al.  Wavelet Analysis of DNA Walks , 2006, J. Comput. Biol..

[12]  Li,et al.  Expansion-modification systems: A model for spatial 1/f spectra. , 1991, Physical review. A, Atomic, molecular, and optical physics.

[13]  J. Tamarkin Review: E. C. Titchmarsh, Introduction to the Theory of Fourier Integrals , 1938 .

[14]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[15]  Pere Caminal,et al.  Detrended Fluctuation Analysis of EEG as a Measure of Depth of Anesthesia , 2007, IEEE Transactions on Biomedical Engineering.

[16]  H. Cerdeira,et al.  Fractal properties of DNA walks. , 1999, Bio Systems.

[17]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[18]  M. B. Priestley,et al.  Non-linear and non-stationary time series analysis , 1990 .

[19]  V S Pande,et al.  Nonrandomness in protein sequences: evidence for a physically driven stage of evolution? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Jun Zhang,et al.  Phase transition and 1/f noise in a computer network model , 2003 .

[21]  John M. Hancock,et al.  Codon repeats in genes associated with human diseases: fewer repeats in the genes of nonhuman primates and nucleotide substitutions concentrated at the sites of reiteration. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[22]  V. V. Prabhu,et al.  Correlations in intronless DNA , 1992, Nature.

[23]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[24]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[25]  Skolnick,et al.  Global fractal dimension of human DNA sequences treated as pseudorandom walks. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[26]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[27]  Murad S. Taqqu,et al.  A seasonal fractional ARIMA Model applied to the Nile River monthly flows at Aswan , 2000 .

[28]  Victor Solo,et al.  Intrinsic random functions and the paradox of l/f noise , 1992 .

[29]  C. A. Chatzidimitriou-Dreismann,et al.  Long-range correlations in DNA , 1993, Nature.

[30]  P. Carpena,et al.  Identifying characteristic scales in the human genome. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Gabriel Rilling,et al.  Empirical mode decomposition as a filter bank , 2004, IEEE Signal Processing Letters.

[32]  H E Stanley,et al.  Scaling features of noncoding DNA. , 1999, Physica A.

[33]  A. Offord Introduction to the Theory of Fourier Integrals , 1938, Nature.

[34]  A L Goldberger,et al.  Correlation approach to identify coding regions in DNA sequences. , 1994, Biophysical journal.

[35]  Maria Macchiato,et al.  1/fα FLUCTUATIONS OF SEISMIC SEQUENCES , 2002 .

[36]  Wentian Li,et al.  Spatial 1/f spectra in open dynamical systems , 1989 .

[37]  E. Bacry,et al.  Characterizing long-range correlations in DNA sequences from wavelet analysis. , 1995, Physical review letters.

[38]  Jianbo Gao,et al.  Protein Coding Sequence Identification by Simultaneously Characterizing the Periodic and Random Features of DNA Sequences , 2005, Journal of biomedicine & biotechnology.

[39]  M. B. Priestley,et al.  A Test for Non‐Stationarity of Time‐Series , 1969 .

[40]  S V Buldyrev,et al.  Quantification of DNA patchiness using long-range correlation measures. , 1997, Biophysical journal.

[41]  C. Peng,et al.  Mosaic organization of DNA nucleotides. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[42]  S. Kay,et al.  Fractional Brownian Motion: A Maximum Likelihood Estimator and Its Application to Image Texture , 1986, IEEE Transactions on Medical Imaging.

[43]  James A. Yorke,et al.  Correlations in DNA sequences across the three domains of life , 2000 .