Autoregressive Modeling and Feature Analysis of DNA Sequences

A parametric signal processing approach for DNA sequence analysis based on autoregressive (AR) modeling is presented. AR model residual errors and AR model parameters are used as features. The AR residual error analysis indicate a high specificity of coding DNA sequences, while AR feature-based analysis helps distinguish between coding and noncoding DNA sequences. An AR model-based string searching algorithm is also proposed. The effect of several types of numerical mapping rules in th proposed method is demonstrated.

[1]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[2]  S. Buldyrev,et al.  Species independence of mutual information in coding and noncoding DNA. , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[3]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[4]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[5]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.

[6]  I. Grosse,et al.  Analysis of symbolic sequences using the Jensen-Shannon divergence. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Wei Wang,et al.  Computing linear transforms of symbolic signals , 2002, IEEE Trans. Signal Process..

[8]  P. P. Vaidyanathan,et al.  GENE AND EXON PREDICTION USING ALLPASS-BASED FILTERS , 2002 .

[9]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[10]  Maxime Crochemore,et al.  Zones of Low Entropy in Genomic Sequences , 1999, Comput. Chem..

[11]  P Bernaola-Galván,et al.  Study of statistical correlations in DNA sequences. , 2002, Gene.

[12]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[13]  Paul Dan Cristea,et al.  Analysis of chromosome genomic signals , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[14]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[15]  H. P. Yockey,et al.  Information Theory And Molecular Biology , 1992 .

[16]  Alessandro Neri,et al.  New approaches to genome sequence analysis based on digital signal processing , 2002 .

[17]  Wei Wang,et al.  Symbolic signal processing , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[19]  David S. Stoffer,et al.  The spectral envelope and its applications , 2000 .

[20]  Jiuzhou Song,et al.  Test of Origin Site ( oriC ) and Terminus Site ( terC ) of Replication by Wavelet Analysis in Bacteria , 2002 .

[21]  Bruce Alberts,et al.  Essential Cell Biology , 1983 .

[22]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[23]  Hanspeter Herzel,et al.  Interpreting correlations in biosequences , 1998 .

[24]  David S. Stoffer,et al.  Spectral analysis for categorical time series: Scaling and the spectral envelope , 1993 .

[25]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[26]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[27]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[28]  W. M. Carey,et al.  Digital spectral analysis: with applications , 1986 .

[29]  H Herzel,et al.  Correlations in protein sequences and property codes. , 1998, Journal of theoretical biology.

[30]  J. Oliver,et al.  Sequence Compositional Complexity of DNA through an Entropic Segmentation Method , 1998 .

[31]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[32]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[33]  A. K. Mohanty,et al.  Long range correlations in DNA sequences , 2002 .

[34]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[35]  C. Peng,et al.  Mosaic organization of DNA nucleotides. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[36]  J. Muzy,et al.  Long-range correlations in genomic DNA: a signature of the nucleosomal structure. , 2001, Physical review letters.

[37]  M. A. Vouk,et al.  A CODING THEORY FRAMEWORK FOR GENETIC SEQUENCE ANALYSIS , 2002 .

[38]  B. Porat,et al.  Digital Spectral Analysis with Applications. , 1988 .

[39]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[40]  H. P. Yockey,et al.  An application of information theory to the Central Dogma and the Sequence Hypothesis. , 1974, Journal of theoretical biology.

[41]  Liaofu Luo,et al.  STATISTICAL CORRELATION OF NUCLEOTIDES IN A DNA SEQUENCE , 1998 .

[42]  I Grosse,et al.  Statistical analysis of the DNA sequence of human chromosome 22. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[44]  E. Bacry,et al.  Characterizing long-range correlations in DNA sequences from wavelet analysis. , 1995, Physical review letters.