Numerical representation of DNA sequences based on genetic code context and its applications in periodicity analysis of genomes

The indispensable prerequisites in characterizing information content of DNA molecules by computational methods are the numerical representations of symbolic DNA sequences. Current numerical representation methods for DNA sequences do not contain the genetic code context information, which may play an important role in defining protein coding regions. We propose a novel numerical representation of DNA sequences based on genetic code context within DNA sequences and explore the feasibility of applying this method to identify protein coding regions in genomes. Computational experiments indicate that incorporating genetic code information into numerical representations is a promising approach in which DNA sequences are uniquely represented and more information is represented so that digital processing tools can be applied to the periodicity analysis in DNA sequences effectively.

[1]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[2]  Hanspeter Herzel,et al.  10-11 bp periodicities in complete genomes reflect protein structure and DNA folding , 1999, Bioinform..

[3]  Wei Wang,et al.  Computing linear transforms of symbolic signals , 2002, IEEE Trans. Signal Process..

[4]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.

[5]  Amir Niknejad,et al.  DNA sequence representation without degeneracy. , 2003, Nucleic acids research.

[6]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[7]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[8]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[9]  D. Goldsack,et al.  Contribution of the free energy of mixing of hydrophobic side chains to the stability of the tertiary structure of proteins. , 1973, Journal of theoretical biology.

[11]  P. Argos,et al.  Structural prediction of membrane-bound proteins. , 2005, European journal of biochemistry.

[12]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[13]  Feng-Biao Guo,et al.  ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. , 2003, Nucleic acids research.

[14]  V. Chechetkin,et al.  Search of hidden periodicities in DNA sequences. , 1995, Journal of theoretical biology.

[15]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[16]  Gerhard Kauer,et al.  Applying signal theory to the analysis of biomolecules , 2003, Bioinform..

[17]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[18]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[19]  Eivind Coward,et al.  Equivalence of two Fourier methods for biological sequences , 1997 .

[20]  James S. Walker Fourier Analysis and Wavelet Analysis , 1998 .