A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation

The 3-base periodicity, identified as a pronounced peak at the frequency N/3 (N is the length of the DNA sequence) of the Fourier power spectrum of protein coding regions, is used as a marker in gene-finding algorithms to distinguish protein coding regions (exons) and noncoding regions (introns) of genomes. In this paper, we reveal the explanation of this phenomenon which results from a nonuniform distribution of nucleotides in the three coding positions. There is a linear correlation between the nucleotide distributions in the three codon positions and the power spectrum at the frequency N/3. Furthermore, this study indicates the relationship between the length of a DNA sequence and the variance of nucleotide distributions and the average Fourier power spectrum, which is the noise signal in gene-finding methods. The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions in the three codon positions.

[1]  N Gautham,et al.  Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. , 2004, Biochemical and biophysical research communications.

[2]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[3]  I. López-Villaseñor,et al.  A simple model to explain three‐base periodicity in coding DNA , 2006, FEBS letters.

[4]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[5]  Tin Wee Tan,et al.  Xpro: database of eukaryotic protein-encoding genes , 2004, Nucleic Acids Res..

[6]  Mark Gerstein,et al.  Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes. , 2002, Nucleic acids research.

[7]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.

[8]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[9]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[10]  Amir Niknejad,et al.  DNA sequence representation without degeneracy. , 2003, Nucleic acids research.

[11]  J. Lobry,et al.  Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. , 1997, Gene.

[12]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..