Prediction of probable genes by Fourier analysis of genomic sequences

MOTIVATION The major signal in coding regions of genomic sequences is a three-base periodicity. Our aim is to use Fourier techniques to analyse this periodicity, and thereby to develop a tool to recognize coding regions in genomic DNA. RESULT The three-base periodicity in the nucleotide arrangement is evidenced as a sharp peak at frequency f = 1/3 in the Fourier (or power) spectrum. From extensive spectral analysis of DNA sequences of total length over 5.5 million base pairs from a wide variety or organisms (including the human genome), and by separately examining coding and non-coding sequences, we find that the relative-height of the peak at f = 1/3 in the Fourier spectrum is a good discriminator of coding potential. This feature is utilized by us to detect probable coding regions in DNA sequences, by examining the local signal-to-noise ratio of the peak within a sliding window. While the overall accuracy is comparable to that of other techniques currently in use, the measure that is presently proposed is independent of training sets or existing database information, and can thus find general application. AVAILABILITY A computer program GeneScan which locates coding open reading frames and exonic regions in genomic sequences has been developed, and is available on request.

[1]  A L Goldberger,et al.  Correlation approach to identify coding regions in DNA sequences. , 1994, Biophysical journal.

[2]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[3]  James W. Fickett,et al.  The GenBank genetic sequence databank , 1986, Nucleic Acids Res..

[4]  Wentian Li,et al.  Understanding long-range correlations in DNA sequences , 1994, chao-dyn/9403002.

[5]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[6]  V. Zhurkin,et al.  Periodicity in DNA primary structure is defined by secondary structure of the coded protein. , 1981, Nucleic acids research.

[7]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[8]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[9]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[10]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[11]  M J Shulman,et al.  The coding function of nucleotide sequences can be discerned by statistical analysis. , 1981, Journal of theoretical biology.

[12]  A. Goldberger,et al.  Finite-size effects on long-range correlations: implications for analyzing DNA sequences. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[13]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[14]  K. Heumann,et al.  Complete nucleotide sequence of Saccharomyces cerevisiae chromosome , 2022 .

[15]  R. Possee,et al.  The complete DNA sequence of Autographa californica nuclear polyhedrosis virus. , 1994, Virology.

[16]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[17]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[18]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of Molecular Biology.

[19]  Jonathan A. Cooper,et al.  Complete nucleotide sequence of Saccharomyces cerevisiae chromosome VIII. , 1994, Science.

[20]  E V Koonin,et al.  New genes in old sequence: a strategy for finding genes in the bacterial genome. , 1994, Trends in biochemical sciences.

[21]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[22]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.

[23]  V. Chechetkin,et al.  Size-dependence of three-periodicity and long-range correlations in DNA sequences , 1995 .

[24]  C Burks,et al.  The GenBank genetic sequence data bank. , 1988, Nucleic acids research.

[25]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[26]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[27]  A. Lapedes,et al.  Application of neural networks and other machine learning algorithms to DNA sequence analysis , 1988 .

[28]  A. Bhattacharya,et al.  Nucleotide sequence organisation and analysis of the nuclear ribosomal DNA circle of the protozoan parasite Entamoeba histolytica. , 1994, Molecular and biochemical parasitology.

[29]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[30]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.

[31]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.