Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.

[1]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[2]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[3]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[4]  Jiao Jin,et al.  Identification of Protein Coding Regions of Rice Genes Using Alternative Spectral Rotation Measure and Linear Discriminant Analysis , 2004, Genomics, proteomics & bioinformatics.

[5]  W. Godwin Article in Press , 2000 .

[6]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[7]  M. Yan,et al.  A new fourier transform approach for protein coding measure based on the format of the Z curve , 1998, Bioinform..

[8]  E. Hill Journal of Theoretical Biology , 1961, Nature.

[9]  P. P. Vaidyanathan,et al.  The role of signal-processing concepts in genomics and proteomics , 2004, J. Frankl. Inst..

[10]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[11]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[12]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[13]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[14]  Wei Wang,et al.  Computing linear transforms of symbolic signals , 2002, IEEE Trans. Signal Process..

[15]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[16]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[17]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[18]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[19]  Gerhard Kauer,et al.  Applying signal theory to the analysis of biomolecules , 2003, Bioinform..

[20]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[21]  Jianbo Gao,et al.  Protein Coding Sequence Identification by Simultaneously Characterizing the Periodic and Random Features of DNA Sequences , 2005, Journal of biomedicine & biotechnology.

[22]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.

[23]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[24]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.

[25]  V. Chechetkin,et al.  Size-dependence of three-periodicity and long-range correlations in DNA sequences , 1995 .

[26]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[27]  Tin Wee Tan,et al.  Xpro: database of eukaryotic protein-encoding genes , 2004, Nucleic Acids Res..