An efficient sliding window strategy for accurate location of eukaryotic protein coding regions

The sliding window is one of important factors that seriously affect the accuracy of coding region prediction and location for the methods based on power spectrum technique. It is very difficult to select the appropriate sliding step and the window length for different organisms. In this study, a novel sliding window strategy is proposed on the basis of power spectrum analysis for the accurate location of eukaryotic protein coding regions. The proposed sliding window strategy is very simple and the sliding step of window is changeable. Our tests show that the average location error for the novel method is 12 bases. Compared with the previous location error of 54 bases using the fixed sliding step, the novel sliding window strategy increased the location accuracy greatly. Further, the consumed CPU time to run the novel strategy is much shorter than the strategy of the fixed length sliding step. So, the computational complexity for the novel method is decreased greatly.

[1]  A L Goldberger,et al.  Correlation approach to identify coding regions in DNA sequences. , 1994, Biophysical journal.

[2]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[3]  A A Tsonis,et al.  Periodicity in DNA coding sequences: implications in gene evolution. , 1991, Journal of theoretical biology.

[4]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[5]  A. Lapedes,et al.  Application of neural networks and other machine learning algorithms to DNA sequence analysis , 1988 .

[6]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[7]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[8]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[9]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[10]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[11]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Wentian Li,et al.  GENERATING NONTRIVIAL LONG-RANGE CORRELATIONS AND 1/f SPECTRA BY REPLICATION AND MUTATION , 1992 .