Protein coding region prediction based on the adaptive representation method

This article proposes a new protein-coding-region prediction technique. The technique maps DNA sequences to numerical strings using an adaptive representation scheme and then uses signal processing to identify coding regions. We learn a mapping from symbols to numerical sequences by computing the distribution variance of each nucleotide in a DNA sequence, and then use the period-3 spectrum to distinguish coding and non-coding regions. Compared to other spectral methods, our method boosts the period-3 spectrum peaks in putative protein-coding regions and attenuates the extraneous peaks in putative non-coding regions by learning to weight the signal by the C-G to A-T ratios. Our adaptive representation method outperforms all other state-of-the-art spectral methods on every benchmark dataset available according to 3 different performance measures.

[1]  S. C. Kremer,et al.  Gene Prediction Based on DNA Spectral Analysis: A Literature Review , 2011, J. Comput. Biol..

[2]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[3]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[4]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[5]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[6]  R. M. C. Junior,et al.  Identification of Protein Coding Regions Using the Modified Gabor-Wavelet Transform , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Zhu-Jin Zhang DV-Curve: a novel intuitive tool for visualizing and analyzing DNA sequences , 2009, Bioinform..

[8]  W. Wayt Gibbs The unseen genome: beyond DNA. , 2003, Scientific American.

[9]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[10]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[11]  E. Ambikairajah,et al.  On DNA Numerical Representations for Period-3 Based Exon Prediction , 2007, 2007 IEEE International Workshop on Genomic Signal Processing and Statistics.

[12]  Stefan C. Kremer,et al.  Theoretical justification of computing the 3-base periodicity using nucleotide distribution variance , 2010, Biosyst..

[13]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[14]  P D Cristea Conversion of nucleotides sequences into genomic signals , 2002, Journal of cellular and molecular medicine.