An Optimization to Protein Coding Regions Identification in Eukaryotes

Identification of coding regions in DNA sequences is an important and challenging optimization problem in bioinformatics. Several approaches have been proposed but none is currently satisfactory. Here, the authors propose an optimization methodology to identify protein coding regions in Eukaryotes. Noise reduction in DNA signal indirectly overcomes spectral leakage phenomenon. The proposed methodology fragments this optimization in two classes as opposed to the usual optimization methods that rely on statistical and digital signal processing. Compact DNA signal with minimal spectral leakage is obtained in class one by using a new indicator sequence while class two addresses the 1/f background noise reduction employing wavelet transforms. Significant improvement in coding regions identification was observed over many real datasets, which were obtained from the national center for bioinformatics. Quantitatively, the authors monitored a gain of 80.5% in coding identification with the Complex method, 42.5% with the Binary method, and 15% with the EIIP indicator sequence method over Mus Musculus Domesticus (House rat), NCBI Accession number: NC_006914, Length of gene: 7700 bp with number of coding regions: 4. Continuous improvement in significance with dyadic wavelet transforms will be observed as a future expectation.

[1]  M. Omair Ahmad,et al.  Prediction of protein-coding regions in DNA sequences using a model-based approach , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[2]  Nilesh V. Patel,et al.  Improved Feature Selection by Incorporating Gene Similarity Into the LASSO , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[3]  Taeho Hwang,et al.  FiGS: a filter-based gene selection workbench for microarray data , 2010, BMC Bioinformatics.

[4]  Suprakash Datta,et al.  DFT based DNA splicing algorithms for prediction of protein coding regions , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[5]  P.P. Vaidyanathan,et al.  Digital filters for gene prediction applications , 2002, Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, 2002..

[6]  E. Ambikairajah,et al.  On DNA Numerical Representations for Period-3 Based Exon Prediction , 2007, 2007 IEEE International Workshop on Genomic Signal Processing and Statistics.

[7]  Feng Liu,et al.  Predicting protein secondary structure using continuous wavelet transform and Chou-Fasman method , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[8]  Peng-Yeng Yin,et al.  A Bayesian Framework for Improving Clustering Accuracy of Protein Sequences Based on Association Rules , 2008 .

[9]  Shuo Guo,et al.  An integrative algorithm for predicting protein coding regions , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[10]  Hui-Huang Hsu,et al.  Advanced Data Mining Technologies in Bioinformatics , 2006 .

[11]  Kuldip Singh,et al.  A Time Series Approach for Identification of Exons and Introns , 2007 .

[12]  Manolis Tzagarakis,et al.  An Innovative Approach to Enhance Collaboration in the Biomedical Field , 2013 .

[13]  M. Roy,et al.  Identification and analysis of coding and non-coding regions of a DNA sequence by positional frequency distribution of nucleotides (PFDN) algorithm , 2009, 2009 4th International Conference on Computers and Devices for Communication (CODEC).

[14]  D.G. Grandhi,et al.  2-Simplex mapping for identifying the protein coding regions in DNA , 2007, TENCON 2007 - 2007 IEEE Region 10 Conference.

[15]  R. M. C. Junior,et al.  Identification of Protein Coding Regions Using the Modified Gabor-Wavelet Transform , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Tessamma Thomas,et al.  Discrete wavelet transform de-noising in eukaryotic gene splicing , 2010, BMC Bioinformatics.

[17]  Mahmood Akhtar,et al.  Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction , 2008, IEEE Journal of Selected Topics in Signal Processing.

[18]  Aboul Ella Hassanien,et al.  Integrated Features Based on Gray-Level and Hu Moment-Invariants with Ant Colony System for Retinal Blood Vessels Segmentation , 2012 .

[19]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[20]  Zhu Yi-sheng,et al.  Prediction of Protein Coding Regions by Support Vector Machine , 2009, 2009 International Symposium on Intelligent Ubiquitous Computing and Education.

[21]  Ujjwal Maulik,et al.  Multiobjective Genetic Fuzzy Clustering of Categorical Attributes , 2007 .

[22]  Mahmood Akhtar,et al.  Optimizing period-3 methods for eukaryotic gene prediction , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  V. K. Srivastava,et al.  DSP technique for gene and exon prediction taking complex indicator sequence , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[24]  Amir Asif,et al.  Prediction of protein coding regions in DNA sequences using Fourier spectral characteristics , 2004, IEEE Sixth International Symposium on Multimedia Software Engineering.

[25]  Amir Asif,et al.  A fast DFT based gene prediction algorithm for identification of protein coding regions , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[26]  Hazrina Yusof Hamdani,et al.  Gene prediction system , 2008 .