From DNA to protein: Why genetic code context of nucleotides for DNA signal processing? A review

Abstract Protein coding regions are commonly diffused with non-coding regions due to 1/f background noise in such a way that a viable discernment between the two regions becomes cumbersome. Commonly employed digital signal processing methodologies lack fundamental genetic code context of nucleotides since these approaches treat DNA signal as normal digital signal that could be processed by traditional DSP tools and techniques. This paper reviews the prevailing approaches for protein coding regions identification that base on common DSP concepts and highlights the importance of genetic code context to be considered for any computational solution for protein coding regions identification. Nucleotides in a DNA signal carry certain natural characteristics i.e. presence in a distinctive triplet format, maintaining distinct structure, owning and further sharing distribution of densities in codons, fuzzy behaviors, semantic similarities, unbalanced nucleotides' distribution producing a relatively high bias for nucleotides' usage in coding regions etc. The computational solutions for protein coding regions identification that exploit these fundamental characteristic of nucleotides can significantly suppress the signal noise and hence can better contribute in identification.

[1]  A. Antoniou,et al.  Application of parametric window functions to the STDFT method for gene prediction , 2005, PACRIM. 2005 IEEE Pacific Rim Conference on Communications, Computers and signal Processing, 2005..

[2]  Wei Wang,et al.  Computing linear transforms of symbolic signals , 2002, IEEE Trans. Signal Process..

[3]  Hon Keung Kwan,et al.  Numerical representation of DNA sequences , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[4]  Kuldip Singh,et al.  A Time Series Approach for Identification of Exons and Introns , 2007 .

[5]  D.G. Grandhi,et al.  2-Simplex mapping for identifying the protein coding regions in DNA , 2007, TENCON 2007 - 2007 IEEE Region 10 Conference.

[6]  Hon Keung Kwan,et al.  Spectral classification of short numerical exon and intron sequences , 2011, BMC Bioinformatics.

[7]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[8]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[9]  Matiur Rahman Applications of Fourier Transforms to Generalized Functions , 2011 .

[10]  Eivind Coward,et al.  Equivalence of two Fourier methods for biological sequences , 1997 .

[11]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[12]  Ali Bashir,et al.  Strobe sequence design for haplotype assembly , 2011, BMC Bioinformatics.

[13]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[14]  Rappold,et al.  Human Molecular Genetics , 1996, Nature Medicine.

[15]  Todd Holden,et al.  ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes , 2007, SPIE Optical Engineering + Applications.

[16]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[17]  Sophocles J. Orfanidis,et al.  Introduction to signal processing , 1995 .

[18]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[19]  Sabrina Hirsch,et al.  Digital Signal Processing A Computer Based Approach , 2016 .

[20]  Hazrina Yusof Hamdani,et al.  Gene prediction system , 2008 .

[21]  M. Omair Ahmad,et al.  Prediction of protein-coding regions in DNA sequences using a model-based approach , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[22]  Low Tang Jung,et al.  On fuzzy semantic similarity measure for DNA coding , 2016, Comput. Biol. Medicine.

[23]  Andrzej K. Brodzik,et al.  Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[25]  Boaz Porat,et al.  Digital Processing of Random Signals: Theory and Methods , 2008 .

[26]  Gail L. Rosen,et al.  Signal processing for biologically-inspired gradient source localization and DNA sequence analysis , 2006 .

[27]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[28]  V. K. Srivastava,et al.  DSP technique for gene and exon prediction taking complex indicator sequence , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[29]  Patrick Cramer,et al.  Structure–function studies of the RNA polymerase II elongation complex , 2009, Acta crystallographica. Section D, Biological crystallography.

[30]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[31]  Hong Yan,et al.  Autoregressive modeling of DNA features for short exon recognition , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[32]  Shamgar Gurevich,et al.  On the diagonalization of the discrete Fourier transform , 2008, ArXiv.

[33]  A. Cetin,et al.  Equiripple FIR filter design by the FFT algorithm , 1997, IEEE Signal Process. Mag..

[34]  Amir Asif,et al.  A fast DFT based gene prediction algorithm for identification of protein coding regions , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35]  Rajiv Saxena,et al.  An Adaptive Window Length Strategy for Eukaryotic CDS Prediction , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Charles K. Chui,et al.  An Introduction to Wavelets , 1992 .

[37]  Juan V. Lorenzo Ginori,et al.  A New Predictor of Coding Regions in Genomic Sequences using a Combination of Different Approaches , 2007 .

[38]  A. Walden,et al.  Spectral analysis for physical applications : multitaper and conventional univariate techniques , 1996 .

[39]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[40]  Roberto Garello,et al.  The Minimum Entropy Mapping Spectrum of a DNA Sequence , 2010, IEEE Transactions on Information Theory.

[41]  P. Welch The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .

[42]  Paul D. McNicholas,et al.  Model-Based Clustering , 2016, Journal of Classification.

[43]  K. P. Soman,et al.  Insight into Wavelets: From Theory to Practice , 2005 .

[44]  Ito Wasito,et al.  Fractal dimension approach for clustering of DNA sequences based on internucleotide distance , 2013, 2013 International Conference of Information and Communication Technology (ICoICT).

[45]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[46]  Vinay Kumar Srivastava,et al.  Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time discrete Fourier transform , 2010, 2010 International Conference on Power, Control and Embedded Systems.

[47]  Alexander Sczyrba,et al.  AGenDA: homology-based gene prediction , 2003, Bioinform..

[48]  Paul Dan Cristea,et al.  Genetic signal representation and analysis , 2002, SPIE BiOS.

[49]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[50]  Yi Pan,et al.  A Comprehensive Review of Emerging Computational Methods for Gene Identification , 2016, J. Inf. Process. Syst..

[51]  Trevor W. Fox,et al.  A Digital Signal Processing Method for Gene Prediction with Improved Noise Suppression , 2004, EURASIP J. Adv. Signal Process..

[52]  T.S. Gunawan On the optimal window shape for genomic signal processing , 2008, 2008 International Conference on Computer and Communication Engineering.

[53]  Ganapati Panda,et al.  Identification of Protein-Coding Regions in DNA Sequences Using A Time-Frequency Filtering Approach , 2011, Genom. Proteom. Bioinform..

[54]  Mahadev D. Uplane,et al.  Use of Kaiser window for ECG processing , 2006 .

[55]  Kevin R. Thornton,et al.  The origin of new genes: glimpses from the young and old , 2003, Nature Reviews Genetics.

[56]  Mahmood Akhtar,et al.  Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction , 2008, IEEE Journal of Selected Topics in Signal Processing.

[57]  J. Tukey,et al.  Modern techniques of power spectrum estimation , 1967, IEEE Transactions on Audio and Electroacoustics.

[58]  Steven Kay,et al.  Modern Spectral Estimation: Theory and Application , 1988 .

[59]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[60]  T. Mahalakshmi,et al.  Visualization Of Genomic Data Using Inter-Nucleotide Distance Signals , 2005 .

[61]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[62]  Pietro Liò,et al.  Finding pathogenicity islands and gene transfer events in genome data , 2000, Bioinform..

[63]  M. Roy,et al.  Identification and analysis of coding and non-coding regions of a DNA sequence by positional frequency distribution of nucleotides (PFDN) algorithm , 2009, 2009 4th International Conference on Computers and Devices for Communication (CODEC).

[64]  Karl-Heinz Zimmermann,et al.  DNA Computing Models , 2008 .

[65]  Vasile Palade,et al.  A neural network based multi-classifier system for gene identification in DNA sequences , 2004, Neural Computing & Applications.

[66]  Vera Afreixo,et al.  Spectrum and symbol distribution of nucleotide sequences. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[67]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[68]  R. Mantegna,et al.  Statistical mechanics in biology: how ubiquitous are long-range correlations? , 1994, Physica A.

[69]  Yehoshua Y. Zeevi,et al.  Scale-Space Generation via Uncertainty Principles , 2005, Scale-Space.

[70]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[71]  James Bruce Lee,et al.  Theory and Application , 2019, Wearable Sensors in Sport.

[72]  S S Tambe,et al.  Application of artificial neural networks for prokaryotic transcription terminator prediction , 1994, FEBS letters.

[73]  Tessamma Thomas,et al.  Discrete wavelet transform de-noising in eukaryotic gene splicing , 2010, BMC Bioinformatics.

[74]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[75]  R. Linsker,et al.  A measure of DNA periodicity. , 1986, Journal of theoretical biology.

[76]  P Bernaola-Galván,et al.  Study of statistical correlations in DNA sequences. , 2002, Gene.

[77]  Changchuan Yin,et al.  Numerical representation of DNA sequences based on genetic code context and its applications in periodicity analysis of genomes , 2008, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[78]  E. Ambikairajah,et al.  On DNA Numerical Representations for Period-3 Based Exon Prediction , 2007, 2007 IEEE International Workshop on Genomic Signal Processing and Statistics.

[79]  Yazhu Chen,et al.  A Brief Review of Computational Gene Prediction Methods , 2004, Genomics, proteomics & bioinformatics.

[80]  Martin Raff,et al.  Cell Junctions, Cell Adhesion, and the Extracellular Matrix , 2002 .

[81]  M. Yan,et al.  A new fourier transform approach for protein coding measure based on the format of the Z curve , 1998, Bioinform..

[82]  Leonidas D. Iasemidis,et al.  Autoregressive Modeling and Feature Analysis of DNA Sequences , 2004, EURASIP J. Adv. Signal Process..

[83]  A. Grossmann,et al.  DECOMPOSITION OF HARDY FUNCTIONS INTO SQUARE INTEGRABLE WAVELETS OF CONSTANT SHAPE , 1984 .

[84]  Sanjit K. Mitra,et al.  Power spectrum analysis for DNA sequences , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[85]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[86]  Bing Li,et al.  Gene finding by integrating gene finders , 2010 .

[87]  T. Richmond,et al.  The structure of DNA in the nucleosome core , 2003, Nature.

[88]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[89]  Pietro Liò,et al.  Wavelets in bioinformatics and computational biology: state of art and perspectives , 2003, Bioinform..

[90]  N. Rao,et al.  Detection of 3-periodicity for small genomic sequences based on AR technique , 2004, 2004 International Conference on Communications, Circuits and Systems (IEEE Cat. No.04EX914).

[91]  D.S.G. Pollock The Discrete Fourier Transform , 1999 .

[92]  Douglas Lyon,et al.  The Discrete Fourier Transform, Part 4: Spectral Leakage , 2009, J. Object Technol..

[93]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[94]  Neelam Goel,et al.  A Review of Soft Computing Techniques for Gene Prediction , 2013 .

[95]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[96]  Zhu Yi-sheng,et al.  Prediction of Protein Coding Regions by Support Vector Machine , 2009, 2009 International Symposium on Intelligent Ubiquitous Computing and Education.

[97]  E. Brigham,et al.  The fast Fourier transform and its applications , 1988 .

[98]  Jamal Tuqan,et al.  A DSP Approach for Finding the Codon Bias in DNA Sequences , 2008, IEEE Journal of Selected Topics in Signal Processing.

[99]  Sahotra Sarkar,et al.  Decoding “coding”—information and DNA , 1996 .

[100]  D. K. Shakya,et al.  A DSP-Based Approach for Gene Prediction in Eukaryotic Genes , 2011 .

[101]  Shuo Guo,et al.  An integrative algorithm for predicting protein coding regions , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[102]  R. M. C. Junior,et al.  Identification of Protein Coding Regions Using the Modified Gabor-Wavelet Transform , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[103]  A. F. Harvey,et al.  The Fundamentals of FFT-Based Signal Analysis and Measurement in LabVIEW and LabWindows , 1993 .

[104]  Suprakash Datta,et al.  DFT based DNA splicing algorithms for prediction of protein coding regions , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[105]  Huai Xiulan,et al.  Fourier and Wavelet Transform Analysis of Pressure Signals during Explosive Boiling , 2008 .

[106]  Omid Abbasi,et al.  RESEARCH ARTICLE Open Access Identification of exonic regions in DNA sequences , 2022 .

[107]  D. B. Preston Spectral Analysis and Time Series , 1983 .

[108]  A. Antoniou Digital Signal Processing: Signals, Systems, and Filters , 2005 .

[109]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.