Digital Signal Processing in the Analysis of Genomic Sequences

Digital Signal Processing (DSP) applications in Bioinformatics have received great attention in recent years, where new effective methods for genomic sequence analysis, such as the detection of coding regions, have been developed. The use of DSP principles to analyze genomic sequences requires defining an adequate representation of the nucleotide bases by numerical values, converting the nucleotide sequences into time series. Once this has been done, all the mathematical tools usually employed in DSP are used in solving tasks such as identification of protein coding DNA regions, identification of reading frames, and others. In this article we present an overview of the most relevant applications of DSP algorithms in the analysis of genomic sequences, showing the main results obtained by using these techniques, analyzing their relative advantages and drawbacks, and providing relevant examples. We finally analyze some perspectives of DSP in Bioinformatics, considering recent research results on algebraic structures of the genetic code, which suggest other new DSP applications in this field, as well as the new field of Genomic Signal Processing.

[1]  D. Schonfeld,et al.  Emergence of new structure from non-stationary analysis of genomic sequences , 2008, 2008 IEEE International Workshop on Genomic Signal Processing and Statistics.

[2]  R. Eck Genetic Code: Emergence of a Symmetrical Pattern , 1963, Science.

[3]  P.D. Cristea,et al.  Multiresolution phase analysis of genomic signals , 2004, First International Symposium on Control, Communications and Signal Processing, 2004..

[4]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[5]  J. Tuqan,et al.  A New DSP-Based Measure for CPG Islands Detection , 2006, 2006 IEEE 12th Digital Signal Processing Workshop & 4th IEEE Signal Processing Education Workshop.

[6]  Jaakko Astola,et al.  Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics , 2004, EURASIP J. Adv. Signal Process..

[7]  Zafer Aydin,et al.  A signal processing application in genomic research: protein secondary structure prediction , 2006 .

[8]  Sanjit K. Mitra,et al.  Power spectrum analysis for DNA sequences , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[9]  Vera Afreixo,et al.  Fourier analysis of symbolic data: A brief review , 2004, Digit. Signal Process..

[10]  K.J.R. Liu,et al.  Genomic processing for cancer classification and prediction - Abroad review of the recent advances in model-based genomoric and proteomic signal processing for cancer detection , 2007, IEEE Signal Processing Magazine.

[11]  Ioan Tabus,et al.  Introduction to the Issue on Genomic and Proteomic Signal Processing , 2008, IEEE J. Sel. Top. Signal Process..

[12]  F. H. C. CRICK,et al.  Origin of the Genetic Code , 1967, Nature.

[13]  M. Yan,et al.  A new fourier transform approach for protein coding measure based on the format of the Z curve , 1998, Bioinform..

[14]  I. López-Villaseñor,et al.  A simple model to explain three‐base periodicity in coding DNA , 2006, FEBS letters.

[15]  T Pöschel,et al.  The hypercube structure of the genetic code explains conservative and non-conservative aminoacid substitutions in vivo and in vitro. , 2002, Bio Systems.

[16]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[17]  Ricardo del Corazón Grau-Ábalo,et al.  Coding Region Prediction in Genomic Sequences Using a Combination of Digital Signal Processing Approaches , 2007, CIARP.

[18]  E. Ambikairajah,et al.  An integer period DFT for biological sequence processing , 2008, 2008 IEEE International Workshop on Genomic Signal Processing and Statistics.

[19]  S.A. Tsaftaris,et al.  How can DNA computing be applied to digital signal processing? , 2004, IEEE Signal Processing Magazine.

[20]  E. Ambikairajah,et al.  Detection of period-3 behavior in genomic sequences using singular value decomposition , 2005, Proceedings of the IEEE Symposium on Emerging Technologies, 2005..

[21]  Jamal Tuqan,et al.  Gene Identification Using the Z-Curve Representation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[22]  Shih-Chieh Su,et al.  Structural analysis of genomic sequences with matched filtering , 2003, Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No.03CH37439).

[23]  Ricardo Grau,et al.  A novel DNA sequence vector space over an extended genetic code Galois field , 2006 .

[24]  William A. Sethares,et al.  Latent Periodicities in Genome Sequences , 2008, IEEE Journal of Selected Topics in Signal Processing.

[25]  Paul Dan Cristea,et al.  Signal Representation and Processing of Nucleotide Sequences , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[26]  J. Lazovic,et al.  Selection of amino acid parameters for Fourier transform-based analysis of proteins , 1996, Comput. Appl. Biosci..

[27]  R. Tuduce,et al.  ANN Prediction of Nucleotide Sequences Link of Principal Component Analysis to Fourier Transform , 2007, 2007 14th International Workshop on Systems, Signals and Image Processing and 6th EURASIP Conference focused on Speech and Image Processing, Multimedia Communications and Services.

[28]  Pietro Liò,et al.  Wavelets in bioinformatics and computational biology: state of art and perspectives , 2003, Bioinform..

[29]  S.V. Providence Utilization of Cellular Automata in the DNA Signal Search Problem , 2004, IEEE SoutheastCon, 2004. Proceedings..

[30]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[31]  Lila Kari,et al.  Coding properties of DNA languages , 2003, Theor. Comput. Sci..

[32]  T. Jukes,et al.  The amino acid code. , 1978, Advances in enzymology and related areas of molecular biology.

[33]  Mauro Grigioni,et al.  SWIFT (sequence-wide investigation with Fourier transform): a software tool for identifying proteins of a given class from the unannotated genome sequence , 2005, Bioinform..

[34]  Paul Dan Cristea Genetic signal analysis , 2001, Proceedings of the Sixth International Symposium on Signal Processing and its Applications (Cat.No.01EX467).

[35]  Gary Benson,et al.  A new distance measure for comparing sequence profiles based on path lengths along an entropy surface , 2002, ECCB.

[36]  C. Xing,et al.  Free Energy Analysis on the Coding Region of the Individual Genes of Saccharomyces cerevisiae , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[37]  Mahmood Akhtar,et al.  Optimizing period-3 methods for eukaryotic gene prediction , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[39]  Cao Xiao-yan Markov Models in Bioinformatics , 2000 .

[40]  E. Ambikairajah,et al.  Comprehensive autoregressive modeling for classification of genomic sequences , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.

[41]  John G. Proakis,et al.  Digital Signal Processing 4th Edition , 2006 .

[42]  C. Epstein,et al.  Role of the Amino-Acid ‘Code’ and of Selection for Conformation in the Evolution of Proteins , 1966, Nature.

[43]  R. Sanchez,et al.  An algebraic hypothesis about the primeval genetic code architecture. , 2008, Mathematical biosciences.

[44]  Amir Asif,et al.  Prediction of protein coding regions in DNA sequences using Fourier spectral characteristics , 2004, IEEE Sixth International Symposium on Multimedia Software Engineering.

[45]  Yuan-Ting Zhang,et al.  Signal processing techniques in genomic engineering , 2002, Proc. IEEE.

[46]  Ricardo del Corazón Grau-Ábalo,et al.  Detection of Coding Regions in Large DNA Sequences Using the Short Time Fourier Transform with Reduced Computational Load , 2006, CIARP.

[47]  C. Burrus,et al.  Introduction to Wavelets and Wavelet Transforms: A Primer , 1997 .

[48]  J. Oliver,et al.  Sequence Compositional Complexity of DNA through an Entropic Segmentation Method , 1998 .

[49]  Gajendra P. S. Raghava,et al.  Locating probable genes using Fourier transform approach , 2002, Bioinform..

[50]  Paul Dan Cristea,et al.  Signal Analysis of Pathogens' Genomic Sequences , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[51]  Valeria De Fonzo,et al.  Hidden Markov Models in Bioinformatics , 2007 .

[52]  Andres Cicuttin,et al.  Entropic approach to information coding in DNA molecules , 2001 .

[53]  P. Cristea,et al.  Study of HIV Variability based on Genomic Signal Analysis of Protease and Reverse Transcriptase Genes , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[54]  P Béland,et al.  The origin and evolution of the genetic code. , 1994, Journal of theoretical biology.

[55]  Mahmood Akhtar,et al.  Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction , 2008, IEEE Journal of Selected Topics in Signal Processing.

[56]  F. Crick Origin of the Genetic Code , 1967, Nature.

[57]  Paul Dan Cristea Phase analysis of DNA genomic signals , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[58]  Gajendra P. S. Raghava,et al.  Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation , 2004, Bioinform..

[59]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[60]  L. Carin,et al.  Sequential modeling for identifying CpG island locations in human genome , 2002, IEEE Signal Processing Letters.

[61]  Eliathamby Ambikairajah,et al.  Boosting approach to exon detection in DNA sequences , 2008 .

[62]  Jamal Tuqan,et al.  A DSP Approach for Finding the Codon Bias in DNA Sequences , 2008, IEEE Journal of Selected Topics in Signal Processing.

[63]  Leonidas D. Iasemidis,et al.  Autoregressive Modeling and Feature Analysis of DNA Sequences , 2004, EURASIP J. Adv. Signal Process..

[64]  Jianchang Ning,et al.  Preliminary wavelet analysis of genomic sequences , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[65]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[66]  J. Bashford,et al.  A supersymmetric model for the evolution of the genetic code. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[67]  Mladen A. Vouk,et al.  An error-correcting code framework for genetic sequence analysis , 2003, J. Frankl. Inst..

[68]  Ravi Gupta,et al.  An efficient algorithm to detect palindromes in DNA sequences using periodicity transform , 2006, Signal Process..

[69]  Ahmed H. Tewfik,et al.  Efficient Updating of Biological Sequence Analyses , 2008, IEEE Journal of Selected Topics in Signal Processing.

[70]  E. Morgado,et al.  Gene algebra from a genetic code algebraic structure , 2004, Journal of mathematical biology.

[71]  Mahmood Akhtar,et al.  Comparison of Gene and Exon Prediction Techniques for Detection of Short Coding Regions , 2006 .

[72]  Andrzej K. Brodzik,et al.  Symbol-balanced quaternionic periodicity transform for latent pattern detection in DNA sequences , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[73]  E.R. Dougherty,et al.  Research issues in genomic signal processing , 2005, IEEE Signal Processing Magazine.

[74]  R. Tuduce,et al.  Common trend extraction from sets of genomic signals , 2008, 2008 3rd International Symposium on Communications, Control and Signal Processing.

[75]  Dan Schonfeld,et al.  Nonstationary Analysis of Coding and Noncoding Regions in Nucleotide Sequences , 2008, IEEE Journal of Selected Topics in Signal Processing.

[76]  J. Tuqan,et al.  The role of the symbolic-to-numerical mapping in the detection of DNA periodicities , 2008, 2008 IEEE International Workshop on Genomic Signal Processing and Statistics.

[77]  P. Vandergheynst,et al.  Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. , 2000, Journal of theoretical biology.

[78]  Elias S. Manolakos,et al.  Signal Background Estimation and Baseline Correction Algorithms for Accurate DNA Sequencing , 2003, J. VLSI Signal Process..

[79]  E. Trifonov 3-, 10.5-, 200- and 400-base periodicities in genome sequences , 1998 .

[80]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[81]  J. Tuqan,et al.  Trigonometric transforms for finding repeats in DNA sequences , 2008, 2008 IEEE International Workshop on Genomic Signal Processing and Statistics.

[82]  Suparerk Janjarasjitt,et al.  Detection and visualization of tandem repeats in DNA sequences , 2003, IEEE Trans. Signal Process..

[83]  Amir Asif,et al.  A fast DFT based gene prediction algorithm for identification of protein coding regions , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[84]  P. P. Vaidyanathan,et al.  The role of signal-processing concepts in genomics and proteomics , 2004, J. Frankl. Inst..

[85]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[86]  R. Sanchez,et al.  A genetic code Boolean structure. I. The meaning of Boolean deductions , 2005, Bulletin of mathematical biology.

[87]  Hornos Algebraic model for the evolution of the genetic code. , 1993, Physical review letters.

[88]  Andreas Antoniou,et al.  Identification of Hot-Spot Locations in Proteins Using Digital Filters , 2008, IEEE Journal of Selected Topics in Signal Processing.

[89]  Genomic signal analysis: Study of pathogen variability , 2006, 2006 IEEE International Workshop on Genomic Signal Processing and Statistics.

[90]  J. Tuqan,et al.  A DSP perspective to the period-3 detection problem , 2006, 2006 IEEE International Workshop on Genomic Signal Processing and Statistics.

[91]  N. Rao,et al.  Detection of 3-periodicity for small genomic sequences based on AR technique , 2004, 2004 International Conference on Communications, Circuits and Systems (IEEE Cat. No.04EX914).

[92]  W. Kinsner,et al.  Feature extraction from DNA sequences by multifractal analysis , 2001, 2001 Conference Proceedings of the 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[93]  John R. Jungck,et al.  The genetic code as a periodic table , 1978, Journal of Molecular Evolution.

[94]  Suprakash Datta,et al.  DFT based DNA splicing algorithms for prediction of protein coding regions , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[95]  Huai Li,et al.  How will bioinformatics impact signal processing research , 2003 .

[96]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[97]  David E. Tyler,et al.  Local Spectral Envelope: An Approach Using Dyadic Tree-Based Adaptive Segmentation , 2002 .

[98]  J. Tuqan,et al.  The Filtered Spectral Rotation Measure , 2006, 2006 Fortieth Asilomar Conference on Signals, Systems and Computers.

[99]  Trevor W. Fox,et al.  A Digital Signal Processing Method for Gene Prediction with Improved Noise Suppression , 2004, EURASIP J. Adv. Signal Process..

[100]  Petre Stoica,et al.  Spectral Analysis of Signals , 2009 .

[101]  Ricardo Grau,et al.  A Novel Algebraic Structure of the Genetic Code Over the Galois Field of Four DNA Bases , 2006, Acta biotheoretica.