Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.

Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.

[1]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[2]  J. F. Young,et al.  Variation of influenza A, B, and C viruses. , 1982, Science.

[3]  Changchuan Yin,et al.  A new method to cluster DNA sequences using Fourier power spectrum , 2015, Journal of Theoretical Biology.

[4]  Byoung-Tak Zhang,et al.  Human Papillomavirus Risk Type Classification from Protein Sequences Using Support Vector Machines , 2006, EvoWorkshops.

[5]  Byoung-Tak Zhang,et al.  Classification of Human Papillomavirus (HPV) Risk Type via Text Mining , 2003 .

[6]  Koichiro Tamura,et al.  MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. , 2013, Molecular biology and evolution.

[7]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[8]  Shek-Chung Yau,et al.  K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. , 2014, Gene.

[9]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[10]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[11]  Stephen S.-T. Yau,et al.  DNA sequence comparison by a novel probabilistic method , 2011, Inf. Sci..

[12]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[13]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[14]  Hon Keung Kwan,et al.  Advanced Numerical Representation of DNA Sequences , 2022 .

[15]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[16]  D J Alexander,et al.  A review of avian influenza in different bird species. , 2000, Veterinary microbiology.

[17]  Jijoy Joseph,et al.  Chaos game representation for comparison of whole genomes , 2006, BMC Bioinformatics.

[18]  Changchuan Yin,et al.  An improved model for whole genome phylogenetic analysis by Fourier transform. , 2015, Journal of theoretical biology.

[19]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[20]  S. Franceschi,et al.  Human papillomavirus type distribution in invasive cervical cancer and high‐grade cervical lesions: A meta‐analysis update , 2007, International journal of cancer.

[21]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[22]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[23]  Ron A M Fouchier,et al.  Antigenic and Genetic Characteristics of Swine-Origin 2009 A(H1N1) Influenza Viruses Circulating in Humans , 2009, Science.

[24]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[25]  Byoung-Tak Zhang,et al.  Classification of the Risk Types of Human Papillomavirus by Decision Trees , 2003, IDEAL.

[26]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[27]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[28]  David Spiro,et al.  Sequencing and Analyses of All Known Human Rhinovirus Genomes Reveal Structure and Evolution , 2009, Science.

[29]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[30]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[31]  M Arbyn,et al.  Worldwide burden of cervical cancer in 2008. , 2011, Annals of oncology : official journal of the European Society for Medical Oncology.

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  Byoung-Tak Zhang,et al.  Ensembled support vector machines for human papillomavirus risk type prediction from protein secondary structures , 2009, Comput. Biol. Medicine.

[34]  Somdatta Sinha,et al.  Using genomic signatures for HIV-1 sub-typing , 2010, BMC Bioinformatics.

[35]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[36]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[37]  R. Webster,et al.  Evolution and ecology of influenza A viruses. , 1992, Current topics in microbiology and immunology.

[38]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[39]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[40]  Chidchanok Lursinsap,et al.  A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition , 2015, BMC Bioinformatics.

[41]  Amir Niknejad,et al.  DNA sequence representation without degeneracy. , 2003, Nucleic acids research.