A new method to cluster DNA sequences using Fourier power spectrum

Abstract A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods.

[1]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[2]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[3]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[4]  B. Berkhout,et al.  Identification of a new human coronavirus , 2004, Nature Medicine.

[5]  Stephen S.-T. Yau,et al.  DNA sequence comparison by a novel probabilistic method , 2011, Inf. Sci..

[6]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[7]  Ron A M Fouchier,et al.  Antigenic and Genetic Characteristics of Swine-Origin 2009 A(H1N1) Influenza Viruses Circulating in Humans , 2009, Science.

[8]  Koichiro Tamura,et al.  MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. , 2013, Molecular biology and evolution.

[9]  D J Alexander,et al.  A review of avian influenza in different bird species. , 2000, Veterinary microbiology.

[10]  B. Blaisdell,et al.  Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system , 2007, Journal of Molecular Evolution.

[11]  Somdatta Sinha,et al.  Using genomic signatures for HIV-1 sub-typing , 2010, BMC Bioinformatics.

[12]  J. A. Tenreiro Machado,et al.  Fractional dynamics in DNA , 2011 .

[13]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[14]  J. F. Young,et al.  Variation of influenza A, B, and C viruses. , 1982, Science.

[15]  Shigehiko Kanaya,et al.  Periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis. , 2002, Gene.

[16]  Vera Afreixo,et al.  Spectrum and symbol distribution of nucleotide sequences. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  David Spiro,et al.  Sequencing and Analyses of All Known Human Rhinovirus Genomes Reveal Structure and Evolution , 2009, Science.

[18]  Yizhar Lavner,et al.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. , 2003, Genome research.

[19]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[20]  Allan C. Wilson,et al.  Mitochondrial DNA sequences of primates: Tempo and mode of evolution , 2005, Journal of Molecular Evolution.

[21]  Bo Zhao,et al.  A novel clustering method via nucleotide-based Fourier power spectrum analysis , 2011, Journal of Theoretical Biology.

[22]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[23]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[24]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[25]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[26]  Obi L. Griffith,et al.  The Genome Sequence of the SARS-Associated Coronavirus , 2003, Science.

[27]  Dimitris Anastassiou,et al.  Frequency-domain analysis of biomolecular sequences , 2000, Bioinform..

[28]  Armando J. Pinho,et al.  Genome analysis with inter-nucleotide distances , 2009, Bioinform..

[29]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[30]  R. Webster,et al.  Evolution and ecology of influenza A viruses. , 1992, Current topics in microbiology and immunology.

[31]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[32]  P. P. Vaidyanathan,et al.  The role of signal-processing concepts in genomics and proteomics , 2004, J. Frankl. Inst..

[33]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[34]  Samson S. Y. Wong,et al.  Characterization and Complete Genome Sequence of a Novel Coronavirus, Coronavirus HKU1, from Patients with Pneumonia , 2005, Journal of Virology.