A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.

Multiple sequence alignment (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.

[1]  S. C. Kremer,et al.  Gene Prediction Based on DNA Spectral Analysis: A Literature Review , 2011, J. Comput. Biol..

[2]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[3]  Qi Dai,et al.  Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. , 2011, Journal of theoretical biology.

[4]  M. Nei,et al.  MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. , 2007, Molecular biology and evolution.

[5]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[6]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[7]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[8]  Yan Li,et al.  Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position , 2013, BMC bioinformatics.

[9]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[10]  Shigehiko Kanaya,et al.  Detection of periodicity in eukaryotic genomes on the basis of power spectrum analysis. , 2002, Genome informatics. International Conference on Genome Informatics.

[11]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[12]  K. V. Venkatesh,et al.  Detailed protein sequence alignment based on Spectral Similarity Score (SSS) , 2005, BMC Bioinformatics.

[13]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[14]  David Spiro,et al.  Sequencing and Analyses of All Known Human Rhinovirus Genomes Reveal Structure and Evolution , 2009, Science.

[15]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[16]  S.S.-T. Yau,et al.  Tracking the 3-Base Periodicity of Protein-Coding Regions by the Nonlinear Tracking-Differentiator , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[17]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[18]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[19]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[20]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[21]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[22]  Alberto O. Mendelzon,et al.  Efficient Retrieval of Similar Time Sequences Using DFT , 1998, FODO.

[23]  J. Kruskal An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[24]  Yue Chen,et al.  Evolutionary characteristics of A/Hangzhou/1/2013 and source of avian influenza virus H7N9 subtype in China. , 2013, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[25]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[26]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[27]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[28]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[29]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[30]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[31]  Gajendra P. S. Raghava,et al.  Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation , 2004, Bioinform..

[32]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[33]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[34]  Silvio C. E. Tosatto,et al.  REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete Fourier transform , 2009, Bioinform..

[35]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[36]  Alvis Brazma,et al.  The IronChip evaluation package: a package of perl modules for robust analysis of custom microarrays , 2010, BMC Bioinformatics.

[37]  Chenglong Yu,et al.  Protein map: an alignment-free sequence comparison method based on various properties of amino acids. , 2011, Gene.

[38]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[39]  W. Wheeler,et al.  The Triangle Inequality and Character Analysis , 1993 .

[40]  Changchuan Yin,et al.  A Fourier Characteristic of Coding Sequences: Origins and a Non-Fourier Approximation , 2005, J. Comput. Biol..

[41]  B. Blaisdell,et al.  Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences , 1989, Journal of Molecular Evolution.

[42]  Cun-Quan Zhang,et al.  A Novel Model for DNA Sequence Similarity Analysis Based on Graph Theory , 2011, Evolutionary bioinformatics online.

[43]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[44]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[45]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[46]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[47]  Changchuan Yin,et al.  Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. , 2007, Journal of theoretical biology.

[48]  Tandy J. Warnow,et al.  Large-Scale Multiple Sequence Alignment and Phylogeny Estimation , 2013, Models and Algorithms for Genome Evolution.