A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance

Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.

[1]  Weiqi Wang,et al.  Complete Genome Sequence of Middle East Respiratory Syndrome Coronavirus (MERS-CoV) from the First Imported MERS-CoV Case in China , 2015, Genome Announcements.

[2]  Yiping Fan,et al.  Response to Comment on "Large-Scale Sequence Analysis of Avian Influenza Isolates" , 2006, Science.

[3]  Raymond J. Moran,et al.  The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference , 2016, Genome biology and evolution.

[4]  K. Hatje,et al.  A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method , 2012, Front. Plant Sci..

[5]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[6]  S. Yau,et al.  Ebolavirus classification based on natural vectors. , 2015, DNA and cell biology.

[7]  M. Peiris,et al.  From SARS to MERS: 10 years of research on highly pathogenic human coronaviruses , 2013, Antiviral Research.

[8]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[9]  Changchuan Yin,et al.  A new method to cluster DNA sequences using Fourier power spectrum , 2015, Journal of Theoretical Biology.

[10]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[11]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[12]  J. Chun,et al.  Complete Genome Sequence of Middle East Respiratory Syndrome Coronavirus KOR/KNIH/002_05_2015, Isolated in South Korea , 2015, Genome Announcements.

[13]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[14]  Samson S. Y. Wong,et al.  Characterization and Complete Genome Sequence of a Novel Coronavirus, Coronavirus HKU1, from Patients with Pneumonia , 2005, Journal of Virology.

[15]  S. Yau,et al.  A new method to cluster genomes based on cumulative Fourier power spectrum. , 2018, Gene.

[16]  Changchuan Yin,et al.  Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. , 2016, Genomics.

[17]  Christian Drosten,et al.  Ecology, evolution and classification of bat coronaviruses in the aftermath of SARS , 2013, Antiviral Research.

[18]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[19]  R. Panstruga,et al.  Editorial: Biotrophic Plant-Microbe Interactions , 2017, Front. Plant Sci..

[20]  Chenglong Yu,et al.  Protein sequence comparison based on K-string dictionary. , 2013, Gene.

[21]  C. Creevey,et al.  Mitochondrial data are not suitable for resolving placental mammal phylogeny , 2014, Mammalian Genome.

[22]  E. Ladoukakis,et al.  Evolution and inheritance of animal mitochondrial DNA: rules and exceptions , 2017, Journal of Biological Research-Thessaloniki.

[23]  S. O’Brien,et al.  Molecular phylogenetics and the origins of placental mammals , 2001, Nature.

[24]  Ying Chen,et al.  A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. , 2014, Journal of theoretical biology.

[25]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[26]  Changchuan Yin,et al.  Virus classification in 60-dimensional protein space. , 2016, Molecular phylogenetics and evolution.