A new distribution vector and its application in genome clustering.

In this paper we report a novel mathematical method to transform the DNA sequences into the distribution vectors which correspond to points in the sixty dimensional space. Each component of the distribution vector represents the distribution of one kind of nucleotide in k segments of the DNA sequences. The mathematical and statistical properties of the distribution vectors are demonstrated and examined with huge datasets of human DNA sequences and random sequences. The determined expectation and standard deviation can make the mapping stable and practicable. Moreover, we apply the distribution vectors to the clustering of the Haemagglutinin (HA) gene of 60 H1N1 viruses from Human, Swine and Avian, the complete mitochondrial genomes from 80 placental mammals and the complete genomes from 50 bacteria. The 60 H1N1 viruses, 80 placental mammals and 50 bacteria are classified accurately and rapidly compared to the multiple sequence alignment methods. The results indicate that the distribution vectors can reveal the similarity and evolutionary relationship among homologous DNA sequences based on the distances between any two of these distribution vectors. The advantage of fast computation offers the distribution vectors the opportunity to deal with a huge amount of DNA sequences efficiently.

[1]  Naruya Saitou,et al.  Estimation of bacterial species phylogeny through oligonucleotide frequency distances. , 2009, Genomics.

[2]  Gavin J. D. Smith,et al.  Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic , 2009, Nature.

[3]  J. Faith,et al.  Evolution of base-substitution gradients in primate mitochondrial genomes. , 2005, Genome research.

[4]  C. Woese,et al.  Phylogenetic structure of the prokaryotic domain: The primary kingdoms , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[5]  S. Osawa,et al.  Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[7]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[8]  Naruya Saitou,et al.  MISHIMA - a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data , 2010, BMC Bioinformatics.

[9]  Kazutaka Katoh,et al.  Multiple alignment of DNA sequences with MAFFT. , 2009, Methods in molecular biology.

[10]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[11]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.