A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications

Background Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences. Methodology To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists' analyses. Conclusions Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.

[1]  David Spiro,et al.  Sequencing and Analyses of All Known Human Rhinovirus Genomes Reveal Structure and Evolution , 2009, Science.

[2]  Songnian Hu,et al.  Genome evolution of novel influenza A (H1N1) viruses in humans , 2009 .

[3]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[4]  C. Scholtissek,et al.  Pigs as ‘Mixing Vessels’ for the Creation of New Pandemic Influenza A Viruses , 1990 .

[5]  S Karlin,et al.  Comparisons of eukaryotic genomic sequences. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[6]  R. Kahn,et al.  The pig as a mixing vessel for influenza viruses: Human and veterinary implications , 2008, Journal of molecular and genetic medicine : an international journal of biomedical research.

[7]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[8]  Amanda Balish,et al.  Triple-reassortant swine influenza A (H1) in humans in the United States, 2005-2009. , 2009, The New England journal of medicine.

[9]  Michael M. Miyamoto,et al.  Molecular and Morphological Supertrees for Eutherian (Placental) Mammals , 2001, Science.

[10]  Srinivas Aluru,et al.  Algorithms for Large-Scale Clustering and Assembly of Biological Sequence Data , 2006 .

[11]  S. Salzberg,et al.  2009 Swine-Origin Influenza A (H1N1) Resembles Previous Influenza Isolates , 2009, PloS one.

[12]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[13]  Libin Liu,et al.  Clustering DNA sequences by feature vectors. , 2006, Molecular phylogenetics and evolution.

[14]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[15]  Yoshihiro Kawaoka,et al.  Molecular Basis for the Generation in Pigs of Influenza A Viruses with Pandemic Potential , 1998, Journal of Virology.

[16]  公庄 庸三 Basic Algebra = 代数学入門 , 2002 .

[17]  G. Bernardi,et al.  Compositional constraints in the extremely GC-poor genome of Plasmodium falciparum. , 1997, Memorias do Instituto Oswaldo Cruz.

[18]  A. Kumar,et al.  Emergence of a Novel Swine-Origin Influenza A (H1N1) Virus in Humans , 2010 .

[19]  L. Finelli,et al.  Emergence of a novel swine-origin influenza A (H1N1) virus in humans. , 2009, The New England journal of medicine.

[20]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[21]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[22]  Multiscale Bootstrap Analysis of Gene Networks Based on Bayesian Networks and Nonparametric Regression , 2003 .

[23]  Srinivas Aluru,et al.  Handbook Of Computational Molecular Biology , 2010 .

[24]  Kareem Carr,et al.  A Rapid Method for Characterization of Protein Relatedness Using Feature Vectors , 2010, PloS one.

[25]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[26]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[27]  R. Belshe,et al.  Implications of the emergence of a novel H1 influenza virus. , 2009, The New England journal of medicine.

[28]  Ron A M Fouchier,et al.  Antigenic and Genetic Characteristics of Swine-Origin 2009 A(H1N1) Influenza Viruses Circulating in Humans , 2009, Science.

[29]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[30]  Allen T. Craig,et al.  Introduction to Mathematical Statistics (6th Edition) , 2005 .

[31]  C. Vinson,et al.  Clustering of DNA sequences in human promoters. , 2004, Genome research.

[32]  Allan C. Wilson,et al.  Mitochondrial DNA sequences of primates: Tempo and mode of evolution , 2005, Journal of Molecular Evolution.

[33]  Hidemitsu Nakamura,et al.  Self-Organizing Clustering: A Novel Non-Hierarchical Method for Clustering Large Amount of DNA Sequences , 2003 .

[34]  Joel Dudley,et al.  MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences , 2008, Briefings Bioinform..

[35]  E. Harley,et al.  Housekeeping genes for phylogenetic analysis of eutherian relationships. , 2006, Molecular biology and evolution.

[36]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[37]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[38]  J. Faith,et al.  Evolution of base-substitution gradients in primate mitochondrial genomes. , 2005, Genome research.

[39]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[40]  Amir Niknejad,et al.  DNA sequence representation without degeneracy. , 2003, Nucleic acids research.

[41]  K Nishikawa,et al.  Genes from nine genomes are separated into their organisms in the dinucleotide composition space. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[42]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.