Phylogenetic analysis of DNA sequences with a novel characteristic vector

In the basic biological research, one of major tasks is to compare biological sequences to infer evolutionary relations among sequences. In this paper, considering both the positions and numbers of a k-word and the random background, a novel characteristic vector of a DNA sequence is proposed to serve for genetic sequences comparison and phylogenetic analysis. The vector is composed of elements which characterize the relative difference of a DNA sequence from a sequence generated by a (k − 2)th order Markov process. Finally, we reconstruct the phylogenetic trees of 48 HEV (Hepatitis E virus) and 20 Eutherian mammals. The results show that this new method provides more information about k-word and improves the efficiency of sequence comparison.

[1]  G. Myers,et al.  Mutational inversion of control of the lactose operon of Escherichia coli. , 1971, Journal of molecular biology.

[2]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[3]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[5]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[6]  P. Coursaget,et al.  African strains of hepatitis E virus that are distinct from Asian strains , 1997, Journal of medical virology.

[7]  P. Coursaget,et al.  Characterization of hepatitis E virus (HEV) from Algeria and Chad by partial genome sequence , 1997, Journal of medical virology.

[8]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[9]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[10]  Milan Randic,et al.  On the Similarity of DNA Primary Sequences , 2000, J. Chem. Inf. Comput. Sci..

[11]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[12]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[13]  A. Janke,et al.  Mammalian mitogenomic relationships and the root of the eutherian tree , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Milan Randic,et al.  On A Four-Dimensional Representation of DNA Primary Sequences , 2003, J. Chem. Inf. Comput. Sci..

[15]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[16]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[17]  Alexandru T. Balaban,et al.  On a Four-Dimensional Representation of DNA Primary Sequences [Journal of Chemical Information and Computer Sciences 43, 532-539 (2003)] , 2003, J. Chem. Inf. Comput. Sci..

[18]  Dejan Plavšić,et al.  Novel 2-D graphical representation of DNA sequences and their numerical characterization , 2003 .

[19]  Bailin Hao,et al.  Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. , 2004, Journal of bioinformatics and computational biology.

[20]  Alessandro Neri,et al.  Visualization and analysis of DNA sequences using DNA walks , 2004, J. Frankl. Inst..

[21]  Graziano Pesole,et al.  Congruent mammalian trees from mitochondrial and nuclear genes using Bayesian methods. , 2003, Molecular biology and evolution.

[22]  Yu-hua Yao,et al.  A class of new 2-D graphical representation of DNA sequences and their application , 2004 .

[23]  Bo Liao,et al.  Analysis of similarity/dissimilarity of DNA sequences based on 3-D graphical representation , 2004 .

[24]  Bo Liao,et al.  A 2D graphical representation of DNA sequence , 2005 .

[25]  Yu-hua Yao,et al.  Analysis of similarity/dissimilarity of DNA sequences based on a 3-D graphical representation , 2005 .

[26]  C. Hagedorn,et al.  Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis , 2006, Reviews in medical virology.

[27]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[28]  Zhao Xu,et al.  A fungal phylogeny based on 82 complete genomes using the composition vector method , 2009, BMC Evolutionary Biology.

[29]  Eric D. Green,et al.  Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets , 2008, Molecular biology and evolution.

[30]  Xiao Sun,et al.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. , 2008, Biochemical and Biophysical Research Communications - BBRC.

[31]  Guohua Huang,et al.  Similarity studies of DNA sequences based on a new 2D graphical representation. , 2009, Biophysical chemistry.

[32]  Jun Wang,et al.  A Poisson model of sequence comparison and its application to coronavirus phylogeny , 2009, Mathematical Biosciences.

[33]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[34]  Tianming Wang,et al.  Phylogenetic analysis of DNA sequences based on the generalized pseudo-amino acid composition. , 2011, Journal of theoretical biology.