K-mer natural vector and its application to the phylogenetic analysis of genetic sequences.

Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.

[1]  Shek-Chung Yau,et al.  Protein space: a natural method for realizing the nature of protein universe. , 2013, Journal of theoretical biology.

[2]  Tiee-Jian Wu,et al.  Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences , 2005, Bioinform..

[3]  N. Takahata,et al.  Allelic genealogy and human evolution. , 1993, Molecular biology and evolution.

[4]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[5]  W. Atchley,et al.  Molecular evolution of the MyoD family of transcription factors. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Michael W. Berry,et al.  An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage , 2004, BMC Bioinformatics.

[7]  Se-Ran Jun,et al.  Whole-genome phylogeny of mammals: Evolutionary information in genic and nongenic regions , 2009, Proceedings of the National Academy of Sciences.

[8]  Richard,et al.  Patterns of Divergence During Evolution of cll ,-Proteinase Inhibitors in Mammals , 1998 .

[9]  Tianming Wang,et al.  A novel statistical measure for sequence comparison on the basis of k-word counts. , 2013, Journal of theoretical biology.

[10]  J. Bull,et al.  Combining data in phylogenetic analysis. , 1996, Trends in ecology & evolution.

[11]  Michael M. Miyamoto,et al.  Molecular and Morphological Supertrees for Eutherian (Placental) Mammals , 2001, Science.

[12]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[13]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[14]  W. Hauswirth,et al.  Nucleotide sequence evidence for rapid genotypic shifts in the bovine mitochondrial DNA D-loop , 1983, Nature.

[15]  S. Pääbo,et al.  Mitochondrial genome variation and the origin of modern humans , 2000, Nature.

[16]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[17]  J. Faith,et al.  Evolution of base-substitution gradients in primate mitochondrial genomes. , 2005, Genome research.

[18]  D. Hillis,et al.  Ribosomal RNA secondary structure: compensatory mutations and implications for phylogenetic analysis. , 1993, Molecular biology and evolution.

[19]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[20]  Troy Hernandez,et al.  Real Time Classification of Viruses in 12 Dimensions , 2013, PloS one.

[21]  A. Meyer,et al.  Complete mitochondrial genome suggests diapsid affinities of turtles. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  S. Hedges,et al.  Molecular evidence for the origin of birds. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Raymond H. Chan,et al.  Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Xuhua Xia,et al.  Rapid evolution of animal mitochondrial DNA , 2012 .

[25]  Guohua Huang,et al.  Alignment-free comparison of genome sequences by a new numerical characterization. , 2011, Journal of theoretical biology.

[26]  J. Leunissen,et al.  Protein sequences indicate that turtles branched off from the amniote tree after mammals , 1996, Journal of Molecular Evolution.

[27]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[28]  I. Korf,et al.  Applying word-based algorithms: the IMEter. , 2009, Methods in molecular biology.

[29]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[30]  Xuhua Xia,et al.  18S ribosomal RNA and tetrapod phylogeny. , 2003, Systematic biology.

[31]  M. Ruvolo,et al.  Geographic Origins of Human Mitochondrial DNA: Phylogenetic Evidence from Control Region Sequences , 1992 .

[32]  M. Nei,et al.  A Simple Method for Estimating and Testing Minimum-Evolution Trees , 1992 .

[33]  J. Ausió,et al.  The histidine-rich protamine from ostrich and tinamou sperm. A link between reptile and bird protamines. , 1999, Biochemistry.

[34]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[35]  B. Lang,et al.  Evolution of the WANCY region in amniote mitochondrial DNA. , 1994, Molecular biology and evolution.

[36]  Jia Wen,et al.  A 2D graphical representation of protein sequence and its numerical characterization , 2009 .

[37]  Hongjie Yu Segmented K-mer and its application on similarity analysis of mitochondrial genome sequences. , 2013, Gene.

[38]  K. Chu,et al.  Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment , 2005, Journal of Molecular Evolution.

[39]  S. Hedges,et al.  Tetrapod phylogeny inferred from 18S and 28S ribosomal RNA sequences and a review of the evidence for amniote relationships. , 1990, Molecular biology and evolution.

[40]  J. A. Studier,et al.  A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.

[41]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Steven A. Benner,et al.  Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily , 1995, Nature.

[43]  J. Klein,et al.  MHC polymorphism pre-dating speciation , 1988, Nature.

[44]  M. Nei,et al.  Divergent evolution and evolution by the birth-and-death process in the immunoglobulin VH gene family. , 1994, Molecular biology and evolution.

[45]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[46]  M. Nei,et al.  Phylogenetic analysis in molecular evolutionary genetics. , 1996, Annual review of genetics.

[47]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[48]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[49]  A. Janke,et al.  The complete mitochondrial genome of Alligator mississippiensis and the separation between recent archosauria (birds and crocodiles). , 1997, Molecular biology and evolution.

[50]  H Baumann,et al.  Patterns of divergence during evolution of alpha 1-proteinase inhibitors in mammals. , 1996, Molecular biology and evolution.

[51]  S. Karnik,et al.  Angiotensin II-Forming Activity in a Reconstructed Ancestral Chymase , 1996, Science.

[52]  E. Harley,et al.  Housekeeping genes for phylogenetic analysis of eutherian relationships. , 2006, Molecular biology and evolution.

[53]  Changchuan Yin,et al.  A Novel Construction of Genome Space with Biological Geometry , 2010, DNA research : an international journal for rapid publication of reports on genes and genomes.

[54]  G. Wistow,et al.  Lens crystallins: gene recruitment and evolutionary dynamism. , 1993, Trends in biochemical sciences.

[55]  M. Nei,et al.  MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. , 2011, Molecular biology and evolution.

[56]  H. Cann,et al.  Maternal inheritance of human mitochondrial DNA. , 1980, Proceedings of the National Academy of Sciences of the United States of America.