Phylogenetic Analysis of HIV-1 Genomes Based on the Position-Weighted K-mers Method

HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson–Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.

[1]  K. Meyer zum Büschenfelde,et al.  Autoimmune hepatitis--an update. , 1992, Behring Institute Mitteilungen.

[2]  E. Herniou,et al.  Use of Whole Genome Sequence Data To Infer Baculovirus Phylogeny , 2001, Journal of Virology.

[3]  D. Cho,et al.  Genome classification improvements based on k-mer intervals in sequences. , 2019, Genomics.

[4]  D. Aboulafia,et al.  Review of screening guidelines for non-AIDS-defining malignancies: evolving issues in the era of highly active antiretroviral therapy. , 2012, AIDS reviews.

[5]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[6]  Lila Kari,et al.  An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes , 2018, bioRxiv.

[7]  Jie Tang,et al.  A novel k-word relative measure for sequence comparison , 2014, Comput. Biol. Chem..

[8]  Xiao Sun,et al.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping. , 2008, Biochemical and Biophysical Research Communications - BBRC.

[9]  Zu-Guo Yu,et al.  Multifractal and correlation analyses of protein sequences from complete genomes. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Tie Zhang,et al.  A novel alignment-free method for whole genome analysis: Application to HIV-1 subtyping and HEV genotyping , 2014, Inf. Sci..

[12]  Li-Qian Zhou,et al.  Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model , 2010, BMC Evolutionary Biology.

[13]  M. Manns,et al.  Autoimmune hepatitis--Update 2015. , 2015, Journal of hepatology.

[14]  P. Lemey,et al.  The Molecular Population Genetics of HIV-1 Group O , 2004, Genetics.

[15]  Tianming Wang,et al.  A simple k-word interval method for phylogenetic analysis of DNA sequences. , 2013, Journal of theoretical biology.

[16]  Zu-Guo Yu,et al.  Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation. , 2016, Molecular phylogenetics and evolution.

[17]  Wentian Li,et al.  Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome. , 2019, Gene.

[18]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[19]  Guanghong Zuo,et al.  CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy , 2015, Genom. Proteom. Bioinform..

[20]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[21]  Bailin Hao,et al.  Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. , 2004, Journal of bioinformatics and computational biology.

[22]  Qi Wu,et al.  DLTree: efficient and accurate phylogeny reconstruction using the dynamical language method , 2017, Bioinform..

[23]  Randy Goebel,et al.  Nucleotide composition string selection in HIV-1 subtyping using whole genomes , 2007, Bioinform..

[24]  J. Felsenstein,et al.  Mathematics vs. Evolution: Mathematical Evolutionary Theory. , 1989, Science.

[25]  Yong Gao,et al.  HIV diversity, recombination and disease progression: how does fitness "fit" into the puzzle? , 2007, AIDS reviews.

[26]  C. Hagedorn,et al.  Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis , 2006, Reviews in medical virology.

[27]  Jinyan Li,et al.  Laplacian normalization and random walk on heterogeneous networks for disease-gene prioritization , 2015, Comput. Biol. Chem..

[28]  B. Snel,et al.  Genomes in flux: the evolution of archaeal and proteobacterial gene content. , 2002, Genome research.

[29]  Tianming Wang,et al.  A novel statistical measure for sequence comparison on the basis of k-word counts. , 2013, Journal of theoretical biology.

[30]  E. Krause,et al.  Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[31]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[32]  V. Anh,et al.  Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles. , 2015, Molecular phylogenetics and evolution.

[33]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[34]  Somdatta Sinha,et al.  Multifractal analysis of HIV-1 genomes. , 2012, Molecular phylogenetics and evolution.

[35]  Sudhir Kumar,et al.  MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. , 2016, Molecular biology and evolution.

[36]  Andrew Rambaut,et al.  HIV Sequence Compendium 2018 , 2018 .

[37]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[38]  K. Chu,et al.  Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment , 2005, Journal of Molecular Evolution.