Phylogenetic Tree Construction Using K-Mer Forest- Based Distance Calculation

Phylogenetics is one of the dominant data engineering research disciplines based on biological information. More particularly here, we consider raw DNA sequences and do comparative analysis in order to come up with important conclusions. When representing evolutionary relationships among different organisms in a concise manner, the phylogenetic tree helps significantly. When constructing phylogenetic trees, the elementary step is to calculate the genetic distance among species. Alignment-based sequencing and alignment-free sequencing are the two main distance computation methods that are used to find genetic relatedness of different species. In this paper we propose a novel alignment-free, pairwise, distance calculation method based on k-mers and a state of art machine learning-based phylogenetic tree construction mechanism. With the proposed approach we can convert longer DNA sequences into compendious k-mer forests which gear up the efficiency of comparison. Later we construct the phylogenetic tree based on calculated distances with the help of an algorithm build upon k-medoid clustering, which guaranteed significant efficiency and accuracy compared to traditional phylogenetic tree construction methods.

[1]  Pandurang Kolekar,et al.  Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. , 2012, Molecular phylogenetics and evolution.

[2]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[3]  N. Datta,et al.  Phylogenetic relationships of drug-resistance factors and other transmissible bacterial plasmids. , 1968, Bacteriological reviews.

[4]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[5]  Indika Perera,et al.  Experiential Learning in Bioinformatics - Learner Support for Complex Workflow Modelling and Analysis , 2018, Int. J. Emerg. Technol. Learn..

[6]  J. Ruane A critical review of the value of genetic distance studies in conservation of animal genetic resources , 1999 .

[7]  Indika Perera,et al.  GPU Accelerated Maximum Likelihood Analysis for Phylogenetic Inference , 2019, ICSCA.

[8]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[9]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[10]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[11]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[12]  D. A. Meedeniya,et al.  Rule-Based Recommendation System for Phylogenetic Inference , 2019, 2019 Moratuwa Engineering Research Conference (MERCon).

[13]  G. Gamage,et al.  Alignment-free Whole Genome Comparison Using k-mer Forests , 2019, 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer).

[14]  What Are All Those Funny Symbols in a Blast Printout? Blast = Basic Local Alignment Search Tool , 2022 .

[15]  Samir Abou El-Seoud,et al.  DNA Computing: Challenges and Application , 2017, Int. J. Interact. Mob. Technol..

[16]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[17]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[18]  Nick V. Grishin,et al.  Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer , 2016, PLoS Comput. Biol..

[19]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[20]  Alberto Apostolico,et al.  Efficient tools for comparative substring analysis. , 2010, Journal of biotechnology.

[21]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[22]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.