Alignment-free Whole Genome Comparison Using k-mer Forests

In evolutionary biology, the study of phylogenetics can be considered as one of the main research disciplines. Phylogenetics is based on comparative data, which is mainly DNA sequences or raw sequencing reads. Alignment-based sequencing and alignment-free sequencing are the two main similarity computation methods, which are used to find genetic relatedness of different species. Alignment-based methods are relatively complex and computationally challenging as the genome scales when considering mammalian datasets and complex metagenomic colonies. Moreover, they show poor accuracy in certain cases in genetic comparison due to misalignments and algorithmic tolerances. Alignment-free comparison methods perform much better in genetic distance computation by addressing most of the challenges observed in alignment-based methods. In this paper, we propose a novel alignment-free, pairwise, distance calculation method based on k-mers. With this, we convert longer DNA sequences into simplified k-mer forest structures, which makes the comparison more convenient. Further, we are using a specialized tree pruning approach, which minimizes tree comparison time considerably compared to other alignment-free methods.

[1]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[2]  J. Ruane A critical review of the value of genetic distance studies in conservation of animal genetic resources , 1999 .

[3]  Pandurang Kolekar,et al.  Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. , 2012, Molecular phylogenetics and evolution.

[4]  David W. Mount,et al.  Bioinformatics - sequence and genome analysis (2. ed.) , 2004 .

[5]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[6]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[7]  Nick V. Grishin,et al.  Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer , 2016, PLoS Comput. Biol..

[8]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[9]  Tutorials , 2019, 2019 17th IEEE International New Circuits and Systems Conference (NEWCAS).

[10]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[11]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[12]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[13]  Alberto Apostolico,et al.  Efficient tools for comparative substring analysis. , 2010, Journal of biotechnology.

[14]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[15]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[16]  What Are All Those Funny Symbols in a Blast Printout? Blast = Basic Local Alignment Search Tool , 2022 .

[17]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[18]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.