Alignment- and reference-free phylogenomics with colored de Bruijn graphs

Background The increasing amount of available genome sequence data enables large-scale comparative studies. A common task is the inference of phylogenies—a challenging task if close reference sequences are not available, genome sequences are incompletely assembled, or the high number of genomes precludes multiple sequence alignment in reasonable time. Results We present a new whole-genome based approach to infer phylogenies that is alignment- and reference-free. In contrast to other methods, it does not rely on pairwise comparisons to determine distances to infer edges in a tree. Instead, a colored de Bruijn graph is constructed, and information on common subsequences is extracted to infer phylogenetic splits. Conclusions The introduced new methodology for large-scale phylogenomics shows high potential. Application to different datasets confirms robustness of the approach. A comparison to other state-of-the-art whole-genome based methods indicates comparable or higher accuracy and efficiency.

[1]  Daniel H Huson,et al.  Drawing explicit phylogenetic networks and their integration into SplitsTree , 2008, BMC Evolutionary Biology.

[2]  Sagi Snir,et al.  Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees , 2018, RECOMB-CG.

[3]  Nabil-Fareed Alikhan,et al.  A genomic overview of the population structure of Salmonella , 2018, PLoS genetics.

[4]  Xiaoyu Yu,et al.  SWPhylo – A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees , 2018, Evolutionary bioinformatics online.

[5]  Gemma C. Langridge,et al.  Pan-genome Analysis of Ancient and Modern Salmonella enterica Demonstrates Genomic Stability of the Invasive Para C Lineage for Millennia , 2018, Current Biology.

[6]  Alexandre P. Francisco,et al.  GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens , 2017, bioRxiv.

[7]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[8]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[9]  David Haussler,et al.  The UCSC Ebola Genome Portal , 2014, PLoS currents.

[10]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[11]  Guanghong Zuo,et al.  CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy , 2015, Genom. Proteom. Bioinform..

[12]  Daniel H. Huson,et al.  SplitsTree: analyzing and visualizing evolutionary data , 1998, Bioinform..

[13]  Madeline A. Crosby,et al.  FlyBase: genomes by the dozen , 2006, Nucleic Acids Res..

[14]  Páll Melsted,et al.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs , 2019, Genome Biology.

[15]  N. M. Vidal,et al.  Evolution of Tom, 297, 17.6 and rover retrotransposons in Drosophilidae species , 2009, Molecular Genetics and Genomics.

[16]  Roland Wittler,et al.  Alignment- and reference-free phylogenomics with colored de Bruijn graphs , 2019, Algorithms for Molecular Biology.

[17]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[18]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[19]  B. Shapiro,et al.  Origins of pandemic Vibrio cholerae from environmental gene pools , 2016, Nature Microbiology.

[20]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[21]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[22]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[23]  Giulia Antonazzo,et al.  FlyBase 2.0: the next generation , 2018, Nucleic Acids Res..

[24]  Prashant Pandey,et al.  Rainbowfish: A Succinct Colored de Bruijn Graph Representation , 2017, bioRxiv.

[25]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[26]  A. Dress,et al.  A canonical decomposition theory for metrics on a finite set , 1992 .