Phylonium: fast estimation of evolutionary distances from large samples of similar genomes

Abstract Motivation Tracking disease outbreaks by whole-genome sequencing leads to the collection of large samples of closely related sequences. Five years ago, we published a method to accurately compute all pairwise distances for such samples by indexing each sequence. Since indexing is slow, we now ask whether it is possible to achieve similar accuracy when indexing only a single sequence. Results We have implemented this idea in the program phylonium and show that it is as accurate as its predecessor and roughly 100 times faster when applied to all 2678 Escherichia coli genomes contained in ENSEMBL. One of the best published programs for rapidly computing pairwise distances, mash, analyzes the same dataset four times faster but, with default settings, it is less accurate than phylonium. Availability and implementation Phylonium runs under the UNIX command line; its C++ sources and documentation are available from github.com/evolbioinf/phylonium. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[2]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[3]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[4]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[5]  Johannes Fischer,et al.  Dismantling DivSufSort , 2017, Stringology.

[6]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[7]  Burkhard Morgenstern,et al.  The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances , 2019, bioRxiv.

[8]  William Rucklidge,et al.  Efficient Visual Recognition Using the Hausdorff Distance , 1996, Lecture Notes in Computer Science.

[9]  Stephen A. Krawetz,et al.  Bioinformatics Methods and Protocols , 1999 .

[10]  A. Phillippy,et al.  High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries , 2017, Nature Communications.

[11]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[12]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[13]  Vineet Bafna,et al.  Skmer: assembly-free and alignment-free sample identification using genome skims , 2019, Genome Biology.

[14]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019 .

[15]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[16]  Jan Holub,et al.  Proceedings of the Prague Stringology Conference 2009 , 2009 .

[17]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[18]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[19]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[20]  S. Jeffery Evolution of Protein Molecules , 1979 .

[21]  Steven Salzberg,et al.  Mugsy: fast multiple alignment of closely related whole genomes , 2010, Bioinform..

[22]  L. Hoang,et al.  Infection control in the new age of genomic epidemiology. , 2017, American journal of infection control.