Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage

In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. Herein, we adapt our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to work on unassembled reads; we call this implementation Read-SpaM. Test runs on simulated reads from bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage. Availability https://github.com/burkhard-morgenstern/Read-SpaM Contact bmorgen@gwdg.de

[1]  Gregory Kucherov,et al.  Lineage calling can identify antibiotic resistant clones within minutes , 2018, bioRxiv.

[2]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[3]  Michael Gerth,et al.  New Wolbachia supergroups detected in quill mites (Acari: Syringophilidae). , 2015, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[4]  A. Vogler,et al.  Lessons from genome skimming of arthropod‐preserving ethanol , 2016, Molecular ecology resources.

[5]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[6]  Shilin Chen,et al.  Plant DNA barcoding: from gene to genome , 2015, Biological reviews of the Cambridge Philosophical Society.

[7]  M. Telford Phylogenomics , 2007, Current Biology.

[8]  Luís M. S. Russo,et al.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis , 2012, Algorithms for Molecular Biology.

[9]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019 .

[10]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[11]  Burkhard Morgenstern,et al.  Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences , 2018, bioRxiv.

[12]  Dee R. Denver,et al.  Genome Skimming: A Rapid Approach to Gaining Diverse Biological Insights into Multicellular Pathogens , 2016, PLoS pathogens.

[13]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[14]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[15]  Jie Ren,et al.  Alignment-free \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d_2^*$\end{document} oligonucleotide frequency dissi , 2016, Nucleic acids research.

[16]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[17]  Mark A. Ragan,et al.  Pattern-Based Phylogenetic Distance Estimation and Tree Reconstruction , 2006, Evolutionary bioinformatics online.

[18]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[19]  Burkhard Morgenstern,et al.  The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances , 2019, bioRxiv.

[20]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[21]  Gesine Reinert,et al.  Alignment-Free Sequence Analysis and Applications. , 2018, Annual review of biomedical data science.

[22]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[23]  Alexandros Stamatakis,et al.  Aligning short reads to reference alignments and trees , 2011, Bioinform..

[24]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[25]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[26]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[27]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[28]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[29]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[30]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[31]  Burkhard Morgenstern,et al.  rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison , 2015, PLoS Comput. Biol..

[32]  Kai Zhou,et al.  Application of next generation sequencing in clinical microbiology and infection prevention. , 2017, Journal of biotechnology.

[33]  Christoph Bleidorn,et al.  The Utility of Genome Skimming for Phylogenomic Analyses as Demonstrated for Glycerid Relationships (Annelida, Glyceridae) , 2015, Genome biology and evolution.

[34]  Dominique Lavenier,et al.  Multiple comparative metagenomics using multiset k-mer counting , 2016, PeerJ Comput. Sci..

[35]  M. Genner,et al.  Minimalist barcodes for sponges: a case study classifying African freshwater Spongillida. , 2019, Genome.

[36]  Cheng Soon Ong,et al.  kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity , 2016, bioRxiv.

[37]  Mark Fishbein,et al.  Hyb-Seq: Combining target enrichment and genome skimming for plant phylogenomics , 2014, Applications in plant sciences.

[38]  S. Dodsworth,et al.  Genome skimming for next-generation biodiversity analysis. , 2015, Trends in plant science.

[39]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[40]  James M. Hogan,et al.  Alignment-free inference of hierarchical and reticulate phylogenomic relationships , 2017, Briefings Bioinform..

[41]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[42]  Umberto Ferraro Petrillo,et al.  Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms , 2018, Bioinform..

[43]  Krister M. Swenson,et al.  Rapid alignment-free phylogenetic identification of metagenomic sequences , 2018, bioRxiv.

[44]  Günter Mayer,et al.  Systematic evaluation of error rates and causes in short samples in next-generation sequencing , 2018, Scientific Reports.

[45]  Burkhard Morgenstern,et al.  Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences , 2019, GigaScience.

[46]  Daniel G. Brown,et al.  LSHPlace: Fast Phylogenetic Placement Using Locality-Sensitive Hashing , 2012, Pacific Symposium on Biocomputing.

[47]  Cinzia Pizzi,et al.  MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics , 2016, Algorithms for Molecular Biology.

[48]  Burkhard Morgenstern,et al.  Phylogeny reconstruction based on the length distribution of k-mismatch common substrings , 2017, Algorithms for Molecular Biology.

[49]  Benjamin Linard,et al.  Rapid alignment-free phylogenetic identification of metagenomic sequences , 2018, bioRxiv.

[50]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[51]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[52]  C. Bleidorn,et al.  Comparative genomics provides a timeframe for Wolbachia evolution and exposes a recent biotin synthesis operon transfer , 2016, Nature Microbiology.

[53]  Gregory Kucherov,et al.  Evolution of biosequence search algorithms: a brief survey , 2018, Bioinform..

[54]  S. Jeffery Evolution of Protein Molecules , 1979 .

[55]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[56]  Yongchao Liu,et al.  A greedy alignment-free distance estimator for phylogenetic inference , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[57]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[58]  Lauren A. Cowley,et al.  Rapid heuristic inference of antibiotic resistance and susceptibility by genomic neighbor typing , 2018 .

[59]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[60]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[61]  Vineet Bafna,et al.  Skmer: assembly-free and alignment-free sample identification using genome skims , 2019, Genome Biology.

[62]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[63]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..